深度学习网络在嵌入式FPGA平台上的实践

 2022-05-18 08:05

论文总字数:26318字

摘 要

近年来,深度学习网络飞速发展,在诸多领域都得到了有力的应用,特别是在计算机视觉领域,深度学习方法取得了巨大成功。然而,随着问题越来越抽象复杂,深度学习网络的计算复杂度随之增加,其对硬件系统的算力和吞吐率都提出了很高要求。特别是对于嵌入式系统,深度学习网络的特性导致其很难集成在智能手机、智能眼镜等边缘设备中。因此,如何高性能低能耗地实现深度学习网络成为各科研机构的研究重点。

FPGA由于其可编辑的逻辑阵列,相比相同设计下的ASIC平台具有可重构性,具有更短的设计周期。FPGA与CPU、GPU相比,具有更低的功耗。随着深度学习网络复杂度不断提升,基于FPGA平台的神经网络加速器已经逐渐受到大家关注。

本文以FPGA为实验平台,采用RTL方式针对模型Lenet-5设计了硬件加速器。本文首先对深度学习网络模型进行了分析,指出深度学习网络不同层硬件设计的瓶颈和关注点。卷积层因其计算量大,设计关键是提高运算速度;全连接层瓶颈在于参数量过多,吞吐率低,设计关键是提高访存吞吐率。本文基于RTL方式设计了深度学习网络的硬件架构。架构可以划分为三个部分:控制部分,数据处理部分和存储系统。控制部分通过设计状态机,负责控制网络各层的顺序执行,控制各层内数据读取使能、访存使能、数据处理使能等。数据处理部分设计了卷积器和8个并行的PE单元,负责完成神经网络卷积、池化、全连接、RELU层计算。存储系统中,我们开辟了三块FPGA片上存储资源BRAM,采用乒乓机制来读写网络计算的中间结果。本文架构所采取的并行化、数据复用和数据切片手段,有利于提高系统的吞吐率,降低访存功耗。本文神经网络中数据都量化为8位定点数,节省了存储资源,也降低了存储和计算功耗。

本文设计的Lenet网络加速器对FPGA平台计算资源的利用率高达91%,几乎已达到平台上限。

关键词:神经网络加速,嵌入式,FPGA

ABSTRACT

In recent years, the deep learning network has developed rapidly and has been applied in many fields. Especially in the field of computer vision, the deep learning method has achieved great success. However, as the problem becomes more and more abstract and complex, the computational complexity of the deep learning network increases, and it imposes high requirements on the computing power and throughput of the hardware system. Especially for embedded systems, the characteristics of deep learning networks make it difficult to integrate into edge devices such as smart phones and smart glasses. Therefore, how to achieve deep learning networks with high performance and low energy consumption is called the research focus of various research institutions.

Due to its reconfigurable logic array, FPGAs are reconfigurable compared to ASIC platforms of the same design, with shorter design cycles. FPGAs have lower power consumption than CPUs and GPUs. With the increasing complexity of deep learning networks, neural network accelerators based on FPGA platforms have gradually attracted attention.

In this paper, FPGA is used as the experimental platform, and the hardware accelerator is designed based on RTL for the model Lenet-5. This paper first analyzes the deep learning network model and points out the bottlenecks and concerns of the hardware design of different layers of the deep learning network. Because of its large amount of computation, the convolution layer is designed to improve the computing speed. The bottleneck of the full-connection layer is that the parameter quantity is too much and the throughput rate is low. The key to the design is to improve the access throughput. This paper designs the hardware architecture of deep learning network based on RTL. The architecture can be divided into three parts: the control part, the data processing part, and the storage system. The control part is responsible for controlling the sequential execution of each layer of the network by designing the state machine, and controls data read enable, memory access enable, and data processing enable in each layer. The data processing part is designed with a convolver and 8 parallel PE units, which are responsible for completing neural network convolution, pooling, full connection, and RELU layer calculation. In the storage system, we design three FPGA on-chip storage resources BRAM, using the ping-pong mechanism to read and write the intermediate results of network computing. The parallelization, data multiplexing and data slicing methods adopted in the architecture of this paper are beneficial to improve the throughput of the system and reduce the power consumption of memory access. In this paper, the data in the neural network is quantized into 8-bit fixed-point numbers, which saves storage resources and reduces storage and computational power consumption.

The Lenet network accelerator designed in this paper uses up to 91% of the computing resources of the FPGA platform, and has almost reached the upper limit of the platform.

KEY WORDS: neural network acceleration, embedded, FPGA

目 录

摘要 I

ABSTRACT II

目 录 III

第一章 绪论 1

1.1 研究背景 1

1.2 深度学习网络加速与性能评估方法 2

1.3 嵌入式FPGA加速器研究现状 2

1.4 论文研究内容及意义 4

1.5 论文组织结构 5

第二章 卷积网络分析 6

2.1 DNN网络简介 6

2.1.1 卷积层 6

2.1.2 全连接层 7

2.1.3 池化层 8

2.2 Lenet-5网络结构 8

2.3 网络复杂度分析 8

第三章 硬件实现方案 10

3.1 全局架构 10

3.2 控制逻辑 10

3.3 数据处理 12

3.3.1 卷积器(Convolver)设计 12

3.3.2 数据处理单元(PE)设计 14

第四章 性能优化方案 16

4.1 并行化设计 16

4.2 数据复用 17

4.2.1 数据复用方式 17

4.2.2 数据切片 18

4.3 量化 19

第五章 存储系统 21

5.1 Buffer设计 21

5.2 数据摆放规则 22

第六章 硬件加速架构评估 24

第七章 总结与展望 25

7.1 总结 25

7.2 存在问题和展望 25

参考文献 27

致 谢 28

绪论

剩余内容已隐藏,请支付后下载全文,论文总字数:26318字

您需要先支付 80元 才能查看全部内容!立即支付

该课题毕业论文、开题报告、外文翻译、程序设计、图纸设计等资料可联系客服协助查找;