《電子技術(shù)應(yīng)用》
您所在的位置:首頁(yè) > 模擬設(shè)計(jì) > 設(shè)計(jì)應(yīng)用 > 一種基于FPGA的CNN硬件加速器實(shí)現(xiàn)
一種基于FPGA的CNN硬件加速器實(shí)現(xiàn)
電子技術(shù)應(yīng)用
邱臻博
重慶郵電大學(xué) 光電工程學(xué)院, 重慶 400065
摘要: 提出了一種基于FPGA的通用CNN硬件加速器設(shè)計(jì)方案。針對(duì)計(jì)算量最大的卷積層,采用了輸入通道并行、核內(nèi)并行、輸出通道并行三種加速方式,根據(jù)FPGA的片上資源,合理地設(shè)置相應(yīng)并行度。在數(shù)據(jù)加載方面,采用相鄰數(shù)據(jù)位寬合并傳輸,有效提高了加速器的實(shí)際傳輸帶寬。基于行的數(shù)據(jù)流加載思想,設(shè)計(jì)了輸入緩存模塊。該緩存模塊只需緩存兩行數(shù)據(jù)即可開(kāi)始卷積運(yùn)算,有效地提前卷積運(yùn)算的開(kāi)始時(shí)間。在數(shù)據(jù)輸入、數(shù)據(jù)運(yùn)算、數(shù)據(jù)輸出模塊之間,利用流水線循環(huán)優(yōu)化方式,極大地提高了硬件的計(jì)算性能。最后將該加速器應(yīng)用于VGG16和Darknet-19網(wǎng)絡(luò),實(shí)驗(yàn)表明,計(jì)算性能分別達(dá)到34.30 GOPS和33.68 GOPS,DSP計(jì)算效率分別高達(dá)79.45%和78.01%。
中圖分類(lèi)號(hào):TP391 文獻(xiàn)標(biāo)志碼:A DOI: 10.16157/j.issn.0258-7998.234372
中文引用格式: 邱臻博. 一種基于FPGA的CNN硬件加速器實(shí)現(xiàn)[J]. 電子技術(shù)應(yīng)用,2023,49(12):20-25.
英文引用格式: Qiu Zhenbo. An FPGA-based implementation of CNN hardware accelerator[J]. Application of Electronic Technique,2023,49(12):20-25.
An FPGA-based implementation of CNN hardware accelerator
Qiu Zhenbo
College of Photoelectric Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
Abstract: This paper proposes a general CNN hardware accelerator design scheme based on FPGA. For the most computationally intensive convolutional layer, three acceleration modes are adopted: input channel parallelism, intra-core parallelism, and output channel parallelism, and the corresponding parallelism degree is reasonably set according to the on-chip resources of FPGA. In terms of data loading, adjacent data bit width combined transmission is adopted, which effectively improves the actual transmission bandwidth of the accelerator. Based on the idea of row-based data flow loading, the input cache module is designed. The cache module only needs to cache two rows of data to start the convolution operation, effectively advancing the start time of the convolution operation. Between the data input, data operation, and data output modules, the pipeline cycle optimization method is used to greatly improve the computing performance of the hardware. Finally, the accelerator is applied to VGG16 and Darknet-19 networks, and experiments show that the computing performance reaches 34.30 GOPS and 33.68 GOPS, respectively, and the DSP computing efficiency is as high as 79.45% and 78.01%.
Key words : convolutional neural network acceleration;FPGA;row data loading;module division;pipeline structure

0 引言

隨著深度學(xué)習(xí)技術(shù)的飛速發(fā)展,神經(jīng)網(wǎng)絡(luò)模型在圖像識(shí)別、目標(biāo)檢測(cè)和圖像分割等領(lǐng)域取得了巨大技術(shù)進(jìn)步[1-2]。然而相比較傳統(tǒng)算法,神經(jīng)網(wǎng)絡(luò)在獲得高的性能同時(shí)也帶來(lái)了高計(jì)算復(fù)雜度的問(wèn)題,使得基于專(zhuān)用硬件設(shè)備加速神經(jīng)網(wǎng)絡(luò)成為神經(jīng)網(wǎng)絡(luò)模型應(yīng)用領(lǐng)域關(guān)注的焦點(diǎn)。目前,神經(jīng)網(wǎng)絡(luò)模型硬件加速的主要方案有GPU、ASIC和FPGA三種方案。相比較GPU,F(xiàn)PGA具有成本功耗低的特點(diǎn);相比較ASIC,F(xiàn)PGA具有模型實(shí)現(xiàn)靈活、開(kāi)發(fā)速度快、綜合成本低的特點(diǎn),特別適用于當(dāng)前神經(jīng)網(wǎng)絡(luò)在邊緣設(shè)備上部署的需求,因此基于FPGA的神經(jīng)網(wǎng)絡(luò)模型加速研究成為當(dāng)前神經(jīng)網(wǎng)絡(luò)領(lǐng)域研究的熱點(diǎn)[3-5]。

大多數(shù)神經(jīng)網(wǎng)絡(luò)模型中卷積層的運(yùn)算量占到了總計(jì)算量的90%以上,因此可以通過(guò)在FPGA中執(zhí)行卷積運(yùn)算來(lái)實(shí)現(xiàn)神經(jīng)網(wǎng)絡(luò)加速[6-7]。文獻(xiàn)[6]基于FPGA實(shí)現(xiàn)通用矩陣乘法加速器來(lái)實(shí)現(xiàn)神經(jīng)網(wǎng)絡(luò)加速,獲得了很好的加速性能。文獻(xiàn)[7]則提出了一種基于脈動(dòng)陣結(jié)構(gòu)的矩陣乘法加速模塊,并用于神經(jīng)網(wǎng)絡(luò)加速,獲得了較好的性能提升。文獻(xiàn)[8-9]從卷積運(yùn)算的加速算法方面進(jìn)行研究,Liang Y[8]等人基于二維Winograd算法在FPGA上對(duì)CNN進(jìn)行了實(shí)現(xiàn),與常規(guī)的卷積計(jì)算單元相比,該實(shí)現(xiàn)中基于二維Winograd算法設(shè)計(jì)的卷積計(jì)算單元將乘法操作減少了56%。Tahmid Abtahi[10]等人使用快速傅里葉變換(Fast Fourier Transform,F(xiàn)FT)對(duì)ResNet-20模型中的卷積運(yùn)算進(jìn)行優(yōu)化,成功減少了單個(gè)卷積計(jì)算單元的DSP資源使用量。除卷積運(yùn)算加速外,相關(guān)研究團(tuán)隊(duì)對(duì)神經(jīng)網(wǎng)絡(luò)加速過(guò)程中的其他方面也展開(kāi)深入研究[10-14]。文獻(xiàn)[10]提出了一種塊卷積方法,這是傳統(tǒng)卷積的一種內(nèi)存高效替代方法,將中間數(shù)據(jù)緩沖區(qū)從外部DRAM完全移動(dòng)到片上存儲(chǔ)器,但隨著分塊層數(shù)的增加,精度會(huì)降低。文獻(xiàn)[11]提出一種相鄰層位寬合并和權(quán)重參數(shù)重排序的策略實(shí)現(xiàn)數(shù)據(jù)傳輸?shù)膬?yōu)化方法,增加數(shù)據(jù)傳輸并行度的同時(shí)節(jié)省了通道的使用。文獻(xiàn)[12-14]采取乒-乓處理結(jié)構(gòu),分別在輸入模塊、卷積運(yùn)算單元、輸出模塊方面提升了卷積運(yùn)算的速率。



本文詳細(xì)內(nèi)容請(qǐng)下載http://ihrv.cn/resource/share/2000005800


作者信息

邱臻博

(重慶郵電大學(xué) 光電工程學(xué)院, 重慶 400065)




weidian.jpg

此內(nèi)容為AET網(wǎng)站原創(chuàng),未經(jīng)授權(quán)禁止轉(zhuǎn)載。