進階搜尋


   電子論文尚未授權公開,紙本請查館藏目錄
(※如查詢不到或館藏狀況顯示「閉架不公開」,表示該本論文不在書庫,無法取用。)
系統識別號 U0026-1310202015031700
論文名稱(中文) 量化卷積神經網路之二維脈動陣列加速器設計
論文名稱(英文) Design of 2D Systolic Array Accelerator for Quantized Convolutional Neural Networks
校院名稱 成功大學
系所名稱(中) 電機工程學系
系所名稱(英) Department of Electrical Engineering
學年度 109
學期 1
出版年 109
研究生(中文) 劉珈寧
研究生(英文) Chia-Ning Liu
學號 N26064820
學位類別 碩士
語文別 中文
論文頁數 59頁
口試委員 指導教授-郭致宏
口試委員-陳中和
口試委員-邱瀝毅
中文關鍵字 深度學習  卷積神經網路  硬體加速器  脈動陣列 
英文關鍵字 deep learning  convolutional neural networks  hardware accelerator  systolic array 
學科別分類
中文摘要 深度學習與人工智慧在近期越來越熱門,應用的方面也很廣泛。神經網路為了提升精確性,層數加深、架構也越來越大,計算量和參數量也隨之提升。因此,神經網路的量化和許多專用硬體加速器被提出以減少計算量並加速運算。本論文針對量化卷積神經網路設計了加速器架構,配合的網路中輸入激活與權重皆被量化為8位元整數。該加速器可以支持各種神經網路模型中不同尺寸的卷積層和全連接層。運算部分主要採用二維脈動陣列運算結構,以降低記憶體存取能耗並提升吞吐量。而為了與脈動陣列配合並最大程度地減少對外部記憶體的存取,設計了特殊的片上記憶體,並針對輸入存取提出Fetcher架構。整體設計與Eyeriss相比,在VGG-16和AlexNet的卷積層中可降低1.79和1.63倍的外部記憶體存取,內部記憶體存取也可節省17.48和7.31倍。
英文摘要 Deep learning and artificial intelligence (AI) have received a lot of attention recently. Through large data sets and dedicated algorithms, machines are being programmed to complete specific tasks. Most operations in these algorithms use multiplication and addition. This characteristic is quite proper for specific hardware accelerator designs. Now quantized networks are being studied to reduce the computing and memory requirements of deep neural networks. These new networks quantize full-precision weights and activations to lower bit-width fixed-point or integer representation, with only a small loss of accuracy. This can be helpful when hardware is being used which has restricted power and storage capacity. In this work, we perform a systolic-based architecture to accelerate convolutional networks with 8-bit integer data. This accelerator can support both the convolutional layers (CLs) and fully-connected layers (FCLs) for various neural network models. The computing unit can balance computation with I/O and improve throughput with the use of systolic structures. To cooperate with the systolic array and minimize external memory access, we also design a particular on-chip memory. The external memory access of CLs is reduced by 1.63x and 1.79x in AlexNet and VGG-16 compared with Eyeriss, and the internal memory access is also reduced by 7.31x and 17.48x.
論文目次 中文摘要 I
英文摘要 II
誌謝 XVII
目錄 XVIII
表目錄 XX
圖目錄 XXI
第一章 緒論 1
1-1 前言 1
1-2 研究動機 1
1-3 研究貢獻 2
1-4 論文架構 3
第二章 相關研究背景介紹 4
2-1 深度學習與神經網路 4
2-2 卷積神經網路 5
2-3 卷積神經網路的量化 8
2-4 脈動陣列 (Systolic Array) 9
2-5 張量處理器 (Tensor Processing Unit, TPU) 10
第三章 卷積網路硬體加速器相關文獻回顧 13
3-1 卷積神經網路加速硬體架構 13
3-1-1 DianNao系列 13
3-1-2 Eyeriss系列 14
3-1-3 低精度網路加速器 16
3-1-3-1 UNPU 16
3-1-3-2 QUEST 17
3-1-4 脈動陣列加速器 18
3-1-4-1 MPNA 18
3-1-4-2 VWA 19
3-2 相關研究方法比較 19
第四章 卷積神經網路加速硬體設計 21
4-1 運算單元 22
4-1-1 脈動架構設計 22
4-1-2 SCALE-Sim 36
4-1-3 全連接層運算 38
4-2 儲存架構 40
4-2-1 Fetcher架構 40
4-2-2 儲存大小與能耗分析 45
第五章 實驗環境與數據分析 47
第六章 結論與未來展望 55
6-1 結論 55
6-2 未來展望 55
參考文獻 56
參考文獻 [1] Chen, Yu-Hsin, et al. "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks." IEEE Journal of Solid-State Circuits 52.1: 127-138, 2016.
[2] Chen, Yu-Hsin, et al. "Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices." IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9.2: 292-308, 2019.
[3] Lee, Jinmook, et al. "UNPU: A 50.6 TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision." 2018 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 2018.
[4] Ueyoshi, Kodai, et al. "QUEST: Multi-purpose log-quantized DNN inference engine stacked on 96-MB 3-D SRAM using inductive coupling technology in 40-nm CMOS." IEEE Journal of Solid-State Circuits 54.1: 186-196, 2018.
[5] Vasudevan, Aravind, Andrew Anderson, and David Gregg. "Parallel multi channel convolution using general matrix multiplication." 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE, 2017.
[6] Samajdar, Ananda, et al. "SCALE-sim: Systolic CNN accelerator." arXiv preprint arXiv:1811.02883, 2018.
[7] Horowitz, Mark. "1.1 computing's energy problem (and what we can do about it)." 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). IEEE, 2014.
[8] Krizhevsky, Alex. "One weird trick for parallelizing convolutional neural networks." arXiv preprint arXiv:1404.5997, 2014.
[9] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychological Review, vol. 65, no. 6, pp. 65-386, 1958.
[10] LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11: 2278-2324, 1998.
[11] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.
[12] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556, 2014.
[13] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition, 2015.
[14] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[15] Redmon, Joseph, and Ali Farhadi. "Yolov3: An incremental improvement." arXiv preprint arXiv:1804.02767, 2018.
[16] M. Courbariaux, Y. Bengio, and J.-P. David, "Binaryconnect: Training deep neural networks with binary weights during propagations," in Advances in neural information processing systems, pp. 3123-3131, 2015.
[17] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, "Xnor-net: Imagenet classification using binary convolutional neural networks,"in European Conference on Computer Vision, pp. 525-542, 2016.
[18] F. Li, B. Zhang, and B. Liu, "Ternary weight networks," ArXiv Prepr. ArXiv160504711, 2016.
[19] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, "Dorefa-net: Training low bitwidth convolutional neural networks with lowbitwidth gradients,"ArXiv Prepr. ArXiv160606160, 2016.
[20] Hubara, Itay, et al. "Quantized neural networks: Training neural networks with low precision weights and activations." The Journal of Machine Learning Research, vol. 18, no.1, pp. 6869-6898, 2017.
[21] Kung, H.T., and Charles E. Leiserson. "Systolic arrays (for VLSI)." Sparse Matrix Proceedings 1978. Vol. 1. Society for industrial and applied mathematics, 1979.
[22] Kung, H. T. "Why systolic architectures?" Computer 1: 37-46, 1982.
[23] Chen, Tianshi, et al. "Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning." ACM SIGARCH Computer Architecture News 42.1: 269-284, 2014.
[24] Chen, Yunji, et al. "Dadiannao: A machine-learning supercomputer." 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2014.
[25] Du, Zidong, et al. "ShiDianNao: Shifting vision processing closer to the sensor." Proceedings of the 42nd Annual International Symposium on Computer Architecture. 2015.
[26] Liu, Daofu, et al. "Pudiannao: A polyvalent machine learning accelerator." ACM SIGARCH Computer Architecture News 43.1: 369-381, 2015.
[27] Liu, Shaoli, et al. "Cambricon: An instruction set architecture for neural networks." 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 2016.
[28] Zhang, Shijin, et al. "Cambricon-x: An accelerator for sparse neural networks." 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016.
[29] HANIF, Muhammad Abdullah, et al. "MPNA: a massively-parallel neural array accelerator with dataflow optimization for convolutional neural networks." arXiv preprint arXiv:1810.12910, 2018.
[30] K. Chang and T. Chang, "VWA: Hardware Efficient Vectorwise Accelerator for Convolutional Neural Network," in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 67, no. 1, pp. 145-154, Jan. 2020.
[31] Jouppi, Norman P., et al. "In-datacenter performance analysis of a tensor processing unit." Proceedings of the 44th Annual International Symposium on Computer Architecture. 2017.
[32] Google(2020). “Cloud TPU System Architecture.” Retrieved from https://cloud.google.com/tpu/docs/system-architecture.
[33] Parhi, Keshab K. "VLSI digital signal processing systems: design and implementation." John Wiley & Sons, 2007.
[34] 吳庭嘉(2020)。支援資料復用及過濾器尺寸可擴展性之一維卷積加速器設計與其電子系統層級驗證平台。國立成功大學電機研究所,台南市。
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2022-10-30起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2022-10-30起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw