進階搜尋


下載電子全文  
系統識別號 U0026-2011201800480700
論文名稱(中文) 移植Tensorflow至CASLAB-GPUSIM模擬平台與矩陣函式庫優化
論文名稱(英文) Porting Tensorflow to CASLAB-GPUSIM and Optimization of Matrix Multiplication Library
校院名稱 成功大學
系所名稱(中) 電腦與通信工程研究所
系所名稱(英) Institute of Computer & Communication
學年度 107
學期 1
出版年 107
研究生(中文) 蘇郁翔
研究生(英文) Yu-Xiang Su
電子信箱 su2278370@gmail.com
學號 Q36054057
學位類別 碩士
語文別 中文
論文頁數 75頁
口試委員 指導教授-陳中和
口試委員-邱瀝毅
口試委員-郭致宏
口試委員-林英超
中文關鍵字 終端裝置  通用繪圖處理器  矩陣乘法  機器學習 
英文關鍵字 Edge Device  GPGPU  Matrix Multiplication  Machine Learning 
學科別分類
中文摘要 隨著雲端計算的蓬勃發展,機器學習的應用也逐漸拓展到終端裝置的應用上,為了能夠在終端硬體之開發階段或是終端應用的效能分析,本論文整合了機器學習框架Tensorflow與本實驗室所開發的OpenCL Runtime,成功將Tensorflow Runtime移植至本實驗室所開發的CASLAB-GPUSIM模擬平台上,接著又透過以Tensorflow所撰寫的測試程式進行了一系列的系統驗證,借此模擬終端裝置上的機器學習應用情境。
除了終端機器學習模擬平台的搭建,本論文認為在以通用繪圖處理器作為終端加速的解決方案中,線性代數的函式庫並沒有隨著該應用情境以及計算資源而有所變化,其中尤以矩陣乘法影響最甚,因其為建構卷積神經模型之卷積層與全連結層的基本運算單元,有鑑於此,本論文針對CLBlast函式庫的矩陣乘法演算法提出了優化建議,亦即針對終端機器學習應用的運算型態減少矩陣乘法函式庫的前處理以達到減少整體矩陣乘法函式庫所需要的執行時間。
英文摘要 With the rapid development of cloud computing, the application of machine learning has gradually expanded to the application of edge devices. In order to analyze the performance of edge application in the early development stage of edge hardware, we complete the suggest that integration of Tensorflow and the GPGPU simulator, called CASLAB-GPUSIM.
In addition to the building of edge device simulation platform, we propose a matrix multiplication library for the machine learning application on edge device using GPGPU as the acceleration solution. According to our experiment result, we have 5.6 average speed up in the fully-connected layer of our benchmarks, including MNIST mode, Lenet-5 and MobileNet.
論文目次 摘要 I
Summary II
誌謝 VI
圖目錄 XI
第1章 序論 1
1.1 Motivation 1
1.2 Contribution 2
1.3 Organization 2
第2章 背景知識 3
2.1 Tensorflow Runtime 3
2.1.1 Tensorflow Kernel Operation 3
2.1.2 Tensorflow Stream Executor 5
2.1.3 Tf-coriander 6
2.2 OpenCL Runtime 7
2.2.1 OpenCL Programming Model 7
2.2.2 HSA Runtime 9
2.3 GPGPU Hardware 11
2.3.1 GPGPU Architecture 11
2.3.2 GPGPU Memory Model 14
第3章 矩陣乘法與機器學習相關研究 15
3.1 Convolution Neural Network 15
3.1.1 Convolution Layer 16
3.1.2 Pooling Layer 17
3.1.3 Fully Connected Layer 18
3.1.4 Activation Function 19
3.2 Matrix Multiplication in CNN 20
3.2.1 Implementation of Convolution Layer 20
3.2.2 Implementation of Fully Connected Layer 21
第4章 通用繪圖處理器上的矩陣乘法優化 23
4.1 Matrix Multiplication on GPGPU 23
4.2 Matrix Multiplication Optimization 24
4.2.1 Direct Implementation 25
4.2.2 Matrix Transposition 28
4.2.3 Shared Memory 29
4.2.4 Auto-Tuning Technique 30
4.3 Matrix Multiplication on Edge Device 33
4.3.1 Edge Computation 33
4.3.2 CASLAB Implementation 35
第5章 Tensorflow移植與矩陣乘法函式庫實作 38
5.1 Platform Introduction 38
5.2 Running Tensorflow on CASLAB-GPUSIM 42
5.2.1 OpenCL Runtime Implementation 43
5.2.2 Finalizer Implementation 44
5.3 Implementation of Matrix Multiplication 45
5.3.1 Kernel Operation Implementation 45
5.3.2 CLBlast Library 48
第6章 終端機器學習應用之矩陣乘法實驗探討 52
6.1 Experiment Environment and Benchmarks 52
6.2 Verification of Tensorflow porting 55
6.3 Performance of CASLAB MM implementation 64
6.3.1 Performance Summary 64
6.3.2 MNIST Benchmarks 66
6.3.3 MobileNet Fully Connected Layer 69
6.4 Experiment Limitation and Recommendation 70
第7章 結論 71
參考文獻 72
參考文獻 [1] “Movidius Official Website.” [Online]. Available: https://www.movidius.com/.
[2] “Tensorflow Official Website.” [Online]. Available: https://www.Tensorflow.org/.
[3] “Eigen Library Offical Website.” [Online]. Available: https://eigen.tuxfamily.org/dox/.
[4] “Nvidia CUDA Toolkit.” [Online]. Available: https://developer.nvidia.com/cuda-downloads.
[5] “Documentation for StreamExecutor open source proposal.” [Online]. Available: https://github.com/henline/streamexecutordoc.
[6] “cuBLAS Offical Website.” [Online]. Available: https://developer.nvidia.com/cublas.
[7] “Tf-coriander githut repository.” [Online]. Available: https://github.com/hughperkins/Tf-coriander.
[8] “Tuned OpenCL BLAS, CLBlast.” [Online]. Available: https://github.com/CNugteren/CLBlast.
[9] “EasyCL github repository.” [Online]. Available: https://github.com/hughperkins/EasyCL.
[10] “coriander github repository.” [Online]. Available: https://github.com/hughperkins/coriander/tree/f069f52b0574148c51151b7baee13616daba56f5.
[11] “The LLVM Compiler Infrastructure.” [Online]. Available: https://llvm.org/.
[12] A.Munshi, “OpenCL 1.2 Specification,” Version 1.2, p. 380, 2012.
[13] “Khronos Official Website.” [Online]. Available: https://www.khronos.org/.
[14] “OpenCL Offline Compiler.” [Online]. Available: https://github.com/HSAFoundation/CLOC.
[15] O.Api, R.Card, andC.Queues, “OpenCL API 1.2 Reference Card,” Khronos Gr., pp. 1–8, 2011.
[16] HSA Foundation, “HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer’s Guide, and Object Format (BRIG),” no. May, pp. 1–391, 2013.
[17] H.Foundation, “HSA Runtime Programmer ’ s Reference Manual,” pp. 1–147, 2015.
[18] “PTX ISA.” [Online]. Available: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.
[19] J. L.Hennessy andD. aPatterson, Computer Architecture, Fourth Edition: A Quantitative Approach, no. 0. 2006.
[20] Y.LeCun, L.Bottou, Y.Bengio, andP.Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2323, 1998.
[21] “The MNIST dataset.” [Online]. Available: http://yann.lecun.com/exdb/mnist/.
[22] “Linear Regression.” [Online]. Available: https://en.wikipedia.org/wiki/Linear_regression.
[23] S.Chetlur, C.Woolley, P.Vandermersch, J.Cohen, J.Tran, B.Catanzaro, andE.Shelhamer, “cuDNN: Efficient Primitives for Deep Learning,” pp. 1–9, 2014.
[24] “Tensorflow MNIST tutorial.” [Online]. Available: https://www.Tensorflow.org/tutorials/.
[25] “Tensorflow Lenet-5 Model.” [Online]. Available: https://blog.csdn.net/NNNNNNNNNNNNY/article/details/70216265.
[26] T. D.Han andT. S.Abdelrahman, “Reducing branch divergence in GPU programs,” Proc. Fourth Work. Gen. Purp. Process. Graph. Process. Units, p. 3:1--3:8, 2011.
[27] “Direct Implementation.” [Online]. Available: https://www.quantstart.com/articles/Matrix-Matrix-Multiplication-on-the-GPU-with-Nvidia-CUDA.
[28] X.Cui, Y.Chen, C.Zhang, andH.Mei, “Auto-tuning dense matrix multiplication for GPGPU with cache,” Proc. Int. Conf. Parallel Distrib. Syst. - ICPADS, pp. 237–242, 2010.
[29] B.Wu, F.Iandola, P. H.Jin, andK.Keutzer, “SqueezeDet: UWu, B., Iandola, F., Jin, P. H., &Keutzer, K. (2016). SqueezeDet: Unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving. ArXiv Preprint ArXiv:1612.01051, 129–137.nified, small, low,” arXiv Prepr. arXiv1612.01051, pp. 129–137, 2016.
[30] A. G.Howard andW.Wang, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications Andrew,” 2012.
[31] X.Sun, N.Ansari, N. E.Sun, X., & Ansari, X.Sun, andN.Ansari, “EdgeIoT: Mobile Edge Computing for the Internet of Things,” IEEE Commun. Mag., vol. 54, no. 12, pp. 22–29, 2016.
[32] P. N.Glaskowsky, “NVIDIA’s Fermi : The First Complete GPU Computing Architecture,” White Pap., no. September, pp. 1–26, 2009.
[33] K.Mo, “MS108 COMPUTER SYSTEM(1) Final Report — gpgpu-sim,” no. 1, pp. 1–17, 2014.
[34] “SystemC Offical Website.” [Online]. Available: http://www.accellera.org/downloads/standards/systemc.
[35] “GeForce 10 series Specification.” [Online]. Available: https://en.wikipedia.org/wiki/GeForce_10_series.
[36] “Adding a New Op.” [Online]. Available: https://www.Tensorflow.org/extend/adding_an_op.
[37] “SWIG Official Website.” [Online]. Available: http://www.swig.org/tutorial.html.
[38] “Tensorflow Tensorboard.” [Online]. Available: https://www.Tensorflow.org/guide/summaries_and_tensorboard.
[39] “Python3.3 time library.” [Online]. Available: https://docs.python.org/3/library/time.html.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2018-11-26起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2018-11-26起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw