進階搜尋


下載電子全文  
系統識別號 U0026-2510201813253400
論文名稱(中文) 時序精確SIMT核心設計與實作
論文名稱(英文) Design of Cycle-accurate SIMT Core and Implementation
校院名稱 成功大學
系所名稱(中) 電腦與通信工程研究所
系所名稱(英) Institute of Computer & Communication
學年度 107
學期 1
出版年 107
研究生(中文) 鄭基漢
研究生(英文) Jhi-Han Jheng
學號 Q36054031
學位類別 碩士
語文別 中文
論文頁數 62頁
口試委員 指導教授-陳中和
口試委員-邱瀝毅
口試委員-郭致宏
口試委員-林英超
中文關鍵字 通用繪圖處理器  時序精確模組 
英文關鍵字 GPGPU  Cycle-accurate Model 
學科別分類
中文摘要 當前高效能運算領域中GPU用於非繪圖應用程式的加速。無論是平行演算法還是深度學習的應用,皆須利用GPU進行運算加速,也因此GPU的設計與實作對於運算系統的開發來說佔有重要的地位。然而開發GPU運算系統是個複雜的過程,必須兼顧硬體與軟體系統才能驗證整個運算平台,透過TLM方法能克服實現複雜系統的障礙,漸增式的開發流程將會先從高度抽象化的硬體模組著手,建構軟體系統的雛型,並在早期的開發階段進行軟硬體的整合驗證,之後再逐步實現更實際與低抽象層級的軟硬體系統。
時序精確為TLM所規範的一種抽象層級,需描述硬體模組在每個時脈邊緣時的行為。透過時序精確的規範,設計者得以根據硬體的功能性模組,開發更低抽象層級的硬體模組。本論文探討時序精確模組的設計方法,並且將此方法應用詳述GPU內部的時序精確SIMT核心設計。在本論文將討論基本時序精確模組的規範方法,分析與條列出SIMT核心架構上的功能需求,並且呈現微架構層級的硬體模組設計與效能指標。最後在CASLAB-GPUSim平台進行整合測試,分析效能指標並且探討效能瓶頸,以及比較其他低階運算系統效能差異,在平行化佳的測試程式能得到4.7到20.1倍的效能提升,而當GPU調升至1.2GHz時,GEMM能有52.6倍的效能提升。
英文摘要 Developing a GPU computing platform requires both software and hardware development. To overcome the complex development process, adopting TLM methodology can build the system by incremental development process, which makes verification and validation in early development stage possible. Cycle-accurate model, the most detailed functional model in TLM, is used to implement RTLable hardware module by describing behavior of the module at each clock edge. We develop the cycle-accurate SIMT core by basic cycle-accurate modeling approach and evaluate its performance on CASLAB-GPUSim cosimulation platform. The performance comparison between a low-end GPU and an embedded CPU with 1.2GHz shows that the low-end GPU can achieve 4.7 to 20.1 times speedup in good parallelism test cases. When tuning the low-end GPU to 1.2 GHz, it can achieve 52.6 times speedup in the test case GEMM, which is the most time-consuming operation in deep learning applications.
論文目次 摘要 III
圖目錄 XIV
第1章 Introduction 1
1.1 Motivation 2
1.2 Contribution 3
1.3 Organization 3
第2章 Background 4
2.1 GPU 4
2.1.1 GPU Programming Model 4
2.1.2 GPU Architecture 5
2.1.3 Warp scheduling and branch divergence handling 6
2.2 ESL 6
2.2.1 TLM 6
2.2.2 SystemC 8
第3章 Related works 10
3.1 GPU Simulator 10
3.2 Cycle Accurate Specification 12
第4章 Cycle-accurate SIMT Core 15
4.1 Cycle-accurate Specification in SystemC 15
4.2 Functional Requirement of SIMT Core 19
4.2.1 Workgroup Receive, Initialization, and Exit 19
4.2.2 Multiple Instruction Thread In-order Single Issuing 20
4.2.3 SIMT Masking 20
4.2.4 Workgroup Synchronization 21
4.2.5 Operand Collection 21
4.2.6 SIMT Execution 22
4.2.7 Data Write Back 22
4.2.8 Control Type Instruction Commitment 23
4.3 Design of Cycle-Accurate SIMT Core 24
4.3.1 Fetch Unit 24
4.3.2 Instruction Buffer 26
4.3.3 Scoreboard 28
4.3.4 SIMT Stack 29
4.3.5 Warp Scheduler 30
4.3.6 Collector Unit 32
4.3.7 Arbitrator 34
4.3.8 Dispatcher 35
4.3.9 SIMD Execution Unit 36
4.3.10 Load/Store Unit 38
4.3.11 Write-back Unit 39
4.3.12 Summary 41
4.4 Profiler 42
第5章 CASLAB-GPUSIM Simulation Platform 43
5.1 Platform Introduction 43
5.2 Instruction Set Architecture 44
5.3 CASLAB-GPU 46
5.4 Runtime System 47
5.4.1 OpenCL Runtime 47
5.4.2 HAS Runtime 49
第6章 Experiment and Evaluation 50
6.1 Experiment Evaluation 50
6.1.1 Simulation Environment 50
6.1.2 Benchmark 51
6.1.3 Profiling Result Analysis 52
6.1.4 Total Cycle Breakdown of Warp Scheduler 54
6.1.5 Performance Evaluation 56
6.2 實驗限制與建議 58
第7章 Conclusion 59
參考文獻 60
參考文獻 [1] Hennessy, John L., and David A. Patterson. Computer architecture: a quantitative approach. Elsevier, 2011.
[2] Black, David C., et al. SystemC: From the ground up. Vol. 71. Springer Science & Business Media, 2009.
[3] Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt"Analyzing CUDA workloads using a detailed GPU simulator." IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp 163-174, 2009.
[4] Aaamodt T. M., and A. Boktor. "GPGPU-Sim 3. x: A performance simulator for many-core accelerator research." International Symposium on Computer Architecture (ISCA), http://www. gpgpu-sim. org/isca2012-tutorial. 2012.
[5] Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi . "GPUWattch: enabling energy optimizations in GPGPUs." ACM SIGARCH Computer Architecture News. Vol. 41. No. 3, pp 487-498,2013.
[6] Thomas, Donald, and Philip Moorby. "Cycle-Accurate Specification." The Verilog® Hardware Description Language (2002): 195-210.
[7] Chupilko, M., and A. Kamkin. "Developing cycle-accurate contract specifications for synchronous parallel-pipeline hardware: application to verification." Electronics Conference (BEC), 2010 12th Biennial Balti
[8] HSA Foundation, “HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer, and Object Format (BRIG),”
http://www.cs.nthu.edu.tw/~ychung/slides/HSA/HSA-PRM-1.02.pdf
[9] Khronos OpenCL Working Group, “The OpenCL Specification,”
https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf
[10] HSA Foundation. “Heterogeneous System Architecture,” http://www.hsafoundation.com/.
[11] Pouchet Louis-Noël. "Polybench: The polyhedral benchmark suite." URL: http://www. cs. ucla. edu/pouchet/software/polybench (2012).
[12] Heng-Yi Chen, “An HSAIL ISA Conformed GPU Platform,” Thesis for Master of Science, Institute of Computer and Communication Engineering, National Cheng Kung University, July, 2015
[13] Kuan- Chieh Hsu, Chung-Ho Chen, “Performance Prediction Model on HSA-Compatible General-Purpose GPU System” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan. 2016.
[14] Wan-Shan Hsieh, Chung-Ho Chen, “Micro-Architecture Optimization of HSA-Compatible GPU” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan. 2016.
[15] Sen-Chih Tsai, Chung-Ho Chen, “Optimization of Workgroup Scheduling on CASLAB-GPUSIM” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan. 2017.
[16] Chien-Ming Chiu, Chung-Ho Chen, “GPU Warp Scheduling Using Memory Stall Sampling on CASLAB-GPUSIM” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan. 2017.
[17] Bo-Xiang Zeng, Chung-Ho Chen, “Architecture Exploration and Optimization of CASLAB-GPUSIM Memory Subsystem” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan. 2017.
[18] Chetlur, Sharan, et al. "cudnn: Efficient primitives for deep learning." arXiv preprint arXiv:1410.0759 (2014).
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2018-11-05起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2018-11-05起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw