進階搜尋


下載電子全文  
系統識別號 U0026-0408201611443700
論文名稱(中文) HSA繪圖處理器之效能預測模型
論文名稱(英文) Performance Prediction Model on HSA-Compatible General-Purpose GPU System
校院名稱 成功大學
系所名稱(中) 電腦與通信工程研究所
系所名稱(英) Institute of Computer & Communication
學年度 104
學期 2
出版年 105
研究生(中文) 許冠傑
研究生(英文) Kuan-Chieh Hsu
學號 q36034049
學位類別 碩士
語文別 英文
論文頁數 66頁
口試委員 指導教授-陳中和
口試委員-蕭勝夫
口試委員-李宗南
口試委員-郭致宏
口試委員-蘇文鈺
中文關鍵字 記憶體系統  多核模擬平台  預測模型 
英文關鍵字 Memory system  Multicore simulation  Prediction model 
學科別分類
中文摘要 首先,我們在本論文中提及了一個完整實現的記憶體子系統。其所屬的平台是先前本實驗室開發的特製通用繪圖處理器(General purpose GPU)架構。為了能夠快速開發與實現這樣的晶片,在早期設計階段我們將先前的C++語言模擬器版本做時序上的延伸探討,盡力保持完整詳細的記憶體流量模擬與快速的模擬時間。這是由於記憶體流量佔程式模擬時間的大部分。例如,第一層快取記憶體(Cache)的存取要求送上晶片網路系統(NoC)會經歷非定值的延遲時間,記憶體快取一致性與記憶體排程器的選擇等皆會影響架構核心所見的延遲時間。而關於記憶體空間分配的方法我們也探討了粗與細兩種不同粒度的分散方法。在晶片網路模組我們也探討了網格狀架構(Mesh)在幾何特性上強韌的原因並採用此類拓樸。
本篇論文另一個特點是採用機器學習的效能預測模型。使用kmeans與SVM兩種模型可以分析出各別測試程式在所有可能的硬體架構設定下,於何種參數數值下會有最佳效能結果。於此,還能進一步預測未測試過的程式將會在何硬體參數下達到最高效能。Kmeans演算法負責將所有測試程式的效能結果分類為若干種特性相似的群集,做為預測模型的參考範本。SVM模型則接著以記憶體系統相關的量測結果訓練出最終的預測參數。由於我們宣稱程式效能主要由記憶體子系統所影響,只以此類特徵分析的結果在八個群集設定下可以達到至少有46.48%的測試點落在百分之十以內的錯誤率;在改變群集數量的量測中,最高可以有57.97%的測試點落在百分之十以內的錯誤率。最終,我們發現最佳的效能結果不一定發生在給定最大硬體資源數量上,這是由於晶片網路系統上的交通流量時常有塞車的情況而影響。
結合了記憶體子系統的開發與預測模型,我們旨在提供一套可靠且精確的早期開發平台,使往後的晶片實現能夠採用現階段的效能探討並快速的完成。
英文摘要 In this thesis, we present a memory subsystem of customized general purpose GPU architecture. For fast development, the C++ simulated architecture should be kept as light-weight while timing accurate at the same time. Since most parts of benchmark simulation time come from memory subsystem-related latencies. For example, the level one cache miss will trigger Network on Chip (NoC) traffic; the cache coherence and memory controller scheduling policy also affect the latency viewed by streaming multiprocessor in this GPGPU architecture. Also, we discuss the memory space partitioning methods in one following section including coarse grain and fine grain partitioning methods. As for NoC module, we adopted previous research in this work and discuss geometry features of chosen topology – Mesh structure for robust reason.
Another contribution of this work is that two machine learning models are used for predicting architecture performance and depicting the performance trend across plenty of hardware configuration settings. We aim to guess a reasonable summit value in performance surface by the following procedures. First, kmeans algorithm clusters training benchmarks into determined number of clusters. The multi-class Support Vector Machine (SVM) model is latter trained to fit memory-related only features. During validation phase, testing benchmarks’ summit performance values are predicted by the result from training phase. Under eight clusters setting, 46.48% predicted cycle performance counts across all tested benchmarks are less than 10% error comparing to real performance values. By varying the number of clusters, up to 57.97% points are less than 10% errors. Also, we show that summit performance not necessary happen under maximum hardware resources. Some discussions point out the memory traffic issues that significantly drag down the execution speed of certain accessing patterns from benchmarks.
Combined the mentioned contributions together, we aim to provide a reliable and accurate early stage simulation platform for future IC chip implementation in an efficient way.
論文目次 Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Contribution 2
1.3 Organization 3
Chapter 2 Background 4
2.1 System Platform Overview 4
2.2 Compile Flow 5
2.3 HSAIL Programing Model 6
2.4 HSAIL Instruction Set 7
2.4.1 MAD(multiply add) 7
2.4.2 CBR(conditional branch) 8
2.4.3 LD (load instruction) 8
2.4.4 Barrier 9
2.5 Architecture Overview 9
2.6 Warp Scheduling 10
2.6.1 Round Robin Scheduling 11
2.6.2 GTO Scheduling 11
2.6.3 Two Level Scheduling 12
2.6.4 Other Scheduling Policies 12
2.7 SIMD Divergence Scheme 13
Chapter 3 Related work 14
3.1 Related Work 14
Chapter 4 Memory Subsystem 15
4.1 Pyramid Design of Memory Space 15
4.2 Memory Space Partition 17
4.3 Coalescer 17
4.4 Cache Hierarchy 19
4.4.1 Coherence Issues 20
4.5 Miss Status Holding Registers Merging Cases 21
4.6 Interconnection Design 22
4.7 NoC Architecture 25
4.8 DRAM System 27
4.8.1 Memory controller scheduling 29
4.9 Summary 33
Chapter 5 Prediction Model 34
5.1 Model Construction 34
5.1.1 Performance Surface 36
5.1.2 Kmeans Clustering 37
5.1.3 Multi-class SVM Classifier 38
5.2 Estimation 41
Chapter 6 Experiment results 42
6.1 Memory System 42
6.1.1 Cache Hit Ratio 43
6.1.2 Relationship between Coalescing Rate and L1 Concurrency 44
6.1.3 NoC Traffic 45
6.1.4 Fine grain & Coarse grain Comparison 46
6.1.5 Memory controller scheduling 49
6.2 Prediction Model 51
6.2.1 Diversity of Extended Benchmark Suite 51
6.2.2 Special Cases (1) – Cluster five: Memory accessing traffic concentration 53
6.2.3 Special Cases (2) – Cluster eight: another extreme case 55
6.2.4 Surface Error 57
6.2.5 Sensitivity to the Number of Clusters 59
6.2.6 Discussion and future work 60
Chapter 7 Conclusion 62
Reference 64

參考文獻 [1] CLOC source code download. [Online] Available: https://github.com/HSAFoundation/HSA-Docs-AMD/wiki/CLOC-Compiler-and-Sample-SDK
[2] Heterogeneous System Architecture standard. [Online] Available: http://www.hsafoundation.com/standards/
[3] Timothy G.Rogers, Mike O’Connor, and Tor M. Aamodt, “Cache-Conscious Wavefront Scheduling,” In MICRO, 2012.
[4] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. PaTT, “Improving GPU Performance via Large Warps and Two-Level Warp Scheduling,” in Proc. of the 44th International Symposium on Microarchitecture (MICRO-44), Dec 2011.
[5] B. Pichai, L. Hsu, and A. Bhattacharjee, “Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces,” in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2014, pp. 743–758.
[6] Shin-Ying Lee, Akhil Arunkumar, and Carole-Jean Wu, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads,” in Proceedings of International Symposium on Computer Architecture (ISCA), 2015.
[7] Yun-Chi Huang, Kuan-Chieh Hsu, Wan-shan Hsieh, Chen-Chieh Wang, Chia-Han Lu, and Chung-Ho Chen, “Dynamic SIMD Re-Convergence with Paired-Path Comparison,” in Proceeding of IEEE International Symposium on Circuits and Systems (ISCAS), 2016.
[8] M. Rhu and M. Erez, “The Dual-path execution model for efficient GPU control flow,” High Performance Computer Architecture (HPCA), 2013.
[9] A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, “Analyzing cuda workloads using a detailed GPU simulator,” in IEEE ISPASS, April 2009.
[10] Jason Power, Joel Hestness, Marc S. Orr, Mark D. Hill, and David A. Wood, “gem5-gpu: A heterogeneous CPU-GPU Simulator,” Computer Architecture Letters vol. 13, no. 1, Jan 2014.
[11] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood, “The gem5 Simulator,” ACM SIGARCH Computer Architecture News. May 2011.
[12] Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli, “Multi2Sim: A Simulation Framework for CPU-GPU Computing,” PACT, 2012.
[13] G. Wu, J. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou, “GPGPU performance and power estimation using machine learning,” in: Proceedings of IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 564–576.
[14] Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O’Connor, and Tor M. Aamodt, “Cache Coherence for GPU Architectures,” HPCA 2013.
[15] Chien-Hsuan Yen, “A Memory-Efficient NoC System for Manycore Platform,” Thesis for Master of Science, Institute of Computer and Communication Engineering, National Cheng Kung University, July, 2014.
[16] W. Y. Alexander, C. X. Thomas, L. Pasi and T. Hannu, “Explorations of Honeycomb Topologies for Network-on-Chip,” Sixth IFIP Inter. Conf. Network and Parallel Computing, pp.73-79, 2009.
[17] SK Hynix website. [Online] Available: https://www.skhynix.com/eng/index.jsp
[18] Scott Rixner, William J. Dally, Ujval J. Kapasi, et al., “Memory Access Scheduling,” in the Proceedings of 27th Annual International Symposium on Computer Architecture (ISCA’00), pages 128-138.
[19] Onur Mutlu and Thomas Moscibroda, “Parallelsim-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems,” In ISCA-36, 2008.
[20] Onur Mutlu and Thomas Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” In MICRO-40, 2007.
[21] Heng-Yi Chen, “An HSAIL ISA Conformed GPU Platform,” Thesis for Master of Science, Institute of Computer and Communication Engineering, National Cheng Kung University, July, 2015
[22] Souley Madougou, Ana Varbanescu, Cees de Laat, and Rob van Nieuwpoort, “The landscape of GPGPU performance modeling tools,” Parallel Computing 56 (2016) 18-33.
[23] E1071 package download. [Online] Available: https://cran.r-project.org/web/packages/e1071/index.html
[24] AMD APP SDK download. [Online] Available: http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/
[25] NVIDIA OpenCL benchmarks download. [Online] Available: https://developer.nvidia.com/opencl
[26] Rodinia benchmark suite download. [Online] Available: http://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/Rodinia:Accelerating_Compute-Intensive_Applications_with_Accelerators
[27] Open source lecture video from Professor Onur Mutlu. [Online] Available: https://www.youtube.com/watch?v=tpQPN01i3GA&list=PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKq&index=26
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2020-08-31起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2020-08-31起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw