進階搜尋


下載電子全文  
系統識別號 U0026-1608201716391000
論文名稱(中文) 使用記憶體延遲取樣之繪圖處理器執行緒排程機制與其在CASLAB-GPUSIM上之實現
論文名稱(英文) GPU Warp Scheduling Using Memory Stall Sampling on CASLAB-GPUSIM
校院名稱 成功大學
系所名稱(中) 電腦與通信工程研究所
系所名稱(英) Institute of Computer & Communication
學年度 105
學期 2
出版年 106
研究生(中文) 邱健鳴
研究生(英文) Chien-Ming Chiu
學號 Q36044256
學位類別 碩士
語文別 英文
論文頁數 75頁
口試委員 指導教授-陳中和
召集委員-朱元三
口試委員-黃稚存
口試委員-邱瀝毅
口試委員-劉濱達
中文關鍵字 繪圖處理器  執行緒排程機制  記憶體競爭 
英文關鍵字 GPU  Warp scheduling  Memory contention 
學科別分類
中文摘要 近年來,隨著資料探勘(Data Mining)、機器學習(Machine Learning)以及影像辨識(Image Recognition)等需要針對大量平行資料進行處理的應用變得愈來愈熱門,繪圖處理器(GPUs)於是被廣泛使用來加速這些非繪圖的工作。現今的繪圖處理器使用極大量的並行執行緒(Multithreading)以及細質多執行緒(Fine-Grained Multithreading)的技術來隱藏運算或管線傳遞時間。然而,近期的研究顯示記憶體競爭(Memory Contention)是造成現今的繪圖處理器無法達到巔峰效能的最嚴重瓶頸之一。當並行執行緒的數量越多,由於記憶體系統的過載,記憶體競爭問題也變得越嚴重,而少量的並行執行緒又會弱化運算或管線傳遞時間的遮蔽能力。我們提出了考量記憶體競爭之繪圖處理器執行緒排程機制(Memory-Contention Aware Warp Scheduling)以尋找記憶體系統資源和工作量之間的平衡。這個機制使用動態取樣(Dynamic Sampling)的方法精準地辨識記憶體競爭問題的嚴重程度並且依照不同的情況提供最合適的執行緒並行程度。我們的實驗結果顯示,對於快取記憶體敏感(Cache Sensitive)的工作,與基本的鬆散循環制(Loose Round-Robin)相比,考量記憶體競爭之繪圖處理器執行緒排程機制在GPGPU-Sim上達到幾何平均高達96.4%的加速。除此之外,考量記憶體競爭之繪圖處理器執行緒排程機制也在CASLAB-GPUSIM上達到整體17.4%的效能提升。
英文摘要 In these years, Graphic Processing Units (GPUs), well known for parallel computing, are widely adopted to accelerate non-graphic workloads such as Data Mining, Machine Learning, and Image Recognition. Modern GPUs utilize a huge number of concurrent threads and Fine-Grained Multithreading technique to overlap operation latencies. However, recent researches have shown that the memory contention problem is one of the most important bottlenecks preventing modern GPUs from achieving peak performance. The memory contention problem could be even more serious when the degree of multithreading gets higher due to the overloading of the memory system while the latency hiding ability is poor with a low degree of multithreading. We propose Memory-Contention Aware Warp Scheduling (MAWS) to strike a balance between memory workloads and memory resources. This scheme uses dynamic sampling to accurately recognize the severity level of the memory contention problem and provides an appropriate degree of thread concurrency correspondingly. Our experiments show that MAWS achieves a geometric mean speedup of 96.4% over baseline Loose Round-Robin scheduler for cache sensitive workloads on GPGPU-Sim. MAWS also achieves an overall speedup of 17.4% on CASLAB-GPUSIM.
論文目次 摘要 I
ABSTRACT II
誌謝 III
LIST OF CONTENTS IV
LIST OF TABLES VII
LIST OF FIGURES VIII
CHAPTER 1 INTRODUCTION 1
CHAPTER 2 BACKGROUND 4
2.1 Baseline GPU Architecture 4
2.2 Stall Factors 7
2.2.1 SP / SFU / LSU / Stall 7
2.2.2 Dependency Stall 7
2.2.3 I-Buffer Stall 8
2.2.4 Warp Invalid 8
2.3 Performance Bottleneck Analysis 9
2.4 Impact of Scheduling on Memory Contention 13
2.4.1 Infinite Memory Resources 15
2.4.2 Loose Round-Robin Scheduling 16
2.4.3 Memory Contention Aware Warp Scheduling 17
CHAPTER 3 RELATED WORKS 19
3.1 Reducing Memory Contention 19
3.1.1 Improving Cache Locality 19
3.1.2 Improving Latency Hiding Ability 20
3.1.3 Thread Block Scheduling 20
3.2 Improving Latency Hiding Ability 21
3.3 Resolving Warp Criticality Problem 22
CHAPTER 4 MEMORY-CONTENTION AWARE WARP SCHEDULING 23
4.1 Memory Contention Detection 23
4.2 MAWSα 28
4.3 MAWSβ 33
4.4 Multiple Schedulers 36
4.5 Hardware Overhead 37
CHAPTER 5 EXPERIMENT RESULT 38
5.1 Methodology 38
5.2 Result 40
5.3 LSU Stall Threshold 47
5.4 Sample Interval 51
CHAPTER 6 CASLAB-GPUSIM SIMT CORE IMPLEMENTATION 53
6.1 CASLAB-GPUSIM SIMT Core 53
6.1.1 Fetch and Decode 54
6.1.2 Scoreboard 55
6.1.3 Issue 57
6.1.4 SIMT Stack 57
6.1.5 Operand Collector 58
6.1.6 SPs and SFU 60
6.1.7 LSU and Memory Subsystem 61
6.2 Custom Instruction Set Architecture 61
6.2.1 Heterogeneous System Architecture Intermediate Language (HSAIL) 61
6.2.2 Implemented Instructions in CASLAB-GPUSIM 62
6.2.3 Experiment Result 66
CHAPTER 7 CONCLUSION 69
REFERENCES 70
參考文獻 [1] NVIDIA, "NVIDIA Fermi Compute Architecture Whitepaper," 2009. [Online]. Available: https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.
[2] NVIDIA, "NVIDIA Kepler GK110 Architecture Whitepaper," 2012. [Online]. Available: https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.
[3] NVIDIA, "NVIDIA GeForce GTX 680 Whitepaper," 2012. [Online]. Available: http://la.nvidia.com/content/PDF/product-specifications/GeForce_GTX_680_Whitepaper_FINAL.pdf.
[4] NVIDIA, "NVIDIA GeForce GTX 980 Whitepaper," 2014. [Online]. Available: https://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF.
[5] NVIDIA, "NVIDIA GeForce GTX 750 Ti Whitepaper," 2014. [Online]. Available: http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf.
[6] NVIDIA, "NVIDIA GeForce GTX 1080 Whitepaper," 2016. [Online]. Available: http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf.
[7] Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi, "Characterizing and Improving the Use of Demand-Fetched Caches in GPUs," in International Computer Symposium (ICS), 2012.
[8] Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi, "MRPB: Memory Request Prioritization for Massively Parallel Processors," in IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 2014.
[9] Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das, "Orchestrated Scheduling and Prefetching for GPGPUs," in International Symposium on Computer Architecture (ISCA), 2013.
[10] Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das, "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2013.
[11] Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos, "Demystifying GPU Microarchitecture through Microbenchmarking," in IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), 2010.
[12] MARK GEBHART, DANIEL R. JOHNSON, DAVID TARJAN, STEPHEN W. KECKLER, WILLIAM J. DALLY, ERIK LINDHOLM, and KEVIN SKADRON, "A Hierarchical Thread Scheduler and Register File for Energy-efficient Throughput Processors," ACM Transactions on Computer Systems, 2011.
[13] Onur Kayıran, Adwait Jog, Mahmut T. Kandemir, and Chita R. Das, "Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs," in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2013.
[14] Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, and Soojung Ryu, "Improving GPGPU Resource Utilization Through Alternative Thread Block Scheduling," in International Symposium on High Performance Computer Architecture (HPCA), 2014.
[15] Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2009.
[16] HSA Foundation, HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer, and Object Format (BRIG) Version 1.0 Final, HSA Foundation, 2015.
[17] Accellera Systems Initiative., "SystemC," Accellera Systems Initiative., 2017. [Online]. Available: http://www.vhdl.org/downloads/standards/systemc.
[18] Shin-Ying Lee, and Carole-Jean Wu, "CAWS: Criticality-aware warp scheduling for GPGPU workloads," in International Conference on Parallel Architecture and Compilation Techniques (PACT), 2014.
[19] Shin-Ying Lee, Akhil Arunkumar, and Carole-Jean Wu, "CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads," in ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), 2015.
[20] Myung Kuk Yoon, Keunsoo Kim, Sangpil Lee, Won Woo Ro, and Murali Annavaram, "Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit," in ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016.
[21] NVIDIA, "CUDA Toolkit," NVIDIA, 2017. [Online]. Available: https://developer.nvidia.com/cuda-toolkit.
[22] Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos, "Auto-tuning a high-level language targeted to GPU codes," in Innovative Parallel Computing (InPar), 2012.
[23] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," in IEEE International Symposium on Workload Characterization (IISWC), 2009.
[24] S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron, "A Characterization of the Rodinia Benchmark Suite with Comparison to Contemporary CMP Workloads," in IEEE International Symposium on Workload Characterization, 2010.
[25] Timothy G. Rogers, Mike OConnor, and Tor M. Aamodt, "Cache-Conscious Wavefront Scheduling," in IEEE/ACM International Symposium on Microarchitecture (MICRO), 2012.
[26] Ankit Sethia, D. Anoushe Jamshidi, and Scott Mahlke, "Mascar: Speeding up GPU warps by reducing memory pitstops," in IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), 2015.
[27] H.-Y. Chen, "An HSAIL ISA conformed GPU platform," in Master thesis, Institute of Computer and Communication Engineering, National Cheng-Kung University, Taiwan, 2015.
[28] Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in IEEE/ACM International Symposium on Microarchitecture (MICRO), 2011.
[29] Myung Kuk Yoon, Seung Hun Kim, and Won Woo Ro, "DRAW: Investigating Benefits of Adaptive Fetch Group Size on GPU," in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2015.
[30] Keunsoo Kim, Sangpil Lee, Myung Kuk Yoon, Gunjae Koo, Won Woo Ro, and Murali Annavaram, "Warped-preexecution: A GPU pre-execution approach for improving latency hiding," in IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016.
[31] B.-X. Zeng, "Architecture Exploration and Optimization of CASLAB-GPUSIM Memory Subsystem," in Master thesis, Institute of Computer and Communication Engineering, National Cheng-Kung University, Taiwan, 2017.
[32] Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," in IEEE/ACM International Symposium on Microarchitecture (MICRO), 2007.
[33] Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt, "Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware," ACM Transactions on Architecture and Code Optimization (TACO), 6 2009.
[34] Samuel Liu, John Erik Lindholm, Ming Y Siu, Brett W. Coon and Stuart F. Oberman, "Operand collector architecture". United States Patent 7834881 B2, 16 11 2010.
[35] S.-C. Yu, "Design of Special Function Unit with Dual-Precision Function Approximation," in Master thesis, Institute of Electrical Engineering, National Cheng-Kung University, Taiwan, 2017.
[36] The Khronos Group Inc., "The open standard for parallel programming of heterogeneous systems," The Khronos Group Inc., 2017. [Online]. Available: https://www.khronos.org/opencl/.
[37] Advanced Micro Devices, Inc., "APP SDK – A Complete Development Platform," Advanced Micro Devices, Inc., 2017. [Online]. Available: http://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2017-08-29起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2017-08-29起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw