進階搜尋


下載電子全文  
系統識別號 U0026-1608201714094500
論文名稱(中文) 繪圖處理器之執行緒區塊排程優化與其在CASLAB-GPUSIM上之實現
論文名稱(英文) Optimization of Workgroup Scheduling on CASLAB-GPUSIM
校院名稱 成功大學
系所名稱(中) 電腦與通信工程研究所
系所名稱(英) Institute of Computer & Communication
學年度 105
學期 2
出版年 106
研究生(中文) 蔡森至
研究生(英文) Sen-Chih Tsai
電子信箱 a2215689@gmail.com
學號 Q36044191
學位類別 碩士
語文別 中文
論文頁數 60頁
口試委員 指導教授-陳中和
口試委員-邱瀝毅
口試委員-劉濱達
召集委員-朱元三
口試委員-黃稚存
中文關鍵字 繪圖處理器  multikernel  多元程式  排程  執行緒級平行處理 
英文關鍵字 Graphics processing units  multikernel  multiprogramming  scheduling  thread-level parallelism 
學科別分類
中文摘要 通用型繪圖處理器的應用日漸受到重視。而本實驗室以高階語言SystemC建立了基於Single Instruction Multiple Thread架構的通用型繪圖處理器模擬平台, CASLAB-GPUSIM,模擬平台也包含了子記憶體及軟體程式介面,並通過取自Rodinia、AMD和NVIDIA等的驗證程式。
此篇論文探討通用型繪圖處理器執行緒區塊排程的效能,提出Kernel Aware Warp Scheduler ( KWS ) 機制緩解其Kernel工作量使用硬體資源的不平衡,此機制需要在執行緒區塊排程配合使用Mixed Concurrent Kernel Execution,讓不同的Kernel執行在同個串流多處理器上,然後以Kernel和指令作為分類調整Warp優先權,藉此提升硬體使用率以改善效能。此篇論文亦提出Profiling Based Workgroup Scheduler (PBWS) 機制緩解Kernel需求與子記憶體資源不平衡。先使用靜態分析決定初始的執行緒區塊數量限制,再藉由動態分析逐步調整每個串流多處理器內部的執行緒區塊數量限制。最後將這些機制實做於CASLAB-GPUSIM平台上,並以實驗評估其硬體使用率的改善或快取記憶體命中的提升以及效能的提升。
總結此篇論文,當繪圖處理器同時執行一個Arithmetic-Intensive和一個Memory-Intensive的Kernel時,這時可以使用KWS機制提升效能約20%;當繪圖處理器只執行一個Kernel時,這時可以使用PBWS機制提升效能約11%。
英文摘要 General Purpose Graphics Processing Units (GPGPUs) become more and more important in recent years. We develop CASLAB-GPUSIM, a GPGPU simulation platform based on single instruction multiple thread acrchitecture by SystemC. The platform also includes the memory subsystem and the software toolchain, and is verified with benchmarks from Rodinia, AMD and NVIDIA.
This paper explores the problems of performance by workgroup scheduling and warp scheduling on CASLAB-GPUSIM. There are two methods proposed. The first is KWS, a kernel aware warp scheduler, which has to be used with mixed concurrent kernel execution. KWS prioritizes the warps by the attribution of kernel and the type of instructions to ease the problem of the imbalance of kernel workload and hardware resources. The second is PBWS, a profiling based workgroup scheduler, which restricts the maximum number of workgroups allocated to the streaming multiprocessors. PBWS miligates the problem of the imbalance of the memory requests from kernel and the memory subsystem. The mechanisms are implemented in CASLAB-GPUSIM and are evaluated with the benchmarks. KWS with mixed concurrent kernel execution yields 20% speedup compared to traditional concurrent kernel execution with Loose Round-Robin warp scheduler. PBWS yields 11% speedup compared to Round-Robin workgroup scheduler.
論文目次 摘要 I
誌謝 VI
目錄 VII
表目錄 X
圖目錄 XI
第1章 序論 1
1.1 研究動機 2
1.2 研究貢獻 5
1.3 文章組織 6
第2章 背景知識與相關研究 7
2.1 通用型繪圖處理器 7
2.1.1 單指令流多執行緒 7
2.1.2 Kernel Execution 8
2.1.3 Workgroup 排程 10
2.1.4 Warp 排程 10
2.1.5 控制流程與分歧 12
2.2 相關研究 13
第3章 執行緒區塊排程的優化 15
3.1 Mixed Workload機制 15
3.1.1 Mixed Concurrent Kernel Execution 15
3.1.2 KWS: Kernel Aware Warp Scheduler 17
3.2 Thread Throttling機制 21
3.2.1 靜態Kernel 分析 21
3.2.2 動態Kernel 分析 23
第4章 CASLAB-GPUSIM軟體層模擬和執行緒區塊排程器之實現 24
4.1 CASLAB GPUSIM 平台全貌 25
4.1.1 GPU 指令集 25
4.2 Runtime System 26
4.2.1 OpenCL Runtime 26
4.2.2 HSA Runtime 28
4.3 驅動層 29
4.4 實驗平台硬體設計 30
4.4.1 Task Dispatch Unit 30
4.4.2 Streaming MultiProcessor 31
4.4.3 Memory Subsystem 32
第5章 實驗評估 33
5.1 實驗環境 33
5.1.1 環境設定值 33
5.1.2 測試程式 34
5.2 評估結果 34
5.2.1 Mixed Workload之實驗評估 34
5.2.2 Mixed Workload之評估分析 42
5.2.3 Thread Thottling之實驗評估 55
第6章 結論 57
6.1 實驗評估結果討論 57
參考文獻 58
參考文獻 [1] "Rodinia: A Benchmark Suite for Heterogeneous Computing," [Online]. Available: http://lava.cs.virginia.edu/Rodinia/download_links.htm.
[2] "AMD APP SDK – A Complete Development Platform," [Online]. Available: http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/.
[3] "NVIDIA OpenCL SDK Code Samples," [Online]. Available: https://developer.nvidia.com/opencl.
[4] "THE GREEN500," [Online]. Available: https://www.top500.org/green500/.
[5] "TOP500," [Online]. Available: https://www.top500.org/.
[6] "Whitepaper NVIDIA’s Next Generation CUDA Compute Architecture: Fermi," NVIDIA, 2009. [Online]. Available: https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.
[7] Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, Soojung Ryu, "Improving GPGPU Resource Utilization Through Alternative Thread Block Scheduling," in High Performance Computer Architecture (HPCA), Orlando, FL, USA, 2014.
[8] NVIDIA Corporation, 2012. [Online]. Available: https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.
[9] Yun-Chi Huang, Kuan-Chieh Hsu, Wan-shan Hsieh, Chen-Chieh Wang, Chia-Han Lu, and Chung-Ho Chen, "Dynamic SIMD Re-Convergence with Paired-Path Comparison," in Proceeding of IEEE International Symposium on Circuits and Systems (ISCAS), 2016.
[10] Onur Kayıran, Adwait Jog, Mahmut T. Kandemir, Chita R. Das, "Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs," in Parallel Architectures and Compilation Techniques (PACT), Edinburgh, Scotland, UK, 2013.
[11] T. Rogers, M. O'Connor, T. Aamodt, "Cache-Conscious Wavefront Scheduling," in 45th International Symposium on Microarchitecture (MICRO-45), Vancouver, BC, Canada, 2012.
[12] Qiumin Xu, Hyeran Jeon, Keunsoo Kim, Won Woo Ro, Murali Annavaram, "Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming," in Proceedings of the 43rd International Symposium on Computer Architecture, Seoul, South Korea, 2016.
[13] "HSA Foundation github," [Online]. Available: https://github.com/HSAFoundation/.
[14] H-Y. Chen, C-H. Chen, “An HSAIL ISA Conformed GPU Platform,” the thesis for Master of Science, Tainan, Taiwan: National Cheng Kung University, 2015.
[15] "The OpenCL Specification Version: 2.0," Khronos OpenCL Working Group, 2014.
[16] "HSA Runtime Programmer’s Reference Manual Version 1.0," HSA Foundation, 2015.
[17] "HSA Platform System Architecture Specification Version 1.0 Final," HSA Foundation, 2015.
[18] "HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer, and Object Format (BRIG) Version 1.0 Final," HSA Foundation, 2015.
[19] Wan-Shan Hsieh, Chung-Ho Chen, “Micro-Architecture Optimization of HSA-Compatible GPU,” the thesis for Master of Science, Tainan, Taiwan: National Cheng Kung University, 2017.
[20] C-M. Chiu, C-H. Chen, “GPU Warp Scheduling Using Memory Stall Sampling on CASLAB-GPUSIM,” the thesis for Master of Science, Tainan, Taiwan: National Cheng Kung University, 2015.
[21] B-X. Zeng, C-H. Chen, “Architecture Exploration and Optimization of CASLAB-GPUSIM Memory Subsystem,” the thesis for Master of Science, Tainan, Taiwan: National Cheng Kung University, 2017.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2017-08-28起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2017-08-28起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw