進階搜尋


 
系統識別號 U0026-1108201516051200
論文名稱(中文) 使用對徑比較法之動態單指令多數據流收斂
論文名稱(英文) Dynamic SIMD Re-convergence with Paired-Path Comparison
校院名稱 成功大學
系所名稱(中) 電腦與通信工程研究所
系所名稱(英) Institute of Computer & Communication
學年度 103
學期 2
出版年 104
研究生(中文) 黃昀棨
研究生(英文) Yun-Chi Huang
學號 Q36024109
學位類別 碩士
語文別 英文
論文頁數 61頁
口試委員 指導教授-陳中和
口試委員-蕭勝夫
口試委員-邱瀝毅
口試委員-范倫達
中文關鍵字 GPGPU  OpenCL  SIMD Control Divergence 
英文關鍵字 GPGPU  OpenCL  SIMD Control Divergence 
學科別分類
中文摘要 在當前的GPGPU(General Purpose Graphic Processor Unit)架構下,單指令多
資料流的分歧(SIMD Divergence)是造成平行運算效能下降的主要原因之一。我們
評估一個基於HSAIL指令集的GPU模擬器,在上面運行OpenCL的核心涵式
(Kernel)以觀察GPU的效能與結果。SIMD中最小的執行單位為波前(Wavefront)
,相當於SISD中的執行序。波前執行條件跳躍時,若此波前中每個工作項目
(Workitem)之跳躍條件不同,導致同一波前中的工作項目要執行不同運算指令
,這種情形便稱為控制分歧(Control Divergence)。一旦有控制分歧的情形發生,
便要啟用輔助的機制使得一個波前能夠依序讓不同的工作項目執行不同的指令,使用這樣的機制處理控制分歧需要編譯器與GPU的共同配合,不同的處理演算法亦會影響GPU在控制分歧下的執行效能。本論文提出了一個新的基於堆疊方式收斂機制,能讓波前在運算途中自行收斂。此機制可以選擇使用或不使用結譯器(Finalizer)所產生的收斂提示指令,不使用的話則免除了編譯器的支援與執行多餘的指令。使用此種動態收斂方法,GPU運行有不規則控制流之程式時獲得平均13.36%的活動比率(Activity Factor)提升。使用不依賴收斂提示指令之收斂方法能透過省去執行多餘指令的時間獲得整體執行效能的提升。
英文摘要 SIMD divergence is one of the critical causes that decrease the parallel computing efficiency in contemporary GPGPU (General Purpose Graphic Processor Unit) architecture. In this thesis, we evaluate a cycle accurate GPU simulator platform based on HSAIL under OpenCL framework by offloading the kernel programs into
simulator. A wavefront (“wavefront” and “warp” in AMD and NVIDIA terminology respectively) is the gathering of multiple threads that execute the same instruction in SIMD fashion. When a wavefront or a warp executes a conditional branch instruction, threads in the warp may go to distinct PCs if the threads have different branch targets, and it’s called SIMD control divergence. Re-convergence mechanisms are applied to help divergent wavefront to execute instructions properly. We develop a new dynamic stack-based re-convergence scheme that can be implemented with or without finalizer generated re-convergence instructions. Using the scheme we propose, the divergent warp re-converges dynamically and get a 13.36% activity factor improvement on average from opportunistic early re-convergence in the unstructured control flow, and the performance is better in the way that warp re-convergence without finalier generated hint instructions.
論文目次 Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Contribution 2
1.3 Organization 2
Chapter 2 Background 3
2.1 OpenCL Programming 3
2.1.1 OpenCL Platform and Execution Model 3
2.1.2 OpenCL Memory Model 4
2.1.3 OpenCL Framework 4
2.2 Heterogeneous System Architecture(HSA) 5
2.2.1 HSA Feature 6
2.2.2 HSAIL 7
2.3 General Purpose Computing on Graphics Processing Units(GPGPU) 8
2.3.1 Workitems of a Kernel mapping to a SM 8
2.3.2 Streaming Multiprocessors 9
2.3.3 Warp Scheduling 10
2.3.4 SIMD Divergence and Re-convergence Schemes 10
Chapter 3 Related Work 16
3.1 Dual-Path Execution Model 16
3.1.1 Execution Example 16
3.2 Implicit Stack-less Re-convergence 18
3.2.1 Re-convergence Mechanism 18
3.2.2 Divergent Control Flow Traversal 19
3.3 Unstructured Control Flow 19
Chapter 4 Dynamic Re-convergence in Dual-Path Stack 21
4.1 Observation 21
4.2 Re-convergence with Dynamic Paired-Path Comparison 22
4.2.1 Re-convergence Schemes Algorithm 23
4.2.2 Divergent Control Flow Traversal 33
4.2.3 Re-convergence Detection Methods 37
4.2.4 Behavior with Synchronization Barrier 41
4.2.5 Divergence Stack Implementation 42
Chapter 5 GPU Simulation Platform 44
5.1 Overview of HSAIL GPU Simulation Platform 44
5.2 Streaming Multiprocessor Pipeline 45
5.3 Finalizer 47
5.4 Configuration 48
Chapter 6 Benchmarks and Evaluation 50
6.1 Benchmarks 50
6.2 Evaluation 52
6.2.1 Activity Factor 52
6.2.2 LD/ST Unit Idle Ratio 55
6.2.3 SIMD Unit Utilization 56
6.2.4 Dynamic Instruction Counts 57
6.2.5 Overall Performance 58
Chapter 7 Conclusion 59
Reference 60
參考文獻 [1] OpenCL – The open standard for parallel programming of heterogeneous systems, [Online], Available: http://www.khronos.org/object/opencl/ .
[2] V. Narasiman; M. Shebanow; C. J. Lee; R. Miftakhutdinov; O. Mutlu, and Y. N. Patt, “Improving GPU Performance via Large Warps and Two-level Warp Scheduling,” MICRO-44 Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture,Pages 308-317,ACM New York, NY, USA ©2011.
[3] S. Collange, “Stack-less SIMT Reconvergence at Low Cost”, ARENAIRE - Inria Grenoble Rhône-Alpes / LIP Laboratoire de l’Informatique du Parallélisme, 2011.
[4] M. Rhu and M. Erez, "The dual-path execution model for efficient GPU control flow," High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on , vol., no., pp.591,602, 23-27 Feb. 2013
[5] HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer’s Guide. and Object Format(BRIG), 2014.
[6] W.W.L. Fung; I. Sham; G.Yuan; and T.M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on , vol., no., pp.407,420, 1-5 Dec. 2007.
[7] Intel HD Graphics OpenSource PRM, 2010.
[8] A. ElTantawy; J.W. Ma; M. O'Connor and T.M. Aamodt, "A scalable multi-path microarchitecture for efficient GPU control flow," High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on , vol., no., pp.248,259, 15-19 Feb. 2014
[9] F. Zhang and E. H. D’Hollander, “Using hammock graphs to structure programs,” Software Engineering, IEEE Transactions on , vol.30, no.4, pp.231,245, April 2004.
[10] R. A. Lorie and H. R. Strong, US Patent 4,435,758: Method for conditional branch execution in SIMD vector processors, 1984.
[11] J. Meng; D. Tarjan and K. Skadron, “Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance, ”In Proc. 37th Int’l Symp. on Computer Architecture (ISCA), pages 235– 246, 2010.
[12] J.D.Collins; D.M. Tullsen and P. Wang, "Control Flow Optimization Via Dynamic Reconvergence Prediction,",MICRO 37 Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, Pages 129-140, 2004..
[13] AMD SDK: AMD APP Software Development Kit, [Online], Available : http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/ .
[14] S. Che et al., "Rodinia: A benchmark suite for heterogeneous computing," IISWC ( IEEE International Symposium on Workload Characterization ) , vol., no., pp.44,54, 4-6 Oct. 2009.
[15] A. Kerr, G. Diamos and S. Yalamanchili, "A characterization and analysis of PTX kernels," IISWC ( IEEE International Symposium on Workload Characterization ) , , vol., no., pp.3,12, 4-6 Oct. 2009
[16] Rogers, T.G., O'Connor, M., Aamodt, T.M., "Cache-Conscious Wavefront Scheduling,", MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM, International Symposium on Microarchitecture, Pages 72-83, 2012.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2016-08-18起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2016-08-18起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw