進階搜尋


下載電子全文  
系統識別號 U0026-1608201715564600
論文名稱(中文) 繪圖處理器之子記憶體架構探勘及優化與其在CASLAB-GPUSIM上之實現
論文名稱(英文) Architecture Exploration and Optimization of CASLAB-GPUSIM Memory Subsystem
校院名稱 成功大學
系所名稱(中) 電腦與通信工程研究所
系所名稱(英) Institute of Computer & Communication
學年度 105
學期 2
出版年 106
研究生(中文) 曾柏翔
研究生(英文) Bo-Xiang Zeng
電子信箱 an4001022@gmail.com
學號 Q36044010
學位類別 碩士
語文別 中文
論文頁數 73頁
口試委員 指導教授-陳中和
口試委員-黃稚存
口試委員-邱瀝毅
口試委員-劉濱達
召集委員-朱元三
中文關鍵字 快取記憶體架構  繪圖處理器  矩陣運算應用 
英文關鍵字 Cache Architecture  GPGPU  Matrix Operation 
學科別分類
中文摘要 在現今深度學習應用上,矩陣乘積和卷積計算等矩陣運算都是不可或缺的基本運算單元。然而傳統的繪圖處理器子記憶體並沒有特別針對矩陣運算特性做架構上的調整,在相關應用上有著效能低落的問題,且子記憶體是影響繪圖處理器效能的主要原因之一。因此對於針對深度學習應用之繪圖處理器晶片更改子記憶體架構是必要的。
本論文提出了兩種針對矩陣運算特性的快取記憶體優化技術:Read Bypass Scheme(RBS)和Write Pseudo Allocate Policy(WPAP),RBS優化技術解決矩陣運算常把資料用2D擺放方式所造成的搶Index問題,WPAP優化技術解決矩陣運算輸入及輸出資料位址分開和Strided Access Pattern的特性所造成的問題,並且在前期評估效能時以GPGPU-Sim實驗平台為基準,在11支矩陣運算應用程式下,使用RBS優化技術可增加161%的效能,使用WPAP優化技術可增加17.3%的效能,如果將兩者優化技術合併可獲得194.1%的效能提升。最後將優化後的繪圖處理器子記憶體架構整合進本實驗室的CASLAB-GPUSIM,使本實驗室的全系統模擬平台具有高效能的子記憶體系統。
英文摘要 Memory subsystem architecture plays a significant role in general purpose graphic processing unit’s (GPGPU) performance. The traditional cache architecture is not specifically designed for matrix operation’s applications, so it will get a poor performance on these benchmarks. However, deep learning is a popular topic in recent years and matrix operation is the basic operation for them. If we want to design the GPGPU chip for future applications, changing the memory subsystem is the necessary way.
To solve this problem, we propose two cache optimization techniques to improve GPGPU memory subsystem performance in matrix operation benchmarks. Read Bypass Scheme (RBS) technique focuses on the maximum memory subsystem resource utilization. Only running out of resources will cause cache stall; Write Pseudo Allocate Policy (WPAP) technique focuses on the minimum network on chip (NOC) traffic. Our results on the GPGPU-Sim platform show that RBS technique yields the 161% speedup, and reduces the GPU cache stall times by 72%; WPAP technique yields the 11.6% speedup compared to the write back with write allocate cache, and reduces the request to network on chip by 18.7%. At last, we implement it on the CASLAB-GPUSIM.
論文目次 摘要 I
Summary II
誌謝 VII
目錄 VIII
表目錄 XI
圖目錄 XII
第1章 序論 1
1.1 研究動機 1
1.2 研究貢獻 2
1.3 論文架構 2
第2章 背景知識 3
2.1 深度學習應用 3
2.2 快取一致性議題 5
2.2.2 軟體快取一致性解法 5
2.2.3 硬體快取一致性解法 6
2.3 繪圖處理器子記憶體介紹 7
2.3.2 Coalescer 8
2.3.3 Miss Status Holding Register (MSHR) 9
2.3.4 本地記憶體 10
第3章 繪圖處理器子記憶體效能相關研究 12
3.1 快取一致性解法 12
3.1.1 中央處理器快取一致性解法 12
3.1.2 繪圖處理器快取一致性解法 14
3.1.3 中央處理器繪圖處理器之間快取一致性解法 16
3.1.4 廠商快取一致性解決方案 17
3.1.5 小結 17
3.2 GPU 架構優化技術 18
3.2.1 Warp Scheduler 18
3.2.2 Prefetcher 19
3.2.3 Prefetcher +Warp Scheduler 20
3.2.4 繪圖處理器記憶體子系統 21
3.2.5 Cache Bypassing 21
3.2.6 小結 22
第4章 繪圖處理器子記憶體優化技術 23
4.1 應用程式觀察 23
4.1.2 快取記憶體讀取效能低落問題 24
4.1.3 快取記憶體寫入效能低落問題 32
4.2 快取記憶體優化技術 36
4.2.1 Read Bypass Scheme (RBS) 36
4.2.2 Write Pseudo Allocate Policy (WPAP) 38
4.2.3 硬體成本評估 43
4.3 實驗平台環境介紹 43
4.4 實驗結果 45
4.4.1 Read Bypass Scheme(RBS) 45
4.4.2 Write Pseudo Allocate Policy (WPAP) 47
4.4.3 RBS & WPAP 實驗結果分析 49
第5章 CASLAB-GPUSIM記憶體之實現 57
5.1 CASLAB-GPUSIM平台介紹 57
5.1.2 異質性系統架構(HSA) 58
5.1.3 OpenCL & HSA Runtime 58
5.1.4 Streaming Multiprocessor (SM) 59
5.1.5 Network on Chip (NOC) 59
5.2 CASLAB-GPUSIM實現細節 60
5.2.2 繪圖處理器記憶體子系統 60
5.2.3 Network on Chip 拓撲 60
5.2.4 快取一致性議題 61
5.3 實驗評估 61
5.3.1 實驗平台環境及應用程式 61
5.3.2 繪圖處理器記憶體子系統效能 63
第6章 結論 66
參考文獻 67

參考文獻 [1] HSA Foundation. “Heterogeneous System Architecture,” http://www.hsafoundation.com/.
[2] Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das "Managing GPU concurrency in heterogeneous architectures", 47th IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 114-126, 2014.
[3] Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood, "Heterogeneous system coherence for integrated CPU-GPU systems." 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 457-467, 2013.
[4] Sorin Daniel J., Mark D. Hill, and David A. Wood. "A primer on memory consistency and cache coherence." Synthesis Lectures on Computer Architecture 6.3 (2011): 1-212.
[5] Timothy G. Mattson, Michael Riepen, Thomas Lehnig, Paul Brett, Werner Haas, Patrick Kennedy, Jason Howard, Sriram Vangal, Nitin Borkar, Greg Ruhl and Saurabh Dighe "The 48-core scc processor: The programmer's view." Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1-11, 2010.
[6] Kongetira Poonacha, Kathirgamar Aingaran, and Kunle Olukotun. "Niagara: A 32-way multithreaded sparc processor." IEEE micro , pp 21-29,2005
[7] Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O'Connor, and Tor M. Aamodt. "Cache coherence for GPU architectures." 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), pp 578-590, 2013.
[8] Blake A. Hechtman, Shuai Che, Derek R. Hower, Yingying Tian, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. "QuickRelease: A throughput-oriented approach to release consistency on GPUs." 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp 189-200, 2014.
[9] Kaxiras Stefanos, and Alberto Ros. "A new perspective for efficient virtual-cache coherence." ACM SIGARCH Computer Architecture News. Vol. 41. No. 3. pp. 535-546, 2013.
[10] Mahdad Davari, Alberto Ros, Erik Hagersten,and Stefanos Kaxiras. "The effects of granularity and adaptivity on private/shared classification for coherence." ACM Transactions on Architecture and Code Optimization (TACO) Vol. 12. No. 26,2015
[11] Konstantinos Koukos , Alberto Ros, Erik Hagersten, Stefanos Kaxiras. "Building heterogeneous unified virtual memories (uvms) without the overhead." ACM Transactions on Architecture and Code Optimization (TACO) Vol .13. NO.1 ,2016.
[12] Jack W. Davidson, and Sanjay Jinturkar. "Memory access coalescing: a technique for eliminating redundant memory accesses." ACM SIGPLAN Notices. Vol. 29. No. 6., 1994.
[13] Lashgar Ahmad, Ebad Salehi, and Amirali Baniasadi. "Understanding outstanding memory request handling resources in gpgpus." The Sixth International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART),2015.
[14] Rogers Timothy G., Mike O'Connor, and Tor M. Aamodt. "Cache-conscious wavefront scheduling." 45th Annual IEEE/ACM International Symposium on Microarchitecture(MICRO), pp 72-83, 2012.
[15] Lee Shin-Ying, Akhil Arunkumar, and Carole-Jean Wu. "Cawa: Coordinated warp scheduling and cache prioritization for critical warp acceleration of gpgpu workloads." ACM SIGARCH Computer Architecture News. Vol. 43. No. 3.pp 515-527, 2015.
[16] Lee Shin-Ying, and Carole-Jean Wu. "Characterizing the latency hiding ability of GPUs." IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) ,pp 145-146,2014
[17] Lee Shin-Ying, and Carole-Jean Wu. "CAWS: criticality-aware warp scheduling for GPGPU workloads." 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT), pp 175-186, 2014.
[18] Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, Chita R. Das. "OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance." ACM SIGPLAN Notices. Vol. 48. No. 4.pp 395-406, 2013.
[19] Rogers Timothy G., Mike O'Connor, and Tor M. Aamodt. "Divergence-aware warp scheduling." 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 99-110, 2013.
[20] Sethia Ankit, D. Anoushe Jamshidi, and Scott Mahlke. "Mascar: Speeding up gpu warps by reducing memory pitstops." IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp 174-185, 2015.
[21] Bin Wang, Weikuan Yu, Xian-He Sun, Xinning Wang . "Dacache: Memory divergence-aware gpu cache management." 29th ACM on International Conference on Supercomputing, pp 89-98, 2015.
[22] Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. "Many-thread aware prefetching mechanisms for GPGPU applications." 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 213-224,2010
[23] Ankit Sethia, Ganesh Dasika, Mehrzad Samadi, and Scott Mahlke. "APOGEE: Adaptive prefetching on GPUs for energy efficiency." 22nd international conference on Parallel architectures and compilation techniques, pp 73-82, 2013.
[24] Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir,Onur Mutlu, Ravishankar Iyer, and Chita R. Das. "Orchestrated scheduling and prefetching for GPGPUs." ACM SIGARCH Computer Architecture News. Vol. 41. No. 3, 2013.
[25] Yunho Oh, Keunsoo Kim, Myung Kuk Yoon, Jong Hyun Park, Yongjun Park, Won Woo Ro, and Murali Annavaram. "APRES: improving cache efficiency by exploiting load characteristics on GPUs." ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp 191-203, 2016.
[26] John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski, Trevor Mudge, and Scott Mahlke "WarpPool: sharing requests with inter-warp coalescing for throughput processors." 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 433-444,2015
[27] Jia Wenhao, Kelly A. Shaw, and Margaret Martonosi. "MRPB: Memory request prioritization for massively parallel processors." IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp 272-283, 2014.
[28] Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. "Unifying primary cache, scratch, and register file memories in a throughput processor." 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 96-106,2012
[29] Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. "Adaptive cache management for energy-efficient gpu computing." 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 343-355, 2014.
[30] Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. "Coordinated static and dynamic cache bypassing for GPUs." IEEE 21st International Symposium on High Performance Computer Architecture (HPCA),pp 76-88,2015
[31] Yun Liang, Xiaolong Xie, Guangyu Sun, and Deming Chen. "An efficient compiler framework for cache bypassing on GPUs." IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (ICCAD), pp 1677-1690, 2013.
[32] LI Dong. "Orchestrating thread scheduling and cache management to improve memory system throughput in throughput processors". 2014. PhD Thesis.
[33] Hestness Joel, Stephen W. Keckler, and David A. Wood. "A comparative analysis of microarchitecture effects on cpu and gpu memory system behavior." IEEE International Symposium on Workload Characterization (IISWC), pp 150-160, 2014.
[34] Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt"Analyzing CUDA workloads using a detailed GPU simulator." IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp 163-174, 2009.
[35] Aaamodt T. M., and A. Boktor. "GPGPU-Sim 3. x: A performance simulator for many-core accelerator research." International Symposium on Computer Architecture (ISCA), http://www. gpgpu-sim. org/isca2012-tutorial. 2012.
[36] Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi . "GPUWattch: enabling energy optimizations in GPGPUs." ACM SIGARCH Computer Architecture News. Vol. 41. No. 3, pp 487-498,2013.
[37] Nan Jiang, James Balfour, Daniel U. Becker, Brian Towles, William J. Dally, George Michelogiannakis, and John Kim. "A detailed and flexible cycle-accurate network-on-chip simulator." IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp 86-96, 2013
[38] Nan Jiang, George Michelogiannakis, Daniel Becker, Brian Towles and William J. Dally. "Booksim 2.0 user’s guide." Standford University (2010).
[39] Pouchet Louis-Noël. "Polybench: The polyhedral benchmark suite." URL: http://www. cs. ucla. edu/pouchet/software/polybench (2012).
[40] Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, Soojung Ryu. "Improving GPGPU resource utilization through alternative thread block scheduling." IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp 260-271,2014
[41] Timothy van Hook, Anthony P. DeLaurier. "Method and apparatus for cache index hashing" US 6549210 B1
[42] Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri Bal. "A detailed GPU cache model based on reuse distance theory." IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp 37-48, 2014.
[43] Alan Gara, Martin Ohmacht “Evict on write, a management strategy for a prefetch unit and/or first level cache in a multiprocessor system with speculative execution” US 8838906 B2
[44] Heng-Yi Chen, “An HSAIL ISA Conformed GPU Platform,” Thesis for Master of Science, Institute of Computer and Communication Engineering, National Cheng Kung University, July, 2015
[45] Yun-Chi Huang, Kuan-Chieh Hsu, Wan-shan Hsieh, Chen-Chieh Wang, Chia-Han Lu, and Chung-Ho Chen, “Dynamic SIMD Re-Convergence with Paired-Path Comparison,” in Proceeding of IEEE International Symposium on Circuits and Systems (ISCAS), 2016.
[46] Kuan- Chieh Hsu, Chung-Ho Chen, “Performance Prediction Model on HSA-Compatible General-Purpose GPU System” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan. 2016.
[47] Wan-Shan Hsieh, “Micro-Architecture Optimization of HSA-Compatible GPU” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan. 2016.
[48] Chien-Hsuan Yen, Chung-Ho Chen, “A Memory-Efficient NoC System for Manycore Platform,” Thesis for Master of Science, Institute of Computer and Communication Engineering, National Cheng Kung University, July, 2014.


論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2017-08-28起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2017-08-28起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw