進階搜尋


下載電子全文  
系統識別號 U0026-2601201921585700
論文名稱(中文) 應用通訊避免演算法執行GMRES數值解法於計算力學之分析
論文名稱(英文) Application of GMRES ( General Minimized RESidual ) method for computational mechanics applications using communication avoiding algorithms
校院名稱 成功大學
系所名稱(中) 機械工程學系
系所名稱(英) Department of Mechanical Engineering
學年度 107
學期 1
出版年 108
研究生(中文) 陳加恩
研究生(英文) Ja-En Chen
學號 N16054271
學位類別 碩士
語文別 英文
論文頁數 178頁
口試委員 指導教授-曾建洲
口試委員-楊天祥
口試委員-溫昌達
中文關鍵字 廣義最小殘值法(GMRES)  通訊避免演算法  MPI-CUDA叢集電腦  Open MPI通用訊息傳遞介面  圖形顯示卡(GPU) 
英文關鍵字 General Minimal Residual method (GMRES)  Communication Avoiding algorithm  MPI-CUDA cluster  Open MPI  Graphical processing Unit (GPU) 
學科別分類
中文摘要 本篇論文的目標是要分析適用於計算力學求解之平行化廣義最小殘值(GMRES)法,論文中所寫的程式架構為使用Open MPI做為使用C語言的平行介面。在本研究中共測試了三種計算力學中的係數矩陣以作為平行解法的測試,分別有二維穩態熱傳之係數矩陣、使用SIMPLE法求解穴流問題中Navier-Stokes方程時所求解的速度係數矩陣、使用有限元素法在固體力學中的剛性係數矩陣。
二維穩態熱傳為Laplace方程差分離散格式。使用SIMPLE求解二維穩態穴流場分佈其中共需解三個線性系統,分別求解兩個維度的速度和流場壓力。在研究中測試了將求解x方向速度的線性系統交由GMRES求解。其中有限元素法使用常應變三角形元素模擬中間有圓孔洞的薄平板以及T字型帶有倒圓角之薄板的案例。在模擬中計算其應力集中係數,且每個三角形元素的von Mises應力都會被計算並且從中找出最大值作為近似的最大應力。另外,為了改進使用一般桌上型電腦的硬體架構所限制的解法最大運算量,論文中亦發展了由MATLAB能直接產出對應於有限元素法之剛性係數矩陣的CSR壓縮形式之方法。
共軛梯度法(CG)在論文作為廣義最小殘值法在誤差收斂速度上的對照。另外使用不同迭代重置間隔的廣義最小殘值法也互相進行了比較,以進一步了解較佳的間隔次數。在GMRES的平行化中,為了降低不同平行線程之間的通訊所造成之延遲,在論文中也探討了建構並使用牛頓基底的另一種GMRES對平行效率的影響。
英文摘要 The goal of this research is the development of a distributed parallel implementation of the general minimal residual (GMRES) solver with the target application of computational mechanics. The program of the solver is aimed to be a two-layered parallel structure, it could have the potential to be implemented on the usage of multi-GPU. The first layer employs OpenMPI (or simply MPI) for management of communications between distributed computing nodes. The second layer of parallelization is the use of a local GPU device for heterogeneous parallel computation using Nvidia’s CUDA. For the sake of simplicity, the investigated Finite Element Method is a stiffness-method based approach for solving the system Kx = F where the stiffness matrix K is computed using a Rayleigh-Ritz method based on a linear shape function, i.e. a Constant Strain Triangle (CST) formulation. The data from the stiffness matrix is stored using a Compressed Sparse Row (CSR) method for eliminating the need to store zero values in the sparse matrix. In this study, this stiffness method is first generated using MATLAB and later exported to the GPU for solution.
The accuracy of the GMRES implementation is tested through several benchmarks. 2D heat transfer being discretized from the Laplace’s equation is tested as benchmark for Newton basis GMRES. The lid-driven cavity solved using SIMPLE is also used to be the benchmark. Finally, the solver is applied to the computation of stress concentration factors for several flat plates using multiple Nvidia GPU devices distributed across an MPI cluster. First, the stress concentration factors of a flat plate with round circular hole located at middle and a T-shape flat plate with round fillet are computed for various loading conditions and geometries. With each different radius, the von Mises stress of each element is calculated and used to create a stress concentration relationship.
The convergence performance of conjugate gradient method (CG) is compared with GMRES since our stiffness matrices are symmetric. GMRES with different restart is also investigated to find the best number for restart. To further investigate the possibility of decreasing the overhead cause by communication between threads, Newton basis GMRES and its communication avoiding algorithm is also studied.
論文目次 摘要 i
Abstract iii
誌謝 vi
Contents vii
List of Tables xii
List of Figures xiii
Nomenclature xx
Chapter 1 Introduction 1
1.1 Background survey 1
1.2 Motivation 2
Chapter 2 Theory of Parallel computing 4
2.1 Efficiency Theory of parallel computing 4
2.1.1 Amdahl's Law 6
2.1.2 Gustafson–Barsis's Law 8
2.2 Program Structure with the usage of CUDA 9
2.2.1 Memory in GPU 10
2.2.2 Memory allocation in CUDA 10
2.2.3 Message passing in CUDA 11
Chapter 3 Sparse Linear Systems 13
3.1 Challenges of solving real linear systems 13
3.1.1 Condition Number 14
3.1.2 Direct Solvers for Linear Systems 17
3.2 Iterative Solvers for Large Linear System 21
3.2.1 Jacobi Iteration 21
3.2.2 Conjugate Gradient Method (CG) 23
3.2.3 General Minimized Residual Method (GMRES) 25
3.2.4 Newton-Basis GMRES 30
3.3 Storage Methods for Sparse Matrix Storage 31
3.3.1 Coordinate Storage Format (COO) 32
3.3.2 Compressed Sparse Row Format (CSR) 32
Chapter 4 Message Passing Interface (MPI) 34
4.1 Message Passing Interface (MPI) 34
4.1.1 Message Passing Interface (MPI) API 34
4.1.2 Designing the size of message passing 35
4.1.3 MPI Environment (MPE) and the Jumpshot analyzing tool 37
4.2 Beowulf Cluster using GPU 38
4.2.1 Environment settings 38
4.2.2 The Structure of the Cluster 39
4.3 Basic Parallel Vector Operations 39
4.3.1 Data Separation of CSR Form Arrays 39
4.3.2 The Parallel Vector-Vector Dot Product 39
4.3.3 The Parallel Matrix-Vector Dot Product in CSR Form 40
Chapter 5 Finite Element Methodology 42
5.1 Finite Element Procedure 42
5.1.1 The Rayleigh-Ritz Method 42
5.1.2 2D Constant Strain Triangular (CST) Element 43
5.1.3 Linear Elastics 45
5.1.4 Strong Form and Weak Form 48
5.2 CSR Form for Finite Element 51
5.2.1 Assembly of Local Stiffness Matrix 51
5.2.2 Assembly of Global Stiffness Matrix in CSR Form 52
5.2.3 Reduced Global Stiffness Matrix 55
Chapter 6 Methodology 57
6.1 CUDA-MPI Algorithm Description 57
6.2 Data Structure 57
6.3 Host-Host Communications 58
6.4 Host-Device Communications 59
6.5 GMRES Parallelization using multiple CUDA devices 59
6.5.1 Arnoldi Iteration Part One with Matrix-Vector Production 60
6.5.2 Arnoldi Iteration Part Two with Vector-Vector Production 62
6.5.3 Arnoldi Iteration Part Three 63
6.5.4 Givens Rotation 63
6.5.5 Backward Substitution of the Least Square Problem 64
6.6 Newton Basis GMRES Parallelization using MPI 64
6.6.1 The First Iteration of Newton Basis GMRES 64
6.6.2 Construction of the Newton Basis 64
6.6.3 QR Factorization of the Basis 65
Chapter 7 Benchmarks and Results 66
7.1 Introduction of Heat Transfer Benchmark 66
7.2 Introduction of the Lid-driven Cavity Benchmark 66
7.3 Introduction of FEM Benchmarks 66
7.3.1 Simple Beam with Axial Load 67
7.3.2 Cantilever Beam with Point Load at Free-end 67
7.3.3 A Circular Hole at the middle of a Flat Plate 68
7.3.4 Flat Plate with Fillet 68
7.4 FEM usage in Results 69
7.4.1 Results with Meshes and von Mises Stress 69
7.5 Speed of Convergence with respect of Iterations 71
7.5.1 Jacobi Preconditioner 71
7.5.2 Restart 72
7.5.3 Comparison with CG 73
7.6 Parallel Performance using Multi-GPUs 73
7.7 Speedup of Communication Avoiding Algorithm 75
7.7.1 Speedup comparison between MGS and Newton Basis GMRES Running with different number of CPU core 75
7.7.2 Analyzation of the bottleneck of the parallel process 75
Chapter 8 Conclusion 77
References 79
Table 83
Figures 87
參考文獻 [1]J. Dongarra, "An Overview of High Performance Computing and Challenges for the Future," in VECPAR, 2008, p. 1.
[2]D. B. Kirk and W. H. Wen-Mei, Programming massively parallel processors: a hands-on approach: Morgan kaufmann, 2016.
[3]J. Bolz, I. Farmer, E. Grinspun, and P. Schröoder, "Sparse matrix solvers on the GPU: conjugate gradients and multigrid," ACM transactions on graphics (TOG), vol. 22, pp. 917-924, 2003.
[4]E. J. Kelmelis, J. R. Humphrey, J. P. Durbano, and F. E. Ortiz, "Accelerated modeling and simulation with a desktop supercomputer," in Enabling Technologies for Simulation Science X, 2006, p. 62270N.
[5]A. A. Khokhar, V. K. Prasanna, M. E. Shaaban, and C.-L. Wang, "Heterogeneous computing: Challenges and opportunities," Computer, vol. 26, pp. 18-27, 1993.
[6]J. L. Gustafson, "Reevaluating Amdahl's law," Communications of the ACM, vol. 31, pp. 532-533, 1988.
[7]J. Sanders and E. Kandrot, CUDA by example: an introduction to general-purpose GPU programming: Addison-Wesley Professional, 2010.
[8]P. G. Ciarlet, B. Miara, and J.-M. Thomas, Introduction to numerical linear algebra and optimisation: Cambridge University Press, 1989.
[9]Z.-C. Li, C.-S. Chien, and H.-T. Huang, "Effective condition number for finite difference method," Journal of computational and applied mathematics, vol. 198, pp. 208-235, 2007.
[10]A. K. Cline, C. B. Moler, G. W. Stewart, and J. H. Wilkinson, "An estimate for the condition number of a matrix," SIAM Journal on Numerical Analysis, vol. 16, pp. 368-375, 1979.
[11]J. H. Wilkinson, The algebraic eigenvalue problem vol. 87: Clarendon Press Oxford, 1965.
[12]N. Jamil, "A comparison of direct and indirect solvers for linear systems of equations," International Journal of Emerging Sciences, vol. 2, pp. 310-321, 2012.
[13]Y. Saad, "Communication complexity of the Gaussian elimination algorithm on multiprocessors," Linear Algebra and its Applications, vol. 77, pp. 315-340, 1986.
[14]N. Galoppo, N. K. Govindaraju, M. Henson, and D. Manocha, "LU-GPU: Efficient algorithms for solving dense linear systems on graphics hardware," in Proceedings of the 2005 ACM/IEEE conference on Supercomputing, 2005, p. 3.
[15]D. Coppersmith and S. Winograd, "Matrix multiplication via arithmetic progressions," in Proceedings of the nineteenth annual ACM symposium on Theory of computing, 1987, pp. 1-6.
[16]H. C. Elman, "Iterative methods for large, sparse, nonsymmetric systems of linear equations," Yale University New Haven, Conn, 1982.
[17]A. Chronopoulos and C. W. Gear, "s-Step iterative methods for symmetric linear systems," Journal of Computational and Applied Mathematics, vol. 25, pp. 153-168, 1989.
[18]G. H. Golub and C. F. Van Loan, Matrix computations vol. 3: JHU Press, 2012.
[19]Y. Saad and M. H. Schultz, "GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems," SIAM Journal on scientific and statistical computing, vol. 7, pp. 856-869, 1986.
[20]V. Frayssé, L. Giraud, S. Gratton, and J. Langou, "Algorithm 842: A set of GMRES routines for real and complex arithmetics on high performance computers," ACM Transactions on Mathematical Software (TOMS), vol. 31, pp. 228-238, 2005.
[21]I. C. Ipsen and C. D. Meyer, "The idea behind Krylov methods," American Mathematical Monthly, pp. 889-899, 1998.
[22]P. N. Brown, "A theoretical comparison of the Arnoldi and GMRES algorithms," SIAM Journal on Scientific and Statistical Computing, vol. 12, pp. 58-78, 1991.
[23]Z. Bai, D. Hu, and L. Reichel, "A Newton basis GMRES implementation," IMA Journal of Numerical Analysis, vol. 14, pp. 563-581, 1994.
[24]D. Calvetti, J. Petersen, and L. Reichel, "A parallel implementation of the GMRES method," L. Reichel, A. Ruttan and RS Varga, de Gruyter, Berlin, pp. 31-45, 1993.
[25]E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, et al., "Open MPI: Goals, concept, and design of a next generation MPI implementation," in European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting, 2004, pp. 97-104.
[26]J. Erhel, "A parallel GMRES version for general sparse matrices," Electronic Transactions on Numerical Analysis, vol. 3, pp. 160-176, 1995.
[27]H. Huang, L. Wang, E.-J. Lee, and P. Chen, "An MPI-CUDA implementation and optimization for parallel sparse equations and least squares (LSQR)," Procedia Computer Science, vol. 9, pp. 76-85, 2012.
[28]A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, "Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks," in Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, 2009, pp. 233-244.
[29]K. He, S. X.-D. Tan, H. Zhao, X.-X. Liu, H. Wang, and G. Shi, "Parallel GMRES solver for fast analysis of large linear dynamic systems on GPU platforms," INTEGRATION, the VLSI journal, vol. 52, pp. 10-22, 2016.
[30]R. D. Cook, Concepts and applications of finite element analysis: John Wiley & Sons, 2007.
[31]J. Fish and T. Belytschko, "A first course in finite elements," 2007.
[32]H. Bang and Y. W. Kwon, The finite element method using MATLAB: CRC press, 2000.
[33]J. Langou, "Computing the R of the QR factorization of tall and skinny matrices using MPI_Reduce," arXiv preprint arXiv:1002.4250, 2010.
[34]C. Lipson and R. C. Juvinall, Handbook of stress and strength: design and material applications: Macmillan, 1963.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2019-01-30起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2019-01-30起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw