||Application of GMRES ( General Minimized RESidual ) method for computational mechanics applications using communication avoiding algorithms
||Department of Mechanical Engineering
General Minimal Residual method (GMRES)
Communication Avoiding algorithm
Graphical processing Unit (GPU)
The goal of this research is the development of a distributed parallel implementation of the general minimal residual (GMRES) solver with the target application of computational mechanics. The program of the solver is aimed to be a two-layered parallel structure, it could have the potential to be implemented on the usage of multi-GPU. The first layer employs OpenMPI (or simply MPI) for management of communications between distributed computing nodes. The second layer of parallelization is the use of a local GPU device for heterogeneous parallel computation using Nvidia’s CUDA. For the sake of simplicity, the investigated Finite Element Method is a stiffness-method based approach for solving the system Kx = F where the stiffness matrix K is computed using a Rayleigh-Ritz method based on a linear shape function, i.e. a Constant Strain Triangle (CST) formulation. The data from the stiffness matrix is stored using a Compressed Sparse Row (CSR) method for eliminating the need to store zero values in the sparse matrix. In this study, this stiffness method is first generated using MATLAB and later exported to the GPU for solution.
The accuracy of the GMRES implementation is tested through several benchmarks. 2D heat transfer being discretized from the Laplace’s equation is tested as benchmark for Newton basis GMRES. The lid-driven cavity solved using SIMPLE is also used to be the benchmark. Finally, the solver is applied to the computation of stress concentration factors for several flat plates using multiple Nvidia GPU devices distributed across an MPI cluster. First, the stress concentration factors of a flat plate with round circular hole located at middle and a T-shape flat plate with round fillet are computed for various loading conditions and geometries. With each different radius, the von Mises stress of each element is calculated and used to create a stress concentration relationship.
The convergence performance of conjugate gradient method (CG) is compared with GMRES since our stiffness matrices are symmetric. GMRES with different restart is also investigated to find the best number for restart. To further investigate the possibility of decreasing the overhead cause by communication between threads, Newton basis GMRES and its communication avoiding algorithm is also studied.
List of Tables xii
List of Figures xiii
Chapter 1 Introduction 1
1.1 Background survey 1
1.2 Motivation 2
Chapter 2 Theory of Parallel computing 4
2.1 Efficiency Theory of parallel computing 4
2.1.1 Amdahl's Law 6
2.1.2 Gustafson–Barsis's Law 8
2.2 Program Structure with the usage of CUDA 9
2.2.1 Memory in GPU 10
2.2.2 Memory allocation in CUDA 10
2.2.3 Message passing in CUDA 11
Chapter 3 Sparse Linear Systems 13
3.1 Challenges of solving real linear systems 13
3.1.1 Condition Number 14
3.1.2 Direct Solvers for Linear Systems 17
3.2 Iterative Solvers for Large Linear System 21
3.2.1 Jacobi Iteration 21
3.2.2 Conjugate Gradient Method (CG) 23
3.2.3 General Minimized Residual Method (GMRES) 25
3.2.4 Newton-Basis GMRES 30
3.3 Storage Methods for Sparse Matrix Storage 31
3.3.1 Coordinate Storage Format (COO) 32
3.3.2 Compressed Sparse Row Format (CSR) 32
Chapter 4 Message Passing Interface (MPI) 34
4.1 Message Passing Interface (MPI) 34
4.1.1 Message Passing Interface (MPI) API 34
4.1.2 Designing the size of message passing 35
4.1.3 MPI Environment (MPE) and the Jumpshot analyzing tool 37
4.2 Beowulf Cluster using GPU 38
4.2.1 Environment settings 38
4.2.2 The Structure of the Cluster 39
4.3 Basic Parallel Vector Operations 39
4.3.1 Data Separation of CSR Form Arrays 39
4.3.2 The Parallel Vector-Vector Dot Product 39
4.3.3 The Parallel Matrix-Vector Dot Product in CSR Form 40
Chapter 5 Finite Element Methodology 42
5.1 Finite Element Procedure 42
5.1.1 The Rayleigh-Ritz Method 42
5.1.2 2D Constant Strain Triangular (CST) Element 43
5.1.3 Linear Elastics 45
5.1.4 Strong Form and Weak Form 48
5.2 CSR Form for Finite Element 51
5.2.1 Assembly of Local Stiffness Matrix 51
5.2.2 Assembly of Global Stiffness Matrix in CSR Form 52
5.2.3 Reduced Global Stiffness Matrix 55
Chapter 6 Methodology 57
6.1 CUDA-MPI Algorithm Description 57
6.2 Data Structure 57
6.3 Host-Host Communications 58
6.4 Host-Device Communications 59
6.5 GMRES Parallelization using multiple CUDA devices 59
6.5.1 Arnoldi Iteration Part One with Matrix-Vector Production 60
6.5.2 Arnoldi Iteration Part Two with Vector-Vector Production 62
6.5.3 Arnoldi Iteration Part Three 63
6.5.4 Givens Rotation 63
6.5.5 Backward Substitution of the Least Square Problem 64
6.6 Newton Basis GMRES Parallelization using MPI 64
6.6.1 The First Iteration of Newton Basis GMRES 64
6.6.2 Construction of the Newton Basis 64
6.6.3 QR Factorization of the Basis 65
Chapter 7 Benchmarks and Results 66
7.1 Introduction of Heat Transfer Benchmark 66
7.2 Introduction of the Lid-driven Cavity Benchmark 66
7.3 Introduction of FEM Benchmarks 66
7.3.1 Simple Beam with Axial Load 67
7.3.2 Cantilever Beam with Point Load at Free-end 67
7.3.3 A Circular Hole at the middle of a Flat Plate 68
7.3.4 Flat Plate with Fillet 68
7.4 FEM usage in Results 69
7.4.1 Results with Meshes and von Mises Stress 69
7.5 Speed of Convergence with respect of Iterations 71
7.5.1 Jacobi Preconditioner 71
7.5.2 Restart 72
7.5.3 Comparison with CG 73
7.6 Parallel Performance using Multi-GPUs 73
7.7 Speedup of Communication Avoiding Algorithm 75
7.7.1 Speedup comparison between MGS and Newton Basis GMRES Running with different number of CPU core 75
7.7.2 Analyzation of the bottleneck of the parallel process 75
Chapter 8 Conclusion 77
J. Dongarra, "An Overview of High Performance Computing and Challenges for the Future," in VECPAR, 2008, p. 1.
D. B. Kirk and W. H. Wen-Mei, Programming massively parallel processors: a hands-on approach: Morgan kaufmann, 2016.
J. Bolz, I. Farmer, E. Grinspun, and P. Schröoder, "Sparse matrix solvers on the GPU: conjugate gradients and multigrid," ACM transactions on graphics (TOG), vol. 22, pp. 917-924, 2003.
E. J. Kelmelis, J. R. Humphrey, J. P. Durbano, and F. E. Ortiz, "Accelerated modeling and simulation with a desktop supercomputer," in Enabling Technologies for Simulation Science X, 2006, p. 62270N.
A. A. Khokhar, V. K. Prasanna, M. E. Shaaban, and C.-L. Wang, "Heterogeneous computing: Challenges and opportunities," Computer, vol. 26, pp. 18-27, 1993.
J. L. Gustafson, "Reevaluating Amdahl's law," Communications of the ACM, vol. 31, pp. 532-533, 1988.
J. Sanders and E. Kandrot, CUDA by example: an introduction to general-purpose GPU programming: Addison-Wesley Professional, 2010.
P. G. Ciarlet, B. Miara, and J.-M. Thomas, Introduction to numerical linear algebra and optimisation: Cambridge University Press, 1989.
Z.-C. Li, C.-S. Chien, and H.-T. Huang, "Effective condition number for finite difference method," Journal of computational and applied mathematics, vol. 198, pp. 208-235, 2007.
A. K. Cline, C. B. Moler, G. W. Stewart, and J. H. Wilkinson, "An estimate for the condition number of a matrix," SIAM Journal on Numerical Analysis, vol. 16, pp. 368-375, 1979.
J. H. Wilkinson, The algebraic eigenvalue problem vol. 87: Clarendon Press Oxford, 1965.
N. Jamil, "A comparison of direct and indirect solvers for linear systems of equations," International Journal of Emerging Sciences, vol. 2, pp. 310-321, 2012.
Y. Saad, "Communication complexity of the Gaussian elimination algorithm on multiprocessors," Linear Algebra and its Applications, vol. 77, pp. 315-340, 1986.
N. Galoppo, N. K. Govindaraju, M. Henson, and D. Manocha, "LU-GPU: Efficient algorithms for solving dense linear systems on graphics hardware," in Proceedings of the 2005 ACM/IEEE conference on Supercomputing, 2005, p. 3.
D. Coppersmith and S. Winograd, "Matrix multiplication via arithmetic progressions," in Proceedings of the nineteenth annual ACM symposium on Theory of computing, 1987, pp. 1-6.
H. C. Elman, "Iterative methods for large, sparse, nonsymmetric systems of linear equations," Yale University New Haven, Conn, 1982.
A. Chronopoulos and C. W. Gear, "s-Step iterative methods for symmetric linear systems," Journal of Computational and Applied Mathematics, vol. 25, pp. 153-168, 1989.
G. H. Golub and C. F. Van Loan, Matrix computations vol. 3: JHU Press, 2012.
Y. Saad and M. H. Schultz, "GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems," SIAM Journal on scientific and statistical computing, vol. 7, pp. 856-869, 1986.
V. Frayssé, L. Giraud, S. Gratton, and J. Langou, "Algorithm 842: A set of GMRES routines for real and complex arithmetics on high performance computers," ACM Transactions on Mathematical Software (TOMS), vol. 31, pp. 228-238, 2005.
I. C. Ipsen and C. D. Meyer, "The idea behind Krylov methods," American Mathematical Monthly, pp. 889-899, 1998.
P. N. Brown, "A theoretical comparison of the Arnoldi and GMRES algorithms," SIAM Journal on Scientific and Statistical Computing, vol. 12, pp. 58-78, 1991.
Z. Bai, D. Hu, and L. Reichel, "A Newton basis GMRES implementation," IMA Journal of Numerical Analysis, vol. 14, pp. 563-581, 1994.
D. Calvetti, J. Petersen, and L. Reichel, "A parallel implementation of the GMRES method," L. Reichel, A. Ruttan and RS Varga, de Gruyter, Berlin, pp. 31-45, 1993.
E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, et al., "Open MPI: Goals, concept, and design of a next generation MPI implementation," in European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting, 2004, pp. 97-104.
J. Erhel, "A parallel GMRES version for general sparse matrices," Electronic Transactions on Numerical Analysis, vol. 3, pp. 160-176, 1995.
H. Huang, L. Wang, E.-J. Lee, and P. Chen, "An MPI-CUDA implementation and optimization for parallel sparse equations and least squares (LSQR)," Procedia Computer Science, vol. 9, pp. 76-85, 2012.
A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, "Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks," in Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, 2009, pp. 233-244.
K. He, S. X.-D. Tan, H. Zhao, X.-X. Liu, H. Wang, and G. Shi, "Parallel GMRES solver for fast analysis of large linear dynamic systems on GPU platforms," INTEGRATION, the VLSI journal, vol. 52, pp. 10-22, 2016.
R. D. Cook, Concepts and applications of finite element analysis: John Wiley & Sons, 2007.
J. Fish and T. Belytschko, "A first course in finite elements," 2007.
H. Bang and Y. W. Kwon, The finite element method using MATLAB: CRC press, 2000.
J. Langou, "Computing the R of the QR factorization of tall and skinny matrices using MPI_Reduce," arXiv preprint arXiv:1002.4250, 2010.
C. Lipson and R. C. Juvinall, Handbook of stress and strength: design and material applications: Macmillan, 1963.