Scalable communication for high-order stencil computations using CUDA-aware MPI
dc.contributor | Aalto-yliopisto | fi |
dc.contributor | Aalto University | en |
dc.contributor.author | Pekkilä, Johannes | en_US |
dc.contributor.author | Väisälä, Miikka S. | en_US |
dc.contributor.author | Käpylä, Maarit J. | en_US |
dc.contributor.author | Rheinhardt, Matthias | en_US |
dc.contributor.author | Lappi, Oskar | en_US |
dc.contributor.department | Department of Computer Science | en |
dc.contributor.groupauthor | Professorship Korpi-Lagg Maarit | en |
dc.contributor.groupauthor | Computer Science Professors | en |
dc.contributor.groupauthor | Computer Science - Large-scale Computing and Data Analysis (LSCA) - Research area | en |
dc.contributor.organization | Åbo Akademi University | en_US |
dc.contributor.organization | Academia Sinica Institute of Astronomy and Astrophysics | en_US |
dc.date.accessioned | 2022-08-10T08:15:30Z | |
dc.date.available | 2022-08-10T08:15:30Z | |
dc.date.issued | 2022-07 | en_US |
dc.description | | openaire: EC/H2020/818665/EU//UniSDyn Funding Information: This work was supported by the Academy of Finland ReSoLVE Centre of Excellence (grant number 307411 ); the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Project UniSDyn, grant agreement n:o 818665 ); and CHARMS within ASIAA from Academia Sinica. Publisher Copyright: © 2022 The Authors | |
dc.description.abstract | Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accentuated with the introduction of graphics processing units, which can provide by multiple factors higher throughput in data-parallel tasks than central processing units. In this work, we explore the computational aspects of iterative stencil loops and implement a generic communication scheme using CUDA-aware MPI, which we use to accelerate magnetohydrodynamics simulations based on high-order finite differences and third-order Runge–Kutta integration. We put particular focus on improving intra-node locality of workloads. Our GPU implementation scales strongly from one to 64 devices at 50%–87% of the expected efficiency based on a theoretical performance model. Compared with a multi-core CPU solver, our implementation exhibits 20–60× speedup and 9–12× improved energy efficiency in compute-bound benchmarks on 16 nodes. | en |
dc.description.version | Peer reviewed | en |
dc.format.extent | 12 | |
dc.format.mimetype | application/pdf | en_US |
dc.identifier.citation | Pekkilä, J, Väisälä, M S, Käpylä, M J, Rheinhardt, M & Lappi, O 2022, 'Scalable communication for high-order stencil computations using CUDA-aware MPI', Parallel Computing, vol. 111, 102904, pp. 1-12. https://doi.org/10.1016/j.parco.2022.102904 | en |
dc.identifier.doi | 10.1016/j.parco.2022.102904 | en_US |
dc.identifier.issn | 0167-8191 | |
dc.identifier.issn | 1872-7336 | |
dc.identifier.other | PURE UUID: 238ed93e-75ae-4830-8f90-2257071a8208 | en_US |
dc.identifier.other | PURE ITEMURL: https://research.aalto.fi/en/publications/238ed93e-75ae-4830-8f90-2257071a8208 | en_US |
dc.identifier.other | PURE LINK: http://www.scopus.com/inward/record.url?scp=85127169118&partnerID=8YFLogxK | |
dc.identifier.other | PURE FILEURL: https://research.aalto.fi/files/82048920/Scalable_communication_for_high_order_stencil_computations_using_CUDA_aware_MPI.pdf | en_US |
dc.identifier.uri | https://aaltodoc.aalto.fi/handle/123456789/115702 | |
dc.identifier.urn | URN:NBN:fi:aalto-202208104524 | |
dc.language.iso | en | en |
dc.publisher | Elsevier | |
dc.relation | info:eu-repo/grantAgreement/EC/H2020/818665/EU//UniSDyn Funding Information: This work was supported by the Academy of Finland ReSoLVE Centre of Excellence (grant number 307411 ); the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Project UniSDyn, grant agreement n:o 818665 ); and CHARMS within ASIAA from Academia Sinica. Publisher Copyright: © 2022 The Authors | en_US |
dc.relation.ispartofseries | Parallel Computing | en |
dc.relation.ispartofseries | Volume 111, pp. 1-12 | en |
dc.rights | openAccess | en |
dc.subject.keyword | High-performance computing | en_US |
dc.subject.keyword | Graphics processing units | en_US |
dc.subject.keyword | Stencil computations | en_US |
dc.subject.keyword | Computational physics | en_US |
dc.subject.keyword | Magnetohydrodynamics | en_US |
dc.title | Scalable communication for high-order stencil computations using CUDA-aware MPI | en |
dc.type | A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä | fi |
dc.type.version | publishedVersion |