Scalable communication for high-order stencil computations using CUDA-aware MPI

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorPekkilä, Johannesen_US
dc.contributor.authorVäisälä, Miikka S.en_US
dc.contributor.authorKäpylä, Maarit J.en_US
dc.contributor.authorRheinhardt, Matthiasen_US
dc.contributor.authorLappi, Oskaren_US
dc.contributor.departmentDepartment of Computer Scienceen
dc.contributor.groupauthorProfessorship Korpi-Lagg Maariten
dc.contributor.groupauthorComputer Science Professorsen
dc.contributor.groupauthorComputer Science - Large-scale Computing and Data Analysis (LSCA) - Research areaen
dc.contributor.organizationÅbo Akademi Universityen_US
dc.contributor.organizationAcademia Sinica Institute of Astronomy and Astrophysicsen_US
dc.date.accessioned2022-08-10T08:15:30Z
dc.date.available2022-08-10T08:15:30Z
dc.date.issued2022-07en_US
dc.description| openaire: EC/H2020/818665/EU//UniSDyn Funding Information: This work was supported by the Academy of Finland ReSoLVE Centre of Excellence (grant number 307411 ); the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Project UniSDyn, grant agreement n:o 818665 ); and CHARMS within ASIAA from Academia Sinica. Publisher Copyright: © 2022 The Authors
dc.description.abstractModern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accentuated with the introduction of graphics processing units, which can provide by multiple factors higher throughput in data-parallel tasks than central processing units. In this work, we explore the computational aspects of iterative stencil loops and implement a generic communication scheme using CUDA-aware MPI, which we use to accelerate magnetohydrodynamics simulations based on high-order finite differences and third-order Runge–Kutta integration. We put particular focus on improving intra-node locality of workloads. Our GPU implementation scales strongly from one to 64 devices at 50%–87% of the expected efficiency based on a theoretical performance model. Compared with a multi-core CPU solver, our implementation exhibits 20–60× speedup and 9–12× improved energy efficiency in compute-bound benchmarks on 16 nodes.en
dc.description.versionPeer revieweden
dc.format.extent12
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationPekkilä, J, Väisälä, M S, Käpylä, M J, Rheinhardt, M & Lappi, O 2022, 'Scalable communication for high-order stencil computations using CUDA-aware MPI', Parallel Computing, vol. 111, 102904, pp. 1-12. https://doi.org/10.1016/j.parco.2022.102904en
dc.identifier.doi10.1016/j.parco.2022.102904en_US
dc.identifier.issn0167-8191
dc.identifier.issn1872-7336
dc.identifier.otherPURE UUID: 238ed93e-75ae-4830-8f90-2257071a8208en_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/238ed93e-75ae-4830-8f90-2257071a8208en_US
dc.identifier.otherPURE LINK: http://www.scopus.com/inward/record.url?scp=85127169118&partnerID=8YFLogxK
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/82048920/Scalable_communication_for_high_order_stencil_computations_using_CUDA_aware_MPI.pdfen_US
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/115702
dc.identifier.urnURN:NBN:fi:aalto-202208104524
dc.language.isoenen
dc.publisherElsevier
dc.relationinfo:eu-repo/grantAgreement/EC/H2020/818665/EU//UniSDyn Funding Information: This work was supported by the Academy of Finland ReSoLVE Centre of Excellence (grant number 307411 ); the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Project UniSDyn, grant agreement n:o 818665 ); and CHARMS within ASIAA from Academia Sinica. Publisher Copyright: © 2022 The Authorsen_US
dc.relation.ispartofseriesParallel Computingen
dc.relation.ispartofseriesVolume 111, pp. 1-12en
dc.rightsopenAccessen
dc.subject.keywordHigh-performance computingen_US
dc.subject.keywordGraphics processing unitsen_US
dc.subject.keywordStencil computationsen_US
dc.subject.keywordComputational physicsen_US
dc.subject.keywordMagnetohydrodynamicsen_US
dc.titleScalable communication for high-order stencil computations using CUDA-aware MPIen
dc.typeA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessäfi
dc.type.versionpublishedVersion

Files