In the past several years, there has been much success in adapting numerical algorithms involving linear algebra and pairwise N-body force calculations to run well on GPUs. These numerical algorithms share the feature that high computational intensity can be achieved while holding only small amounts of data in on-chip storage. In previous work, we combined a briquette data structure and a heavily pipelined CFD processing of these data briquettes in sequence that results in a very small on-chip data workspace and high performance for our multifluid PPM gas dynamics algorithm on CPUs with standard sized caches. The on-chip data workspace produced in that earlier work is not small enough to meet the requirements of today's GPUs, which demand that no more than 32 kB of on-chip data be associated with a single thread of control (a warp). Here we report a variant of our earlier technique that allows a user-controllable trade-off between workspace size and redundant computation that can be a win on GPUs. We use our multifluid PPM gas dynamics algorithm to illustrate this technique. Performance results for this algorithm in 32-bit precision on a recently introduced dual-chip GPU, the Nvidia K80, are 1.7 times that on a similarly recent dual CPU node using two 16-core Intel Haswell chips. The redundant computation that allows the on-chip data context for each thread of control to be less than 32 kB is roughly 9% of the total. We have built an automatic translator from a Fortran expression to CUDA to ease the programming burden that is involved in applying our technique.
Bibliographical noteFunding Information:
Development of PPM algorithms and codes, beginning with our work on the Los Alamos Roadrunner machine, that has led to the work reported here has been supported by contracts from the Los Alamos National Laboratory subcontract 237111 and Sandia National Laboratory subcontract 1254431. Our work at the University of Minnesota has also been supported by National Science Foundation through grant 1413548 and PRAC grant 1440025 for access to NCSA’s Blue Waters system. We have also carried out tests on early examples of advanced hardware at Sandia’s CSRI. This work was also performed under the auspices of the US Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-JRNL-673849.
© 2016 Elsevier Inc. All rights reserved.
- Code transformation
- Computational fluid dynamics
- GPU computation