Abstract
In the past several years, there has been much success in adapting numerical algorithms involving linear algebra and pairwise N-body force calculations to run well on GPUs. These numerical algorithms share the feature that high computational intensity can be achieved while holding only small amounts of data in on-chip storage. In previous work, we combined a briquette data structure and a heavily pipelined CFD processing of these data briquettes in sequence that results in a very small on-chip data workspace and high performance for our multifluid PPM gas dynamics algorithm on CPUs with standard sized caches. The on-chip data workspace produced in that earlier work is not small enough to meet the requirements of today's GPUs, which demand that no more than 32 kB of on-chip data be associated with a single thread of control (a warp). Here we report a variant of our earlier technique that allows a user-controllable trade-off between workspace size and redundant computation that can be a win on GPUs. We use our multifluid PPM gas dynamics algorithm to illustrate this technique. Performance results for this algorithm in 32-bit precision on a recently introduced dual-chip GPU, the Nvidia K80, are 1.7 times that on a similarly recent dual CPU node using two 16-core Intel Haswell chips. The redundant computation that allows the on-chip data context for each thread of control to be less than 32 kB is roughly 9% of the total. We have built an automatic translator from a Fortran expression to CUDA to ease the programming burden that is involved in applying our technique.
Original language | English (US) |
---|---|
Pages (from-to) | 56-65 |
Number of pages | 10 |
Journal | Journal of Parallel and Distributed Computing |
Volume | 93-94 |
DOIs | |
State | Published - Jul 1 2016 |
Bibliographical note
Publisher Copyright:© 2016 Elsevier Inc. All rights reserved.
Keywords
- Code transformation
- Computational fluid dynamics
- GPU computation
- Precompilation