For modern HPC systems, failures are treated as the norm instead of exceptions. To avoid rerunning applications from scratch, checkpoint/restart techniques are employed to periodically checkpoint intermediate data to parallel file systems. To increase HPC checkpointing speed, distributed burst buffers (DBB) have been proposed to use node-local NVRAM to absorb the bursty checkpoint data. However, without proper coordination, DBB is prone to suffer from low resource utilization. To solve this problem, we propose an NVRAM-based burst buffer coordination system, named collaborative distributed burst buffer (CDBB). CDBB coordinates all the available burst buffers, based on their priorities and states, to help overburdened burst buffers and maximize resource utilization. We built a proof-of-concept prototype and tested CDBB at the Minnesota Supercomputing Institute. Compared with a traditional DBB system, CDBB can speed up checkpointing by up to 8.4x under medium and heavy workloads and only introduces negligible overhead.
|Original language||English (US)|
|Number of pages||12|
|State||Published - 2018|
|Event||26th High Performance Computing Symposium, HPC 2018, Part of the 2018 Spring Simulation Multi-Conference, SpringSim 2018 - Baltimore, United States|
Duration: Apr 15 2018 → Apr 18 2018
Bibliographical noteFunding Information:
This work is partially supported by the following NSF awards: 1305237, 1421913, 1439622 and 1525617. This work is also supported by Hewlett Packard Enterprise.
- Burst buffer
- Coordination system
- Non-volatile memory
- Parallel file system