TY - JOUR
T1 - Optimization of checkpointing-related I/O for high-performance parallel and distributed computing
AU - Subramaniyan, Rajagopal
AU - Grobelny, Eric
AU - Studham, Scott
AU - George, Alan D.
PY - 2008/11/1
Y1 - 2008/11/1
N2 - Checkpointing, the process of saving program/application state, usually to a stable storage, has been the most common fault-tolerance methodology for high-performance applications. The rate of checkpointing (how often) is primarily driven by the failure rate of the system. If the checkpointing rate is low, fewer resources are consumed but the chance of high computational loss is increased and vice versa if the checkpointing rate is high. It is important to strike a balance, and an optimum rate of checkpointing is required. In this paper, we analytically model the process of checkpointing in terms of mean-time-between-failure of the system, amount of memory being checkpointed, sustainable I/O bandwidth to the stable storage, and frequency of checkpointing. We identify the optimum frequency of checkpointing to be used on systems with given specifications thereby making way for efficient use of available resources and maximum performance of the system without compromising on the fault-tolerance aspects. Further, we develop discrete-event models simulating the checkpointing process to verify the analytical model for optimum checkpointing. Using the analytical model, we also investigate the optimum rate of checkpointing for systems of varying resource levels ranging from small embedded cluster systems to large supercomputers.
AB - Checkpointing, the process of saving program/application state, usually to a stable storage, has been the most common fault-tolerance methodology for high-performance applications. The rate of checkpointing (how often) is primarily driven by the failure rate of the system. If the checkpointing rate is low, fewer resources are consumed but the chance of high computational loss is increased and vice versa if the checkpointing rate is high. It is important to strike a balance, and an optimum rate of checkpointing is required. In this paper, we analytically model the process of checkpointing in terms of mean-time-between-failure of the system, amount of memory being checkpointed, sustainable I/O bandwidth to the stable storage, and frequency of checkpointing. We identify the optimum frequency of checkpointing to be used on systems with given specifications thereby making way for efficient use of available resources and maximum performance of the system without compromising on the fault-tolerance aspects. Further, we develop discrete-event models simulating the checkpointing process to verify the analytical model for optimum checkpointing. Using the analytical model, we also investigate the optimum rate of checkpointing for systems of varying resource levels ranging from small embedded cluster systems to large supercomputers.
KW - Checkpointing
KW - Distributed computing
KW - Fault tolerance
KW - High-performance computing
KW - Modeling
KW - Parallel computing
KW - Supercomputing
KW - Technology growth
UR - http://www.scopus.com/inward/record.url?scp=54149107334&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=54149107334&partnerID=8YFLogxK
U2 - 10.1007/s11227-007-0162-0
DO - 10.1007/s11227-007-0162-0
M3 - Article
AN - SCOPUS:54149107334
SN - 0920-8542
VL - 46
SP - 150
EP - 180
JO - Journal of Supercomputing
JF - Journal of Supercomputing
IS - 2
ER -