Systematic checkpointing of the machine state makes restart of execution from a safe state possible upon detection of an error. The time and energy overhead of checkpointing, however, grows with the frequency of checkpointing. Considering the growth of expected error rates, amortizing this overhead becomes especially challenging, as checkpointing frequency tends to increase with increasing error rates. Based on the observation that due to imbalanced technology scaling, recomputing a data value can be more energy efficient than retrieving (i.e., loading) a stored copy, this paper explores how recomputation of data values (which otherwise would be read from a checkpoint from memory or secondary storage) can reduce the machine state to be checkpointed, and thereby, the checkpointing overhead. Even in a relatively small scale system, recomputation-based checkpointing can reduce the storage overhead by up to 23.91%; time overhead, by 11.92%; and energy overhead, by 12.53%, respectively.
|Original language||English (US)|
|Title of host publication||Proceedings - 2020 IEEE International Symposium on High Performance Computer Architecture, HPCA 2020|
|Publisher||Institute of Electrical and Electronics Engineers Inc.|
|Number of pages||14|
|State||Published - Feb 2020|
|Event||26th IEEE International Symposium on High Performance Computer Architecture, HPCA 2020 - San Diego, United States|
Duration: Feb 22 2020 → Feb 26 2020
|Name||Proceedings - 2020 IEEE International Symposium on High Performance Computer Architecture, HPCA 2020|
|Conference||26th IEEE International Symposium on High Performance Computer Architecture, HPCA 2020|
|Period||2/22/20 → 2/26/20|
Bibliographical noteFunding Information:
ACKNOWLEDGEMENTS This work was supported by NSF CAREER CCF-1553042.
© 2020 IEEE.
Copyright 2020 Elsevier B.V., All rights reserved.