TY - GEN
T1 - Fault tolerant wide-area parallel computing
AU - Weissman, Jon B.
PY - 2000
Y1 - 2000
N2 - Executing parallel applications across distributed networks introduces the problem of fault tolerance. A viable solution for fault tolerance must keep overhead manageable and not compromise the high performance objective of parallel processing. In this paper, we explore two options for achieving fault tolerance for a common class of parallel applications, single-program-multiple- data (SPMD). We quantitatively compare checkpoint-recovery and wide-area replication as a means of achieving fault tolerance. The experimental results obtained for a canonical SPMD application suggest that checkpoint-recovery may be preferable for small problems if local parallel disks are available, but wide-area replication outperforms checkpoint-recovery for larger-grain problems, precisely the problems most suited for the wide-area network environment. The results also show that it possible to accurately model and predict the overheads of the two methods
AB - Executing parallel applications across distributed networks introduces the problem of fault tolerance. A viable solution for fault tolerance must keep overhead manageable and not compromise the high performance objective of parallel processing. In this paper, we explore two options for achieving fault tolerance for a common class of parallel applications, single-program-multiple- data (SPMD). We quantitatively compare checkpoint-recovery and wide-area replication as a means of achieving fault tolerance. The experimental results obtained for a canonical SPMD application suggest that checkpoint-recovery may be preferable for small problems if local parallel disks are available, but wide-area replication outperforms checkpoint-recovery for larger-grain problems, precisely the problems most suited for the wide-area network environment. The results also show that it possible to accurately model and predict the overheads of the two methods
UR - http://www.scopus.com/inward/record.url?scp=84876363498&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84876363498&partnerID=8YFLogxK
U2 - 10.1007/3-540-45591-4_168
DO - 10.1007/3-540-45591-4_168
M3 - Conference contribution
AN - SCOPUS:84876363498
SN - 354067442X
SN - 9783540674429
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 1214
EP - 1225
BT - Parallel and Distributed Processing - 15 IPDPS 2000 Workshops, Proceedings
A2 - Rolim, Jose
PB - Springer Verlag
T2 - 15 Workshops Held in Conjunction with the IEEE International Parallel and Distributed Processing Symposium, IPDPS 2000
Y2 - 1 May 2000 through 5 May 2000
ER -