Fault tolerant wide-area parallel computing

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Scopus citations

Abstract

Executing parallel applications across distributed networks introduces the problem of fault tolerance. A viable solution for fault tolerance must keep overhead manageable and not compromise the high performance objective of parallel processing. In this paper, we explore two options for achieving fault tolerance for a common class of parallel applications, single-program-multiple- data (SPMD). We quantitatively compare checkpoint-recovery and wide-area replication as a means of achieving fault tolerance. The experimental results obtained for a canonical SPMD application suggest that checkpoint-recovery may be preferable for small problems if local parallel disks are available, but wide-area replication outperforms checkpoint-recovery for larger-grain problems, precisely the problems most suited for the wide-area network environment. The results also show that it possible to accurately model and predict the overheads of the two methods

Original languageEnglish (US)
Title of host publicationParallel and Distributed Processing - 15 IPDPS 2000 Workshops, Proceedings
Pages1214-1225
Number of pages12
StatePublished - Dec 1 2000
Event15 Workshops Held in Conjunction with the IEEE International Parallel and Distributed Processing Symposium, IPDPS 2000 - Cancun, Mexico
Duration: May 1 2000May 5 2000

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume1800 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other15 Workshops Held in Conjunction with the IEEE International Parallel and Distributed Processing Symposium, IPDPS 2000
CountryMexico
CityCancun
Period5/1/005/5/00

Fingerprint Dive into the research topics of 'Fault tolerant wide-area parallel computing'. Together they form a unique fingerprint.

Cite this