Accession Number : ADA290430

Title :   High-Level Fault Tolerance in Distributed Programs,

Corporate Author : CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE

Personal Author(s) : Seligman, Erik ; Beguelin, Adam

PDF Url : ADA290430

Report Date : DEC 1994

Pagination or Media Count : 12

Abstract : We have been developing high-level checkpoint and restart methods for Dome (Distributed Object Migration Environment), a C++ library of data-parallel objects that are automatically distributed using PVM. There are several levels of programming abstraction at which fault tolerance mechanisms can be designed: high-level, where the checkpoint and restart are built into our C++ objects, but the program structure is severely consrained; high-level with preprocessing, where a preprocessor inserts extra C++ statements into the code to facilitate checkpoint and restart; and low-level, where periodically an interrupt causes a memory image to be written out. Because we consider portability (both of our libraries and of the checkpoints they produce) to be an important goal, we focus on the higher-level checkpointing methods. In addition, we describe an implementation of high-level checkpointing, demonstrate it on multiple architectures, and show that it is efficient enough to provide good expected run times with low overhead, even in the case of frequent failures.

Descriptors :   *DISTRIBUTED DATA PROCESSING, *FAULT TOLERANT COMPUTING, ENVIRONMENTS, COMPUTER PROGRAMMING, COMPUTER ARCHITECTURE, MEMORY DEVICES, IMAGES, MIGRATION, RESTARTING, LIBRARIES, HIGH LEVEL LANGUAGES, PREPROCESSING, INSERTS.

Subject Categories : Computer Programming and Software

Distribution Statement : APPROVED FOR PUBLIC RELEASE