Accession Number : ADA299747

Title :   Application-Transparent Fault Management.

Descriptive Note : Research rept.,

Corporate Author : CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE

Personal Author(s) : Russinovich, Mark E.

PDF Url : ADA299747

Report Date : AUG 1994

Pagination or Media Count : 145

Abstract : As computers continue to proliferate and they are used in more demanding environments, data integrity and continuous availability are an increasingly important aspect of their designs. Since operating systems are common to all computers and it is at the operating system level where there is maximum system visibility and control, it is appropriate for the operating system to provide policies which detect, contain and tolerate faults. These policies and the mechanism that support them form an operating system's fault management. A fault management mechanism, the sentry mechanism, has been designed and implemented for a UNIX 4.3 BSD server running on the Mach 3.0 microkernal. Fault tolerant policies have been designed for a range of computer systems, from a single computer, to mirrored computers to distributed systems. The policies first addressed provide single computed applications with application-transparent fault tolerance with respect to transient faults and certain types of permanent faults. Contributions to this area include algorithms for concurrent process journaling, disk checkpointing and memory checkpointing. Formal proofs are made of the journal sequencing algorithm and the disk checkpointing algorithm. Performance measurements from am implementation of the single computer algorithms show an average performance overhead of less than 5% and a requirement of only 10 MB of dedicated disk stable storage. The system provides fault tolerance with no additional hardware other than a hard disk, and works with unmodified applications such as the X-window system. Sentry policies that provide software based fault tolerance for duplicated and triplicated computer systems as well as distributed systems have also been designed.

Descriptors :   *OPERATING SYSTEMS(COMPUTERS), *COMPUTER PROGRAM VERIFICATION, *FAULT TOLERANT COMPUTING, ALGORITHMS, TRANSIENTS, STABILITY, POLICIES, DETECTION, MANAGEMENT, COMPUTERS, MEMORY DEVICES, AVAILABILITY, NUMERICAL INTEGRATION, COMPUTER APPLICATIONS, DISKS, FAULT TOLERANCE, MAGNETIC DISKS.

Subject Categories : Computer Programming and Software

Distribution Statement : APPROVED FOR PUBLIC RELEASE