Reliable Computer Systems

  • View
    166

  • Download
    19

Embed Size (px)

Text of Reliable Computer Systems

RELIABLE COMPUTER SYSTEMSDESIGN

AND EVALUATION

HIRD EDITION

BERT

S.

SWARZ

l

7

.

RELIABLE

COMPUTERSYSTEMS

Digitized by the Internet Archivein

2011

http://www.archive.org/details/reliablecomputerOOsiew

RELIABLE

COMPUTERSYSTEMSDESIGN

AND

EVALUATION

THIRD EDITIONDaniel P Siewiorek

Carnegie Mellon

University

Pittsburgh, Pennsylvania

Roberts. Swarz

Worcester Polytechnic Institute Worcester, Massachusetts

A K PetersNatick, Massachusetts

Editorial, Sales,

and Customer Service Office

A K Peters, Ltd.63 South AvenueNatick,

MA 017601998 by

Copyright

A K Peters,

Ltd.

All rights reserved.

No part of the material protected

by

this

copyright notice

may be reproduced

or utilized in any form, electronic or mechanical, including

photocopying, recording, or by any information storage and retrieval system,without written permission from the copyright owner.

Trademark products mentioned

in the

book

are listed

on page 890.

Library of Congress Cataloging-in-Publication DataSiewiorek, DanielP.:

Reliable computer systems

design and evaluation

/

Daniel

P.

Siewiorek, Robert S. Swarz. - 3rd ed.p.

cm.undertitle:

First ed. published

The theory and

practice of

reliable

system design.

Includes bibliographical references and index.

ISBN 1-56881-092-X1.

Electronic digital computersI.

-

Reliability. 2. Fault-tolerantP.

computing.

Swarz, Robert

S.

II.

Siewiorek, DanielIII. Title.

Theory

and practice of reliablesystem design.

QA76.5.S537 1998004-dc21Printed in the United States of America

98-202237

CIP

02 0100 99 98

109876543

2

1

CREDITSFigure 1-3: Eugene Foley, "The Effects of Microelectronics Revolution on Systems and Board Test," Computers, Vol. 12, No. 10 (October 1979). Copyright 1979 IEEE.Reprinted by permission.

Figure 1-6:

S. Russell Craig,

"Incoming Inspection and Test Programs," Electronics Test

(October 1980). Reprinted by permission.Credits are continued on pages 885-890, which are considered a

continuation of the copyright page.

To Karon and Lonnie

A

Special

Remembrance:this

During the development of

book, a friend, colleague, and fault-tolerant pioneerhis 37 years of

passed away. Dr. Wing N. Toy documentedseveral generations of fault-tolerant

experience

in

designing

computers

for the Bell

System electronic switchingin

systems describedthese pages.

in

Chapter

8.

We

dedicate this book to Dr. Toy

the confidencelearn from

that his writings will continue to influence designs

produced by those who

CONTENTSPreface

xv

I

THE THEORY OF RELIABLE SYSTEM DESIGNFUNDAMENTAL CONCEPTSa Digital

1

1

35

Physical Levels in a Digital System

Temporal Stages ofCost of aDigital

System

6

System

18

SummaryReferences

21

21

2

FAULTS

AND

THEIR MANIFESTATIONS

22

System Errors

2431

Fault ManifestationsFault Distributions

49 57

Models for Permanent Faults: The MIL-HDBK-217 Model Models for Intermittent and Transient Faults 65 Software Fault Models 73Distribution Distribution

SummaryReferences

76 76 77

Problems3

RELIABILITY

TECHNIQUESP.

79Siewiorek

Steven A. Elkind and Daniel

System-Failure Response Stages

80 84

Hardware Hardware Hardware Hardware

Fault-Avoidance TechniquesFault-Detection Techniques

96 138169

Masking Redundancy Techniques Dynamic Redundancy Techniques Software Reliability Techniques 201

SummaryReferences

219219221

Problems4

MAINTAINABILITY

AND

TESTING TECHNIQUES229

228

Specification-Based Diagnosis

Symptom-Based Diagnosis

260

viii

CONTENTS

SummaryReferences

268 268 269

Problems5

EVALUATION CRITERIA271

271 Stephen McConnel and Daniel P. Siewiorek

Introduction

Survey of Evaluation Criteria: Hardware Survey of Evaluation Criteria: SoftwareReliability

272 279 285

Modeling Techniques: Combinatorial Models294

Examples of Combinatorial ModelingReliability

and

Availability

Modeling Techniques: Markov Models334 342

305

Examples of Markov ModelingAvailability

Modeling Techniques

Software Assistance for Modeling Techniques

349 356

Applications of Modeling Techniques to Systems Designs

SummaryReferences

391391

Problems6

392

FINANCIAL CONSIDERATIONS402

402

Fundamental ConceptsCost Models408419

SummaryReferences

419 420

Problems

II

THE PRACTICE OF RELIABLE SYSTEM DESIGN424

423

Fundamental Concepts 402 General-Purpose ComputingHigh-Availability Systems

424

Long-Life SystemsCritical

425

Computations

425

7

GENERAL-PURPOSE COMPUTING427 427

427

Introduction

Generic Computer

DECIBM

430431

The DEC Case:DanielP.

RAMP

in the

VAX Family

433

Siewiorek

CONTENTS

The VAX ArchitectureFirst-Generation

433

VAX Implementations 439 Second-Generation VAX Implementations 455References484PartI:

The IBM CaseDanielP.

Reliability, Availability,

and

Serviceability in

IBM 308X

and IBM 3090 Processor ComplexesSiewiorek

485

Technology 485 Manufacturing 486

Overview of the 3090 Processor ComplexReferences507PartII:

493

The IBM Case

Recovery Through Programming:508

MVS

Recovery ManagementC.T. Connolly

Introduction

508

RAS Objectives 509 Overview of Recovery Management 509 MVS/XA Hardware Error Recovery 511

MVS/XA

Serviceability Facilities

520

Availability

522523

SummaryReference

Bibliography

523

523

8

HIGH-AVAILABILITY SYSTEMS524 524

524

Introduction

AT&T Switching Systems Tandem Computers, Inc.Stratus

528531

Computers,533

Inc.

References

The AT&T Case Part I: Fault-Tolerant Design of AT&T Telephone Switching System Processors 533W.N. ToyIntroduction

533

Allocation and Causes of System

Downtime

534

Duplex Architecture 535 Fault Simulation Techniques

538

First-Generation ESS Processors

540544551

Second-Generation Processors

Third-Generation 3B20D Processor

SummaryReferences

572 573

The AT&T Case Part AT&T 5ESS SwitchL.C.

II:

Large-Scale Real-Time Program Retrofit

Methodology

in

574

Toy

5ESS Switch Architecture OverviewSoftware Replacement576

574

SummaryReferences

585

586

The Tandem Case: Fault Tolerance in Tandem Computer Systems 586 Joel Bartlett, Wendy Bartlett, Richard Carr, Dave Garcia, Jim Cray, Robert Horst, Robert Jardine, Doug Jewett, Dan Lenoski, and Dix McGuireHardwareIntegrity S2

588 597

Processor Module Implementation Details618Facilities

MaintenanceSoftware

and Practices

622

625

OperationsReferences

647 647

Summary and Conclusions648

The

Stratus Case:

The Stratus Architecture

648

Steven

Webber

Stratus Solutions to

Downtime652

650

Issues of Fault Tolerance

System Architecture OverviewRecovery ScenariosStratus Software

653

664665

Architecture Tradeoffs

666 669

Service Strategies

Summary

670

9

LONG-LIFE SYSTEMS671 671

671

Introduction

Generic Spacecraft

Deep-Space Planetary Probes 676 Other Noteworthy Spacecraft DesignsReferences679

679

The Galileo Case: Galileo OrbiterRobert W. Kocsis

Fault Protection

System

679

The Galileo Spacecraft 680 Attitude and Articulation Control Subsystem Command and Data Subsystem 683

680

AACS/CDS

Interactions

687688

Sequences and

Fault Protection

CONTENTS

Fault-Protection Design Problems

and Their Resolution

689

SummaryReferences10

690 690

CRITICAL

COMPUTATIONS691

691

Introduction

C.vmpSIFT

691

693

The C.vmp Case: A Voted MultiprocessorDanielP.

694

Siewiorek, Vittal Kini, Henry Mashburn, Stephen McConnel, and Michael Tsao

System Architecture

694 699

Issues of Processor Synchronization

Performance MeasurementsOperational Experiences707

702

References

709for

The SIFT Case: Design and Analysis of a Fault-Tolerant ComputerAircraft

Control

710

J