CS717 Detection and Tolerance of Complex Faults in Computing Systems CS 717 Greg Bronevetsky

CS717

Detection and Tolerance of Complex Faults in

Computing Systems

CS 717Greg Bronevetsky

CS717

The Problem

• Systems Fail– Hardware failures– Software failures– Hacker attacks

• Failures can be made less probable– ex: Higher quality hardware

• Cannot be prevented• Key challenge

– Detect real-world failures– Allow applications to tolerate them

CS717

Types of Failures

• Hardware Failures– Wear and tear on wires– Electric interference– Radiation hitting electronics

• Software Failures– System misconfigurations– Buggy code

• Hacker Attacks– Worst-case scenario– Arbitrary modifications of system state/code

CS717

Сlasses of Faults

• Permanent Faults– Same input will result in same (erroneous) output– Ex:

• Broken wires• OS misconfigurations

• Transient Faults– Temporarily erroneous output– Not replicable– Ex:

• Radiation hitting wires, flipping bits• Hacker attacks

– Typically harder to detect

CS717

Fault Models

• To study faults multiple fault models developed– Too many to cover her

• Major Types:– Fail-stop– Random– Human-Induced– Byzantine

CS717

Fail-Stop Faults

• System stops on failure• No random misbehaving

• Very simple to detect• Vast body of work exists assuming this model

– Mostly focused on failure tolerance

CS717

Random Faults

• State of computer changed randomly• Computer may be arbitrarily complex

• Meant to model physical problems in hardware

• Work ranges from theoretical to physical– Abstract circuits with randomly failing gates

vs.– Shooting CPUs with protons

CS717

Human-Induced Faults

• Targets failures typically caused by humans• Focus of work to deal with specific problems

– Buffer overflow attacks– System misconfiguration– Buggy code …

• Types of research– Bug Prevention : Software Engineering,

Programming Languages User Interfaces

– Hacker Attacks : Security

CS717

Byzantine Faults

• Worst-case scenario• Adversary can change system state/code

arbitrarily• Typically, limits placed on adversary

– Decidable– Polynomial-time probabilistic– Longer-running than checker algorithm…

• Very hard to detect these efficiently• Solutions either very specific or very expensive

CS717

Goal

• Suppose system fails according to some model

• Can we– Detect fault and recover?– Ensure that result correct despite failure?

• ex: guarantee provided by error-correcting codes

• Is 100% correctness achievable?– Probably not– High-probability correctness doable

CS717

What Might We Want to Do

• Detect any errors that happen in computation• Detect most errors

– Random ones– Human-induced– Byzantine with limited adversary

• Once detected, correct errors• Tolerate errors without explicit detection

CS717

What Might We Want to Do

• Want to have deterministic correctness guarantees– May be possible if adversary limited enough

• Probabilistic guarantees may be sufficient– 1 in 100 years chance of failure = 0% chance

• Would like to have provable guarantees • May settle for experimental evidence of

effectiveness

CS717

Coverage of Course

• Work exists in many subfields• Theory

– Checkers for restricted algorithm classes– Complexity work on efficient proving systems

• Systems– Replicated systems– Checking specific algorithms– Checking program control flows– Theory of placement of manual checks

CS717

Coverage of Course

• Software Engineering– Programmers helping with system checks– Assertions– Modeling of programs

• Hardware work– Variations on hardware replications– Lock-step execution– Thread-level speculation

CS717

State of the Field

• Many communities, little communication– Systems work boring to theorists– Theory obscure/useless to systems people

• Solutions space:Algorithm-specific General

• Mostly useless for general programs

• If automatic, then very high time/hardware overheads (100%+)

• If not automatic then significant manual work for programmer

CS717

What Do We Want?

• Want to – Never worry about failures– Not pay much overhead for the privilege

• This means– Arbitrary programs– Reasonably powerful adversaries

(at least, random faults)– No (or minimal) programmer interaction– Efficiency

CS717

Solution Parameters

Need solution that is both:• Automatic

– Applies to any program– No programmer interaction

• Intelligent– Tailored to each particular program– Application-specific knowledge used for better

reliability/efficiency

CS717

The Black Hole

Little work on solutions both Automatic AND Intelligent

Theory of checking

Establishes limits and capabilities of checkers

Algorithm-specific solutions(efficient checkers tied

to particular algorithms)

Blind Mechanisms(simple, application

independent)

Automatic AND

Intelligent???

CS717

The Black Hole

Our Goal: Automatic AND Intelligent

Theory of checking





independent)

Automatic AND

Intelligent!

CS717

Our Research Goal

• Given an algorithm, create custom checker

• Detects errors with high probability

• Runs in time asymptotically smaller than original algorithm (Or just faster by constant factor)

CS717

Our Research Goal

• Intelligent checking work at application-level• Lower-level techniques only check simple

components– Correctness of addition, memory state, etc.– Poor ability to cut overhead

• Since no application knowledge

• Application-level checking: application's own semantics preserved through faults– i.e. if app says matrix A=BC, ensure that– Solution tailored to application

CS717

Our Research Goal

• Unknown if Automatic/Intelligent is possible– Some subproblems shown impossible via

complexity theory– Problem looks very hard

• Field very large but largely nonexistent

• Goal of class: Find inspiration from current work to embark on new research

CS717

Coverage of Course

• Lectures cover papers in multiple fields• Different communities bring different

techniques– Hence, different sources of inspiration

• Multiple fault models– Random and Byzantine in particular

• By semester end, may find good leads to follow

CS717 Outline of Course

(in no particular order)

• Hardware/Software Replication• Algorithm Based Fault Tolerance• Control-Flow Checking• Checkers for Specific Algorithms• Data Structure Checkers• Complexity Work on Provers (PCP Theorem)• Programmers Helping Checkers• Fault Detection in Parallel Systems• Hardware Fault Tolerance• Fault Tolerant Circuits• Experimental Evaluation of Checkers• Physics Experiments• Machine Learning

CS717

Hardware Replication

• Comes from Systems community• Run same algorithm on multiple processors• Compare results, take majority vote• Tolerates almost arbitrary faults in minority of

processors– 2f+1 replicas needed for f failures

• Triplication of hardware most common• Used by NASA to secure against errors

– Recent gravity experiment saved by backup processor

CS717

Byzantine Quorums

• Basic replication assumes: every processor replies with answer

• Suppose faulty processors can stay silent• Allow f faulty processors out of n

– Then we must decide on correct answer after n-f replies

– But f out of those n-f might be wrong– Thus, must take majority decision out of n-2f

• Bottom line: to tolerate f faults, need 3f+1 replicas

CS717

Byzantine Quorums

• Byzantine Quorum Systems provide protocols to manage this 3f+1 replication

• Cryptographically sign all communication• Maintain known core of 3+1 good

processors at all times• Protocols somewhat mindbending but pretty

cool

CS717

Algorithm-Based Fault Tolerance

• Comes from Scientific Computing/Numerical Analysis community

• Fault tolerance for basic linear algebra algorithms

• Input encoded in algorithm-specific code– Input matrices typically encoded per row/column

• Algorithm run on encoded input, returns encoded output

• Encoded output decoded, checked for inconsistencies

CS717


• Encoding guarantees detection/tolerance of upto f errors in each row/column

• Approach meant for parallel systems• If processor fails, all its results likely wrong• Thus, algorithms modified s.t. no processor

touches >1 entry in a row/column

• Approach fairly general, but each algorithm needs own solution

CS717


• ABFT produces checkers for data elements• Can develop theory of check placement in

parallel systems• Given assignment of data to processors and

checks to data, can derive number of faults detectable/tolerable by arrangement– Detectability/tolerance depends on detailed failure

model

• Multiple evaluation algorithms available

CS717

Control-Flow Checking

• Checking general programs is hard• Control-flow follows basic stack pattern

– Much easier to check– Present in most programs

• Solutions typically annotate program, check annotations

• Typically check that– Program exists blocks it entered– Program executes each block’s correct code

CS717

Control-Flow Checking

• Hardware– Watchdog processor watches fetched instructions– Yells if illegal block sequence or illegal instructions

in block

• Software– Program modified to check itself– Can’t check each instruction

• Too costly in software

– Just checks that program moves through blocks/functions correctly

CS717

Checkers for Specific Algorithms

• Work from Theory and Software Engineering communities

• Given specific algorithm can usually develop efficient checker for it

• Exist checkers for whole algorithm classes– ex: Linear recurrences

• Self-correctors available– Corrector calls faulty algorithm on several random

inputs– Collects results into (likely) correct answer

CS717


• For example, sorting:– Invariants:

• Output is permutation of input• Output is in non-decreasing order

– To check:• Can easily check order in linear time• Modify sorter to output for each input value its post-sort

index• Can use index list to verify permutation in linear time

– This O(nlog n) algorithm has O(n)-time checker

CS717


• Different algorithms have different checkers• Some use “certification trails” (additional mini-

proof to help verify correctness)– ex: Sorting checker

• Checker for one algorithm rarely applicable to other algorithms

• Specific checkers give technique ideas and show how efficient general algorithms can be

CS717

Data Structure Checkers

• Algorithms only produce answers• Usually interested in maintaining state reliably• Need checkers for data structure• Solutions exists for

– Generic RAMs (most expensive)– Stacks– Queues– Trees– Graphs

…

CS717

Data Structure Checkers

• Some solutions use a little secure memory safe from adversary

• Others use certification trails to prove correctness of encoding

• Solution for RAMs applicable to all other data structures

• Custom-tailored checkers more efficient– Not directly applicable to other data structures– Tend to provide good inspiration though

CS717

Complexity Work on Provers

• Complexity community wants to know: How small can proofs get?

• To show string an NP language need poly-length proof

• How small can proof be if only probabilistic guarantee?

• Big theorems developed– IP = PSPACE– PCP Theorem

CS717


• IP = PSPACE– IP (Interactive Proofs) = Languages where

membership can be probabilistically proven via poly-many queries

– PSPACE = Languages computable using poly space

• PCP Theorem: can prove string an NP language by– Using log n random bits– Showing 3 random bits of (very long) proof

CS717


• Work done by complexity theorists, so:– Very cheap for checker– VERY expensive for prover– Not directly applicable to real world

• However in theorem development, multiple tools developed– Checkers for polynomials– Secure encodings via polynomials

…

• Tools usefuls for our purposes

CS717

Programmers Helping Checkers

• Software Engineering community• Gives up on purely automates solutions,

asks programmer for help

• Simplest example: Assertions&Exceptions– Insert boolean checks into code– If check fails

• Assertions: program informs user of failure• Exception: program executes exception handler to fix

problem

CS717

Programmers Helping Checkers

• Another example: Programmer-implemented checkers

• SCCM – two languages in one– Regular imperative language– Auxiliary functional language

• Allows programmer to specify algorithm itself• Hooks to associate algorithm’s variables to program

variables

– System check’s main program’s state using auxiliary program

• Pro: Guards against bugs & faults• Con: Annoying to use

CS717

Fault Detection in Parallel Systems

• Parallel systems bring new challenges• May wish to compute average of numbers

held by all nodes– Some nodes may be faulty– Can use polynomials to encode data to help

system compute approximate answer

• May wish to know which nodes failed(or number of failed nodes)– Hard when faults are Byzantine– If this was known, could

• Move replicas away from failed nodes• Increase degree of replication

CS717

Fault Detection in Parallel Systems

• Some techniques guard against malicious humans

• SETI@Home has users who return fake data without doing work

• Can develop schemes to make this unprofitable

• Parallel techniques valuable for – Scientific computations (usually parallel and large)– Large business computations (ex: databases)

CS717

Hardware Fault Tolerance

• If hardware can be made more reliable then less need to software-level reliability

• Quality of circuitry will not improve in near future– Driven by physics and economics

• Can use replication of circuits to achieve high reliability

• Faults caught at hardware level fast, invisible correction

CS717

Hardware Fault Tolerance

• Simplest version: lock-step execution– Two processors execute in lock-step– If output if any instruction pair disagrees, redo

instruction

• Modern versions– Thread-level speculation (TLS) allows threads to

be aborted mid-stream– Can guess that branch will go TRUE and

speculatively execute ahead– If guess was wrong, all speculative thread actions

aborted

CS717

Fault Tolerance via TLS

• Can run multiple threads with identical code– Regularly compare results– If disagreement abort threads, try again

• TLS can be done on same processor or multiple processors

• Threads touch hardware differently, so affected differently by physical problems– ex: radiation, faulty wires, etc.

• Guarantees very low-level• Low overhead (if TLS already available)

CS717

Fault Tolerant Circuits

• Very old field, begun by von Neumann in 50’s• Model:

– Computer made up of binary logic gates– Each gate fails with constant probability =

• Approach– Replicate each gate log n times

• n = number of gates in overall circuit

– Feed output of copies into combiner circuit– – Thus, probability of 1 of n gates failing = constant

)/1()/1( log nOn

CS717

Fault Tolerant Circuits

• Get constant failure probability for any size circuit– Note, constant number of replicas becomes worse

as replicas become larger (more gates)

• Limitation: can’t replicate <log n since output gate needs log n replication

• Recent work: encoded computation– Encode all inputs using Reed-Solomon codes– Transform circuit to work on encoded data– Output encoded data– Allows for <log n replication since output > 1 gate

CS717

Experimental Evaluation of Checkers

• Many different approaches to fault tolerance• Little work on comparing their effectiveness in

real-world setting• Example: control-flow checking

– Many different papers– Each uses different evaluation method– Net effect: no insight into how techniques compare

• No surprise: experiments not glamorous• Will cover few papers that do experimental

comparisons

CS717

Physics Experiments

• Multiple sources of hardware faults– Manufacturing defects– Temperature– Wire failure– Radiation

• Manufacturing defects and wire failure hard to study

• Temperature-induced failures– Easy to study– However, few papers around (that I’ve seen)

CS717

Physics Experiments

• Radiation-induced failures studied extensively by physicists– Stick CPU inside particle accelerator– Shoot protons or ions at it– Run sample programs and watch what happens

• Experiments run on various CPUs• Give insight into vulnerability of CPUs to

radiation– Approximate space operation conditions

CS717

Machine Learning

• Machine learning algorithms can be thought of “fault removers”– Given true data + adversary-induced noise– Return expression describing true data

• Decision tree, linear plane, neural net, etc.

– Thus, undo effects of adversary

• If limit data & adversary complexity– Can prove given learning algorithm effective

• ex: PAC learning

– Use complexity measurements like V-C dimension

CS717

Machine Learning

• Can assure correct output of algorithms with low complexity– i.e. low V-C dimension

• Applies to broad range of algorithms• Not very intelligent (i.e. algorithm-specific)

• Would like to cover in course• I don’t have background to find papers• Anyone want to help?

CS717

CS717

• Failures happen, want to pretend they don’t• Need techniques for detecting and correcting

system faults• Requirements

– Automatic – little/no programmer interaction– Efficient – small cost for fault tolerance

• Must check correctness at application-level– Ensure application-level semantics– Checker tailored to application – potential for low

overhead

CS717

Something for Everyone

• Field spans many different areas– Theory

• Algorithms• Complexity

– Systems– Computer Architecture– Software Engineering– Scientific Computing/Numerical Analysis– Machine Learning

• Great place for cross-area collaboration

CS717

Pulling Fields Together

• Breadth of field makes collaboration hard

• Difficult to see from one side to another

• Nobody expert in all sub-areas

• This semester will get basic grounding in many

CS717

Goal of Semester

• If your computer is lying to you, how do you know?

• Big question, no good answers• Very fundamental to Computer Science

• Goal:– By end of semester get ideas for possible answers– May found new field in process

CS717

The Black Hole

Many disjoint efforts

Core problem still wide open

Lets find the answer!Theory of checking





independent)

Automatic AND

Intelligent???

Documents

CS717 Detection and Tolerance of Complex Faults in Computing Systems CS 717 Greg Bronevetsky