Dealing with Dependent Failures: Theory and Practice · Dealing with Dependent Failures: Theory and Practice Flavio Junqueira, Alejandro Hevia, Ranjita Bhagwan Keith Marzullo, and

Dealing with Dependent Failures:Dealing with Dependent Failures:Theory and PracticeTheory and Practice

Flavio Junqueira, Alejandro Hevia, Ranjita BhagwanKeith Marzullo, and Geoffrey M. Voelker

2

A typical day of an Internet wormA typical day of an Internet worm……

Host A runs WidowsOS

Host B runs LoonixOS

Hum?A vulnerability in

Widows!

A

B

… it isn't in Loonix!

3

Dealing with Dependent FailuresDealing with Dependent Failures

There are two traditional ways of developing a distributed systemin an environment with dependent failures:

1. Redesign the system so that failures are independent.

2. Replicate enough so that no failure will cause too manyreplicas to fail.

4


A novel approach:

3. Develop new protocols that are aware of the dependency offailures and can act accordingly.

... note that a similar problem arises when components have differentprobabilities of failures.

5


Have developed an abstraction for representing information about dependentfailures with differing probabilities (non-IID environment).

Generalizes the threshold assumption of no more than t of n componentsbeing faulty.

Have used it to generate new lower bounds and optimal protocols forconsensus and related problems.

Discovered sufficient conditions with which one can transform aprotocol that was written for the threshold model to execute in a non-IIDenvironment.

... broad application in process control, safety critical systems, distributed sensornetworks, ...

6

An application: An application: Informed ReplicationInformed Replication

Having a group of nodes cooperatively share their resources in anefficient manner to obtain high availability.

Motivation System model Host diversity Searching for replica sets The Phoenix Recovery System

7

Setting up the stageSetting up the stage

Past worm outbreaks Code red (2001): compromised over 359,000 hosts Nimda (2001): multiple forms of infection Slammer (2003): fastest worm in history (90% of vulnerable hosts

in 10 minutes) Witty (2004): first to contain malicious payload

Coping with worms Containment is hard Recover from catastrophes [HotOS03]

Goal: minimize data loss

8

Defining the problemDefining the problem

How are Internet pathogens so successful? Shared vulnerabilities

Design or implementation flaw in a software system

Countermeasure: replicate data using informed replication Replica sets based on shared vulnerabilities

Problem: identifying vulnerabilities

9

System modelSystem model

A set of hosts (H) A host fails by losing its state

A set of attributes (A) Attribute = software system Operating systems +

Applications Configuration

One operating system Set of applications

A set of configurationsC ! 2

A

Conf :H ! C

Attributes (Software systems)

Hos

ts

{ , , }

{ , , }

{ , , }

{ , , }

Con

figur

atio

ns

10

CoresCores

A set S⊆ H is a core iff:

Ideally A’ = A

Cores

!"A ' # A :$a %A ' :"h %S :a &Conf (h)

!S is minimal

Hos

ts

{ , , }

{ , , }

{ , , }

{ , , }

Con

figur

atio

ns

11

Host diversityHost diversity

The distribution of configurations affects the efficiency of this approach. The distribution is skewed

Study of the UCSD network nmap tool

Port scans: detect open ports OS fingerprinting: guess OS out of error messages

Total number of scanned devices: 11,963 2,963 general-purpose hosts (port data + OS)

Conservative assumptions Same open port = run the same service Ignore OS versions

12

Top 10 operating systems and servicesTop 10 operating systems and services

15.6%printer0.7%Tru64 Unix17.8%ftpd0.9%BSD/OS18.0%httpd1.1%HP-UX19.4%smtp2.0%IRIX24.8%active directory2.2%FreeBSD25.3%sunrpc6.9%Mac OS30.7%sshd10.0%Linux39.0%microsoft-ds10.0%Mac OS X50.4%epmap10.1%Solaris55.3%netbios-ssn54.1%Windows

ServiceOS

13

Configuration distributionConfiguration distribution

Distribution is indeed skewed: 50% of hosts comprise

All: 20% Multiple: 15% Atts in at least 100: 8%

14

Visualizing diversityVisualizing diversity

Qualitative view More diversity across

operating systems Still a fair amount of

diversity for the sameOS

15

Searching for coresSearching for cores

What is the practical problem? Determine replica sets Our approach: find cores

Computing a core of optimal size is NP-complete Use heuristics Host as both client and server

Client: request cores Server: participates in cores

Core Host that requests it (original copy) Replicas

16

Finding cores: basicsFinding cores: basics

Configuration

{ , , }

Configuration

{ , , }

Configuration

{ , , }

Configuration

{ , , }

Attributes (Software systems)

××or

×Possible cores

17

Representing advertised configurationsRepresenting advertised configurations

Container abstraction Containers (B)

One for each operating system in A Each container b ∈ B has a set SB(b) of sub-containers, one for

each non-OS attribute in A A host h advertises its configuration by associating itself with

every sub-container s ∈ SB(b) b is the container for the OS of h s is the sub-container in SB(b) for some attribute of h

18

Container abstractionContainer abstraction

{ , ,}

{ , , }{ , , }

{ , , }

19

HeuristicsHeuristics

Random Ignore configurations Choose randomly a number n of hosts from H

Uniform1. Different OS

i. Choose a container b randomlyii. Choose a sub-container sb randomly from biii. Choose a host randomly from sb

2. Same OS (same b where h is placed)i. Choose a sub-container sb randomly from bii. Choose a host randomly from sb

Weighted: containers weighted by the number of hosts Doubly-weighted: sub-containers also weighted

20

SimulationsSimulations

Population: 2,963 general-purpose hosts One run: Each host computes a core Questions

How much replication is needed? How many other hosts a particular host has to service? How well chosen cores protect hosts?

Metrics Average core size (core size)

Core size averaged across all the hosts Maximum load (load)

Maximum number of other hosts that any host services Average coverage (coverage)

Coverage: percentage of attributes covered in a core

21

A sample runA sample run

Random Better load balance Worse coverage Worse core size

Load is too high for other heuristics Proposed modification

Limit the load of each host Intuition: force load balance Each host services at most L

other hosts L = load limit or simply limit

910.99972.58DWeighted840.99952.64Weighted2840.99972.56Uniform120.9775Random

LoadCoverageCoresize

22

Core sizeCore size

Random increaseslinearly with load Intrinsic to the

heuristic Other heuristics

Core size less than3

For many hosts,one single replica

23

CoverageCoverage

Lower bound on limit: 2 Dependent on the diversity

Uniform: limit at least 3 toachieve 3 nines coverage

Weighted: achieves 3 ninescoverage for limit values atleast 2

Random: core size at least 9 toachieve same coverage

24

Uncovered hostsUncovered hosts

Share of hosts that are notfully covered is small

Uniform Limit 3: slightly over 1% Limit > 4: around 0.5%

Weighted Around 0.5%

Random Core size greater than 8

to achieve similar results

25

Load varianceLoad variance

Downside of uniform Worst variance

Variance is similar forsmall values of limit

Load limit forces betterdistribution

26

Summary of simulation resultsSummary of simulation results

How many replicas are needed? Around 1.3 on average

How many other hosts a particular host has to service? Uniform: 3 for good coverage Weighted: 2 for good coverage

How well chosen cores protect hosts? Uniform: coverage greater than 0.999, L ≥ 3 Weighted: coverage greater than 0.999, L ≥ 2

Uniform heuristic Simpler

Weighted heuristics Better load balance

27

Exploits on Exploits on kk attributes attributes

Illustrate with k=2 A variant of uniform

1. Client c chooses a host hwith different OS

2. Find a core for c usinguniform

3. Find a core for h usinguniform

4. Combine the 2 cores to forma 2-resilient core 5.171.000.9810

5.161.000.9895.111.000.9785.000.990.9574.580.920.8664.180.860.765

Core size1-cov2-covL

28

The Phoenix Recovery SystemThe Phoenix Recovery System

Backup data on cores Requirement: set of operating

systems and applications is notknown

Macedon framework Pastry DHT

Advertising configurations Container → Zone Sub-container → Sub-zone

OS hint lists Empty zones Doesn’t need to be accurate

29

Prototype evaluationPrototype evaluation

On PlanetLab Total number of hosts: 63

62 PlanetLab hosts 1 UCSD host

Configurations manually set 63 randomly chosen out of the 2,963

30

Evaluation resultsEvaluation results

Simulated attack Parameters

Backup file: 5MB L = 3 Interval between announcements:

120s

Target: Windows hosts (60%) Caused hosts to crash almost

simultaneously All hosts recovered

For 35: avg 100s For 3: several minutes (transient

network failures)

3.334.44112.222.107

2.722.88112.232.105

1.941.65112.222.123

Sim.Imp.Sim.Imp.Sim.Imp.

Load var.CoverageCore sizeL

• Imp. = implementation

• Sim. = simulation

31

ConclusionsConclusions

Informed replication Replica sets based on attributes Internet catastrophes: software systems

Survivable data at a low replication cost Core size is less than 3 on average Hosts service at most 3 other hosts

Diversity study Approach is realistic

Side-effects of load limit scheme Upper bounds the amount of work any host has to accomplish Constrain damage in case of individual malicious behavior

32

Future workFuture work

Real deployment Tune current prototype Security features Cope with real threats

More data sets to determine diversity Mechanism to monitor resource usage Informed replication

With other approaches for cooperative backup With other types of attributes

E.g. Resource utilization

Documents

Dealing with Dependent Failures: Theory and Practice · Dealing with Dependent Failures: Theory and Practice Flavio Junqueira, Alejandro Hevia, Ranjita Bhagwan Keith Marzullo, and