Upload
dinhthuan
View
213
Download
0
Embed Size (px)
Citation preview
Dealing with Dependent Failures:Dealing with Dependent Failures:Theory and PracticeTheory and Practice
Flavio Junqueira, Alejandro Hevia, Ranjita BhagwanKeith Marzullo, and Geoffrey M. Voelker
2
A typical day of an Internet wormA typical day of an Internet worm……
Host A runs WidowsOS
Host B runs LoonixOS
Hum?A vulnerability in
Widows!
A
B
… it isn't in Loonix!
3
Dealing with Dependent FailuresDealing with Dependent Failures
There are two traditional ways of developing a distributed systemin an environment with dependent failures:
1. Redesign the system so that failures are independent.
2. Replicate enough so that no failure will cause too manyreplicas to fail.
4
Dealing with Dependent FailuresDealing with Dependent Failures
A novel approach:
3. Develop new protocols that are aware of the dependency offailures and can act accordingly.
... note that a similar problem arises when components have differentprobabilities of failures.
5
Dealing with Dependent FailuresDealing with Dependent Failures
Have developed an abstraction for representing information about dependentfailures with differing probabilities (non-IID environment).
Generalizes the threshold assumption of no more than t of n componentsbeing faulty.
Have used it to generate new lower bounds and optimal protocols forconsensus and related problems.
Discovered sufficient conditions with which one can transform aprotocol that was written for the threshold model to execute in a non-IIDenvironment.
... broad application in process control, safety critical systems, distributed sensornetworks, ...
6
An application: An application: Informed ReplicationInformed Replication
Having a group of nodes cooperatively share their resources in anefficient manner to obtain high availability.
Motivation System model Host diversity Searching for replica sets The Phoenix Recovery System
7
Setting up the stageSetting up the stage
Past worm outbreaks Code red (2001): compromised over 359,000 hosts Nimda (2001): multiple forms of infection Slammer (2003): fastest worm in history (90% of vulnerable hosts
in 10 minutes) Witty (2004): first to contain malicious payload
Coping with worms Containment is hard Recover from catastrophes [HotOS03]
Goal: minimize data loss
8
Defining the problemDefining the problem
How are Internet pathogens so successful? Shared vulnerabilities
Design or implementation flaw in a software system
Countermeasure: replicate data using informed replication Replica sets based on shared vulnerabilities
Problem: identifying vulnerabilities
9
System modelSystem model
A set of hosts (H) A host fails by losing its state
A set of attributes (A) Attribute = software system Operating systems +
Applications Configuration
One operating system Set of applications
A set of configurationsC ! 2
A
Conf :H ! C
Attributes (Software systems)
Hos
ts
{ , , }
{ , , }
{ , , }
{ , , }
Con
figur
atio
ns
10
CoresCores
A set S⊆ H is a core iff:
Ideally A’ = A
Cores
!"A ' # A :$a %A ' :"h %S :a &Conf (h)
!S is minimal
Hos
ts
{ , , }
{ , , }
{ , , }
{ , , }
Con
figur
atio
ns
11
Host diversityHost diversity
The distribution of configurations affects the efficiency of this approach. The distribution is skewed
Study of the UCSD network nmap tool
Port scans: detect open ports OS fingerprinting: guess OS out of error messages
Total number of scanned devices: 11,963 2,963 general-purpose hosts (port data + OS)
Conservative assumptions Same open port = run the same service Ignore OS versions
12
Top 10 operating systems and servicesTop 10 operating systems and services
15.6%printer0.7%Tru64 Unix17.8%ftpd0.9%BSD/OS18.0%httpd1.1%HP-UX19.4%smtp2.0%IRIX24.8%active directory2.2%FreeBSD25.3%sunrpc6.9%Mac OS30.7%sshd10.0%Linux39.0%microsoft-ds10.0%Mac OS X50.4%epmap10.1%Solaris55.3%netbios-ssn54.1%Windows
ServiceOS
13
Configuration distributionConfiguration distribution
Distribution is indeed skewed: 50% of hosts comprise
All: 20% Multiple: 15% Atts in at least 100: 8%
14
Visualizing diversityVisualizing diversity
Qualitative view More diversity across
operating systems Still a fair amount of
diversity for the sameOS
15
Searching for coresSearching for cores
What is the practical problem? Determine replica sets Our approach: find cores
Computing a core of optimal size is NP-complete Use heuristics Host as both client and server
Client: request cores Server: participates in cores
Core Host that requests it (original copy) Replicas
16
Finding cores: basicsFinding cores: basics
Configuration
{ , , }
Configuration
{ , , }
Configuration
{ , , }
Configuration
{ , , }
Attributes (Software systems)
××or
×Possible cores
17
Representing advertised configurationsRepresenting advertised configurations
Container abstraction Containers (B)
One for each operating system in A Each container b ∈ B has a set SB(b) of sub-containers, one for
each non-OS attribute in A A host h advertises its configuration by associating itself with
every sub-container s ∈ SB(b) b is the container for the OS of h s is the sub-container in SB(b) for some attribute of h
18
Container abstractionContainer abstraction
{ , ,}
{ , , }{ , , }
{ , , }
19
HeuristicsHeuristics
Random Ignore configurations Choose randomly a number n of hosts from H
Uniform1. Different OS
i. Choose a container b randomlyii. Choose a sub-container sb randomly from biii. Choose a host randomly from sb
2. Same OS (same b where h is placed)i. Choose a sub-container sb randomly from bii. Choose a host randomly from sb
Weighted: containers weighted by the number of hosts Doubly-weighted: sub-containers also weighted
20
SimulationsSimulations
Population: 2,963 general-purpose hosts One run: Each host computes a core Questions
How much replication is needed? How many other hosts a particular host has to service? How well chosen cores protect hosts?
Metrics Average core size (core size)
Core size averaged across all the hosts Maximum load (load)
Maximum number of other hosts that any host services Average coverage (coverage)
Coverage: percentage of attributes covered in a core
21
A sample runA sample run
Random Better load balance Worse coverage Worse core size
Load is too high for other heuristics Proposed modification
Limit the load of each host Intuition: force load balance Each host services at most L
other hosts L = load limit or simply limit
910.99972.58DWeighted840.99952.64Weighted2840.99972.56Uniform120.9775Random
LoadCoverageCoresize
22
Core sizeCore size
Random increaseslinearly with load Intrinsic to the
heuristic Other heuristics
Core size less than3
For many hosts,one single replica
23
CoverageCoverage
Lower bound on limit: 2 Dependent on the diversity
Uniform: limit at least 3 toachieve 3 nines coverage
Weighted: achieves 3 ninescoverage for limit values atleast 2
Random: core size at least 9 toachieve same coverage
24
Uncovered hostsUncovered hosts
Share of hosts that are notfully covered is small
Uniform Limit 3: slightly over 1% Limit > 4: around 0.5%
Weighted Around 0.5%
Random Core size greater than 8
to achieve similar results
25
Load varianceLoad variance
Downside of uniform Worst variance
Variance is similar forsmall values of limit
Load limit forces betterdistribution
26
Summary of simulation resultsSummary of simulation results
How many replicas are needed? Around 1.3 on average
How many other hosts a particular host has to service? Uniform: 3 for good coverage Weighted: 2 for good coverage
How well chosen cores protect hosts? Uniform: coverage greater than 0.999, L ≥ 3 Weighted: coverage greater than 0.999, L ≥ 2
Uniform heuristic Simpler
Weighted heuristics Better load balance
27
Exploits on Exploits on kk attributes attributes
Illustrate with k=2 A variant of uniform
1. Client c chooses a host hwith different OS
2. Find a core for c usinguniform
3. Find a core for h usinguniform
4. Combine the 2 cores to forma 2-resilient core 5.171.000.9810
5.161.000.9895.111.000.9785.000.990.9574.580.920.8664.180.860.765
Core size1-cov2-covL
28
The Phoenix Recovery SystemThe Phoenix Recovery System
Backup data on cores Requirement: set of operating
systems and applications is notknown
Macedon framework Pastry DHT
Advertising configurations Container → Zone Sub-container → Sub-zone
OS hint lists Empty zones Doesn’t need to be accurate
29
Prototype evaluationPrototype evaluation
On PlanetLab Total number of hosts: 63
62 PlanetLab hosts 1 UCSD host
Configurations manually set 63 randomly chosen out of the 2,963
30
Evaluation resultsEvaluation results
Simulated attack Parameters
Backup file: 5MB L = 3 Interval between announcements:
120s
Target: Windows hosts (60%) Caused hosts to crash almost
simultaneously All hosts recovered
For 35: avg 100s For 3: several minutes (transient
network failures)
3.334.44112.222.107
2.722.88112.232.105
1.941.65112.222.123
Sim.Imp.Sim.Imp.Sim.Imp.
Load var.CoverageCore sizeL
• Imp. = implementation
• Sim. = simulation
31
ConclusionsConclusions
Informed replication Replica sets based on attributes Internet catastrophes: software systems
Survivable data at a low replication cost Core size is less than 3 on average Hosts service at most 3 other hosts
Diversity study Approach is realistic
Side-effects of load limit scheme Upper bounds the amount of work any host has to accomplish Constrain damage in case of individual malicious behavior
32
Future workFuture work
Real deployment Tune current prototype Security features Cope with real threats
More data sets to determine diversity Mechanism to monitor resource usage Informed replication
With other approaches for cooperative backup With other types of attributes
E.g. Resource utilization