Why Recovery Should Be Free, And Often Can Be · 2018-09-05 · Why Recovery Should Be Free, And Often Can Be. Armando Fox, Stanford University. 1st Intl. Conference on Autonomic

Why Recovery Should Be Free,Why Recovery Should Be Free,And Often Can BeAnd Often Can Be

Armando Fox, Stanford UniversityArmando Fox, Stanford University

1st Intl. Conference on Autonomic Computing1st Intl. Conference on Autonomic ComputingNew York, May 18, 2004New York, May 18, 2004

© 2004 Armando Fox

2.5 years later, what is autonomic computing?2.5 years later, what is autonomic computing?

IBM Autonomic Manifesto says it’s the creation of systems that aIBM Autonomic Manifesto says it’s the creation of systems that arereselfself--configuring/upgradingconfiguring/upgrading

selfself--optimizing (performance)optimizing (performance)

selfself--healing (from failures)healing (from failures)

selfself--protecting (from attacks)protecting (from attacks)

Googlism.comGooglism.com says:says:(#1) autonomic computing is going to be the next big thing(#1) autonomic computing is going to be the next big thing

autonomic computing is not a productautonomic computing is not a product

autonomic computing is shipping nowautonomic computing is shipping now

autonomic computing is something hp does alreadyautonomic computing is something hp does already

But seriously:But seriously:autonomic computing is all about addressing complexityautonomic computing is all about addressing complexity

autonomic computing is essentially to minimize human supervisionautonomic computing is essentially to minimize human supervision

autonomic computing is indicative of both why and how the world autonomic computing is indicative of both why and how the world of computer of computer science is changingscience is changing

© 2004 Armando Fox

Practicing What We PreachPracticing What We Preach

““...our ability to analyze and predict the performance of the ...our ability to analyze and predict the performance of the enormously complex software systems that lies at the core of ourenormously complex software systems that lies at the core of oureconomy is painfully inadequate.” (economy is painfully inadequate.” (ChoudhuryChoudhury & & WeikumWeikum, 2000 , 2000 PITAC Report)PITAC Report)

Approaches to autonomic performance tuningApproaches to autonomic performance tuning

Hippodrome (HP), Microsoft SQL Server Hippodrome (HP), Microsoft SQL Server AutoAdminAutoAdmin, ..., ...

Performance debugging for distributed systems of black boxes (HPPerformance debugging for distributed systems of black boxes (HP))

Control theoretic modeling for system performance tuning (IBM)Control theoretic modeling for system performance tuning (IBM)

Statistical learning of measured lowStatistical learning of measured low--level performance parameters that level performance parameters that correlate with highcorrelate with high--level performance (HP)level performance (HP)

Intrusion detection using statistical anomalies (UNM/SRI)Intrusion detection using statistical anomalies (UNM/SRI)

1. Complex HW/SW systems ascollections of black boxes

1. Complex HW/SW systems as1. Complex HW/SW systems ascollections of black boxescollections of black boxes

© 2004 Armando Fox

““Collections of Black Boxes”Collections of Black Boxes”

Build model by measurement & analysisBuild model by measurement & analysisControl theory: model system as closedControl theory: model system as closed--loop feedbackloop feedback

Statistical analysis: correlate trends in desired macroscopic Statistical analysis: correlate trends in desired macroscopic parameters with trends in lowerparameters with trends in lower--level parameters for predictionlevel parameters for prediction

Structural model: Structural model:

Rely on Rely on external controlexternal control, using mechanisms that respect , using mechanisms that respect the black boxthe black box

“Increase the size of the DB connection pool” [“Increase the size of the DB connection pool” [HellersteinHellerstein et al]et al]

“Reallocate one or more whole machines” [“Reallocate one or more whole machines” [IBM FCRC paper]IBM FCRC paper]

“Reboot/rejuvenate one or more machines” [“Reboot/rejuvenate one or more machines” [TrivediTrivedi, Fox, others], Fox, others]

“Shoot one of the blocked “Shoot one of the blocked txnstxns” [everyone]” [everyone]

“Induce memory pressure on other apps” [“Induce memory pressure on other apps” [WaldspurgerWaldspurger et al]et al]

© 2004 Armando Fox

Example Example -- building models through measurementbuilding models through measurement

Idea: many known intrusions involve exploiting a Idea: many known intrusions involve exploiting a weakness by making “unusual” sequences of system calls weakness by making “unusual” sequences of system calls ((HofmeyrHofmeyr et al 1998)et al 1998)

Collect kCollect k--length sequences of system calls under “normal” length sequences of system calls under “normal” conditionsconditions

Look for call sequences that differ from “known normal” Look for call sequences that differ from “known normal” sequences by more than a threshold amountsequences by more than a threshold amount

Goal: Goal: intrusion detection with low false positive rate and nearintrusion detection with low false positive rate and near--zero false negative ratezero false negative rate

In practice, reducing false negatives increases false positivesIn practice, reducing false negatives increases false positives

© 2004 Armando Fox

Another exampleAnother example

Observation: under normal conditions, number of Observation: under normal conditions, number of observed observed code paths through an application is much code paths through an application is much smaller than number of smaller than number of possible possible paths (paths (KicimanKiciman 2003)2003)

Build probabilistic contextBuild probabilistic context--free grammar from observed pathsfree grammar from observed paths

Raise alarm when a path is observed corresponding to a very Raise alarm when a path is observed corresponding to a very lowlow--probability parse tree generated by the PCFGprobability parse tree generated by the PCFG

Raising recall (% of actual faults detected) from 40% to 80% Raising recall (% of actual faults detected) from 40% to 80% increases false positive rate from 0.1% to 1.8%increases false positive rate from 0.1% to 1.8%

2. Statistical techniques identify “interesting” features and relationships from large datasets,

but are prone to false positives

2. Statistical techniques identify “interesting” 2. Statistical techniques identify “interesting” features and relationships from large datasets, features and relationships from large datasets,

but are prone to false positivesbut are prone to false positives

© 2004 Armando Fox

So What’s Missing?So What’s Missing?

Separation of recovery/operations from diagnosis/repairSeparation of recovery/operations from diagnosis/repairPrevious examples rely on human to filter false positivesPrevious examples rely on human to filter false positives

But need to keep human “out of the loop” during normal But need to keep human “out of the loop” during normal operations (or else say goodbye to 5 nines)operations (or else say goodbye to 5 nines)

How should we survive false positives?How should we survive false positives?

Missing ingredient: “microMissing ingredient: “micro--recovery”recovery”

3. Make “micro-recovery” so inexpensive that occasional false positives don’t matter

3. Make “micro3. Make “micro--recovery” so inexpensive that recovery” so inexpensive that occasional false positives don’t matteroccasional false positives don’t matter

© 2004 Armando Fox

““MicroMicro--recovery” to survive false positivesrecovery” to survive false positives

““Salubrious”: returns some part of system to known stateSalubrious”: returns some part of system to known stateReclaim resources (memory, DB Reclaim resources (memory, DB connsconns, sockets, DHCP lease...), sockets, DHCP lease...)

Throw away corrupt transient stateThrow away corrupt transient state

Possibly setup to retry operation, if appropriatePossibly setup to retry operation, if appropriate

Safe: affects only performance, not correctnessSafe: affects only performance, not correctness

NonNon--disruptive: performance impact is “small”disruptive: performance impact is “small”

Predictable: impact and timePredictable: impact and time--toto--complete is stablecomplete is stable

Separates recovery from diagnosis/repairSeparates recovery from diagnosis/repair

4. Not really recovery, but continuous adaptation4. Not really recovery, but 4. Not really recovery, but continuous adaptationcontinuous adaptation

© 2004 Armando Fox

Summary so farSummary so far

1.1. Treat complex systems as collections of black boxesTreat complex systems as collections of black boxes

2.2. Statistical methods are good at identifying “interesting” Statistical methods are good at identifying “interesting” features from much larger collection, albeit with false features from much larger collection, albeit with false positivespositives

3.3. Tolerate false positives by making “microTolerate false positives by making “micro--recovery” very recovery” very inexpensiveinexpensive

4.4. Result: blur line between “normal” and “recovery”Result: blur line between “normal” and “recovery”

Goal is to provide “recovery management invariants” to automated statistical techniques, just as Transactions

provide semantic invariants to the programs that use them

Goal is to provide “recovery management invariants” to Goal is to provide “recovery management invariants” to automated statistical techniques, just as Transactions automated statistical techniques, just as Transactions

provide semantic invariants to the programs that use themprovide semantic invariants to the programs that use them

© 2004 Armando Fox

CrashCrash--Only Building BlocksOnly Building Blocks

Pinpoint: Anomalous code Pinpoint: Anomalous code paths and component paths and component interactions (Probabilistic interactions (Probabilistic contextcontext--free grammar)free grammar)

Modify Modify appserverappserverto to undeployundeploy/ / redeploy redeploy EJB’sEJB’s and and stall pending stall pending reqsreqs

MicrorebootsMicroreboots of of EJB’sEJB’s

JAGR (J2EE JAGR (J2EE application application server) server) [AMS [AMS 2003 & in prep.]2003 & in prep.]

Time series of state and activity Time series of state and activity (Tarzan)(Tarzan)

QuorumQuorum--like like redundancy and redundancy and predictable repair; predictable repair; relaxed relaxed consistencyconsistency

WholeWhole--node node reboot (need reboot (need not preserve not preserve state)state)

DStoreDStore(persistent (persistent hashtablehashtable) ) [in [in preparation]preparation]

Time series of state and activity Time series of state and activity (Tarzan)(Tarzan)

QuorumQuorum--like like redundancy; redundancy; relaxed relaxed consistencyconsistency

WholeWhole--node fast node fast reboot (doesn’t reboot (doesn’t preserve state)preserve state)

SSM (diskless SSM (diskless session state session state store) store) [NSDI [NSDI 04]04]

Statistical monitoringStatistical monitoringHow realizedHow realizedControl pointControl pointSubsystemSubsystem

• Control points are safe, predictable, non-disruptive

• Crash-only design: shutdown=crash, recover=restart

• Makes state-management subsystems as easy to manage as stateless Web servers

© 2004 Armando Fox

Optimizing for Specialized State TypesOptimizing for Specialized State Types

Two singleTwo single--key (“Berkeley DB”) get/set state storeskey (“Berkeley DB”) get/set state storesTypical applications of “clusterTypical applications of “cluster--based hash tables”: user session based hash tables”: user session state, application workflow state, persistent user profiles, state, application workflow state, persistent user profiles, merchandise catalogs, ...merchandise catalogs, ...

Replication to a set of N bricks provides durabilityReplication to a set of N bricks provides durabilityWrite to subset, wait for subset, remember subsetWrite to subset, wait for subset, remember subset

DStoreDStore: state persists “forever” as long as : state persists “forever” as long as N/2N/2 bricks survivebricks survive

SSM: If client loses cookie, state is lost; otherwise, persists SSM: If client loses cookie, state is lost; otherwise, persists for for time time t t with probability with probability p, p, where where t, p t, p = F(N, node MTBF)= F(N, node MTBF)

Recovery==restart, takes seconds or lessRecovery==restart, takes seconds or lessEfficacy doesn’t depend on whether replica is behaving correctlyEfficacy doesn’t depend on whether replica is behaving correctly

SSM: node state SSM: node state not preserved not preserved (in(in--memory only)memory only)

DStoreDStore: node state preserved, read: node state preserved, read--repair fixesrepair fixes

© 2004 Armando Fox

Managing Managing DStoreDStore and SSMand SSM

Rebooting is the only control mechanismRebooting is the only control mechanismHas predictable effect and takes predictable time, regardless ofHas predictable effect and takes predictable time, regardless ofwhat the process is doingwhat the process is doing

•• Like kill Like kill --9, “turning off” a VM, or pulling power cord9, “turning off” a VM, or pulling power cord

Intuition: the “infrastructure” supporting the power switch is Intuition: the “infrastructure” supporting the power switch is simpler than the applications using itsimpler than the applications using it

Due to slight Due to slight overprovisioningoverprovisioning inherent in replication, rebooting inherent in replication, rebooting has minimal effect on throughput & latencyhas minimal effect on throughput & latency

Activity and state statistics collected per brick every Activity and state statistics collected per brick every second; any deviation => reboot bricksecond; any deviation => reboot brick

Makes it as easy as managing a stateless server farmMakes it as easy as managing a stateless server farm

© 2004 Armando Fox

Detecting “Anomalous” ConditionsDetecting “Anomalous” Conditions

9 metrics collected per brick every second9 metrics collected per brick every secondNumDroppedNumDropped, , NumWriteProcessedNumWriteProcessed, , NumReadProcessedNumReadProcessed,,TimeIntervalTimeInterval, , FreeMemoryFreeMemory, , NumElementsNumElements, , InboxSizeInboxSize, , NumRequestsHandledNumRequestsHandled, , MemoryUsedMemoryUsed

““ActivityActivity” statistics capture a notion of “forward progress”” statistics capture a notion of “forward progress”

“State” statistics capture resource utilization under “normal” “State” statistics capture resource utilization under “normal” circumstancescircumstances

Metrics compared against those of “peer” bricksMetrics compared against those of “peer” bricksBasic idea: Changes in workload tend to affect all bricks equallBasic idea: Changes in workload tend to affect all bricks equallyy

Underlying (weak) assumption: “Most bricks are doing mostly the Underlying (weak) assumption: “Most bricks are doing mostly the right thing most of the time”right thing most of the time”

Anomaly in 6 or more (out of 9) metrics => reboot brickAnomaly in 6 or more (out of 9) metrics => reboot brick

© 2004 Armando Fox

Detection & recovery in SSMDetection & recovery in SSM

““Activity” statistics compared against other bricks Activity” statistics compared against other bricks (absolute median deviation)(absolute median deviation)

“State” statistics use simple time“State” statistics use simple time--series analysis (Tarzan)series analysis (Tarzan)keep Nkeep N--length time series, length time series, discretizediscretize each data pointeach data point

count relative frequencies of all substrings of length count relative frequencies of all substrings of length k k or shorteror shorter

Works even when period is irregular or not known Works even when period is irregular or not known a prioria priori

Note! We are not Note! We are not SLT/ML researchers!SLT/ML researchers!

Goal is to Goal is to enable enable those those techniques to be techniques to be brought to bearbrought to bear

© 2004 Armando Fox

Detection & recovery in Detection & recovery in DStoreDStore

Metrics and algorithm comparable to Metrics and algorithm comparable to those used in SSMthose used in SSM

We inject “failWe inject “fail--stutter” behavior by stutter” behavior by increasing request latencyincreasing request latency

Bottom case: more aggressive Bottom case: more aggressive detection also results in 2 detection also results in 2 “unnecessary” reboots“unnecessary” reboots

But they don’t matter muchBut they don’t matter much

Currently some voodoo constants for Currently some voodoo constants for thresholds in both SSM and thresholds in both SSM and DStoreDStore

TradeTrade--off of fast detection vs. false off of fast detection vs. false positivespositives

© 2004 Armando Fox

What faults does this handle?What faults does this handle?

Substantially all nonSubstantially all non--Byzantine faults we injected:Byzantine faults we injected:Memory Memory bitflipsbitflips in code, data, and checksums (=> crash)in code, data, and checksums (=> crash)

hang/timeout/freezehang/timeout/freeze

FailFail--stutter: Network loss (drop up to 70% of packets randomly)stutter: Network loss (drop up to 70% of packets randomly)

Periodic slowdown (Periodic slowdown (egeg from garbage collection)from garbage collection)

Persistent slowdown (one node lags the others)Persistent slowdown (one node lags the others)

All anomalies are “coerced” to crash faults All anomalies are “coerced” to crash faults If that turned out to be the wrong thing, it didn’t cost you mucIf that turned out to be the wrong thing, it didn’t cost you much h to try itto try it

Human notified after threshold number of restartsHuman notified after threshold number of restarts

These systems are “always recovering”These systems are “always recovering”

© 2004 Armando Fox

Casting repartitioning as recoveryCasting repartitioning as recovery

Split key spaceSplit key space

Take replica offline and Take replica offline and clone itclone it

existing repair mechanisms existing repair mechanisms used for “recovery” afterwardused for “recovery” afterward

“RISC“RISC--like” moral like” moral --expressing other expressing other maintenance operations in maintenance operations in terms of failure/recoveryterms of failure/recovery

00 10 00 10 00 1000 1010

0 5 10 15 20 25 30 35Time (minutes)

0

50Repairs

0

50Repairs0K

3K

6K

GE

T re

q/se

c

0K

3K

6K

GE

T re

q/se

c

1. brick offline2. data copy3. bricks online

1. brick offline2. data copy3. bricks online

© 2004 Armando Fox

Pinpoint: Anomalous Path DetectionPinpoint: Anomalous Path Detection

Capture paths through Capture paths through EJB’sEJB’s as dynamic call trees (intraas dynamic call trees (intra--method calls hidden)method calls hidden)

Build probabilistic contextBuild probabilistic context--free grammar from thesefree grammar from these

Detect trees that correspond to very low probability Detect trees that correspond to very low probability parsesparses

Component interaction analysisComponent interaction analysiscurrently finds 55currently finds 55--75% of 75% of failures.failures.

Path shape analysis detects Path shape analysis detects >90% of failures; but correctly>90% of failures; but correctlydiagnoses fewer.diagnoses fewer.

Across all expts:80% detection rate with 1.8% FP rate

Across 92% of expts:40% detection rate with 0.2% FP rate

False positive rate

Det

ectio

n ra

te

© 2004 Armando Fox

JAGR: JAGR: JBossJBoss with Microwith Micro--rebootsreboots

performabilityperformability of of RUBiSRUBiS ((goodputgoodput/sec vs. time)/sec vs. time)vanilla vanilla JBossJBoss w/manual restarting of appw/manual restarting of app--server, vs. server, vs. JAGR w/automatic recovery and microJAGR w/automatic recovery and micro--rebootingrebooting

JAGR/JAGR/RUBiSRUBiS does 78% better than does 78% better than JBoss/RUBiSJBoss/RUBiS

Maintains 20 Maintains 20 reqreq/sec, even in the face of faults/sec, even in the face of faults

Lower steadyLower steady--state after recovery in first graph: class reloading, state after recovery in first graph: class reloading, recompiling, etc., which is not necessary with microrecompiling, etc., which is not necessary with micro--rebootsreboots

Also used to fix memory leaks without rebooting whole Also used to fix memory leaks without rebooting whole appserverappserver

© 2004 Armando Fox

A General Architecture for SLT/MLA General Architecture for SLT/MLChallenges:Challenges:

online SLT algorithmsonline SLT algorithms

Data collection without Data collection without perturbing systemperturbing system

Data storage and Data storage and management for modelsmanagement for models

Wily attackers who can Wily attackers who can game the algorithmsgame the algorithms

MultiMulti--level learning and level learning and multimulti--timescale learningtimescale learning

Stability and “autopilot” Stability and “autopilot” issues due to model updateissues due to model update

Recovery synthesis

Client requests

Responses

Datacenter boundary

Collection

Short-termstore

Long-termstore

Onlinealgo.

Onlinealgo.

Observations fromother datacenters

Offlinealgo.

Offlinealgo.

Recovery actions toother datacenters

Observations toother datacenters

Application component

Application server

Programmable network elements provide extension of approach to Programmable network elements provide extension of approach to other layersother layers

© 2004 Armando Fox

Autonomic & Technology TrendsAutonomic & Technology Trends

CPU speed increases slowing down, need more explicit parallelismCPU speed increases slowing down, need more explicit parallelism

Use extra CPU to collect and locally analyze data; exploit tempoUse extra CPU to collect and locally analyze data; exploit temporal ral localitylocality

Disk space is free (though bandwidth and disasterDisk space is free (though bandwidth and disaster--recovery aren’t)recovery aren’t)

egeg A9.com keeps 10’s of TB of scanned pagesA9.com keeps 10’s of TB of scanned pages

Can keep history of several models for regression analysis, trenCan keep history of several models for regression analysis, trending, etc.ding, etc.

VM’sVM’s being used as unit of software distributionbeing used as unit of software distribution

Fault isolationFault isolation

Opportunity for Opportunity for nonintrusivenonintrusive observationobservation

Action that is independent of the hosted appAction that is independent of the hosted app

This approach scales down as well as upThis approach scales down as well as up

© 2004 Armando Fox

ConclusionConclusion

The real reason to reduce MTTRis to tolerate false positives: recovery → adaptation

The The real real reason to reduce MTTRreason to reduce MTTRis to tolerate false positives: is to tolerate false positives: recovery recovery →→ adaptationadaptation

Toward “new science” in autonomic computingToward “new science” in autonomic computing“...Ultimately, these aspects [of autonomic systems] will be “...Ultimately, these aspects [of autonomic systems] will be emrgentemrgent properties of a general architecture, and distinctions will properties of a general architecture, and distinctions will blur into a more general notion of selfblur into a more general notion of self--maintenance.” (maintenance.” (The The Vision of Autonomic ComputingVision of Autonomic Computing))

Wanted...Wanted...“autonomic computing is not going to be as successful as it coul“autonomic computing is not going to be as successful as it could be if there is no d be if there is no standard” (standard” (GooglismGooglism))

Generic and open architectures for pervasive integration of SLTGeneric and open architectures for pervasive integration of SLT

Share data! (Success stories with Share data! (Success stories with EbayEbay, , TellmeTellme, Amazon), Amazon)

Engage mathematiciansEngage mathematicians

© 2004 Armando Fox

Thank you...Thank you...

“The Internet is the first computational artifact “The Internet is the first computational artifact that we must approach with humility, the same that we must approach with humility, the same way that physicists approach the Universe, or way that physicists approach the Universe, or

biologists approach the cell.”biologists approach the cell.”Christos Papadimitriou, UC BerkeleyChristos Papadimitriou, UC Berkeley

Documents

Why Recovery Should Be Free, And Often Can Be · 2018-09-05 · Why Recovery Should Be Free, And Often Can Be. Armando Fox, Stanford University. 1st Intl. Conference on Autonomic