Upload
deirdre-strickland
View
222
Download
0
Tags:
Embed Size (px)
Citation preview
Image: xkcd.com
Dependable Cloud Architecture
@mikewo
Mike Wood
http://mvwood.com
Questions
@mikewo
Mike Wood
http://mvwood.com
Tack
“Failure is alwaysan option.”
Image: Discovery Channel, Fair Use
Protection From:
What are we looking for?
Check out: http://bit.ly/wazbizcontImages: Office ClipArt & Godzilla Releasing Corp (Fair Use)
Hardware Failure Data Corruption Network Failure Loss of Facilities
Image: FOX, Fair Use
Human Error
What we’re trying to achieve
1. Monitoring2. Resilient Solutions
Image: Cohdra
Image: Office ClipArt
Cost vs Risk
99.999% $1, … ,000.00
To get more 9’s here add more 0’s here.
Image: NASA
Monitoring
Functional Transparency
Image: Office ClipArt
Logging Messages
Hardware Health
Dependent Services Health
Telemetry
Image: NASA
Analyze your Data
ResilienceImage: Office ClipArt
Remember: Failure is always an option.
Common Points of Failure• Machine\application crashes• Throttling (exceeding capacity)• Connectivity\Network• External service dependencies
Focus less on the uptime of hardware and more about how the solution handles it WHEN
something fails!
Try/catch != Resilient
private void createFile() {
string fileName = @"c:\workingDirectory\someFileName.txt";
try {
File.Create(fileName);}catch (DirectoryNotFoundException ex)
{Trace.WriteLine(String.Format("Unable to create {0}. {1}",
fileName, ex));
throw; } } }
Image: Michael Wood
Decompose your system…
Capacity BufferingContent Delivery Networks (CDN’s)
Distributed Application Cache
Local Content Cache
Enables recovery during outages or
spikes in load
Image: jepler
Always carry a spare75% Capacity, half of our load 75% Capacity, half of our load
50% more capacity then needed• Can absorb of temporary spikes• Time to react if need to add capacity
100% of load, 150% Capacity0% Capacity, redirect all load
Over allocated, but still functioning• Degrade, but don’t fail
SYSTEM FAILURE!!!
Image: Kevin Rosseel
Request Buffering
Image: Joe Shlabotnik
QueuesRetry PoliciesAsync Workloads
Dept. of Redundancy Dept.
Have a backup, somewhere elseMore than one? Cost to benefit Ratio?
Ready StateHot = full capacityWarm = scaled down, but ready to growCold = mothballed, starts from zero
Image: Mr. White
Redundancy - Its about probability
95% uptime 95% uptime 95% uptime 95% uptime
1 box : 5% downtime or 438hrs per year
2 boxes : 5/100 * 5/100 = 25/10,000 = 0.25% downtime or 22hrs per year
4 boxes : 5/100 * 5/100 * 5/100 * 5/100 = 625/100,000,0000.000625% downtime or 3.285 MINUTES per year
(that’s 18 ½ days!)
Total Outage duration =
Time to Detect+ Time to Diagnose+ Time to Decide+ Time to ActImage: Office ClipArt
Dynamic Addressing & Configuration
What about your data?
Image: barrymieny
Availability via Degradation
Image: Michael Wood
Images: Gizmodo
Virtualization and Automation
Images: Orion Pictures owns Terminator Franchise
The “HI” Point
Check out: http://bit.ly/wazinternalsImages: Office Clip Art
Image: NASA
“Don't be too proud of this technological terror you've constructed…”
ADMIT:• Your Solution WILL fail at some point• You can learn from others just as
well as yourself
DO:• Root cause analysis• Read other root cause analysis• Plan for failure
DON’T:• Get cocky• Stick your head in the sand
Images: LucasFilm, Fair Use
Questions@mikewo
Mike Wood
http://mvwood.com
http://bit.ly/CloudFailSafe
Tack