33
Michael Richardson Twitter: @Mr_SPB 1 © 2011 Energized Work - www.energizedwork.com Availability and Recoverability

System Availability Talk

Embed Size (px)

DESCRIPTION

Talk i gave on HA, resiliency and recovery of systems

Citation preview

Page 1: System Availability Talk

Michael RichardsonTwitter: @Mr_SPB

1© 2011 Energized Work - www.energizedwork.com

Availability and Recoverability

Page 2: System Availability Talk

So what is High Availability?

• Five 9s?• No Single point of failure?• Multiple Data Centre’s?• Fault Tolerance?• Load Balancing?• Uptime?

2© 2012 Energized Work - www.energizedwork.com

Page 3: System Availability Talk

The 9’s of Availability

3© 2012 Energized Work - www.energizedwork.com

9 9

Page 4: System Availability Talk

The 9’s of Availability

4© 2012 Energized Work - www.energizedwork.com

Availability Downtime per Year

One nine (90%) 36.5 days

Two nines (99%) 3.65 days

Three nines (99.9%) 8.76 hours

Four nines (99.99%) 52.56 minutes

Five nines (99.999%) 5.26 minutes

Page 5: System Availability Talk

Problem with the 9’s

5© 2012 Energized Work - www.energizedwork.com

• What do they mean?• Guaranteed or just an SLA• Multiplicity

(99.9% * 99.9% * 99.9% = 99.7%)

Page 6: System Availability Talk

SLA availability numbers:

just aim to provide a level of confidence in a website’s

service

6© 2012 Energized Work - www.energizedwork.com

Page 7: System Availability Talk

No Single Point of Failure (SPOF)

7© 2012 Energized Work - www.energizedwork.com

Page 8: System Availability Talk

two of everything?

8© 2012 Energized Work - www.energizedwork.com

Page 9: System Availability Talk

Start with this

9© 2012 Energized Work - www.energizedwork.com

Index.html

Users

Page 10: System Availability Talk

End with this

10© 2012 Energized Work - www.energizedwork.com

WEB1

switch 1 switch 2

WEB2 APP1 APP2 DB1 DB2

Firewall 1 Firewall 2

Users

Page 11: System Availability Talk

• It’s expensive ££• Where do you draw the line?• Are failures independent• Can you guarantee No SPOF?• Increased complexity

11© 2012 Energized Work - www.energizedwork.com

Problems with eliminating SPOF

Page 12: System Availability Talk

Problem: Data Centre’s Fail

12© 2012 Energized Work - www.energizedwork.com

Page 13: System Availability Talk

Solution: Get a 2nd Data Centre

13© 2012 Energized Work - www.energizedwork.com

Page 14: System Availability Talk

Hot/Hot Multisite

14© 2012 Energized Work - www.energizedwork.com

• Full range of services available in multiple locations.

• Easy to automate failover of sites• Data Consistency is hard.• Capacity Planning concerns

+

Page 15: System Availability Talk

Hot/Warm Multisite

15© 2012 Energized Work - www.energizedwork.com

• Simpler than Hot/Hot• Read/write ratio dependant• Synchronous or Asynchronously

replicate data?

+

Page 16: System Availability Talk

Hot/Cold Multisite

16© 2012 Energized Work - www.energizedwork.com

• Easy to setup• Will it work?• Can it be trusted?• Cold site rapidly become stale• Is it actually valuable?

+

Page 17: System Availability Talk

DR Multisite

17© 2012 Energized Work - www.energizedwork.com

• Fingers crossed you never need it.• How can/should you test it?• Cloud?

+

Page 18: System Availability Talk

Problems with Multiple sites

18© 2012 Energized Work - www.energizedwork.com

• ££ - it’s expensive• Managing more systems• Managing consistency of Data• Managing Capacity• Is it still fail proof?• Unless you test it, it’s just a plan

Page 19: System Availability Talk

19© 2012 Energized Work - www.energizedwork.com

We now have a Complex System

Page 20: System Availability Talk

• More redundancy and automation leads to more complexity.

• More complexity often adds more points of failure.

20© 2012 Energized Work - www.energizedwork.com

Complex Systems

Page 21: System Availability Talk

Author: Dr. Richard Cook

21© 2012 Energized Work - www.energizedwork.com

“How Complex Systems fail”

• Catastrophe is always just around the corner.

• Human Operators have dual roles.• Change introduces new forms of failure

Page 22: System Availability Talk

Failure and Recovery

22© 2012 Energized Work - www.energizedwork.com

Page 23: System Availability Talk

Questions for the Customer

23© 2012 Energized Work - www.energizedwork.com

• What is the cost of downtime?

• What are the RTO and RPO?

Page 24: System Availability Talk

24© 2012 Energized Work - www.energizedwork.com

RTO = Recovery Time Objective

RPO = Recovery Point Objective

Page 25: System Availability Talk

Aggressive RTO & RPO is expensive and has a performance impact.

25© 2012 Energized Work - www.energizedwork.com

Page 26: System Availability Talk

RTO / RPO example

26© 2012 Energized Work - www.energizedwork.com

problem

•Simple DB•Business can tolerate up to 15 minutes downtime•10 minute window of data lose.

Page 27: System Availability Talk

RTO / RPO example

27© 2012 Energized Work - www.energizedwork.com

Possible solution

1.Continuously replicate data to 2nd host2.Continue with nightly backups and also copy DB transaction logs from the primary host to another system.

Page 28: System Availability Talk

So what’s more important?

28© 2012 Energized Work - www.energizedwork.com

Increasing Availability

Or

Reducing Recovery Time

Page 29: System Availability Talk

29© 2012 Energized Work - www.energizedwork.com

MTBFOr

MTTRWhat about MTTD??

Page 30: System Availability Talk

30© 2012 Energized Work - www.energizedwork.com

Answer?

It Depends

Page 31: System Availability Talk

31© 2012 Energized Work - www.energizedwork.com

Failure is inevitable

Page 32: System Availability Talk

32© 2012 Energized Work - www.energizedwork.com

Ask anyone

Page 33: System Availability Talk

33© 2011 Energized Work - www.energizedwork.com

Thank you

The End

Twitter - @Mr_SPB