71
Cloudy with a Chance of Scaling A Guide to High Availability in the Cloud Lee Atchison, Principal Cloud Architect and Advocate at New Relic, Inc. ©2008-16 New Relic, Inc. All rights reserved.

Velocity - cloudy with a chance of scaling

Embed Size (px)

Citation preview

Page 1: Velocity - cloudy with a chance of scaling

Cloudy with a Chance of ScalingA Guide to High Availability in the CloudLee Atchison, Principal Cloud Architect and Advocate at New Relic, Inc.

©2008-16 New Relic, Inc. All rights reserved.  

Page 2: Velocity - cloudy with a chance of scaling

2 ©2008-16 New Relic, Inc. All rights reserved.  

Safe HarborThis document and the information herein (including any information that may be incorporated by reference) is provided for informational purposes only and should not be construed as an offer, commitment, promise or obligation on behalf of New Relic, Inc. (“New Relic”) to sell securities or deliver any product, material, code, functionality, or other feature. Any information provided hereby is proprietary to New Relic and may not be replicated or disclosed without New Relic’s express written permission.

Such information may contain forward-looking statements within the meaning of federal securities laws. Any statement that is not a historical fact or refers to expectations, projections, future plans, objectives, estimates, goals, or other characterizations of future events is a forward-looking statement. These forward-looking statements can often be identified as such because the context of the statement will include words such as “believes,” “anticipates,”, “expects” or words of similar import.

Actual results may differ materially from those expressed in these forward-looking statements, which speak only as of the date hereof, and are subject to change at any time without notice. Existing and prospective investors, customers and other third parties transacting business with New Relic are cautioned not to place undue reliance on this forward-looking information. The achievement or success of the matters covered by such forward-looking statements are based on New Relic’s current assumptions, expectations, and beliefs and are subject to substantial risks, uncertainties, assumptions, and changes in circumstances that may cause the actual results, performance, or achievements to differ materially from those expressed or implied in any forward-looking statement. Further information on factors that could affect such forward-looking statements is included in the filings we make with the SEC from time to time. Copies of these documents may be obtained by visiting New Relic’s Investor Relations website at http://ir.newrelic.com or the SEC’s website at www.sec.gov.

New Relic assumes no obligation and does not intend to update these forward-looking statements, except as required by law. New Relic makes no warranties, expressed or implied, in this document or otherwise, with respect to the information provided.

Page 3: Velocity - cloudy with a chance of scaling

3 ©2008-16 New Relic, Inc. All rights reserved.  

Who am I?

Lee AtchisonPrincipal Cloud Architectand Advocate

Specialize in:Cloud computingServices & Microservices

Scalability, Availability

29 years in industry7 in Amazon Retail & AWS(Built SW/VG AppStore, AWS Elastic Beanstalk)

4 in New Relic(Architecture Lead, Cloud, Service Migration)

@leeatchison leeatchison

Page 4: Velocity - cloudy with a chance of scaling

4 ©2008-16 New Relic, Inc. All rights reserved.  

I want to tell you a story…

Page 5: Velocity - cloudy with a chance of scaling

5 ©2008-16 New Relic, Inc. All rights reserved.  

I want to tell you a story…

You tell me if this is ok or not…

This was a recently overheard conversation…

Page 6: Velocity - cloudy with a chance of scaling

6 ©2008-16 New Relic, Inc. All rights reserved.  

Is this ok?

“We were wondering how changing a setting on

our MySQL database might impact our performance…

Page 7: Velocity - cloudy with a chance of scaling

7 ©2008-16 New Relic, Inc. All rights reserved.  

Is this ok?

“We were wondering how changing a setting on

our MySQL database might impact our performance…

… but we were worried that the change may cause our production

database to fail…”

Page 8: Velocity - cloudy with a chance of scaling

8 ©2008-16 New Relic, Inc. All rights reserved.  

Is this ok?“… Since we didn’t want to

bring down production, we decided to make the

change to our backup (replica) database

instead…

UnderConstruction

… but we were worried that the change may cause our production

database to fail…”

Page 9: Velocity - cloudy with a chance of scaling

9 ©2008-16 New Relic, Inc. All rights reserved.  

Is this ok?“… Since we didn’t want to

bring down production, we decided to make the

change to our backup (replica, hot standby)

database instead…

… After all, it wasn’t being used for anything

at the moment.”

UnderConstruction

Page 10: Velocity - cloudy with a chance of scaling

10 ©2008-16 New Relic, Inc. All rights reserved.  

Is this ok?Until, of course, the

backup was needed…

UnderConstructionX

Page 11: Velocity - cloudy with a chance of scaling

11 ©2008-16 New Relic, Inc. All rights reserved.  

Is this ok?Until, of course, the

backup was needed…

This was a true story

UnderConstruction!!!!X

X

Page 12: Velocity - cloudy with a chance of scaling

I fly radio controlled model airplanes

“Keep your plane at least two mistakes high.”

There’s an old adage:

©2008-16 New Relic, Inc. All rights reserved.   12

Page 13: Velocity - cloudy with a chance of scaling

“Keep your plane at least two mistakes high.”

©2008-16 New Relic, Inc. All rights reserved.   13

But Why?

Page 14: Velocity - cloudy with a chance of scaling

Why Two Mistakes High?

You perform some stunt, and it fails… You lose altitude

©2008-16 New Relic, Inc. All rights reserved.   14

Page 15: Velocity - cloudy with a chance of scaling

Why Two Mistakes High?

You perform some stunt, and it fails… You lose altitude

Now, you are lower, and you are trying to recover

©2008-16 New Relic, Inc. All rights reserved.   15

Page 16: Velocity - cloudy with a chance of scaling

Why Two Mistakes High?

You perform some stunt, and it fails… You lose altitude

Now, you are lower, and you are trying to recoverYou want to still be high enough, so that if you make another mistake, you won’t crash

©2008-16 New Relic, Inc. All rights reserved.   16

Page 17: Velocity - cloudy with a chance of scaling

Why Two Mistakes High?

You perform some stunt, and it fails… You lose altitude

Now, you are lower, and you are trying to recoverYou want to still be high enough, so that if you make another mistake, you won’t crash

©2008-16 New Relic, Inc. All rights reserved.   17

You always want to be high enough to make a mistake,

even if you’ve just made a mistake…

Page 18: Velocity - cloudy with a chance of scaling

18 ©2008-16 New Relic, Inc. All rights reserved.  

Put another way…

… even if you arecurrently recovering

from a mistake

…flying two mistakes high, you can always have

a backup plan for recovering from a mistake

Page 19: Velocity - cloudy with a chance of scaling

19 ©2008-16 New Relic, Inc. All rights reserved.  

Don’t screw up...

…while you are screwing up

Page 20: Velocity - cloudy with a chance of scaling

This same applies when buildinghighly available, high scale applications

©2008-16 New Relic, Inc. All rights reserved.   20

Page 21: Velocity - cloudy with a chance of scaling

21 ©2008-16 New Relic, Inc. All rights reserved.  

How do we keep “Two Mistakes High” in an application?

Walk through ramifications and recovery

plan

Page 22: Velocity - cloudy with a chance of scaling

22 ©2008-16 New Relic, Inc. All rights reserved.  

How do we keep “Two Mistakes High” in an application?

Walk through ramifications and recovery

plan

Make sure recovery plan works

Has no mistakes

Has its own recovery plan

Page 23: Velocity - cloudy with a chance of scaling

23 ©2008-16 New Relic, Inc. All rights reserved.  

How do we keep “Two Mistakes High” in an application?

Walk through ramifications and recovery

plan

If recovery plan doesn’t work…

it’s not a good recovery plan

Make sure recovery plan works

Has no mistakes

Has its own recovery plan

Page 24: Velocity - cloudy with a chance of scaling

24 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEHow many nodes do we need?

Page 25: Velocity - cloudy with a chance of scaling

25 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEHow many nodes do we need?

How many nodes do I need to handle my traffic demands?

Building a Service Designed to handle 1,000 req/sec

(assume single node = 300 req/sec)

Page 26: Velocity - cloudy with a chance of scaling

26 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEHow many nodes do we need?

Right???

ceil[1,000 / 300] = 4 nodes With four nodes, we can handle our

traffic PLUS we have enough nodes that

we can lose one! We have redundancy!

Page 27: Velocity - cloudy with a chance of scaling

27 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEWell no…

You think 4 nodes gives you redundancy, but it doesn’t...

If you lose one of those nodes: Remaining nodes can only handle

300 * 3 = 900 req/sec Cannot handle the 1,000 req/sec

load

Page 28: Velocity - cloudy with a chance of scaling

28 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEHow many do we need?

4 nodes... allows handling our traffic but we cannot handle a

node failure

5 nodes... allows handling

a single node failure

But…

No upgrading

6 nodes... a multi-node failure,

Or…

Handle a failureduring an upgrade

or more…

Page 29: Velocity - cloudy with a chance of scaling

29 ©2008-16 New Relic, Inc. All rights reserved.  

LESSONFly Two Mistakes High

Even if you think you have redundancy… Think through the failure modes … and make sure

Page 30: Velocity - cloudy with a chance of scaling

30 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLERolling Deploys

Page 31: Velocity - cloudy with a chance of scaling

31

What is a Rolling Deploy?

©2008-16 New Relic, Inc. All rights reserved.  

Load Balancer

Server

Server

Server

Server

Server

Page 32: Velocity - cloudy with a chance of scaling

32

What is a Rolling Deploy?

©2008-16 New Relic, Inc. All rights reserved.  

Server

Server

Server

Server

Server

Remove one serverfrom service

Load Balancer

Page 33: Velocity - cloudy with a chance of scaling

33

What is a Rolling Deploy?

©2008-16 New Relic, Inc. All rights reserved.  

Server

Server

Server

Server

Server

Deploy new application version to this server

Load Balancer

Page 34: Velocity - cloudy with a chance of scaling

34

What is a Rolling Deploy?

©2008-16 New Relic, Inc. All rights reserved.  

Load Balancer

Server

Server

Server

Server

Server

Put back into service

Page 35: Velocity - cloudy with a chance of scaling

35

What is a Rolling Deploy?

©2008-16 New Relic, Inc. All rights reserved.  

Load Balancer

Server

Server

Server

Server

ServerRepeat 1 by 1

with remaining servers

Allows deploying changes to your servers without bringing your entire application down

Page 36: Velocity - cloudy with a chance of scaling

36 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLERolling Deploys

Are you safe?

You need 10 nodes to run your application

You have 11 nodes, so that you can do rolling deploy Bring one node down at a

time to upgrade… Always at least 10

available...

Page 37: Velocity - cloudy with a chance of scaling

37 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEWell no…

With the failed server to contend with… you have no room to do an upgrade or

rollback, and you are at risk for another failure

What if that node fails during upgrade?

What if you now have to roll back?

Page 38: Velocity - cloudy with a chance of scaling

38 ©2008-16 New Relic, Inc. All rights reserved.  

LESSONFly Two Mistakes High

Make sure you can handle failures Even during “exceptional” events,

such as upgrades Exceptional events can cause

failures

Page 39: Velocity - cloudy with a chance of scaling

39 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEUnknown dependencies

? ?

Page 40: Velocity - cloudy with a chance of scaling

40 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEUnknown dependencies

Are you safe?

You have your application running on 20 servers… You can run on 15 servers if

necessary Plenty of redundancy

Page 41: Velocity - cloudy with a chance of scaling

41 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEWell, depends…

Are any of the

20 servers in the same

rack?

Page 42: Velocity - cloudy with a chance of scaling

42 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEWell, depends…

Are any of the

20 servers in the same

rack?

Share the same power

supply?

Page 43: Velocity - cloudy with a chance of scaling

43 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEWell, depends…

Are any of the

20 servers in the same

rack?

Share the same power

supply?

Share the same power

source?

Page 44: Velocity - cloudy with a chance of scaling

44 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEWell, depends…

Are any of the

20 servers in the same

rack?

Share the same power

supply?

Share the same power

source?

Share the same A/C system?

Page 45: Velocity - cloudy with a chance of scaling

45 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEWell, depends…

Are any of the

20 servers in the same

rack?

Share the same power

supply?

Share the same power

source?

Share the same A/C system?

The Cloud is not immune!

Page 46: Velocity - cloudy with a chance of scaling

46 ©2008-16 New Relic, Inc. All rights reserved.  

LESSONFly Two Mistakes High

Redundancy is not redundancy when the resources are not independent

Page 47: Velocity - cloudy with a chance of scaling

47 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEFailure loop

Page 48: Velocity - cloudy with a chance of scaling

48 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEFailure loop

Are you safe from power outages?

You live in an apartment… The apartment provides an enclosed

garage to store things in The power goes out in your place a

lot… ... you buy a generator, store it in

the garage

Page 49: Velocity - cloudy with a chance of scaling

49 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEFailure loop

Oops

Oops… the garage: Has a single door, the big garage

door It has a garage door opener That requires electricity to open... The generator is only available...

when you already have power…

Page 50: Velocity - cloudy with a chance of scaling

50 ©2008-16 New Relic, Inc. All rights reserved.  

LESSONFly Two Mistakes High

Make sure your recovery plans actually are operational when you are in a failure mode

Page 51: Velocity - cloudy with a chance of scaling

51 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEHigh redundancy in action

Page 52: Velocity - cloudy with a chance of scaling

52 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEA real system…Great example:

Highlyindependent

Multi-levelerror recovery

Highly recoverable

system

Redundant

Page 53: Velocity - cloudy with a chance of scaling

53 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEA real system…

In fact, one of the very first large scale software applications utilizing extreme

redundancy and failure management

Great example:

Highlyindependent

Multi-levelerror recovery

Highly recoverable

system

Redundant

Page 54: Velocity - cloudy with a chance of scaling

54 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEWhat is this system?

Page 55: Velocity - cloudy with a chance of scaling

55 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEUS Space Shuttle Program

They had problems…serious mechanical problems...

But the software system utilized state of the art:• Redundancy techniques• Error recovery techniques

Page 56: Velocity - cloudy with a chance of scaling

56 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEUS Space Shuttle System

Five onboard computers Four were identical

(fifth talk about later) All four:

– Ran the exact same program during critical periods

– Given same data– Expected to generate

the same result

Page 57: Velocity - cloudy with a chance of scaling

57 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEFour computers

Computers voted on the proper outcome

If any one computer did not generate the same results:

Page 58: Velocity - cloudy with a chance of scaling

58 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEFour computers

Computers voted on the proper outcome

Those that disagreed with the outcome were turned off

for remainder of the flight

If any one computer did not generate the same results:

Page 59: Velocity - cloudy with a chance of scaling

59 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEFour computers

Ultimate in democratic systems…

Computers voted on the proper outcome

Those that disagreed with the outcome were turned off

for remainder of the flight

If any one computer did not generate the same results:

Page 60: Velocity - cloudy with a chance of scaling

60 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEFour computers

Could FLY with only THREE computers working

Could LAND with only TWO computers working

Page 61: Velocity - cloudy with a chance of scaling

61 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEDeadlock

What if the four computers couldn’t decide?

(software bug or multiple failures)

Page 62: Velocity - cloudy with a chance of scaling

62 ©2008-16 New Relic, Inc. All rights reserved.  

EXAMPLEDeadlock

What if the four computers couldn’t decide?

(software bug or multiple failures)

Fifth computer was used as a tie breaker

Much simpler version of software… only used for key decisions

Software written by independent software team, unconnected with rest of software developers

(In theory) would not introduce same software errors…

Page 63: Velocity - cloudy with a chance of scaling

©2008-16 New Relic, Inc. All rights reserved.   63

Highly Successful

30-year operation of Space Shuttle: Never a case where a serious life

threatening problem occurred that was a result of a software problem

Even though software was the most complex software ever built for a space program

Page 64: Velocity - cloudy with a chance of scaling

64 ©2008-16 New Relic, Inc. All rights reserved.  

US Space Shuttle

This is extreme (not needed by most projects) Shows what is possible... Independence is critical to high

availability

Page 65: Velocity - cloudy with a chance of scaling

65 ©2008-16 New Relic, Inc. All rights reserved.  

LESSONFly Two Mistakes High

Use availability

solution consistent

with the risk

Page 66: Velocity - cloudy with a chance of scaling

66 ©2008-16 New Relic, Inc. All rights reserved.  

LESSONFly Two Mistakes High

Use availability

solution consistent

with the risk

Higher the risk, higher the focus on availability

Page 67: Velocity - cloudy with a chance of scaling

67 ©2008-16 New Relic, Inc. All rights reserved.  

LESSONFly Two Mistakes High

Use availability

solution consistent

with the risk

Higher the risk, higher the focus on availability

Don’t over invest, don’t under invest

Page 68: Velocity - cloudy with a chance of scaling

68 ©2008-16 New Relic, Inc. All rights reserved.  

LESSONFly Two Mistakes High

Use availability

solution consistent

with the risk

Higher the risk, higher the focus on availability

Don’t over invest, don’t under invest

But think ahead, avoid the surprise

Page 69: Velocity - cloudy with a chance of scaling

And remember…

“Keep your plane at least two mistakes high.”

©2008-16 New Relic, Inc. All rights reserved.   69

Page 70: Velocity - cloudy with a chance of scaling

Architecting for ScaleBy: Lee AtchisonPublished by: O’Reilly Media, Available: June 2016www.architectingforscale.com

Preview edition available at New Relic booth

Want to Learn More?

Velocity Events“Static vs Dynamic Cloud”

Thursday 12noon, New Relic BoothOffice Hours

Thursday 3pm, O’Reilly BoothBook Signing

Today 2:30pm, O’Reilly BoothThroughout show, New Relic Booth

@leeatchison leeatchison

Page 71: Velocity - cloudy with a chance of scaling

©2008-15 New Relic, Inc. All rights reserved.  

Thank you.

Lee AtchisonPrincipal Cloud Architect and Advocate at New Relic, Inc.

Architecting for ScalePublished by: O’Reilly Media, Available: June 2016www.architectingforscale.com

@leeatchison leeatchison