52
#DevoxxUS Architecting for failures in micro services: Patterns and lessons learned Bhakti Mehta @bhakti_mehta

Devoxx2017

Embed Size (px)

Citation preview

Page 1: Devoxx2017

#DevoxxUS

Architecting for failures in micro services:

Patterns and lessons learnedBhakti Mehta

@bhakti_mehta

Page 2: Devoxx2017

INTRODUCTION

➤ Platform@Atlassian

➤ In the past Platform Lead at BlueJeans Network

➤ Worked at Sun Microsystems/Oracle for 13 years

➤ Committer to numerous open source projects including GlassFish Application Server

Page 3: Devoxx2017

MY RECENT BOOK

Page 4: Devoxx2017

PREVIOUS BOOK

Page 5: Devoxx2017

ATLASSSIAN

Page 6: Devoxx2017

Microservices

Page 7: Devoxx2017

PATH TO MICROSERVICES

➤ Advantages

➤ Simplicity

➤ Isolation of problems

➤ Scale up and scale down

➤ Easy deployment

➤ Polyglotism and heterogenity

Page 8: Devoxx2017

Sounds great!!

Page 9: Devoxx2017

In reality……..

Page 10: Devoxx2017

MONOLITHS TO MICRO SERVICES

Page 11: Devoxx2017

RESILIENT SYSTEM

➤ Processes transactions, even when there are transient impulses, persistent stresses

➤ Functions even when there are component failures disrupting normal processing

➤ Accepts failures will happen

➤ Design for crumple zones

Page 12: Devoxx2017

RESILIENT SYSTEM

Be the duck

Behave normally when the system is not performing as expected

in face of outages

Behave normally

How the customer should perceive you?

Page 13: Devoxx2017

RESILIENT SYSTEM

How the system needs to function? Heal quickly before customers notice

Page 14: Devoxx2017

KINDS OF FAILURES

➤ Challenges at scale

➤ Integration point failures

➤ Network errors

➤ Semantic errors.

➤ Slow responses

➤ Outright hang

➤ GC issues

Page 15: Devoxx2017

THE NEW WAY OF LIFE

You build it You run it !! (You own it You plan for it !!! ]

Page 16: Devoxx2017
Page 17: Devoxx2017

➤ PERFECT STORM

Page 18: Devoxx2017

THINGS THAT WENT WRONG

➤ Bad node in load balancer group

➤ Deployment of new code

➤ Gradual increase in latency

➤ Abuse by clients

➤ Not enough prod like data in staging

➤ No easy way to trigger stale/lenient fallbacks

➤ Less alerts

Page 19: Devoxx2017

LESSONS LEARNED

consequential !!!!

Errors can be frequent but latencies are consequential !!

Page 20: Devoxx2017

ACTION PLAN

➤ Circuit breakers

➤ Fallback (lenient acceptable values)

➤ Predictive caching

➤ Reduce surface area by clients

➤ Load tests

➤ Failure injection testing

➤ Monitor

➤ Alerts

Development time

Before a deploy

Post deploy

Page 21: Devoxx2017

The more you sweat on the field the less you bleed in war!!!

Page 22: Devoxx2017

RESILIENCY PLANNING STAGE 1

➤ When developing code

➤ Avoiding Cascading failures

➤ Circuit breaker

➤ Timeouts

➤ Retry

➤ Bulkhead

➤ Cache optimisations

➤ Avoid malicious clients

➤ Rate limiting

Page 23: Devoxx2017

RESILIENCY PLANNING STAGE 2

➤ Planning for dealing with failures before deploy to prod

➤ load test ➤ a/b test ➤ longevity ➤ dark launch features

Page 24: Devoxx2017

RESILIENCY PLANNING STAGE 3

➤ Watching out for failures after deploy to prod

➤ health check ➤ metrics

Page 25: Devoxx2017
Page 26: Devoxx2017

CASCADING FAILURES

Caused by Chain reactions

For example

One node in a load balance group fails

Others need to pick up work

Eventually performance can degenerate

Page 27: Devoxx2017

HYSTRIX- CIRCUIT BREAKER PATTERN

• Fault tolerance pattern as a library

• Automatic fail fast

• Automatic fail over

• Metrics- Circuit breaker open, calls/sec, Execution time median, 90, 95 99 percentile

• If command has high failure rate in last 10 seconds it is unlikely to succeed now

Page 28: Devoxx2017

TIMEOUTS PATTERN

Page 29: Devoxx2017

RETRY PATTERN AND TIMEOUTS

➤ Retry for failures in case of network failures, timeouts or server errors

➤ Helps transient network errors such as dropped connections or server fail over

Page 30: Devoxx2017

BULKHEAD

Page 31: Devoxx2017

RATE LIMITING

Page 32: Devoxx2017

RATE LIMITING

➤ Restricting the number of requests that can be made by a client

➤ Client can be identified based on the access token used

➤ Additionally clients can be identified based on IP address

Page 33: Devoxx2017

CACHE OPTIMIZATIONS

Getting from first level cache

Getting from second

level cache

Getting from the DB

Page 34: Devoxx2017

TALE OF THE NEVER LEAVING CACHE ENTRIES

➤ Longer TTL

➤ Not evicted soon enough

➤ Bottlenecks

➤ Failures

Page 35: Devoxx2017

LOGGING BEST PRACTICES

➤ Include detailed, consistent pattern across service logs

➤ Obfuscate sensitive data

➤ Identify caller or initiator as part of logs

➤ Do not log payloads

➤ Request tracing across services

Page 36: Devoxx2017

RESILIENCE PLANNING STAGE 2

➤ Before deploy

➤ Load testing

➤ Longevity testing

➤ Capacity planning

Page 37: Devoxx2017

LOAD TESTING

➤ Ensure that you test for load on APIs ➤ Plan for longevity testing

Page 38: Devoxx2017

CAPACITY PLANNING

➤ Anticipate growth

➤ Design for handling exponential growth

Page 39: Devoxx2017

RESILIENCE PLANNING STAGE 3

➤ After deploy

➤ Health check

➤ Metrics and Monitoring

➤ Phased rollout of features

Page 40: Devoxx2017

Health Check

Page 41: Devoxx2017

HEALTH CHECK

➤ Memory

➤ CPU

➤ Threads

➤ Error rate

➤ If any of the checks exceed a threshold send alert

Page 42: Devoxx2017

Metrics and Monitoring

Page 43: Devoxx2017

METRICS

➤ Response times, throughput

➤ Identify slow running DB queries

➤ GC rate and pause duration

➤ Garbage collection can cause slow responses

➤ Monitor unusual activity

➤ Create alerts when thresholds are exceeded

➤ Run books for actions to be taken on alerts

Page 44: Devoxx2017

Thoughts of the on call person paged at 3 am

debugging an issue in your code

Page 45: Devoxx2017

MONITORING

Monitoring server

EnvironmentCHECKS

ALERTS

Email

Page 46: Devoxx2017

SAVED BY THE METRICS AND ALERTS

➤ MaxDBConnection alert

➤ CPU Utilisation spiking up

➤ Analysed slow running queries

➤ Some select queries taking very long avg of 718 ms 95 percentile 2030 ms.

➤ Unidentified cause which was a bug fix which introduced pagination and the ORDER BY clause needed to match a function based index

Page 47: Devoxx2017

ROLLOUT OF NEW FEATURES

➤ Phasing rollout of new features

➤ Dark launch features

➤ Have a way to turn features off if not behaving as expected

➤ Alerts and more alerts!

Page 48: Devoxx2017

AWS S3 OUTAGE➤ S3 outage in US East

➤ Number of services affected

➤ 3rd party services we depend on have degraded performances

➤ Lots of key take aways from this

Page 49: Devoxx2017

Cheat sheet

A Alerts K Key invalidations

B Bulkheads L Logging

C Circuit Breakers M Metrics & monitoring

D Data obfuscation N Network latencies

E Eventual consistent O Optimizing queries

F Fallbacks & Hystrix P Phased rollouts

G GC settings Q Queues bounded

H Health checks R Run books

I Injecting failure S Staged deployments

J Jitter with Retries T Timeouts

Page 50: Devoxx2017

TAKEAWAY

➤ Inevitability of failures

➤ Expect systems will fail

➤ Failure prevention - Plan for failures Not if but when

➤ Automate

Keep Calm and Cloud On!

Page 51: Devoxx2017

REFERENCES➤ https://commons.wikimedia.org/wiki/File:Bulkhead_PSF.png

➤ http://www.constructionlawtoday.com/uploads/image/Expect-Delays-sign(1).jpg

➤ http://cdn.idigitaltimes.com/sites/idigitaltimes.com/files/2016/04/27/wolverinex-menapocalpse.jpg

➤ https://www.freevector.com/uploads/vector/preview/13242/FreeVector-Swimming-Duck.jpg

➤ http://weknowyourdreams.com/image.php?pic=/images/happiness/happiness-04.jpg

➤ http://www.fitnessandpower.com/wp-content/uploads/2013/10/military-fitness.jpg

➤ http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2010/10/speed-limit-change-sign-resized_2.jpg

➤ https://www.askideas.com/media/51/Funny-Grumpy-Cat-Some-People-Just-Need-A-Hug-Around-The-Neck-With-A-Rope-Image.jpg

➤ https://www.flickr.com/photos/skynoir/ Beer in hand: skynoir/Flickr/Creative Commons License

Page 52: Devoxx2017

#DevoxxUS

Questions

@bhakti_mehta