View
342
Download
1
Category
Preview:
Citation preview
#DevoxxUS
Architecting for failures in micro services:
Patterns and lessons learnedBhakti Mehta
@bhakti_mehta
INTRODUCTION
➤ Platform@Atlassian
➤ In the past Platform Lead at BlueJeans Network
➤ Worked at Sun Microsystems/Oracle for 13 years
➤ Committer to numerous open source projects including GlassFish Application Server
MY RECENT BOOK
PREVIOUS BOOK
ATLASSSIAN
Microservices
PATH TO MICROSERVICES
➤ Advantages
➤ Simplicity
➤ Isolation of problems
➤ Scale up and scale down
➤ Easy deployment
➤ Polyglotism and heterogenity
Sounds great!!
In reality……..
MONOLITHS TO MICRO SERVICES
RESILIENT SYSTEM
➤ Processes transactions, even when there are transient impulses, persistent stresses
➤ Functions even when there are component failures disrupting normal processing
➤ Accepts failures will happen
➤ Design for crumple zones
RESILIENT SYSTEM
Be the duck
Behave normally when the system is not performing as expected
in face of outages
Behave normally
How the customer should perceive you?
RESILIENT SYSTEM
How the system needs to function? Heal quickly before customers notice
KINDS OF FAILURES
➤ Challenges at scale
➤ Integration point failures
➤ Network errors
➤ Semantic errors.
➤ Slow responses
➤ Outright hang
➤ GC issues
THE NEW WAY OF LIFE
You build it You run it !! (You own it You plan for it !!! ]
➤ PERFECT STORM
THINGS THAT WENT WRONG
➤ Bad node in load balancer group
➤ Deployment of new code
➤ Gradual increase in latency
➤ Abuse by clients
➤ Not enough prod like data in staging
➤ No easy way to trigger stale/lenient fallbacks
➤ Less alerts
LESSONS LEARNED
consequential !!!!
Errors can be frequent but latencies are consequential !!
ACTION PLAN
➤ Circuit breakers
➤ Fallback (lenient acceptable values)
➤ Predictive caching
➤ Reduce surface area by clients
➤ Load tests
➤ Failure injection testing
➤ Monitor
➤ Alerts
Development time
Before a deploy
Post deploy
The more you sweat on the field the less you bleed in war!!!
RESILIENCY PLANNING STAGE 1
➤ When developing code
➤ Avoiding Cascading failures
➤ Circuit breaker
➤ Timeouts
➤ Retry
➤ Bulkhead
➤ Cache optimisations
➤ Avoid malicious clients
➤ Rate limiting
RESILIENCY PLANNING STAGE 2
➤ Planning for dealing with failures before deploy to prod
➤ load test ➤ a/b test ➤ longevity ➤ dark launch features
RESILIENCY PLANNING STAGE 3
➤ Watching out for failures after deploy to prod
➤ health check ➤ metrics
CASCADING FAILURES
Caused by Chain reactions
For example
One node in a load balance group fails
Others need to pick up work
Eventually performance can degenerate
HYSTRIX- CIRCUIT BREAKER PATTERN
• Fault tolerance pattern as a library
• Automatic fail fast
• Automatic fail over
• Metrics- Circuit breaker open, calls/sec, Execution time median, 90, 95 99 percentile
• If command has high failure rate in last 10 seconds it is unlikely to succeed now
TIMEOUTS PATTERN
RETRY PATTERN AND TIMEOUTS
➤ Retry for failures in case of network failures, timeouts or server errors
➤ Helps transient network errors such as dropped connections or server fail over
BULKHEAD
RATE LIMITING
RATE LIMITING
➤ Restricting the number of requests that can be made by a client
➤ Client can be identified based on the access token used
➤ Additionally clients can be identified based on IP address
CACHE OPTIMIZATIONS
Getting from first level cache
Getting from second
level cache
Getting from the DB
TALE OF THE NEVER LEAVING CACHE ENTRIES
➤ Longer TTL
➤ Not evicted soon enough
➤ Bottlenecks
➤ Failures
LOGGING BEST PRACTICES
➤ Include detailed, consistent pattern across service logs
➤ Obfuscate sensitive data
➤ Identify caller or initiator as part of logs
➤ Do not log payloads
➤ Request tracing across services
RESILIENCE PLANNING STAGE 2
➤ Before deploy
➤ Load testing
➤ Longevity testing
➤ Capacity planning
LOAD TESTING
➤ Ensure that you test for load on APIs ➤ Plan for longevity testing
CAPACITY PLANNING
➤ Anticipate growth
➤ Design for handling exponential growth
RESILIENCE PLANNING STAGE 3
➤ After deploy
➤ Health check
➤ Metrics and Monitoring
➤ Phased rollout of features
Health Check
HEALTH CHECK
➤ Memory
➤ CPU
➤ Threads
➤ Error rate
➤ If any of the checks exceed a threshold send alert
Metrics and Monitoring
METRICS
➤ Response times, throughput
➤ Identify slow running DB queries
➤ GC rate and pause duration
➤ Garbage collection can cause slow responses
➤ Monitor unusual activity
➤ Create alerts when thresholds are exceeded
➤ Run books for actions to be taken on alerts
Thoughts of the on call person paged at 3 am
debugging an issue in your code
MONITORING
Monitoring server
EnvironmentCHECKS
ALERTS
SAVED BY THE METRICS AND ALERTS
➤ MaxDBConnection alert
➤ CPU Utilisation spiking up
➤ Analysed slow running queries
➤ Some select queries taking very long avg of 718 ms 95 percentile 2030 ms.
➤ Unidentified cause which was a bug fix which introduced pagination and the ORDER BY clause needed to match a function based index
ROLLOUT OF NEW FEATURES
➤ Phasing rollout of new features
➤ Dark launch features
➤ Have a way to turn features off if not behaving as expected
➤ Alerts and more alerts!
AWS S3 OUTAGE➤ S3 outage in US East
➤ Number of services affected
➤ 3rd party services we depend on have degraded performances
➤ Lots of key take aways from this
Cheat sheet
A Alerts K Key invalidations
B Bulkheads L Logging
C Circuit Breakers M Metrics & monitoring
D Data obfuscation N Network latencies
E Eventual consistent O Optimizing queries
F Fallbacks & Hystrix P Phased rollouts
G GC settings Q Queues bounded
H Health checks R Run books
I Injecting failure S Staged deployments
J Jitter with Retries T Timeouts
TAKEAWAY
➤ Inevitability of failures
➤ Expect systems will fail
➤ Failure prevention - Plan for failures Not if but when
➤ Automate
Keep Calm and Cloud On!
REFERENCES➤ https://commons.wikimedia.org/wiki/File:Bulkhead_PSF.png
➤ http://www.constructionlawtoday.com/uploads/image/Expect-Delays-sign(1).jpg
➤ http://cdn.idigitaltimes.com/sites/idigitaltimes.com/files/2016/04/27/wolverinex-menapocalpse.jpg
➤ https://www.freevector.com/uploads/vector/preview/13242/FreeVector-Swimming-Duck.jpg
➤ http://weknowyourdreams.com/image.php?pic=/images/happiness/happiness-04.jpg
➤ http://www.fitnessandpower.com/wp-content/uploads/2013/10/military-fitness.jpg
➤ http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2010/10/speed-limit-change-sign-resized_2.jpg
➤ https://www.askideas.com/media/51/Funny-Grumpy-Cat-Some-People-Just-Need-A-Hug-Around-The-Neck-With-A-Rope-Image.jpg
➤ https://www.flickr.com/photos/skynoir/ Beer in hand: skynoir/Flickr/Creative Commons License
#DevoxxUS
Questions
@bhakti_mehta
Recommended