Upload
michael-whitehead
View
1.017
Download
1
Embed Size (px)
Citation preview
NETFLIX’S CHAOS
MONKEY
Michael Whitehead
“EVERYTHING FAILS ALL THE TIME”
- WERNER VOGELS
CHAOS MONKEY
A service that causes failure and wreaks havoc on instances in Auto Scaling Groups
A member of the Simian Army developed by Netflix
WHY WOULD WE INTENTIONALLY CAUSE FAILURE?!? It is inevitable Infrastructure is Complex Forcing failure puts you in control Identify faults in your architecture
• Does you load balancers reroute traffic correctly?• Do your instances function correctly when they come back up?• Are you monitoring tools alerting you on important events?
GETTING STARTED WITH CHAOS MONKEY
Amazon Web Services Must be using Auto Scaling Groups Uses Amazon SimpleDB for event storage Simple Email Service setup (optional for notifications) Can be used with Netflix’s Asgard (optional) Java 7 JDK or newer
WOW!
EXAMPLE WITH CLOUDFORMATION
NEAT!
AWESOME!
COOL!
NO WAY!
BUILDING & CONFIGURATION
Clone SimianArmy repo from Github Builds using Gradle Runs 6 times a day during business hours- 9am to 3pm Does not run on holidays or weekends Timeframes and frequency of runs can be configured
IMPORTANT PROPERTIES
Enabling Chaos Monkey Set simianarmy.chaos.enabled = true Set simianarmy.chaos.leashed=false
Probability of 1 instance being terminated per day per ASG simianarmy.chaos.ASG.probability = 1.0
Opt-in or Opt-out model
OPT-IN / OPT-OUT MODEL
Set to False = Opt-in Set to True = Opt-out simianarmy.chaos.ASG.enabled = false
When Opt-In (false) you must enable each auto scaling group you want to run Chaos Monkey in
simianarmy.chaos.<<auto scaling group name>>.enabled = true
When Opt-Out (true) you must disable each auto scaling group you do not want it to run in
simianarmy.chaos.<<auto scaling group name>>.enabled = false
EMAIL NOTIFICATIONS
ARE TERMINATIONS ALL IT CAN DO? Block all network
traffic
Burn CPU Burn IO Fill Disk Kill Processes Network Loss Null-Route
• All EC2 <-> EC2 traffic
SSH REQUIRED
Detach all EBS volumes
Fail DNS Fail EC2 API Fail S3 API Fail DynamoDB API Network Corruption Network Latency
LINKS CloudFormation Template: https://
github.com/joehack3r/aws/blob/master/cloudformation/templates/chaosMonkey.json
Chaos Monkey Announcement: http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html
Simian Army Quick Start Guide: https://github.com/Netflix/SimianArmy/wiki/Quick-Start-Guide
Chaos Monkey Configuration: https://github.com/Netflix/SimianArmy/wiki/Chaos-Settings
Chaos Monkey Army: https://github.com/Netflix/SimianArmy/wiki/The-Chaos-Monkey-Army