Intro to Netflix's Chaos Monkey

Preview:

Citation preview

NETFLIX’S CHAOS

MONKEY

Michael Whitehead

“EVERYTHING FAILS ALL THE TIME”

- WERNER VOGELS

CHAOS MONKEY

A service that causes failure and wreaks havoc on instances in Auto Scaling Groups

A member of the Simian Army developed by Netflix

WHY WOULD WE INTENTIONALLY CAUSE FAILURE?!? It is inevitable Infrastructure is Complex Forcing failure puts you in control Identify faults in your architecture

• Does you load balancers reroute traffic correctly?• Do your instances function correctly when they come back up?• Are you monitoring tools alerting you on important events?

GETTING STARTED WITH CHAOS MONKEY

Amazon Web Services Must be using Auto Scaling Groups Uses Amazon SimpleDB for event storage Simple Email Service setup (optional for notifications) Can be used with Netflix’s Asgard (optional) Java 7 JDK or newer

WOW!

EXAMPLE WITH CLOUDFORMATION

NEAT!

AWESOME!

COOL!

NO WAY!

BUILDING & CONFIGURATION

Clone SimianArmy repo from Github Builds using Gradle Runs 6 times a day during business hours- 9am to 3pm Does not run on holidays or weekends Timeframes and frequency of runs can be configured

IMPORTANT PROPERTIES

Enabling Chaos Monkey Set simianarmy.chaos.enabled = true Set simianarmy.chaos.leashed=false

Probability of 1 instance being terminated per day per ASG simianarmy.chaos.ASG.probability = 1.0

Opt-in or Opt-out model

OPT-IN / OPT-OUT MODEL

Set to False = Opt-in Set to True = Opt-out simianarmy.chaos.ASG.enabled = false

When Opt-In (false) you must enable each auto scaling group you want to run Chaos Monkey in

simianarmy.chaos.<<auto scaling group name>>.enabled = true

When Opt-Out (true) you must disable each auto scaling group you do not want it to run in

simianarmy.chaos.<<auto scaling group name>>.enabled = false

EMAIL NOTIFICATIONS

ARE TERMINATIONS ALL IT CAN DO? Block all network

traffic

Burn CPU Burn IO Fill Disk Kill Processes Network Loss Null-Route

• All EC2 <-> EC2 traffic

SSH REQUIRED

Detach all EBS volumes

Fail DNS Fail EC2 API Fail S3 API Fail DynamoDB API Network Corruption Network Latency

Recommended