12
NETFLIX’S CHAOS MONKEY Michael Whitehead

Intro to Netflix's Chaos Monkey

Embed Size (px)

Citation preview

Page 1: Intro to Netflix's Chaos Monkey

NETFLIX’S CHAOS

MONKEY

Michael Whitehead

Page 2: Intro to Netflix's Chaos Monkey

“EVERYTHING FAILS ALL THE TIME”

- WERNER VOGELS

Page 3: Intro to Netflix's Chaos Monkey

CHAOS MONKEY

A service that causes failure and wreaks havoc on instances in Auto Scaling Groups

A member of the Simian Army developed by Netflix

Page 4: Intro to Netflix's Chaos Monkey

WHY WOULD WE INTENTIONALLY CAUSE FAILURE?!? It is inevitable Infrastructure is Complex Forcing failure puts you in control Identify faults in your architecture

• Does you load balancers reroute traffic correctly?• Do your instances function correctly when they come back up?• Are you monitoring tools alerting you on important events?

Page 5: Intro to Netflix's Chaos Monkey

GETTING STARTED WITH CHAOS MONKEY

Amazon Web Services Must be using Auto Scaling Groups Uses Amazon SimpleDB for event storage Simple Email Service setup (optional for notifications) Can be used with Netflix’s Asgard (optional) Java 7 JDK or newer

Page 6: Intro to Netflix's Chaos Monkey

WOW!

EXAMPLE WITH CLOUDFORMATION

NEAT!

AWESOME!

COOL!

NO WAY!

Page 7: Intro to Netflix's Chaos Monkey

BUILDING & CONFIGURATION

Clone SimianArmy repo from Github Builds using Gradle Runs 6 times a day during business hours- 9am to 3pm Does not run on holidays or weekends Timeframes and frequency of runs can be configured

Page 8: Intro to Netflix's Chaos Monkey

IMPORTANT PROPERTIES

Enabling Chaos Monkey Set simianarmy.chaos.enabled = true Set simianarmy.chaos.leashed=false

Probability of 1 instance being terminated per day per ASG simianarmy.chaos.ASG.probability = 1.0

Opt-in or Opt-out model

Page 9: Intro to Netflix's Chaos Monkey

OPT-IN / OPT-OUT MODEL

Set to False = Opt-in Set to True = Opt-out simianarmy.chaos.ASG.enabled = false

When Opt-In (false) you must enable each auto scaling group you want to run Chaos Monkey in

simianarmy.chaos.<<auto scaling group name>>.enabled = true

When Opt-Out (true) you must disable each auto scaling group you do not want it to run in

simianarmy.chaos.<<auto scaling group name>>.enabled = false

Page 10: Intro to Netflix's Chaos Monkey

EMAIL NOTIFICATIONS

Page 11: Intro to Netflix's Chaos Monkey

ARE TERMINATIONS ALL IT CAN DO? Block all network

traffic

Burn CPU Burn IO Fill Disk Kill Processes Network Loss Null-Route

• All EC2 <-> EC2 traffic

SSH REQUIRED

Detach all EBS volumes

Fail DNS Fail EC2 API Fail S3 API Fail DynamoDB API Network Corruption Network Latency