23
Winston Diagnostic and Remediation Engineering (DaRE) Vinay Shah & Jean-Sebastien Jeannotte

Winston - Netflix's event driven auto remediation and diagnostics tool

Embed Size (px)

Citation preview

Page 1: Winston - Netflix's event driven auto remediation and diagnostics tool

Winston

Diagnostic and Remediation Engineering (DaRE)Vinay Shah & Jean-Sebastien Jeannotte

Page 2: Winston - Netflix's event driven auto remediation and diagnostics tool

● Introduction● Internals - How it works?● Demo - See it in action!● Learnings and challenges● Metrics & Road ahead● Additional resources

Topics

Page 3: Winston - Netflix's event driven auto remediation and diagnostics tool

Introduction

Page 4: Winston - Netflix's event driven auto remediation and diagnostics tool

Landscape

Operational load vs.

new features

Scale and Growth Availability

Page 5: Winston - Netflix's event driven auto remediation and diagnostics tool

Application or Service

Monitoring

Alerting

Pagerduty Email Winston

Page 6: Winston - Netflix's event driven auto remediation and diagnostics tool

● Reduce MTTR

● Reduce risk of human errors

● Reduce pager fatigue, provide tier 1 support

● Don’t worry about infrastructure, focus on your business logic

● Best practice for runbook lifecycle management

Business goals

Page 7: Winston - Netflix's event driven auto remediation and diagnostics tool

Winston is an event driven runbook automation platform. It is designed to host and execute runbooks in response to operational events.

Page 8: Winston - Netflix's event driven auto remediation and diagnostics tool

Internals

Page 9: Winston - Netflix's event driven auto remediation and diagnostics tool

How

is it

dep

loye

d?

Page 10: Winston - Netflix's event driven auto remediation and diagnostics tool

Execution Flow

Page 11: Winston - Netflix's event driven auto remediation and diagnostics tool

● One stop portal for all things Winston

● Supports Create, Read, Update, Delete, Execute and Diagnose functionality

● Implements best practises

○ Compliance/Auditing

○ Persistence

○ Security (Authentication/Authorization)

● Self serve & scalable

Winston Studio

Page 12: Winston - Netflix's event driven auto remediation and diagnostics tool

● Pack

A group of related automations typically organized around a discreet

service or product

● Action

Set of steps to help with diagnostics or remediations written as code

● Event & event source

External services that are the source of events that trigger a runbook

Terminology

Page 13: Winston - Netflix's event driven auto remediation and diagnostics tool

Demo

Page 15: Winston - Netflix's event driven auto remediation and diagnostics tool

● False positives

○ Cassandra ring health

● Diagnostics - correlation could point towards causation - e.g:

○ Querying Chronos events

○ Querying dependencies upstream and downstream for anomalous behaviour

● Remediation

○ Clean up disk space

○ Restart Kafka process

Sample use cases

Page 16: Winston - Netflix's event driven auto remediation and diagnostics tool

Learnings & challenges

Page 17: Winston - Netflix's event driven auto remediation and diagnostics tool

Common patterns

Page 18: Winston - Netflix's event driven auto remediation and diagnostics tool

● Usage

○ Culture of automating the manual and repeatable

○ Noisy signals become more interesting

○ Lesser the control more the opportunity

● Product

○ Safety is crucial

○ Usability is important

○ Resiliency

Insights

Page 19: Winston - Netflix's event driven auto remediation and diagnostics tool

● Don’t reinvent the wheel

● Start simple and iterate

● Allow experimentation

● Pay special care to usability of your product

● Push for changing the culture - usage will follow

● Talk to us/others who have gone through some of the pains and learnings

Recommendations to get started

Page 20: Winston - Netflix's event driven auto remediation and diagnostics tool

Metrics and Road ahead

Page 21: Winston - Netflix's event driven auto remediation and diagnostics tool

● Adoption. Adoption. Adoption.

● Usability

○ Polyglot support (Groovy based actions)

○ Deeper Integrations

● Safety

○ Resource isolation (Containers)

○ Rate limiting

The road ahead

Page 22: Winston - Netflix's event driven auto remediation and diagnostics tool

● Introducing Winston: http://techblog.netflix.com/2016/08/introducing-winston-event-driven.html

● Stackstorm: https://docs.stackstorm.com/

● Reach out: [email protected] or [email protected]

We are hiring

Senior Software Engineer - https://jobs.netflix.com/jobs/860752

Links & resources

Page 23: Winston - Netflix's event driven auto remediation and diagnostics tool

Thank you.