Winston - Netflix's event driven auto remediation and diagnostics tool

Preview:

Citation preview

Winston

Diagnostic and Remediation Engineering (DaRE)Vinay Shah & Jean-Sebastien Jeannotte

● Introduction● Internals - How it works?● Demo - See it in action!● Learnings and challenges● Metrics & Road ahead● Additional resources

Topics

Introduction

Landscape

Operational load vs.

new features

Scale and Growth Availability

Application or Service

Monitoring

Alerting

Pagerduty Email Winston

● Reduce MTTR

● Reduce risk of human errors

● Reduce pager fatigue, provide tier 1 support

● Don’t worry about infrastructure, focus on your business logic

● Best practice for runbook lifecycle management

Business goals

Winston is an event driven runbook automation platform. It is designed to host and execute runbooks in response to operational events.

Internals

How

is it

dep

loye

d?

Execution Flow

● One stop portal for all things Winston

● Supports Create, Read, Update, Delete, Execute and Diagnose functionality

● Implements best practises

○ Compliance/Auditing

○ Persistence

○ Security (Authentication/Authorization)

● Self serve & scalable

Winston Studio

● Pack

A group of related automations typically organized around a discreet

service or product

● Action

Set of steps to help with diagnostics or remediations written as code

● Event & event source

External services that are the source of events that trigger a runbook

Terminology

Demo

● False positives

○ Cassandra ring health

● Diagnostics - correlation could point towards causation - e.g:

○ Querying Chronos events

○ Querying dependencies upstream and downstream for anomalous behaviour

● Remediation

○ Clean up disk space

○ Restart Kafka process

Sample use cases

Learnings & challenges

Common patterns

● Usage

○ Culture of automating the manual and repeatable

○ Noisy signals become more interesting

○ Lesser the control more the opportunity

● Product

○ Safety is crucial

○ Usability is important

○ Resiliency

Insights

● Don’t reinvent the wheel

● Start simple and iterate

● Allow experimentation

● Pay special care to usability of your product

● Push for changing the culture - usage will follow

● Talk to us/others who have gone through some of the pains and learnings

Recommendations to get started

Metrics and Road ahead

● Adoption. Adoption. Adoption.

● Usability

○ Polyglot support (Groovy based actions)

○ Deeper Integrations

● Safety

○ Resource isolation (Containers)

○ Rate limiting

The road ahead

● Introducing Winston: http://techblog.netflix.com/2016/08/introducing-winston-event-driven.html

● Stackstorm: https://docs.stackstorm.com/

● Reach out: vshah@netflix.com or jjeannotte@netflix.com

We are hiring

Senior Software Engineer - https://jobs.netflix.com/jobs/860752

Links & resources

Thank you.

Recommended