View
452
Download
0
Category
Preview:
Citation preview
Winston
Diagnostic and Remediation Engineering (DaRE)Vinay Shah & Jean-Sebastien Jeannotte
● Introduction● Internals - How it works?● Demo - See it in action!● Learnings and challenges● Metrics & Road ahead● Additional resources
Topics
Introduction
Landscape
Operational load vs.
new features
Scale and Growth Availability
Application or Service
Monitoring
Alerting
Pagerduty Email Winston
● Reduce MTTR
● Reduce risk of human errors
● Reduce pager fatigue, provide tier 1 support
● Don’t worry about infrastructure, focus on your business logic
● Best practice for runbook lifecycle management
Business goals
Winston is an event driven runbook automation platform. It is designed to host and execute runbooks in response to operational events.
Internals
How
is it
dep
loye
d?
Execution Flow
● One stop portal for all things Winston
● Supports Create, Read, Update, Delete, Execute and Diagnose functionality
● Implements best practises
○ Compliance/Auditing
○ Persistence
○ Security (Authentication/Authorization)
● Self serve & scalable
Winston Studio
● Pack
A group of related automations typically organized around a discreet
service or product
● Action
Set of steps to help with diagnostics or remediations written as code
● Event & event source
External services that are the source of events that trigger a runbook
Terminology
Demo
Winston Studio
DEMO
● False positives
○ Cassandra ring health
● Diagnostics - correlation could point towards causation - e.g:
○ Querying Chronos events
○ Querying dependencies upstream and downstream for anomalous behaviour
● Remediation
○ Clean up disk space
○ Restart Kafka process
Sample use cases
Learnings & challenges
Common patterns
● Usage
○ Culture of automating the manual and repeatable
○ Noisy signals become more interesting
○ Lesser the control more the opportunity
● Product
○ Safety is crucial
○ Usability is important
○ Resiliency
Insights
● Don’t reinvent the wheel
● Start simple and iterate
● Allow experimentation
● Pay special care to usability of your product
● Push for changing the culture - usage will follow
● Talk to us/others who have gone through some of the pains and learnings
Recommendations to get started
Metrics and Road ahead
● Adoption. Adoption. Adoption.
● Usability
○ Polyglot support (Groovy based actions)
○ Deeper Integrations
● Safety
○ Resource isolation (Containers)
○ Rate limiting
The road ahead
● Introducing Winston: http://techblog.netflix.com/2016/08/introducing-winston-event-driven.html
● Stackstorm: https://docs.stackstorm.com/
● Reach out: vshah@netflix.com or jjeannotte@netflix.com
We are hiring
Senior Software Engineer - https://jobs.netflix.com/jobs/860752
Links & resources
Thank you.
Recommended