Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Practical Operability Techniques for Teams
Matthew SkeltonConflux
confluxdigital.net / @ConfluxHQ
Continuous Lifecycle London, 17 May 2018
Practical operabilityWhy do we need a focus on operability?5 practical operability techniques to try
2
use modern logging, Run Book dialogue sheets, endpoint
healthchecks, correlation IDs, and user personas as
team collaboration techniques3
About meMatthew Skelton, Conflux
@matthewpskelton
matthewskelton.net
Leeds, UK
4
assemblyconf.com
Team Guide to Software OperabilityMatthew Skelton & Rob Thatcher
skeltonthatcher.com/publications
Download a free sample chapter
5
You?Software Developer
Tester / QA
DevOps Engineer
Team Leader
Head of Department
6
Practical Operability Techniques
for Teams
7
Operability
making software work well in Production
8
Operability
9
“But can’t we just give those things to a DevOp?”
10
“But can’t we just give those things to the
DevOp?”
11
Operability is a shared concern
#BizDevTestSecOps12
Operability is a shared concern
#BizDevTestSecOps13
14
15
A self-managed Kubernetes cluster near you
Operability is a shared concern
#BizDevTestSecOps16
Practical operability techniques1. Modern logging2. Run Book dialogue sheets3. Endpoint healthchecks4. Correlation IDs5. Lightweight User Personas
17
Lack of observability for distributed systems
18
19
Logging with Event IDs
Modern logging w/ Event IDsDistinct application states
No “logorrhoea” (!)
Distributed tracing via logs
Build a shared understanding
20
search by event
Event ID
{Delivered, InTransit, Arrived}
21
transaction trace
Correlation ID
612999958…
22
How many distinct event types (state transitions) in
your application?
23
24
represent distinct states
25
enum
Human-readable sets: unique values, sparse,
immutable
C#, Java, Python, node(Ruby, PHP, …)
26
public enum EventID { // Badly-initialised logging data NotSet = 0, // An unrecognised event has occurred UnexpectedError = 10000, ApplicationStarted = 20000, ApplicationShutdownNoticeReceived = 20001, MessageQueued = 40000, MessagePeeked = 40001,
BasketItemAdded = 60001, BasketItemRemoved = 60002,
CreditCardDetailsSubmitted = 70001,
// ... }
27
BasketItemAdded = 60001BasketItemRemoved = 60002
28
log using Event IDs with a modern
‘structured logging’ library
29
example:https://github.com/EqualExperts/opslogger
Sean Reilly@seanjreilly 30
Example: video processingOn-demand processing of TV and mobile streaming advertsAd-agency → TV broadcasterHigh throughputGlitch-free video & audio31
Storage I/O
Worker Job
Queue
Upload
32
33
34
Example: video processingDiscover processing bottlenecksTrigger alerts Report on KPIsTarget areas for improvement
35
Modern logging w/ Event IDsreduce time to detect problems
increase team engagement
increase configurability
enhance collaboration
36
Modern Logging: Collaborate on Event IDs and Correlation traces for better system awareness
37
Operational aspects not addressed, or addressed
too late in the cycle
38
Run Book dialogue sheets
Run Book dialogue sheetsChecklists for typical operational considerations
Team-friendly exploration
40
runbooktemplate.infoRun Book dialogue sheets
41
System characteristics
Hours of operation
During what hours does the service or system actually need to operate? Can portions or features of the system be unavailable at times if needed?
Hours of operation - core features
(e.g. 03:00-01:00 GMT+0)
Hours of operation - secondary features
(e.g. 07:00-23:00 GMT+0)
Data and processing flows
How and where does data flow through the system? What controls or triggers data flows?(e.g. mobile requests / scheduled batch jobs / inbound IoT sensor data )
… 42
runbooktemplate.infoRun Book dialogue sheets
44
Run Book dialogue sheetsEarly discovery of operational requirementsInput to team backlog“Shift-left” testingAvoid operational problems
45
Run Book dialogue sheets: Collaborate on operational
requirements for better system awareness
46
“Why has my deployment failed again?”
“Why is Pre-Prod always so flaky?”
47
Endpoint healthchecks
Endpoint healthchecksSimple HTTP check Common way to assess any service/app/componentKey operational requirement
49
endpoint healthchecks
Every runnable app/service/daemon exposes /status/health
An HTTP GET to the endpoint returns:200 – "I am healthy"
500 – "I am sick" 50
endpoint healthchecks
Each component is responsible for determining its own health and viability – this is very contextual
51
endpoint healthchecks
Use JSON as a response type – parsable by both
machines and humans!
52
53
endpoint healthchecks
For databases and other non-HTTP components, run a lightweight HTTP
service in front of the component200 / 500 responses
54
Helper service
55
57
Question:
What does this look like for Serverless?
¯\_(ツ)_/¯
Endpoint healthchecksRapid dashboarding and diagnosisReduce confusion around environment state“Fail fast” aka “learn sooner”
58
Endpoint healthchecks: Collaborate on component
health status for better system awareness
59
“Which containers [servers] handled the
request?”
60
Correlation IDs
Correlation IDsUnique-ish identifiersTrace calls across machine & container boundariesRe-assemble HTTP request/response later
62
‘Unique-ish’ identifier for each request
Passed through downstream layers
63
Unique-ish ID64
Synchronous HTTP:
X-HEADER e.g. X-trace-idX-trace-id: 348e1cf8
If header is present, pass it on
(Yes, RFC6648, but this is internal only)
65
Asynchonous (queues, etc.):
Message Attributes, name:value paire.g. "trace-id":"348e1cf8"
AWS SQS: SendMessage() / ReceiveMessage()
Log the Correlation ID if present
66
Example: OpenTracing / PCF
3 tracing elements:TraceID, SpanID, ParentSpan
"X-B3-TraceId" "X-B3-SpanId" "X-B3-ParentSpan"
67
Example: OpenTracing / PCF
Always log the TraceID as-isLog calling SpanID as ParentSpan
Log new SpanID
68
Trace
Span
ParentSpan
69
Correlation IDsDetect bottlenecks and unexpected interactionsIncrease transparencyLearn about the system
70
Correlation IDs: Collaborate on distributed tracing for better system
awareness71
Software is difficult to operate: poor UX for Ops.
72
Lightweight user personas
Lightweight User PersonasSimple characterisation of user needs for Dev/Test/OpsBased on full UX user personas but less detailed
74
Lightweight user personas:
Ops EngineerTest Engineer
Build & Deployment EngineerService Owner
75
Lightweight user personas:
Consider the User Experience (UX) of engineers and team members using
and working with the software
76
http://www.keepitusable.com/blog/?tag=alan-cooper 77
Motivations
Goals
Frustrations
Lightweight user personas:
What data does the User Persona need visible on a dashboard in order to make decisions rapidly & safely?
78
https://www.geckoboard.com/blog/visualisation-upgrades-progressing-towards-a-more-useful-and-beautiful-dashboard/ 79
Lightweight User PersonasEmpathise better with people from other rolesCapture missing operational requirements
80
Lightweight User Personas: Collaborate on user needs
for better system awareness
81
Summary
82
Operability
making software work well in Production
83
84
Lack of observabilityOperational aspects not known“Why has deployment failed?”
What handled the request?Poor UX for Ops
85
Logging with Event IDs
use enum-based Event IDs to explore runtime behaviour and
fault conditions86
Run Book dialogue sheets
explore and establish operational requirements as a team, around a physical table,
together87
Endpoint healthchecks
HTTP 200 / 500 responses to /status/health call with JSON details – good for tools and
humans88
Correlation IDs
trace execution using correlation IDs:
synchronous (HTTP X-trace-id) async (SQS MessageAttribute)
89
Lightweight user personas
explore the UX and needs of different roles for rapid
decisions via dashboards90
use modern logging, Run Book dialogue sheets, endpoint
healthchecks, correlation IDs, and user personas as
team collaboration techniques91
Team Guide to Software OperabilityMatthew Skelton & Rob Thatcher
skeltonthatcher.com/publications
Download a free sample chapter
92
Resources•Team Guide to Software Operability by Matthew Skelton and Rob Thatcher (Skelton Thatcher Publications, 2016) http://operabilitybook.com/ •Run Book template & Run Book dialogue sheets http://runbooktemplate.info/
93
thank you
@ConfluxHQconfluxdigital.net