Upload
rob-harrop
View
377
Download
0
Embed Size (px)
Citation preview
The performance of a distributed system is the combined performance of
its collaborating services and their communication links
Who am I?
▸ CTO @ Skipjaq▸ ML-driven performance optimisation
▸ Co-founder of SpringSource
▸ Once upon a time I…▸ Contributed to Spring Framework
▸ Wrote a book about Spring
▸ Talked a lot about Spring
Who am I?
▸ I’m on Twitter: ▸ @robertharrop
▸ I’m on Github: ▸ github.com/robharrop
▸ I write about maths and performance ▸ https://robharrop.github.io
If you have questions after the session, {grab, tweet} me.
After this talk you will know how to:
▸ Measure performance correctly
▸ Find potential performance disasters
▸ Identify the best candidates for optimisation
▸ Model complex micro services systems
▸ Forecast system scalability
How fast can I do a thing, and
how many things can I do every period?
What do we mean by performance?
Throughput
▸ The rate of processing: x per y
▸ Requests per second
▸ Records per minute
▸ Messages per second
▸ Tasks per day
Latency
▸ Time taken… for something▸ Service time?
▸ First byte?
▸ First response complete?
▸ Last byte?
▸ Render?
▸ Moral of the story: define what you mean by latency
Crib Sheet
▸ Record timestamped requests with observed latency and success/error
▸ Throughput▸ Min, max, mean
▸ Varying time windows (10s, 30s, 1m, 5m, …)
▸ Latency▸ Min, max, 95th, 99th, 99.9th and other tail percentiles
▸ Mean just means meaningless
We need to talk about latency
▸ Latency isn’t exponentially-distributed▸ And it certainly isn’t normally-distributed
▸ Latency distributions have heavy tails
▸ Latency distributions are multi-modal
▸ Customers see tail latencies way more than you think▸ Don’t let percentiles trick you
▸ Understand what latency means to your business
What can we conclude?
▸ Latency and throughput work in tandem
▸ At high throughput, latency degrades considerably
Service latencies stack
▸ For simple cases (feed-forward networks), latencies are additive▸ Analytical models are available
▸ http://robharrop.github.io/maths/performance/2016/03/15/queue-networks.html
▸ For most interesting cases this cannot be assumed▸ Simulation is the best option▸ Pretty Damn Quick (PDQ) is a great tool, but requires a chunk of effort▸ Guesstimate is great for quick and dirty models
Amdahl’s Law
▸ Theoretical improvement in latency given a fixed workload
Theoretical max system speedup
Speedup of part under optimisation
Percentage of execution time in part under optimisation
Amdahl’s Law in the Limit
▸ Theoretical max is limited by parts of the system not under improvement
Theoretical max system speedup Percentage of execution time in part
under optimisation
Speedup of part under optimisation
Summary
▸ Measurements are critical▸ Garbage in, garbage out
▸ Monitor utilisation for early-warning of disaster▸ Little’s Law
▸ Monitor latency per-user, not just per-request
▸ Select optimisation targets carefully▸ Amdahl’s Law
▸ Monitor crosstalk to forecast scalability▸ Universal Scalability Law