Upload
avishai-ish-shalom
View
265
Download
3
Embed Size (px)
Citation preview
Engineer’s guide to Data AnalysisAvishai Ish-Shalom
github.com/avishai-ish-shalom@[email protected]
Wix in numbers
~ 400 Engineers~ 1400 employees
~ 100M Sites
~ 250 micro services
IaaS(Insult as a Service)▪Thin API, written in Flask (python)
▪CouchDB
▪Apache proxy
▪StatsD, Graphite, ELK
Architecture
StatsD
Graphite
▪Metrics collector, storage and
UI
▪Math functions
▪Common
▪De-facto standard
Oops, I think something is broken
What is this “metric” you speak of?
A metric is
▪Numeric data
▪Often with timestamp (time
series)
▪A “measurement” of
something
▪Discrete
Where do metrics come from?
▪Events with numeric data
▪Counting/aggregating
▪Sampling
Sampling
Sampling
Events
▪Data about something that
happened
▪timestamp (time series data)
▪Has properties - numeric and
non-numeric
{“timestamp”: “2016-11-
15T18:43:39+00:00”,“host”: “test01.example.net”,“status”: “ok”,“latency”: 14.31
}
10000 events/secx
0.5kb/event=
How much data?400GB a day
Telemetry is a big data problem
Aggregates are lossy compression
We must decide in advance how we’ll use the metric
Aggregates
▪Max, Min, Sum, Average, etc
▪Last, random point
▪Percentiles (quantiles)
▪Historgrams, reverse quantiles
▪Each is suitable for a particular use case
Averages are mean to me
Percentiles
p99 - The sampled value that is larger than other 99% of
samples
▪O(n) memory complexity
▪ O(n*log n) computation complexity
▪Some shortcuts for p50 (median), p100 (max), p0 (min)
Use when clients experience individual values
Percentiles
▪Precentiles are not additive
▪ You cannot average percentiles
Example:
s1 (100 points) = [0, 0, ....., 100, 100] => p99 =
100
s2 (100 points) = [0, 0, …., 50, 50] => p99 = 50
p99(s1 : s2) = 50, avg(p99(s1), p99(s2)) = 75Fail
Histograms
Distribution visualization of sample
▪Count of events in each bin
▪Beans are usually evenly spaced
▪Use logarithmically spaced bins
for long tails
▪ Additive
Histograms :-(
So why aren’t we all using this?
▪Storage
▪Have to decide on bins schema
▪ Not many tools support this
Choosing the right aggregate
▪ Percentiles/histograms for latency
▪ Max/min for latency and sizes
▪ Histogram analysis for sizes and latency
▪ Sums/averages for capacity and money
▪ Aggregate per domain
▪ Look for deviations
Resolution
▪ Humans need ~5 data points to see a trend
▪ Hides faster changes
▪ Rollups/downscaling is hard
▪ Multi tier FTW!
It ain't what you don’t know that gets you into trouble.
It's what you know for sure that just ain’t so.
““
Peak Erasure/Spike erosion
■ When lowering resolution, data points are
aggregated
■ Default aggregation is average
■ Peaks are erased
■ This can happen in storage or visualization
Peak Erasure/Spike erosion
■ Storages down-sample to save space
■ Aggregation function may be configurable
■ Metric collectors aggregate too
○ carbon-cache uses last value
○ StatsD - gauges, timers, counters
Counters vs Gauges
Behaviour in low res time window
■ Low res sampling erases fast changes
■ “Round numbers” syndrom
■ Counters smear changes, but don’t erase them
TLDR: use counters when possible
Mixed modes
Aggregating multiple modes reduces usability of aggregates
■ Different transaction types differ in latencies/sizes
■ Errors, successes have very different latencies/sizes
■ Makes your graphs weird
TLDR: use separate metrics for different things
Building useful graphs
Visualization
■ Timeframe
■ No more than 3 series
■ Be weary of multiple Y scales, but scale if needed
■ Only related series on the same graph
■ Never mix X scales
■ Visual references: bounds, Y min/max values, legend
Metric design
■ Choose your aggregates wisely
■ Decide on a proper resolution, sampling rate, aggregation
time windows
■ Explore the distribution
■ Separate known modes to independent metrics
Separate signal from noise
■ Use low-pass filters to smooth
■ Trend changes
■ Timeshifts
■ Filter out outliers
Working with clusters
■ Most-deviant/outliers
■ Max/Min
■ Sum (capacity)
■ Pre-aggregate percentiles
Thank You
github.com/avishai-ish-shalom@[email protected]
Questions?
github.com/avishai-ish-shalom@[email protected]