Engineers guide to data analysis

Engineer’s guide to Data AnalysisAvishai Ish-Shalom

github.com/avishai-ish-shalom@[email protected]

Wix in numbers

~ 400 Engineers~ 1400 employees

~ 100M Sites

~ 250 micro services

IaaS(Insult as a Service)▪Thin API, written in Flask (python)

▪CouchDB

▪Apache proxy

▪StatsD, Graphite, ELK

Architecture

StatsD

Graphite

▪Metrics collector, storage and

UI

▪Math functions

▪Common

▪De-facto standard

Oops, I think something is broken

What is this “metric” you speak of?

A metric is

▪Numeric data

▪Often with timestamp (time

series)

▪A “measurement” of

something

▪Discrete

Where do metrics come from?

▪Events with numeric data

▪Counting/aggregating

▪Sampling

Sampling

Sampling

Events

▪Data about something that

happened

▪timestamp (time series data)

▪Has properties - numeric and

non-numeric

{“timestamp”: “2016-11-

15T18:43:39+00:00”,“host”: “test01.example.net”,“status”: “ok”,“latency”: 14.31

}

10000 events/secx

0.5kb/event=

How much data?400GB a day

Telemetry is a big data problem

Aggregates are lossy compression

We must decide in advance how we’ll use the metric

Aggregates

▪Max, Min, Sum, Average, etc

▪Last, random point

▪Percentiles (quantiles)

▪Historgrams, reverse quantiles

▪Each is suitable for a particular use case

Averages are mean to me

Percentiles

p99 - The sampled value that is larger than other 99% of

samples

▪O(n) memory complexity

▪ O(n*log n) computation complexity

▪Some shortcuts for p50 (median), p100 (max), p0 (min)

Use when clients experience individual values

Percentiles

▪Precentiles are not additive

▪ You cannot average percentiles

Example:

s1 (100 points) = [0, 0, ....., 100, 100] => p99 =

100

s2 (100 points) = [0, 0, …., 50, 50] => p99 = 50

p99(s1 : s2) = 50, avg(p99(s1), p99(s2)) = 75Fail

Histograms

Distribution visualization of sample

▪Count of events in each bin

▪Beans are usually evenly spaced

▪Use logarithmically spaced bins

for long tails

▪ Additive

Histograms :-(

So why aren’t we all using this?

▪Storage

▪Have to decide on bins schema

▪ Not many tools support this

Choosing the right aggregate

▪ Percentiles/histograms for latency

▪ Max/min for latency and sizes

▪ Histogram analysis for sizes and latency

▪ Sums/averages for capacity and money

▪ Aggregate per domain

▪ Look for deviations

Resolution

▪ Humans need ~5 data points to see a trend

▪ Hides faster changes

▪ Rollups/downscaling is hard

▪ Multi tier FTW!

It ain't what you don’t know that gets you into trouble.

It's what you know for sure that just ain’t so.

““

Peak Erasure/Spike erosion

■ When lowering resolution, data points are

aggregated

■ Default aggregation is average

■ Peaks are erased

■ This can happen in storage or visualization

Peak Erasure/Spike erosion

■ Storages down-sample to save space

■ Aggregation function may be configurable

■ Metric collectors aggregate too

○ carbon-cache uses last value

○ StatsD - gauges, timers, counters

Counters vs Gauges

Behaviour in low res time window

■ Low res sampling erases fast changes

■ “Round numbers” syndrom

■ Counters smear changes, but don’t erase them

TLDR: use counters when possible

Mixed modes

Aggregating multiple modes reduces usability of aggregates

■ Different transaction types differ in latencies/sizes

■ Errors, successes have very different latencies/sizes

■ Makes your graphs weird

TLDR: use separate metrics for different things

Building useful graphs

Visualization

■ Timeframe

■ No more than 3 series

■ Be weary of multiple Y scales, but scale if needed

■ Only related series on the same graph

■ Never mix X scales

■ Visual references: bounds, Y min/max values, legend

Metric design

■ Choose your aggregates wisely

■ Decide on a proper resolution, sampling rate, aggregation

time windows

■ Explore the distribution

■ Separate known modes to independent metrics

Separate signal from noise

■ Use low-pass filters to smooth

■ Trend changes

■ Timeshifts

■ Filter out outliers

Working with clusters

■ Most-deviant/outliers

■ Max/Min

■ Sum (capacity)

■ Pre-aggregate percentiles

Thank You


Questions?


Technology

Engineers guide to data analysis