35
Engineer’s guide to Data Analysis Avishai Ish-Shalom github.com/avishai-ish- shalom @nukemberg [email protected]

Engineers guide to data analysis

Embed Size (px)

Citation preview

Page 1: Engineers guide to data analysis

Engineer’s guide to Data AnalysisAvishai Ish-Shalom

github.com/avishai-ish-shalom@[email protected]

Page 2: Engineers guide to data analysis

Wix in numbers

~ 400 Engineers~ 1400 employees

~ 100M Sites

~ 250 micro services

Page 3: Engineers guide to data analysis

IaaS(Insult as a Service)▪Thin API, written in Flask (python)

▪CouchDB

▪Apache proxy

▪StatsD, Graphite, ELK

Page 4: Engineers guide to data analysis

Architecture

StatsD

Page 5: Engineers guide to data analysis

Graphite

▪Metrics collector, storage and

UI

▪Math functions

▪Common

▪De-facto standard

Page 6: Engineers guide to data analysis

Oops, I think something is broken

Page 7: Engineers guide to data analysis

What is this “metric” you speak of?

Page 8: Engineers guide to data analysis

A metric is

▪Numeric data

▪Often with timestamp (time

series)

▪A “measurement” of

something

▪Discrete

Page 9: Engineers guide to data analysis

Where do metrics come from?

▪Events with numeric data

▪Counting/aggregating

▪Sampling

Page 10: Engineers guide to data analysis

Sampling

Page 11: Engineers guide to data analysis

Sampling

Page 12: Engineers guide to data analysis

Events

▪Data about something that

happened

▪timestamp (time series data)

▪Has properties - numeric and

non-numeric

{“timestamp”: “2016-11-

15T18:43:39+00:00”,“host”: “test01.example.net”,“status”: “ok”,“latency”: 14.31

}

Page 13: Engineers guide to data analysis

10000 events/secx

0.5kb/event=

How much data?400GB a day

Page 14: Engineers guide to data analysis

Telemetry is a big data problem

Page 15: Engineers guide to data analysis

Aggregates are lossy compression

We must decide in advance how we’ll use the metric

Page 16: Engineers guide to data analysis

Aggregates

▪Max, Min, Sum, Average, etc

▪Last, random point

▪Percentiles (quantiles)

▪Historgrams, reverse quantiles

▪Each is suitable for a particular use case

Page 17: Engineers guide to data analysis

Averages are mean to me

Page 18: Engineers guide to data analysis

Percentiles

p99 - The sampled value that is larger than other 99% of

samples

▪O(n) memory complexity

▪ O(n*log n) computation complexity

▪Some shortcuts for p50 (median), p100 (max), p0 (min)

Use when clients experience individual values

Page 19: Engineers guide to data analysis

Percentiles

▪Precentiles are not additive

▪ You cannot average percentiles

Example:

s1 (100 points) = [0, 0, ....., 100, 100] => p99 =

100

s2 (100 points) = [0, 0, …., 50, 50] => p99 = 50

p99(s1 : s2) = 50, avg(p99(s1), p99(s2)) = 75Fail

Page 20: Engineers guide to data analysis

Histograms

Distribution visualization of sample

▪Count of events in each bin

▪Beans are usually evenly spaced

▪Use logarithmically spaced bins

for long tails

▪ Additive

Page 21: Engineers guide to data analysis

Histograms :-(

So why aren’t we all using this?

▪Storage

▪Have to decide on bins schema

▪ Not many tools support this

Page 22: Engineers guide to data analysis

Choosing the right aggregate

▪ Percentiles/histograms for latency

▪ Max/min for latency and sizes

▪ Histogram analysis for sizes and latency

▪ Sums/averages for capacity and money

▪ Aggregate per domain

▪ Look for deviations

Page 23: Engineers guide to data analysis

Resolution

▪ Humans need ~5 data points to see a trend

▪ Hides faster changes

▪ Rollups/downscaling is hard

▪ Multi tier FTW!

Page 24: Engineers guide to data analysis

It ain't what you don’t know that gets you into trouble.

It's what you know for sure that just ain’t so.

““

Page 25: Engineers guide to data analysis

Peak Erasure/Spike erosion

■ When lowering resolution, data points are

aggregated

■ Default aggregation is average

■ Peaks are erased

■ This can happen in storage or visualization

Page 26: Engineers guide to data analysis

Peak Erasure/Spike erosion

■ Storages down-sample to save space

■ Aggregation function may be configurable

■ Metric collectors aggregate too

○ carbon-cache uses last value

○ StatsD - gauges, timers, counters

Page 27: Engineers guide to data analysis

Counters vs Gauges

Behaviour in low res time window

■ Low res sampling erases fast changes

■ “Round numbers” syndrom

■ Counters smear changes, but don’t erase them

TLDR: use counters when possible

Page 28: Engineers guide to data analysis

Mixed modes

Aggregating multiple modes reduces usability of aggregates

■ Different transaction types differ in latencies/sizes

■ Errors, successes have very different latencies/sizes

■ Makes your graphs weird

TLDR: use separate metrics for different things

Page 29: Engineers guide to data analysis

Building useful graphs

Page 30: Engineers guide to data analysis

Visualization

■ Timeframe

■ No more than 3 series

■ Be weary of multiple Y scales, but scale if needed

■ Only related series on the same graph

■ Never mix X scales

■ Visual references: bounds, Y min/max values, legend

Page 31: Engineers guide to data analysis

Metric design

■ Choose your aggregates wisely

■ Decide on a proper resolution, sampling rate, aggregation

time windows

■ Explore the distribution

■ Separate known modes to independent metrics

Page 32: Engineers guide to data analysis

Separate signal from noise

■ Use low-pass filters to smooth

■ Trend changes

■ Timeshifts

■ Filter out outliers

Page 33: Engineers guide to data analysis

Working with clusters

■ Most-deviant/outliers

■ Max/Min

■ Sum (capacity)

■ Pre-aggregate percentiles

Page 34: Engineers guide to data analysis

Thank You

github.com/avishai-ish-shalom@[email protected]

Page 35: Engineers guide to data analysis

Questions?

github.com/avishai-ish-shalom@[email protected]