36
S AMPLING A S TREAM O F E VENTS W ITH A S KETCH P REETAM J INKA • B ARON S CHWARTZ • V IVID C ORTEX M ONITORAMA • J UNE 2015

Sampling Is Hard

Embed Size (px)

Citation preview

Page 1: Sampling Is Hard

SAMPLING A STREAM OF EVENTS WITH A SKETCH

PREETAM JINKA • BARON SCHWARTZ • VIVIDCORTEX

MONITORAMA • JUNE 2015

Page 2: Sampling Is Hard

INTRODUCTIONS

VividCortex is the best way to see what your databases are doing in production

Preetam Jinka, Software Engineer

@PreetamJinka

[email protected]

Baron Schwartz, CEO/Founder

@xaprb

[email protected]

Page 3: Sampling Is Hard

A STREAM OF EVENTS IN TIME

Time

Page 4: Sampling Is Hard

COMPUTE METRICS ABOUT THE EVENTS

Page 5: Sampling Is Hard

METRICS ALONE ARE NOT ENOUGH

Page 6: Sampling Is Hard

WE WANT SAMPLES OF THESE EVENTS

Page 7: Sampling Is Hard

REPRESENTATIVE SAMPLING IS HARDIt’s Hard To Pick Individual Samples

Page 8: Sampling Is Hard

REPRESENTATIVE SAMPLING MATTERS

Page 9: Sampling Is Hard

EVENTS ARE DIVERSE AND COMPLEX

Page 10: Sampling Is Hard

GOALSSample enough events but not too many

Select representative events

Bias “important” events

Avoid “private” events

Balance sampling between rare and frequent events

Achieve desired overall sampling rate

Page 11: Sampling Is Hard

CONFLICTING GOALSBias towards “important” events, versus rate limitingRare versus frequent versus overall sampling rateCorrectness versus efficiency

Page 12: Sampling Is Hard

POSSIBLE APPROACHESSelect every Nth eventSelect worst event per time periodSelect random event per time period

Page 13: Sampling Is Hard

STATISTICS TO THE RESCUE?If events are generated by a Poisson process, then:

Constant average rate, exponential inter-arrival times

If we choose samples using an exponential probability, then samples would be Poisson too

Time

Probability

(approaches 1.0)

Page 14: Sampling Is Hard

Probability of selecting an event, given the time since last event is 1 - e - λt

USING EXPONENTIAL PROBABILITIES

Time

Probability

Events

Page 15: Sampling Is Hard

WE CHOSE A SIMPLER APPROACH

Time

Probability

Events

We use a linearly increasing probability that will produce a uniform distribution of samples

Page 16: Sampling Is Hard

ANALOGY: WAITING FOR A BUS

Buses arrive at a stop every 5 minutes on averageYou arrive 2 minutes after the last bus leftHow long should you expect to wait?

Page 17: Sampling Is Hard

WHY DO WE DO THAT?Easy to understandLess computationally expensiveReliable in low-frequency scenarios

Page 18: Sampling Is Hard

BEWARE: PROBABILITY GOTCHA

Time

Probability

Events

Should we use time since last event, or time since last sample, to compute the probability of selecting an event?

We Sampled This Event

Which strategy is correct?

Page 19: Sampling Is Hard

USE THE TIME SINCE THE LAST EVENT

Time

Probability

Events

Probabilities are additive, and probability since the last sample grows a lot faster than probability since last event

Page 20: Sampling Is Hard

EFFICIENCY CHALLENGES“Remembering” millions of categories creates memory and CPU load

We use an LRU to “forget” stale categories for efficiency

Lots of edge cases can result (oversampling, undersampling)

Page 21: Sampling Is Hard

EFFICIENCY SOLUTION?We’d like a cheap way to “remember” the last time we’ve seen a category of query, even if it’s approximate

Page 22: Sampling Is Hard

A SKETCH TO THE RESCUE!A sketch is a compact, probabilisitic data structure

Trades off accuracy for resources (CPU, memory)

Similar in nature to a bloom filter

Page 23: Sampling Is Hard

WE “INVENTED” A SKETCH

We were inspired by the Count-Min Sketch

Instead of frequency, we needed last-seen timestamp

We call it the “Last-Seen Sketch”

It is compact and efficient (memory + CPU)

It errs on the side of undersampling

Page 24: Sampling Is Hard

THE LAST-SEEN SKETCH

The sketch is several arrays of timestamps

Categories of events map to cells by hash andmodulus.

Each event will hash & modulus to one cell in each array.

With 4 arrays, it’s stored 4 places.

TS 0 TS 1 TS 2 TS 3 TS 4 TS 5

ARRAY 0 1 4 7 9 8 3

ARRAY 1 7 3 2 9 1

ARRAY 2 1 3 9 7

ARRAY 3 3 9 7

Page 25: Sampling Is Hard

STORING AN EVENT’S TIMESTAMP

Store ts=8 for event that hashes to 20. Where are its values stored?

20 % 6 => index 2

20 % 5 => 0

20 % 4 => 0

20 % 3 => 2

TS 0 TS 1 TS 2 TS 3 TS 4 TS 5

ARRAY 0 1 4 8 9 8 3

ARRAY 1 8 3 2 9 1

ARRAY 2 8 3 9 7

ARRAY 3 3 9 8

Page 26: Sampling Is Hard

LOOKING UP A VALUE

Example: find stored timestamp for event that hashes to 13.

Indices are 1, 3, 1, 1.

Choose the lowest value.

Result: value is 3.

TS 0 TS 1 TS 2 TS 3 TS 4 TS 5

ARRAY 0 1 4 8 9 8 3

ARRAY 1 8 3 2 9 1

ARRAY 2 8 3 9 7

ARRAY 3 3 9 8

Page 27: Sampling Is Hard

PUTTING IT ALL TOGETHER

Events are categorized and flagged in various ways

Important events: long-running, has an error, etcIneligible: has blacklisted text that’s sensitive/private, etc

Events are then eligible for selecting as a sample

Page 28: Sampling Is Hard

PUTTING IT ALL TOGETHER

Probability of selecting the event is determined with the Last-Seen SketchOn collision, we err on the side of undersampling (very small prob)Events are selected and transmitted to our APIs

Page 29: Sampling Is Hard

NOW WE HAVE METRICS + SAMPLES

Page 30: Sampling Is Hard

RATE LIMITSImportant to prevent DOS’ing ourselvesCurrent: implemented with a global sample quota per interval of timeFuture: likely will use a EWMA to influence overall sampling probability

Page 31: Sampling Is Hard

IMPORTANT EVENTSNot all events are created equalBias sampling towards important eventsExtremely helpful for one-in-a-million problems in productionChallenging to balance with rate limits

Page 32: Sampling Is Hard

EXAMPLE IN VIVIDCORTEX

Suppose we are sniffing queries off the wire that have occasional warnings or errors, such as say .001% of queries

If we aren’t sampling this query category enough, we won’t have the warning-producing SQL to examine!

Page 33: Sampling Is Hard

PRESTO!

Page 34: Sampling Is Hard

MONGODB QUERY WITH ERROR

Page 35: Sampling Is Hard

[email protected]

@PreetamJinka

linkedin.com/in/preetamjinka

[email protected]

@xaprb

linkedin.com/in/xaprb

Thanks to John Berryman, who helped implement and peer reviewed .

Page 36: Sampling Is Hard

PHOTO CREDITSChocolates: skrb - https://www.flickr.com/photos/skrb/5984342555

Dew: taufuuu - https://www.flickr.com/photos/ghailon/11565221176

Silhouette: https://www.flickr.com/photos/28481088@N00/2925783507

Bus stop: Robert Couse-Baker - https://www.flickr.com/photos/29233640@N07/14033204315

calla edge: mclcbooks - https://www.flickr.com/photos/39877441@N05/5455416496/

Windmills: omarparada - https://www.flickr.com/photos/omarparada/9776594294

Airplanes: presidioofmonterey - https://www.flickr.com/photos/presidioofmonterey/10710648865

Droplet collision: https://www.flickr.com/photos/69294818@N07/8682467843

1000 layers: doug88888 - https://www.flickr.com/photos/doug88888/3139395660

Balancing Rocks: light_seeker - https://www.flickr.com/photos/light_seeker/7780857224

Capilano Dam: barabanov - https://www.flickr.com/photos/barabanov/4733415724

Survival Bias: hjl - https://www.flickr.com/photos/hjl/15942299782