65
Sketching Big Data with Spark Reynold Xin @rxin Sep 29, 2015 @ Strata NY

Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Embed Size (px)

Citation preview

Page 1: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Sketching Big Data with Spark

Reynold Xin @rxin Sep 29, 2015 @ Strata NY

Page 2: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

About Databricks

Founded by creators of Spark in 2013

Cloud service for end-to-end data processing •  Interactive notebooks, dashboards,

and production jobs

We are hiring!

Page 3: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Spark

Page 4: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Count-min sketch

Page 5: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Approximate frequent items

Page 6: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Taylor Swift

Page 7: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics
Page 8: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

“Spark is the Taylor Swift of big data software.” - Derrick Harris, Fortune

Page 9: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Who is this guy?

Co-founder & architect for Spark at Databricks Former PhD student at UC Berkeley AMPLab A “systems” guy, which means I won’t be showing equations and this talk might be the easiest to consume in HDS

Page 10: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

This talk

1.  Develop intuitions on these sketches so you know when to use it

2.  Understand how certain parts in distributed data processing (e.g. Spark) work

Page 11: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics
Page 12: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Sketch: Reynold’s not-so-scientific definition

1. Use small amount of space to summarize a large dataset. 2. Go over each data point once, a.k.a. “streaming algorithm”, or “online algorithm” 3. Parallelizable, but only small amount of communication

Page 13: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

What for?

Exploratory analysis Feature engineering Combine sketch and exact to speed up processing

Page 14: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Sketches in Spark

Set membership (Bloom filter) Cardinality (HyperLogLog) Histogram (count-min sketch) Frequent pattern mining

Frequent items Stratified Sampling …

Page 15: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

This Talk

Set membership (Bloom filter) Cardinality (HyperLogLog) Histogram (count-min sketch) Frequent pattern mining

Frequent items Stratified Sampling …

Page 16: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Set membership

Page 17: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Set membership

Identify whether an item is in a set e.g. “You have bought this item before”

Page 18: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Exact set membership

Track every member of the set •  Space: size of data •  One pass: yes •  Parallelizable & communication: size of data

Page 19: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Approximate set membership

Take 1. Use a 32-bit integer hash map to track •  ~4 bytes per record •  Max 4 billion items

Take 2. Hash items to 256 buckets

•  Memory usage only 256 bits •  Good if num records is small •  Bad if num records is large (256+ items, collision rate 100%!)

Page 20: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Bloom filter

Bloom filter algorithm •  k hash functions •  hash item into k separate positions •  if any of the k positions is not set, then item is not in set

Properties •  ~500MB needed to have 10% error rate on 1 billion items •  See http://hur.st/bloomfilter?n=1000000000&p=0.1 •  False positives possible

Page 21: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Use case beyond exploration

SELECT * FROM A join B on A.key = B.key 1.  Assume A and B are both large, i.e. “shuffle join” 2.  Some rows in A might not have matched rows in B 3.  Wouldn’t it be nice if we only need to shuffle rows that match?

Answer: use a bloom filter to filter the ones that don’t match

Page 22: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Frequent items

Page 23: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Frequent Items

Find items more frequent than 1/k

Page 24: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Source: http://www.macfreek.nl/memory/Letter_Distribution

Page 25: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

4,474

3,146

2,352

1,749

1,293 1,248 1,107 1,094 1,065

907 835 793 789 737 598 582 517 482 447 444 420 409 409 405 400 381 378 369 367 366

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000 Tw

itter

follo

wer

s in

thou

sand

s

Twitter Followers of NBA teams (in 1,000s), September 2015

Source: http://www.statista.com/statistics/240386/twitter-followers-of-national-basketball-association-teams/

Page 26: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Frequent Items

Exploration •  Identify important members in a network •  E.g. “the”, LA Lakers, Taylor Swift

Feature Engineering •  Identify outliers •  Ignore low frequency items

Page 27: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Frequent Items: Exact Algorithm

SELECT  item,  count(*)  cnt  FROM  corpus  GROUP  BY  item  HAVING  cnt  >  k  *  cnt  

•  Space: linear to |item| •  One pass: no (two passes) •  Parallelizable & communication: linear to |item|

Page 28: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics
Page 29: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Example 1: Find Items Frequency > ½ (k=2)

Page 30: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

draw

Put back if any pair of balls are the same color

Page 31: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics
Page 32: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

draw

Remove if balls are all different color

Page 33: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Example 1: Find Items Frequency > 1/2

Blue ball left (frequent item)

Page 34: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Example 2: Find Items Frequency > ½ (k=2)

Page 35: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

draw

Page 36: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics
Page 37: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

draw

Page 38: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

draw

Page 39: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

1 ball left (frequent item)

Page 40: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

How do we implement this?

Maintain a hash table of counts

Page 41: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Increment for every ball we see

0 => 1

Page 42: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Increment for every ball we see

1 => 2

Page 43: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Increment for every ball we see

0 => 4

Page 44: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Increment for every ball we see

0 => 4

Page 45: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Increment for every ball we see

4

0 => 1

Page 46: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

When the hash table has k items, remove 1 from each item and remove the item if count = 0

4 => 3

1 => 0

Page 47: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

3

Page 48: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

3

0 => 1

Page 49: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

2

Page 50: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

2

0 => 1

Page 51: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

1

Page 52: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Implementation

Maintains a hash table of counts •  For each item, increment its count •  If hash table size == k:

– decrement 1 from each item; and –  remove items whose count == 0

Parallelization: merge hash tables of max size k

Page 53: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Comparing Exact vs Approximate

Naïve Exact Sketch

# Passes 2 1

Memory |item| k

Communication |item| k

Page 54: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Comparing Exact vs Approximate

Naïve Exact Sketch Smart Exact

# Passes 2 1 2 (1st pass using sketch)

Memory |item| k k

Communication |item| k k

Page 55: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Quiz: an example with false positive?

K = 3

Page 56: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

How to use it in Spark?

Frequent items for multiple columns independently •  df.stat.freqItems([“columnA”,  “columnB”,  …])  

Frequent items for composite keys

•  df.stat.freqItems(struct(“columnA”,  “columnB”))  

Page 57: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Stratified sampling

Page 58: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Bernoulli sampling & Variance

Sample US population (300m) using rate 0.000002 (~600) •  Wyoming (0.5m) should have 1 •  Bernoulli sampling likely leads to Wyoming having 0

Intuition: uniform sampling leads to ~ 600 samples.

•  i.e. it might be 600, or 601, or 599, or … •  Impact on WY when going from 600 to 601 is much larger than that on CA’s

Page 59: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Stratified sampling

Existing “exact” algorithms •  Draw-by-draw •  Selection-rejection •  Reservoir •  Random sort

Either sequential or expensive (full global sort)

Page 60: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Random sort

Example: sampling probability p = 0.1 on 100 items. 1.  Generate random keys

•  (0.644, t1), (0.378, t2), … (0.500, t99), (0.471, t100)

2.  Sort and select the smallest 10 items

•  (0.028, t94), (0.029, t44), …, (0.137, t69), …, (0.980, t26), (0.988, t60)

Page 61: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Heuristics

Qualitatively speaking •  If u is “much larger” than p, then t is “unlikely” to be selected •  If u is “much smaller” than p, then it is “likely” to be selected

Set two thresholds q1 and q2, such that: •  If u < q1, accept t directly •  If u > q2, reject t directly •  Otherwise, put t in a buffer to be sorted

Page 62: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Spark’s stratified sampling algorithm

Combines “exact” and “sketch” to achieve parallelization & low memory overhead df.stat.sampleByKeyExact(col,  fractions,  seed)    

Xiangrui Meng. Scalable Simple Random Sampling and Stratified Sampling. ICML 2013  

Page 63: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

This Talk

Set membership (Bloom filter) Cardinality (HyperLogLog) Histogram (count-min sketch) Frequent pattern mining

Frequent items Stratified Sampling …

Page 64: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Conclusion

Sketches can be useful in exploration, feature engineering, as well as building faster exact algorithms. We are building a lot of these into Spark so you don’t need to reinvent the wheel!

Page 65: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics

Thank you. Meetup tonight @ Civic Hall, 6:30pm  156 5th Avenue, 2nd floor, New York, NY