61
Scalable Simple Random Sampling Algorithms Xiangrui Meng Joint ICME/Statistics Seminar in Data Science Xiangrui Meng (Databricks) ScaSRS March 3, 2014 1 / 40

Scalable Simple Random Sampling Algorithms

Embed Size (px)

DESCRIPTION

@Stanford

Citation preview

Page 1: Scalable Simple Random Sampling Algorithms

Scalable Simple Random Sampling Algorithms

Xiangrui Meng

Joint ICME/Statistics Seminar in Data Science

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 1 / 40

Page 2: Scalable Simple Random Sampling Algorithms

Spark workshop (April 4, 2014)

http://icme.stanford.edu/news/2014/spark-workshop

Reza Zadeh ([email protected])

Apache Spark is a fast and general engine for large-scale data processing.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 2 / 40

Page 3: Scalable Simple Random Sampling Algorithms

Statistical analysis of big dataAnalyzing data sets of billions of records has now become a regular task inmany companies and institutions. The continuous increase in data sizekeeps challenging the design of algorithms.

Design and implement new scalable algorithms.I Algorithms:

F alternating direction method of multipliers (Boyd et al., 2011)F matrix factorization for recommender systems (Koren et al., 2009)

I Libraries:F Vowpal Wabbit, Apache Mahout, H2O, and Spark MLlib

Reduce the data size and use traditional algorithms.I Sampling is a systematic and cost-effective way, sometimes with

provable performance:F Coresets for k-means and k-median clustering (Har-Peled et al, 2004).F Coresets for `1, `2, and `p regression (Clarkson, 2005, Drineas et al.,

2006, Dasgupta et al., 2009, ...)

However, even the sampling algorithms do not always scale well ...

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 3 / 40

Page 4: Scalable Simple Random Sampling Algorithms

Statistical analysis of big dataAnalyzing data sets of billions of records has now become a regular task inmany companies and institutions. The continuous increase in data sizekeeps challenging the design of algorithms.

Design and implement new scalable algorithms.I Algorithms:

F alternating direction method of multipliers (Boyd et al., 2011)F matrix factorization for recommender systems (Koren et al., 2009)

I Libraries:F Vowpal Wabbit, Apache Mahout, H2O, and Spark MLlib

Reduce the data size and use traditional algorithms.I Sampling is a systematic and cost-effective way, sometimes with

provable performance:F Coresets for k-means and k-median clustering (Har-Peled et al, 2004).F Coresets for `1, `2, and `p regression (Clarkson, 2005, Drineas et al.,

2006, Dasgupta et al., 2009, ...)

However, even the sampling algorithms do not always scale well ...

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 3 / 40

Page 5: Scalable Simple Random Sampling Algorithms

Statistical analysis of big dataAnalyzing data sets of billions of records has now become a regular task inmany companies and institutions. The continuous increase in data sizekeeps challenging the design of algorithms.

Design and implement new scalable algorithms.I Algorithms:

F alternating direction method of multipliers (Boyd et al., 2011)F matrix factorization for recommender systems (Koren et al., 2009)

I Libraries:F Vowpal Wabbit, Apache Mahout, H2O, and Spark MLlib

Reduce the data size and use traditional algorithms.I Sampling is a systematic and cost-effective way, sometimes with

provable performance:F Coresets for k-means and k-median clustering (Har-Peled et al, 2004).F Coresets for `1, `2, and `p regression (Clarkson, 2005, Drineas et al.,

2006, Dasgupta et al., 2009, ...)

However, even the sampling algorithms do not always scale well ...

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 3 / 40

Page 6: Scalable Simple Random Sampling Algorithms

Statistical analysis of big dataAnalyzing data sets of billions of records has now become a regular task inmany companies and institutions. The continuous increase in data sizekeeps challenging the design of algorithms.

Design and implement new scalable algorithms.I Algorithms:

F alternating direction method of multipliers (Boyd et al., 2011)F matrix factorization for recommender systems (Koren et al., 2009)

I Libraries:F Vowpal Wabbit, Apache Mahout, H2O, and Spark MLlib

Reduce the data size and use traditional algorithms.I Sampling is a systematic and cost-effective way, sometimes with

provable performance:F Coresets for k-means and k-median clustering (Har-Peled et al, 2004).F Coresets for `1, `2, and `p regression (Clarkson, 2005, Drineas et al.,

2006, Dasgupta et al., 2009, ...)

However, even the sampling algorithms do not always scale well ...Xiangrui Meng (Databricks) ScaSRS March 3, 2014 3 / 40

Page 7: Scalable Simple Random Sampling Algorithms

Outline

1 Simple random sampling without replacementExisting algorithmsAlgorithm ScaSRSStreaming environmentsStratified samplingEmpirical evaluation

2 Simple random sampling with replacementExisting algorithmsAlgorithm ScaSRSWRStreaming environments

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 4 / 40

Page 8: Scalable Simple Random Sampling Algorithms

Outline

1 Simple random sampling without replacementExisting algorithmsAlgorithm ScaSRSStreaming environmentsStratified samplingEmpirical evaluation

2 Simple random sampling with replacementExisting algorithmsAlgorithm ScaSRSWRStreaming environments

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 5 / 40

Page 9: Scalable Simple Random Sampling Algorithms

Simple random sampling (SRS)

Simple random sampling (Thompson, 2012)

Simple random sampling is a sampling design in which s distinct items areselected from the n items in the population in such a way that everypossible combination of s items is equally likely to be the sample selected.

SRS is often used as

a sampling technique,

a building block for complex sampling methods.

Given an item set T , which contains n items: t1, . . . , tn, and an integers ≤ n, we want to generate a simple random sample of size s from T .

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 6 / 40

Page 10: Scalable Simple Random Sampling Algorithms

The draw-by-draw method

Draw-by-draw

1: Set S = ∅.2: for i from 1 to s do3: Select one item t with equal probability from T − S .4: Let S := S + {t}.5: end for6: Return S .

Selecting one item with equal probability is hard due toI variable-length records,I no indices.

Representing T − S is also hard when data is large.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 7 / 40

Page 11: Scalable Simple Random Sampling Algorithms

The draw-by-draw method

Draw-by-draw

1: Set S = ∅.2: for i from 1 to s do3: Select one item t with equal probability from T − S .4: Let S := S + {t}.5: end for6: Return S .

Selecting one item with equal probability is hard due toI variable-length records,I no indices.

Representing T − S is also hard when data is large.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 7 / 40

Page 12: Scalable Simple Random Sampling Algorithms

The selection-rejection algorithm

Selection-rejection (Fan, 1962)

1: Set i = 0.2: for j from 1 to n do3: With probability s−i

n−j+1 , select tj and let i = i + 1.4: end for

Pros:

One-pass.

O(1) storage.

Cons:

Sequential.

Needs both n and s to work.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 8 / 40

Page 13: Scalable Simple Random Sampling Algorithms

The selection-rejection algorithm

Selection-rejection (Fan, 1962)

1: Set i = 0.2: for j from 1 to n do3: With probability s−i

n−j+1 , select tj and let i = i + 1.4: end for

Pros:

One-pass.

O(1) storage.

Cons:

Sequential.

Needs both n and s to work.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 8 / 40

Page 14: Scalable Simple Random Sampling Algorithms

The reservoir algorithm

Reservoir (Vitter, 1985)

1: The first s items are stored into a reservoir R.2: for i from s + 1 to n do3: With probability s

i , replace an item from R with equal probabilityand let ti take its place.

4: end for5: Select the items in R.

Pros:

Does not require n.

Cons:

Sequential.

O(s) storage.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 9 / 40

Page 15: Scalable Simple Random Sampling Algorithms

The reservoir algorithm

Reservoir (Vitter, 1985)

1: The first s items are stored into a reservoir R.2: for i from s + 1 to n do3: With probability s

i , replace an item from R with equal probabilityand let ti take its place.

4: end for5: Select the items in R.

Pros:

Does not require n.

Cons:

Sequential.

O(s) storage.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 9 / 40

Page 16: Scalable Simple Random Sampling Algorithms

The random sort algorithm

Random sort (Sunter, 1977)

1: Associate each ti with an independent key ui drawn from U(0, 1).2: Sort T in ascending order with regard to the key.3: Select the smallest s items.

Cons:

A random permutation of the entire data set.

Pros:

The process of generating {uj} is embarrassingly parallel.

Sorting is scalable.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 10 / 40

Page 17: Scalable Simple Random Sampling Algorithms

The random sort algorithm

Random sort (Sunter, 1977)

1: Associate each ti with an independent key ui drawn from U(0, 1).2: Sort T in ascending order with regard to the key.3: Select the smallest s items.

Cons:

A random permutation of the entire data set.

Pros:

The process of generating {uj} is embarrassingly parallel.

Sorting is scalable.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 10 / 40

Page 18: Scalable Simple Random Sampling Algorithms

An example of random sort

Set n = 100, s = 10, and hence the sampling probability p = s/n = 0.1.

1 Generate random keys:

(0.644, t1), (0.378, t2), . . . , (0.587, t10), . . . , (0.500, t99), (0.471, t100)

2 Sort and select the smallest 10 items:

(0.028, t94), (0.029, t44), . . . , (0.137, t69)︸ ︷︷ ︸the smallest 10 items

, . . . , (0.980, t26), (0.988, t60)

Fact: the 10th item after the sort is associated with a random key 0.137.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 11 / 40

Page 19: Scalable Simple Random Sampling Algorithms

An example of random sort

Set n = 100, s = 10, and hence the sampling probability p = s/n = 0.1.

1 Generate random keys:

(0.644, t1), (0.378, t2), . . . , (0.587, t10), . . . , (0.500, t99), (0.471, t100)

2 Sort and select the smallest 10 items:

(0.028, t94), (0.029, t44), . . . , (0.137, t69)︸ ︷︷ ︸the smallest 10 items

, . . . , (0.980, t26), (0.988, t60)

Fact: the 10th item after the sort is associated with a random key 0.137.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 11 / 40

Page 20: Scalable Simple Random Sampling Algorithms

An example of random sort

Set n = 100, s = 10, and hence the sampling probability p = s/n = 0.1.

1 Generate random keys:

(0.644, t1), (0.378, t2), . . . , (0.587, t10), . . . , (0.500, t99), (0.471, t100)

2 Sort and select the smallest 10 items:

(0.028, t94), (0.029, t44), . . . , (0.137, t69)︸ ︷︷ ︸the smallest 10 items

, . . . , (0.980, t26), (0.988, t60)

Fact: the 10th item after the sort is associated with a random key 0.137.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 11 / 40

Page 21: Scalable Simple Random Sampling Algorithms

Heuristics

Qualitatively speaking,

if ui is “much larger” than p, then ti is “very unlikely” to be selected;

if ui is “much smaller” than p, then ti is “very likely” to be selected.

Set two thresholds q1 and q2, such that

if ui > q1, reject ti directly;

if ui < q2, select ti directly;

otherwise, put ti onto a waiting list that goes to the sort phase.

The resulting algorithm fails

if we reject more than n − s items,

or if we select more than s items.

Otherwise, it returns the same result as the random sort algorithm.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 12 / 40

Page 22: Scalable Simple Random Sampling Algorithms

Heuristics

Qualitatively speaking,

if ui is “much larger” than p, then ti is “very unlikely” to be selected;

if ui is “much smaller” than p, then ti is “very likely” to be selected.

Set two thresholds q1 and q2, such that

if ui > q1, reject ti directly;

if ui < q2, select ti directly;

otherwise, put ti onto a waiting list that goes to the sort phase.

The resulting algorithm fails

if we reject more than n − s items,

or if we select more than s items.

Otherwise, it returns the same result as the random sort algorithm.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 12 / 40

Page 23: Scalable Simple Random Sampling Algorithms

Heuristics

Qualitatively speaking,

if ui is “much larger” than p, then ti is “very unlikely” to be selected;

if ui is “much smaller” than p, then ti is “very likely” to be selected.

Set two thresholds q1 and q2, such that

if ui > q1, reject ti directly;

if ui < q2, select ti directly;

otherwise, put ti onto a waiting list that goes to the sort phase.

The resulting algorithm fails

if we reject more than n − s items,

or if we select more than s items.

Otherwise, it returns the same result as the random sort algorithm.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 12 / 40

Page 24: Scalable Simple Random Sampling Algorithms

ScaSRS: a scalable simple random sampling algorithm

1: Let l = 0, and W = ∅ be the waiting list.2: for each item ti ∈ T do3: Draw a key ui independently from U(0, 1).4: if ui < q2 then5: Select ti and let l := l + 1.6: else if ui < q1 then7: Associate ti with ui and add it into W .8: end if9: end for

10: Sort W ’s items in the ascending order of the key.11: Select the smallest pn − l items from W .

ScaSRS outputs the same result as the random sort algorithm giventhe same sequence of random keys, if it succeeds.

ScaSRS is embarrassingly parallel.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 13 / 40

Page 25: Scalable Simple Random Sampling Algorithms

A quantitative analysis

Theorem

For a fixed δ > 0, if we choose

q1 = min(1, p + γ1 +√γ21 + 2γ1p), where γ1 = − log δ

n,

q2 = max(0, p + γ2 −√γ22 + 3γ2p), where γ2 = −2 log δ

3n,

ScaSRS succeeds with probability at least 1− 2δ. Moreover, with highprobability, it only needs O(

√s) storage and runs in O(n) time.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 14 / 40

Page 26: Scalable Simple Random Sampling Algorithms

A practical choice of δ

Set δ = 0.00005. We get the following thresholds:

q1 = min

(1, p +

10

n+

√100

n2+

20p

n

),

q2 = max

(0, p +

20

3n−√

400

9n2+

20p

n

),

and ScaSRS succeeds with probability at least 1− 2δ = 99.99%.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 15 / 40

Page 27: Scalable Simple Random Sampling Algorithms

Sketch of proof

Denote Ui the random key associated with item i and let Yi = 1Ui<q1 .

E[Yi ] = q1 and E[Y 2i ] = q1.

Y =∑

i Yi is the number of un-rejected items during the scan.

Apply a Bernstein-type inequality (Maurer, 2003),

log Pr{Y ≤ pn} ≤ −(q1 − p)2n

2q1.

With the choice of q1 in our theorem, we have

Pr{Y ≤ s} ≤ δ.

By similar arguments, we can bound the number of selected itemsduring the scan:

Pr

{∑i

1Ui<q2 ≥ s

}≤ δ.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 16 / 40

Page 28: Scalable Simple Random Sampling Algorithms

The size of the waiting list = O(√s), w.h.p.

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

48000 48500 49000 49500 50000 50500 51000 51500 52000

n = 1e6, k = 50000, p = 0.05

pdf of number of unrejected itemspdf of number of accepted items

O(sqrt(k))

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 17 / 40

Page 29: Scalable Simple Random Sampling Algorithms

Streaming environments (when only p is given)

If only p is given, we can update the thresholds q1 and q2 on the fly basedon the number of items seen so far, denoted by j :

q1,i = min(

1, p + γ1,i +√γ21,i + 2γ1,ip

), where γ1,i = − log δ

i,

q2,i = max(

0, p + γ2,i −√γ22,i + 3γ2,ip

), where γ2,i = −2 log δ

3i.

O(log n +√

s log n) storage

at least 1− 2δ success rate

not necessary to know the exact s but just a good lower boundI use the local count on each processI or a global count updated less frequently

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 18 / 40

Page 30: Scalable Simple Random Sampling Algorithms

Streaming environments (when only s is given)

When only s is given, we can no longer accept items on the fly becausethe sampling probability could be arbitrarily small. However, we can stillreject items on the fly based on s and i :

q1,i = min

(1,

s

i+ γ1,i +

√γ21,i + 2γ1

s

i

), where γ1,i = − log δ

i.

s(log n + 1) +O(√

s + log n) storage

at least 1− δ success rate

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 19 / 40

Page 31: Scalable Simple Random Sampling Algorithms

Stratified sampling

If the item set is heterogeneous, it may be possible to partition it intoseveral non-overlapping homogeneous subsets, called strata. Applying SRSwithin each stratum is preferred to applying SRS to the entire set forbetter representativeness. This approach is called stratified sampling.

Applications:

U.S. Census survey

political survey

Stratification:

based on training labels

based on days of the week

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 20 / 40

Page 32: Scalable Simple Random Sampling Algorithms

Stratified sampling (cont.)

Applying ScaSRS to stratified sampling is straightforward. Let m be thenumber of strata. We have the following result:

If the size of each stratum is given, we need O(√

ms) storage.

If only the sampling probability p is given, we needO(m log n +

√ms log n) storage.

If only the sample size k is given, we needs(log n + 1) +O(

√sm log n) storage.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 21 / 40

Page 33: Scalable Simple Random Sampling Algorithms

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 22 / 40

Page 34: Scalable Simple Random Sampling Algorithms

Empirical evaluation: MapReduce implementation

1: Set l = 0.

2: function map(ti )3: Generate ui from U(0, 1).4: if ui < q2 then5: Select (and output directly) ti .6: l := l + 1.7: else if ui < q1 then8: Emit (ui , ti ).9: end if

10: end function

11: function reduce(. . . , (ui , ti ), . . .)12: Select the first pn − l items.13: end function

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 23 / 40

Page 35: Scalable Simple Random Sampling Algorithms

Empirical evaluation: simple random sampling

P1 P2 P3 P4 P5 P6

n 6.0e7 6.0e7 3.0e8 3.0e8 1.5e9 1.5e9p 0.01 0.1 0.01 0.1 0.01 0.1s 6.0e5 6.0e6 3.0e6 3.0e7 1.5e7 1.5e8

Selection-Rejection 281 355 1371 1475 >3600 >3600Reservoir 288 299 1285 1571 >3600 >3600

Random Sort 513 581 1629 2344 >3600 >3600ScaSRS 96 103 126 127 140 158ScaSRSp 98 114 109 139 162 214

W 6.9e3 2.2e4 1.6e4 4.9e4 3.4e4 1.1e5Wp 5.8e4 1.8e5 2.9e5 9.1e5 1.5e6 4.5e6

Table : Test problems, running times (in seconds), and waiting list sizes.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 24 / 40

Page 36: Scalable Simple Random Sampling Algorithms

Empirical evaluation: stratified sampling

Problem and setup:

23.25 billion page-view events, 7 terabytes, 8 strata.

The ratio between the size of largest strata and that of the smalleststrata is approximately 15000.

p = 0.01.

3000 mappers and 5 reducers.

Result:

509 seconds.

Within the waiting list, the ratio between the size of the largest strataand that of the smallest strata is 861.2.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 25 / 40

Page 37: Scalable Simple Random Sampling Algorithms

Summary for ScaSRS

based on the random sort algorithm

using probabalistic thresholds to decide on the fly whether to select,reject, or wait-list an item independently of others

embarrassingly parallel

high success rate and O(√

s) storage

streaming environments

extension to stratified sampling

straightfoward MapReduce implementation

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 26 / 40

Page 38: Scalable Simple Random Sampling Algorithms

Outline

1 Simple random sampling without replacementExisting algorithmsAlgorithm ScaSRSStreaming environmentsStratified samplingEmpirical evaluation

2 Simple random sampling with replacementExisting algorithmsAlgorithm ScaSRSWRStreaming environments

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 27 / 40

Page 39: Scalable Simple Random Sampling Algorithms

Simple random sampling with replacement (SRSWR)

A simple random sample with replacement (SRSWR) of size s from apopulation of n items can be thought of as drawing s independent samplesof size 1, where each of the s items in the sample is selected from thepopulation with equal probability.

An item may appear more than once in the sample.

Equivalent to sample from

Multinomial

(s,

(1

n,

1

n, . . . ,

1

n

)).

Applications:

bootstrapping

ensemble methods

generating random tuples

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 28 / 40

Page 40: Scalable Simple Random Sampling Algorithms

Simple random sampling with replacement (SRSWR)

A simple random sample with replacement (SRSWR) of size s from apopulation of n items can be thought of as drawing s independent samplesof size 1, where each of the s items in the sample is selected from thepopulation with equal probability.

An item may appear more than once in the sample.

Equivalent to sample from

Multinomial

(s,

(1

n,

1

n, . . . ,

1

n

)).

Applications:

bootstrapping

ensemble methods

generating random tuples

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 28 / 40

Page 41: Scalable Simple Random Sampling Algorithms

The draw-by-draw method

Draw-by-draw

1: Set S = ∅.2: for i from 1 to s do3: Select one item t with equal probability from T .4: Let S := S + {t}.5: end for6: Return S .

Selecting one item with equal probability is hard due toI variable-length records,I no indices.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 29 / 40

Page 42: Scalable Simple Random Sampling Algorithms

The Poisson-approximation algorithm

Poisson-approximation (Laserson, 2013)

1: for each item ti ∈ T do2: Generate a number si from distribution Pois(p).3: if si > 0 then4: Repeat ti for si times in the sample.5: end if6: end for

Pros:

One-pass.

O(1) storage.

Embarrassingly parallel.

Cons:

Variable sample size.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 30 / 40

Page 43: Scalable Simple Random Sampling Algorithms

The Poisson-approximation algorithm

Poisson-approximation (Laserson, 2013)

1: for each item ti ∈ T do2: Generate a number si from distribution Pois(p).3: if si > 0 then4: Repeat ti for si times in the sample.5: end if6: end for

Pros:

One-pass.

O(1) storage.

Embarrassingly parallel.

Cons:

Variable sample size.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 30 / 40

Page 44: Scalable Simple Random Sampling Algorithms

The Poisson-approximation algorithm (cont.)

If Yi ∼ Pois(p), i = 1, . . . , n are independent, then given∑ni=1 Yi = s, (Y1,Y2, . . . ,Yn) follows Multinom

(s,(1n ,

1n , . . . ,

1n

)).

I If the sample from the Poisson-approximation algorithm happens tohave size s = pn, it is a simple random sample with replacement.

If Xi ∼ Pois(λi ), i = 1, . . . , n are independent and λ =∑n

i=1 λi , wehave Y =

∑ni=1 Xi ∼ Pois(λ).

I The size of the sample from the Poisson-approximation algorithmfollows distribution Pois(pn).

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 31 / 40

Page 45: Scalable Simple Random Sampling Algorithms

The Poisson-approximation algorithm (cont.)

If Yi ∼ Pois(p), i = 1, . . . , n are independent, then given∑ni=1 Yi = s, (Y1,Y2, . . . ,Yn) follows Multinom

(s,(1n ,

1n , . . . ,

1n

)).

I If the sample from the Poisson-approximation algorithm happens tohave size s = pn, it is a simple random sample with replacement.

If Xi ∼ Pois(λi ), i = 1, . . . , n are independent and λ =∑n

i=1 λi , wehave Y =

∑ni=1 Xi ∼ Pois(λ).

I The size of the sample from the Poisson-approximation algorithmfollows distribution Pois(pn).

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 31 / 40

Page 46: Scalable Simple Random Sampling Algorithms

How to obtain the exact sample size?

To get the exact sample size, we follow an approach similar to what wehave in ScaSRS.

Generate a Poisson sequence to pre-accept items on the fly, whereeach value follows Pois(p1) independently for some p1 < p.

Generate another Poisson sequence to wait-list items, where eachvalue follows Pois(p2) independently for some p2 > 0.

Let a be the number of items we pre-accepted. Select a simplerandom sample without replacement of size s − a from the waiting listand merge it into the final sample.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 32 / 40

Page 47: Scalable Simple Random Sampling Algorithms

How to obtain the exact sample size?

To get the exact sample size, we follow an approach similar to what wehave in ScaSRS.

Generate a Poisson sequence to pre-accept items on the fly, whereeach value follows Pois(p1) independently for some p1 < p.

Generate another Poisson sequence to wait-list items, where eachvalue follows Pois(p2) independently for some p2 > 0.

Let a be the number of items we pre-accepted. Select a simplerandom sample without replacement of size s − a from the waiting listand merge it into the final sample.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 32 / 40

Page 48: Scalable Simple Random Sampling Algorithms

How to obtain the exact sample size?

To get the exact sample size, we follow an approach similar to what wehave in ScaSRS.

Generate a Poisson sequence to pre-accept items on the fly, whereeach value follows Pois(p1) independently for some p1 < p.

Generate another Poisson sequence to wait-list items, where eachvalue follows Pois(p2) independently for some p2 > 0.

Let a be the number of items we pre-accepted. Select a simplerandom sample without replacement of size s − a from the waiting listand merge it into the final sample.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 32 / 40

Page 49: Scalable Simple Random Sampling Algorithms

ScaSRSWR: a scalable algorithm for SRSWR

1: Choose δ ∈ (0, 1), p1 < p such that FPois(p1n)(s) ≥ 1− δ, and p2 > 0such that FPois((p1+p2)n)(s) ≤ δ.

2: function map(ti )3: Generate a number s1i from distribution Pois(p1).4: Include ti for s1i times in the sample.5: Generate a number s2i from distribution Pois(p2).6: for j ∈ {1, . . . , s2i} do7: Draw a value u from U(0, 1) and emit (u, ti ).8: end for9: end function

10: function reduce(. . . , (uk , tk), . . .)11: Let a be the number of accepted items in step 4.12: Select the first s − a items ordered by key.13: end function

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 33 / 40

Page 50: Scalable Simple Random Sampling Algorithms

ScaSRSWR: a scalable algorithm for SRSWR (cont.)

Theorem

For a fixed δ > 0, ScaSRSWR outputs a simple random sample withreplacement of size s with probability at least 1− 2δ. Moreover, with highprobability, it only needs O(

√s) storage and runs in O(n) time.

The algorithm fails to output a sample of size s if

it pre-accepted too many items, i.e., a > s,

or it wait-listed too few items, i.e., a + w < s, where w is the size ofthe waiting list.

So the overall failure rate is at most 2δ. Given a ≤ s and a + w ≥ s, wecan prove that the output is a simple random sample with replacement.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 34 / 40

Page 51: Scalable Simple Random Sampling Algorithms

Streaming environments

It is possible to extend ScaSRSWR to a streaming environment with sometweaks. Assuming that only p is given, we need three Poisson sequences:

1 The first sequence generates pre-accepted items, whereX1i ∼ Pois(p1i ), which means including ti for X1i times.

2 The second sequence gets “merged” with the first sequence at theend of the stream such that each element in the merged sequencefollows Pois(p1), which is the same as the non-streaming case.

3 The third sequence is collected at the end of the stream andtransformed such that each element in the transformed sequencefollows Pois(p2), which is the same as the non-streaming case.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 35 / 40

Page 52: Scalable Simple Random Sampling Algorithms

Streaming environments

It is possible to extend ScaSRSWR to a streaming environment with sometweaks. Assuming that only p is given, we need three Poisson sequences:

1 The first sequence generates pre-accepted items, whereX1i ∼ Pois(p1i ), which means including ti for X1i times.

2 The second sequence gets “merged” with the first sequence at theend of the stream such that each element in the merged sequencefollows Pois(p1), which is the same as the non-streaming case.

3 The third sequence is collected at the end of the stream andtransformed such that each element in the transformed sequencefollows Pois(p2), which is the same as the non-streaming case.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 35 / 40

Page 53: Scalable Simple Random Sampling Algorithms

Streaming environments

It is possible to extend ScaSRSWR to a streaming environment with sometweaks. Assuming that only p is given, we need three Poisson sequences:

1 The first sequence generates pre-accepted items, whereX1i ∼ Pois(p1i ), which means including ti for X1i times.

2 The second sequence gets “merged” with the first sequence at theend of the stream such that each element in the merged sequencefollows Pois(p1), which is the same as the non-streaming case.

3 The third sequence is collected at the end of the stream andtransformed such that each element in the transformed sequencefollows Pois(p2), which is the same as the non-streaming case.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 35 / 40

Page 54: Scalable Simple Random Sampling Algorithms

How to merge two Poisson sequences?

If Xi ∼ Pois(λi ) are independent and λ =∑

i λi , Y =∑

i Xi ∼ Pois(λ).

Suppose the ith item in the first sequence follows Pois(p1i ).

We need the ith item in the second sequence follows Pois(p1 − p1i ) inorder to have the sum follows Pois(p1). However, p1 depends on n,which is unknown until we reach the end of the stream.

If X ∼ Pois(λ) and Y ∼ Binom(X , p), then Y ∼ Pois(λp).

We can transform the second sequence at the end of the stream tomake each item in the merged sequence follows Pois(p1) as long asp1i ≤ p1 and p1i + pc

1i ≥ p1, i = 1, . . . , n:

If X1i ∼ Pois(p1i ), X c1i ∼ Pois(p − p1i ), and Y c

1i ∼ Binom(

X c1i ,

p1−p1ip−p1i

),

then X1i + Y c1i ∼ Pois(p1).

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 36 / 40

Page 55: Scalable Simple Random Sampling Algorithms

How to merge two Poisson sequences?

If Xi ∼ Pois(λi ) are independent and λ =∑

i λi , Y =∑

i Xi ∼ Pois(λ).

Suppose the ith item in the first sequence follows Pois(p1i ).

We need the ith item in the second sequence follows Pois(p1 − p1i ) inorder to have the sum follows Pois(p1). However, p1 depends on n,which is unknown until we reach the end of the stream.

If X ∼ Pois(λ) and Y ∼ Binom(X , p), then Y ∼ Pois(λp).

We can transform the second sequence at the end of the stream tomake each item in the merged sequence follows Pois(p1) as long asp1i ≤ p1 and p1i + pc

1i ≥ p1, i = 1, . . . , n:

If X1i ∼ Pois(p1i ), X c1i ∼ Pois(p − p1i ), and Y c

1i ∼ Binom(

X c1i ,

p1−p1ip−p1i

),

then X1i + Y c1i ∼ Pois(p1).

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 36 / 40

Page 56: Scalable Simple Random Sampling Algorithms

Streaming environmentsFor any λ > 100, it is easy to verify the following bounds:

Pr{X ≤ λ} > 0.99995, X ∼ Pois(λ− 5√λ),

Pr{X < λ} < 1− 0.99995, X ∼ Pois(λ+ 5√λ).

Assuming that n > n0 > 100/p, we are going to choose p1 = p − 5√

p/nand p2 = 10

√p/n. For each i ∈ {1, . . . , n}, we set

p1i = p − 5√

p/max(i , n0),

pc1i = p − p1i = 5

√p/max(i , n0),

p2i = 10√

p/max(i , n0).

The algorithm succeeds with probability at least 0.9999. The expectedstorage is

n∑i=1

(pc1i + p2i ) ≤

n∑i=1

15√

p/i = O(√

pn) = O(√

s).

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 37 / 40

Page 57: Scalable Simple Random Sampling Algorithms

Streaming ScaSRSWR (when only p is given)

1: function map(ti )2: Let p1i = p − 5

√p/max(i , n0) and generate s1i ∼ Pois(p1i ).

3: Include ti for s1i times in the sample.4: Let pc

1i = p − p1i and generate sc1i ∼ Pois(pc1i ).

5: Emit (0, (p1i , sc1i , ti )) if sc1i > 0.

6: Let p2i = 10√

p/max(i , n0) and generate s2i ∼ Pois(p2i ).7: Emit (1, (p2i , s2i , ti )) if s2i > 0.8: end function

9: Compute p1 = p − 5√

p/n and p2 = 10√

p/n.10: function reduce(0, [. . . , (p1k , s

c1k , tk), . . .])

11: Generate s̄c1k ∼ Binom(sc1k ,p1−p1kp−p1k

) and include tk for s̄c1k times.12: end function

13: Let a be the number of accepted items and W = ∅.14: function reduce(1, [. . . , (p2k , s2k , tk), . . .])15: Generate s̄2k ∼ Binom(s2k ,

p2p2k

) and add tk to W for s̄2k times.16: end function17: Select a simple random sample of size s − a from W and output it.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 38 / 40

Page 58: Scalable Simple Random Sampling Algorithms

Streaming ScaSRSWR (when only p is given)

1: function map(ti )2: Let p1i = p − 5

√p/max(i , n0) and generate s1i ∼ Pois(p1i ).

3: Include ti for s1i times in the sample.4: Let pc

1i = p − p1i and generate sc1i ∼ Pois(pc1i ).

5: Emit (0, (p1i , sc1i , ti )) if sc1i > 0.

6: Let p2i = 10√

p/max(i , n0) and generate s2i ∼ Pois(p2i ).7: Emit (1, (p2i , s2i , ti )) if s2i > 0.8: end function

9: Compute p1 = p − 5√

p/n and p2 = 10√

p/n.10: function reduce(0, [. . . , (p1k , s

c1k , tk), . . .])

11: Generate s̄c1k ∼ Binom(sc1k ,p1−p1kp−p1k

) and include tk for s̄c1k times.12: end function

13: Let a be the number of accepted items and W = ∅.14: function reduce(1, [. . . , (p2k , s2k , tk), . . .])15: Generate s̄2k ∼ Binom(s2k ,

p2p2k

) and add tk to W for s̄2k times.16: end function17: Select a simple random sample of size s − a from W and output it.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 38 / 40

Page 59: Scalable Simple Random Sampling Algorithms

Streaming ScaSRSWR (when only p is given)

1: function map(ti )2: Let p1i = p − 5

√p/max(i , n0) and generate s1i ∼ Pois(p1i ).

3: Include ti for s1i times in the sample.4: Let pc

1i = p − p1i and generate sc1i ∼ Pois(pc1i ).

5: Emit (0, (p1i , sc1i , ti )) if sc1i > 0.

6: Let p2i = 10√

p/max(i , n0) and generate s2i ∼ Pois(p2i ).7: Emit (1, (p2i , s2i , ti )) if s2i > 0.8: end function

9: Compute p1 = p − 5√

p/n and p2 = 10√

p/n.10: function reduce(0, [. . . , (p1k , s

c1k , tk), . . .])

11: Generate s̄c1k ∼ Binom(sc1k ,p1−p1kp−p1k

) and include tk for s̄c1k times.12: end function

13: Let a be the number of accepted items and W = ∅.14: function reduce(1, [. . . , (p2k , s2k , tk), . . .])15: Generate s̄2k ∼ Binom(s2k ,

p2p2k

) and add tk to W for s̄2k times.16: end function17: Select a simple random sample of size s − a from W and output it.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 38 / 40

Page 60: Scalable Simple Random Sampling Algorithms

Streaming ScaSRSWR (when only s is given)

When only s is given, we can no longer accept items on the fly becausethe sampling probability could be arbitrarily small. However, we can stillgenerate a Poisson sequence as the waiting list W :

Choose Xi ∼ Pois((s + 5√

s)/max(i , n0)).

At the end of the stream, we can adjust the sequence using Binomialvalues to make the Poission numbers i.i.d..

The sum of adjusted sequence is greater than s with high probability.

Storage: O(∑

i (s + 5√

s)/max(i , n0)) = O(s log n).

After adjustment, generate a simple random sample without replacementof size s from W and output it as the final sample.

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 39 / 40

Page 61: Scalable Simple Random Sampling Algorithms

Summary

Scalable simple random sampling algorithms — ScaSRS and ScaSRSWR:

independently select, reject, or wait-list an item on the fly

embarrassingly parallel

high success rate and O(√

s) storage

streaming environments

extension to stratified sampling

open-source implementations:I Apache Spark: https://github.com/mengxr/spark-sampling/I Apache DataFu: http://datafu.incubator.apache.org/

Xiangrui Meng (Databricks) ScaSRS March 3, 2014 40 / 40