21
Hokusai Sketching streams in real time Sergiy Matusevych 1 Alexander J. Smola 2 Amr Ahmed 2 1 Yahoo! Research, Santa Clara, CA 2 Google, Mountain View, CA UAI 2012 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 1

Hokusai - Sketching streams in real time

Embed Size (px)

Citation preview

Page 1: Hokusai - Sketching streams in real time

HokusaiSketching streams in real time

Sergiy Matusevych1

Alexander J. Smola2

Amr Ahmed2

1Yahoo! Research, Santa Clara, CA2Google, Mountain View, CA

UAI 2012

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 1

Page 2: Hokusai - Sketching streams in real time

Thanks

Alex SmolaGoogle and CMU

Amr AhmedGoogle

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 2

Page 3: Hokusai - Sketching streams in real time

Motivation

I Compute frequencies of elements in the data streamI Item frequencies change over time.I Number of items unkonwn and variable.I Example - logging query frequency over time.

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 3

Page 4: Hokusai - Sketching streams in real time

Motivation

I Compute frequencies of elements in the data streamI Item frequencies change over time.I Number of items unkonwn and variable.I Example - logging query frequency over time.

I ApplicationsI Flow counting for IP traffic (who sent what, when and how much)I Spam detection and filtering (detect bursts immediately)I Website analytics (feedback to editors, trend detection)

I State of the artI CountMin sketch is instantaneous but does not log time.I Naive snapshotting costs linear memory.I MapReduce batch job provides exact counts but long delays.

I Resource constraintsI Fixed memory footprint for entire sketch regardless of durationI High query throughputI Real time aggregation and response

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 4

Page 5: Hokusai - Sketching streams in real time

Strategy

1. Use CountMin sketch to store snapshots of data.(this solves the real time logging problem)

2. Compress snapshots linearly as they ageI We care most about recent events

I Logarithmic storage sinceT∑t=1

t−1 = O(logT )

3. Exploit CountMin data structure for efficient compressionI Variant 1: reduce storage per snapshotI Variant 2: increase timespan per snapshot

4. Interpolate between both variants for improved accuracy

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 5

Page 6: Hokusai - Sketching streams in real time

CountMin Sketch (Cormode & Muthukrishnan)

M ∈ Rd×n matrix

d hash functions

n bins

hash h1 M11 M12 M13 M14 M15 M16. . . M1n

hash h2 M21 M22 M23 M24 M25 M26. . . M2n

hash h3 M31 M32 M33 M34 M35 M36. . . M3n

x

I In-memory data structure for instantaneous retrieval

I Aggregate statistic of observation interval (instantanous retrieval)

I Intuition — Bloom filter with integers

Algorithm

insert(x):for i = 1 to d doM[i , hi (x)]← M[i , hi (x)] + 1

end for

query(x):nx ← min

i∈{1,...d}M[i , hi (x)]

return nx

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 6

Page 7: Hokusai - Sketching streams in real time

Guarantees

M ∈ Rd×n matrix

d hash functions

n bins

hash h1 M11 M12 M13 M14 M15 M16. . . M1n

hash h2 M21 M22 M23 M24 M25 M26. . . M2n

hash h3 M31 M32 M33 M34 M35 M36. . . M3n

x

I Approximation guaranteeFor sketch with d = dlog 1

δ e and n = d eε e we have with probability1− δ that the estimate nx deviates from the count nx via

nx ≤ nx ≤ nx + ε∑x ′

nx ′ for all x .

I Linear statistic of the dataI Power law distributions with exponent z only use O(Nε−1/z) space.

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 7

Page 8: Hokusai - Sketching streams in real time

Step 1: Combining time intervals

M ∈ Rd×n matrix

d hash functions

n bins

hash h1 M11 M12 M13 M14 M15 M16. . . M1n

hash h2 M21 M22 M23 M24 M25 M26. . . M2n

hash h3 M31 M32 M33 M34 M35 M36. . . M3n

x

I MT and MT ′ sketches at time intervals T and T ′ with T ∩ T ′ = ∅.I Combine sketches by adding them up

+

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 8

Page 9: Hokusai - Sketching streams in real time

Step 1: Efficient computation

I Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.I Insert into the leftmost aggregation interval.

I Aggregate as cumulative sum from the left using 1 +n∑

i=0

2i = 2n+1

I Computation is∞∑n=1

n · 2−n = O(1) amortized time, O(log t) space.

4

2

1

1 1

1 2

1 1

1 1

1 1

1 1

42

4

2

1

2

1

1 1

1 1 2 4

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 9

Page 10: Hokusai - Sketching streams in real time

Step 1: Efficient computation

I Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.I Insert into the leftmost aggregation interval.

I Aggregate as cumulative sum from the left using 1 +n∑

i=0

2i = 2n+1

I Computation is∞∑n=1

n · 2−n = O(1) amortized time, O(log t) space.

2

2

81

1 1

1 1

1 1

1 421

8

8

8

4

4

4

2

4

2

1 1

1 1

1 1

42

4

2

1

1 1

1 1 2 4

8

8

8

8

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 10

Page 11: Hokusai - Sketching streams in real time

Step 1: Efficient computation

I Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.I Insert into the leftmost aggregation interval.

I Aggregate as cumulative sum from the left using 1 +n∑

i=0

2i = 2n+1

I Computation is∞∑n=1

n · 2−n = O(1) amortized time, O(log t) space.

2

2

81

1 1

1 1

1 1

1 421

8

8

8

4

4

4

2

4

2

1 1

1 1

1 1

42

4

2

1

1 1

1 1 2 4

8

8

8

8

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 11

Page 12: Hokusai - Sketching streams in real time

Step 2: Folding over

M ∈ Rd×n matrix

d hash functions

n bins

hash h1 M11 M12 M13 M14 M15 M16. . . M1n

hash h2 M21 M22 M23 M24 M25 M26. . . M2n

hash h3 M31 M32 M33 M34 M35 M36. . . M3n

x

I Mb is sketch with n = 2b bins.

I Mb−1 can obtained as

Mb−1[i , j ] = Mb[i , j ] + Mb[i , j + 2b−1]

by “folding over” the sketch

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 12

Page 13: Hokusai - Sketching streams in real time

Step 2: Efficient computation

I Halve the size of the sketch every 2t intervals.I Computation costs O(1) time and O(log t) space.

. . .

1 x 16 bins

2 x 8 bins

4 x 4 bins

interval 1

interval 2 3

4 5 6 7

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 13

Page 14: Hokusai - Sketching streams in real time

Step 3: Resolution Interpolation

I Time aggregation reports good estimate over long time interval.

I Item aggregation reports poor estimate over short time interval.

I Marginals of joint distribution — assume independence & interpolate

n(t)

n(x)n I Torso and TailI Item aggregated estimate nxI Time aggregated estimate ntI Count interpolation

nxt =nx · nt

nwhere n =

∑t

nt =∑x

nx

I HeadI Sketch accuracy decreases with e · tI Use regular CountMin sketch whenever

n(x , t) > e · t · 2−b

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 14

Page 15: Hokusai - Sketching streams in real time

Setup and Throughput

Web query data, 5 days sample

Term frequency

Num

ber

of u

niqu

e te

rms

100

102

104

106

97.9M unique terms,378.1M total

100 101 102 103 104 105 106

Wikipedia data

Term frequency

Num

ber

of u

niqu

e te

rms

100

101

102

103

104

105

106 4.5M unique terms,1291.5M total

100 102 104 106

Configuration

I PlatformI 64-bit LinuxI 4-core 2GHz x86I 16GB RAMI Gigabit network

I Sketch setupI 4 hash functionsI 223 binsI 211 aggregation

intervals (7 days in5 minute intervals)

I 3-gram interpolation12GB sketch with

I 3 hash functionsI 230 bins

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 15

Page 16: Hokusai - Sketching streams in real time

Setup and Throughput

Web query data, 5 days sample

Term frequency

Num

ber

of u

niqu

e te

rms

100

102

104

106

97.9M unique terms,378.1M total

100 101 102 103 104 105 106

Wikipedia data

Term frequency

Num

ber

of u

niqu

e te

rms

100

101

102

103

104

105

106 4.5M unique terms,1291.5M total

100 102 104 106

Speed

I SoftwareI Client-server systemI ICE middlewareI 1 server, 10 clients

I Throughput/sI 50k insertsI 22k requests

(time aggregation)I 8.5k requests

(resolution interp.)

I Limiting FactorsI TCP/IP Overhead

Package queryI Memory latency

Random access

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 16

Page 17: Hokusai - Sketching streams in real time

Accuracy (aggregate absolute error n − n)

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 17

Page 18: Hokusai - Sketching streams in real time

Accuracy (stratified absolute error n − n)

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 18

Page 19: Hokusai - Sketching streams in real time

Sketching for Graphical Models

I GoalI Observe stream of observationsI Estimate joint probability in O(1) timeI CountMin is good for head but interpolation better for torso and tail

I General StrategyI Markov network with junction tree: cliques C and separator sets S.I Estimate counts for xC and xS with C ∈ C and S ∈ S to generate

p(x) = n|S|−|C|∏C∈C

nxC∏S∈S

n−1xS .

I Estimates are fast — only lookup in CountMin sketch. No need tosolve convex program for graphical model inference.

I Markov Chain

p(abc) ≈ n−3 · na · nb · nc Unigrams

p(abc) ≈ n−2 · nab · nbcnb

Bigrams

Backoff smoothing (e.g. Kneser-Ney) in practice.

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 19

Page 20: Hokusai - Sketching streams in real time

n-gram Interpolation

I Trigram approximation

I Wikipedia dataset (1291.5M terms, 405M unique trigrams)

Absolute error Relative error

Unigram approximation 2.50 · 107 0.266Bigram approximation 1.22 · 106 0.013Trigram sketching (CountMin) 8.35 · 106 0.089

I Sketching trigrams is not accurate enough on the tail.

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 20

Page 21: Hokusai - Sketching streams in real time

Summary

I Fast and simple algorithm to aggregate statistics of data streams.I Effective compressed representation of the temporal data.I Works well for graphical models.I High-performance scalable implementation with O(1) time access.I Can be distributed over many servers.

Hokusai Katsushika

Great Wave off Kanagawa

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 21