Hokusai - Sketching streams in real time

  • View
    140

  • Download
    0

  • Category

    Science

Preview:

Citation preview

HokusaiSketching streams in real time

Sergiy Matusevych1

Alexander J. Smola2

Amr Ahmed2

1Yahoo! Research, Santa Clara, CA2Google, Mountain View, CA

UAI 2012

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 1

Thanks

Alex SmolaGoogle and CMU

Amr AhmedGoogle

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 2

Motivation

I Compute frequencies of elements in the data streamI Item frequencies change over time.I Number of items unkonwn and variable.I Example - logging query frequency over time.

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 3

Motivation

I Compute frequencies of elements in the data streamI Item frequencies change over time.I Number of items unkonwn and variable.I Example - logging query frequency over time.

I ApplicationsI Flow counting for IP traffic (who sent what, when and how much)I Spam detection and filtering (detect bursts immediately)I Website analytics (feedback to editors, trend detection)

I State of the artI CountMin sketch is instantaneous but does not log time.I Naive snapshotting costs linear memory.I MapReduce batch job provides exact counts but long delays.

I Resource constraintsI Fixed memory footprint for entire sketch regardless of durationI High query throughputI Real time aggregation and response

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 4

Strategy

1. Use CountMin sketch to store snapshots of data.(this solves the real time logging problem)

2. Compress snapshots linearly as they ageI We care most about recent events

I Logarithmic storage sinceT∑t=1

t−1 = O(logT )

3. Exploit CountMin data structure for efficient compressionI Variant 1: reduce storage per snapshotI Variant 2: increase timespan per snapshot

4. Interpolate between both variants for improved accuracy

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 5

CountMin Sketch (Cormode & Muthukrishnan)

M ∈ Rd×n matrix

d hash functions

n bins

hash h1 M11 M12 M13 M14 M15 M16. . . M1n

hash h2 M21 M22 M23 M24 M25 M26. . . M2n

hash h3 M31 M32 M33 M34 M35 M36. . . M3n

x

I In-memory data structure for instantaneous retrieval

I Aggregate statistic of observation interval (instantanous retrieval)

I Intuition — Bloom filter with integers

Algorithm

insert(x):for i = 1 to d doM[i , hi (x)]← M[i , hi (x)] + 1

end for

query(x):nx ← min

i∈{1,...d}M[i , hi (x)]

return nx

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 6

Guarantees

M ∈ Rd×n matrix

d hash functions

n bins

hash h1 M11 M12 M13 M14 M15 M16. . . M1n

hash h2 M21 M22 M23 M24 M25 M26. . . M2n

hash h3 M31 M32 M33 M34 M35 M36. . . M3n

x

I Approximation guaranteeFor sketch with d = dlog 1

δ e and n = d eε e we have with probability1− δ that the estimate nx deviates from the count nx via

nx ≤ nx ≤ nx + ε∑x ′

nx ′ for all x .

I Linear statistic of the dataI Power law distributions with exponent z only use O(Nε−1/z) space.

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 7

Step 1: Combining time intervals

M ∈ Rd×n matrix

d hash functions

n bins

hash h1 M11 M12 M13 M14 M15 M16. . . M1n

hash h2 M21 M22 M23 M24 M25 M26. . . M2n

hash h3 M31 M32 M33 M34 M35 M36. . . M3n

x

I MT and MT ′ sketches at time intervals T and T ′ with T ∩ T ′ = ∅.I Combine sketches by adding them up

+

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 8

Step 1: Efficient computation

I Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.I Insert into the leftmost aggregation interval.

I Aggregate as cumulative sum from the left using 1 +n∑

i=0

2i = 2n+1

I Computation is∞∑n=1

n · 2−n = O(1) amortized time, O(log t) space.

4

2

1

1 1

1 2

1 1

1 1

1 1

1 1

42

4

2

1

2

1

1 1

1 1 2 4

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 9

Step 1: Efficient computation

I Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.I Insert into the leftmost aggregation interval.

I Aggregate as cumulative sum from the left using 1 +n∑

i=0

2i = 2n+1

I Computation is∞∑n=1

n · 2−n = O(1) amortized time, O(log t) space.

2

2

81

1 1

1 1

1 1

1 421

8

8

8

4

4

4

2

4

2

1 1

1 1

1 1

42

4

2

1

1 1

1 1 2 4

8

8

8

8

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 10

Step 1: Efficient computation

I Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.I Insert into the leftmost aggregation interval.

I Aggregate as cumulative sum from the left using 1 +n∑

i=0

2i = 2n+1

I Computation is∞∑n=1

n · 2−n = O(1) amortized time, O(log t) space.

2

2

81

1 1

1 1

1 1

1 421

8

8

8

4

4

4

2

4

2

1 1

1 1

1 1

42

4

2

1

1 1

1 1 2 4

8

8

8

8

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 11

Step 2: Folding over

M ∈ Rd×n matrix

d hash functions

n bins

hash h1 M11 M12 M13 M14 M15 M16. . . M1n

hash h2 M21 M22 M23 M24 M25 M26. . . M2n

hash h3 M31 M32 M33 M34 M35 M36. . . M3n

x

I Mb is sketch with n = 2b bins.

I Mb−1 can obtained as

Mb−1[i , j ] = Mb[i , j ] + Mb[i , j + 2b−1]

by “folding over” the sketch

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 12

Step 2: Efficient computation

I Halve the size of the sketch every 2t intervals.I Computation costs O(1) time and O(log t) space.

. . .

1 x 16 bins

2 x 8 bins

4 x 4 bins

interval 1

interval 2 3

4 5 6 7

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 13

Step 3: Resolution Interpolation

I Time aggregation reports good estimate over long time interval.

I Item aggregation reports poor estimate over short time interval.

I Marginals of joint distribution — assume independence & interpolate

n(t)

n(x)n I Torso and TailI Item aggregated estimate nxI Time aggregated estimate ntI Count interpolation

nxt =nx · nt

nwhere n =

∑t

nt =∑x

nx

I HeadI Sketch accuracy decreases with e · tI Use regular CountMin sketch whenever

n(x , t) > e · t · 2−b

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 14

Setup and Throughput

Web query data, 5 days sample

Term frequency

Num

ber

of u

niqu

e te

rms

100

102

104

106

97.9M unique terms,378.1M total

100 101 102 103 104 105 106

Wikipedia data

Term frequency

Num

ber

of u

niqu

e te

rms

100

101

102

103

104

105

106 4.5M unique terms,1291.5M total

100 102 104 106

Configuration

I PlatformI 64-bit LinuxI 4-core 2GHz x86I 16GB RAMI Gigabit network

I Sketch setupI 4 hash functionsI 223 binsI 211 aggregation

intervals (7 days in5 minute intervals)

I 3-gram interpolation12GB sketch with

I 3 hash functionsI 230 bins

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 15

Setup and Throughput

Web query data, 5 days sample

Term frequency

Num

ber

of u

niqu

e te

rms

100

102

104

106

97.9M unique terms,378.1M total

100 101 102 103 104 105 106

Wikipedia data

Term frequency

Num

ber

of u

niqu

e te

rms

100

101

102

103

104

105

106 4.5M unique terms,1291.5M total

100 102 104 106

Speed

I SoftwareI Client-server systemI ICE middlewareI 1 server, 10 clients

I Throughput/sI 50k insertsI 22k requests

(time aggregation)I 8.5k requests

(resolution interp.)

I Limiting FactorsI TCP/IP Overhead

Package queryI Memory latency

Random access

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 16

Accuracy (aggregate absolute error n − n)

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 17

Accuracy (stratified absolute error n − n)

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 18

Sketching for Graphical Models

I GoalI Observe stream of observationsI Estimate joint probability in O(1) timeI CountMin is good for head but interpolation better for torso and tail

I General StrategyI Markov network with junction tree: cliques C and separator sets S.I Estimate counts for xC and xS with C ∈ C and S ∈ S to generate

p(x) = n|S|−|C|∏C∈C

nxC∏S∈S

n−1xS .

I Estimates are fast — only lookup in CountMin sketch. No need tosolve convex program for graphical model inference.

I Markov Chain

p(abc) ≈ n−3 · na · nb · nc Unigrams

p(abc) ≈ n−2 · nab · nbcnb

Bigrams

Backoff smoothing (e.g. Kneser-Ney) in practice.

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 19

n-gram Interpolation

I Trigram approximation

I Wikipedia dataset (1291.5M terms, 405M unique trigrams)

Absolute error Relative error

Unigram approximation 2.50 · 107 0.266Bigram approximation 1.22 · 106 0.013Trigram sketching (CountMin) 8.35 · 106 0.089

I Sketching trigrams is not accurate enough on the tail.

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 20

Summary

I Fast and simple algorithm to aggregate statistics of data streams.I Effective compressed representation of the temporal data.I Works well for graphical models.I High-performance scalable implementation with O(1) time access.I Can be distributed over many servers.

Hokusai Katsushika

Great Wave off Kanagawa

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 21

Recommended