Upload
sergiy-matusevych
View
140
Download
0
Embed Size (px)
Citation preview
HokusaiSketching streams in real time
Sergiy Matusevych1
Alexander J. Smola2
Amr Ahmed2
1Yahoo! Research, Santa Clara, CA2Google, Mountain View, CA
UAI 2012
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 1
Thanks
Alex SmolaGoogle and CMU
Amr AhmedGoogle
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 2
Motivation
I Compute frequencies of elements in the data streamI Item frequencies change over time.I Number of items unkonwn and variable.I Example - logging query frequency over time.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 3
Motivation
I Compute frequencies of elements in the data streamI Item frequencies change over time.I Number of items unkonwn and variable.I Example - logging query frequency over time.
I ApplicationsI Flow counting for IP traffic (who sent what, when and how much)I Spam detection and filtering (detect bursts immediately)I Website analytics (feedback to editors, trend detection)
I State of the artI CountMin sketch is instantaneous but does not log time.I Naive snapshotting costs linear memory.I MapReduce batch job provides exact counts but long delays.
I Resource constraintsI Fixed memory footprint for entire sketch regardless of durationI High query throughputI Real time aggregation and response
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 4
Strategy
1. Use CountMin sketch to store snapshots of data.(this solves the real time logging problem)
2. Compress snapshots linearly as they ageI We care most about recent events
I Logarithmic storage sinceT∑t=1
t−1 = O(logT )
3. Exploit CountMin data structure for efficient compressionI Variant 1: reduce storage per snapshotI Variant 2: increase timespan per snapshot
4. Interpolate between both variants for improved accuracy
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 5
CountMin Sketch (Cormode & Muthukrishnan)
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16. . . M1n
hash h2 M21 M22 M23 M24 M25 M26. . . M2n
hash h3 M31 M32 M33 M34 M35 M36. . . M3n
x
I In-memory data structure for instantaneous retrieval
I Aggregate statistic of observation interval (instantanous retrieval)
I Intuition — Bloom filter with integers
Algorithm
insert(x):for i = 1 to d doM[i , hi (x)]← M[i , hi (x)] + 1
end for
query(x):nx ← min
i∈{1,...d}M[i , hi (x)]
return nx
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 6
Guarantees
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16. . . M1n
hash h2 M21 M22 M23 M24 M25 M26. . . M2n
hash h3 M31 M32 M33 M34 M35 M36. . . M3n
x
I Approximation guaranteeFor sketch with d = dlog 1
δ e and n = d eε e we have with probability1− δ that the estimate nx deviates from the count nx via
nx ≤ nx ≤ nx + ε∑x ′
nx ′ for all x .
I Linear statistic of the dataI Power law distributions with exponent z only use O(Nε−1/z) space.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 7
Step 1: Combining time intervals
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16. . . M1n
hash h2 M21 M22 M23 M24 M25 M26. . . M2n
hash h3 M31 M32 M33 M34 M35 M36. . . M3n
x
I MT and MT ′ sketches at time intervals T and T ′ with T ∩ T ′ = ∅.I Combine sketches by adding them up
+
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 8
Step 1: Efficient computation
I Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.I Insert into the leftmost aggregation interval.
I Aggregate as cumulative sum from the left using 1 +n∑
i=0
2i = 2n+1
I Computation is∞∑n=1
n · 2−n = O(1) amortized time, O(log t) space.
4
2
1
1 1
1 2
1 1
1 1
1 1
1 1
42
4
2
1
2
1
1 1
1 1 2 4
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 9
Step 1: Efficient computation
I Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.I Insert into the leftmost aggregation interval.
I Aggregate as cumulative sum from the left using 1 +n∑
i=0
2i = 2n+1
I Computation is∞∑n=1
n · 2−n = O(1) amortized time, O(log t) space.
2
2
81
1 1
1 1
1 1
1 421
8
8
8
4
4
4
2
4
2
1 1
1 1
1 1
42
4
2
1
1 1
1 1 2 4
8
8
8
8
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 10
Step 1: Efficient computation
I Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.I Insert into the leftmost aggregation interval.
I Aggregate as cumulative sum from the left using 1 +n∑
i=0
2i = 2n+1
I Computation is∞∑n=1
n · 2−n = O(1) amortized time, O(log t) space.
2
2
81
1 1
1 1
1 1
1 421
8
8
8
4
4
4
2
4
2
1 1
1 1
1 1
42
4
2
1
1 1
1 1 2 4
8
8
8
8
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 11
Step 2: Folding over
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16. . . M1n
hash h2 M21 M22 M23 M24 M25 M26. . . M2n
hash h3 M31 M32 M33 M34 M35 M36. . . M3n
x
I Mb is sketch with n = 2b bins.
I Mb−1 can obtained as
Mb−1[i , j ] = Mb[i , j ] + Mb[i , j + 2b−1]
by “folding over” the sketch
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 12
Step 2: Efficient computation
I Halve the size of the sketch every 2t intervals.I Computation costs O(1) time and O(log t) space.
. . .
1 x 16 bins
2 x 8 bins
4 x 4 bins
interval 1
interval 2 3
4 5 6 7
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 13
Step 3: Resolution Interpolation
I Time aggregation reports good estimate over long time interval.
I Item aggregation reports poor estimate over short time interval.
I Marginals of joint distribution — assume independence & interpolate
n(t)
n(x)n I Torso and TailI Item aggregated estimate nxI Time aggregated estimate ntI Count interpolation
nxt =nx · nt
nwhere n =
∑t
nt =∑x
nx
I HeadI Sketch accuracy decreases with e · tI Use regular CountMin sketch whenever
n(x , t) > e · t · 2−b
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 14
Setup and Throughput
Web query data, 5 days sample
Term frequency
Num
ber
of u
niqu
e te
rms
100
102
104
106
97.9M unique terms,378.1M total
100 101 102 103 104 105 106
Wikipedia data
Term frequency
Num
ber
of u
niqu
e te
rms
100
101
102
103
104
105
106 4.5M unique terms,1291.5M total
100 102 104 106
Configuration
I PlatformI 64-bit LinuxI 4-core 2GHz x86I 16GB RAMI Gigabit network
I Sketch setupI 4 hash functionsI 223 binsI 211 aggregation
intervals (7 days in5 minute intervals)
I 3-gram interpolation12GB sketch with
I 3 hash functionsI 230 bins
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 15
Setup and Throughput
Web query data, 5 days sample
Term frequency
Num
ber
of u
niqu
e te
rms
100
102
104
106
97.9M unique terms,378.1M total
100 101 102 103 104 105 106
Wikipedia data
Term frequency
Num
ber
of u
niqu
e te
rms
100
101
102
103
104
105
106 4.5M unique terms,1291.5M total
100 102 104 106
Speed
I SoftwareI Client-server systemI ICE middlewareI 1 server, 10 clients
I Throughput/sI 50k insertsI 22k requests
(time aggregation)I 8.5k requests
(resolution interp.)
I Limiting FactorsI TCP/IP Overhead
Package queryI Memory latency
Random access
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 16
Accuracy (aggregate absolute error n − n)
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 17
Accuracy (stratified absolute error n − n)
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 18
Sketching for Graphical Models
I GoalI Observe stream of observationsI Estimate joint probability in O(1) timeI CountMin is good for head but interpolation better for torso and tail
I General StrategyI Markov network with junction tree: cliques C and separator sets S.I Estimate counts for xC and xS with C ∈ C and S ∈ S to generate
p(x) = n|S|−|C|∏C∈C
nxC∏S∈S
n−1xS .
I Estimates are fast — only lookup in CountMin sketch. No need tosolve convex program for graphical model inference.
I Markov Chain
p(abc) ≈ n−3 · na · nb · nc Unigrams
p(abc) ≈ n−2 · nab · nbcnb
Bigrams
Backoff smoothing (e.g. Kneser-Ney) in practice.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 19
n-gram Interpolation
I Trigram approximation
I Wikipedia dataset (1291.5M terms, 405M unique trigrams)
Absolute error Relative error
Unigram approximation 2.50 · 107 0.266Bigram approximation 1.22 · 106 0.013Trigram sketching (CountMin) 8.35 · 106 0.089
I Sketching trigrams is not accurate enough on the tail.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 20
Summary
I Fast and simple algorithm to aggregate statistics of data streams.I Effective compressed representation of the temporal data.I Works well for graphical models.I High-performance scalable implementation with O(1) time access.I Can be distributed over many servers.
Hokusai Katsushika
Great Wave off Kanagawa
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 21