View
140
Download
0
Category
Preview:
Citation preview
HokusaiSketching streams in real time
Sergiy Matusevych1
Alexander J. Smola2
Amr Ahmed2
1Yahoo! Research, Santa Clara, CA2Google, Mountain View, CA
UAI 2012
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 1
Thanks
Alex SmolaGoogle and CMU
Amr AhmedGoogle
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 2
Motivation
I Compute frequencies of elements in the data streamI Item frequencies change over time.I Number of items unkonwn and variable.I Example - logging query frequency over time.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 3
Motivation
I Compute frequencies of elements in the data streamI Item frequencies change over time.I Number of items unkonwn and variable.I Example - logging query frequency over time.
I ApplicationsI Flow counting for IP traffic (who sent what, when and how much)I Spam detection and filtering (detect bursts immediately)I Website analytics (feedback to editors, trend detection)
I State of the artI CountMin sketch is instantaneous but does not log time.I Naive snapshotting costs linear memory.I MapReduce batch job provides exact counts but long delays.
I Resource constraintsI Fixed memory footprint for entire sketch regardless of durationI High query throughputI Real time aggregation and response
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 4
Strategy
1. Use CountMin sketch to store snapshots of data.(this solves the real time logging problem)
2. Compress snapshots linearly as they ageI We care most about recent events
I Logarithmic storage sinceT∑t=1
t−1 = O(logT )
3. Exploit CountMin data structure for efficient compressionI Variant 1: reduce storage per snapshotI Variant 2: increase timespan per snapshot
4. Interpolate between both variants for improved accuracy
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 5
CountMin Sketch (Cormode & Muthukrishnan)
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16. . . M1n
hash h2 M21 M22 M23 M24 M25 M26. . . M2n
hash h3 M31 M32 M33 M34 M35 M36. . . M3n
x
I In-memory data structure for instantaneous retrieval
I Aggregate statistic of observation interval (instantanous retrieval)
I Intuition — Bloom filter with integers
Algorithm
insert(x):for i = 1 to d doM[i , hi (x)]← M[i , hi (x)] + 1
end for
query(x):nx ← min
i∈{1,...d}M[i , hi (x)]
return nx
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 6
Guarantees
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16. . . M1n
hash h2 M21 M22 M23 M24 M25 M26. . . M2n
hash h3 M31 M32 M33 M34 M35 M36. . . M3n
x
I Approximation guaranteeFor sketch with d = dlog 1
δ e and n = d eε e we have with probability1− δ that the estimate nx deviates from the count nx via
nx ≤ nx ≤ nx + ε∑x ′
nx ′ for all x .
I Linear statistic of the dataI Power law distributions with exponent z only use O(Nε−1/z) space.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 7
Step 1: Combining time intervals
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16. . . M1n
hash h2 M21 M22 M23 M24 M25 M26. . . M2n
hash h3 M31 M32 M33 M34 M35 M36. . . M3n
x
I MT and MT ′ sketches at time intervals T and T ′ with T ∩ T ′ = ∅.I Combine sketches by adding them up
+
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 8
Step 1: Efficient computation
I Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.I Insert into the leftmost aggregation interval.
I Aggregate as cumulative sum from the left using 1 +n∑
i=0
2i = 2n+1
I Computation is∞∑n=1
n · 2−n = O(1) amortized time, O(log t) space.
4
2
1
1 1
1 2
1 1
1 1
1 1
1 1
42
4
2
1
2
1
1 1
1 1 2 4
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 9
Step 1: Efficient computation
I Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.I Insert into the leftmost aggregation interval.
I Aggregate as cumulative sum from the left using 1 +n∑
i=0
2i = 2n+1
I Computation is∞∑n=1
n · 2−n = O(1) amortized time, O(log t) space.
2
2
81
1 1
1 1
1 1
1 421
8
8
8
4
4
4
2
4
2
1 1
1 1
1 1
42
4
2
1
1 1
1 1 2 4
8
8
8
8
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 10
Step 1: Efficient computation
I Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.I Insert into the leftmost aggregation interval.
I Aggregate as cumulative sum from the left using 1 +n∑
i=0
2i = 2n+1
I Computation is∞∑n=1
n · 2−n = O(1) amortized time, O(log t) space.
2
2
81
1 1
1 1
1 1
1 421
8
8
8
4
4
4
2
4
2
1 1
1 1
1 1
42
4
2
1
1 1
1 1 2 4
8
8
8
8
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 11
Step 2: Folding over
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16. . . M1n
hash h2 M21 M22 M23 M24 M25 M26. . . M2n
hash h3 M31 M32 M33 M34 M35 M36. . . M3n
x
I Mb is sketch with n = 2b bins.
I Mb−1 can obtained as
Mb−1[i , j ] = Mb[i , j ] + Mb[i , j + 2b−1]
by “folding over” the sketch
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 12
Step 2: Efficient computation
I Halve the size of the sketch every 2t intervals.I Computation costs O(1) time and O(log t) space.
. . .
1 x 16 bins
2 x 8 bins
4 x 4 bins
interval 1
interval 2 3
4 5 6 7
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 13
Step 3: Resolution Interpolation
I Time aggregation reports good estimate over long time interval.
I Item aggregation reports poor estimate over short time interval.
I Marginals of joint distribution — assume independence & interpolate
n(t)
n(x)n I Torso and TailI Item aggregated estimate nxI Time aggregated estimate ntI Count interpolation
nxt =nx · nt
nwhere n =
∑t
nt =∑x
nx
I HeadI Sketch accuracy decreases with e · tI Use regular CountMin sketch whenever
n(x , t) > e · t · 2−b
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 14
Setup and Throughput
Web query data, 5 days sample
Term frequency
Num
ber
of u
niqu
e te
rms
100
102
104
106
97.9M unique terms,378.1M total
100 101 102 103 104 105 106
Wikipedia data
Term frequency
Num
ber
of u
niqu
e te
rms
100
101
102
103
104
105
106 4.5M unique terms,1291.5M total
100 102 104 106
Configuration
I PlatformI 64-bit LinuxI 4-core 2GHz x86I 16GB RAMI Gigabit network
I Sketch setupI 4 hash functionsI 223 binsI 211 aggregation
intervals (7 days in5 minute intervals)
I 3-gram interpolation12GB sketch with
I 3 hash functionsI 230 bins
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 15
Setup and Throughput
Web query data, 5 days sample
Term frequency
Num
ber
of u
niqu
e te
rms
100
102
104
106
97.9M unique terms,378.1M total
100 101 102 103 104 105 106
Wikipedia data
Term frequency
Num
ber
of u
niqu
e te
rms
100
101
102
103
104
105
106 4.5M unique terms,1291.5M total
100 102 104 106
Speed
I SoftwareI Client-server systemI ICE middlewareI 1 server, 10 clients
I Throughput/sI 50k insertsI 22k requests
(time aggregation)I 8.5k requests
(resolution interp.)
I Limiting FactorsI TCP/IP Overhead
Package queryI Memory latency
Random access
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 16
Accuracy (aggregate absolute error n − n)
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 17
Accuracy (stratified absolute error n − n)
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 18
Sketching for Graphical Models
I GoalI Observe stream of observationsI Estimate joint probability in O(1) timeI CountMin is good for head but interpolation better for torso and tail
I General StrategyI Markov network with junction tree: cliques C and separator sets S.I Estimate counts for xC and xS with C ∈ C and S ∈ S to generate
p(x) = n|S|−|C|∏C∈C
nxC∏S∈S
n−1xS .
I Estimates are fast — only lookup in CountMin sketch. No need tosolve convex program for graphical model inference.
I Markov Chain
p(abc) ≈ n−3 · na · nb · nc Unigrams
p(abc) ≈ n−2 · nab · nbcnb
Bigrams
Backoff smoothing (e.g. Kneser-Ney) in practice.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 19
n-gram Interpolation
I Trigram approximation
I Wikipedia dataset (1291.5M terms, 405M unique trigrams)
Absolute error Relative error
Unigram approximation 2.50 · 107 0.266Bigram approximation 1.22 · 106 0.013Trigram sketching (CountMin) 8.35 · 106 0.089
I Sketching trigrams is not accurate enough on the tail.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 20
Summary
I Fast and simple algorithm to aggregate statistics of data streams.I Effective compressed representation of the temporal data.I Works well for graphical models.I High-performance scalable implementation with O(1) time access.I Can be distributed over many servers.
Hokusai Katsushika
Great Wave off Kanagawa
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 21
Recommended