50
Stream Algorithmics Albert Bifet March 2012

COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Stream Algorithmics

Albert Bifet

March 2012

Page 2: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Streams

Big Data & Real Time

Page 3: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Streams

Data StreamsI Sequence is potentially infiniteI High amount of data: sublinear spaceI High speed of arrival: sublinear time per exampleI Once an element from a data stream has been processed

it is discarded or archived

Big Data & Real Time

Page 4: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

ExamplePuzzle: Finding Missing Numbers

I Let π be a permutation of {1, . . . ,n}.I Let π−1 be π with one element

missing.I π−1[i] arrives in increasing order

Task: Determine the missing number

Big Data & Real Time

Page 5: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

ExamplePuzzle: Finding Missing Numbers

I Let π be a permutation of {1, . . . ,n}.I Let π−1 be π with one element

missing.I π−1[i] arrives in increasing order

Task: Determine the missing number

Use a n-bitvector tomemorize all thenumbers (O(n)space)

Big Data & Real Time

Page 6: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

ExamplePuzzle: Finding Missing Numbers

I Let π be a permutation of {1, . . . ,n}.I Let π−1 be π with one element

missing.I π−1[i] arrives in increasing order

Task: Determine the missing number

Data Streams:O(log(n)) space.

Big Data & Real Time

Page 7: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

ExamplePuzzle: Finding Missing Numbers

I Let π be a permutation of {1, . . . ,n}.I Let π−1 be π with one element

missing.I π−1[i] arrives in increasing order

Task: Determine the missing number

Data Streams:O(log(n)) space.Store

n(n + 1)

2−∑j≤i

π−1[j].

Big Data & Real Time

Page 8: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Streams

Approximation algorithms

I Small error rate with high probabilityI An algorithm (ε, δ)−approximates F if it outputs F̃ for which

Pr[|F̃ − F | > εF ] < δ.

Big Data & Real Time

Page 9: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

Examples

1. Compute different number of pairs of IP addresses seen ina router

2. Compute top-k most used words in tweets

Two problems: find number of distinctitems and find most frequent items.

Page 10: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

8 Bits Counter

1 0 1 0 1 0 1 0

What is the largest number we canstore in 8 bits?

Page 11: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

8 Bits Counter

What is the largest number we canstore in 8 bits?

Page 12: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

8 Bits Counter

0 20 40 60 80 1000

20

40

60

80

100

x

f (x) = log(1 + x)/ log(2)

f (0) = 0, f (1) = 1

Page 13: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

8 Bits Counter

0 2 4 6 8 100

2

4

6

8

10

x

f (x) = log(1 + x)/ log(2)

f (0) = 0, f (1) = 1

Page 14: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

8 Bits Counter

0 2 4 6 8 100

2

4

6

8

10

x

f (x) = log(1 + x/30)/ log(1 + 1/30)

f (0) = 0, f (1) = 1

Page 15: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

8 Bits Counter

0 20 40 60 80 1000

20

40

60

80

100

x

f (x) = log(1 + x/30)/ log(1 + 1/30)

f (0) = 0, f (1) = 1

Page 16: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

8 bits Counter

MORRIS APPROXIMATE COUNTING ALGORITHM

1 Init counter c ← 02 for every event in the stream3 do rand = random number between 0 and 14 if rand < p5 then c ← c + 1

What is the largest number we canstore in 8 bits?

Page 17: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

8 bits Counter

MORRIS APPROXIMATE COUNTING ALGORITHM

1 Init counter c ← 02 for every event in the stream3 do rand = random number between 0 and 14 if rand < p5 then c ← c + 1

With p = 1/2 we can store 2× 256with standard deviation σ =

√n/2

Page 18: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

8 bits Counter

MORRIS APPROXIMATE COUNTING ALGORITHM

1 Init counter c ← 02 for every event in the stream3 do rand = random number between 0 and 14 if rand < p5 then c ← c + 1

With p = 2−c then E [2c] = n + 2 withvariance σ2 = n(n + 1)/2

Page 19: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

8 bits Counter

MORRIS APPROXIMATE COUNTING ALGORITHM

1 Init counter c ← 02 for every event in the stream3 do rand = random number between 0 and 14 if rand < p5 then c ← c + 1

If p = b−c then E [bc] = n(b − 1) + b,σ2 = (b − 1)n(n + 1)/2

Page 20: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

Examples

1. Compute different number of pairs of IP addressesseen in a routerIPv4: 32 bitsIPv6: 128 bits

2. Compute top-k most used words in tweets

Find number of distinct items

Page 21: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream AlgorithmicsMemory unit Size Binary sizekilobyte (kB/KB) 103 210

megabyte (MB) 106 220

gigabyte (GB) 109 230

terabyte (TB) 1012 240

petabyte (PB) 1015 250

exabyte (EB) 1018 260

zettabyte (ZB) 1021 270

yottabyte (YB) 1024 280

Find number of distinct itemsIPv4: 32 bits IPv6: 128 bits

Page 22: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

Example

1. Compute different number of pairs of IP addressesseen in a routerIPv4: 32 bits, IPv6: 128 bitsUsing 256 words of 32 bits accuracy of 5%

Find number of distinct items

Page 23: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

Example

1. Compute different number of pairs of IP addressesseen in a router

Selecting n random numbers,I half of these numbers have the first bit as zero,I a quarter have the first and second bit as zero,I an eigth have the first, second and third bit as zero..

A pattern 0i1 appears with probability 2−(i+1), so n ≈ 2i+1

Find number of distinct items

Page 24: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

FLAJOLET-MARTIN PROBABILISTIC COUNTING ALGORITHM

1 Init bitmap[0 . . . L− 1]← 02 for every item x in the stream3 do index = ρ(hash(x)) � position of the least significant 1-bit

4 if bitmap[index ] = 05 then bitmap[index ] = 16 b ← position of leftmost zero in bitmap7 return 2b/0.77351

E [pos] ≈ log2 φn ≈ log2 0.77351 · nσ(pos) ≈ 1.12

Page 25: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

item x hash(x) ρ(hash(x)) bitmapa 0110 1 01000b 1001 0 11000c 0111 1 11000d 1100 0 11000abe 0101 1 11000f 1010 0 11000ab

b = 2,n ≈ 22/0.77351 = 5.17

Page 26: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

FLAJOLET-MARTIN PROBABILISTIC COUNTING ALGORITHM

1 Init bitmap[0 . . . L− 1]← 02 for every item x in the stream3 do index = ρ(hash(x)) � position of the least significant 1-bit

4 if bitmap[index ] = 05 then bitmap[index ] = 16 b ← position of leftmost zero in bitmap7 return 2b/0.77351

1 Init M ← −∞2 for every item x in the stream3 do M = max(M, ρ(h(x))4 b ← M + 1 � position of leftmost zero in bitmap

5 return 2b/0.77351

Page 27: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

Stochastic AveragingPerform m experiments in parallel

σ′ = σ/√

m

Relative accuracy is 0.78/√

m

HYPERLOGLOG COUNTER

I the stream is divided in m = 2b substreamsI the estimation uses harmonic meanI Relative accuracy is 1.04/

√m

Page 28: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

HYPERLOGLOG COUNTER

1 Init M[0 . . . b − 1]← −∞2 for every item x in the stream3 do index = hb(x)

4 M[index ] = max(M[index ], ρ(hb(x))

5 return αmm2/∑m−1

j=0 2−M[j]

h(x) = 010011000111h3(x) = 001 and h3(x) = 011000111

Page 29: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Methodology

Paolo BoldiFacebook Four degrees of separation

Big Data does not need big machines,it needs big intelligence

Page 30: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

Examples

1. Compute different number of pairs of IP addresses seen ina router

2. Compute top-k most used words in tweets

Find most frequent items

Page 31: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

MAJORITY

1 Init counter c ← 02 for every item s in the stream3 do if counter is zero4 then pick up the item5 if item is the same6 then increment counter7 else decrement counter

Find the item that it is contained inmore than half of the instances

Page 32: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

FREQUENT

1 for every item i in the stream2 do if item i is not monitored3 do if < k items monitored4 then add a new item with count 15 else if an item z whose count is zero exists6 then replace this item z by the new one7 else decrement all counters by one8 else � item i is monitored9 increase its counter by one

Figure : Algorithm FREQUENT to find most frequent items

Page 33: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

LOSSYCOUNTING

1 for every item i in the stream2 do if item i is not monitored3 then add a new item with count 1 + ∆4 else � item i is monitored5 increase its counter by one6 if bn/kc 6= ∆7 then ∆ = bn/kc8 decrement all counters by one9 remove items with zero counts

Figure : Algorithm LOSSYCOUNTING to find most frequent items

Page 34: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

SPACE SAVING

1 for every item i in the stream2 do if item i is not monitored3 do if < k items monitored4 then add a new item with count 15 else replace the item with lower counter6 increase its counter by one7 else � item i is monitored8 increase its counter by one

Figure : Algorithm SPACE SAVING to find most frequent items

Page 35: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

j

1

2

3

4

h1(j)h2(j) h3(j)h4(j)

+I

+I

+I

+I

Figure : A CM sketch structure example of ε = 0.4 and δ = 0.02

Page 36: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Count-Min Sketch

A two dimensional array with width w and depth d

w =

⌈eε

⌉, d =

⌈ln

⌉It uses space wd with update time d

CM-Sketch computes frequency dataadding and removing real values.

Page 37: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Count-Min Sketch

A two dimensional array with width w and depth d

w =

⌈eε

⌉, d =

⌈ln

⌉It uses space wd = e

ε ln 1δ with update time d = ln 1

δ

CM-Sketch computes frequency dataadding and removing real values.

Page 38: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

ProblemGiven a data stream, choose k items with the same probability,storing only k elements in memory.

RESERVOIR SAMPLING

Page 39: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Algorithmics

RESERVOIR SAMPLING

1 for every item i in the first k items of the stream2 do store item i in the reservoir3 n = k4 for every item i in the stream after the first k items of the stream5 do select a random number r between 1 and n6 if r < k7 then replace item r in the reservoir with item i8 n = n + 1

Figure : Algorithm RESERVOIR SAMPLING

Page 40: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Mean and Variance

Given a stream x1, x2, . . . , xn

x̄n =1n·

n∑i=1

xi

σ2n =

1n − 1

·n∑

i=1

(xi − x̄i)2.

Page 41: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Mean and Variance

Given a stream x1, x2, . . . , xn

sn =n∑

i=1

xi , qn =n∑

i=1

x2i

sn = sn−1 + xn, qn = qn−1 + x2n

x̄n = sn/n

σ2n =

1n − 1

· (n∑

i=1

x2i − nx̄2

i ) =1

n − 1· (qn − s2

n/n)

Page 42: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Sliding Window

1011000111 1010101

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

ε log2 N) space, whereI N is the length of the sliding windowI ε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Page 43: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Sliding Window

10110001111 0101011

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

ε log2 N) space, whereI N is the length of the sliding windowI ε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Page 44: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Sliding Window

101100011110 1010111

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

ε log2 N) space, whereI N is the length of the sliding windowI ε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Page 45: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Sliding Window

1011000111101 0101110

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

ε log2 N) space, whereI N is the length of the sliding windowI ε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Page 46: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Sliding Window

10110001111010 1011101

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

ε log2 N) space, whereI N is the length of the sliding windowI ε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Page 47: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Data Stream Sliding Window

101100011110101 0111010

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

ε log2 N) space, whereI N is the length of the sliding windowI ε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Page 48: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Exponential Histograms

M = 2

1010101 101 11 1 1 1Content: 4 2 2 1 1 1Capacity: 7 3 2 1 1 1

1010101 101 11 11 1Content: 4 2 2 2 1Capacity: 7 3 2 2 1

1010101 10111 11 1Content: 4 4 2 1Capacity: 7 5 2 1

Page 49: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Exponential Histograms

1010101 101 11 1 1Content: 4 2 2 1 1Capacity: 7 3 2 1 1

Error < content of the last bucket W/Mε = 1/(2M) and M = 1/(2ε)

M · log(W/M) buckets to maintain thedata stream sliding window

Page 50: COMP423A/COMP523A Data Stream Miningabifet/523/StreamAlgorithmics-Slides.pdf · Data Stream Algorithmics FREQUENT 1 for every item i in the stream 2 do if item i is not monitored

Exponential Histograms

1010101 101 11 1 1Content: 4 2 2 1 1Capacity: 7 3 2 1 1

To give answers in O(1) time,it maintain three counters LAST, TOTAL and VARIANCE.

M · log(W/M) buckets to maintain thedata stream sliding window