Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro

Data Streams

Topics in Data MiningFall 2015

Bruno Ribeiro

© 2015 Bruno Ribeiro

2

Stream item counting

Stream statistics

Stream classification

Stream matching

Data Streams Applications


3

Data Streams

◦ Data streams—continuous, ordered, changing, fast, huge amount

◦ Traditional DBMS—data stored in finite, persistent data sets Characteristics

◦ Huge volumes of continuous data, possibly infinite

◦ Fast changing and requires fast, real-time response

◦ Random access is expensive—single scan algorithm (only single pass)

◦ Store only the summary of the data seen thus far

◦ Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing

What are Data Streams?

Ack. From Jiawei Han

4

Telecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply &

manufacturing Sensor, monitoring & surveillance: video streams, RFIDs Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too

expensive)

Examples


5

DBMS versus DSMS

Persistent relations One-time queries Random access “Unbounded” disk store Only current state matters No real-time services Relatively low update rate Data at any granularity Assume precise data Access plan determined by

query processor, physical DB design

Transient streams Continuous queries Sequential access Bounded main memory Historical data is important Real-time requirements Possibly multi-GB arrival rate Data at fine granularity Data stale/imprecise Unpredictable/variable data

arrival and characteristics

Ack. From Motwani’s PODS tutorial slides

6

In General: Streaming algorithm

X1

stream processingengine

estimate of θ,summary(in memory)

Continuous Data Stream(Terabytes)

(Gigabytes)

XnX2

where θ = g(X1,...,Xn)

“indirect” observation

Query Q

…

Hashing


7

Query types

◦ One-time query vs. continuous query (being evaluated continuously as stream continues to arrive)

◦ Predefined query vs. ad-hoc query (issued on-line) Unbounded memory requirements

◦ For real-time response, main memory algorithm should be used

◦ Memory requirement is unbounded if one will join future tuples Approximate query answering

◦ With bounded memory, it is not always possible to produce exact answers

◦ High-quality approximate answers are desired

◦ Data reduction and synopsis construction methods Sketches, random sampling, histograms, wavelets, etc.

Querying


8

Major challenges◦ Keep track of a large universe, e.g., pairs of IP address, not

ages Methodology

◦ Synopses (trade-off between accuracy and storage): A summary given in brief terms that covers the major points of a subject matter

◦ Use synopsis data structure, much smaller (O(logk N) space) than their base data set (O(N) space)

◦ Compute an approximate answer within a small error range (factor ε of the actual answer)

Major methods ◦ Random sampling◦ Histograms◦ Sliding windows◦ Multi-resolution model◦ Sketches◦ Radomized algorithms

Synopses/Approximate Answers


9

Sliding windows◦ Only over sliding windows of recent stream data ◦ Approximation but often more desirable in applications

Batched processing, sampling and synopses◦ Batched if update is fast but computing is slow

Compute periodically, not very timely◦ Sampling if update is slow but computing is fast

Compute using sample data◦ Synopsis data structures

Maintain a small synopsis or sketch of data Good for querying historical data

Blocking operators, e.g., sorting, avg, min, etc.◦ Blocking if unable to produce the first output until seeing the

entire input

Types of Streaming Algorihms


10

Random sampling (but without knowing the total length in advance)

Sliding windows

◦ Make decisions based only on recent data of sliding window size w

◦ An element arriving at time t expires at time t + w Histograms

◦ Approximate the frequency distribution of element values in a stream

◦ Partition data into a set of contiguous buckets

◦ Equal-width (equal value range for buckets) vs. V-optimal (minimizing frequency variance within each bucket)

Multi-resolution models

◦ Popular models: balanced binary trees, micro-clusters, and wavelets

Stream Processing


11

Random Sampling:A Simple Approach to Item Counts


12

Random Sampling: Packet sampling

Router

Internet

Bernoulli sampling

Internet Internet

Widely used: processing overhead controlled by sampling rate (1/200)

Traffic summary:* Find % traffic from Netflix @ Purdue

Estimate packet-level statistics

>>


13

Find % connections from Netflix @ Purdue

A Fair Measure: Flow-level Statistics

Estimate flow-level statistics

>>

Estimate flow size distribution


14

Reverse problem (inference problem)

Flow-level Statistics from Sampled Packets?

m oc a j g


15

Finding estimates – schematic view

Sampling

Estimator


16

Flow size distribution: maximum likelihood estimation

sampling rate = 1/200 128,000 sampled flows EM algorithm

◦ 2 initializations

1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

30%

40%

50%

60%

70%

80%

90%

100%

Estimate 1

Estimate 2

Original

Flow size

Cum

ul. %

of

flow

s

Estimates highly sensitive to initialization© 2015 Bruno Ribeiro

17

MLE: more samples

pkt sampling rate = 1/200, 1 trillion sampled flows

1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

20

30%

40%

50%

60%

70%

80%

90%

100%

Es-ti-mate

Flow size

Cum

ul. %

of

flow

s


Surface: 71% is water

18

Problem: Uniform sampling

Wikipedia


19

Dedicates precious memory only to “important” observations

Sample flows, rather than packets◦ Problem?◦ Will likely miss large flows

Sample flows ∝ flow size◦ Problem?◦ Streaming setting: We don’t yet know the size

Example pf compromise: Sample and Hold◦ Sample packets, keep all remaining packets of same flow

Importance Sampling


Different Sampling Designs

Packet Sampling = Packet Sampling: Sample elements with probability p

Flow Sampling = Flow sampling: Sample sets with probability q Sample & Hold = Randomly sample elements with probability q’ from the stream but collect all future elements with same color

Dual Sampling = Sample first element with high probability. Sample following elements with low probability and use “sequence numbers” to obtain elements lost “in the middle”

m oc a

seeing as a stream of elements

j g


Results: Different Sampling Designs FS = Flow sampling SH = Sample and

hold

• DS = Dual sampling• PS = Packet

sampling

Tune & Veitch, 2014© 2015 Bruno Ribeiro

22

Sketches


Sketches

Note that not every problem can be solved well with sampling◦ Example: flow size estimation

“Sketch”: a linear transformation of the input◦ Model stream as defining a vector, sketch is result of

multiplying stream vector by an (implicit) matrix

linear projection

stream

sketch

X1 XnX2…


24

Counting Sketch: Abhishek Kumar et al. 2004

Definitions◦ N → number of flows◦ W → maximum flow size◦ M → memory size

Space Complexity

◦ Available memoryM = k N log W, k < 1


25

Flow Size Sketch: Kumar et al. 2004

Counters

elements offlow blue

elements offlow red

elements offlow green

f

f

f collision

Flow size distribution

Motivation:❍ Estimate flow size distribution

Hash function f

Uses precious memory with counters > 0

Hash function:Uniformly at random associates a newly arrived flow to a counter


26

0

Data Streaming on Flow Size Estimation router

Estimation phase

powerfulback end

server

0

0

universal hash

function

1

12

0

0

Sketch phase

12

collision!!

counters

su

mm

ary flow size

distribution estimateDisambiguate


27

Effectively only works if counter load < 2

In practice reduces required memory by 1/2

Very resource-intensive estimation procedure

Issues with Kumar et al.


28

Ribeiro et al. 2008

Eviction Sketch


Eviction Sketch: Probabilistic collision avoidance

2 0 1 6 0 06 1 2

Flows:

flow

7flow

8

Maximum hash value = M

M/2 counters

If hash(packet) < M/2 → red

Otherwise (hash(packet) mod M/2) → blue

flow

9Counters:

M/2 counters

Undetectable collision

Detectable blue – red collision: 1 bit required


Eviction

Number of eviction classes ∞ Policy: Evicts random flow

Flow sampling

Folding: interesting fact

Collision policy:

“red flow cannot increment blue counter”

“blue flow overwrites red counter”

counter = 0 are red

Result: e.g. if 1 counter / flow All red counters are also blue counters = 0

Virtually expands hash table in ≈ 50% (virtual 2 counters/ flow)

Blue counters evict red counters Flow sampling effect: Discards 15% flows at random

2 0 1 3 0 06 1 2

Flows:

Counters:

0 0 0 1 0 01 1 1Counter colors:(extra bit)


Group large flow sizes & Probabilistic counting [Morris 78]

Reduce counter size:Probabilisitc counter increments

With ma = 2ª , 6 bit counter bins up flows up to average size 1014

01Arrived packets:

…

k-1

2k-1kk+1

p=1/m1 k+

2

…

m1

…

m2

p=1/m2

k

average

Hash counter

Counter value k → average flow sizes = [k, k+m1-1] Counter value k+1 → average flow sizes = [k+m1, k+m1+m2-1]


Experiment

Evaluated with simulations

Our worst result with Internet core traces◦ 9.5 million flows◦ 8MB of memory◦ k=16◦ W=1014

k

Same accuracy without counter folding requires 13MB of memory


33

Final estimation result (over Internet traffic)

Input: 106 flows with 250KB memory


34

Approximate Search

Bloom filters


Good Tutorial: Andrei Broder and Michael Mitzenmacher, Network Applications ofBloom Filters: A Survey, Internet Mathematics Vol. 1, No. 4: 485-509, 2003

35

How Bloom Filters Work

Hash function f1 f2

f3


S = set of items m = |S| k = hash functions n = number of stored bits in filter

Assume kn < m

To check membership: y S∊ , check whether fi(y), 1≤i≤k, are all set to 1

o If not, y S∉ o Else, we conclude that y S∊ , but sometimes y ∉ S (false positive)

In many applications, false positives are OK as long as happens with small probability

Why Bloom Filters Work


Bloom Filter Errors Assumption: Hash functions look random

Given m bits for filter and n elements, choose number k of hash functions to minimize false positives:◦ Let ◦ Then,

As k increases, more chances to find at least one 0but we also insert more 1’s in bit vector

Optimal at k = (ln 2)m/n (derivative = 0, 2nd deriv > 0)


Example

0 2.3 4.5 6.8 9 11.30

0.025

0.05

0.075

0.1

Hash functions

Fal

se p

osit

ive

rate

m/n = 8

Opt k = 8 ln 2 = 5.45...

Ack Mitzenmacher© 2015 Bruno Ribeiro

Documents

Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro