49
Distributed Data Management Summer Semester 2015 TU Kaiserslautern Prof. Dr.-Ing. Sebastian Michel Databases and Information Systems Group (AG DBIS) http://dbis.informatik.uni-kl.de/ Distributed Data Management, SoSe 2015, S. Michel 1

Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Distributed Data ManagementSummer Semester 2015

TU Kaiserslautern

Prof. Dr.-Ing. Sebastian Michel

Databases and Information Systems Group (AG DBIS)

http://dbis.informatik.uni-kl.de/

Distributed Data Management, SoSe 2015, S. Michel 1

Page 2: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Announcements

• Please send specific questions you have to Evica Milchevski or Kiril Panev ([email protected] or [email protected] )

• They will collect them and compile a FAQ list to walk through in the exam preparation session on July 21 during regular exercise session hours.

• Please do not forget to register for the exam also at the examination office.

• Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional for the next exercise session; everyone gets the “point” for this assignment.

Distributed Data Management, SoSe 2015, S. Michel 2

Page 3: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

(DISTRIBUTED) DATA STREAM PROCESSING

Distributed Data Management, SoSe 2015, S. Michel 3

Page 4: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Data Stream• A data stream is a sequence of data tuples.

• Think of standard tuples of relational databases.

• With time information (timestamps)

• One after the other, or in batches, they are generated.

• That means, Data is moving! Continuously generated (assumed infinite!)

• Potentially high pace.

• System has to process data without first storing everything (how would that be possible anyway if stream is infinite?!)

Distributed Data Management, SoSe 2015, S. Michel 4

Page 5: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Sensor Networks as Data Streams Origin• E.g., in Environmental Monitoring

Distributed Data Management, SoSe 2015, S. Michel 5

StationStream(timestamp, humidity, solarRadiation, windSpeed, snowHeight)

• Various application scenarios:– avalanche risk level

computation

– insights for agriculture

– air pollution (urban) monitoring

Page 6: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Sample Applications: Pothole Patrol• The Pothole Patrol

• Detecting and reporting the surface conditions of roads; using sensors in vehicles

• Using 3-axis accelerometer+GPS + learning

Distributed Data Management, SoSe 2015, S. Michel 6

Eriksson et al. The Pothole Patrol: Using a Mobile Sensor Network for Road Surface Monitoring. MobiSys 2008.

Page 7: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Sample Application: Swiss Experiment

Distributed Data Management, SoSe 2015, S. Michel7

• Environmental monitoring

• Sensor data management and meta data sharing.

• Across many different types of measurement: (hydrology, alpine monitoring, atmospheric phenomena, earthquakes, …)

• Also higher level applications like putting sensors and interpretations on maps, computing statistics over streams.

http://www.swiss-experiment.ch

Page 8: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Sensors

Distributed Data Management, SoSe 2015, S. Michel 8

• Mobile vs. static

• Large vs. tiny (smart dust!)

• bytes/hours vs. > GB/minute

ambient temp.sensor of a car

Sensors at LHC@CERN

Tiny sensor atU Michigan

Page 9: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Example Sensor (Tinynode) on Top of Extension Board

Distributed Data Management, SoSe 2015, S. Michel 9

Antenna

Light Sensor

Connector Board for extras

RS-233 for programming

htt

p:/

/ww

w.t

inyn

od

e.co

m/

Actual Sensor

Page 10: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Social Sensors• Explicitly: Snow Tweets (http://snowcore.uwaterloo.ca/snowtweets/)

– #snowtweets 50.0 cm. at K1A 0A2

– #snowtweets 10.0 in. at 20500

– #snowtweets 4 cm at Palmerston North 4414

• Implicitly: By mentioning topics, people, in social communication

Distributed Data Management, SoSe 2015, S. Michel 10

time

Page 11: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

11

Earthquake News on Twitter

Distributed Data Management, SoSe 2015, S. Michel

Page 12: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

12

Earthquake News on Twitter

Distributed Data Management, SoSe 2015, S. Michel

Page 13: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

13

Earthquake News on Twitter

Distributed Data Management, SoSe 2015, S. Michel

Page 14: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

14source: http://blog.socialflow.com/

Earthquake News on Twitter

Distributed Data Management, SoSe 2015, S. Michel

Page 15: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Classic Example: Stock Market

• Real-time analysis of stock marked changes

• Computing statistics over streams, e.g., for decision support

• Opportunities for reacting in real-time

• Even with fully automated means: algorithmic trading

Distributed Data Management, SoSe 2015, S. Michel 15

Page 16: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

So Far: Databases/NoSQL Datastores• Data is changing, yes, but this is more due to

inserts and update to stored data items

• Historic data is kept

• Queries operate on full data (tables)

• MapReduce is extreme, Write-once & Read-many times

• Data warehousing, too: periodically loading data in store for deep(er) analytics

• Data mining

Distributed Data Management, SoSe 2015, S. Michel 16

Page 17: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Data Stream Management vs. Traditional Data Management

• At query time, data is accessed as a whole

• Data is persistently stored

• Queries are ad-hoc (mainly)

Distributed Data Management, SoSe 2015, S. Michel 17

DATA Base/Store

Query & ResultsInsert

Update

Delete

Page 18: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Data Stream Management vs. Traditional Data Management (Cont’d)

• Data is moving! Continuously generated (assumed infinite!)• At high pace• Queries are (mainly) continuous (aka. standing). Registered

once, observed “forever”.• Answer to queries in (near) real-time required (often)• Probabilistic methods for efficiency or considering only part of

the stream (sliding window)Distributed Data Management, SoSe 2015, S. Michel 18

DATA STREAM

Set of queries

Page 19: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

DBMS vs. DSMS

Distributed Data Management, SoSe 2015, S. Michel 19

Database management system (DBMS)

Data stream management system (DSMS)

Persistent data (relations) Volatile data streams

Random access Sequential access

One-time queries Continuous queries

(theoretically) unlimited secondary storage Limited main memory

Only the current state is relevant Consideration of the order of the input

Relatively low update rate Potentially extremely high update rate

Little or no time requirements Real-time requirements

Assumes exact data Assumes outdated/inaccurate data

Plannable query processingVariable data arrival and data characteristics

http://en.wikipedia.org/wiki/Data-stream_management_system

Page 20: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Data Stream Model

• Stream of data items is unbounded (available memory is not)

• No way to store entire stream (how could we, its (probably) not ending)

• To compute query results, need to devise algorithm with little memory consumption

Distributed Data Management, SoSe 2015, S. Michel 20

Page 21: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Overview of Data Stream Topics• Synopses:

– concise representations of stream content– tailored to tasks, e.g., counting distinct elements– usually not exact, but approximations (estimators) of

true values. – generally useful for representing data compactly– We will look at some of them today

• (Sliding) Windows:– focus of certain recent subset of data– computation of functions/joins over window(s)

content– Will look at CQL language: think “SQL” for streaming

data

Distributed Data Management, SoSe 2015, S. Michel 21

Page 22: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Data Stream Mining: Teasers

• I tell you integer numbers between 1 and N

• I will tell all but one number

• After N-1 numbers I ask: which number was missing?

Distributed Data Management, SoSe 2015, S. Michel 22

481 324 122 412 871 231 849 447 641 …

Page 23: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Data Stream Mining: Teasers (Cont’d)

• Keep Boolean array of length N:

– Mark position for observed number

– Size required: N

– Computation at end: N to find missing number

Distributed Data Management, SoSe 2015, S. Michel 23

Page 24: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Data Stream Mining: Teasers (Cont’d)

• Keep Boolean array of length N:

– Mark position for observed number

– Size required: N

– Computation at end: N to find missing number

Distributed Data Management, SoSe 2015, S. Michel 24

• Much better:

– keep sum of numbers: S

– Missing number is N*(N+1)/2 - S

Page 25: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Counting Occurrences

• Consider a stream of elements ai

…, a2, a84, a41, a2, a77, a231, a2, a4, a54, …

• How often does a2 occur?

• How to implement?

Distributed Data Management, SoSe 2015, S. Michel 25

Page 26: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Counting Occurrences

• Consider a stream of elements ai

…, a2, a84, a41, a2, a77, a231, a2, a4, a54, …

• How often does a2 occur?

• How to implement?

Distributed Data Management, SoSe 2015, S. Michel 26

• Keep counter for each id

• Required space #ids (=N)

• Not feasible of N is very large

Page 27: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Probabilistic Counting: Count-Min Sketch

• Keep 2-dim array (h, r)

• h hash functions hi that map to range 0…(r-1)

Distributed Data Management, SoSe 2015, S. Michel 27

Cormode, Muthukrishnan (2004). An Improved Data Stream Summary: The Count-Min Sketch and its Applications. J. Algorithms 55: 29–38.

0 1 2 3 4 5

• Arriving item x.

• For each j: array[j, hj(x)]++

h1

h2

h3

h4

Page 28: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Count-Min Sketch: Insert Example

Distributed Data Management, SoSe 2015, S. Michel 28

0 1 2 3 4 5

h1

h2

h3

h4

a, b, a, a, c, a, c, ….Data stream is

x h1(x) h2(x) h3(x) h4(x)

a 4 5 0 2

b 3 5 1 3

c 2 2 0 3

… … … … …

Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:

Page 29: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Count-Min Sketch: Insert Example

Distributed Data Management, SoSe 2015, S. Michel 29

1

1

1

1

0 1 2 3 4 5

h1

h2

h3

h4

a, b, a, a, c, a, c, ….Data stream is

x h1(x) h2(x) h3(x) h4(x)

a 4 5 0 2

b 3 5 1 3

c 2 2 0 3

… … … … …

Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:

red = inserted

Page 30: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Count-Min Sketch: Insert Example

Distributed Data Management, SoSe 2015, S. Michel 30

1 1

2

1 1

1 1

0 1 2 3 4 5

h1

h2

h3

h4

a, b, a, a, c, a, c, ….Data stream is

x h1(x) h2(x) h3(x) h4(x)

a 4 5 0 2

b 3 5 1 3

c 2 2 0 3

… … … … …

Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:

red = inserted

Page 31: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Count-Min Sketch: Insert Example

Distributed Data Management, SoSe 2015, S. Michel 31

1 2

3

2 1

2 1

0 1 2 3 4 5

h1

h2

h3

h4

x h1(x) h2(x) h3(x) h4(x)

a 4 5 0 2

b 3 5 1 3

c 2 2 0 3

… … … … …

Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:

a, b, a, a, c, a, c, ….Data stream is

red = inserted

Page 32: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Count-Min Sketch: Insert Example

Distributed Data Management, SoSe 2015, S. Michel 32

1 3

4

3 1

3 1

0 1 2 3 4 5

h1

h2

h3

h4

x h1(x) h2(x) h3(x) h4(x)

a 4 5 0 2

b 3 5 1 3

c 2 2 0 3

… … … … …

Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:

a, b, a, a, c, a, c, ….Data stream is

red = inserted

Page 33: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Count-Min Sketch: Insert Example

Distributed Data Management, SoSe 2015, S. Michel 33

1 1 3

1 4

4 1

3 2

0 1 2 3 4 5

h1

h2

h3

h4

x h1(x) h2(x) h3(x) h4(x)

a 4 5 0 2

b 3 5 1 3

c 2 2 0 3

… … … … …

Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:

a, b, a, a, c, a, c, ….Data stream is

Imagine that continues now a bit, then we might end up with ……

red = inserted

Page 34: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Count-Min Sketch: Counting

• How often did we see item a?

• Recall the hash function values for a:

h1(a) = 4, h2(a)=5, h3(a)=0, h4(a)=2

Distributed Data Management, SoSe 2015, S. Michel 34

5 3 4 4 9 3

4 7 1 4 4 8

8 4 6 7 2 1

3 1 4 8 7 5

0 1 2 3 4 5

h1

h2

h3

h4

Page 35: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Count-Min Sketch: Counting

• How often did we see item a?

• Look at positions: h1(a) = 4, h2(a)=5, h3(a)=0, h4(a)=2

• Take minimum of the corresponding values: Here: 4

Distributed Data Management, SoSe 2015, S. Michel 35

5 3 4 4 9 3

4 7 1 4 4 8

8 4 6 7 2 1

3 1 4 8 7 5

0 1 2 3 4 5

h1

h2

h3

h4

9

8

8

4

Is this estimator generally underestimating or overestimating or can’t we say anything about that?

Page 36: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Count-Min Sketch: Counting

• How often did we see item a?

• Look at positions: h1(a) = 4, h2(a)=5, h3(a)=0, h4(a)=2

• Take minimum of the corresponding values: Here: 4

• Estimate is never underestimating

• Overestimation probabilistically bounded

Distributed Data Management, SoSe 2015, S. Michel 36

5 3 4 4 9 3

4 7 1 4 4 8

8 4 6 7 2 1

3 1 4 8 7 5

0 1 2 3 4 5

h1

h2

h3

h4

9

8

8

4

Page 37: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Unbiased vs. Biased Estimators

• Given a real number and an estimator of it, denoted as

• E.g., number of distinct elements in a set S

• is called an unbiased estimator if

• and biased otherwise, in which case

Distributed Data Management, SoSe 2015, S. Michel 37

Page 38: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Counting Distinct Elements

• Consider a stream of elements ai

…, a2, a84, a41, a2, a77, a231, a2, a4, a54, …

• How to compute/estimate the number of distinct elements observed?

Distributed Data Management, SoSe 2015, S. Michel 38

Page 39: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Application• Streams (one pass, little memory footprint)

• Distributed systems: compact data exchange (recall Bloom filter)

• But also counting in Database Systems

• Sketches for partial data can often be merged for global view

Distributed Data Management, SoSe 2015, S. Michel 39

sketch

sketch

Efficient Counting, Comparing:How many distinct elements altogether?

Page 40: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Flajolet Martin (FM) Sketch (aka. Hash Sketch)

• Allocate a bitvector B of size m = log(N)

• Hash items to bitvector positions:

– Hash each item i to a m-bit number h(i)

– Compute position k of the least-significant “1” of h(i)

– Set the bit B[k] to “1”

Given data stream 17, 5, 19, 211, 17, 5, 31

h(17) = 010100 then least-sig. 1 bit = 2h(5) = 000101 then least-sig. 1 bit = 0...

Proposed originally by Flajolet and Martin in 1985

Distributed Data Management, SoSe 2015, S. Michel 40

Page 41: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

FM-Sketch: Estimator

• Get then position t of left-most “0” bit of B

• Count-Distinct Estimate of real distinct number n:

here: with t = 3:

• Improvement: – If you use more bitmaps and compute an average

position t, you can improve count-distinct estimate

Distributed Data Management, SoSe 2015, S. Michel 41

111010 B:• In the end B might look like this

Note: Be careful with left-most bit, least significant bit; depends on interpretation of bits

with

Page 42: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

FM Sketch: Intuition/Idea

• B[0] is set approximately n/2 times

• B[1] is set approximately n/4 times

• B[i] = 0 if i >> log2(n)

• B[i] = 1 if i<<log2(n)

• “Mix” of 1s and 0s around i≈log2(n)

• Use left-most zero at indicator for log2(n):

n ≈ 2position of left most zero bit

Distributed Data Management, SoSe 2015, S. Michel 42

Page 43: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

FM-Sketch: Union

• Given: two multisets S and T and theirs sketches and of size m

• Then:

The sketch

is the sketch of

BSBT

Distributed Data Management, SoSe 2015, S. Michel 43

Page 44: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

K-Min Value (KMV) Synopsis

• KMV synopsis is ordered set of k smallest values

},...,,{ )()2()1( kUUUL

0 1

• Unbiased Estimator:

– Exact error analysis based on theory of order statistics

– Asymptotically optimal as k becomes large

Distributed Data Management, SoSe 2015, S. Michel 44

• Given set S of values. Want number of distinct elements n := D(S) (notation)

• Hashing outputs values uniformly in [0,1]

k-min values

Slide based on PPT slides from Beyer et al. ‘07

[Beyer et al. , 2007]

Page 45: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

K-Min Value (KMV) Synopsis (Cont’d)

• KMV synopsis:

• Distance between each hash value is 1/n (assuming uniformity)

• We are interested in knowing the value of n

• Natural derivation of basic estimator:

Observation: Thus:

1/n

},...,,{ )()2()1( kUUUL

0 1U(1)

U(2)

U(k)

...

Hashing = dropping the n DVs uniformly on [0,1]

k-min values

Slide based on PPT slides from Beyer et al. ‘07

Why does this work?

This is a biased estimator.(k-1) instead of k renders it unbiased

Page 46: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

KMV Example• Implemented hash function using MD5 (and

mapped to [0,1]. Items are characters a-z.

Distributed Data Management, SoSe 2015, S. Michel

[0.043, 0.172, 0.281, 0.354, 0.382, 0.421, 0.443, 0.459, 0.463, 0.523, 0.556, 0.565, 0.569, 0.57, 0.59, 0.644, 0.652, 0.675, 0.682, 0.724, 0.818, 0.864, 0.89, 0.938, 0.994, 0.997]

k U(k)1 0.000 0,043

2 5.814 0,172

3 7.117 0,281

4 8.475 0,354

5 10.471 0,382

6 11.876 0,421

7 13.544 0,443

8 15.251 0,459

9 17.279 0,463

10 17.208 0,523

11 17.986 0,556

12 19.469 0,565

13 21.090 0,569

14 22.807 0,570

15 23.729 0,590

16 23.292 0,644

17 24.540 0,652

18 25.185 0,675

19 26.393 0,682

20 26.243 0,724

21 24.450 0,818

The sorted (ascending) hash values for the 26 characters a-z are:

• In general, why are distinct elements counted?

• What if we have one KMV sketch for a multiset A and one for multiset B? Can we combine them?

46

Page 47: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

(Multiset) Union of Partitions

0

k-min

0

k-min

0

k-min

U(k)

L

LA LB

• Combine KMV synopses: L = LA LB

• Theorem: L is a KMV synopsis of AB

… 1 … 1

… 1

Distributed Data Management, SoSe 2015, S. Michel 47

Take union of values and consider again the k smallest ones:

Slide based on PPT slides from Beyer et al. ‘07

Page 48: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Distributed Data Management, SoSe 2015, S. Michel

• L=LALB as before (union): contains k elements– L corresponds to a uniform random sample of DVs in

AB

• K = # values in L that are also in D(AB)Theorem: Can compute from LA and LB alone

• K/ k estimates Jaccard coefficient:

estimates

• Unbiased estimator of #DVs in the intersection:

(Multiset) Intersection of Partitions

48

D(set) = distinct values

Slide based on PPT slides from Beyer et al. ‘07

Page 49: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Literature

Distributed Data Management, SoSe 2015, S. Michel 49

• Graham Cormode, S. Muthukrishnan: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1): 58-75 (2005)

• Philippe Flajolet, G. Nigel Martin: Probabilistic Counting Algorithms for Data Base Applications. J. Comput. Syst. Sci. 31(2): 182-209 (1985)

• Kevin S. Beyer, Peter J. Haas, Berthold Reinwald, Yannis Sismanis, Rainer Gemulla: On synopses for distinct-value estimation under multisetoperations. SIGMOD Conference 2007: 199-210