Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional

Distributed Data ManagementSummer Semester 2015

TU Kaiserslautern

Prof. Dr.-Ing. Sebastian Michel

Databases and Information Systems Group (AG DBIS)

http://dbis.informatik.uni-kl.de/

Distributed Data Management, SoSe 2015, S. Michel 1

Announcements

• Please send specific questions you have to Evica Milchevski or Kiril Panev ([email protected] or [email protected] )

• They will collect them and compile a FAQ list to walk through in the exam preparation session on July 21 during regular exercise session hours.

• Please do not forget to register for the exam also at the examination office.

• Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional for the next exercise session; everyone gets the “point” for this assignment.


mailto:[email protected]

mailto:[email protected]

(DISTRIBUTED) DATA STREAM PROCESSING


Data Stream• A data stream is a sequence of data tuples.

• Think of standard tuples of relational databases.

• With time information (timestamps)

• One after the other, or in batches, they are generated.

• That means, Data is moving! Continuously generated (assumed infinite!)

• Potentially high pace.

• System has to process data without first storing everything (how would that be possible anyway if stream is infinite?!)


Sensor Networks as Data Streams Origin• E.g., in Environmental Monitoring


StationStream(timestamp, humidity, solarRadiation, windSpeed, snowHeight)

• Various application scenarios:– avalanche risk level

computation

– insights for agriculture

– air pollution (urban) monitoring

Sample Applications: Pothole Patrol• The Pothole Patrol

• Detecting and reporting the surface conditions of roads; using sensors in vehicles

• Using 3-axis accelerometer+GPS + learning


Eriksson et al. The Pothole Patrol: Using a Mobile Sensor Network for Road Surface Monitoring. MobiSys 2008.

Sample Application: Swiss Experiment

Distributed Data Management, SoSe 2015, S. Michel7

• Environmental monitoring

• Sensor data management and meta data sharing.

• Across many different types of measurement: (hydrology, alpine monitoring, atmospheric phenomena, earthquakes, …)

• Also higher level applications like putting sensors and interpretations on maps, computing statistics over streams.

http://www.swiss-experiment.ch

Sensors


• Mobile vs. static

• Large vs. tiny (smart dust!)

• bytes/hours vs. > GB/minute

ambient temp.sensor of a car

Sensors at LHC@CERN

Tiny sensor atU Michigan

Example Sensor (Tinynode) on Top of Extension Board


Antenna

Light Sensor

Connector Board for extras

RS-233 for programming

htt

p:/

/ww

w.t

inyn

od

e.co

m/

Actual Sensor

Social Sensors• Explicitly: Snow Tweets (http://snowcore.uwaterloo.ca/snowtweets/)

– #snowtweets 50.0 cm. at K1A 0A2

– #snowtweets 10.0 in. at 20500

– #snowtweets 4 cm at Palmerston North 4414

• Implicitly: By mentioning topics, people, in social communication


time

11

Earthquake News on Twitter

Distributed Data Management, SoSe 2015, S. Michel

12



13



14source: http://blog.socialflow.com/



Classic Example: Stock Market

• Real-time analysis of stock marked changes

• Computing statistics over streams, e.g., for decision support

• Opportunities for reacting in real-time

• Even with fully automated means: algorithmic trading


So Far: Databases/NoSQL Datastores• Data is changing, yes, but this is more due to

inserts and update to stored data items

• Historic data is kept

• Queries operate on full data (tables)

• MapReduce is extreme, Write-once & Read-many times

• Data warehousing, too: periodically loading data in store for deep(er) analytics

• Data mining


Data Stream Management vs. Traditional Data Management

• At query time, data is accessed as a whole

• Data is persistently stored

• Queries are ad-hoc (mainly)


DATA Base/Store

Query & ResultsInsert

Update

Delete

Data Stream Management vs. Traditional Data Management (Cont’d)

• Data is moving! Continuously generated (assumed infinite!)• At high pace• Queries are (mainly) continuous (aka. standing). Registered

once, observed “forever”.• Answer to queries in (near) real-time required (often)• Probabilistic methods for efficiency or considering only part of

the stream (sliding window)Distributed Data Management, SoSe 2015, S. Michel 18

DATA STREAM

Set of queries

DBMS vs. DSMS


Database management system (DBMS)

Data stream management system (DSMS)

Persistent data (relations) Volatile data streams

Random access Sequential access

One-time queries Continuous queries

(theoretically) unlimited secondary storage Limited main memory

Only the current state is relevant Consideration of the order of the input

Relatively low update rate Potentially extremely high update rate

Little or no time requirements Real-time requirements

Assumes exact data Assumes outdated/inaccurate data

Plannable query processingVariable data arrival and data characteristics

http://en.wikipedia.org/wiki/Data-stream_management_system

Data Stream Model

• Stream of data items is unbounded (available memory is not)

• No way to store entire stream (how could we, its (probably) not ending)

• To compute query results, need to devise algorithm with little memory consumption


Overview of Data Stream Topics• Synopses:

– concise representations of stream content– tailored to tasks, e.g., counting distinct elements– usually not exact, but approximations (estimators) of

true values. – generally useful for representing data compactly– We will look at some of them today

• (Sliding) Windows:– focus of certain recent subset of data– computation of functions/joins over window(s)

content– Will look at CQL language: think “SQL” for streaming

data


Data Stream Mining: Teasers

• I tell you integer numbers between 1 and N

• I will tell all but one number

• After N-1 numbers I ask: which number was missing?


481 324 122 412 871 231 849 447 641 …

Data Stream Mining: Teasers (Cont’d)

• Keep Boolean array of length N:

– Mark position for observed number

– Size required: N

– Computation at end: N to find missing number


Data Stream Mining: Teasers (Cont’d)

• Keep Boolean array of length N:

– Mark position for observed number

– Size required: N

– Computation at end: N to find missing number


• Much better:

– keep sum of numbers: S

– Missing number is N*(N+1)/2 - S

Counting Occurrences

• Consider a stream of elements ai

…, a2, a84, a41, a2, a77, a231, a2, a4, a54, …

• How often does a2 occur?

• How to implement?


Counting Occurrences


…, a2, a84, a41, a2, a77, a231, a2, a4, a54, …

• How often does a2 occur?

• How to implement?


• Keep counter for each id

• Required space #ids (=N)

• Not feasible of N is very large

Probabilistic Counting: Count-Min Sketch

• Keep 2-dim array (h, r)

• h hash functions hi that map to range 0…(r-1)


Cormode, Muthukrishnan (2004). An Improved Data Stream Summary: The Count-Min Sketch and its Applications. J. Algorithms 55: 29–38.

0 1 2 3 4 5

• Arriving item x.

• For each j: array[j, hj(x)]++

h1

h2

h3

h4

Count-Min Sketch: Insert Example


0 1 2 3 4 5

h1

h2

h3

h4

a, b, a, a, c, a, c, ….Data stream is

x h1(x) h2(x) h3(x) h4(x)

a 4 5 0 2

b 3 5 1 3

c 2 2 0 3

… … … … …

Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:



1

1

1

1

0 1 2 3 4 5

h1

h2

h3

h4


x h1(x) h2(x) h3(x) h4(x)

a 4 5 0 2

b 3 5 1 3

c 2 2 0 3

… … … … …


red = inserted



1 1

2

1 1

1 1

0 1 2 3 4 5

h1

h2

h3

h4


x h1(x) h2(x) h3(x) h4(x)

a 4 5 0 2

b 3 5 1 3

c 2 2 0 3

… … … … …


red = inserted



1 2

3

2 1

2 1

0 1 2 3 4 5

h1

h2

h3

h4

x h1(x) h2(x) h3(x) h4(x)

a 4 5 0 2

b 3 5 1 3

c 2 2 0 3

… … … … …



red = inserted



1 3

4

3 1

3 1

0 1 2 3 4 5

h1

h2

h3

h4

x h1(x) h2(x) h3(x) h4(x)

a 4 5 0 2

b 3 5 1 3

c 2 2 0 3

… … … … …



red = inserted



1 1 3

1 4

4 1

3 2

0 1 2 3 4 5

h1

h2

h3

h4

x h1(x) h2(x) h3(x) h4(x)

a 4 5 0 2

b 3 5 1 3

c 2 2 0 3

… … … … …



Imagine that continues now a bit, then we might end up with ……

red = inserted

Count-Min Sketch: Counting

• How often did we see item a?

• Recall the hash function values for a:

h1(a) = 4, h2(a)=5, h3(a)=0, h4(a)=2


5 3 4 4 9 3

4 7 1 4 4 8

8 4 6 7 2 1

3 1 4 8 7 5

0 1 2 3 4 5

h1

h2

h3

h4



• Look at positions: h1(a) = 4, h2(a)=5, h3(a)=0, h4(a)=2

• Take minimum of the corresponding values: Here: 4


5 3 4 4 9 3

4 7 1 4 4 8

8 4 6 7 2 1

3 1 4 8 7 5

0 1 2 3 4 5

h1

h2

h3

h4

9

8

8

4

Is this estimator generally underestimating or overestimating or can’t we say anything about that?



• Look at positions: h1(a) = 4, h2(a)=5, h3(a)=0, h4(a)=2

• Take minimum of the corresponding values: Here: 4

• Estimate is never underestimating

• Overestimation probabilistically bounded


5 3 4 4 9 3

4 7 1 4 4 8

8 4 6 7 2 1

3 1 4 8 7 5

0 1 2 3 4 5

h1

h2

h3

h4

9

8

8

4

Unbiased vs. Biased Estimators

• Given a real number and an estimator of it, denoted as

• E.g., number of distinct elements in a set S

• is called an unbiased estimator if

• and biased otherwise, in which case


n̂

Counting Distinct Elements


…, a2, a84, a41, a2, a77, a231, a2, a4, a54, …

• How to compute/estimate the number of distinct elements observed?


Application• Streams (one pass, little memory footprint)

• Distributed systems: compact data exchange (recall Bloom filter)

• But also counting in Database Systems

• Sketches for partial data can often be merged for global view


sketch

sketch

Efficient Counting, Comparing:How many distinct elements altogether?

Flajolet Martin (FM) Sketch (aka. Hash Sketch)

• Allocate a bitvector B of size m = log(N)

• Hash items to bitvector positions:

– Hash each item i to a m-bit number h(i)

– Compute position k of the least-significant “1” of h(i)

– Set the bit B[k] to “1”

Given data stream 17, 5, 19, 211, 17, 5, 31

h(17) = 010100 then least-sig. 1 bit = 2h(5) = 000101 then least-sig. 1 bit = 0...

Proposed originally by Flajolet and Martin in 1985


FM-Sketch: Estimator

• Get then position t of left-most “0” bit of B

• Count-Distinct Estimate of real distinct number n:

here: with t = 3:

• Improvement: – If you use more bitmaps and compute an average

position t, you can improve count-distinct estimate


111010 B:• In the end B might look like this

Note: Be careful with left-most bit, least significant bit; depends on interpretation of bits

with

FM Sketch: Intuition/Idea

• B[0] is set approximately n/2 times

• B[1] is set approximately n/4 times

• B[i] = 0 if i >> log2(n)

• B[i] = 1 if i<<log2(n)

• “Mix” of 1s and 0s around i≈log2(n)

• Use left-most zero at indicator for log2(n):

n ≈ 2position of left most zero bit


FM-Sketch: Union

• Given: two multisets S and T and theirs sketches and of size m

• Then:

The sketch

is the sketch of

BSBT


K-Min Value (KMV) Synopsis

• KMV synopsis is ordered set of k smallest values

},...,,{ )()2()1( kUUUL

0 1

• Unbiased Estimator:

– Exact error analysis based on theory of order statistics

– Asymptotically optimal as k becomes large


• Given set S of values. Want number of distinct elements n := D(S) (notation)

• Hashing outputs values uniformly in [0,1]

k-min values

Slide based on PPT slides from Beyer et al. ‘07

[Beyer et al. , 2007]

K-Min Value (KMV) Synopsis (Cont’d)

• KMV synopsis:

• Distance between each hash value is 1/n (assuming uniformity)

• We are interested in knowing the value of n

• Natural derivation of basic estimator:

Observation: Thus:

1/n

},...,,{ )()2()1( kUUUL

0 1U(1)

U(2)

U(k)

...

Hashing = dropping the n DVs uniformly on [0,1]

k-min values


Why does this work?

This is a biased estimator.(k-1) instead of k renders it unbiased

KMV Example• Implemented hash function using MD5 (and

mapped to [0,1]. Items are characters a-z.


[0.043, 0.172, 0.281, 0.354, 0.382, 0.421, 0.443, 0.459, 0.463, 0.523, 0.556, 0.565, 0.569, 0.57, 0.59, 0.644, 0.652, 0.675, 0.682, 0.724, 0.818, 0.864, 0.89, 0.938, 0.994, 0.997]

k U(k)1 0.000 0,043

2 5.814 0,172

3 7.117 0,281

4 8.475 0,354

5 10.471 0,382

6 11.876 0,421

7 13.544 0,443

8 15.251 0,459

9 17.279 0,463

10 17.208 0,523

11 17.986 0,556

12 19.469 0,565

13 21.090 0,569

14 22.807 0,570

15 23.729 0,590

16 23.292 0,644

17 24.540 0,652

18 25.185 0,675

19 26.393 0,682

20 26.243 0,724

21 24.450 0,818

The sorted (ascending) hash values for the 26 characters a-z are:

• In general, why are distinct elements counted?

• What if we have one KMV sketch for a multiset A and one for multiset B? Can we combine them?

46

(Multiset) Union of Partitions

0

k-min

0

k-min

0

k-min

U(k)

L

LA LB

• Combine KMV synopses: L = LA LB

• Theorem: L is a KMV synopsis of AB

… 1 … 1

… 1


Take union of values and consider again the k smallest ones:



• L=LALB as before (union): contains k elements– L corresponds to a uniform random sample of DVs in

AB

• K = # values in L that are also in D(AB)Theorem: Can compute from LA and LB alone

• K/ k estimates Jaccard coefficient:

estimates

• Unbiased estimator of #DVs in the intersection:

(Multiset) Intersection of Partitions

48

D(set) = distinct values


Literature


• Graham Cormode, S. Muthukrishnan: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1): 58-75 (2005)

• Philippe Flajolet, G. Nigel Martin: Probabilistic Counting Algorithms for Data Base Applications. J. Comput. Syst. Sci. 31(2): 182-209 (1985)

• Kevin S. Beyer, Peter J. Haas, Berthold Reinwald, Yannis Sismanis, Rainer Gemulla: On synopses for distinct-value estimation under multisetoperations. SIGMOD Conference 2007: 199-210

Documents

Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional