Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Distributed Data ManagementSummer Semester 2015
TU Kaiserslautern
Prof. Dr.-Ing. Sebastian Michel
Databases and Information Systems Group (AG DBIS)
http://dbis.informatik.uni-kl.de/
Distributed Data Management, SoSe 2015, S. Michel 1
Announcements
• Please send specific questions you have to Evica Milchevski or Kiril Panev ([email protected] or [email protected] )
• They will collect them and compile a FAQ list to walk through in the exam preparation session on July 21 during regular exercise session hours.
• Please do not forget to register for the exam also at the examination office.
• Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional for the next exercise session; everyone gets the “point” for this assignment.
Distributed Data Management, SoSe 2015, S. Michel 2
(DISTRIBUTED) DATA STREAM PROCESSING
Distributed Data Management, SoSe 2015, S. Michel 3
Data Stream• A data stream is a sequence of data tuples.
• Think of standard tuples of relational databases.
• With time information (timestamps)
• One after the other, or in batches, they are generated.
• That means, Data is moving! Continuously generated (assumed infinite!)
• Potentially high pace.
• System has to process data without first storing everything (how would that be possible anyway if stream is infinite?!)
Distributed Data Management, SoSe 2015, S. Michel 4
Sensor Networks as Data Streams Origin• E.g., in Environmental Monitoring
Distributed Data Management, SoSe 2015, S. Michel 5
StationStream(timestamp, humidity, solarRadiation, windSpeed, snowHeight)
• Various application scenarios:– avalanche risk level
computation
– insights for agriculture
– air pollution (urban) monitoring
Sample Applications: Pothole Patrol• The Pothole Patrol
• Detecting and reporting the surface conditions of roads; using sensors in vehicles
• Using 3-axis accelerometer+GPS + learning
Distributed Data Management, SoSe 2015, S. Michel 6
Eriksson et al. The Pothole Patrol: Using a Mobile Sensor Network for Road Surface Monitoring. MobiSys 2008.
Sample Application: Swiss Experiment
Distributed Data Management, SoSe 2015, S. Michel7
• Environmental monitoring
• Sensor data management and meta data sharing.
• Across many different types of measurement: (hydrology, alpine monitoring, atmospheric phenomena, earthquakes, …)
• Also higher level applications like putting sensors and interpretations on maps, computing statistics over streams.
http://www.swiss-experiment.ch
Sensors
Distributed Data Management, SoSe 2015, S. Michel 8
• Mobile vs. static
• Large vs. tiny (smart dust!)
• bytes/hours vs. > GB/minute
ambient temp.sensor of a car
Sensors at LHC@CERN
Tiny sensor atU Michigan
Example Sensor (Tinynode) on Top of Extension Board
Distributed Data Management, SoSe 2015, S. Michel 9
Antenna
Light Sensor
Connector Board for extras
RS-233 for programming
htt
p:/
/ww
w.t
inyn
od
e.co
m/
Actual Sensor
Social Sensors• Explicitly: Snow Tweets (http://snowcore.uwaterloo.ca/snowtweets/)
– #snowtweets 50.0 cm. at K1A 0A2
– #snowtweets 10.0 in. at 20500
– #snowtweets 4 cm at Palmerston North 4414
• Implicitly: By mentioning topics, people, in social communication
Distributed Data Management, SoSe 2015, S. Michel 10
time
11
Earthquake News on Twitter
Distributed Data Management, SoSe 2015, S. Michel
12
Earthquake News on Twitter
Distributed Data Management, SoSe 2015, S. Michel
13
Earthquake News on Twitter
Distributed Data Management, SoSe 2015, S. Michel
14source: http://blog.socialflow.com/
Earthquake News on Twitter
Distributed Data Management, SoSe 2015, S. Michel
Classic Example: Stock Market
• Real-time analysis of stock marked changes
• Computing statistics over streams, e.g., for decision support
• Opportunities for reacting in real-time
• Even with fully automated means: algorithmic trading
Distributed Data Management, SoSe 2015, S. Michel 15
So Far: Databases/NoSQL Datastores• Data is changing, yes, but this is more due to
inserts and update to stored data items
• Historic data is kept
• Queries operate on full data (tables)
• MapReduce is extreme, Write-once & Read-many times
• Data warehousing, too: periodically loading data in store for deep(er) analytics
• Data mining
Distributed Data Management, SoSe 2015, S. Michel 16
Data Stream Management vs. Traditional Data Management
• At query time, data is accessed as a whole
• Data is persistently stored
• Queries are ad-hoc (mainly)
Distributed Data Management, SoSe 2015, S. Michel 17
DATA Base/Store
Query & ResultsInsert
Update
Delete
Data Stream Management vs. Traditional Data Management (Cont’d)
• Data is moving! Continuously generated (assumed infinite!)• At high pace• Queries are (mainly) continuous (aka. standing). Registered
once, observed “forever”.• Answer to queries in (near) real-time required (often)• Probabilistic methods for efficiency or considering only part of
the stream (sliding window)Distributed Data Management, SoSe 2015, S. Michel 18
DATA STREAM
Set of queries
DBMS vs. DSMS
Distributed Data Management, SoSe 2015, S. Michel 19
Database management system (DBMS)
Data stream management system (DSMS)
Persistent data (relations) Volatile data streams
Random access Sequential access
One-time queries Continuous queries
(theoretically) unlimited secondary storage Limited main memory
Only the current state is relevant Consideration of the order of the input
Relatively low update rate Potentially extremely high update rate
Little or no time requirements Real-time requirements
Assumes exact data Assumes outdated/inaccurate data
Plannable query processingVariable data arrival and data characteristics
http://en.wikipedia.org/wiki/Data-stream_management_system
Data Stream Model
• Stream of data items is unbounded (available memory is not)
• No way to store entire stream (how could we, its (probably) not ending)
• To compute query results, need to devise algorithm with little memory consumption
Distributed Data Management, SoSe 2015, S. Michel 20
Overview of Data Stream Topics• Synopses:
– concise representations of stream content– tailored to tasks, e.g., counting distinct elements– usually not exact, but approximations (estimators) of
true values. – generally useful for representing data compactly– We will look at some of them today
• (Sliding) Windows:– focus of certain recent subset of data– computation of functions/joins over window(s)
content– Will look at CQL language: think “SQL” for streaming
data
Distributed Data Management, SoSe 2015, S. Michel 21
Data Stream Mining: Teasers
• I tell you integer numbers between 1 and N
• I will tell all but one number
• After N-1 numbers I ask: which number was missing?
Distributed Data Management, SoSe 2015, S. Michel 22
481 324 122 412 871 231 849 447 641 …
Data Stream Mining: Teasers (Cont’d)
• Keep Boolean array of length N:
– Mark position for observed number
– Size required: N
– Computation at end: N to find missing number
Distributed Data Management, SoSe 2015, S. Michel 23
Data Stream Mining: Teasers (Cont’d)
• Keep Boolean array of length N:
– Mark position for observed number
– Size required: N
– Computation at end: N to find missing number
Distributed Data Management, SoSe 2015, S. Michel 24
• Much better:
– keep sum of numbers: S
– Missing number is N*(N+1)/2 - S
Counting Occurrences
• Consider a stream of elements ai
…, a2, a84, a41, a2, a77, a231, a2, a4, a54, …
• How often does a2 occur?
• How to implement?
Distributed Data Management, SoSe 2015, S. Michel 25
Counting Occurrences
• Consider a stream of elements ai
…, a2, a84, a41, a2, a77, a231, a2, a4, a54, …
• How often does a2 occur?
• How to implement?
Distributed Data Management, SoSe 2015, S. Michel 26
• Keep counter for each id
• Required space #ids (=N)
• Not feasible of N is very large
Probabilistic Counting: Count-Min Sketch
• Keep 2-dim array (h, r)
• h hash functions hi that map to range 0…(r-1)
Distributed Data Management, SoSe 2015, S. Michel 27
Cormode, Muthukrishnan (2004). An Improved Data Stream Summary: The Count-Min Sketch and its Applications. J. Algorithms 55: 29–38.
0 1 2 3 4 5
• Arriving item x.
• For each j: array[j, hj(x)]++
h1
h2
h3
h4
Count-Min Sketch: Insert Example
Distributed Data Management, SoSe 2015, S. Michel 28
0 1 2 3 4 5
h1
h2
h3
h4
a, b, a, a, c, a, c, ….Data stream is
x h1(x) h2(x) h3(x) h4(x)
a 4 5 0 2
b 3 5 1 3
c 2 2 0 3
… … … … …
Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:
Count-Min Sketch: Insert Example
Distributed Data Management, SoSe 2015, S. Michel 29
1
1
1
1
0 1 2 3 4 5
h1
h2
h3
h4
a, b, a, a, c, a, c, ….Data stream is
x h1(x) h2(x) h3(x) h4(x)
a 4 5 0 2
b 3 5 1 3
c 2 2 0 3
… … … … …
Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:
red = inserted
Count-Min Sketch: Insert Example
Distributed Data Management, SoSe 2015, S. Michel 30
1 1
2
1 1
1 1
0 1 2 3 4 5
h1
h2
h3
h4
a, b, a, a, c, a, c, ….Data stream is
x h1(x) h2(x) h3(x) h4(x)
a 4 5 0 2
b 3 5 1 3
c 2 2 0 3
… … … … …
Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:
red = inserted
Count-Min Sketch: Insert Example
Distributed Data Management, SoSe 2015, S. Michel 31
1 2
3
2 1
2 1
0 1 2 3 4 5
h1
h2
h3
h4
x h1(x) h2(x) h3(x) h4(x)
a 4 5 0 2
b 3 5 1 3
c 2 2 0 3
… … … … …
Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:
a, b, a, a, c, a, c, ….Data stream is
red = inserted
Count-Min Sketch: Insert Example
Distributed Data Management, SoSe 2015, S. Michel 32
1 3
4
3 1
3 1
0 1 2 3 4 5
h1
h2
h3
h4
x h1(x) h2(x) h3(x) h4(x)
a 4 5 0 2
b 3 5 1 3
c 2 2 0 3
… … … … …
Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:
a, b, a, a, c, a, c, ….Data stream is
red = inserted
Count-Min Sketch: Insert Example
Distributed Data Management, SoSe 2015, S. Michel 33
1 1 3
1 4
4 1
3 2
0 1 2 3 4 5
h1
h2
h3
h4
x h1(x) h2(x) h3(x) h4(x)
a 4 5 0 2
b 3 5 1 3
c 2 2 0 3
… … … … …
Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:
a, b, a, a, c, a, c, ….Data stream is
Imagine that continues now a bit, then we might end up with ……
red = inserted
Count-Min Sketch: Counting
• How often did we see item a?
• Recall the hash function values for a:
h1(a) = 4, h2(a)=5, h3(a)=0, h4(a)=2
Distributed Data Management, SoSe 2015, S. Michel 34
5 3 4 4 9 3
4 7 1 4 4 8
8 4 6 7 2 1
3 1 4 8 7 5
0 1 2 3 4 5
h1
h2
h3
h4
Count-Min Sketch: Counting
• How often did we see item a?
• Look at positions: h1(a) = 4, h2(a)=5, h3(a)=0, h4(a)=2
• Take minimum of the corresponding values: Here: 4
Distributed Data Management, SoSe 2015, S. Michel 35
5 3 4 4 9 3
4 7 1 4 4 8
8 4 6 7 2 1
3 1 4 8 7 5
0 1 2 3 4 5
h1
h2
h3
h4
9
8
8
4
Is this estimator generally underestimating or overestimating or can’t we say anything about that?
Count-Min Sketch: Counting
• How often did we see item a?
• Look at positions: h1(a) = 4, h2(a)=5, h3(a)=0, h4(a)=2
• Take minimum of the corresponding values: Here: 4
• Estimate is never underestimating
• Overestimation probabilistically bounded
Distributed Data Management, SoSe 2015, S. Michel 36
5 3 4 4 9 3
4 7 1 4 4 8
8 4 6 7 2 1
3 1 4 8 7 5
0 1 2 3 4 5
h1
h2
h3
h4
9
8
8
4
Unbiased vs. Biased Estimators
• Given a real number and an estimator of it, denoted as
• E.g., number of distinct elements in a set S
• is called an unbiased estimator if
• and biased otherwise, in which case
Distributed Data Management, SoSe 2015, S. Michel 37
n̂
Counting Distinct Elements
• Consider a stream of elements ai
…, a2, a84, a41, a2, a77, a231, a2, a4, a54, …
• How to compute/estimate the number of distinct elements observed?
Distributed Data Management, SoSe 2015, S. Michel 38
Application• Streams (one pass, little memory footprint)
• Distributed systems: compact data exchange (recall Bloom filter)
• But also counting in Database Systems
• Sketches for partial data can often be merged for global view
Distributed Data Management, SoSe 2015, S. Michel 39
sketch
sketch
Efficient Counting, Comparing:How many distinct elements altogether?
Flajolet Martin (FM) Sketch (aka. Hash Sketch)
• Allocate a bitvector B of size m = log(N)
• Hash items to bitvector positions:
– Hash each item i to a m-bit number h(i)
– Compute position k of the least-significant “1” of h(i)
– Set the bit B[k] to “1”
Given data stream 17, 5, 19, 211, 17, 5, 31
h(17) = 010100 then least-sig. 1 bit = 2h(5) = 000101 then least-sig. 1 bit = 0...
Proposed originally by Flajolet and Martin in 1985
Distributed Data Management, SoSe 2015, S. Michel 40
FM-Sketch: Estimator
• Get then position t of left-most “0” bit of B
• Count-Distinct Estimate of real distinct number n:
here: with t = 3:
• Improvement: – If you use more bitmaps and compute an average
position t, you can improve count-distinct estimate
Distributed Data Management, SoSe 2015, S. Michel 41
111010 B:• In the end B might look like this
Note: Be careful with left-most bit, least significant bit; depends on interpretation of bits
with
FM Sketch: Intuition/Idea
• B[0] is set approximately n/2 times
• B[1] is set approximately n/4 times
• B[i] = 0 if i >> log2(n)
• B[i] = 1 if i<<log2(n)
• “Mix” of 1s and 0s around i≈log2(n)
• Use left-most zero at indicator for log2(n):
n ≈ 2position of left most zero bit
Distributed Data Management, SoSe 2015, S. Michel 42
FM-Sketch: Union
• Given: two multisets S and T and theirs sketches and of size m
• Then:
The sketch
is the sketch of
BSBT
Distributed Data Management, SoSe 2015, S. Michel 43
K-Min Value (KMV) Synopsis
• KMV synopsis is ordered set of k smallest values
},...,,{ )()2()1( kUUUL
0 1
• Unbiased Estimator:
– Exact error analysis based on theory of order statistics
– Asymptotically optimal as k becomes large
Distributed Data Management, SoSe 2015, S. Michel 44
• Given set S of values. Want number of distinct elements n := D(S) (notation)
• Hashing outputs values uniformly in [0,1]
k-min values
Slide based on PPT slides from Beyer et al. ‘07
[Beyer et al. , 2007]
K-Min Value (KMV) Synopsis (Cont’d)
• KMV synopsis:
• Distance between each hash value is 1/n (assuming uniformity)
• We are interested in knowing the value of n
• Natural derivation of basic estimator:
Observation: Thus:
1/n
},...,,{ )()2()1( kUUUL
0 1U(1)
U(2)
U(k)
...
Hashing = dropping the n DVs uniformly on [0,1]
k-min values
Slide based on PPT slides from Beyer et al. ‘07
Why does this work?
This is a biased estimator.(k-1) instead of k renders it unbiased
KMV Example• Implemented hash function using MD5 (and
mapped to [0,1]. Items are characters a-z.
Distributed Data Management, SoSe 2015, S. Michel
[0.043, 0.172, 0.281, 0.354, 0.382, 0.421, 0.443, 0.459, 0.463, 0.523, 0.556, 0.565, 0.569, 0.57, 0.59, 0.644, 0.652, 0.675, 0.682, 0.724, 0.818, 0.864, 0.89, 0.938, 0.994, 0.997]
k U(k)1 0.000 0,043
2 5.814 0,172
3 7.117 0,281
4 8.475 0,354
5 10.471 0,382
6 11.876 0,421
7 13.544 0,443
8 15.251 0,459
9 17.279 0,463
10 17.208 0,523
11 17.986 0,556
12 19.469 0,565
13 21.090 0,569
14 22.807 0,570
15 23.729 0,590
16 23.292 0,644
17 24.540 0,652
18 25.185 0,675
19 26.393 0,682
20 26.243 0,724
21 24.450 0,818
The sorted (ascending) hash values for the 26 characters a-z are:
• In general, why are distinct elements counted?
• What if we have one KMV sketch for a multiset A and one for multiset B? Can we combine them?
46
(Multiset) Union of Partitions
0
k-min
0
k-min
0
k-min
U(k)
L
LA LB
• Combine KMV synopses: L = LA LB
• Theorem: L is a KMV synopsis of AB
… 1 … 1
… 1
Distributed Data Management, SoSe 2015, S. Michel 47
Take union of values and consider again the k smallest ones:
Slide based on PPT slides from Beyer et al. ‘07
Distributed Data Management, SoSe 2015, S. Michel
• L=LALB as before (union): contains k elements– L corresponds to a uniform random sample of DVs in
AB
• K = # values in L that are also in D(AB)Theorem: Can compute from LA and LB alone
• K/ k estimates Jaccard coefficient:
estimates
• Unbiased estimator of #DVs in the intersection:
(Multiset) Intersection of Partitions
48
D(set) = distinct values
Slide based on PPT slides from Beyer et al. ‘07
Literature
Distributed Data Management, SoSe 2015, S. Michel 49
• Graham Cormode, S. Muthukrishnan: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1): 58-75 (2005)
• Philippe Flajolet, G. Nigel Martin: Probabilistic Counting Algorithms for Data Base Applications. J. Comput. Syst. Sci. 31(2): 182-209 (1985)
• Kevin S. Beyer, Peter J. Haas, Berthold Reinwald, Yannis Sismanis, Rainer Gemulla: On synopses for distinct-value estimation under multisetoperations. SIGMOD Conference 2007: 199-210