PIPES: A Resource Adaptive Data Stream Management System Bernhard Seeger Philipps-University...

Preview:

Citation preview

PIPES: A Resource Adaptive Data Stream Management System

Bernhard SeegerPhilipps-University Marburg, Germany

Research supported by the German Research Society (DFG) grant Se 553/4-2

2

Information Landscape

DBMS

Input

Output

DBMS

DBMS

DBMS

DBMS

DBMS

File System

File System

File System

File System

File System

DSMS

3

Outline

Motivation and problem definition

Sliding Windows

Query Processing in PIPES Data Stream Model

Logical Operators

Algebraic Query Optimization

Physical Operators

Runtime Environment

Dynamic Plan Migration

Conclusions

4

Example Application

Traffic monitoring Data format

Continuous dataflow streams Variable stream rates

Time + location dependence

Queries Continuous, long-running

“At which measuring stations of the highway has the average speed of vehicles been below

15 m/s over the last 15 minutes ?”

HighwayStream( lane, speed, length, timestamp )

5

Data Streams

Continuously Arriving Sequence of Records

time as an integral component

Autonomous Data Sources sensors, mobile devices,

software agents, …

Important Type of Data miniaturization of hardware

ubiquitous networks

o o oo o …

6

Requirements

Declarative Query Language Expressive like (Temporal) SQL

join of data streams according to time combination of data streams with persistent databases

assigns meaning to data

query results as a data stream

Publish/Subscribe Paradigm Subscribe: users register new queries Publish: continous report of results

Quality of Service (QoS) e. g. at least one record per second

scalability number of data sources number of subscribed queries

7

Stream Query Processing

Similar to Traditional DBMS1. Queries expressed in CQL

SQL-like query language

2. Logical Query Plan algebra with „relational“ operators

3. Query Optimization algebraic rules

simple, but accurate cost model

4. Physical Query Plan select physical operators

5. Processing of the Query

8

What is special about PIPES?

PIPES provides an Infrastructure for DSMS DSMS = Data Stream Management System PIPES = Public Infrastructure for Processing and Exploring

Data Streams Differences to DBMS

Semantics is borrowed from Temporal Databases Expressiveness Query Optimization

Data Driven Query Processing Publish/Subscribe

Adaptive Runtime Environment Dynamic assignment of resources at runtime scalability, QoS

Continuous Optimization of Queries von Anfragen plan migration scalability, QoS

9

Outline

Motivation and problem definition

Sliding Windows

Query Processing in PIPES Data Stream Model

Logical Operators

Algebraic Query Optimization

Physical Operators

Runtime Environment

Dynamic Plan Migration

Conclusions

10

2. Sliding Windows

Requirement of Users no impact of outdated data on our result integration of different streams according to time

Moving Temporal Windows Finite subsequence of an infinite stream Query processing is restricted to the most recent data

Important for an expressive and efficient query processing

Options Count-based windows

FIFO queue of size w

Time-based windows t time stamp of an element t + w + 1 end of the validity of an element

11

Problem: Determinism

Data-driven Processing

Count-based Windows w = 2

Non-Determinism Result of a query depends

on scheduling

a3 b3

a3b1a3b2

a1

a2

b1

b2

a2b3a3b3

a3b1a3b2a2b3a3b3

a1b3a2b3a3b2a3b3

a1b3a2b3a3b2a3b3

Example: Symetric Join

a2

a3

b2

b3

Reset

a3b1a3b2a2b3a3b3

a1b3a2b3a3b2a3b3

12

Temporal Windows in CQL

SELECT sectionIDFROM ( SELECT AVG(speed) AS avgSpeed, 1 AS sectionID FROM HighwayStream1 [Range 15 minutes] UNION ALL … UNION ALL SELECT AVG(speed) AS avgSpeed, 20 AS sectionID FROM HighwayStream20 [Range 15 minutes])WHERE avgSpeed < 15;

“At which measuring stations of the highway has the average speed of vehicles been below 15 m/s over the last 15 minutes ?”

13

Outline

Motivation and problem definition

Sliding Windows

Query Processing in PIPES Data Stream Model

Logical Operators

Algebraic Query Optimization

Physical Operators

Runtime Environment

Dynamic Plan Migration

Conclusions

14

3. Query Processing in PIPES

Data Streams Model Input Streams

Autonomous Source

Logical Streams Semantics

Physical Streams Implementation of the Semantics, but more expressive

15

Input Streams

Sequence of Records Arbitrary, but fixed schema

No limitation to the relational model

Records with timestamps Temporal ordered

Schema: HighwayStream( short lane, float speed, float length, Timestamp timestamp )

Input Stream:(5; 18.28; 5.27; 5:00:08)(2; 21.33; 4.62; 5:01:32)(4; 19.69; 9.97; 5:02:16)

16

Physical Stream

PIPES: Time Intervals instead of Points Validity of an element e

Processing of e restricted to its time interval

Removal of invalid records

Sequence of tuples (e, [tS, tE))

Ordered by tS and tE

((5; 18.28; 5.27; 5:00:08), [5:00:08, 5:00:09))((2; 21.33; 4.62; 5:01:32), [5:01:32, 5:01:33))((4; 19.69; 9.97; 5:02:16), [5:02:16, 5:02:17))

Transformation: input stream physical stream

17

Data Stream Operators

Window Operator

Relational Operator „relational“ algebra on data streams

projection

selection

Cartesian product

union

difference

temporal extension of operators

18

Window Operator

Purpose Extension of the validity of an element by w time units.

Overlap of windows of elements Elements need to be processed together

Window: w = 15 minutes

(e1, [5:00:08, 5:15:09))(e2, [5:01:32, 5:16:33))(e3, [5:02:16, 5:17:17))

Sliding window: 15 minutes

tS+1+wtS

w+1

19

Relational Stream Operators

Snapshot-Reducibility Snapshot

Mapping of a physical stream to a non-temporal relation. Relation comprises all valid elements at point t

t

RelationalOperator

RelationalStreamOperator

S1, …, Sn R1, …, Rn

RoutSout

20

Query Optimization

Application of Well-known Rules from Temporal Databases Slivinskas, Jensen, Snodgrass (ICDE 2000)

Query Plans for Conventional and Temporal Queries Involving Duplicates and Ordering

many rules directly applicable to streams

conventional + temporal rules

Basis for Effective Query Optimization

21

1) Query2) Logical Query Plan3) Query Optimization4) Physical Query Plan

Steps

SELECT sectionIDFROM ( SELECT AVG(speed) AS avgSpeed, 1 AS sectionID FROM HighwayStream1 [Range 15 minutes] UNION ALL … UNION ALL SELECT AVG(speed) AS avgSpeed, 20 AS sectionID FROM HighwayStream20 [Range 15 minutes])WHERE avgSpeed < 15;

“At which measuring stations of the highway has the average speed of vehicles been below 15 m/s over the last 15 minutes ?”

Map: projection on sectionID

Filter: avgSpeed < 15

Union: merge of data streams

Aggregation: averagespeed (avgSpeed)Map: projection on speed., assigning sectionID

Window: w = 15 minutes

22

Physical Operators

Stateless Operators Processing of an element is independent from the

previous ones.

Examples: filter, map

Stateful Operators Processing of an element depends on previous

elements Restrict to elements in sliding window

Explicit management of status

Examples:join, aggregation

23

Data-driven Joins

Input streams A and B and sliding window of size w

join predicate P

Output records ((a,b), [tS,tE))

P(a,b)

overlapping intervals of a und b

a b

tS tE

(a,b)

24

Methodology

Adaptation of Sweepline TechniquetA = Start time of last element of A

tB = Start time of last element of B

Status for each input Status of A: elements of A with end time ≥ tB

Status of B: elements of B with end time ≥ tA

Continuous Processing

A B

StatusA StatusB

insertionprobing & reorganisation

25

Runtime Environment of PIPES

Sources

Sinks

Qu

ery

grap

h

PIP

ES

26

Outline

Motivation and problem definition

Sliding Windows

Query Processing in PIPES Data Stream Model

Logical Operators

Algebraic Query Optimization

Physical Operators

Runtime Environment

Dynamic Plan Migration

Conclusions

27

4. Plan Migration

Re-Optimization of Query Plans at Runtime Identification of poorly performing subgraphs in the

query graph

Plan Migration Substitution of old plan by a new one

Requirements

Preserving of snapshot reducibility

Continuous production of results

Short migration time

28

Beispiel

R S T U

C1 C2Sinks

Sources

29

Semantics Problems

Duplicates Parallel insertion of new elements into both plans

Loss of Results Exclusive insertion of new element in the new plan

30

Split

Approach in PIPES

Assumptions Streams A and B Window of length w equivalent query plans Palt and Pneu

Earliest split time tsplit = max {tA, tB} + w

Splitting of the input at split time

tsplit

31

Approach in PIPES

Production of Results Acceptance of all results received from the old plan

Pold

Selection of results received from the new plan Pnew

Acceptance only if start time > tsplit

Pold Pnew

Split

A

Split

B

32

Properties

Method is broadly applicable Arbitrary plans

Many data streams

Different window sizes

Migration Time Worst-case: w time units

33

Outline

Motivation and problem definition

Sliding Windows

Query Processing in PIPES Data Stream Model

Logical Operators

Algebraic Query Optimization

Physical Operators

Runtime Environment

Dynamic Plan Migration

Conclusions

34

5. Conclusions

Applications Traffic management Alarming systems

Observation of production lines

Basic ideas of stream processing in PIPES Temporal Databases Data-driven query processing Adaptivity at runtime Continuous Optimization at runtime

Dynamic Plan Migration Broadly applicable approach

35

Current Work

Problems Cost models for optimization

New techniques

Strategies for adaptation Memory

CPU

QoS

Runtime environment Realtime applications

Real applications for DSMS Observation of patients in hospitals

Processing of sensor data Coupling of PIPES and commercial products

36

Related Work

Abadi, Carney, Cetintemel et al. Aurora: A new model and architecture for data stream

management. The VLDB Journal, 12(2):120-139, 2003.

Arasu, Babu, and Widom The CQL continuous query language: Semantic foundations and

query execution. Technical Report 2003-67, Stanford University, 2003.

Tucker, Maier, Sheard, and Faragas Exploiting punctuation semantics in continuous data streams.

IEEE Trans. Knowledge and Data Eng., 15(3):555-568, 2003.

Law, Wang, and Zaniolo Query languages and data models for database

sequences and data streams. In VLDB, pages 492-503, 2004.

37

Papers on PIPES/XXL

Michael Cammert, Jürgen Krämer, Bernhard Seeger, Sonny Vaupel: An Approach to Adaptive Memory Management in Data Stream Systems , will appear in Proc. ICDE 2006.

Michael Cammert, Christoph Heinz, Jürgen Krämer, Bernhard Seeger: Sortierbasierte Joins über Datenströmen,BTW 2005, Karlsruhe - Germany, March, 2-4.

Björn Blohsfeld, Christoph Heinz, Bernhard Seeger:Maintaining Nonparametric Estimators over Data Streams,BTW 2005, Karlsruhe - Germany, March, 2-4.

Christoph Heinz, Bernhard Seeger: Wavelet Density Estimators over Data Streams (Extended Abstract),ACM Symposium on Applied Computing, Santa Fe - New Mexico, 2005.

Michael Cammert, Christoph Heinz, Jürgen Krämer, Bernhard Seeger: Anfrageverarbeitung auf Datenströmen,Datenbank-Spektrum 11: 5-13, (2004).

Jürgen Krämer, Bernhard Seeger:PIPES–A Public Infrastructure for Processing and Exploring Data Streams. Proc. SIGMOD 2004 (Demo)

Jochen Van den Bercken, Björn Blohsfeld, Jens-Peter Dittrich, Jürgen Krämer, Tobias Schäfer, Martin Schneider, Bernhard Seeger: XXL - A Library Approach to Supporting Efficient Implementations of Advanced Database Queries,In Proc. of the Conf. on Very Large Databases (VLDB), 39-48, September 2001.

38

Future Work

Query optimization Adequate cost model

Not only stream rates

Runtime statistics: delays, memory usage, etc.

Static query optimization Multi query optimization

Subquery sharing

Dynamic query optimization Detection of suitable subgraphs

Plan migration at runtime

Temporal aspects Coalesce

Thank you !

Any questions ?

For more information check our website:

http://dbs.mathematik.uni-marburg.de/Home/Research/Projects/PIPES

40

Reorganization

Restriction of memory usage

All elements where tE mintSj tSj : latest start timestamp of input stream j

Ordering invariant no temporal overlap with future stream elements

Which elements can be discarded in internal data structures ?

Why ?

41

Aggregation

Incremental computation

Efficient implementation Aggregation segment-tree

Amortized logarithmic costs per element

T

current state(aggregates)

new element

Example: Sum

4

25

345

9

7

ReorganizationInsertion

42

Outline

Motivation and problem definition Query formulation Our temporal approach

Stream typesLogical query plansQuery optimizationPhysical query plansQuery execution

Exploration of Data Streams Conclusions

43

Exploration of Data Streams

Example Estimation of selectivity during runtime of continuous range

queries:

select * from Stream S

where S.measure between min and max

Our Approach Exploit the density p of the distribution

Represents all information about the distribution

Suitable for estimating the selectivity multiple queries

max

min

)( dxxp

44

Requirement

Problem Density is unknown

Adaptation of a non-parametric density estimation technique Kernels Wavelets Sampling and CDF

Requirements Low resource consumption (memory and CPU)

Memory and CPU adaptive Increasing memory size higher accuracy

Valid estimation at each point in time Adapt to a changing distribution

45

Reservoir Sampling

CDF is built on top of the iid samples

Disadvantages Estimation relies on a few elements

No advantage from an increasing memory

Advantage Low processing overhead

main memory

12 5 2734 4

samples

0 jdata stream

... 34...5...12 4...27...

46

Blockwise Estimation

Stream is transformed into blocks For simplicity: blocks are of the same size

Idea Estimation of the first k blocks is available

Compute the estimation of k+1 blocks iteratively

Example (Average)

Generalization for density functions Straightforward Extension

Problem: Violates the requirement of limited memory

actkk avgk

avgk

kavg

1

1

11

47

Cumulative-Compressed Estimation

Compression Cubic splines

Weighting strategies

Amortized cost for updates O(log M)

))(ˆ)(ˆ)1(()(ˆ111 xsxfcompressxf kkkk

12 5 2734 4

sample

main memory

Current estimatorat time k

k 1k

48

Experimental Comparison

Streaming data from a real traffic data set

Arithmetic weights

Memory size: 5000

Recommended