Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases...

Data Stream Management

Systems

Tore Risch

Dept. of information technology

Uppsala University

Sweden

2016-12-05

New applications

• Data comes as huge data streams, e.g.

- Satellite data

- Scientific instruments

- Colliders

- Industrial equipment

- Patient monitoring

- Stock data

- Traffic monitoring

Too much data to store on disk

Would like to query data directly in the streams

DataBase Management

Systems, DBMS

e.g. Oracle, MySQL, SQL Server

Query Processor

Data Manager

Meta –

Dynamic SQL Queries

Stored

DataStream Management

Systems, DSMS

e.g. Tipco Streambase

Query Processor

Stream ManagerData

streams

Meta –

Dynamic Continuous Queries (CQs)

streams RDB

Data managers, key/value

stores

Data manager

Java/C++/JavaScript programs

Stored

e.g. BerkeleyDB, Riak, Cassandra, DynamoDB

Data Stream Managers

e.g. Storm, Kinesis, Stream Mill, Flink

streams

Data stream

manager

Java/C++/JavaScript programs

sa.engine Stream Analythics

Engine

Continuous queries over streams

from expresswaysSchema for stream CarLocStr of tuples:

CarLocStr(car_id, /* unique car identifier */

speed, /* speed of the car */

exp_way, /* expressway: 0..10 */

lane, /* lane: 0,1,2,3 */

dir, /* direction: 0(east), 1(west) */

x-pos); /* coordinate in express way */

Continuous query expressed in the query language CQL (Stanford):

Continuously get the cars in a window of the stream every 30 seconds:

SELECT DISTINCT car_id

FROM CarLocStr [RANGE 30 SECONDS]

Continiuously compute the average speed of vehicles per expressway, direction, segment each 5 minutes:

SELECT exp_way, dir, seg, AVG(speed) as speed,

FROM CarSegStr [RANGE 5 MINUTES]

GROUP BY exp_way, dir, seg

DSMSs vs. Relational Databases• Streams ordered

– Regular relational DBs are based on sets and bags

• The order of the result of a relational (sub-query) is insignificant

• Used in lots of query optimization methods

– Must preserve order when result is stream of tuples

• Streams potentially infinite in size– Regular DBs based on queries to finite tables

– A stream must normally be accessed in one pass

– A regular relational join (e.g. nested-loop join) may scan tablesseveral times

• NLJ impossible over streams

• Merge Join (MJ) possible in stream (tuple arrival) order only

– Cannot materialize (store) intermediate results

• Regular hash join impossible over streams

Different query processing for streams

DSMSs vs. relational databases• Often very high stream data volume and rate

– Regular DBs much less demanding

– Regular database are optimized for high-watermark

• Once loaded it does not change in size very fast

– Streams may have extreme insert and delete rates as tuples flow through the DSMS

• Real-time delivery, Quality of Service– Regular databases also have QoS requirements, but slower!

• E.g. guaranteed average response time of 1s (soft real-time)

– If a DSMS does not keep up with the flow some actions have to be made

• E.g. skipping elements, shedding

DSMSs vs. relational databases• Stream queries are continuous

– Regular DB queries passive

• Client send query – server replies

– Queries over stream run continuously until they are stopped

Continuous queries, CQs

• Active databases (triggers) are related, but:– Active database are very side effect oriented (very procedural)

– The action in a trigger typically updates the database

– Triggers activated by database state changes

• Continuous queries are non-pocedural– CQs can be seen as side effect free functions (algebra) over

streams

– Database updates are handled as streams insering tuples intosome database (after sufficient data reduction by CQ filters)

Continuous queries• CQs are turned on and run until stop condition becomes

true or when they are explicitly terminated– Regular queries finsish when demanded result data delivered

• CQs often based on time stamps (logs) of stream elements, i.e. streams contain temporal data– Regular queries often not temporal

– Temporal databases provides queries over time stamped stored data

• Non-monotone operators in CQs (e.g. grouping and sorting) must be made over stream windows– A stream window is a bounded stream segment

– By contrast, regular DBMS queries specified over entire tables

A simple example

peakdetect(winagg(readstream(“SIG2"),256,256));

Stream windows• Limited size section of stream stored temporarily in

database– A window is regarded as a transient stored regular database

• Need window operators to chop stream into segments– as [RANGE 30 SECONDS] in CQL

• Window size based on:– A time window

• E.g. elements last 30 seconds

– i.e. a window contains all event processed during the last second

– Number of elements, a counting window

• E.g. last 10 elements

– i.e. windows has fixed size of 10 elements

– A landmark window

• All events from time t0 belongs to window

– c.f. log file

Stream windows

• Windows also have stride– Rule for how fast they move forward,

• E.g. 256 elements for a 256 element counting windowor 30s for a 30s time window

=> A tumbling window

• E.g. 128 elements for a 256 element counting windowor 10s for a 30s time window

=> A sliding windows

• The CQ language needs window operators to collect stream tuples into stream of windows of tuples– Stride and window size are parameters

Window joins

• Join over temporary data in window– Can be full join with aggregate functions (group-by)

• Window join operators become approximate– Since only windows of elements are joined rather than entire

stream

• In a relational database all elements in tables are matched in a join

– Often approximate summaries of streaming data maintained

• E.g. moving average, standard deviation, max, min

• The summaries also become dynamic streams!

Continuous query languages

• CQL– The classical CQ language developed by Stanford CSD

– Minimal extension to SQL by streams over tuples

– FROM clause is augumented with a […] construct to specify streams and windows:

SELECT DISTINCT car_id

FROM CarLocStr [RANGE 30 SECONDS];

SELECT exp_way, dir, seg, AVG(speed) as speed,

FROM CarSegStr [RANGE 5 MINUTES]

GROUP BY exp_way, dir, seg;

– [now] indicates no window:SELECT car_id, speed

FROM CarLocStr [now]

Continuous query languages

• There is no standard CQ language

• Most DSMS use variants of CQL– E.g. Tipco StreamBase, Azure Stream Analythics, SQL Stream

Variables always bound to rows in streams, tuple algebra andqueries over tuples

Syntactic sugar added to handle windows and sequences

Numerical models must be expressed in terms of rows, which many look unnatural for numerical models

• sa.engine (and sa.amos) based on domain algebra and queriesVariables bound to objects rather than rows

• For example, number, vectors, arrays, strings etc.

• Mathematical models easy and natural to express algebraically

Combines regular mathematical algebra with stream algebra, relational algebra, and object-oriented queries

Parallel DSMSs

• Classical DSMSs execute on one computer– Can be many threads (multi-core)

• The streams contain rather low volume events– Rather slow reading from sensors

• Sensor networks

• Smart dust

• Typically relatively simple processing (conditions) over events– E.g. thresholds

• Temperature > 100 and pressure < 200

• CQ typically makes data reduction by source filtering

• Could still be substantial data volumes at the data sink the receiver of the stream.– Collect all event into a relational database for SQL analyses

Parallel DSMSs

• New applications for DSMS require

– Advanced queries and computations on stream elements, e.g.:

• E.g. applying advanced statistical operators over windows

• Complex numerical computations (FFT, PCA) and other mathematical models

• Advanced pattern matching and AI techniques

– Scalability over both data volume and computations

• Solution

– Parallel DSMS• Computations made in in parallel

Well known DSMS SystemsSTREAM (Stanford):StreaMon: Baby & Widom: An Adaptive Engine for

Stream Query Processing, SIGMOD 2004, Classical, as DSMSprototype, defines CQL. Used by SQLStream product.

Aurora (Brown,MIT,Brandeis): Carney et al: Monitoring Streams – A New Class of Data ManagemBnt Applications, VLDB 2003. Classical, a classical DSMS, now Tipco Streambase product.

STORM (http://storm.incubator.apache.org/): Data stream manager developed by Twitter. Stream processing engine where programmers specify communicating distributed computations in Java (Scala, Clojure). Program library, no query language.

Kinesis (https://aws.amazon.com/kinesis/streams/): Amazon’s data stream manager.

Flink (https://flink.apache.org/): A data stream manager. CQL-similar query language being developed (https://calcite.apache.org/).

Borealis (Brown & Brandeis): Ahmad et al: StreaMon: An Adaptive Engine for Stream Query Processing, SIGMOD 2005, first parallell DSMS prototype.

Azure Stream Analytics (https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-introduction), Microsoft’s DSMS, own CQL dialect.

sa.engine (SCSQ++)• The world’s fastest DSMS 2010-now

• Measured with DSMS benchmark Linear Road: http://www.cs.brandeis.edu/~linearroad/

• Processes demanding computations on

streaming data at network speed

• Performance obtained by non-trivial

parallelization on clusters

• See: http://user.it.uu.se/~udbl/SCSQ.html

• SVALI: SCSQ extended with features for

analyses of streams from industrial equipment:http://link.springer.com/article/10.1007/s10844-016-0427-2

Performance

2,5 2,5 564

Best timely response to compute intense

queries and analyses over high data stream

rates, www.cs.brandeis.edu/~linearroad:

sa.engine

How: Massively parallel stream query processing

Use case: fleet management

and analysis

Communicated

data streams

sa.enginesa.engine

sa.engine

Central

analytics

Equipment Monitoring centers Analysts

Stream Analysis and

Query Language SAQL

R/Matlab:

3 000 000 users

Hadoop

100 000

Empowers analysts to both implement and deploy

both analytical models and queries

by combining regular algebra and stream algebra in

very high-level queries

Automatic generation of (distributed) optimized execution

plans ensures efficiency and scalability

Existing algorithms in e.g. C or Java can be

reused in queries.

The sa.engine approch

Real-time query and analysis of streaming data

End user analytics on very high level by query and analysis

language

Capability of handling compute intense analyses

over massive data streams in real-time (AI, learning, prediction)

Light-weight implementation enables edge analytics

on embedded IoT systems and mobile devices

Plug-in architecture where existing algorithms and indexes

can easily be reused

Overview paper

L. Golab and T. Özsu: Issues in Stream Data Management, SIGMOD Records, 32(2), June 2003, http://www.acm.org/sigmod/record/issues/0306/1.golab-ozsu1.pdf

Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases...

Documents

Time-value curves to provide dynamic QoS for time sensitive file transfer

10 Supporting Time-Based QoS Requirements in Software ... · Supporting time-based QoS requirements in Software Transactional Memory 10:3 the adaptive transaction mode and with the

Real-time Detection and Containment of Network Attacks using QoS Regulation

Multiple constraints QoS Routing Given: - a (real time) connection request with specified QoS requirements (e.g., Bdw, Delay, Jitter, packet loss, path

Real-Time HSPA Emulator for End-to-Edge QoS Evaluation in

Sticky CSMA/CA: Implicit synchronization and real-time QoS in … · 2009-02-19 · Sticky CSMA/CA: Implicit synchronization and real-time QoS in mesh networks Sumit Singh a,*, Prashanth

Sticky CSMA/CA: Implicit Synchronization and Real-time QoS ...ebelding/courses/284/papers/singh_stickycsma.pdf · Sticky CSMA/CA: Implicit Synchronization and Real-time QoS in Mesh

1 evaluation of Time-kill curve analysis and … strain that had a slightly slower growth. 172 Time-kill curves. Pharmacodynamic functions

Real-Time Demonstrator for QoS testing in 3G/4G mobile

Real-Time Demonstrator for QoS testing in 3G/4G mobile ...research.ac.upc.edu/EW2004/papers/127.pdf · Real-Time Demonstrator for QoS testing in 3G/4G mobile networks with IP multimedia

RLT300 VoIP-QoS-Solution Brief Jan2011 Web - Tone … nEricsson nHP nIBM ... ReliaTel triggers alarms in real time based on ... RLT300_VoIP-QoS-Solution_Brief_Jan2011_Web.cdr Author:

Quality of Service(QoS). Outline Why QoS is important? What is QoS? QoS approach. Conclusion

QoS-enabled Component Middleware for Distributed Real-Time and

Provisioning Dynamic QoS Adaptation for Enterprise Distributed Real-time Embedded (DRE) Systems

Regressive admission control enabled by real time qos measurements

Data Stream Management Systems - Uppsala University · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed

QoS Routing Using Traffic Forecast - A Case Study of Time-Dependent Routing

QoS-driven Lifecycle Management of Service-oriented Distributed Real-time & Embedded Systems QoS-driven Lifecycle Management of Service-oriented Distributed

Model Driven Engineering for QoS Management in Distributed Real-time & Embedded Systems

Even slower