View
1
Download
0
Category
Preview:
Citation preview
Data Stream Management
Systems
Tore Risch
Dept. of information technology
Uppsala University
Sweden
2016-12-05
New applications
• Data comes as huge data streams, e.g.
- Satellite data
- Scientific instruments
- Colliders
- Industrial equipment
- Patient monitoring
- Stock data
- Traffic monitoring
Too much data to store on disk
Would like to query data directly in the streams
DataBase Management
Systems, DBMS
4
e.g. Oracle, MySQL, SQL Server
Query Processor
Data Manager
Meta –
data
DBMS
Dynamic SQL Queries
Stored
Data
DataStream Management
Systems, DSMS
5
e.g. Tipco Streambase
Query Processor
Stream ManagerData
streams
Meta –
data
DSMS
Dynamic Continuous Queries (CQs)
Data
streams RDB
Data managers, key/value
stores
6
Data manager
Java/C++/JavaScript programs
Stored
Data
e.g. BerkeleyDB, Riak, Cassandra, DynamoDB
Data Stream Managers
7
e.g. Storm, Kinesis, Stream Mill, Flink
Data
streams
Data
streams
Data stream
manager
Java/C++/JavaScript programs
sa.engine Stream Analythics
Engine
8
Continuous queries over streams
from expresswaysSchema for stream CarLocStr of tuples:
CarLocStr(car_id, /* unique car identifier */
speed, /* speed of the car */
exp_way, /* expressway: 0..10 */
lane, /* lane: 0,1,2,3 */
dir, /* direction: 0(east), 1(west) */
x-pos); /* coordinate in express way */
Continuous query expressed in the query language CQL (Stanford):
Continuously get the cars in a window of the stream every 30 seconds:
SELECT DISTINCT car_id
FROM CarLocStr [RANGE 30 SECONDS]
Continiuously compute the average speed of vehicles per expressway, direction, segment each 5 minutes:
SELECT exp_way, dir, seg, AVG(speed) as speed,
FROM CarSegStr [RANGE 5 MINUTES]
GROUP BY exp_way, dir, seg
DSMSs vs. Relational Databases• Streams ordered
– Regular relational DBs are based on sets and bags
• The order of the result of a relational (sub-query) is insignificant
• Used in lots of query optimization methods
– Must preserve order when result is stream of tuples
• Streams potentially infinite in size– Regular DBs based on queries to finite tables
– A stream must normally be accessed in one pass
– A regular relational join (e.g. nested-loop join) may scan tablesseveral times
• NLJ impossible over streams
• Merge Join (MJ) possible in stream (tuple arrival) order only
– Cannot materialize (store) intermediate results
• Regular hash join impossible over streams
Different query processing for streams
DSMSs vs. relational databases• Often very high stream data volume and rate
– Regular DBs much less demanding
– Regular database are optimized for high-watermark
• Once loaded it does not change in size very fast
– Streams may have extreme insert and delete rates as tuples flow through the DSMS
• Real-time delivery, Quality of Service– Regular databases also have QoS requirements, but slower!
• E.g. guaranteed average response time of 1s (soft real-time)
– If a DSMS does not keep up with the flow some actions have to be made
• E.g. skipping elements, shedding
DSMSs vs. relational databases• Stream queries are continuous
– Regular DB queries passive
• Client send query – server replies
– Queries over stream run continuously until they are stopped
Continuous queries, CQs
• Active databases (triggers) are related, but:– Active database are very side effect oriented (very procedural)
– The action in a trigger typically updates the database
– Triggers activated by database state changes
• Continuous queries are non-pocedural– CQs can be seen as side effect free functions (algebra) over
streams
– Database updates are handled as streams insering tuples intosome database (after sufficient data reduction by CQ filters)
Continuous queries• CQs are turned on and run until stop condition becomes
true or when they are explicitly terminated– Regular queries finsish when demanded result data delivered
• CQs often based on time stamps (logs) of stream elements, i.e. streams contain temporal data– Regular queries often not temporal
– Temporal databases provides queries over time stamped stored data
• Non-monotone operators in CQs (e.g. grouping and sorting) must be made over stream windows– A stream window is a bounded stream segment
– By contrast, regular DBMS queries specified over entire tables
A simple example
peakdetect(winagg(readstream(“SIG2"),256,256));
Stream windows• Limited size section of stream stored temporarily in
database– A window is regarded as a transient stored regular database
table
• Need window operators to chop stream into segments– as [RANGE 30 SECONDS] in CQL
• Window size based on:– A time window
• E.g. elements last 30 seconds
– i.e. a window contains all event processed during the last second
– Number of elements, a counting window
• E.g. last 10 elements
– i.e. windows has fixed size of 10 elements
– A landmark window
• All events from time t0 belongs to window
– c.f. log file
Stream windows
• Windows also have stride– Rule for how fast they move forward,
• E.g. 256 elements for a 256 element counting windowor 30s for a 30s time window
=> A tumbling window
• E.g. 128 elements for a 256 element counting windowor 10s for a 30s time window
=> A sliding windows
• The CQ language needs window operators to collect stream tuples into stream of windows of tuples– Stride and window size are parameters
Window joins
• Join over temporary data in window– Can be full join with aggregate functions (group-by)
• Window join operators become approximate– Since only windows of elements are joined rather than entire
stream
• In a relational database all elements in tables are matched in a join
– Often approximate summaries of streaming data maintained
• E.g. moving average, standard deviation, max, min
• The summaries also become dynamic streams!
Continuous query languages
• CQL– The classical CQ language developed by Stanford CSD
– Minimal extension to SQL by streams over tuples
– FROM clause is augumented with a […] construct to specify streams and windows:
SELECT DISTINCT car_id
FROM CarLocStr [RANGE 30 SECONDS];
SELECT exp_way, dir, seg, AVG(speed) as speed,
FROM CarSegStr [RANGE 5 MINUTES]
GROUP BY exp_way, dir, seg;
– [now] indicates no window:SELECT car_id, speed
FROM CarLocStr [now]
Continuous query languages
• There is no standard CQ language
• Most DSMS use variants of CQL– E.g. Tipco StreamBase, Azure Stream Analythics, SQL Stream
Variables always bound to rows in streams, tuple algebra andqueries over tuples
Syntactic sugar added to handle windows and sequences
Numerical models must be expressed in terms of rows, which many look unnatural for numerical models
• sa.engine (and sa.amos) based on domain algebra and queriesVariables bound to objects rather than rows
• For example, number, vectors, arrays, strings etc.
• Mathematical models easy and natural to express algebraically
Combines regular mathematical algebra with stream algebra, relational algebra, and object-oriented queries
Parallel DSMSs
• Classical DSMSs execute on one computer– Can be many threads (multi-core)
• The streams contain rather low volume events– Rather slow reading from sensors
• Sensor networks
• Smart dust
• Typically relatively simple processing (conditions) over events– E.g. thresholds
• Temperature > 100 and pressure < 200
• CQ typically makes data reduction by source filtering
• Could still be substantial data volumes at the data sink the receiver of the stream.– Collect all event into a relational database for SQL analyses
Parallel DSMSs
• New applications for DSMS require
– Advanced queries and computations on stream elements, e.g.:
• E.g. applying advanced statistical operators over windows
• Complex numerical computations (FFT, PCA) and other mathematical models
• Advanced pattern matching and AI techniques
– Scalability over both data volume and computations
• Solution
– Parallel DSMS• Computations made in in parallel
Well known DSMS SystemsSTREAM (Stanford):StreaMon: Baby & Widom: An Adaptive Engine for
Stream Query Processing, SIGMOD 2004, Classical, as DSMSprototype, defines CQL. Used by SQLStream product.
Aurora (Brown,MIT,Brandeis): Carney et al: Monitoring Streams – A New Class of Data ManagemBnt Applications, VLDB 2003. Classical, a classical DSMS, now Tipco Streambase product.
STORM (http://storm.incubator.apache.org/): Data stream manager developed by Twitter. Stream processing engine where programmers specify communicating distributed computations in Java (Scala, Clojure). Program library, no query language.
Kinesis (https://aws.amazon.com/kinesis/streams/): Amazon’s data stream manager.
Flink (https://flink.apache.org/): A data stream manager. CQL-similar query language being developed (https://calcite.apache.org/).
Borealis (Brown & Brandeis): Ahmad et al: StreaMon: An Adaptive Engine for Stream Query Processing, SIGMOD 2005, first parallell DSMS prototype.
Azure Stream Analytics (https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-introduction), Microsoft’s DSMS, own CQL dialect.
sa.engine (SCSQ++)• The world’s fastest DSMS 2010-now
• Measured with DSMS benchmark Linear Road: http://www.cs.brandeis.edu/~linearroad/
• Processes demanding computations on
streaming data at network speed
• Performance obtained by non-trivial
parallelization on clusters
• See: http://user.it.uu.se/~udbl/SCSQ.html
• SVALI: SCSQ extended with features for
analyses of streams from industrial equipment:http://link.springer.com/article/10.1007/s10844-016-0427-2
Performance
2,5 2,5 564
512
1024
0
200
400
600
800
1000
1200
Best timely response to compute intense
queries and analyses over high data stream
rates, www.cs.brandeis.edu/~linearroad:
SAQL
sa.engine
How: Massively parallel stream query processing
Use case: fleet management
and analysis
Communicated
data streams
sa.enginesa.engine
sa.engine
sa.engine
sa.engine
sa.engine
sa.engine
sa.engine
sa.engine
Central
analytics
Edge
analytics
Edge
analytics
Equipment Monitoring centers Analysts
Stream Analysis and
Query Language SAQL
R/Matlab:
3 000 000 users
Hadoop
100 000
users
Empowers analysts to both implement and deploy
both analytical models and queries
by combining regular algebra and stream algebra in
very high-level queries
Automatic generation of (distributed) optimized execution
plans ensures efficiency and scalability
Existing algorithms in e.g. C or Java can be
reused in queries.
The sa.engine approch
Real-time query and analysis of streaming data
End user analytics on very high level by query and analysis
language
Capability of handling compute intense analyses
over massive data streams in real-time (AI, learning, prediction)
Light-weight implementation enables edge analytics
on embedded IoT systems and mobile devices
Plug-in architecture where existing algorithms and indexes
can easily be reused
Overview paper
L. Golab and T. Özsu: Issues in Stream Data Management, SIGMOD Records, 32(2), June 2003, http://www.acm.org/sigmod/record/issues/0306/1.golab-ozsu1.pdf
Recommended