39
Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Embed Size (px)

Citation preview

Page 1: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Data Streams and Continuous Query Systems

CS 240B: Professor Zaniolo

Eric Sytwu

Joseph Joswig

Page 2: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Outline

1. Review of Data Streams

2. NiagaraCQ

3. TelegraphCQ

4. Conclusion

5. Bibliography

Page 3: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Data Sets VS Data Streams

• Data Sets– Infrequently changing data– Ex. Employee personnel table, contact

database, library system

• Data Streams– Data arriving continuously– Ex. Stock streamer, sensor networks, weather

monitoring system

Page 4: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Traditional Database Query

• In a traditional query, the query engine returns a subset of the data that is currently in the system.

End User /Application

Query ProcessorStatic Data Sets

Query Results

Page 5: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Continuous Queries

• Continuous queries are persistent queries that allow users to get new results as new information enters the system.

End User /Application

Query ProcessorData Streams

Query Continuous

Results

Workspace

Page 6: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Niagara CQ

Page 7: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Goal:

• Allow users to obtain new results from a database without having to issue the same query repeatedly.

• Develop a system that will allow a large number of users to be able to register continuous queries using a high level language like XML-QL

Page 8: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

What’s wrong with previous continuous querying systems?

• Previous group optimization efforts focused on finding an optimal plan for a small number of queries.

• Computationally too expensive to handle a handle a large number of queries

• Not designed for the web, which is constantly changing

Page 9: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Benefits of NiagaraCQ

• Based on group optimization

• Grouped queries can share computation

• Common execution plans of grouped queries reside in memory, saving on I/O costs compared to executing each query separately.

Page 10: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

How do we get the benefits?• Incremental group optimization• Groups are created for existing queries

according to their signatures, which represent similar structures among the queries.

• Each individual query in a query group shares the results from the execution of the group plan

• When a new query is submitted, the group optimizer considers existing groups as potential optimization choices, the new query is merged into an existing group

Page 11: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Example:

• XML-QL query

• Expression signature

Quotes.Quote.Symbolin quotes.xml

constant

=

Page 12: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Query Plan

• Query plan

Quotes.xml Quotes.xml

File Scan File Scan

Select Symbol=“INT

C”

Select Symbol=“MSF

T”

Trigger Action I Trigger Action J

Page 13: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Group Plan

• Group plan

Quotes.xml Constant Table

File Scan File Scan

Join

Trigger Action I Trigger Action J

Symbol=Constant value

Split

……

Page 14: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Materialized Intermediate Files

• Query split with intermediate files

Split

Trigger_Act_j

File Scan

File Scan

Trigger_Act_i

….

File_i File_j

Page 15: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

General Selection Predicates• “Attribute op Constant”• Attribute = path expression without wildcards• Op = “=“, “<“, “>”…

Page 16: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Join Operators

• A join signature in or approach contains the names of the two data sources and the predicated for the join. Join queries are grouped with the same join signature.

Page 17: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Processing Continuous Queries• 1. CQM adds continuous queries with file and timer information to enable

ED to monitor events• 2. ED asks DM to monitor changes to files• 3.When a timer event happens, ED asks DM the last modified time of files• 4.DM informs ED of changes to push-based data sources• 5.If file changes and timer events are satisfied. ED provides CQM with a list

of firing CQs• 6.CQM invokes QE to execute firing CQs.• 7.File scan operator calls DM to retrieve selected documents.• 8.DM only returns data changes between last fire time and current fire time.

Page 18: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Experimental results

• Peformed on a Sun Ultra 6000 with 1GB RAM running JDK1.2 on Solaris 2.6

Page 19: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Experimental Results

Page 20: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Analysis of NiagaraCQ

• Pros– Scalable to large number of queries, users– Works with both change and timer based continuous queries– Better performance, less I/O required to execute queries.

• Cons– No dynamic re-grouping of groups, eventually, groups become

sub-optimal– Assumes queries have common structure, not always the case– Incremental grouping works only for select and join as of now.

Eventually, aggregation may be included.

Page 21: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Niagara in Review

• The goal was to develop an Internet-scale continuous query system using group optimization based on the assumption that many continuous queries on the Internet will have some similarities.

• Proposed novel “incremental grouping” methodology

• Supports both timer-based and changed based queries.

Page 22: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

TelegraphCQ

Page 23: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

TelegraphCQ Design Overview

• Focus: Continuously Adaptive Query Processing of high volume and highly variable data streams.

• Large scale

• Deeply networked nature

• Unpredictability of the environment

• Need for close user interaction

• Data constantly moving and changing

Page 24: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

TelegraphCQ Restrictions

• Data is pushed to the query processor

• Data arrival rate can be high and bursty

• On the fly processing, data can be stored, but real-time one pass analysis is important

• Ordering of data is of significant importance.

Page 25: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Design Goals

1. scheduling and resource management for groups of queries

2. support for out-of-core (non main memory) data

3. variable adaptivity4. dynamic QoS support5. parallel cluster-based processing and

distributed computation.

Page 26: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

TelegraphCQ

• Complete Redesign and Re-implementation of Telegraph system with focus on focus on support for shared, continuous query processing over query and data streams.

• Distinguish it from the Telegraph project’s broader focus on adaptive dataflow in general, and to emphasize the challenges we are addressing in our new implementation.

Page 27: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Telegraph Module Types• Ingress and Caching

– Interface with external data sources• TeSS – HTML/XML Screenscraper• TelNape – Interfaces with popular P2P networks

– Local caching to hide network delays• Query Processing

– pipelined, non-blocking versions of standard relational operators such as joins, selections, projections, grouping and aggregation, and duplicate elimination.

– State Module (SteMs)• Adaptive Routing

– ability to “re-optimize” the plan on a continuous basis while a queryis running.

– Eddies– Flux (Fault-tolerant, Load-balancing eXchange): Opaque dataflow

module handles buffering and reordering of streams

Page 28: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Eddies

• Role: Continuously route tuples among a set of other modules according to a routing policy

• Intercept tuples and choose the order that they travel between modules

• Eddy can shut down each module when the end of all of its input streams is reached and the modules have completed current processing.

• Not designed as general purpose scheduler, no enforcement of resource management policies

• Multiple eddies run as parallel threads on queries with disjoint sets of tables and streams.

Page 29: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Adaptive Processing W/Eddies & SteMs

• SteM - temporary repository of tuples, essentially corresponding to half of a traditional join operator.

• It stores homogeneous tuples (i.e., tuples spanning the same set of tables) formed during query processing.

• Supports insert (build), search (probe), and optionally delete (eviction) operations.

• Two kinds of tuples can be routed to a SteM.

– When a tuple t in T (a build tuple) is routed to SteMT , t is added to the set of tuples in SteMT.

– When a tuple p ∉ T (a probe tuple) is routed to SteMT , SteMT returns concatenated matches for it to the Eddy. These concatenated matches are the tuples in {p} join SteMT that satisfy all query predicates that can be evaluated on the columns in p and T.

Eddy

S T

SteMsS SteMsT

S build

T build

ST matchesS probe

T pr

obe

Page 30: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Fjords

• Inter Module Communications API

• Form the links between modules

• Supports a mixture of Push (streaming) and Pull (static) operations for query plans

• Allows modules to ignore the specifics of the data source.

• Supports non-Blocking dequeue operations

Page 31: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

System Specifications

• Build on PostgreSQL platform– process per

connection model– Coded in C/C++

Page 32: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Example: Landmark Query

The input windows of these queries have a fixed beginning point in the timeline, and a forward moving endpoint.

Example: “Select all the days after the hundredth trading day, on which the closing price of MSFT has been greater than $50. Keep this query standing in the system for a thousand trading days”.

SELECT closingPrice, timestampFROM ClosingStockPricesWHERE stockSymbol = ‘MSFT’And closingPrice > 50.00for (t = 101; t <= 1100; t++ ){

WindowIs(ClosingStockPrices, 101, t);}

MSFT 101 $60

MSFT 102 $48

MSFT 103 $52

MSFT 104 $60

Page 33: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Example: Sliding Window QueryThe input windows of these queries

have forward moving beginning and end points.

Example: “On every third trading day starting today, calculate the average closing price of MSFT for the three most recent trading days. Keep the query standing for fifty trading days”.

Select AVG(closingPrice)From ClosingStockPricesWhere stockSymbol = ‘MSFT’for (t = ST; t < ST + 50; t +=3 ){

WindowIs(ClosingStockPrices, t - 2, t);

}

MSFT 101 $60

MSFT 102 $48

MSFT 103 $52

MSFT 104 $56

MSFT 105 $55

MSFT 106 $58

MSFT 107 $52

MSFT 108 $60

Page 34: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Example: Temporal Band Join Query

These queries join tuples in one stream with tuples in another based on timestamp.

Example: “For the five most recent trading days starting today, select all stocks that closed higher than MSFT on a given day. Keep the query standing for twenty trading days”.

Select c2.*

FROM ClosingStockPrices as c1,

ClosingStockPrices as c2

WHERE c1.stockSymbol = ‘MSFT’ and

c2.stockSymbol!= ‘MSFT’ and

c2.closingPrice > c1.closingPrice and

c2.timestamp = c1.timestamp

for (t = ST; t < ST +20 ; t++ )

{

WindowIs(c1, t - 4, t);

WindowIs(c2, t - 4, t);

}

Page 35: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Pros and Cons of System

• Pros– Focus on extreme adaptability– New code is multithreaded to help boost system

parallelism and enhance performance particularly in multiprocessor scenarios.

• Cons– Code not fully multi-threaded, existing PostgreSQL– Queries separated into classes for processing based

on disjoint footprints.– Still in early development stages

• Issues still need to be solved• no extensive performance analysis

Page 36: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

TelegraphCQ Future Work

• Egress Modules– Include fault tolerance in delivery of results,

ie: in mobile networks– Improved interface with overlay networks

• Cluster and Distributed Implementations– Extension of FLuX module– Integration with TAG system

Page 37: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Conclusion and Review

• NiagaraCQ– NiagaraCQ is a system that establishes scalability with a general

strategy of incremental group optimization.

• TelegraphCQ– TelegraphCQ is a system that combines prior work in Fjords,

Eddies, and PSoup in order to query streaming data on large scales

• Other Data Streaming solutions– Aurora– STREAM– StreamMill

Page 38: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Thank You for your time!

Page 39: Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Bibliography

• J. Chen, D. DeWitt, F.Tian, Y.Wang. NiagaraCQ: A Scalable Continuous Query System for Internet Databases. In Proc. Of the ACM SIGMOD Conf. on Management of Data, 2000.

• Xiaoning Wang, NiagaraCQ presentation. www.cs.wpi.edu/~cs561/s03/talks/niagara-cq.ppt

• Chandrasekaran, et al. TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. UC Berkeley. 2003 CIDR Conference.