Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Data Streams and Continuous Query Systems

CS 240B: Professor Zaniolo

Eric Sytwu

Joseph Joswig

Outline

1. Review of Data Streams

2. NiagaraCQ

3. TelegraphCQ

4. Conclusion

5. Bibliography

Data Sets VS Data Streams

• Data Sets– Infrequently changing data– Ex. Employee personnel table, contact

database, library system

• Data Streams– Data arriving continuously– Ex. Stock streamer, sensor networks, weather

monitoring system

Traditional Database Query

• In a traditional query, the query engine returns a subset of the data that is currently in the system.

End User /Application

Query ProcessorStatic Data Sets

Query Results

Continuous Queries

• Continuous queries are persistent queries that allow users to get new results as new information enters the system.

End User /Application

Query ProcessorData Streams

Query Continuous

Results

Workspace

Niagara CQ

Goal:

• Allow users to obtain new results from a database without having to issue the same query repeatedly.

• Develop a system that will allow a large number of users to be able to register continuous queries using a high level language like XML-QL

What’s wrong with previous continuous querying systems?

• Previous group optimization efforts focused on finding an optimal plan for a small number of queries.

• Computationally too expensive to handle a handle a large number of queries

• Not designed for the web, which is constantly changing

Benefits of NiagaraCQ

• Based on group optimization

• Grouped queries can share computation

• Common execution plans of grouped queries reside in memory, saving on I/O costs compared to executing each query separately.

How do we get the benefits?• Incremental group optimization• Groups are created for existing queries

according to their signatures, which represent similar structures among the queries.

• Each individual query in a query group shares the results from the execution of the group plan

• When a new query is submitted, the group optimizer considers existing groups as potential optimization choices, the new query is merged into an existing group

Example:

• XML-QL query

• Expression signature

Quotes.Quote.Symbolin quotes.xml

constant

=

Query Plan

• Query plan

Quotes.xml Quotes.xml

File Scan File Scan

Select Symbol=“INT

C”

Select Symbol=“MSF

T”

Trigger Action I Trigger Action J

Group Plan

• Group plan

Quotes.xml Constant Table

File Scan File Scan

Join

Trigger Action I Trigger Action J

Symbol=Constant value

Split

……

Materialized Intermediate Files

• Query split with intermediate files

Split

Trigger_Act_j

File Scan

File Scan

Trigger_Act_i

….

File_i File_j

General Selection Predicates• “Attribute op Constant”• Attribute = path expression without wildcards• Op = “=“, “<“, “>”…

Join Operators

• A join signature in or approach contains the names of the two data sources and the predicated for the join. Join queries are grouped with the same join signature.

Processing Continuous Queries• 1. CQM adds continuous queries with file and timer information to enable

ED to monitor events• 2. ED asks DM to monitor changes to files• 3.When a timer event happens, ED asks DM the last modified time of files• 4.DM informs ED of changes to push-based data sources• 5.If file changes and timer events are satisfied. ED provides CQM with a list

of firing CQs• 6.CQM invokes QE to execute firing CQs.• 7.File scan operator calls DM to retrieve selected documents.• 8.DM only returns data changes between last fire time and current fire time.

Experimental results

• Peformed on a Sun Ultra 6000 with 1GB RAM running JDK1.2 on Solaris 2.6

Experimental Results

Analysis of NiagaraCQ

• Pros– Scalable to large number of queries, users– Works with both change and timer based continuous queries– Better performance, less I/O required to execute queries.

• Cons– No dynamic re-grouping of groups, eventually, groups become

sub-optimal– Assumes queries have common structure, not always the case– Incremental grouping works only for select and join as of now.

Eventually, aggregation may be included.

Niagara in Review

• The goal was to develop an Internet-scale continuous query system using group optimization based on the assumption that many continuous queries on the Internet will have some similarities.

• Proposed novel “incremental grouping” methodology

• Supports both timer-based and changed based queries.

TelegraphCQ

TelegraphCQ Design Overview

• Focus: Continuously Adaptive Query Processing of high volume and highly variable data streams.

• Large scale

• Deeply networked nature

• Unpredictability of the environment

• Need for close user interaction

• Data constantly moving and changing

TelegraphCQ Restrictions

• Data is pushed to the query processor

• Data arrival rate can be high and bursty

• On the fly processing, data can be stored, but real-time one pass analysis is important

• Ordering of data is of significant importance.

Design Goals

1. scheduling and resource management for groups of queries

2. support for out-of-core (non main memory) data

3. variable adaptivity4. dynamic QoS support5. parallel cluster-based processing and

distributed computation.

TelegraphCQ

• Complete Redesign and Re-implementation of Telegraph system with focus on focus on support for shared, continuous query processing over query and data streams.

• Distinguish it from the Telegraph project’s broader focus on adaptive dataflow in general, and to emphasize the challenges we are addressing in our new implementation.

Telegraph Module Types• Ingress and Caching

– Interface with external data sources• TeSS – HTML/XML Screenscraper• TelNape – Interfaces with popular P2P networks

– Local caching to hide network delays• Query Processing

– pipelined, non-blocking versions of standard relational operators such as joins, selections, projections, grouping and aggregation, and duplicate elimination.

– State Module (SteMs)• Adaptive Routing

– ability to “re-optimize” the plan on a continuous basis while a queryis running.

– Eddies– Flux (Fault-tolerant, Load-balancing eXchange): Opaque dataflow

module handles buffering and reordering of streams

Eddies

• Role: Continuously route tuples among a set of other modules according to a routing policy

• Intercept tuples and choose the order that they travel between modules

• Eddy can shut down each module when the end of all of its input streams is reached and the modules have completed current processing.

• Not designed as general purpose scheduler, no enforcement of resource management policies

• Multiple eddies run as parallel threads on queries with disjoint sets of tables and streams.

Adaptive Processing W/Eddies & SteMs

• SteM - temporary repository of tuples, essentially corresponding to half of a traditional join operator.

• It stores homogeneous tuples (i.e., tuples spanning the same set of tables) formed during query processing.

• Supports insert (build), search (probe), and optionally delete (eviction) operations.

• Two kinds of tuples can be routed to a SteM.

– When a tuple t in T (a build tuple) is routed to SteMT , t is added to the set of tuples in SteMT.

– When a tuple p ∉ T (a probe tuple) is routed to SteMT , SteMT returns concatenated matches for it to the Eddy. These concatenated matches are the tuples in {p} join SteMT that satisfy all query predicates that can be evaluated on the columns in p and T.

Eddy

S T

SteMsS SteMsT

S build

T build

ST matchesS probe

T pr

obe

Fjords

• Inter Module Communications API

• Form the links between modules

• Supports a mixture of Push (streaming) and Pull (static) operations for query plans

• Allows modules to ignore the specifics of the data source.

• Supports non-Blocking dequeue operations

System Specifications

• Build on PostgreSQL platform– process per

connection model– Coded in C/C++

Example: Landmark Query

The input windows of these queries have a fixed beginning point in the timeline, and a forward moving endpoint.

Example: “Select all the days after the hundredth trading day, on which the closing price of MSFT has been greater than $50. Keep this query standing in the system for a thousand trading days”.

SELECT closingPrice, timestampFROM ClosingStockPricesWHERE stockSymbol = ‘MSFT’And closingPrice > 50.00for (t = 101; t <= 1100; t++ ){

WindowIs(ClosingStockPrices, 101, t);}

MSFT 101 $60

MSFT 102 $48

MSFT 103 $52

MSFT 104 $60

Example: Sliding Window QueryThe input windows of these queries

have forward moving beginning and end points.

Example: “On every third trading day starting today, calculate the average closing price of MSFT for the three most recent trading days. Keep the query standing for fifty trading days”.

Select AVG(closingPrice)From ClosingStockPricesWhere stockSymbol = ‘MSFT’for (t = ST; t < ST + 50; t +=3 ){

WindowIs(ClosingStockPrices, t - 2, t);

}

MSFT 101 $60

MSFT 102 $48

MSFT 103 $52

MSFT 104 $56

MSFT 105 $55

MSFT 106 $58

MSFT 107 $52

MSFT 108 $60

Example: Temporal Band Join Query

These queries join tuples in one stream with tuples in another based on timestamp.

Example: “For the five most recent trading days starting today, select all stocks that closed higher than MSFT on a given day. Keep the query standing for twenty trading days”.

Select c2.*

FROM ClosingStockPrices as c1,

ClosingStockPrices as c2

WHERE c1.stockSymbol = ‘MSFT’ and

c2.stockSymbol!= ‘MSFT’ and

c2.closingPrice > c1.closingPrice and

c2.timestamp = c1.timestamp

for (t = ST; t < ST +20 ; t++ )

{

WindowIs(c1, t - 4, t);

WindowIs(c2, t - 4, t);

}

Pros and Cons of System

• Pros– Focus on extreme adaptability– New code is multithreaded to help boost system

parallelism and enhance performance particularly in multiprocessor scenarios.

• Cons– Code not fully multi-threaded, existing PostgreSQL– Queries separated into classes for processing based

on disjoint footprints.– Still in early development stages

• Issues still need to be solved• no extensive performance analysis

TelegraphCQ Future Work

• Egress Modules– Include fault tolerance in delivery of results,

ie: in mobile networks– Improved interface with overlay networks

• Cluster and Distributed Implementations– Extension of FLuX module– Integration with TAG system

Conclusion and Review

• NiagaraCQ– NiagaraCQ is a system that establishes scalability with a general

strategy of incremental group optimization.

• TelegraphCQ– TelegraphCQ is a system that combines prior work in Fjords,

Eddies, and PSoup in order to query streaming data on large scales

• Other Data Streaming solutions– Aurora– STREAM– StreamMill

Thank You for your time!

Bibliography

• J. Chen, D. DeWitt, F.Tian, Y.Wang. NiagaraCQ: A Scalable Continuous Query System for Internet Databases. In Proc. Of the ACM SIGMOD Conf. on Management of Data, 2000.

• Xiaoning Wang, NiagaraCQ presentation. www.cs.wpi.edu/~cs561/s03/talks/niagara-cq.ppt

• Chandrasekaran, et al. TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. UC Berkeley. 2003 CIDR Conference.

Documents

Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig