39
The Stanford Data The Stanford Data Streams Research Project Streams Research Project Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Gurmeet Manku, Liadan O’Callaghan, Justin Rosentein, Qi Sun, Rohit Varma

The Stanford Data Streams Research Project Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock,

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

The Stanford Data Streams The Stanford Data Streams Research ProjectResearch Project

Profs. Rajeev Motwani & Jennifer Widom

And a cast of full- and part-time students:Arvind Arasu, Brian Babcock, Shivnath Babu,

Mayur Datar, Gurmeet Manku, Liadan O’Callaghan, Justin Rosentein, Qi Sun, Rohit Varma

stanfordstreamdatamanager

stanfordstreamdatamanager 2

Data StreamsData Streams• Traditional DBMS -- data stored in finite,

persistent data setsdata sets

• New applications -- data as multiple, continuous, rapid, time-varying data streamsdata streams– Network monitoring and traffic engineering– Security applications– Telecom call records– Financial applications– Web logs and click-streams– Sensor networks– Manufacturing processes

stanfordstreamdatamanager 3

ChallengesChallenges

• Multiple, continuous, rapid, time-varyingMultiple, continuous, rapid, time-varying streams of data

• Queries may be continuous continuous (not just one-time)– Evaluated continuously as stream data arrives

– Answer updated over time

• Queries may be complexcomplex– Beyond element-at-a-time processing

– Beyond stream-at-a-time processing

stanfordstreamdatamanager 4

Using Traditional DatabaseUsing Traditional Database

User/ApplicationUser/ApplicationUser/ApplicationUser/Application

LoaderLoaderLoaderLoader

QueryQuery ResultResult

ResultResult……

QueryQuery……

stanfordstreamdatamanager 5

New Approach for Data StreamsNew Approach for Data Streams

User/ApplicationUser/ApplicationUser/ApplicationUser/Application

Register QueryRegister Query

Stream QueryStream QueryProcessorProcessor

ResultResult

stanfordstreamdatamanager 6

New Approach for Data StreamsNew Approach for Data Streams

User/ApplicationUser/ApplicationUser/ApplicationUser/Application

Register QueryRegister Query

Stream QueryStream QueryProcessorProcessor

ResultResult

Scratch SpaceScratch Space(Memory and/or Disk)(Memory and/or Disk)

DataStream

ManagementSystem

(DSMS)

stanfordstreamdatamanager 7

DBMS versus DSMSDBMS versus DSMS

stanfordstreamdatamanager 8

DBMS versus DSMSDBMS versus DSMS• Persistent relations • Transient streams (and

persistent relations)

stanfordstreamdatamanager 9

DBMS versus DSMSDBMS versus DSMS• Persistent relations

• One-time queries

• Transient streams (and persistent relations)

• Continuous queries

stanfordstreamdatamanager 10

DBMS versus DSMSDBMS versus DSMS• Persistent relations

• One-time queries

• Random access

• Transient streams (and persistent relations)

• Continuous queries

• Sequential access

stanfordstreamdatamanager 11

DBMS versus DSMSDBMS versus DSMS• Persistent relations

• One-time queries

• Random access

• Access plan determined by query processor and physical DB design

• Transient streams (and persistent relations)

• Continuous queries

• Sequential access

• Unpredictable data arrival and characteristics

stanfordstreamdatamanager 12

DBMS versus DSMSDBMS versus DSMS• Persistent relations

• One-time queries

• Random access

• Access plan determined by query processor and physical DB design

• “Unbounded” disk store

• Transient streams (and persistent relations)

• Continuous queries

• Sequential access

• Unpredictable data arrival and characteristics

• Bounded main memory

stanfordstreamdatamanager 13

Sample ApplicationsSample Applications

• Network management and traffic engineering (e.g., Sprint)– Streams of measurements and packet traces

– Queries: detect anomalies, adjust routing

• Telecom call data (e.g., AT&T)– Streams of call records

– Queries: fraud detection, customer call patterns, billing

stanfordstreamdatamanager 14

Sample Applications (cont’d) Sample Applications (cont’d)

• Network security (e.g., iPolicy, NetForensics/Cisco, Netscreen)– Network packet streams, user session information

– Queries: URL filtering, detecting intrusions & DOS attacks & viruses

• Financial applications (e.g., Traderbot)– Streams of trading data, stock tickers, news feeds

– Queries: arbitrage opportunities, analytics, patterns

stanfordstreamdatamanager 15

Sample Applications (cont’d) Sample Applications (cont’d)

• Web tracking and personalization (e.g., Yahoo, Google, Akamai)– Clickstreams, user query streams, log records

– Queries: monitoring, analysis, personalization

• Truly massive databases (e.g., Astronomy Archives)– Stream the data by once (or over and over)

– Queries do the best they can

stanfordstreamdatamanager 16

Making Things ConcreteMaking Things Concrete

• Database = two streams of mobile call records– Outgoing(connectionID, caller, start, end)

– Incoming(connectionID, callee, start, end)

• Query language = SQL

FROM clauses can refer to streams and/or relations

stanfordstreamdatamanager 17

Query Example 1Query Example 1

• Find all outgoing calls longer than 2 minutes (relational selection)

SELECT O.connectionID, O.callerFROM Outgoing OWHERE O.end – O.start > 2

• Result requires unbounded storage

• Can provide result as data stream

stanfordstreamdatamanager 18

Query Example 2Query Example 2

• Pair up callers and callees (relational join)

SELECT O.caller, I.calleeFROM Outgoing O, Incoming IWHERE O.connectionID = I.connectionID

• Can still provide result as data stream

• Requires unbounded temporary storage (without additional assumptions)

stanfordstreamdatamanager 19

Query Example 3Query Example 3

• Find total connection time for each caller (relational grouping and aggregation)

SELECT O.caller, sum(O.end – O.start)FROM Outgoing OGROUP BY O.caller

• Cannot provide result in (append-only) stream

stanfordstreamdatamanager 20

Project GoalProject Goal

Reconsider all aspects of data management and processing in presence of data streams

stanfordstreamdatamanager 21

Remainder of TalkRemainder of Talk• Data stream model

• Queries over data streams– Language, semantics, evaluation & optimization

• DSMS query processing architecture and system internals

• Results to date

• Ongoing work

• Related work

stanfordstreamdatamanager 22

Data ModelData Model

• Database: relations + data streamsrelations + data streams

• Stream characteristics– Type of data (schema)

– Data distribution

– Flow rate

– Stability of distribution and flow

– Ordering and other constraints

– Synchronization of multiple streams

– Distributed streams

stanfordstreamdatamanager 23

Data Stream Queries -- Basic IssuesData Stream Queries -- Basic Issues

• Answer availability– One-time

– Multiple-time

– Continuous (“standing”), stored or streamed

• Registration time– Predefined

– Ad-hoc

• Stream access– Arbitrary

– Sliding window (special case: size = 1)

stanfordstreamdatamanager 24

Data Stream Queries -- Basic IssuesData Stream Queries -- Basic Issues

• Answer availability– One-time

– Multiple-time

– Continuous (“standing”), stored or streamed

• Registration time– Predefined

– Ad hoc

• Stream access– Arbitrary

– Sliding window (special case: size = 1)

stanfordstreamdatamanager 25

Query Language & SemanticsQuery Language & Semantics

• Specifying queries over streams– SQL-like versus dataflow network of operators

– Sliding windows as first-class query construct

• Semantic issues– Blocking operators, e.g., aggregation, order-by

– Streams as sets versus lists

– Timestamping

stanfordstreamdatamanager 26

Query Evaluation -- ApproximationQuery Evaluation -- Approximation• Why approximate?

– Streams are coming too fast

– Exact answer requires unbounded storage or significant computational resources

– Ad hoc queries reference history

• Issues in approximation– Sliding windows, sampling, synopses, …

– How is approximation controlled?

– How is it understood by user?

• Accuracy-efficiency-storage tradeoffAccuracy-efficiency-storage tradeoff

stanfordstreamdatamanager 27

Query Evaluation -- AdaptivityQuery Evaluation -- Adaptivity

• Why adaptivity?– Queries are long-running

– Fluctuating stream arrival & data characteristics

– Evolving query loads

• Issues in adaptivity– Adaptive resource allocation (memory,

computation)

– Adaptive query execution plans

stanfordstreamdatamanager 28

Query Evaluation -- Multiple QueriesQuery Evaluation -- Multiple Queries

• Possibly large number of continuous queries

• Long-running

• Shared resources

• Multi-query optimization

stanfordstreamdatamanager 29

Query Evaluation -- Distributed StreamsQuery Evaluation -- Distributed Streams

1 Many physical streams but one logical stream– E.g., maintain top 100 visited pages at Yahoo

2 Correlate streams at distributed servers– E.g., network monitoring

3 Many streams controlled by a few servers– E.g., sensor networks

• Issues– Move processing to streams, not streams to

processor– Approximation-bandwidth tradeoffApproximation-bandwidth tradeoff

stanfordstreamdatamanager 30

Query Processing ArchitectureQuery Processing Architecture

Input Data Streams

Usersissue

continuous and ad-hoc queries

Administrator can monitor query

executionand adjust run-time

parameters

Applicationsregister

continuous queries

OutputStream

X

X

Waiting Op

Ready Op

Running Op

Synopses Query Plans

stanfordstreamdatamanager 31

DSMS InternalsDSMS Internals

• Query plans: operators, synopses, queuesoperators, synopses, queues

• Memory management– Dynamic allocation to buffers, queues, synopses

– Accuracy vs. memory tradeoff

– Operators adapt gracefully to memory reallocation

• Scheduler– Handles variable-rate input streams

– Handles varying operator and query requirements

stanfordstreamdatamanager 32

Some Results to DateSome Results to Date

• Algorithms on data streams– Online clustering [FOCS 2000, ICDE 2002]

– Online quantiles [SIGMOD 98, SIGMOD 99]

– Statistics over sliding windows [SODA 2002]

– Online frequency counting

• Theory of stream query processing– Memory requirements of stream queries [PODS02]

• System design– STREAMSTREAM: stanfordstreamdatamanager

stanfordstreamdatamanager 33

STREAM System ImplementationSTREAM System Implementation

• Comprehensive DSMS query processor

• Broad suite of operators and synopses

• Sophisticated “developer’s workbench” interface– Submit queries in extended SQL or algebra

– Submit or edit query plans in XML or GUI

– Query plan execution visualizer

– On-the-fly modification of memory allocation, scheduling policies, etc.

stanfordstreamdatamanager 34

Ongoing WorkOngoing Work

• Algebra for streams

• Synopses and algorithmic issues

• Memory management issues

• Exploiting constraints on streams

• Approximation in query processing

• Distributed stream processing

• System development

stanfordstreamdatamanager 35

Ongoing WorkOngoing Work

• Algebra for streams

• Synopses and algorithmic issues

• Memory management issues

• Exploiting constraints on streams

• Approximation in query processing

• Distributed stream processing

• System development

stanfordstreamdatamanager 36

Ongoing Work -- ConstraintsOngoing Work -- Constraints

• Exploiting constraints on streams in query processing– Foreign-key joins, referential integrity, clustering,

ordering

– Need not be exact (e.g., k-clustered)

– Reduce memory requirements

– Unblock blocking operators

stanfordstreamdatamanager 37

Ongoing Work -- Approximation in Ongoing Work -- Approximation in Query ProcessingQuery Processing

• Understanding behavior of approximate operators when composed

• Memory allocation to operators in a plan, given per-operator memory-accuracy curve

• Best query plan, assuming best memory allocation

• Multiple (weighted) queries sharing resources

stanfordstreamdatamanager 38

Related WorkRelated Work

• Triggers, alerters, materialized views, continuous queries on conventional DBs, pub/sub, sequence & temporal databases, …

• TelegraphTelegraph project at UC Berkeley

• NiagaraNiagara project at Wisconsin/OGI

• AmazonAmazon project at Cornell

• AuroraAurora project at Brown/MIT

• And others

For Papers and General Info.For Papers and General Info.

http://www-db.stanford.edu/stream

stanfordstreamdatamanager