28
Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Embed Size (px)

Citation preview

Page 1: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Beth Plale Computer Science Dept.

Indiana University

Calder: OGSA-DAI access to data in streams

Page 2: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Contributors

• PhD students:– Nithya Vijayakumar– Ying Liu

• Funding agencies:– Department of Energy

• Early Career grant

– National Science Foundation• LEAD project

Page 3: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Problem Statement• Data streams (e.g., from sensor networks,

instruments) are growing in pervasivness and importance

• Bringing streaming sources to a grid by means of wrapping each sensor and instrument as grid or web service is a naïve solution.

• It is the data in the streams that is of value to growing groups of user, not the instruments.– Select few will ‘steer’ instruments

• Need fresh solution that provides access to collections of data streams.

Page 4: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Outline

• Characterization of data stream systems

Page 5: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Data streams:• Indefinite sequence of events (or

messages, tuples)

• Often time marked– Generation time, that is, timestamp, and– Logical time

• Events continuously generated– pushed or pulled from providers to remote

consumers

Page 6: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Types of Data Stream Systems

Data manipulationsystems-process large amounts of data

Stream routing systems- delivery of events

Stream detectionsystems-detect unusual behavior

sizeand numberof stream“chunks”analyzed

timeliness demandson response

overlap

Page 7: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Stream Routing Systems• Known by various names

– Publish/subscribe, selective data dissemination, document filtering, message oriented middleware (MOM)

• Decisions made event-by-event– Set of queries (usually very large number)

managed over long time duration, arriving event matched against set of queries.

• Stream routing projects– Xfilter (UMaryland), Xyleme (INRIA),

XPushMachine (UWashington), NaradaBroker (IndianaU), Bayou(XeroxParc), Echo (GeorgiaTech)

route

detectmanip

Page 8: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Stream Routing Example

stock quote stream

queries added through user interface

Each arriving event matched against set of queries

Results multicast to owners of satisfied queries

- Timeliness requirement necessitates focus on efficient matching

Long-standingqueries

Page 9: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Data Manipulation Systems

• Event streams subject to transformation, filtering, aggregation.

• Looser timeliness requirements on results• Long running queries, often periodic (based

on assumption of synchronous streams)• Results in generation of new streams• Projects:

– Antarctic Monitoring(UNottingham), sensor network query layer (Cornell), dQUOB (IndianaU), STREAM (Stanford), Fjords (Berkeley), NiagraCQ (UWisconsin)

route

detectmanip

Page 10: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Stream Detection Systems

• Event-oriented (versus periodic)• Less predictable, asynchronous streams• Intent is to detect anomalous behavior• Timeliness is critical, time markers key to

decision making• Result is notification message• Examples:

– R-GMA (EU DataGrid), dQUOB (IndianaU), Conquer (GeorgiaTech), Gigascope (AT&T), Fjords (Berkeley)

route

detectmanip

Page 11: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Claim: streams systems that qualify as Data Stream Resource are only those circled

• A Data resource has coherence and meaning• Both systems qualify as ‘data resource’ because: -- distributed global snapshot on stream behavior alone -- distributed global snapshot has meaning and coherence

Data manipulationsystems

Stream routing systems

Stream detectionsystemsoverlap

Justification:

Page 12: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Rowsetgrid dataservice

Rowsetgrid dataservice

Rowsetgrid dataservice

Rowsetgrid dataservice

OGSA-DAI OGSI access to data sources

RowsetGrid dataservice

RowsetGrid dataservice

Grid dataservice

registry

Grid dataservice

registry

Grid dataservice

registry

Grid dataservice

registryGrid dataService(GDS)

Grid dataService(GDS)

R-DBMSR-DBMS

Data description - database description

Data access - query/update access

Data management - monitor service

Data factory - create rowset instance

Page 13: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Rowsetgrid dataservice

Rowsetgrid dataservice

Rowsetgrid dataservice

Rowsetgrid dataservice

Database Access using OGSA-DAI OGSI

RowsetGrid dataservice

RowsetGrid dataservice

grid dataservice

registry

grid dataservice

registry

grid dataservicefactory

grid dataservicefactory

Grid dataservice

registry

Grid dataservice

registry

Grid dataservice

registry

Grid dataservice

registry

Grid dataService(GDS)

Grid dataService(GDS)

R-DBMS

Grid servicehandle offactory

createGDS

handle ofGDS

query

servicecreate

response(row-by-row)

Page 14: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Rowsetgrid dataservice

Rowsetgrid dataservice

Rowsetgrid dataservice

Rowsetgrid dataservice

Database Access using OGSA-DAI OGSI

RowsetGrid dataservice

RowsetGrid dataservice

grid dataservice

registry

grid dataservice

registry

grid dataservicefactory

grid dataservicefactory

Grid dataservice

registry

Grid dataservice

registry

Grid dataservice

registry

Grid dataservice

registry

Grid dataService(GDS)

Grid dataService(GDS)

Grid servicehandle offactory

createGDS

handle ofGDS

query

servicecreate

response(row-by-row)

Page 15: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Calder: presents db interface to application and supports multiple stream systems (like ogsa-dai)

Page 16: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams
Page 17: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Specifying Long-Running Queries: An Example

Select from 3 data types, CAPS radar, ACARS data, and NEXRAD Doppler level II,

Filter out events not falling within 80 mile radius around New OrleansExecute beginning immediately and terminating execution in 1 hour. Set sliding window size (RANGE) as window over which joins are carried out

SELECT (caps, acars, nexradII) WHERE REGION(90W, 30N, 62.5KM) START now EXPIRE 1hr RANGE 6 min

Page 18: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Retrieving results from rowset service: Rowset Request API

Results obtained through request to rowset service of form:-- single event based on timestamp,-- range of events bounded by time range,-- most recent n events, or-- stream of events.

getTuple(timestamp, ringBufferID)getRangeTuple(startTS, endTS, ringBufferID)getMostRecent(lastRecent, num_events, ringBufferID)getStream(ringbufferID)

Page 19: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams
Page 20: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams
Page 21: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Experimental Environment• GDS, rowset server, stream processing server

– Dual Dell 2.8 GHz Precision workstation, 2 GB memory, RHEL

• Planner– Solaris UltraSPARC 502MHz, 1GB memory,

SunOS 5.8• Computational mesh (1 node)

– Xeon Intel 2.8 Ghz, 2GB RAM, RedHat 8.0• 1 Gbps switched Ethernet network• OGSA-DAI OGSI v4.0• dQUOB v1.0

Page 22: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Evaluation

• Benchmark following steps – GDS setup - ogsa-dai factory call– Query plan time

• Plan how query is to be distributed

– Query compile - compile into portable script form– Distribute query - deploy script into computational

mesh (1 node in test)– Ring buffer setup - allocate space in rowset

service, return handle to GDS.

Page 23: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Benchmark results: average taken over 100 different runs

Page 24: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Time interval (a) obtained bygetRangeTuple(startTS, end TS, ringBufferID)drops as # queries at node increases; turnaround increases

Page 25: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Related Research

• Common Middleware Instrument Architecture (CIMA), McMullen, Bramley et al. • GATES,

– Aggarwal, HPDC04

• Data Cutter, Saltz• Grid Stream Database Manager (GSDM)

– Koparanova and Risch, AxGrids 2003– Stores into O-O DBMS

• DQP – Manchester and Newcastle

• DB community: Widom, Borealis, NiagaraCQ

Page 26: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Summary

• Model of data stream store provides conceptual framework for retrieving data from streams by means of rich and meaningful queries

• GGF DAIS framework of OGSA-DAI is intuitive abstraction for accessing data stream store

• It is not about the instruments and sensors … it is about the streams!

Real-Time WRF Real-Time WRF executed on executed on Grid when Grid when

environment environment primed and primed and

storms presentstorms present

On-DemandOn-DemandResourceResource

SchedulingScheduling

Page 27: Beth Plale Computer Science Dept. Indiana University Calder: OGSA-DAI access to data in streams

Parting Views

• Our work brings data streams to the grid in way that user intuitively thinks about accessing data resources

• Querying heterogeneous data management tools is a difficult problem. Heterogeneous query languages will exist– I.e., continuous query languages will always be

different from database query languages.

• Scientists need a lot of coaching to understand why their monolithic “pile of perl” is not a grid service, and further why they should invest in breaking up that “pile of perl”.