38
07 December 2001 Data Streaming in Wide-Area Computations: the Missing Link Beth Plale Computer Science Dept. Indiana University

07 December 2001 Data Streaming in Wide- Area Computations: the Missing Link Beth Plale Computer Science Dept. Indiana University

Embed Size (px)

Citation preview

07 December 2001

Data Streaming in Wide-Area Computations: the

Missing Link

Beth PlaleComputer Science Dept.

Indiana University

07 December 2001

Grid services that facilitate access to remote data currently focus on the file as the unit of transfer.

In this talk, we argue that the data stream and its delivery of a stream of events is a valuable complementary form of remote data access in grid applications.

07 December 2001

Talk Outline

• Wide area computing applications– Data oriented services

• Data streams: the missing link– dQUOB system for querying

streaming data

• WAN performance results

07 December 2001

•Authenticate once

•Submit a grid computation

(code, resources, data,

…)

•Locate resources

•Negotiate authorization,

acceptable use, etc.

•Select and acquire resources

•Initiate data transfers,

computation

•Monitor progress

•Steer computation

•Store and distribute results

•Account for usage

Grid Applications

LI GO in Louisiana

a b

c

LI GO in Louisiana

a b

c

Slide courtesy Ann Chervanek, ISI

07 December 2001

Data Oriented Grid Services

Metadata Service Application

Replica LocationService

Information Services

Planner:Data location, Replica selection,Selection of compute and storage nodes

Security and Policy

Executor:Initiates data transfers and computations

Data Movement

Data Access

Compute Resources Storage Resources

Courtesy Ann Chervanek, ISI

07 December 2001

Metadata

• Information that describes the contents of data

• Typically, each discipline develops an ontology: – What attributes are important?– What is exact meaning of attributes?– How is information structured?

• Must be able to interpret, share and reproduce data

07 December 2001

Metadata Examples

• UNIX-style file system metadata– file size, access permissions, creation time, modify

time, etc.

• Information needed to read/interpret the bits– Format of information: MPEG, JPEG, GIF, ppt, doc,…– File type: ascii, binary, …– Big-endian or little-endian– For removable media: what device wrote bits?

07 December 2001

Metadata Examples (cont.)

• Descriptions of meaning of files: what do ones and zeroes represent?– Satellite image of South America– Results of Monte Carlo simulation– Word document

• Contextual information– When was data created? By whom?– Under what experimental conditions?– Using what application software, operating system

software, hardware configuration?– What input files/parameter settings?– Goal: experimental results understandable,

repeatable

07 December 2001

A Metadata Service• Records metadata attributes associated with

data– Typically stored in attribute:value pairs– Schema determined by application domain

• Answers queries– Attribute-based search capability– Identify files that contain data with specified

metadata attributes– “Find precipitation measurements over North

America for January to June 1999”

• Example Metadata Service– a Metadata CATalog (MCAT), SDSC

07 December 2001

Metadata Service Application

Replica LocationService

Information Services

Planner:Data location, Replica selection,Selection of compute and storage nodes

Security and Policy

Executor:Initiates data transfers, computations

Data Movement

Data Access

Compute Resources Storage Resources

07 December 2001

Replica Management

• Terabytes or petabytes of data shared by researchers around the world

• Often read-only data, “published” by experiments

• Replicate data at multiple locations1. Fault tolerance

• Avoid single points of failure2. Performance

• Avoid wide area data transfer latencies• Load balancing

07 December 2001

Replica Management, cont.• Issues:

– Location: finding copies of files– Aggregation: manage groups of files to

improve convenience and scalability– Consistency model: how out of date is the

file one obtains from a replica?

• Current Replication Grid Services– Globus Replica Catalog, ANL/ISI– Grid Data Mirroring Package (GDMP), CERN,

Caltech, and Fermilab– Storage Resource Broker (SRB), SDSC

07 December 2001

Metadata Service Application

Replica LocationService

Information Services

Planner:Data location, Replica selection,Selection of compute and storage nodes

Security and Policy

Executor:Initiates data transfers, computations

Data Movement

Data Access

Compute Resources Storage Resources

07 December 2001

Information Services

• Repository of information about people, organizations, computational resources, software, storage devices, etc.

• Emerging Grid standard is Globus MDS2– Level 1: distributed LDAP directory servers, typically

one per organization or administrative domain. These are called GRIS servers.

– Level 2: aggregating directory servers (GIIS servers).

– Grid applications capable of querying either.

07 December 2001

Metadata Service Application

Replica LocationService

Information Services

Planner:Data location, Replica selection,Selection of compute and storage nodes

Security and Policy

Executor:Initiates data transfers, computations

Data Movement

Data Access

Compute Resources Storage Resources

07 December 2001

Security• Forms of Security:

– Authorization: Verify that users are allowed to perform requested operations

– Privacy: Knowledge of existence, location and content of data must be controlled

– Integrity: Prevent adversary from tampering stored data or data transfers (access control)

• Emerging standards:– Grid Security Infrastructure (GSI), Globus

• for authentication and access control

– Community Authorization Service (CAS), Globus• for authentication and access control

07 December 2001

Metadata Service Application

Replica LocationService

Information Services

Planner:Data location, Replica selection,Selection of compute and storage nodes

Security and Policy

Executor:Initiates data transfers, computations

Data Movement

Data Access

Compute Resources Storage Resources

07 December 2001

Data Movement

• Want efficient, secure movement of large amounts of data for:– Publishing large data collections– Replication of large files or collections of files– As input data to grid application

• Emerging standard: GridFTP, Globus– Extends standard FTP protocol: get/put etc.– Secure (GSS security bindings)– Parallel data movement– Automatic and manual TCP buffer setting

07 December 2001

Data Access

• Fine-grain operations on large data sets– Partial file accesses– Database queries– Structured data formats (HDF)– Containers

• Current Grid data access services– Storage Resource Broker (SRB), UCSD

• Container can hold logically connected files

Data object

Data object

Data object

Data object

Container

07 December 2001

Talk Outline

• Wide area computing applications– Data oriented services

• Data streams: the missing link– dQUOB system for querying

streaming data

• WAN performance results

07 December 2001

Role of Topography on Tornado Formation

Long time frames

raw Level 2 data

Time

Convert data format and stream

Exponential average

candidate substreams

archivearchive

archive

Unneeded data

Peachtree City, GA

Grier, SC

Hytop, AL

07 December 2001

R2

R4

R3

SQLqueries

dispatcher

R4user

functions

Peachtree City, GA

Hytop, AL

Grier, SC

R1

Relational SchemaR1: attrib1, attrib2, …R2: attrib3, attrib 4, …R3: attrib5, attrib6, …R4: attrib7, attrib8, …

07 December 2001

SQL Queries to Extract, Transform and Filter Data Streams

• View data stream as set of relations (tables) in DBMS• Data streams joined

– Join on valid time (timestamp or logical time)

• Materialized views – New event streams created

• Queries embedded into data streams on-the-fly• Statistics collected about data at runtime.

– Sample data stream, build histograms– Use statistics to reoptimize queries

• Associate user supplied (mathematical) function with SQL query to strengthen transformation capability– e.g., FFT

07 December 2001

Instantiation of a Query

repositoryrepository

event channel

providerprovider consumerconsumerquoblet

Preexisting event handler

action routines

dQUOB library

TCL interpreter

dQUOB runtime

compiledqueries 6

5

4

2commandchannel

3

dQUOB server

2

1

script

07 December 2001

ECho: Event Delivery Middleware

• Publish/subscribe model of event flow• Receivers register for events that are pushed

from a source, • Sender unaware of identity and number of

receivers• Binary data transmission, based on dynamically

defined event formats• Georgia Tech, Eisenhauer, Schwan

07 December 2001

Current Research Issues

• Better memory utilization

• Integrated support for mathematical transformations

• Support for complex data types

07 December 2001

Efficient Memory Utilization

• Stream arrival rates can vary vastly between one stream and another

• Detect stream arrival characteristics, adjust sliding window size

• Adjustments done within context of global memory usage

Relation R contains tuples a, b, …, l, …, zRelation S contains tuples 1,2,…, 5, … 20

Can detect that stream (relation) S is both slow and erratic, so reduce sliding window size.

ag f e d c b

1

l k j i h

5 4 3 2

R:S:

sliding window

……

Notes:-- Joins are typically Cartesian Product.-- Sliding window controls number of tuples participating in join.-- Participating tuples must be retained in memory, thus consume resources.

07 December 2001

Efficient memory utilization, cont.

QM

memory space

dispatcher

Eventhandlers

quoblet

QN

-- All tuples participating in joins are resident in common memory space.

-- Most tuples are either forwarded on, used as input for a new tuple (materialized view), or discarded

joins

07 December 2001

Better support for large data

typedef struct EC_Data_ { int tid; /* task id */ int tag; /* logical timestep */ int aid; /* adaptation id */unsigned long timestamp; /* event timestamp */ double time_measure;int spec_nr; /* species type */ int lon_min; /* longitude */ int lon_max; int lon_count; int lat_min; /* latitude */ int lat_max; int lat_count; int level_min; /* atmospheric 'level' */ int level_max; int level_count; int values_size; float * values;} EC_Data, * EC_DataPtr;

Attributes(fields over whichqueries can beexpressed)

int values_size = 32768262K of data as vector

data

-- Event instance transported in binary network format using TCP

Problem: data notvisible to query (thoughit is visible to user-definedfunction or ‘action’)

07 December 2001

Better support for large data, cont.

• DATALINK SQL data type • data type is

pointer to file where data resides

• vector replaced by pointer to file

• Object-relational representation

• Files too large: streaming on ‘record-by-record’ basis becomes impractical

• Ability to query data as well as attributes. – Attribute points to

location in vector.

Issues: Solutions:

07 December 2001

Integrate mathematical transformations

Query (selects,projects, joins)

User suppliedfunction

memory space

dispatcher

Eventhandlers

quoblet

-- Results of user-supplied function available to query at next timestep

07 December 2001

Talk Outline

• Wide area computing applications– Data oriented services

• Data streams: the missing link– dQUOB system for querying

streaming data

• WAN performance results

07 December 2001

Measuring Wide-area Performance

Workload: 540 events generated by global atmospheric transport modelEnvironment:

-- Georgia Tech: Sun Ultra 30 cluster, Solaris 7--- Albuquerque High Performance Computing Center (AHPCC):

Onyx 2 8 processor, Irix64 6.5-- NCSA, Urbana-Champaign Illinois: Origin 2000 Array,

48 processor, IrixNetwork: Abilene (2.4 Gbits per second)

AHPCC,Albuquerque, NM

Georgia Tech,Atlanta, GA

NCSA,Urbana, IL

Abilene 2.4 Gbits per second

07 December 2001

-- Baseline case (LAN communication)-- Query’s filtering capability progressively strengthens in response to changes in network environment

07 December 2001

Pushing data transformation closer to source yields totalduration times that are closer to baseline (LAN) time. - variance due to traffic flow control in TCP

filter atdestination

filter atsource

baseline

07 December 2001

filter atsource

baseline

filter atdestination

NCSA Origin appears to throttle process-to-process communicationwhen both processes reside on machine.

07 December 2001

Related Research• SQL query processing; non-traditional

application– Active Disks (U Maryland)– Continual Queries (Georgia Tech)– Snodgrass (U Arizona)– Eddies (UC Berkeley)

• Data stream computation– ABACUS (CMU)– Data Cutter (U Maryland) – Distributed Laboratories (Georgia Tech)

07 December 2001

Summary• The grid is an emerging computational and networking

infrastructure providing pervasive, uniform, and reliable access to remote data, computational, sensor, and human resources.

• Grid data services include metadata, replication, information services, data movement, and data access.

• Data streams play an important role in grid applications and are the missing link.

• Our view of data streams as a database enables scientists to better manage data streams.

• dQUOB: an implementation of our approach.

http://www.cs.indiana.edu/~plale/projects/dQUOB