156
VLDB 2003 Berlin Grid Data Management Systems & Services Data Grid Management Systems – Part I Arun Jagatheesan, Reagan Moore Grid Services for Structured data –Part II Paul Watson, Norman Paton VLDB Tutorial Berlin, 2003

Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

  • Upload
    ledung

  • View
    214

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin

Grid Data Management Systems & Services

Data Grid Management Systems – Part IArun Jagatheesan, Reagan Moore

Grid Services for Structured data –Part IIPaul Watson, Norman Paton

VLDB TutorialBerlin, 2003

Page 2: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin

Part I:Data Grid Management Systems

Arun JagatheesanReagan Moore

{arun, moore}@sdsc.eduSan Diego Supercomputer CenterUniversity of California, San Diego

VLDB TutorialBerlin, 2003

http://www.npaci.edu/DICE/SRB/

Page 3: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 3

Tutorial Part I Outline• Concepts

• Introduction to Grid Computing• Proliferation of Data Grids• Data Grid Concepts

• Practice• Real life use cases SDSC Storage Resource Broker

(SRB)• Hands on Session

• Research• Active Datagrid Collections• Data Grid Management Systems (DGMS)• Open Research Issues

Page 4: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 4

Distributed Computing

© Images courtesy of Computer History Museum

Page 5: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 5

Distributed Data Management• Data collecting

• Sensor systems, object ring buffers and portals• Data organization

• Collections, manage data context• Data sharing

• Data grids, manage heterogeneity• Data publication

• Digital libraries, support discovery• Data preservation

• Persistent archives, manage technology evolution• Data analysis

• Processing pipelines, manage knowledge extraction

Page 6: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 6

What is a Grid?

“Coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations”

Ian Foster, ANL

What is Middleware?Software that manages distributed state information for results of remote services

Reagan Moore, SDSC

Page 7: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 7

Data Grids• A datagrid provides the coordinated

management mechanisms for data distributed across remote resources.

• Data Grid• Coordinated sharing of information storage• Logical name space for location independent identifiers• Abstractions for storage repositories, information

repositories, and access APIs• Computing grid and the datagrid part of the Grid.

• Data generation versus data management

Page 8: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 8

Tutorial Part I Outline• Concepts

• Introduction to Grid Computing• Proliferation of Data Grids• Data Grid Concepts

• Practice• Real life use cases SDSC Storage Resource Broker

(SRB)• Hands on Session

• Research• Active Datagrid Collections• Data Grid Management Systems (DGMS)• Open Research Issues

Are data grids in production use?How are they

applied?

Page 9: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 9

Storage Resource Broker at SDSC

More features, 60 Terabytes and counting

Page 10: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 10

NSF Infrastructure Programs• Partnership for Advanced Computational Infrastructure -

PACI• Data grid - Storage Resource Broker

• Distributed Terascale Facility - DTF/ETF• Compute, storage, network resources

• Digital Library Initiative, Phase II - DLI2• Publication, discovery, access

• Information Technology Research projects - ITR• SCEC Southern California Earthquake Center• GEON GeoSciences Network• SEEK Science Environment for Ecological Knowledge• GriPhyN Grid Physics Network• NVO National Virtual Observatory

• National Middleware Initiative - NMI• Hardening of grid technology (security, job execution, grid services)

• National Science Digital Library - NSDL• Support for education curricula modules

Page 11: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 11

Federal Infrastructure Programs• NASA

• Information Power Grid - IPG• Advanced Data Grid - ADG• Data Management System - Data Assimilation Office

• Integration of DODS with Storage Resource Broker• Earth Observing Satellite EOS data pools • Consortium of Earth Observing Satellites CEOS data grid

• Library of Congress• National Digital Information Infrastructure and Preservation Program -

NDIIPP• National Archives and Records Administration (NARA) and

National Historical Public Records Commission• Prototype persistent archives

• NIH• Biomedical Informatics Research Network data grid

• DOE• Particle Physics Data Grid

Page 12: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 12

NSF GriPhyN/iVDGL

• Petabyte scale Virtual Data Grids • GriPhyN, iVDGL, PPDG – Trillium

• Grid Physics Network• International Virtual Data Grid Laboratory• Particle Physics Data Grid

• Distributed worldwide • Harness Petascale processing, data resources

• DataTAG – Transatlantic with European Side

Page 13: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 13

Tera Grid

• Launched in August 2001• SDSC, NCSA, ANL, CACR, PSC

• 20 Tera flops of computing power • One peta byte of storage• 40 Gb/sec (FASTEST network on planet)• “Building the Computational Infrastructure for

Tomorrow's Scientific Discovery”

Page 14: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 14

European Datagrid

• European Union• Different Communities

• High Energy Physics• Biology• Earth Science

• Collaborate and complement other European and US projects

Page 15: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 15

NIH BIRN

• Biomedical Informatics Research Network• Access and analyze biomedical image data• Data resources distributed throughout the country• Medical schools and research centers across the nation

• Stable high performance grid based environment• Coordinate sharing of virtual data collections and data

mining• Growing fast!

Page 16: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 16

Commonality in all these projects• Distributed data management

• Authenticity• Access controls• Curation

• Data sharing across administrative domains• Common name space for all registered digital entities

• Data publication • Browsing and discovery of data in collections

• Data Preservation• Management of technology evolution

Page 17: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 17

Data and Requirements• Mostly Unstructured, heterogeneous

• Images, Files, Semi-structured, databases, streams, …• File systems, SAN, FTP sites, Web servers, Archives

• Community-Based• Shared amongst one or more communities

• Meta-data• Different Meta-data schemas for the same data• Different Notations, ontologies

• Sensitive to Sharing• Nobel Prizes, Federal Agreements, project data

Page 18: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 18

Tutorial Part I Outline• Concepts

• Introduction to Grid Computing• Proliferation of Data Grids• Data Grid Concepts

• Practice• Real life use cases SDSC Storage Resource Broker

(SRB)• Hands on Session

• Research• Active Datagrid Collections• Data Grid Management Systems (DGMS)• Open Research Issues

Page 19: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 19

Data GridData Grid

Using a Data Grid – in Abstract

Ask for d

ata

•User asks for data from the data grid

Data delivered

•The data is found and returned•Where & how details are managed by data grid•But access controls are specified by owner

Page 20: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 20

Data Grid Transparencies• Find data without knowing the identifier

• Descriptive attributes• Access data without knowing the location

• Logical name space• Access data without knowing the type of storage

• Storage repository abstraction• Retrieve data using your preferred API

• Access abstraction• Provide transformations for any data collection

• Data behavior abstraction

Page 21: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 21

Logical Layers (bits,data,information,..)

Storage Resource Transparency

Storage Location Transparency

E:\srbVault\image.jpg /users/srbVault/image.jpg Select … from srb.mdas.td where...Data Identifier Transparency

image_0.jpg…image_100.jpgData Replica Transparency

image.sqlimage.cgi image.wsdlVirtual Data Transparency

Semantic data Organization (with behavior)patientRecordsCollectionmyActiveNeuroCollection

Inter-organizational Information

Storage Management

Page 22: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 22

Storage Resource Transparency (1)

• Storage repository abstraction • Archival systems, file systems, databases, FTP sites, …

• Logical resources • Combine physical resources into a logical set of resources• Hide the type and protocol of physical storage system• Load balancing – based on access patterns• Unlike DBMS, user is aware of logical resources• Flexibility to changes in mass storage technology

Page 23: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 23

Storage Resource Transparency (2)

• Standard operations at storage repositories • POSIX like operations on all resources

• Storage specific operations• Databases - bulk metadata access• Object ring buffers - object based access• Hierarchical resource managers - status and staging

requests

Page 24: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 24

Storage Location Transparency

• No fixed home or location for distributed data• Transparent of physical location and physical resource

• Virtualization of distributed data resources• Data placement managed by data grid

• Redundancy for preservation• Resource redundancy – “m of n” resources in list • Location redundancy – replicate at multiple locations

Page 25: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 25

Data Identifier Transparency • Four Types of Data Identifiers:• Unique name

• OID or handle • Descriptive name

• Descriptive attributes – meta data• Semantic access to data

• Collective name • Logical name space of a collection of data sets• Location independent

• Physical name• Physical location of resource and physical path of data

Page 26: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 26

Data Replica Transparency• Replication

• Improve access time• Improve reliability• Provide disaster backup and preservation• Physically or Semantically equivalent replicas

• Replica consistency• Synchronization across replicas on writes• Updates might use “m of n” or any other policy• Distributed locking across multiple sites

• Versions of files• Time-annotated snapshots of data

Page 27: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 27

Virtual Data Abstraction

• Virtual Data or “On Demand Data”• Created on demand is not already available• Recipe to create derived data• Grid based computation to create derived data product

• Object based storage (extended data operations)• Data subsetting at the remote storage repository• Data formatting at the remote storage repository• Metadata extraction at the remote storage repository• Bulk data manipulation at the remote storage repository

Page 28: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 28

Data Organization• Physical Organization of the data

• Distributed Data• Heterogeneous resources • Multiple formats (structured and unstructured)

• Logical Organization • Impose logical structure for data sets• Collections of semantically related data sets• Users create their own views (collections) of the data grid

• Digital Ontology• Characterization of structures in data sets and collections• Mapping of semantic labels to the structures

Page 29: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 29

Data Behavior Abstraction

• Loose coupling between data and behavior• Collection provides a structure to related data sets• Related data sets operated using a collective behavior• Any behavior (set of operations) is associated with a

collection• Data Grid Collections impose behavior

• Describe a generic standard behavior using WSDL• Each collection gets its specific behavior by extending the

generic behavior• Generic WSDL is extended using portType (or interface)

inheritance

Page 30: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 30

Tutorial Part I Outline• Concepts

• Introduction to Grid Computing• Proliferation of Data Grids• Data Grid Concepts

• Practice• Real life use cases SDSC Storage Resource Broker

(SRB)• Hands on Session

• Research• Active Datagrid Collections• Data Grid Management Systems (DGMS)• Open Research Issues

Page 31: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 31

SDSC SRB – The History

• Started in 1995 funded by DARPA• Massive Data Analysis System (MDAS)• PI: Reagan Moore• “Support data-intensive applications that manipulate very

large data sets by building upon object-relational database technology and archival storage technology”

• Multiple projects for many federal agencies• DoD, NSF, NARA, NIH, DoE, NLM, Library of Congress,

NASA• In production or evaluation at multiple academic and

research institutions round the world

Page 32: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 32

SDSC SRB Team - Data “R” Us :-)Camera-shy• Wayne Schroeder• Vicky Rowley (BIRN)• Lucas Gilbert • Marcio Faerman (SCEC)• Antoine De Torcy (IN2P3)• Students & emeritus

• Erik Vandekieft• Reena Mathew• Xi (Cynthia) Sheng• Allen Ding• Grace Lin• Qiao Xin• Daniel Moore• Ethan Chen

World’s first ‘datagrid engineer’?

Page 33: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 33

SDSC Collaborations• Hayden Planetarium Simulation &

Visualization • NVO -Digital Sky Project (NSF)• ASCI - Data Visualization Corridor

(DOE)• Particle Physics Data Grid (DOE)

{GrPhyN (NSF)}• Information Power Grid (NASA)• Biomedical Informatics Research

Network (NIH) • Knowledge Network for

BioComplexity (NSF)• Mol Science – JCSG, AfCS• Visual Embryo Project (NLM)

• RoadNet (NSF)• Earth System Sciences – CEED,

Bionome, SIO Explorer• Advanced Data Grid (NASA)• Hyper LTER • Grid Portal (NPACI)• Tera Scale Computing (NSF)• Long Term Archiving Project

(NARA)• Education – Transana (NPACI)• NSDL – National Science Digital

Library (NSF)• Digital Libraries – ADL, Stanford,

UMichigan, UBerkeley, CDL• … 31 additional collaborations

Page 34: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 34

Production Data Grid• SDSC Storage Resource Broker

• Federated client-server system, managing• Over 50 TBs of data at SDSC• Over 8.64 million files

• Manages data collections stored in• Archives (HPSS, UniTree, ADSM, DMF)• Hierarchical Resource Managers• Tapes, tape robots• File systems (Unix, Linux, Mac OS X, Windows)• FTP sites• Databases (Oracle, DB2, Postgres, SQLserver, Sybase,

Informix)• Virtual Object Ring Buffers

Page 35: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 35

SRBserver

SRB agent

SRBserver

Federated SRB server model

MCAT

Read Application

SRB agent

1

2

34

6

5

Logical NameOr

Attribute Condition

1.Logical-to-Physical mapping2.Identification of Replicas3.Access & Audit Control

Peer-to-peer Brokering

Server(s) SpawningData

Access

Parallel Data Access

R1 R2

5/6

Page 36: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 36

Logical Name SpaceExample - Hayden Planetarium

• Generate “fly-through” of the evolution of the solar system

• Access data distributed across multiple administration domains

• Gigabyte files, total data size was 7 TBytes• Very tight production schedule - 3 months

Page 37: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 37

Page 38: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 38

Hayden Data Flow

NCSA

SDSC

AMNHNYC

GPFS7.5 TB

IBM SP2

SGI

Production parameters, movies, images

data simulation

visualization

HPSS 7.5 TB

2.5 TB UniTree

UVa

NY

CalTech

BIRN

Page 39: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 39

Mappings on Name Space

• Define logical resource name• List of physical resources

• Replication• Write to logical resource completes when all physical

resources have a copy• Load balancing

• Write to a logical resource completes when copy exist on next physical resource in the list

• Fault tolerance• Write to a logical resource completes when copies exist on

“k” of “n” physical resources

Page 40: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 40

Hayden Conclusions

• The SRB was used as a logical central repository for all original, processed or rendered data.

• Location transparency crucial for data storage, data sharing and easy collaborations.

• SRB successfully used for a commercial project in “impossible” production deadline situation dictated by marketing department.

• Collaboration across sites made feasible with SRB

Page 41: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 41

Latency ManagementExample - ASCI - DOE

• Demonstrate the ability to load collections at terascale rates• Large number of digital entities• Terabyte sized data

• Optimize interactions with the HPSS High Performance Storage System• Server-initiated I/O• Parallel I/O

Page 42: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 42

SRB Latency Management

ReplicationServer-initiated I/O

StreamingParallel I/O

CachingClient-initiated I/O

Remote Proxies,Staging

Data AggregationContainers

SourceDestination

Prefetch

Network DestinationNetwork

Page 43: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 43

ASCI Small Files• Ingest a very large number of small files into

SRB• time consuming if the files are ingested one at a time

• Bulk ingestion to improve performance • Ingestion broken down into two parts

• the registration of files with MCAT• the I/O operations (file I/O and network data transfer)

• Multi-threading was used for both the registration and I/O operations.

• Sbload was created for this purpose.• reduced the ASCI benchmark time of ingesting ~2,100

files from ~2.5 hours to ~7 seconds.

Page 44: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 44

Latency ManagementExample - Digital Sky Project

• 2MASS (2 Micron All Sky Survey): • Bruce Berriman, IPAC, Caltech; John Good, IPAC,

Caltech, Wen-Piao Lee, IPAC, Caltech• NVO (National Virtual Observatory):

• Tom Prince, Caltech, Roy Williams CACR, Caltech, John Good, IPAC, Caltech

• SDSC – SRB :• Arcot Rajasekar, Mike Wan, George Kremenek, Reagan

Moore

Page 45: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 45

Digital Sky

Page 46: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 46

Digital Sky Data Ingestion

Informix

SUN

SRBSUN E10K

HPSS

….

800 GB

10 TB

SDSCIPAC CALTECH

input tapes from telescopes

star catalogData

Cache

Page 47: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 47

Digital Sky - 2MASS

• http://www.ipac.caltech.edu/2mass• The input data was originally written to DLT

tapes in the order seen by the telescope • 10 TBytes of data, 5 million files

• Ingestion took nearly 1.5 years - almost continuous reading of tapes retrieved from a closet, one at a time

• Images aggregated into 147,000 containers by SRB

Page 48: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 48

Containers• Images sorted by spatial location

• Retrieving one container accesses related images

• Minimizes impact on archive name space• HPSS stores 680 Tbytes in 17 million files

• Minimizes distribution of images across tapes

• Bulk unload by transport of containers

Page 49: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 49

Digital Sky – “Stars at finger tips”• Average 3000 images a day – web clients and

also as web service

InformixSUNs SRBSUN E10K

HPSS

….

800 GB

10 TB

SDSC

IPAC CALTECH

WEB

JPL

SUNs SGIsWEB

Page 50: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 50

Remote Proxies• Extract image cutout from Digital Palomar Sky

Survey• Image size 1 Gbyte• Shipped image to server for extracting cutout took 2-4

minutes (5-10 Mbytes/sec)• Remote proxy performed cutout directly on

storage repository• Extracted cutout by partial file reads• Image cutouts returned in 1-2 seconds

• Remote proxies are a mechanism to aggregate I/O commands

Page 51: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 51

Real-Time DataExample - RoadNet Project

• Manage interactions with a virtual object ring buffer

• Demonstrate federation of ORBs• Demonstrate integration of archives, VORBs and

file systems• Support queries on objects in VORBs

Page 52: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 52

VORBserver

VORB agent

VORBserver

Federated VORB Operation

VCAT

Get Sensor Data( from Boston)

VORB agent

1

2

4

5

Logical Name of the Sensorwiith Stream Characteristics

Contact VORB Catalog:1.Logical-to-Physical mapping

Physical Sensors Identified2. Identification of Replicas

ORB1 and ORB2 are identifiedas sources of reqd. data

3.Access & Audit Control

Automatically ContactORB2

Through VORB server

At Nome

Format Data and Transfer

R2

San Diego

NomeORB1

ORB2

3Check ORB1ORB1 is down

Check ORB2ORB2 is up. Get Data

6

Page 53: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 53

Information Abstraction Example - Data Assimilation Office

HSI has implemented metadata schema in SRB/MCAT

Origin: host, path, owner, uid, gid, perm_mask, [times]

Ingestion: date, user, user_email, comment

Generation: creator (name, uid, user, gid), host (name, arch, OS name & flags), compiler (name, version, flags), library, code (name, version), accounting data

Data description: title, version, discipline, project, language, measurements, keywords, sensor, source, prod. status, temporal/spatial coverage, location, resolution, quality

Fully compatible with GCMD

Page 54: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 54

DODS

SRBServer

WWWBrowser

UserApp

NativeFS DODS-

enableApp

DODS Libraries

LegacyApp

DODS/NetCDF

NetCDFApp

SRBClients

SRBClients

NativeFS

PBS Job

DMF

SRBServer

Unitree

SRB Middleware

Com

pute

Engi

ne

StorageServer

DMSServer

Des

ktop

Wor

ksta

tion

DesktopWorkstation

SRBClients

UserApp

NativeFS

StorageServer

Des

ktop

Wor

ksta

tion

DODS Access Environment Integration

Page 55: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 55

Data Grid Brick• Data grid to authenticate users, manage file

names, manage latency, federate systems• Intel Celeron 1.7 GHz CPU• SuperMicro P4SGA PCI Local bus ATX mainboard• 1 GB memory (266 MHz DDR DRAM)• 3Ware Escalade 7500-12 port PCI bus IDE RAID • 10 Western Digital Caviar 200-GB IDE disk drives• 3Com Etherlink 3C996B-T PCI bus 1000Base-T • Redstone RMC-4F2-7 4U ten bay ATX chassis• Linux operating system

• Cost is $2,200 per Tbyte plus tax• Gig-E network switch costs $500 per brick• Effective cost is about $2,700 per TByte

Page 56: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 56

Unix Shell

Java, NTBrowsers

SOAP/WSDL

GridFTP

SDSC Storage Resource Broker & Meta-data Catalog - Access Abstraction

ArchivesHPSS, ADSM,UniTree, DMF

DatabasesDB2, Oracle,

Postgres

File SystemsUnix, NT,Mac OSX

Application

HRM

AccessAPIs

Servers

Storage AbstractionCatalog Abstraction

DatabasesDB2, Oracle, Sybase, SQLServer, Informix

C, C++, Libraries

Logical Name Space

LatencyManagement

DataTransport

MetadataTransport

Consistency Management / Authorization-AuthenticationPrimeServer

Linux I/O

DLL /Python OAI

Page 57: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 57

Tutorial Part I Outline• Concepts

• Introduction to Grid Computing• Proliferation of Data Grids• Data Grid Concepts

• Practice• Real life use cases SDSC Storage Resource Broker

(SRB)• Hands on Session

• Research• Active Datagrid Collections• Data Grid Management Systems (DGMS)• Open Research Issues

Page 58: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 58

Page 59: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 59

Page 60: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 60

Page 61: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 61

Tutorial Part I Outline• Concepts

• Introduction to Grid Computing• Proliferation of Data Grids• Data Grid Concepts

• Practice• Real life use cases SDSC Storage Resource Broker

(SRB)• Hands on Session

• Research• Active Datagrid Collections• Data Grid Management Systems (DGMS)• Open Research Issues

Page 62: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 62

Datagrid Management System (DGMS)• DGMS manages

• State information of the datagrid collections (data)• Knowledge of events, rules and services (data behavior)• Collaborative communities (data users and resources)

• Differences from DBMS • Manages “community-owned” unstructured data along

with its behavior and inter-organizational resources• Logical organization has the (logical) resources where

the data be present (hidden in DBMS)• Basic unit = Active Datagrid Collection • Also uses concepts got from decades of DB Research

Page 63: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 63

DGMS Philosophy• Collective view of

• Inter-organizational data • Operations on datagrid space

• Local autonomy and global state consistency• Collaborative datagrid communities

• Multiple administrative domains or “Grid Zones”• Self-describing and self-manipulating data

• Horizontal and vertical behavior• Loose coupling between data and behavior (dynamically)• Relationships between a digital entity and its Physical

locations, Logical names, Meta-data, Access control, Behavior, “Grid Zones”.

Page 64: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 64

Active Datagrid Collections

SDSC

121.Event

Thit.xml

National Lab

getEvents()121.Event

Hits.sql

University of Gators

addEvent()

ResourcesData Sets

Behavior

Page 65: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 65

Active Datagrid Collections

Heterogeneous,distributed

physical data

SDSC

Dynamic or virtual data

121.Event

Thit.xml

National Lab

getEvents()121.Event

Hits.sql

University of Gators

addEvent()National Lab University of Gators

Page 66: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 66

Active Datagrid Collections

myHEP-Collection

SDSC

121.Event

Thit.xml

National Lab

121.EventHits.sql

University of Gators

Logical Collection gives location and

naming transparency

SDSC

Meta-data

Page 67: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 67

Active Datagrid Collections

myHEP-Collection

SDSC

121.Event

Thit.xml

National Lab

121.EventHits.sql

University of Gators

Now add behavior or services to this

logical collection

Meta-data

SDSC

Collection state and services

HorizontalServices

getEvents() addEvent()

Page 68: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 68

Active Datagrid Collections

myHEP-Collection

SDSC

121.Event

Thit.xml

National Lab

121.EventHits.sql

University of Gators

Meta-data

SDSC

Collection state and services

HorizontalServices

getEvents() addEvent()

ADC specificOperations + Model View

Controllers

ADC Logical view of data & operations

Page 69: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 69

Active Datagrid Collections

Digital entities

Meta-data

Services

State

Horizontal datagrid services and vertical domain specific services

Events, collective state, mappings to domain services to be invoked

Standardized schema with domain specific schema extensions

Physical and virtual data present in the datagrid

Page 70: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 70

Active Datagrid Collections

• Logical set consisting of related digital entities and references to their collective behavior for self-organization and manipulation of the data.

• Basic unit or data model managed in DGMS

Collections facilitate the transparencies and abstractions required to manage data in grids and inter-organizational enterprises

Page 71: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 71

DGMS• Datagrid Management System consists of a set

of services (protocols) and a hierarchical framework for:• Confluence of datagrid communities• Coordinated sharing of inter-organizational information

storage space and active datagrid collections

Page 72: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 72

Datagrid Broker

• A datagrid broker acts as an agent for an administrative domain in a DGMS framework.

• Datagrid communities • formed by confluence of datagrid brokers • Peer2peer network of brokers resulting in DGMS

• Datagrid brokers facilitate • sharing of services and data as components of active

datagrid collections in the datagrid.• Ensure the users in its domain are benefited by

participating in datagrid communities.

Page 73: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 73

DGMS and Datagrid Brokers

University of Gators - Physics

CMS GridPhysics Grid LHC

Grid

Florida Grid

Datagrid Broker

Datagrid Broker

Datagrid BrokerDGMS

Framework + Protocols for datagrid community

organization

Page 74: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 74

Datagrid Brokerage Protocols

University of Gators - Physics

Datagrid Broker

Datagrid Broker

Datagrid Broker

Datagrid Broker

Datagrid Joining Protocol

Super Broker

Florida Grid

Datagrid Broker

Datagrid Broker

Page 75: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 75

New-community member

Datagrid Brokerage Protocols

University of Gators - Physics

Datagrid Broker

Datagrid Broker

Datagrid Broker

Datagrid Broker

Super Broker

Page 76: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 76

Datagrid Brokerage Protocols (org)

• Organizing datagrid community • Managing the inter-organizational data• Datagrid Operations

• Converted into datagrid brokerage protocols • Protocols implemented as services by the datagrid brokers

• Hence, DGMS is nothing but these datagrid brokers which form these communities and the protocols (services) which operate on the collections

Page 77: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 77

Datagrid Brokerage Protocols (data)

University of Gators - Physics

Datagrid Broker

ADC

Datagrid Broker

merge()Datagrid Broker

Datagrid Broker

getEvents()

Super Broker

Active datagrid collection with

references to data and its behavior

Page 78: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 78

Need for Standard DGL

DatabaseSQL

121.Event Hits.sql

University of Gators121.Event

Thit.xml

National Lab

DGMSXML based,Invoke OperationsSubset XQuery

DGL

DDL,DML,DQL

Page 79: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 79

Data Grid Language• XML based asynchronous protocol

• Describe data sets, collections, datagrid operations, ... • Access and manage data grids, data flow pipelines• Query on data resource (based on W3C XQuery)

• Facilitates Grid Workflow• Sharing of granular state information about execution of

each datagrid operation amongst different processes or services

• Implementation Status• Reference Implementation by SDSC Matrix Project• On top of SRB protocol stack as W3C SOAP Web Service

Page 80: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 80

Data Grid Language• Datagrid Request

• Asynchronous requests for data/process-flow in datagrids• Requests are either a Transaction or a Status Query• Each Transaction consists of one or more Flows • Each Flow consists of one ore more datagrid operations• Datagrid operation = data transformation or data query• A flow can be executed sequential or parallel

• Datagrid Response• Either Transaction Acknowledgement or Status Response• Status Response contains the results of a Transaction

Page 81: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 81

Data ���� Discovery

Digital entities

Meta-data

Services

State

New data

updates relationships among data in collections

Services invoked to analyze new relationships

DGMS applications get notified of state updates

Page 82: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 82

Data ���� Discovery (Issues)• “More data; More discovery” Fermi National Lab • DGMS applications to automate knowledge

discovery• Work flow Management Systems (WfMS) subscribe to

updates in datagrid collections• Trigger like mechanisms on this large scale dynamic and

distributed data is a needed• Dynamic rule description and execution based on events• Semantic Mediation of datagrid collections • [SDSC Grid Enabled Mediation (GeMS)]

Page 83: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 83

DGMS Research Issues• Self-organization of datagrid communities

• Using knowledge relationships across the datagrids• Inter-datagrid operations based on semantics of data in

the communities (different ontologies)• High speed data transfer

• Terabyte to transfer - TCP/IP not final answer• Protocols, routers needed

• Latency Management• Data source speed >> data sink speed

• Datagrid Constraints • Data placement and scheduling

• How many replicas, where to place them…

Page 84: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 84

Other Grid Software• Legion (Avaki)

• Grid as a single virtual machine• Condor

• High Throughput Computing (HTC)• Globus

• Synonymous with Computational Grids• Data Handling Capabilities• GRAM (1000s), GridFTP (1000s), GSI, MDS, RLS, MCS

• Entropia• PC Grid, P2P

• IBM• SUN• HP

Page 85: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 85

Part I Summary• Grids are evolving

• coming soon to a domain near you• DGMS

• Coordinate collaborative management of inter-organizational information storage using Active Datagrid Collections

• Tools are available from research and academia. • Industry getting involved. • SDSC SRB provides abstraction mechanisms required to

implement data grids, digital libraries, persistent archives• Open Research issues for

• Distributed databases, Information management and Semantic web researchers

Page 86: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin

Part II:Grid Services for Structured Data

Paul WatsonUniversity of Newcastle, UK

[email protected] Paton

University of Manchester, [email protected]

Page 87: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 87

Part II a:Requirements & Grid Services

• Motivation• why is structured data important for Grid applications?• examples from Grid projects

• Requirements for Grid Data Services• Requirements generic across all services• Requirements specific to structured data services

• Grid Services• Brief history of Grid middleware• Open Grid Service Architecture

• Web Services• Grid Services

Page 88: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 88

Motivation

• Structured data is important to Grid applications• data• metadata

• Level of structure varies• Relational/Object DBs• XML• Structured files

Page 89: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 89

• Large quantities of biological data

• Different kinds of data• Data sources are scattered

and autonomous

• We will take examples from one project that is aiming to build a Grid application to support bioinformaticians: myGrid (mygrid.man.ac.uk)

Example: Bioinformatics

Page 90: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 90

Bioinformatics: myGrid project data

• Types of structured data• Biological data• Provenance

• Execution of biological workflows generates a provenance record• query to ask useful questions….

• what experiments were run on this data?• how was this data derived?• what are the top 20 bio-services used by my organisation?

• Annotation• users opinions on interpretation of data

Page 91: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 91

What does myGrid do with the data?

(Distributed)Query

processingWorkflows

Notification

Data &Metadata

• Biological databases and computations are represented as services

• (Distributed) Queries• extract information• integration of data from

distributed sources• Notification

• tell users when data/tools have changed

• Workflows• combine services

Page 92: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 92

Requirements

• A key driver for requirements is that applications combine data access with computation• Computation + Data

Page 93: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 93

myGrid Workflow example A Personal View onto the data Workflow

Metadata about

workflow

note aboutworkflow

Page 94: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 94

Categories of Requirements

• The need to integrate computation and data access means data services cannot be designed in isolation

• Therefore there are two categories of requirements…• those generic across all Grid services, allowing databases to be first-

class components• those specific to Grid-enabled data services, allowing databases to be

exploited in Grid applications• These are considered in turn…

Page 95: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 95

Generic Grid Requirements

• Allow compute and data components to be seamlessly combined, e.g.• high performance data transfer (more later)• scheduling (more later)• security• accounting• management

Page 96: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 96

Generic Grid requirements:High Performance Data Transfer

• A database may deliver huge amounts of data to a remote computation for analysis

• Requirements:• efficient communication of data

• flow control• efficient encoding• direct routing of results….

Page 97: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 97

Direct Routing of Results

Data ServiceClient

Analysis Service

1: Query

2: Result

3: ForwardResult

Data ServiceClient

Analysis Service

1: Query

2: Result

Inefficient

Efficient

Page 98: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 98

Generic Grid requirements: Scheduling

• co-scheduling data service with other components of a distributed application

• it would be helpful if data service provided cost estimates to assist scheduling, e.g.• for data-source selection (e.g. when replicas are available)• to estimate computational requirements (e.g. if we know a

query will generate R rows then can schedule space and time on consumer)

Page 99: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 99

Database Specific Requirements

• Grid applications will require at least the same functionality, tools and properties as other database applications• query, update, programming, indexing, integrity• availability, recovery, manageability, security• replication, versioning, evolution, archiving, change notification• concurrency control, transactions, bulk load

• It has taken huge amounts of effort to make these available in today’s database servers

• Conclusion: we must build on this effort by integrating existingdatabase servers into the Grid• starting from scratch isn’t an option.

Page 100: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 100

Grid-characteristic requirements

• Are there any requirements of grid-enabled databases that may require special attention?• performance

• scalability• unpredictability

• meta-data-driven access• federation

Page 101: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 101

Performance: scalability

• Data size• examples of large dbs

• Query execution time• example of large queries

• Concurrent Users• example of large number of concurrent users

Page 102: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 102

Performance: Unpredictability

• Need to support curiosity-driven access• c.f. pre-determined access in e-commerce

• Must manage unpredictable usage• avoid denial of service• share resources fairly

• Need to share db resources in a controlled way• cost estimation helpful• resource usage quotas• holistic approach required

• CPU, Memory, Disk, Network, DB Cache, Locks

Page 103: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 103

Metadata based access

• Users & applications may potentially have access to large numbers of data repositories

• Repositories can be selected via registries• Two stage access:

1. use metadata access to registries to find repository(s)• importance of metadata standards

2. access (and federate data) from repository(s)• importance of repository access standards• importance of federation…

Page 104: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 104

Federation

• Workflows allow us to manually combine data and computational services

• But, they need to be manually constructed• time consuming, error prone, inflexible

• Ideally there would be tools to federate data across the Grid• utilise dynamically acquired Grid resources for federation• distributed query processing

Page 105: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 105

Grid Services

• Grid applications are now being built within a service-based framework

• In this section we will consider• Brief history of Grid middleware• Open Grid Service Architecture

• Web Services• Grid Services

Page 106: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 106

Brief history of Grid middleware

• Until 2002, Grid middleware suffered from:• lack of standards

• Globus was a de facto standard rather than de jure standard• Also other approaches such as Unicore

• monolithic, so difficult to build and integrate components• lack of synergy with commercial middleware standards and tools

• Since 2001 there has been an attempt to address this:• Open Grid Services Architecture (OGSA)

• Key Text:• “The Physiology of the Grid, An Open Services Architecture for

Distributed Systems Integration”, I. Foster, C. Kesselman, J.M. Nick & S. Tuecke. June 2002 (available from www.globus.org)

Page 107: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 107

Why Services?

• Virtualisation of resources• computers, repositories, networks, programs, databases,

activities… can all be represented as services• Advantages

• standard interface definition• standard way to invoke service• local/remote invocation transparency• ability to hide diverse implementations & platforms behind

the interface• composition

Page 108: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 108

Web Services

• Grid Services are based on Web Services (WS)• WS allow the creation of components that offer a

set of operations• based on message exchange & XML

• Key WS standards define:• interfaces (WSDL)• message formats (SOAP)

Page 109: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 109

Transient Service Instances

• Web Services address the discovery and invocation of persistent services

• In Grids, many resources and activities will be created dynamically and may be short-lived, e.g.• a job running on a computer• a database session• data produced as the result of a query

• A key feature of OGSA is the introduction of transient service instances• created dynamically to encapsulate a transient resource or activity• created by factories• “Grid Service Instances”

Page 110: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 110

Stateful Consumer-Service Interactions

FactoryClient

1. Client asks a Factory to create a new Grid Service Instance

2. Factory creates Grid Service Instance and returns handle to client

FactoryClient

Grid Service Instance

3. Client can then call operations on the Grid Service Instance

Client

Grid Service Instance

Stateful Consumer-Service interactions can be implemented with Grid Service Instances

Grid Service InstanceGrid Service Instance

Grid Service InstanceGrid Service Instance

Factory Grid Service InstanceGrid Service Instance

Page 111: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 111

Other Grid Service Instances

• The last part of the tutorial will include examples of Grid Service Instances being created to represent• database sessions• data• computation on data

Page 112: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 112

Grid Service Instance Lifetimes

• Lifetime management provided• service, host and client (policy permitting) can kill GSI

• What if the client is in charge of lifetime but it fails, or the network fails, and kill request is never received?

• Soft-state management also provided• initial lifetime set when GSI created (can be at client

request)• client can extend this by request• when lifetime reached, GSI or host can kill GSI

Page 113: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 113

Accessing a Grid Service Instance

• Each GSI is assigned a globally unique name• the Grid Service Handle (GSH)

• protocol and network address independent• To access a service, this must be mapped onto a Grid

Service Reference (GSR)• protocol & network specific• therefore a factory returns a GSH and a GSR

• A GSR can have a finite lifetime or become invalid• re-resolution provided by a Handle-to-Reference mapper

• This 2-level naming opens the way for:• service upgrading• failover• scalability

Page 114: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 114

Exposing Service State

• Service Data Elements (SDEs)• Standard way for services to advertise and

expose state• Consumers can:

• find what state is exposed• query state• register to be notified of state changes

Page 115: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 115

Open Grid Services Infrastructure (OGSI)

• The basic interfaces and behaviours required by Grid Services are defined in the Open Grid Services Infrastructure (OGSI) specification• v1 submitted to the Global Grid Forum as a

recommendation track document (April 2003)• www.ggf.org/ogsi-wg

• The first major implementation was made available in June 2003 as Globus Toolkit 3 (GT3)• www.globus.org

Page 116: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 116

The OGSA Architecture

Page 117: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin

Part II b:The Design of Grid Data Services

Page 118: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 118

Grid Service Design/Development Status

• Many projects have developed data grid components and infrastructures.

• Relatively few projects have yet deployed OGSI in anger.• Individual projects are developing their own generic or

application specific services.• Certain of these activities are feeding into standards within

the Global Grid Forum (GGF).• The GGF OGSA Working Group and Data Area are

seeking to coordinate activity.

Page 119: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 119

Abstract Service Classification

Program ExecutionServices

Application Specific Services

Web Services

OGSI

Core Grid ServicesGrid Data Services

Global Grid Forum OGSA Working Group: www.gridforum.org

Page 120: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 120

Designing Grid Services

• Service design:• State – Service Data

Elements (SDEs).• Operations – WSDL

document.• Ancillary issues:

• Service lifetime.• Service granularity.• Operation granularity.• Extensibility.

• Core services for data management under development:• Data movement (RFT).• Data replication (GMR).• Database Access (OGSA-

DAI).• Distributed Querying (OGSA-

DQP).• There is no definitive

“wedding cake” for Data Grid Services.

Page 121: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 121

Reliable File Transfer

• RFT Features:• Built over GridFTP.• Persistent transfer state.• Error recovery.

• Service properties:• Service lifetime bounds

transfer duration. • Service handles a single job

at a time. • A job may transfer many files.

• Operations:• start JobDescription -> JobId.• cancel JobId.

• SDEs:• FileTransfer Status.• FileTransferProgress.• FileTransferRestartMarker.• Version.

R.K. Madduri, W.E. Allcock, Reliable File Transfer in Grid Environments, Proc. 27th IEEE Conference on Local Computer Networks (LCN), 737-738, 2002.

Page 122: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 122

Reliable File Transfer

RFT Service

RFT Factory DBMScreate

DBMS

Grid FTP

DBMS

Grid FTPdata

transfer

controlstorage

resource

M. Atkinson, A.L. Chervenak, P. Kunszt, I. Narang, N.W. Paton, D. Pearson, A. Shoshani, P. Watson, Data Access, Integration and Management, The Grid (2nd Edition), I. Fister, C. Kesselman (eds), 2003.

storage resource

Page 123: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 123

Grid Movement and Replication

• GMR Features:• Pluggable architecture:

scheduler, transport, adaptors.

• Service properties:• Service lifetime bounds

replication refresh. • Service handles many jobs at

a time. • A job may replicate many

items.

• Operations:• createReplicationJob,

updateReplicationJob, ...• SDEs:

• Job progress.• Job statistics.

A. Chokshi, S. Glanville, V Gogate, S. Jeffrey, C. Madsen, I. Narang, V. Raman, M Subramanian, OGSA Grid Data Movement and Replication Service, IBM Almaden, 2003.

Page 124: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 124

GMR Architecture

GMR Service

Client

Files

Application Grid Service

DBMS DB Service

Move/Control

Inform/Monitor Replica

Catalog

Request/Subscribe

Page 125: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 125

Example GMR Request

• Request specifies:• Source.• Target.• Requester.• Lifetime of replication.• Options on replication:

• Replace.• Update.• NewCopy.

• Scheduling information:• Scheduler name.• Arguments.

<GMRReplicationJobRequest><Source><DataGDR><simple_path_GDR><host>myHost1</host><path>/foo/file1</path>

</simple_path_GDR></DataGDR>

</Source><Target>...

</Target><RequestorInformation><UserName>Bob</UserName>

</RequestorInformation></GMRReplicationJobRequest>

Page 126: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 126

Grid Services for Structured Data

• General approach:• wrap database as a

service.• Aim to standardise

where possible• result format.• metadata:

• operations supported.• driver employed.• schema.

InterfaceCode

Ser

vice

Inte

rface

DBMS

Metadata

Accounting

Query

Notification

Bulk Loading

Scheduling

Services

Transaction

P. Watson, Databases and the Grid, in Grid Computing: Making the Global Infrastructure a Reality, F. Berman, G. Fox, T. Hey (eds), Wiley, 2003.

Page 127: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 127

Mapping Sessions to Services

• An important architectural issue is how to map client database sessions onto services.

• Options include:• one service per database:

• all clients connect to the same service.• natural for Web Services.• needs way to internally handle the state of multiple sessions.

• one service instance per session:• a new service instance is created for each new session.• natural for Grid Services.

Page 128: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 128

OGSA Data Access & Integration

• OGSA-DAI Features:• Wraps relational and XML

databases.• Compound requests.• Flexible delivery.

• Service properties:• Represents a user session.• User can have many concurrent

sessions.• Service actively processes one

request at a time.

• Operations:• perform Request -> Result.

• SDEs:• Schema.• DBMS.• Driver.• Stored requests.• Request progress.• ...

OGSA-DAI Software: http://www.ogsa-dai.org.uk

Page 129: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 129

Service Interactions

GFactory GDSF

G

Registry

GSGDSR

GClient

G

GDS GDSInstanceGDT DB

RegisterService

findServiceData

CreateService

perform(Query)

GGDT

Consumer

1

4

2

113

5

Page 130: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 130

Terminology

• GDSR:• Grid Database Service

Registry.• Searchable registry of

GDSFs.• GDSF:

• Grid Database Service Factory.

• Supports the creation of GDS instances.

• GS:• GridService portType.

• GDS:• Grid Database Service.• Supports a client session with

the database.• GDT:

• Grid Database Transport portType.

• Endpoint for service-to-service data pull/push requests.

Page 131: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 131

GDS PortTypes

GDS

GridServicePort

GridDataTransportPort

Client

GridDataServicePort

findServiceData

<service data>perform

<result id>

getFully

<result>

Page 132: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 132

Service Data Elements

• Sample SDEs:• dataResource.• driverType .• dataResourceManagementSystem.• database.• databaseSchema.• runningRequests.• definedRequests.• activitiesSupported.

Page 133: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 133

Example SDE

• Suppose database has a table:• Person (id int(11) primary key, …)

• Selection of Service Data Element:• <queryByServiceDataNames name=“databaseSchema”/>

• Sample of SDE document:• <databaseSchema dataResource=“mydatabase”>

<table name=“Person”><column name=“id” fullname=“Person.id” length=“11”>

<sqlType>INTEGER</sqlType></column>

… </table>

</databaseSchema>

Page 134: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 134

Perform Documents

• A GDS::perform operation takes a potentially complex document as input.

• The top-level elements indicate the action that should be taken:• Request defines a request for storage by the GDS.• Execute runs a stored request. • Terminate stops a running request.

• A Request element is a collection of linked activities, such as:• sqlQueryStatement specifies an SQL select statement.• sqlUpdateStatement specifies an SQL DML statement.• Data delivery descriptions that specify movement of data between the service

and some third party.

Page 135: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 135

Query With Direct Response

<gridDataServicePerform … ><request name="myRequest">

<sqlQueryStatementname="statement">

<dataResource>frogGenome</dataResource><expression>select * from chromFrag where len &lt 20000

</expression><webRowSetStream name="statementresponse"/>

</sqlQueryStatement>

<deliverToResponse name="d1"><fromLocal from="statementresponse"/>

</deliverToResponse>

</request></gridDataServicePerform>

Analyst GDS DB

XML Request

XML Response

Page 136: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 136

Query With Third Party Delivery

AnalystDB

XML Request

XML Response

Result File

GDS

deliverToGFTP

<gridDataServicePerform … ><request name=“myPushRequest">

<sqlQueryStatement name=“stmt"><dataResource>frogGenome</dataResource><expression>select * from chromFragwhere len &lt 20000

</expression><webRowSetStream name="statementresponse"/>

</sqlQueryStatement>

<deliverToGFTP name="d1"><fromLocal from="statementresponse"/><toGFTP host="ogsdai.org.uk"

port="8080" file="path/to/myfile.txt"/></deliverToGFTP>

</request></gridDataServicePerform>

Page 137: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 137

Update Pulling From Third Party

AnalystDB

XML Request

XML Response

OGSA Service

GDS

deliverFromGDT

<gridDataServicePerform><request name=“myPullRequest">

<deliverFromGDT name="d1"><fromGDT streamId="otherrequestasynch/d1"

mode="full">http://ogsadai.org.uk/GDTService/

my/GDT/GSH</fromGDT><toLocal name="datatoinsert"/>

</deliverFromGDT>

<sqlUpdateStatement name="statement"><sqlParameter position="1" from="datatoinsert"/><expression>insert into chromFrag values ?</expression>

</sqlUpdateStatement></request>

</gridDataServicePerform>

Page 138: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 138

Design Issues for GDS

• Service granularity:• Notions to capture:

• Installed DBMS.• Database within DBMS.• Session over database.• Request.• Query result set.

• Which should have:• PortType definitions.• Service Data Elements.

• Which should be:• Service instances.

• Operation granularity.• Atomic operations:

• Execute SQL Query.• Apply XSLT transformation.• Map directly to multi-

operation PortType.• Good for tooling.

• Compound operations:• Execute query, apply

transformation, deliver result.• Need to define orchestration

notation.• Fewer fine-grained calls.

Page 139: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 139

Standardisation

• Global Grid Forum:• Standards body for Grid Computing.• Open meetings three times a year.• www.gridforum.org.

• Database Access and Integration Services Working Group (DAIS-WG):• Sessions at each GGF meeting.• Intermediate telcons and face-to-face sessions.• Developing standard proposal.• http://www.gridforum.org/6_DATA/dais.htm

• OGSA-DAI 2.5 broadly implements the February 03 DAIS draft.

Page 140: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 140

Databases and the Grid

• Deploying databases:• Q: How are databases

made available on the Grid?

• A: Through integration with other Grid services, and provision of standard interfaces.

• DAIS GGF Working Group.

• Deploying database technologies:• Q: What database

technologies might be broadly useful as Grid services?

• A: transactions, materialised views, distributed query processing, …

Page 141: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 141

Distributed Query Processing

• DQP involves a single query referencing data stored at multiple sites.

• The locations of the data may be transparent to the author of the query.

select p.proteinId, Blast(p.sequence)from protein p, proteinTerm twhere t.termId = ‘GO:0005942’ and

p.proteinId = t.proteinId

J. Smith, A. Gounaris, P. Watson, N. Paton, A. Fernandes, R. Sakellariou, Distributed Query Processing on the Grid, 3rd Int. Workshop on Grid Computing, Springer-Verlag, 279-290, 2002.

Page 142: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 142

Service-Based DQP

• Service-based in two respects:• Queries are expressed over

services.• Grid database services.• Computational services.

• The distributed query processor it itself implemented as a collection of services.

OGSA-DAI

OGSA

Distributed Query Processor

Application/Presentation Layer

Configuration

Logging

Auditing

Policy

Security

Versioning

Accounting

OGSA-DQP Software: http://www.ogsa-dai.org.uk

Page 143: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 143

Service-based DQP

• DQP often exploits source wrappers.

• In service-based DQP, the sources are Web or Grid Services.

• A query may refer to database (GDS) and computational services.

• Workflow languages are often used for service orchestration.• Emerging standard:

BPEL4WS.• Procedural request

description.• Programmer controls order of

evaluation.

Page 144: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 144

Mutual Benefit

• The Grid needs DQP:• Declarative, high-level

resource integration with implicit parallelism.

• DQP-based solutions should in principle run raster than those manually coded.

• DQP needs the Grid:• Systematic access to remote

data and computational resources.

• Dynamic resource discovery and allocation.

Page 145: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 145

Phases in Query Processing

• Initialisation:• Grid Distributed Query Service is created.• Source descriptions imported.

• Query compilation:• Query optimisation and scheduling.

• Query evaluation:• Query evaluation services created as needed.• Query partitions allocated to evaluators for execution.

Page 146: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 146

Distributed Query Services

• Services:• GridDistributedQueryService (GDQS).• GridQueryEvaluationService (GQES).• Associated factories.

• Properties:• Both GDQS and GQES services implement the portTypes

of the Grid Database Service.• The GDQS is extended to support the importing of service

descriptions.

Page 147: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 147

Initialising a GDQS

GSH:GDQSF

findServiceData

create

importSchema(GSH:GDSF, ConfDoc)

findServiceData

GSH:GDSF

N2

N3

Create(ConfDoc)

findServiceDataDBSchemaGSH:GDS1

GSH:GDQS1

7

N1113

2

1

5

6

findServiceData ConfigDocs

4

G

Factory

GDSF

G

Registry

GSGDSR GFactory GDQSF

register

GGDQ GDQS1

G Registry GS

GDSR

GClient

GS

register

GGDS GDS1

GS

1

8

Page 148: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 148

Issues in Initialisation

• Q: When is a GDQS bound to a particular GDS?• A: When the schema of the

GDS is imported.• Q: What is the lifespan of

a GDS used by a GDQS?• A: The GDS is kept alive until

the GDQS expires.• Q: Are GDSs shared by

multiple GDQSs?• A: No.

• Q: When is a GQES created?• A: When a query is about to

be evaluated that needs it.• Q: What is the lifespan of

a GQES?• A: It lasts only as long as a

single query.• Q: Is a GQES shared

among several queries or GDQSs?• A: No.

Page 149: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 149

Query Compilation

LogicalOptimiser

PhysicalOptimiser

Partitioner Scheduler

Evaluator

OQLParser

Single-nodeOptimiser

Multi-nodeoptimiser

Page 150: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 150

Scheduling

• Partitions are allocated to Grid nodes; partitions may be merged during scheduling.

• Expressed by parallel algebra expression.

• Heuristic algorithm considers memory use, network costs.

table_scan(protein)

table_scantermID=...

(proteinTerm)

reduce reduce

hash_join

(proteinId)

op_call(Blast)

reduce

exchange exchange

exchange

3,4

1

1 2

Page 151: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 151

Query Evaluation

• Query installation:• GQESs created for partitions as required.• Partitions sent to GQESs.

• Query evaluation:• Partitions evaluated using iterator model.• Pipelined and partitioned parallelism.• Results conveyed to client.

Page 152: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 152

GFactory GQESF

GFactory GQESF

GFactory GQESF

N2

N1

N3

GClientGGDS

GGDS

GDQ

GDT

GDQS

N0GDS

GFactory GQESF

N4

perform(Query)1

createService

createService2

createService

2

2

GGDS GQES2

GDT

GGDS GQES3

GDT

GGDS GQES1

GDT

GGDS GQES1

GDT

perform(QuerySubplan)

perform(QuerySubplan)

perf orm(Q

ueryS ubpla n)

3

sequential_scan

reduce (proteinID,sequence)

sequential_scan (term=8372)

reduce (proteinID)

hash_join(p.proteinID=t.proteinID)

3

operation_callblast(p.sequence)

reduce (p.proteinID, blast)

operation_callblast(p.sequence)

reduce (p.proteinID, blast)

3

Web Services (BLAST)

results

results

results

4

1144

Page 153: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 153

Features of OGSA-DQP

• Low cost of entry:• Imports source descriptions through GDSs.• Imports service descriptions as WSDL.

• Throw-away GDQS:• Import sources on a task-specific basis.• Discard GDQS when task completed.

• Builds on parallel database technology:• Implicit parallelism.• Pipelined + partitioned parallel evaluation.

• Integrates data access with operation invocation.

Page 154: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 154

Related Work on Database Services

• Web services interfaces to databases (e.g.):• K. Mensah, Web Services Enable Your Database, Web

Services Journal, Vol 3, No 4, 2003.• S. Malaika et al., DB2 and Web Services, IBM Systems

Journal, Vol 41, No 4, 666-685, 2002.• Distributed querying using services:

• T. Malik and A. Szalay, SkyQuery: A Web Service Approach to Federate Databases, Proc. CIDR, http://www-db.cs.wisc.edu/cidr/program/p17.pdf, 2003.

• R. Braumandl, et al., ObjectGlobe: Ubiquitous query processing on the Internet, VLDB J., Vol 10, 48-71, 2001.

Page 155: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 155

Grid Data Management Services

• Data Grid applications can benefit from many lower level services:• Data movement.• Replication.• Database access and integration.

• Work is underway on designing, developing and standardising many core Grid Data Management services.

• Designing services in a dynamic and heterogeneous environment is non-trivial, and there is much still to be done.

Page 156: Grid Services for Structured data –Part II Arun ...ceick/6340/vldb03-grid.pdf · Grid Services for Structured data –Part II ... Proliferation of Data Grids ... Information Storage

VLDB 2003 Berlin 156

Outstanding Research Issues

• Adaptability.• Cost modelling.• Data encoding.• Data placement.• Caching and replication.• Glide-in databases.• Management of Grid

Resources.

• Orchestration.• Quality of service.• Scheduling.• Security.• Service description.• Service frameworks.