Upload
ledung
View
214
Download
2
Embed Size (px)
Citation preview
VLDB 2003 Berlin
Grid Data Management Systems & Services
Data Grid Management Systems – Part IArun Jagatheesan, Reagan Moore
Grid Services for Structured data –Part IIPaul Watson, Norman Paton
VLDB TutorialBerlin, 2003
VLDB 2003 Berlin
Part I:Data Grid Management Systems
Arun JagatheesanReagan Moore
{arun, moore}@sdsc.eduSan Diego Supercomputer CenterUniversity of California, San Diego
VLDB TutorialBerlin, 2003
http://www.npaci.edu/DICE/SRB/
VLDB 2003 Berlin 3
Tutorial Part I Outline• Concepts
• Introduction to Grid Computing• Proliferation of Data Grids• Data Grid Concepts
• Practice• Real life use cases SDSC Storage Resource Broker
(SRB)• Hands on Session
• Research• Active Datagrid Collections• Data Grid Management Systems (DGMS)• Open Research Issues
VLDB 2003 Berlin 4
Distributed Computing
© Images courtesy of Computer History Museum
VLDB 2003 Berlin 5
Distributed Data Management• Data collecting
• Sensor systems, object ring buffers and portals• Data organization
• Collections, manage data context• Data sharing
• Data grids, manage heterogeneity• Data publication
• Digital libraries, support discovery• Data preservation
• Persistent archives, manage technology evolution• Data analysis
• Processing pipelines, manage knowledge extraction
VLDB 2003 Berlin 6
What is a Grid?
“Coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations”
Ian Foster, ANL
What is Middleware?Software that manages distributed state information for results of remote services
Reagan Moore, SDSC
VLDB 2003 Berlin 7
Data Grids• A datagrid provides the coordinated
management mechanisms for data distributed across remote resources.
• Data Grid• Coordinated sharing of information storage• Logical name space for location independent identifiers• Abstractions for storage repositories, information
repositories, and access APIs• Computing grid and the datagrid part of the Grid.
• Data generation versus data management
VLDB 2003 Berlin 8
Tutorial Part I Outline• Concepts
• Introduction to Grid Computing• Proliferation of Data Grids• Data Grid Concepts
• Practice• Real life use cases SDSC Storage Resource Broker
(SRB)• Hands on Session
• Research• Active Datagrid Collections• Data Grid Management Systems (DGMS)• Open Research Issues
Are data grids in production use?How are they
applied?
VLDB 2003 Berlin 9
Storage Resource Broker at SDSC
More features, 60 Terabytes and counting
VLDB 2003 Berlin 10
NSF Infrastructure Programs• Partnership for Advanced Computational Infrastructure -
PACI• Data grid - Storage Resource Broker
• Distributed Terascale Facility - DTF/ETF• Compute, storage, network resources
• Digital Library Initiative, Phase II - DLI2• Publication, discovery, access
• Information Technology Research projects - ITR• SCEC Southern California Earthquake Center• GEON GeoSciences Network• SEEK Science Environment for Ecological Knowledge• GriPhyN Grid Physics Network• NVO National Virtual Observatory
• National Middleware Initiative - NMI• Hardening of grid technology (security, job execution, grid services)
• National Science Digital Library - NSDL• Support for education curricula modules
VLDB 2003 Berlin 11
Federal Infrastructure Programs• NASA
• Information Power Grid - IPG• Advanced Data Grid - ADG• Data Management System - Data Assimilation Office
• Integration of DODS with Storage Resource Broker• Earth Observing Satellite EOS data pools • Consortium of Earth Observing Satellites CEOS data grid
• Library of Congress• National Digital Information Infrastructure and Preservation Program -
NDIIPP• National Archives and Records Administration (NARA) and
National Historical Public Records Commission• Prototype persistent archives
• NIH• Biomedical Informatics Research Network data grid
• DOE• Particle Physics Data Grid
VLDB 2003 Berlin 12
NSF GriPhyN/iVDGL
• Petabyte scale Virtual Data Grids • GriPhyN, iVDGL, PPDG – Trillium
• Grid Physics Network• International Virtual Data Grid Laboratory• Particle Physics Data Grid
• Distributed worldwide • Harness Petascale processing, data resources
• DataTAG – Transatlantic with European Side
VLDB 2003 Berlin 13
Tera Grid
• Launched in August 2001• SDSC, NCSA, ANL, CACR, PSC
• 20 Tera flops of computing power • One peta byte of storage• 40 Gb/sec (FASTEST network on planet)• “Building the Computational Infrastructure for
Tomorrow's Scientific Discovery”
VLDB 2003 Berlin 14
European Datagrid
• European Union• Different Communities
• High Energy Physics• Biology• Earth Science
• Collaborate and complement other European and US projects
VLDB 2003 Berlin 15
NIH BIRN
• Biomedical Informatics Research Network• Access and analyze biomedical image data• Data resources distributed throughout the country• Medical schools and research centers across the nation
• Stable high performance grid based environment• Coordinate sharing of virtual data collections and data
mining• Growing fast!
VLDB 2003 Berlin 16
Commonality in all these projects• Distributed data management
• Authenticity• Access controls• Curation
• Data sharing across administrative domains• Common name space for all registered digital entities
• Data publication • Browsing and discovery of data in collections
• Data Preservation• Management of technology evolution
VLDB 2003 Berlin 17
Data and Requirements• Mostly Unstructured, heterogeneous
• Images, Files, Semi-structured, databases, streams, …• File systems, SAN, FTP sites, Web servers, Archives
• Community-Based• Shared amongst one or more communities
• Meta-data• Different Meta-data schemas for the same data• Different Notations, ontologies
• Sensitive to Sharing• Nobel Prizes, Federal Agreements, project data
VLDB 2003 Berlin 18
Tutorial Part I Outline• Concepts
• Introduction to Grid Computing• Proliferation of Data Grids• Data Grid Concepts
• Practice• Real life use cases SDSC Storage Resource Broker
(SRB)• Hands on Session
• Research• Active Datagrid Collections• Data Grid Management Systems (DGMS)• Open Research Issues
VLDB 2003 Berlin 19
Data GridData Grid
Using a Data Grid – in Abstract
Ask for d
ata
•User asks for data from the data grid
Data delivered
•The data is found and returned•Where & how details are managed by data grid•But access controls are specified by owner
VLDB 2003 Berlin 20
Data Grid Transparencies• Find data without knowing the identifier
• Descriptive attributes• Access data without knowing the location
• Logical name space• Access data without knowing the type of storage
• Storage repository abstraction• Retrieve data using your preferred API
• Access abstraction• Provide transformations for any data collection
• Data behavior abstraction
VLDB 2003 Berlin 21
Logical Layers (bits,data,information,..)
Storage Resource Transparency
Storage Location Transparency
E:\srbVault\image.jpg /users/srbVault/image.jpg Select … from srb.mdas.td where...Data Identifier Transparency
image_0.jpg…image_100.jpgData Replica Transparency
image.sqlimage.cgi image.wsdlVirtual Data Transparency
Semantic data Organization (with behavior)patientRecordsCollectionmyActiveNeuroCollection
Inter-organizational Information
Storage Management
VLDB 2003 Berlin 22
Storage Resource Transparency (1)
• Storage repository abstraction • Archival systems, file systems, databases, FTP sites, …
• Logical resources • Combine physical resources into a logical set of resources• Hide the type and protocol of physical storage system• Load balancing – based on access patterns• Unlike DBMS, user is aware of logical resources• Flexibility to changes in mass storage technology
VLDB 2003 Berlin 23
Storage Resource Transparency (2)
• Standard operations at storage repositories • POSIX like operations on all resources
• Storage specific operations• Databases - bulk metadata access• Object ring buffers - object based access• Hierarchical resource managers - status and staging
requests
VLDB 2003 Berlin 24
Storage Location Transparency
• No fixed home or location for distributed data• Transparent of physical location and physical resource
• Virtualization of distributed data resources• Data placement managed by data grid
• Redundancy for preservation• Resource redundancy – “m of n” resources in list • Location redundancy – replicate at multiple locations
VLDB 2003 Berlin 25
Data Identifier Transparency • Four Types of Data Identifiers:• Unique name
• OID or handle • Descriptive name
• Descriptive attributes – meta data• Semantic access to data
• Collective name • Logical name space of a collection of data sets• Location independent
• Physical name• Physical location of resource and physical path of data
VLDB 2003 Berlin 26
Data Replica Transparency• Replication
• Improve access time• Improve reliability• Provide disaster backup and preservation• Physically or Semantically equivalent replicas
• Replica consistency• Synchronization across replicas on writes• Updates might use “m of n” or any other policy• Distributed locking across multiple sites
• Versions of files• Time-annotated snapshots of data
VLDB 2003 Berlin 27
Virtual Data Abstraction
• Virtual Data or “On Demand Data”• Created on demand is not already available• Recipe to create derived data• Grid based computation to create derived data product
• Object based storage (extended data operations)• Data subsetting at the remote storage repository• Data formatting at the remote storage repository• Metadata extraction at the remote storage repository• Bulk data manipulation at the remote storage repository
VLDB 2003 Berlin 28
Data Organization• Physical Organization of the data
• Distributed Data• Heterogeneous resources • Multiple formats (structured and unstructured)
• Logical Organization • Impose logical structure for data sets• Collections of semantically related data sets• Users create their own views (collections) of the data grid
• Digital Ontology• Characterization of structures in data sets and collections• Mapping of semantic labels to the structures
VLDB 2003 Berlin 29
Data Behavior Abstraction
• Loose coupling between data and behavior• Collection provides a structure to related data sets• Related data sets operated using a collective behavior• Any behavior (set of operations) is associated with a
collection• Data Grid Collections impose behavior
• Describe a generic standard behavior using WSDL• Each collection gets its specific behavior by extending the
generic behavior• Generic WSDL is extended using portType (or interface)
inheritance
VLDB 2003 Berlin 30
Tutorial Part I Outline• Concepts
• Introduction to Grid Computing• Proliferation of Data Grids• Data Grid Concepts
• Practice• Real life use cases SDSC Storage Resource Broker
(SRB)• Hands on Session
• Research• Active Datagrid Collections• Data Grid Management Systems (DGMS)• Open Research Issues
VLDB 2003 Berlin 31
SDSC SRB – The History
• Started in 1995 funded by DARPA• Massive Data Analysis System (MDAS)• PI: Reagan Moore• “Support data-intensive applications that manipulate very
large data sets by building upon object-relational database technology and archival storage technology”
• Multiple projects for many federal agencies• DoD, NSF, NARA, NIH, DoE, NLM, Library of Congress,
NASA• In production or evaluation at multiple academic and
research institutions round the world
VLDB 2003 Berlin 32
SDSC SRB Team - Data “R” Us :-)Camera-shy• Wayne Schroeder• Vicky Rowley (BIRN)• Lucas Gilbert • Marcio Faerman (SCEC)• Antoine De Torcy (IN2P3)• Students & emeritus
• Erik Vandekieft• Reena Mathew• Xi (Cynthia) Sheng• Allen Ding• Grace Lin• Qiao Xin• Daniel Moore• Ethan Chen
World’s first ‘datagrid engineer’?
VLDB 2003 Berlin 33
SDSC Collaborations• Hayden Planetarium Simulation &
Visualization • NVO -Digital Sky Project (NSF)• ASCI - Data Visualization Corridor
(DOE)• Particle Physics Data Grid (DOE)
{GrPhyN (NSF)}• Information Power Grid (NASA)• Biomedical Informatics Research
Network (NIH) • Knowledge Network for
BioComplexity (NSF)• Mol Science – JCSG, AfCS• Visual Embryo Project (NLM)
• RoadNet (NSF)• Earth System Sciences – CEED,
Bionome, SIO Explorer• Advanced Data Grid (NASA)• Hyper LTER • Grid Portal (NPACI)• Tera Scale Computing (NSF)• Long Term Archiving Project
(NARA)• Education – Transana (NPACI)• NSDL – National Science Digital
Library (NSF)• Digital Libraries – ADL, Stanford,
UMichigan, UBerkeley, CDL• … 31 additional collaborations
VLDB 2003 Berlin 34
Production Data Grid• SDSC Storage Resource Broker
• Federated client-server system, managing• Over 50 TBs of data at SDSC• Over 8.64 million files
• Manages data collections stored in• Archives (HPSS, UniTree, ADSM, DMF)• Hierarchical Resource Managers• Tapes, tape robots• File systems (Unix, Linux, Mac OS X, Windows)• FTP sites• Databases (Oracle, DB2, Postgres, SQLserver, Sybase,
Informix)• Virtual Object Ring Buffers
VLDB 2003 Berlin 35
SRBserver
SRB agent
SRBserver
Federated SRB server model
MCAT
Read Application
SRB agent
1
2
34
6
5
Logical NameOr
Attribute Condition
1.Logical-to-Physical mapping2.Identification of Replicas3.Access & Audit Control
Peer-to-peer Brokering
Server(s) SpawningData
Access
Parallel Data Access
R1 R2
5/6
VLDB 2003 Berlin 36
Logical Name SpaceExample - Hayden Planetarium
• Generate “fly-through” of the evolution of the solar system
• Access data distributed across multiple administration domains
• Gigabyte files, total data size was 7 TBytes• Very tight production schedule - 3 months
VLDB 2003 Berlin 37
VLDB 2003 Berlin 38
Hayden Data Flow
NCSA
SDSC
AMNHNYC
GPFS7.5 TB
IBM SP2
SGI
Production parameters, movies, images
data simulation
visualization
HPSS 7.5 TB
2.5 TB UniTree
UVa
NY
CalTech
BIRN
VLDB 2003 Berlin 39
Mappings on Name Space
• Define logical resource name• List of physical resources
• Replication• Write to logical resource completes when all physical
resources have a copy• Load balancing
• Write to a logical resource completes when copy exist on next physical resource in the list
• Fault tolerance• Write to a logical resource completes when copies exist on
“k” of “n” physical resources
VLDB 2003 Berlin 40
Hayden Conclusions
• The SRB was used as a logical central repository for all original, processed or rendered data.
• Location transparency crucial for data storage, data sharing and easy collaborations.
• SRB successfully used for a commercial project in “impossible” production deadline situation dictated by marketing department.
• Collaboration across sites made feasible with SRB
VLDB 2003 Berlin 41
Latency ManagementExample - ASCI - DOE
• Demonstrate the ability to load collections at terascale rates• Large number of digital entities• Terabyte sized data
• Optimize interactions with the HPSS High Performance Storage System• Server-initiated I/O• Parallel I/O
VLDB 2003 Berlin 42
SRB Latency Management
ReplicationServer-initiated I/O
StreamingParallel I/O
CachingClient-initiated I/O
Remote Proxies,Staging
Data AggregationContainers
SourceDestination
Prefetch
Network DestinationNetwork
VLDB 2003 Berlin 43
ASCI Small Files• Ingest a very large number of small files into
SRB• time consuming if the files are ingested one at a time
• Bulk ingestion to improve performance • Ingestion broken down into two parts
• the registration of files with MCAT• the I/O operations (file I/O and network data transfer)
• Multi-threading was used for both the registration and I/O operations.
• Sbload was created for this purpose.• reduced the ASCI benchmark time of ingesting ~2,100
files from ~2.5 hours to ~7 seconds.
VLDB 2003 Berlin 44
Latency ManagementExample - Digital Sky Project
• 2MASS (2 Micron All Sky Survey): • Bruce Berriman, IPAC, Caltech; John Good, IPAC,
Caltech, Wen-Piao Lee, IPAC, Caltech• NVO (National Virtual Observatory):
• Tom Prince, Caltech, Roy Williams CACR, Caltech, John Good, IPAC, Caltech
• SDSC – SRB :• Arcot Rajasekar, Mike Wan, George Kremenek, Reagan
Moore
VLDB 2003 Berlin 45
Digital Sky
VLDB 2003 Berlin 46
Digital Sky Data Ingestion
Informix
SUN
SRBSUN E10K
HPSS
….
800 GB
10 TB
SDSCIPAC CALTECH
input tapes from telescopes
star catalogData
Cache
VLDB 2003 Berlin 47
Digital Sky - 2MASS
• http://www.ipac.caltech.edu/2mass• The input data was originally written to DLT
tapes in the order seen by the telescope • 10 TBytes of data, 5 million files
• Ingestion took nearly 1.5 years - almost continuous reading of tapes retrieved from a closet, one at a time
• Images aggregated into 147,000 containers by SRB
VLDB 2003 Berlin 48
Containers• Images sorted by spatial location
• Retrieving one container accesses related images
• Minimizes impact on archive name space• HPSS stores 680 Tbytes in 17 million files
• Minimizes distribution of images across tapes
• Bulk unload by transport of containers
VLDB 2003 Berlin 49
Digital Sky – “Stars at finger tips”• Average 3000 images a day – web clients and
also as web service
InformixSUNs SRBSUN E10K
HPSS
….
800 GB
10 TB
SDSC
IPAC CALTECH
WEB
JPL
SUNs SGIsWEB
VLDB 2003 Berlin 50
Remote Proxies• Extract image cutout from Digital Palomar Sky
Survey• Image size 1 Gbyte• Shipped image to server for extracting cutout took 2-4
minutes (5-10 Mbytes/sec)• Remote proxy performed cutout directly on
storage repository• Extracted cutout by partial file reads• Image cutouts returned in 1-2 seconds
• Remote proxies are a mechanism to aggregate I/O commands
VLDB 2003 Berlin 51
Real-Time DataExample - RoadNet Project
• Manage interactions with a virtual object ring buffer
• Demonstrate federation of ORBs• Demonstrate integration of archives, VORBs and
file systems• Support queries on objects in VORBs
VLDB 2003 Berlin 52
VORBserver
VORB agent
VORBserver
Federated VORB Operation
VCAT
Get Sensor Data( from Boston)
VORB agent
1
2
4
5
Logical Name of the Sensorwiith Stream Characteristics
Contact VORB Catalog:1.Logical-to-Physical mapping
Physical Sensors Identified2. Identification of Replicas
ORB1 and ORB2 are identifiedas sources of reqd. data
3.Access & Audit Control
Automatically ContactORB2
Through VORB server
At Nome
Format Data and Transfer
R2
San Diego
NomeORB1
ORB2
3Check ORB1ORB1 is down
Check ORB2ORB2 is up. Get Data
6
VLDB 2003 Berlin 53
Information Abstraction Example - Data Assimilation Office
HSI has implemented metadata schema in SRB/MCAT
Origin: host, path, owner, uid, gid, perm_mask, [times]
Ingestion: date, user, user_email, comment
Generation: creator (name, uid, user, gid), host (name, arch, OS name & flags), compiler (name, version, flags), library, code (name, version), accounting data
Data description: title, version, discipline, project, language, measurements, keywords, sensor, source, prod. status, temporal/spatial coverage, location, resolution, quality
Fully compatible with GCMD
VLDB 2003 Berlin 54
DODS
SRBServer
WWWBrowser
UserApp
NativeFS DODS-
enableApp
DODS Libraries
LegacyApp
DODS/NetCDF
NetCDFApp
SRBClients
SRBClients
NativeFS
PBS Job
DMF
SRBServer
Unitree
SRB Middleware
Com
pute
Engi
ne
StorageServer
DMSServer
Des
ktop
Wor
ksta
tion
DesktopWorkstation
SRBClients
UserApp
NativeFS
StorageServer
Des
ktop
Wor
ksta
tion
DODS Access Environment Integration
VLDB 2003 Berlin 55
Data Grid Brick• Data grid to authenticate users, manage file
names, manage latency, federate systems• Intel Celeron 1.7 GHz CPU• SuperMicro P4SGA PCI Local bus ATX mainboard• 1 GB memory (266 MHz DDR DRAM)• 3Ware Escalade 7500-12 port PCI bus IDE RAID • 10 Western Digital Caviar 200-GB IDE disk drives• 3Com Etherlink 3C996B-T PCI bus 1000Base-T • Redstone RMC-4F2-7 4U ten bay ATX chassis• Linux operating system
• Cost is $2,200 per Tbyte plus tax• Gig-E network switch costs $500 per brick• Effective cost is about $2,700 per TByte
VLDB 2003 Berlin 56
Unix Shell
Java, NTBrowsers
SOAP/WSDL
GridFTP
SDSC Storage Resource Broker & Meta-data Catalog - Access Abstraction
ArchivesHPSS, ADSM,UniTree, DMF
DatabasesDB2, Oracle,
Postgres
File SystemsUnix, NT,Mac OSX
Application
HRM
AccessAPIs
Servers
Storage AbstractionCatalog Abstraction
DatabasesDB2, Oracle, Sybase, SQLServer, Informix
C, C++, Libraries
Logical Name Space
LatencyManagement
DataTransport
MetadataTransport
Consistency Management / Authorization-AuthenticationPrimeServer
Linux I/O
DLL /Python OAI
VLDB 2003 Berlin 57
Tutorial Part I Outline• Concepts
• Introduction to Grid Computing• Proliferation of Data Grids• Data Grid Concepts
• Practice• Real life use cases SDSC Storage Resource Broker
(SRB)• Hands on Session
• Research• Active Datagrid Collections• Data Grid Management Systems (DGMS)• Open Research Issues
VLDB 2003 Berlin 58
VLDB 2003 Berlin 59
VLDB 2003 Berlin 60
VLDB 2003 Berlin 61
Tutorial Part I Outline• Concepts
• Introduction to Grid Computing• Proliferation of Data Grids• Data Grid Concepts
• Practice• Real life use cases SDSC Storage Resource Broker
(SRB)• Hands on Session
• Research• Active Datagrid Collections• Data Grid Management Systems (DGMS)• Open Research Issues
VLDB 2003 Berlin 62
Datagrid Management System (DGMS)• DGMS manages
• State information of the datagrid collections (data)• Knowledge of events, rules and services (data behavior)• Collaborative communities (data users and resources)
• Differences from DBMS • Manages “community-owned” unstructured data along
with its behavior and inter-organizational resources• Logical organization has the (logical) resources where
the data be present (hidden in DBMS)• Basic unit = Active Datagrid Collection • Also uses concepts got from decades of DB Research
VLDB 2003 Berlin 63
DGMS Philosophy• Collective view of
• Inter-organizational data • Operations on datagrid space
• Local autonomy and global state consistency• Collaborative datagrid communities
• Multiple administrative domains or “Grid Zones”• Self-describing and self-manipulating data
• Horizontal and vertical behavior• Loose coupling between data and behavior (dynamically)• Relationships between a digital entity and its Physical
locations, Logical names, Meta-data, Access control, Behavior, “Grid Zones”.
VLDB 2003 Berlin 64
Active Datagrid Collections
SDSC
121.Event
Thit.xml
National Lab
getEvents()121.Event
Hits.sql
University of Gators
addEvent()
ResourcesData Sets
Behavior
VLDB 2003 Berlin 65
Active Datagrid Collections
Heterogeneous,distributed
physical data
SDSC
Dynamic or virtual data
121.Event
Thit.xml
National Lab
getEvents()121.Event
Hits.sql
University of Gators
addEvent()National Lab University of Gators
VLDB 2003 Berlin 66
Active Datagrid Collections
myHEP-Collection
SDSC
121.Event
Thit.xml
National Lab
121.EventHits.sql
University of Gators
Logical Collection gives location and
naming transparency
SDSC
Meta-data
VLDB 2003 Berlin 67
Active Datagrid Collections
myHEP-Collection
SDSC
121.Event
Thit.xml
National Lab
121.EventHits.sql
University of Gators
Now add behavior or services to this
logical collection
Meta-data
SDSC
Collection state and services
HorizontalServices
getEvents() addEvent()
VLDB 2003 Berlin 68
Active Datagrid Collections
myHEP-Collection
SDSC
121.Event
Thit.xml
National Lab
121.EventHits.sql
University of Gators
Meta-data
SDSC
Collection state and services
HorizontalServices
getEvents() addEvent()
ADC specificOperations + Model View
Controllers
ADC Logical view of data & operations
VLDB 2003 Berlin 69
Active Datagrid Collections
Digital entities
Meta-data
Services
State
Horizontal datagrid services and vertical domain specific services
Events, collective state, mappings to domain services to be invoked
Standardized schema with domain specific schema extensions
Physical and virtual data present in the datagrid
VLDB 2003 Berlin 70
Active Datagrid Collections
• Logical set consisting of related digital entities and references to their collective behavior for self-organization and manipulation of the data.
• Basic unit or data model managed in DGMS
Collections facilitate the transparencies and abstractions required to manage data in grids and inter-organizational enterprises
VLDB 2003 Berlin 71
DGMS• Datagrid Management System consists of a set
of services (protocols) and a hierarchical framework for:• Confluence of datagrid communities• Coordinated sharing of inter-organizational information
storage space and active datagrid collections
VLDB 2003 Berlin 72
Datagrid Broker
• A datagrid broker acts as an agent for an administrative domain in a DGMS framework.
• Datagrid communities • formed by confluence of datagrid brokers • Peer2peer network of brokers resulting in DGMS
• Datagrid brokers facilitate • sharing of services and data as components of active
datagrid collections in the datagrid.• Ensure the users in its domain are benefited by
participating in datagrid communities.
VLDB 2003 Berlin 73
DGMS and Datagrid Brokers
University of Gators - Physics
CMS GridPhysics Grid LHC
Grid
Florida Grid
Datagrid Broker
Datagrid Broker
Datagrid BrokerDGMS
Framework + Protocols for datagrid community
organization
VLDB 2003 Berlin 74
Datagrid Brokerage Protocols
University of Gators - Physics
Datagrid Broker
Datagrid Broker
Datagrid Broker
Datagrid Broker
Datagrid Joining Protocol
Super Broker
Florida Grid
Datagrid Broker
Datagrid Broker
VLDB 2003 Berlin 75
New-community member
Datagrid Brokerage Protocols
University of Gators - Physics
Datagrid Broker
Datagrid Broker
Datagrid Broker
Datagrid Broker
Super Broker
VLDB 2003 Berlin 76
Datagrid Brokerage Protocols (org)
• Organizing datagrid community • Managing the inter-organizational data• Datagrid Operations
• Converted into datagrid brokerage protocols • Protocols implemented as services by the datagrid brokers
• Hence, DGMS is nothing but these datagrid brokers which form these communities and the protocols (services) which operate on the collections
VLDB 2003 Berlin 77
Datagrid Brokerage Protocols (data)
University of Gators - Physics
Datagrid Broker
ADC
Datagrid Broker
merge()Datagrid Broker
Datagrid Broker
getEvents()
Super Broker
Active datagrid collection with
references to data and its behavior
VLDB 2003 Berlin 78
Need for Standard DGL
DatabaseSQL
121.Event Hits.sql
University of Gators121.Event
Thit.xml
National Lab
DGMSXML based,Invoke OperationsSubset XQuery
DGL
DDL,DML,DQL
VLDB 2003 Berlin 79
Data Grid Language• XML based asynchronous protocol
• Describe data sets, collections, datagrid operations, ... • Access and manage data grids, data flow pipelines• Query on data resource (based on W3C XQuery)
• Facilitates Grid Workflow• Sharing of granular state information about execution of
each datagrid operation amongst different processes or services
• Implementation Status• Reference Implementation by SDSC Matrix Project• On top of SRB protocol stack as W3C SOAP Web Service
VLDB 2003 Berlin 80
Data Grid Language• Datagrid Request
• Asynchronous requests for data/process-flow in datagrids• Requests are either a Transaction or a Status Query• Each Transaction consists of one or more Flows • Each Flow consists of one ore more datagrid operations• Datagrid operation = data transformation or data query• A flow can be executed sequential or parallel
• Datagrid Response• Either Transaction Acknowledgement or Status Response• Status Response contains the results of a Transaction
VLDB 2003 Berlin 81
Data ���� Discovery
Digital entities
Meta-data
Services
State
New data
updates relationships among data in collections
Services invoked to analyze new relationships
DGMS applications get notified of state updates
VLDB 2003 Berlin 82
Data ���� Discovery (Issues)• “More data; More discovery” Fermi National Lab • DGMS applications to automate knowledge
discovery• Work flow Management Systems (WfMS) subscribe to
updates in datagrid collections• Trigger like mechanisms on this large scale dynamic and
distributed data is a needed• Dynamic rule description and execution based on events• Semantic Mediation of datagrid collections • [SDSC Grid Enabled Mediation (GeMS)]
VLDB 2003 Berlin 83
DGMS Research Issues• Self-organization of datagrid communities
• Using knowledge relationships across the datagrids• Inter-datagrid operations based on semantics of data in
the communities (different ontologies)• High speed data transfer
• Terabyte to transfer - TCP/IP not final answer• Protocols, routers needed
• Latency Management• Data source speed >> data sink speed
• Datagrid Constraints • Data placement and scheduling
• How many replicas, where to place them…
VLDB 2003 Berlin 84
Other Grid Software• Legion (Avaki)
• Grid as a single virtual machine• Condor
• High Throughput Computing (HTC)• Globus
• Synonymous with Computational Grids• Data Handling Capabilities• GRAM (1000s), GridFTP (1000s), GSI, MDS, RLS, MCS
• Entropia• PC Grid, P2P
• IBM• SUN• HP
VLDB 2003 Berlin 85
Part I Summary• Grids are evolving
• coming soon to a domain near you• DGMS
• Coordinate collaborative management of inter-organizational information storage using Active Datagrid Collections
• Tools are available from research and academia. • Industry getting involved. • SDSC SRB provides abstraction mechanisms required to
implement data grids, digital libraries, persistent archives• Open Research issues for
• Distributed databases, Information management and Semantic web researchers
VLDB 2003 Berlin
Part II:Grid Services for Structured Data
Paul WatsonUniversity of Newcastle, UK
[email protected] Paton
University of Manchester, [email protected]
VLDB 2003 Berlin 87
Part II a:Requirements & Grid Services
• Motivation• why is structured data important for Grid applications?• examples from Grid projects
• Requirements for Grid Data Services• Requirements generic across all services• Requirements specific to structured data services
• Grid Services• Brief history of Grid middleware• Open Grid Service Architecture
• Web Services• Grid Services
VLDB 2003 Berlin 88
Motivation
• Structured data is important to Grid applications• data• metadata
• Level of structure varies• Relational/Object DBs• XML• Structured files
VLDB 2003 Berlin 89
• Large quantities of biological data
• Different kinds of data• Data sources are scattered
and autonomous
• We will take examples from one project that is aiming to build a Grid application to support bioinformaticians: myGrid (mygrid.man.ac.uk)
Example: Bioinformatics
VLDB 2003 Berlin 90
Bioinformatics: myGrid project data
• Types of structured data• Biological data• Provenance
• Execution of biological workflows generates a provenance record• query to ask useful questions….
• what experiments were run on this data?• how was this data derived?• what are the top 20 bio-services used by my organisation?
• Annotation• users opinions on interpretation of data
VLDB 2003 Berlin 91
What does myGrid do with the data?
(Distributed)Query
processingWorkflows
Notification
Data &Metadata
• Biological databases and computations are represented as services
• (Distributed) Queries• extract information• integration of data from
distributed sources• Notification
• tell users when data/tools have changed
• Workflows• combine services
VLDB 2003 Berlin 92
Requirements
• A key driver for requirements is that applications combine data access with computation• Computation + Data
VLDB 2003 Berlin 93
myGrid Workflow example A Personal View onto the data Workflow
Metadata about
workflow
note aboutworkflow
VLDB 2003 Berlin 94
Categories of Requirements
• The need to integrate computation and data access means data services cannot be designed in isolation
• Therefore there are two categories of requirements…• those generic across all Grid services, allowing databases to be first-
class components• those specific to Grid-enabled data services, allowing databases to be
exploited in Grid applications• These are considered in turn…
VLDB 2003 Berlin 95
Generic Grid Requirements
• Allow compute and data components to be seamlessly combined, e.g.• high performance data transfer (more later)• scheduling (more later)• security• accounting• management
VLDB 2003 Berlin 96
Generic Grid requirements:High Performance Data Transfer
• A database may deliver huge amounts of data to a remote computation for analysis
• Requirements:• efficient communication of data
• flow control• efficient encoding• direct routing of results….
VLDB 2003 Berlin 97
Direct Routing of Results
Data ServiceClient
Analysis Service
1: Query
2: Result
3: ForwardResult
Data ServiceClient
Analysis Service
1: Query
2: Result
Inefficient
Efficient
VLDB 2003 Berlin 98
Generic Grid requirements: Scheduling
• co-scheduling data service with other components of a distributed application
• it would be helpful if data service provided cost estimates to assist scheduling, e.g.• for data-source selection (e.g. when replicas are available)• to estimate computational requirements (e.g. if we know a
query will generate R rows then can schedule space and time on consumer)
VLDB 2003 Berlin 99
Database Specific Requirements
• Grid applications will require at least the same functionality, tools and properties as other database applications• query, update, programming, indexing, integrity• availability, recovery, manageability, security• replication, versioning, evolution, archiving, change notification• concurrency control, transactions, bulk load
• It has taken huge amounts of effort to make these available in today’s database servers
• Conclusion: we must build on this effort by integrating existingdatabase servers into the Grid• starting from scratch isn’t an option.
VLDB 2003 Berlin 100
Grid-characteristic requirements
• Are there any requirements of grid-enabled databases that may require special attention?• performance
• scalability• unpredictability
• meta-data-driven access• federation
VLDB 2003 Berlin 101
Performance: scalability
• Data size• examples of large dbs
• Query execution time• example of large queries
• Concurrent Users• example of large number of concurrent users
VLDB 2003 Berlin 102
Performance: Unpredictability
• Need to support curiosity-driven access• c.f. pre-determined access in e-commerce
• Must manage unpredictable usage• avoid denial of service• share resources fairly
• Need to share db resources in a controlled way• cost estimation helpful• resource usage quotas• holistic approach required
• CPU, Memory, Disk, Network, DB Cache, Locks
VLDB 2003 Berlin 103
Metadata based access
• Users & applications may potentially have access to large numbers of data repositories
• Repositories can be selected via registries• Two stage access:
1. use metadata access to registries to find repository(s)• importance of metadata standards
2. access (and federate data) from repository(s)• importance of repository access standards• importance of federation…
VLDB 2003 Berlin 104
Federation
• Workflows allow us to manually combine data and computational services
• But, they need to be manually constructed• time consuming, error prone, inflexible
• Ideally there would be tools to federate data across the Grid• utilise dynamically acquired Grid resources for federation• distributed query processing
VLDB 2003 Berlin 105
Grid Services
• Grid applications are now being built within a service-based framework
• In this section we will consider• Brief history of Grid middleware• Open Grid Service Architecture
• Web Services• Grid Services
VLDB 2003 Berlin 106
Brief history of Grid middleware
• Until 2002, Grid middleware suffered from:• lack of standards
• Globus was a de facto standard rather than de jure standard• Also other approaches such as Unicore
• monolithic, so difficult to build and integrate components• lack of synergy with commercial middleware standards and tools
• Since 2001 there has been an attempt to address this:• Open Grid Services Architecture (OGSA)
• Key Text:• “The Physiology of the Grid, An Open Services Architecture for
Distributed Systems Integration”, I. Foster, C. Kesselman, J.M. Nick & S. Tuecke. June 2002 (available from www.globus.org)
VLDB 2003 Berlin 107
Why Services?
• Virtualisation of resources• computers, repositories, networks, programs, databases,
activities… can all be represented as services• Advantages
• standard interface definition• standard way to invoke service• local/remote invocation transparency• ability to hide diverse implementations & platforms behind
the interface• composition
VLDB 2003 Berlin 108
Web Services
• Grid Services are based on Web Services (WS)• WS allow the creation of components that offer a
set of operations• based on message exchange & XML
• Key WS standards define:• interfaces (WSDL)• message formats (SOAP)
VLDB 2003 Berlin 109
Transient Service Instances
• Web Services address the discovery and invocation of persistent services
• In Grids, many resources and activities will be created dynamically and may be short-lived, e.g.• a job running on a computer• a database session• data produced as the result of a query
• A key feature of OGSA is the introduction of transient service instances• created dynamically to encapsulate a transient resource or activity• created by factories• “Grid Service Instances”
VLDB 2003 Berlin 110
Stateful Consumer-Service Interactions
FactoryClient
1. Client asks a Factory to create a new Grid Service Instance
2. Factory creates Grid Service Instance and returns handle to client
FactoryClient
Grid Service Instance
3. Client can then call operations on the Grid Service Instance
Client
Grid Service Instance
Stateful Consumer-Service interactions can be implemented with Grid Service Instances
Grid Service InstanceGrid Service Instance
Grid Service InstanceGrid Service Instance
Factory Grid Service InstanceGrid Service Instance
VLDB 2003 Berlin 111
Other Grid Service Instances
• The last part of the tutorial will include examples of Grid Service Instances being created to represent• database sessions• data• computation on data
VLDB 2003 Berlin 112
Grid Service Instance Lifetimes
• Lifetime management provided• service, host and client (policy permitting) can kill GSI
• What if the client is in charge of lifetime but it fails, or the network fails, and kill request is never received?
• Soft-state management also provided• initial lifetime set when GSI created (can be at client
request)• client can extend this by request• when lifetime reached, GSI or host can kill GSI
VLDB 2003 Berlin 113
Accessing a Grid Service Instance
• Each GSI is assigned a globally unique name• the Grid Service Handle (GSH)
• protocol and network address independent• To access a service, this must be mapped onto a Grid
Service Reference (GSR)• protocol & network specific• therefore a factory returns a GSH and a GSR
• A GSR can have a finite lifetime or become invalid• re-resolution provided by a Handle-to-Reference mapper
• This 2-level naming opens the way for:• service upgrading• failover• scalability
VLDB 2003 Berlin 114
Exposing Service State
• Service Data Elements (SDEs)• Standard way for services to advertise and
expose state• Consumers can:
• find what state is exposed• query state• register to be notified of state changes
VLDB 2003 Berlin 115
Open Grid Services Infrastructure (OGSI)
• The basic interfaces and behaviours required by Grid Services are defined in the Open Grid Services Infrastructure (OGSI) specification• v1 submitted to the Global Grid Forum as a
recommendation track document (April 2003)• www.ggf.org/ogsi-wg
• The first major implementation was made available in June 2003 as Globus Toolkit 3 (GT3)• www.globus.org
VLDB 2003 Berlin 116
The OGSA Architecture
VLDB 2003 Berlin
Part II b:The Design of Grid Data Services
VLDB 2003 Berlin 118
Grid Service Design/Development Status
• Many projects have developed data grid components and infrastructures.
• Relatively few projects have yet deployed OGSI in anger.• Individual projects are developing their own generic or
application specific services.• Certain of these activities are feeding into standards within
the Global Grid Forum (GGF).• The GGF OGSA Working Group and Data Area are
seeking to coordinate activity.
VLDB 2003 Berlin 119
Abstract Service Classification
Program ExecutionServices
Application Specific Services
Web Services
OGSI
Core Grid ServicesGrid Data Services
Global Grid Forum OGSA Working Group: www.gridforum.org
VLDB 2003 Berlin 120
Designing Grid Services
• Service design:• State – Service Data
Elements (SDEs).• Operations – WSDL
document.• Ancillary issues:
• Service lifetime.• Service granularity.• Operation granularity.• Extensibility.
• Core services for data management under development:• Data movement (RFT).• Data replication (GMR).• Database Access (OGSA-
DAI).• Distributed Querying (OGSA-
DQP).• There is no definitive
“wedding cake” for Data Grid Services.
VLDB 2003 Berlin 121
Reliable File Transfer
• RFT Features:• Built over GridFTP.• Persistent transfer state.• Error recovery.
• Service properties:• Service lifetime bounds
transfer duration. • Service handles a single job
at a time. • A job may transfer many files.
• Operations:• start JobDescription -> JobId.• cancel JobId.
• SDEs:• FileTransfer Status.• FileTransferProgress.• FileTransferRestartMarker.• Version.
R.K. Madduri, W.E. Allcock, Reliable File Transfer in Grid Environments, Proc. 27th IEEE Conference on Local Computer Networks (LCN), 737-738, 2002.
VLDB 2003 Berlin 122
Reliable File Transfer
RFT Service
RFT Factory DBMScreate
DBMS
Grid FTP
DBMS
Grid FTPdata
transfer
controlstorage
resource
M. Atkinson, A.L. Chervenak, P. Kunszt, I. Narang, N.W. Paton, D. Pearson, A. Shoshani, P. Watson, Data Access, Integration and Management, The Grid (2nd Edition), I. Fister, C. Kesselman (eds), 2003.
storage resource
VLDB 2003 Berlin 123
Grid Movement and Replication
• GMR Features:• Pluggable architecture:
scheduler, transport, adaptors.
• Service properties:• Service lifetime bounds
replication refresh. • Service handles many jobs at
a time. • A job may replicate many
items.
• Operations:• createReplicationJob,
updateReplicationJob, ...• SDEs:
• Job progress.• Job statistics.
A. Chokshi, S. Glanville, V Gogate, S. Jeffrey, C. Madsen, I. Narang, V. Raman, M Subramanian, OGSA Grid Data Movement and Replication Service, IBM Almaden, 2003.
VLDB 2003 Berlin 124
GMR Architecture
GMR Service
Client
Files
Application Grid Service
DBMS DB Service
Move/Control
Inform/Monitor Replica
Catalog
Request/Subscribe
VLDB 2003 Berlin 125
Example GMR Request
• Request specifies:• Source.• Target.• Requester.• Lifetime of replication.• Options on replication:
• Replace.• Update.• NewCopy.
• Scheduling information:• Scheduler name.• Arguments.
<GMRReplicationJobRequest><Source><DataGDR><simple_path_GDR><host>myHost1</host><path>/foo/file1</path>
</simple_path_GDR></DataGDR>
</Source><Target>...
</Target><RequestorInformation><UserName>Bob</UserName>
</RequestorInformation></GMRReplicationJobRequest>
VLDB 2003 Berlin 126
Grid Services for Structured Data
• General approach:• wrap database as a
service.• Aim to standardise
where possible• result format.• metadata:
• operations supported.• driver employed.• schema.
InterfaceCode
Ser
vice
Inte
rface
DBMS
Metadata
Accounting
Query
Notification
Bulk Loading
Scheduling
Services
Transaction
P. Watson, Databases and the Grid, in Grid Computing: Making the Global Infrastructure a Reality, F. Berman, G. Fox, T. Hey (eds), Wiley, 2003.
VLDB 2003 Berlin 127
Mapping Sessions to Services
• An important architectural issue is how to map client database sessions onto services.
• Options include:• one service per database:
• all clients connect to the same service.• natural for Web Services.• needs way to internally handle the state of multiple sessions.
• one service instance per session:• a new service instance is created for each new session.• natural for Grid Services.
VLDB 2003 Berlin 128
OGSA Data Access & Integration
• OGSA-DAI Features:• Wraps relational and XML
databases.• Compound requests.• Flexible delivery.
• Service properties:• Represents a user session.• User can have many concurrent
sessions.• Service actively processes one
request at a time.
• Operations:• perform Request -> Result.
• SDEs:• Schema.• DBMS.• Driver.• Stored requests.• Request progress.• ...
OGSA-DAI Software: http://www.ogsa-dai.org.uk
VLDB 2003 Berlin 129
Service Interactions
GFactory GDSF
G
Registry
GSGDSR
GClient
G
GDS GDSInstanceGDT DB
RegisterService
findServiceData
CreateService
perform(Query)
GGDT
Consumer
1
4
2
113
5
VLDB 2003 Berlin 130
Terminology
• GDSR:• Grid Database Service
Registry.• Searchable registry of
GDSFs.• GDSF:
• Grid Database Service Factory.
• Supports the creation of GDS instances.
• GS:• GridService portType.
• GDS:• Grid Database Service.• Supports a client session with
the database.• GDT:
• Grid Database Transport portType.
• Endpoint for service-to-service data pull/push requests.
VLDB 2003 Berlin 131
GDS PortTypes
GDS
GridServicePort
GridDataTransportPort
Client
GridDataServicePort
findServiceData
<service data>perform
<result id>
getFully
<result>
VLDB 2003 Berlin 132
Service Data Elements
• Sample SDEs:• dataResource.• driverType .• dataResourceManagementSystem.• database.• databaseSchema.• runningRequests.• definedRequests.• activitiesSupported.
VLDB 2003 Berlin 133
Example SDE
• Suppose database has a table:• Person (id int(11) primary key, …)
• Selection of Service Data Element:• <queryByServiceDataNames name=“databaseSchema”/>
• Sample of SDE document:• <databaseSchema dataResource=“mydatabase”>
<table name=“Person”><column name=“id” fullname=“Person.id” length=“11”>
<sqlType>INTEGER</sqlType></column>
… </table>
</databaseSchema>
VLDB 2003 Berlin 134
Perform Documents
• A GDS::perform operation takes a potentially complex document as input.
• The top-level elements indicate the action that should be taken:• Request defines a request for storage by the GDS.• Execute runs a stored request. • Terminate stops a running request.
• A Request element is a collection of linked activities, such as:• sqlQueryStatement specifies an SQL select statement.• sqlUpdateStatement specifies an SQL DML statement.• Data delivery descriptions that specify movement of data between the service
and some third party.
VLDB 2003 Berlin 135
Query With Direct Response
<gridDataServicePerform … ><request name="myRequest">
<sqlQueryStatementname="statement">
<dataResource>frogGenome</dataResource><expression>select * from chromFrag where len < 20000
</expression><webRowSetStream name="statementresponse"/>
</sqlQueryStatement>
<deliverToResponse name="d1"><fromLocal from="statementresponse"/>
</deliverToResponse>
</request></gridDataServicePerform>
Analyst GDS DB
XML Request
XML Response
VLDB 2003 Berlin 136
Query With Third Party Delivery
AnalystDB
XML Request
XML Response
Result File
GDS
deliverToGFTP
<gridDataServicePerform … ><request name=“myPushRequest">
<sqlQueryStatement name=“stmt"><dataResource>frogGenome</dataResource><expression>select * from chromFragwhere len < 20000
</expression><webRowSetStream name="statementresponse"/>
</sqlQueryStatement>
<deliverToGFTP name="d1"><fromLocal from="statementresponse"/><toGFTP host="ogsdai.org.uk"
port="8080" file="path/to/myfile.txt"/></deliverToGFTP>
</request></gridDataServicePerform>
VLDB 2003 Berlin 137
Update Pulling From Third Party
AnalystDB
XML Request
XML Response
OGSA Service
GDS
deliverFromGDT
<gridDataServicePerform><request name=“myPullRequest">
<deliverFromGDT name="d1"><fromGDT streamId="otherrequestasynch/d1"
mode="full">http://ogsadai.org.uk/GDTService/
my/GDT/GSH</fromGDT><toLocal name="datatoinsert"/>
</deliverFromGDT>
<sqlUpdateStatement name="statement"><sqlParameter position="1" from="datatoinsert"/><expression>insert into chromFrag values ?</expression>
</sqlUpdateStatement></request>
</gridDataServicePerform>
VLDB 2003 Berlin 138
Design Issues for GDS
• Service granularity:• Notions to capture:
• Installed DBMS.• Database within DBMS.• Session over database.• Request.• Query result set.
• Which should have:• PortType definitions.• Service Data Elements.
• Which should be:• Service instances.
• Operation granularity.• Atomic operations:
• Execute SQL Query.• Apply XSLT transformation.• Map directly to multi-
operation PortType.• Good for tooling.
• Compound operations:• Execute query, apply
transformation, deliver result.• Need to define orchestration
notation.• Fewer fine-grained calls.
VLDB 2003 Berlin 139
Standardisation
• Global Grid Forum:• Standards body for Grid Computing.• Open meetings three times a year.• www.gridforum.org.
• Database Access and Integration Services Working Group (DAIS-WG):• Sessions at each GGF meeting.• Intermediate telcons and face-to-face sessions.• Developing standard proposal.• http://www.gridforum.org/6_DATA/dais.htm
• OGSA-DAI 2.5 broadly implements the February 03 DAIS draft.
VLDB 2003 Berlin 140
Databases and the Grid
• Deploying databases:• Q: How are databases
made available on the Grid?
• A: Through integration with other Grid services, and provision of standard interfaces.
• DAIS GGF Working Group.
• Deploying database technologies:• Q: What database
technologies might be broadly useful as Grid services?
• A: transactions, materialised views, distributed query processing, …
VLDB 2003 Berlin 141
Distributed Query Processing
• DQP involves a single query referencing data stored at multiple sites.
• The locations of the data may be transparent to the author of the query.
select p.proteinId, Blast(p.sequence)from protein p, proteinTerm twhere t.termId = ‘GO:0005942’ and
p.proteinId = t.proteinId
J. Smith, A. Gounaris, P. Watson, N. Paton, A. Fernandes, R. Sakellariou, Distributed Query Processing on the Grid, 3rd Int. Workshop on Grid Computing, Springer-Verlag, 279-290, 2002.
VLDB 2003 Berlin 142
Service-Based DQP
• Service-based in two respects:• Queries are expressed over
services.• Grid database services.• Computational services.
• The distributed query processor it itself implemented as a collection of services.
OGSA-DAI
OGSA
Distributed Query Processor
Application/Presentation Layer
Configuration
Logging
Auditing
Policy
Security
Versioning
Accounting
OGSA-DQP Software: http://www.ogsa-dai.org.uk
VLDB 2003 Berlin 143
Service-based DQP
• DQP often exploits source wrappers.
• In service-based DQP, the sources are Web or Grid Services.
• A query may refer to database (GDS) and computational services.
• Workflow languages are often used for service orchestration.• Emerging standard:
BPEL4WS.• Procedural request
description.• Programmer controls order of
evaluation.
VLDB 2003 Berlin 144
Mutual Benefit
• The Grid needs DQP:• Declarative, high-level
resource integration with implicit parallelism.
• DQP-based solutions should in principle run raster than those manually coded.
• DQP needs the Grid:• Systematic access to remote
data and computational resources.
• Dynamic resource discovery and allocation.
VLDB 2003 Berlin 145
Phases in Query Processing
• Initialisation:• Grid Distributed Query Service is created.• Source descriptions imported.
• Query compilation:• Query optimisation and scheduling.
• Query evaluation:• Query evaluation services created as needed.• Query partitions allocated to evaluators for execution.
VLDB 2003 Berlin 146
Distributed Query Services
• Services:• GridDistributedQueryService (GDQS).• GridQueryEvaluationService (GQES).• Associated factories.
• Properties:• Both GDQS and GQES services implement the portTypes
of the Grid Database Service.• The GDQS is extended to support the importing of service
descriptions.
VLDB 2003 Berlin 147
Initialising a GDQS
GSH:GDQSF
findServiceData
create
importSchema(GSH:GDSF, ConfDoc)
findServiceData
GSH:GDSF
N2
N3
Create(ConfDoc)
findServiceDataDBSchemaGSH:GDS1
GSH:GDQS1
7
N1113
2
1
5
6
findServiceData ConfigDocs
4
G
Factory
GDSF
G
Registry
GSGDSR GFactory GDQSF
register
GGDQ GDQS1
G Registry GS
GDSR
GClient
GS
register
GGDS GDS1
GS
1
8
VLDB 2003 Berlin 148
Issues in Initialisation
• Q: When is a GDQS bound to a particular GDS?• A: When the schema of the
GDS is imported.• Q: What is the lifespan of
a GDS used by a GDQS?• A: The GDS is kept alive until
the GDQS expires.• Q: Are GDSs shared by
multiple GDQSs?• A: No.
• Q: When is a GQES created?• A: When a query is about to
be evaluated that needs it.• Q: What is the lifespan of
a GQES?• A: It lasts only as long as a
single query.• Q: Is a GQES shared
among several queries or GDQSs?• A: No.
VLDB 2003 Berlin 149
Query Compilation
LogicalOptimiser
PhysicalOptimiser
Partitioner Scheduler
Evaluator
OQLParser
Single-nodeOptimiser
Multi-nodeoptimiser
VLDB 2003 Berlin 150
Scheduling
• Partitions are allocated to Grid nodes; partitions may be merged during scheduling.
• Expressed by parallel algebra expression.
• Heuristic algorithm considers memory use, network costs.
table_scan(protein)
table_scantermID=...
(proteinTerm)
reduce reduce
hash_join
(proteinId)
op_call(Blast)
reduce
exchange exchange
exchange
3,4
1
1 2
VLDB 2003 Berlin 151
Query Evaluation
• Query installation:• GQESs created for partitions as required.• Partitions sent to GQESs.
• Query evaluation:• Partitions evaluated using iterator model.• Pipelined and partitioned parallelism.• Results conveyed to client.
VLDB 2003 Berlin 152
GFactory GQESF
GFactory GQESF
GFactory GQESF
N2
N1
N3
GClientGGDS
GGDS
GDQ
GDT
GDQS
N0GDS
GFactory GQESF
N4
perform(Query)1
createService
createService2
createService
2
2
GGDS GQES2
GDT
GGDS GQES3
GDT
GGDS GQES1
GDT
GGDS GQES1
GDT
perform(QuerySubplan)
perform(QuerySubplan)
perf orm(Q
ueryS ubpla n)
3
sequential_scan
reduce (proteinID,sequence)
sequential_scan (term=8372)
reduce (proteinID)
hash_join(p.proteinID=t.proteinID)
3
operation_callblast(p.sequence)
reduce (p.proteinID, blast)
operation_callblast(p.sequence)
reduce (p.proteinID, blast)
3
Web Services (BLAST)
results
results
results
4
1144
VLDB 2003 Berlin 153
Features of OGSA-DQP
• Low cost of entry:• Imports source descriptions through GDSs.• Imports service descriptions as WSDL.
• Throw-away GDQS:• Import sources on a task-specific basis.• Discard GDQS when task completed.
• Builds on parallel database technology:• Implicit parallelism.• Pipelined + partitioned parallel evaluation.
• Integrates data access with operation invocation.
VLDB 2003 Berlin 154
Related Work on Database Services
• Web services interfaces to databases (e.g.):• K. Mensah, Web Services Enable Your Database, Web
Services Journal, Vol 3, No 4, 2003.• S. Malaika et al., DB2 and Web Services, IBM Systems
Journal, Vol 41, No 4, 666-685, 2002.• Distributed querying using services:
• T. Malik and A. Szalay, SkyQuery: A Web Service Approach to Federate Databases, Proc. CIDR, http://www-db.cs.wisc.edu/cidr/program/p17.pdf, 2003.
• R. Braumandl, et al., ObjectGlobe: Ubiquitous query processing on the Internet, VLDB J., Vol 10, 48-71, 2001.
VLDB 2003 Berlin 155
Grid Data Management Services
• Data Grid applications can benefit from many lower level services:• Data movement.• Replication.• Database access and integration.
• Work is underway on designing, developing and standardising many core Grid Data Management services.
• Designing services in a dynamic and heterogeneous environment is non-trivial, and there is much still to be done.
VLDB 2003 Berlin 156
Outstanding Research Issues
• Adaptability.• Cost modelling.• Data encoding.• Data placement.• Caching and replication.• Glide-in databases.• Management of Grid
Resources.
• Orchestration.• Quality of service.• Scheduling.• Security.• Service description.• Service frameworks.