Upload
hilary-hicks
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
1
The Challenge of Data Integration
Data + Grid = Discovery?
Prof. Malcolm AtkinsonDirector
www.nesc.ac.uk
22nd January 2003
2
Overview
Essentials of e-ScienceCollaboration
Resource Sharing Data Sharing Mutual Dependence
Essentials of the GridDistributed Virtual Machine?
Essentials of Data SharingDatabase Research did it?New ChallengesData Access & Integration Building Bricks
Band Wagon v Research OpportunityThresholds, Visions and Questions
3
5
UK e-Science
e- Science and the Grid‘e- Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.’
‘e- Science will change the dynamic of the way science is undertaken.’
J ohn TaylorDirector General of Research Councils
Offi ce of Science and Technology
From presentation by Tony Hey
6
Cambridge
Newcastle
Edinburgh
Oxford
Glasgow
Manchester
Cardiff
Southampton
London
Belfast
Daresbury Lab
RALHinxton
UK e-Science Investment
Nationale-
ScienceCentre
HPC(x)
Projects > 60 started
> 30 proposed+
EU Projects
7
£80m Collaborative projects
E-ScienceSteering
Committee
DG Research Councils
Director
Director’s Management Role
Director’sAwareness and Co-ordination Role
Generic Challenges EPSRC (£15m), DTI (£15m)
Industrial Collaboration (£40m)
Academic Application SupportProgramme
Research Councils (£74m), DTI (£5m)
PPARC (£26m) BBSRC (£8m) MRC (£8m) NERC (£7m) ESRC (£3m) EPSRC (£17m) CLRC (£5m)
Grid TAG
UK e-Science Programme (2)2003 - 2005
8
9
Collaboration Growing
Hard Problems, Multi-disciplinary, Expense
Sharing Ideas Thought processes and Stimuli Effort Resources
Requires Communication Common understanding & Framework Mechanisms for sharing fairly Organisation and Infrastructure
Scientists have done this for Centuries
12
InterdependenceScience has relied on experiment and theorySimulation, Data Mining, Analysis
Theory-Greece400 BC
Experiment -Italy
1,500 AD
For problems which are:- too large/small- too fast/slow- too complex- too expensive, unethical, ...-Testing Understanding
Simulation -Europe
1,980 AD
13
Interdependence
Theory
ExperimentComputing
Models
DataData
14
Database Growth
PDB protein structures
15
16
Globus Toolkit® History
0
5000
10000
15000
20000
25000
30000
1997 1998 1999 2000 2001 2002
Do
wn
loa
ds
pe
r M
on
th f
rom
ftp
.glo
bu
s.o
rg
DARPA, NSF, and DOE begin funding Grid work
NASA beginsfunding Grid work,DOE adds support
The Grid: Blueprint for a New Computing
Infrastructure published
GT 1.0.0Released
Early ApplicationSuccesses Reported
NSF & European CommissionInitiate Many New Grid Projects
Anatomy of the GridPaper Released Significant
CommercialInterest inGrids
Physiology of the GridPaper Released
GT 2.0Released
Does not include downloads from:NMI, UK eScience, EU Datagrid,IBM, Platform, etc.
17
Encompassing Vision
data archives
sensor nets
computers
software
colleagues
instruments
18
People & Industry
Global Grid ForumGGF2 260 Jul 01GGF3 220 Oct 01GGF4 400 Feb 02GGF5 900 Jul 02GGF6 450 Oct 02GGF7 >1000Mar 03
UK All HandsAHM’02 350Sep 02
GlobusWorld1 450Jan 03
IBM This week“IBM DRIVES GRID COMPUTING FOR COMMERCIAL BUSINESS WITH TEN NEW GRID OFFERINGS”
Targets Financial, Life Sciences Automotive & Aerospace Governments
Partners Platform, DataSynapse Avaki, Entropia United Devices
IBM last 20 monthsLeaders of OGSIDevelopment teamsGrid JamboreeGGF
0100
200
300
400
500
600
700800
900
GGF1 GGF2 GGF3 GGF4 GGF5
19
20
High-Altitude ViewsA Rallying Cry
Meeting a Hard Challenge requires Many MindsOperating & Maintaining Infrastructure requires Many Hands & Many Companies
Another Stab at Distributed Computing
Hard Challenge: Intellectually and Practically ImportantDependable Ubiquity over Heterogeneity & Fallibility
An Ambitious Virtual MachineConsistent large scale computational environments
A Global Operating SystemCollective Resources, Common Management
21
An Architectural View
Grid Plumbing & Security Infrastructure
Scheduling Accounting Authorisation
Monitoring Diagnosis Logging
Application
Data & Compute Resources OperationsTeams
DistributedProviders
Application Users
Common Application Platform for Group of ApplicationsApplication& PlatformDevelopers
22
Open Grid Services Infrastructure
Confluence of Web Services & GridConsistent Interface Description
Based on WSDL 1.2 proposal Extend Properties Separate Binding from Interface Function Composition & Inheritence
Exploit WS* InvestmentGrid Features
SecurityLife-Time ManagementService (state) Information via Data ElementsDiscoveryGroupingNotification
OGSI Version 1 Proposal at GGF7 (March 03)
23
Open Grid Services Architecture
Ubiquitous Building BlocksUsing OGSI PlatformOpen & ExtensibleEncourage Refactoring Experiments
InitiallyThe Globus 2 model
Except State Information now distributed
Example New FeaturesGlobal Name Mapping ServiceReplication and Caching ServiceData Access & IntegrationMetering, Logging, Authorisation, Charging, …
24
Grid Challenge
Balancing “Direct” Access to the “Platforms” with Abstraction & Virtualisation
Developers often have exploitable application knowledgeAutomation necessary & helpful
Interface matching, operation validation, … Optimisation at many scales
There isn’t enough effort to develop Languages & Abstractions
25
26
Data Integration
Data Resource 1
Data Resource 2
Scientist with Idea1) Find Data2) Extract Data
3) Transform Data
4) Combine Data
5) Interpret Data
27
Wellcome Trust: Cardiovascular Functional Genomics
Glasgow Edinburgh
Leicester
Oxford
LondonNetherlands
Shared dataPublic curated
data
28
Oxford
Glasgow
Cardiff
Southampton
London
Belfast
Daresbury Lab
RAL
OGSA-DAI Partners
EPCC & NeSC
Newcastle
IBMUSA
IBM Hursley
Oracle
Manchester
EPCC & NeSCIBM UKIBM USAManchester e-SCNewcastle e-SCOracle £3 million, 18 months, started February 2002
Cambridge
Hinxton
30
DAI Key Services
GridDataService GDS Access to data & DB operations
GridDataServiceFactory GDSF Makes GDS & GDSF
GridDataServiceRegistry GDSR Discovery of GDS(F) & Data
GridDataTranslationService GDTS Translates or Transforms Data
GridDataTransportDepot GDTD Data transport with persistence
Integrated Structured Data TransportRelational & XML models supportedRole-based AuthorisationBinary structured files (later)
31
DAI Architecture
Grid Infrastructure
Scheduling Accounting
Monitoring Diagnosis
Data Intensive Applications for Science X
Compute, Data & Storage Resources
Distributed
Authorisation
Data Access Services
Data Integration Services
Structured Data
Simulation, Analysis & Integration Technology for Science X
Data Intensive X Scientists
Data Integration Architecture
GridFTP Naming Caching
Generic Virtual Data Access and Integration Technology
32
1a. Request to Registry for sources of data about “x”
1b. Registry responds with
Factory handle2a. Request to Factory for access to database
2b. Factory creates GridDataService to manage access
2c. Factory returns handle of GDS to client
3a. Client queries GDS with XPath, SQL, etc
3b. GDS interacts with database
3c. Results of query returned to client as XML
SOAP/HTTP
service creation
API interactions
Registry
Factory
Grid Data Service
Client
XML / Relational database
33
1a. Request to Registry for sources of data about “x” & “y”
1b. Registry responds with
Factory handle2a. Request to Factory for access and integration to databases
2b. Factory creates GridDataServices network
2c. Factory returns handle of GDS to client
3a. Client submits set of queries GDS with
XPath, SQL, etc
3c. Results of queries returned to consumer as XML or binary
SOAP/HTTP
service creation
API interactions
Registry
Factory
Client
XML / Relational database
Consumer
XML / Relational database
GDS
GDS
GDS
GDS
GDS
3b. Tell consumer
34
Biomedical (or ANY) Data
OpportunitiesGlobal Production of Published DataVolume DiversityCombination Analysis Discovery
ChallengesData HuggersMeagre metadataEase of UseAutomated, optimised integrationTraceability, Dependability
OpportunitiesSpecialised IndexingStructurally varied replicationConsistent Structured Universe of DiscourseData & Computation Integration
ChallengesApproximate MatchingMulti-scale optimisation
Bad habits / industrial structures
Safety and Multi-scale optimisation
35
Data Integration Challenges
High-Level LanguagesDescribing the Data Extraction RecipesDescribing the Sources & Components
Metadata that drives automation & validation
MobilityCode & Data
Integrating Existing DB technologyMoving the DBMS to the Grid context
New Optimisation ChallengesData & Computation & Storage & Movement
Shared Distributed Annotation SystemsHow to ReferenceProvenance & Acknowledgement
36
37
Challenges
A Programming & Development ModelDependability at this ScaleFoundations for TrustRaising the Level of AutomationSupporting New Forms of
CollaborationData
38