Upload
ashlyn
View
31
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Data and the Grid. The OGSA-DAI Team [email protected]. Motivation. Entering an age of data Data Explosion CERN: LHC will generate 1GB/s = 10PB/y VLBA (NRAO) generates 1GB/s today Pixar generate 100 TB/Movie Storage getting cheaper Data stored in many different ways Data resources - PowerPoint PPT Presentation
Citation preview
2http://www.ogsadai.org.uk
Motivation
Entering an age of data– Data Explosion
• CERN: LHC will generate 1GB/s = 10PB/y• VLBA (NRAO) generates 1GB/s today• Pixar generate 100 TB/Movie
– Storage getting cheaper Data stored in many different ways
– Data resources• Relational databases• XML databases• Flat files
Need ways to facilitate – Data discovery– Data access– Data integration
Empower e-Business and e-Science– The Grid is a vehicle for achieving this
3http://www.ogsadai.org.uk
Composing Observations in Astronomy
Data and images courtesy Alex Szalay, John Hopkins
No. & sizes of data sets as of mid-2002, grouped by wavelength• 12 waveband coverage of large areas of the sky• Total about 200 TB data• Doubling every 12 months• Largest catalogues near 1B objects
4http://www.ogsadai.org.uk
Cardiovascular Functional Genomics
Glasgow Edinburgh
Leicester
Oxford
LondonNetherlands
Shared data Public curateddata (US)
BRIDGESIBM
5http://www.ogsadai.org.uk
Data Grid
CustomerSupport
Web-basedDashboard
Customer Data Customer Order Information
Marketing department identifies likely buyers of new product
• Company wants real-time integrated view of customer buying behavior
• Data resides in various distributed CRM & ERP systems
• Grid allows developers and apps to access and integrate customer data sources together in real time--across many distributed databases
SAPOracle SiebelDB2
Business Intelligence and Customer Data
Thanks to Dave Berry, Andrew Grimshaw – OGSA-WG
6http://www.ogsadai.org.uk
Providing Data to Cluster-Based Analytical Application
Forward Proxy Data Caches of
Remote Data
CentralizedCompute Cluster
HeadquartersIllinois
• Company has centralized HPC cluster running compute-intensive applications
• Source data for analysis distributed among 3 global sites, one of them an external partner
• Manual data-sharing processes increase costs/errors, and hinder time-to-results
• Grid enables secure, automatic provisioning of remote data to HPC cluster—feeding CPUs more data faster
Data Grid
Data Grid
Data Grid
Data Grid
R&D West Coast
TestingIndia
Engineering East Coast
AnalyticalApplications
Thanks to Dave Berry, Andrew Grimshaw – OGSA-WG
7http://www.ogsadai.org.uk
Data Virtualisation
Client
Other DataServices
“Primitive”Data Sources
• Abstraction
• Federation
• Transformation
Data Service
API
Thanks to Dave Berry, Andrew Grimshaw – OGSA-WG
Client API
Data Service
8http://www.ogsadai.org.uk
Grand Challenges
9http://www.ogsadai.org.uk
Share and share alike!
Many challenges:– Scalability, performance, heterogeneity, ownership, economics– Common schema, data description and semantics, data formats,
process and procedure, provenance
Can be solved only through collaboration and the sharing of:– Ideas– Efforts– Resources
Perhaps most importantly: sharing of data– Beware of data huggers!
Emerging Open Grid Infrastructures will– Allow global collaboration – Change the way that we can work
10http://www.ogsadai.org.uk
Three-way Alliance
Experiment &Advanced Data
Collection→
Shared Data
Requires Much Engineering, Much Innovation
Changes Culture, New Mores, New Behaviours
New Opportunities, New Results, New Rewards
TheoryModels & Simulations
→Shared Data
Computing ScienceSystems, Notations &
Formal Foundation→ Process & Trust
Multi-national, Multi-discipline, Computer-enabledConsortia, Cultures & Societies
11http://www.ogsadai.org.uk
Emergence of Virtual Organisations
12http://www.ogsadai.org.uk
Data Requirements
What do we need for effective sharing of data?– Structured, organised, annotated & curated data– Computable data models– Visualisation of data– Data provenance– Shared distributed systems– Networked workplaces, instruments, data sources– Metadata, ontologies, standards– Authentication, authorisation, accounting, policies
13http://www.ogsadai.org.uk
Database Growth
14http://www.ogsadai.org.uk
Terabyte → Petabyte
Terabyte Petabyte
RAM time to move 15 minutes 2 months
1GB WAN move time 10 hours ($1000) 14 months ($1 million)
Disk cost 7 disks =
$5000 (SCSI)
6800 Disks + 490 units + 32 racks = $7 million
Disk power 100 Watts 100 Kilowatts
Disk weight 5.6 Kg 33 Tonnes
Disk footprint Inside machine 60 m2
Approximately Correct in May 2003 Distributed Computing EconomicsJim Gray, Microsoft Research, MSR-TR-2003-24
15http://www.ogsadai.org.uk
Mohammed & Mountains
Petabytes of Data cannot be moved– It stays where it is produced or curated
• Hospitals, observatories, European Bioinformatics Institute
– A few caches and a small proportion cached
Distributed collaborating communities– Expertise in curation, simulation & analysis
Diverse data collections– Discovery depends on insights– Unpredictable or unexpected use of data
16http://www.ogsadai.org.uk
Move computation to the data
Assumption: code size << data size– Minimise data transport
Provision combined storage & compute resources Develop the database philosophy for this?
– Enhanced stored procedures– Pre-query analysis for more concise queries– Mobile code sandbox– Robustness
Develop the storage architecture for this?– Compute closer to disk? – System on a Chip using free space in the on-disk controller
Develop experiment, sensor & simulation architectures– That take code to select and digest data as an output control
Data Cutter a step in this direction– Sub-setting and aggregation of datasets using filters executed close to data– http://www.cs.umd.edu/projects/hpsl/ResearchAreas/DataCutter.htm
17http://www.ogsadai.org.uk
Meta-data: describing data
Choosing data sources– How do you find them?– How are they described and advertised?– Is the equivalent of Google possible?
Meta-data is required describing:– Structure of data– Types of data– Operations supported/available– Access requirements– Quality of service?
No established standards for heterogeneous data sources
18http://www.ogsadai.org.uk
Cultural Challenges
Changing the way we work? Publication and sharing of results
– Increased volume and diversity = increased opportunity?– Allows independent validation of methods and derivatives– Responsibility, ownership, credit, citation
Many distributed data resources– Data collected from observation, simulation & experiment– Independently owned & managed
• No common goals or design• Work hard for agreements on foundation types and ontologies• Autonomous decisions change data, structure, policy, etc
– Requires negotiations and patience!
Diversity– No “one size fits all” solutions will work
19http://www.ogsadai.org.uk
Economic Challenges
Data production, publication & management– Many researchers contributing increments of data – Who pays for storage, transport, management and
curation?
Data longevity– Research requirements may outlive technical decisions– Data does not preserve itself indefinitely!
Costs must be shared somehow…
20http://www.ogsadai.org.uk
Security Challenges: Medical Imaging Data
Diagnosing based on sensitive patient data– Users: a (group of) doctor(s)– Retrieve an image, run algorithm, examine result and write diagnosis,
maybe re-run another algorithm.
Secure Data Retrieval– Patient data is sensitive, needs to be stored anonymously at all times– Site admins are not trustworthy – strip or encrypt patient data from
image– Replication of data not always allowed
High security needs– Strong authorization– Fine-grained access control mechanisms– Leaking patient information results in prosecution.
Thanks to Dave Berry, Andrew Grimshaw – OGSA-WG
21http://www.ogsadai.org.uk
Scientific Discovery
Obtaining access to that data– Overcoming administrative barriers– Overcoming technical barriers
Understanding that data– The parts you care about for your research
Combing them using sophisticated models– The picture of reality in your head
Analysis on scales required by statistics– Coupling data access with computation
Repeated Processes– Examining variations, covering a set of candidates– Monitoring the emerging details– Coupling with scientific workflows
22http://www.ogsadai.org.uk
DAIS Working Group
23http://www.ogsadai.org.uk
DAIS WG Goals
Provide service-based access to structured data resources as part of OGSA architecture
Specify a selection of interfaces tailored to various styles of data access starting with relational and XML
Interact well with other GGF OGSA specs
24http://www.ogsadai.org.uk
DAIS WG Non-Goals
No new common query language No schema integration or common data
model No common namespace or naming scheme No data resource management
– e.g. starting/stopping database managers
No push based delivery – Information Dissemination WG?
25http://www.ogsadai.org.uk
DAIS View Of Data Services Model
0-* 0-*
Consumer Data Service Data Resource 0- * 0-*
0-*
0-*
This structure is not exposed through the Data Service interface to the Consumer.
A Data Service presents a Consumer with an interface to a Data Resource. A Data Resource can have arbitrary complexity, for example, a file on an
NFS mounted file system or a federation of relational databases. A Consumer is not typically exposed to this complexity and operates within
the bounds and semantics of the interface provided by the Data Service
26http://www.ogsadai.org.uk
Specification Names
Web Services Data Access and Integration (WS-DAI)– The specification formerly known as the Grid Data Service
Specification– A paradigm-neutral specification of descriptive and
operational features of services for accessing data
The WS-DAI Realisations– WS-DAIR: for relational databases– WS-DAIX: for XML repositories
27http://www.ogsadai.org.uk
DAIS Specification Landscape
OGSA Data Services
WS-DAI
WS-DAIXWS-DAIR
Scenarios for MappingDAIS Concepts
Is Informed By
Extend
GWD-I
GWD-R
28http://www.ogsadai.org.uk
INFOD
GridFTP
DT
TMBoFADFBoF
GGF
Arch Data ISP SRM
OGSA
CMM GIR
GSM
GFS DAIS
CGS GRAAP
Policy
IETF
SNMP
Other Standards Bodies
????W3C
XQuery
ANSI
SQL
DMTF
CIM
OASIS
WS-DM
WS-RF
WS-N
WSPolicy
WSAddress
OREP
JDBC
JCP
DFDL
DAIS and Other Standards/Specs
29http://www.ogsadai.org.uk
DAIS Data Access
Database Data Service
Relational Database
SQLResponse
SQLDescription: Readable Writeable ConcurrentAccess TransactionInitiation TransactionIsolation Etc.
SQLAccess
Consumer
SQLExecute ( SQLExpression )
30http://www.ogsadai.org.uk
DAIS Derived Data Access Database
Data Service
SQL Response Data Service
Relational Database
Row Set
SQLExecuteFactory ( SQLExpression BehaviouralProperties )
RDBMS specific mechanism for generating result set
SQLFactory
SQLResponseDescription
SQLResponseAccess
Consumer
GetRowset ( rowsetnumber )
Reference to SQLResponse Data Service
Rowset
SQLDescription: Readable Writeable ConcurrentAccess TransactionInitiation TransactionIsolation Etc.
31http://www.ogsadai.org.uk
Usage Patterns
GA
Q
S+R
Data
Q - QueryD - DeliveryS - StatusR - ResultU - UpdateI - Data id
Q+D
A
C
GS
R
G
C
A
Q
S
D
R
A G
Q+U
S
Retrieve Update/Insert Pipeline
G2=C
G1=P
A I
Q1
S2
S1
U/R
Q2+D
Q1+D
G2=C
A
G1=P
S2
S1
Q2
U/R
Actors
- OGSI process - Non-OGSI processA - AnalystC - ConsumerG - GDSP - Producer
CallResponse
Data Flow
A
PG
U
IQ
S
A
PG
U
I
S
Q+D
32http://www.ogsadai.org.uk
The story so far
Technology enables Grids and more data Distributed systems for sharing information
– Essential, ubiquitous & challenging– Therefore share methods and technology as much as possible
Collaboration is essential– Combining approaches– Combining skills– Sharing resources
Structured Data is the language of Collaboration– Data Access & Integration a Ubiquitous Requirement– Primary data, metadata, administrative & system data
Many hard technical challenges– Scale, heterogeneity, distribution, dynamic variation– Intimate combinations of data and computation with autonomous
development of both