32
http://www.ogsadai.org.uk Data and the Grid The OGSA-DAI Team [email protected]

Data and the Grid

  • Upload
    ashlyn

  • View
    31

  • Download
    2

Embed Size (px)

DESCRIPTION

Data and the Grid. The OGSA-DAI Team [email protected]. Motivation. Entering an age of data Data Explosion CERN: LHC will generate 1GB/s = 10PB/y VLBA (NRAO) generates 1GB/s today Pixar generate 100 TB/Movie Storage getting cheaper Data stored in many different ways Data resources - PowerPoint PPT Presentation

Citation preview

Page 1: Data and the Grid

http://www.ogsadai.org.uk

Data and the Grid

The OGSA-DAI Team

[email protected]

Page 2: Data and the Grid

2http://www.ogsadai.org.uk

Motivation

Entering an age of data– Data Explosion

• CERN: LHC will generate 1GB/s = 10PB/y• VLBA (NRAO) generates 1GB/s today• Pixar generate 100 TB/Movie

– Storage getting cheaper Data stored in many different ways

– Data resources• Relational databases• XML databases• Flat files

Need ways to facilitate – Data discovery– Data access– Data integration

Empower e-Business and e-Science– The Grid is a vehicle for achieving this

Page 3: Data and the Grid

3http://www.ogsadai.org.uk

Composing Observations in Astronomy

Data and images courtesy Alex Szalay, John Hopkins

No. & sizes of data sets as of mid-2002, grouped by wavelength• 12 waveband coverage of large areas of the sky• Total about 200 TB data• Doubling every 12 months• Largest catalogues near 1B objects

Page 4: Data and the Grid

4http://www.ogsadai.org.uk

Cardiovascular Functional Genomics

Glasgow Edinburgh

Leicester

Oxford

LondonNetherlands

Shared data Public curateddata (US)

BRIDGESIBM

Page 5: Data and the Grid

5http://www.ogsadai.org.uk

Data Grid

CustomerSupport

Web-basedDashboard

Customer Data Customer Order Information

Marketing department identifies likely buyers of new product

• Company wants real-time integrated view of customer buying behavior

• Data resides in various distributed CRM & ERP systems

• Grid allows developers and apps to access and integrate customer data sources together in real time--across many distributed databases

SAPOracle SiebelDB2

Business Intelligence and Customer Data

Thanks to Dave Berry, Andrew Grimshaw – OGSA-WG

Page 6: Data and the Grid

6http://www.ogsadai.org.uk

Providing Data to Cluster-Based Analytical Application

Forward Proxy Data Caches of

Remote Data

CentralizedCompute Cluster

HeadquartersIllinois

• Company has centralized HPC cluster running compute-intensive applications

• Source data for analysis distributed among 3 global sites, one of them an external partner

• Manual data-sharing processes increase costs/errors, and hinder time-to-results

• Grid enables secure, automatic provisioning of remote data to HPC cluster—feeding CPUs more data faster

Data Grid

Data Grid

Data Grid

Data Grid

R&D West Coast

TestingIndia

Engineering East Coast

AnalyticalApplications

Thanks to Dave Berry, Andrew Grimshaw – OGSA-WG

Page 7: Data and the Grid

7http://www.ogsadai.org.uk

Data Virtualisation

Client

Other DataServices

“Primitive”Data Sources

• Abstraction

• Federation

• Transformation

Data Service

API

Thanks to Dave Berry, Andrew Grimshaw – OGSA-WG

Client API

Data Service

Page 8: Data and the Grid

8http://www.ogsadai.org.uk

Grand Challenges

Page 9: Data and the Grid

9http://www.ogsadai.org.uk

Share and share alike!

Many challenges:– Scalability, performance, heterogeneity, ownership, economics– Common schema, data description and semantics, data formats,

process and procedure, provenance

Can be solved only through collaboration and the sharing of:– Ideas– Efforts– Resources

Perhaps most importantly: sharing of data– Beware of data huggers!

Emerging Open Grid Infrastructures will– Allow global collaboration – Change the way that we can work

Page 10: Data and the Grid

10http://www.ogsadai.org.uk

Three-way Alliance

Experiment &Advanced Data

Collection→

Shared Data

Requires Much Engineering, Much Innovation

Changes Culture, New Mores, New Behaviours

New Opportunities, New Results, New Rewards

TheoryModels & Simulations

→Shared Data

Computing ScienceSystems, Notations &

Formal Foundation→ Process & Trust

Multi-national, Multi-discipline, Computer-enabledConsortia, Cultures & Societies

Page 11: Data and the Grid

11http://www.ogsadai.org.uk

Emergence of Virtual Organisations

Page 12: Data and the Grid

12http://www.ogsadai.org.uk

Data Requirements

What do we need for effective sharing of data?– Structured, organised, annotated & curated data– Computable data models– Visualisation of data– Data provenance– Shared distributed systems– Networked workplaces, instruments, data sources– Metadata, ontologies, standards– Authentication, authorisation, accounting, policies

Page 13: Data and the Grid

13http://www.ogsadai.org.uk

Database Growth

Page 14: Data and the Grid

14http://www.ogsadai.org.uk

Terabyte → Petabyte

Terabyte Petabyte

RAM time to move 15 minutes 2 months

1GB WAN move time 10 hours ($1000) 14 months ($1 million)

Disk cost 7 disks =

$5000 (SCSI)

6800 Disks + 490 units + 32 racks = $7 million

Disk power 100 Watts 100 Kilowatts

Disk weight 5.6 Kg 33 Tonnes

Disk footprint Inside machine 60 m2

Approximately Correct in May 2003 Distributed Computing EconomicsJim Gray, Microsoft Research, MSR-TR-2003-24

Page 15: Data and the Grid

15http://www.ogsadai.org.uk

Mohammed & Mountains

Petabytes of Data cannot be moved– It stays where it is produced or curated

• Hospitals, observatories, European Bioinformatics Institute

– A few caches and a small proportion cached

Distributed collaborating communities– Expertise in curation, simulation & analysis

Diverse data collections– Discovery depends on insights– Unpredictable or unexpected use of data

Page 16: Data and the Grid

16http://www.ogsadai.org.uk

Move computation to the data

Assumption: code size << data size– Minimise data transport

Provision combined storage & compute resources Develop the database philosophy for this?

– Enhanced stored procedures– Pre-query analysis for more concise queries– Mobile code sandbox– Robustness

Develop the storage architecture for this?– Compute closer to disk? – System on a Chip using free space in the on-disk controller

Develop experiment, sensor & simulation architectures– That take code to select and digest data as an output control

Data Cutter a step in this direction– Sub-setting and aggregation of datasets using filters executed close to data– http://www.cs.umd.edu/projects/hpsl/ResearchAreas/DataCutter.htm

Page 17: Data and the Grid

17http://www.ogsadai.org.uk

Meta-data: describing data

Choosing data sources– How do you find them?– How are they described and advertised?– Is the equivalent of Google possible?

Meta-data is required describing:– Structure of data– Types of data– Operations supported/available– Access requirements– Quality of service?

No established standards for heterogeneous data sources

Page 18: Data and the Grid

18http://www.ogsadai.org.uk

Cultural Challenges

Changing the way we work? Publication and sharing of results

– Increased volume and diversity = increased opportunity?– Allows independent validation of methods and derivatives– Responsibility, ownership, credit, citation

Many distributed data resources– Data collected from observation, simulation & experiment– Independently owned & managed

• No common goals or design• Work hard for agreements on foundation types and ontologies• Autonomous decisions change data, structure, policy, etc

– Requires negotiations and patience!

Diversity– No “one size fits all” solutions will work

Page 19: Data and the Grid

19http://www.ogsadai.org.uk

Economic Challenges

Data production, publication & management– Many researchers contributing increments of data – Who pays for storage, transport, management and

curation?

Data longevity– Research requirements may outlive technical decisions– Data does not preserve itself indefinitely!

Costs must be shared somehow…

Page 20: Data and the Grid

20http://www.ogsadai.org.uk

Security Challenges: Medical Imaging Data

Diagnosing based on sensitive patient data– Users: a (group of) doctor(s)– Retrieve an image, run algorithm, examine result and write diagnosis,

maybe re-run another algorithm.

Secure Data Retrieval– Patient data is sensitive, needs to be stored anonymously at all times– Site admins are not trustworthy – strip or encrypt patient data from

image– Replication of data not always allowed

High security needs– Strong authorization– Fine-grained access control mechanisms– Leaking patient information results in prosecution.

Thanks to Dave Berry, Andrew Grimshaw – OGSA-WG

Page 21: Data and the Grid

21http://www.ogsadai.org.uk

Scientific Discovery

Obtaining access to that data– Overcoming administrative barriers– Overcoming technical barriers

Understanding that data– The parts you care about for your research

Combing them using sophisticated models– The picture of reality in your head

Analysis on scales required by statistics– Coupling data access with computation

Repeated Processes– Examining variations, covering a set of candidates– Monitoring the emerging details– Coupling with scientific workflows

Page 22: Data and the Grid

22http://www.ogsadai.org.uk

DAIS Working Group

Page 23: Data and the Grid

23http://www.ogsadai.org.uk

DAIS WG Goals

Provide service-based access to structured data resources as part of OGSA architecture

Specify a selection of interfaces tailored to various styles of data access starting with relational and XML

Interact well with other GGF OGSA specs

Page 24: Data and the Grid

24http://www.ogsadai.org.uk

DAIS WG Non-Goals

No new common query language No schema integration or common data

model No common namespace or naming scheme No data resource management

– e.g. starting/stopping database managers

No push based delivery – Information Dissemination WG?

Page 25: Data and the Grid

25http://www.ogsadai.org.uk

DAIS View Of Data Services Model

0-* 0-*

Consumer Data Service Data Resource 0- * 0-*

0-*

0-*

This structure is not exposed through the Data Service interface to the Consumer.

A Data Service presents a Consumer with an interface to a Data Resource. A Data Resource can have arbitrary complexity, for example, a file on an

NFS mounted file system or a federation of relational databases. A Consumer is not typically exposed to this complexity and operates within

the bounds and semantics of the interface provided by the Data Service

Page 26: Data and the Grid

26http://www.ogsadai.org.uk

Specification Names

Web Services Data Access and Integration (WS-DAI)– The specification formerly known as the Grid Data Service

Specification– A paradigm-neutral specification of descriptive and

operational features of services for accessing data

The WS-DAI Realisations– WS-DAIR: for relational databases– WS-DAIX: for XML repositories

Page 27: Data and the Grid

27http://www.ogsadai.org.uk

DAIS Specification Landscape

OGSA Data Services

WS-DAI

WS-DAIXWS-DAIR

Scenarios for MappingDAIS Concepts

Is Informed By

Extend

GWD-I

GWD-R

Page 28: Data and the Grid

28http://www.ogsadai.org.uk

INFOD

GridFTP

DT

TMBoFADFBoF

GGF

Arch Data ISP SRM

OGSA

CMM GIR

GSM

GFS DAIS

CGS GRAAP

Policy

IETF

SNMP

Other Standards Bodies

????W3C

XQuery

ANSI

SQL

DMTF

CIM

OASIS

WS-DM

WS-RF

WS-N

WSPolicy

WSAddress

OREP

JDBC

JCP

DFDL

DAIS and Other Standards/Specs

Page 29: Data and the Grid

29http://www.ogsadai.org.uk

DAIS Data Access

Database Data Service

Relational Database

SQLResponse

SQLDescription: Readable Writeable ConcurrentAccess TransactionInitiation TransactionIsolation Etc.

SQLAccess

Consumer

SQLExecute ( SQLExpression )

Page 30: Data and the Grid

30http://www.ogsadai.org.uk

DAIS Derived Data Access Database

Data Service

SQL Response Data Service

Relational Database

Row Set

SQLExecuteFactory ( SQLExpression BehaviouralProperties )

RDBMS specific mechanism for generating result set

SQLFactory

SQLResponseDescription

SQLResponseAccess

Consumer

GetRowset ( rowsetnumber )

Reference to SQLResponse Data Service

Rowset

SQLDescription: Readable Writeable ConcurrentAccess TransactionInitiation TransactionIsolation Etc.

Page 31: Data and the Grid

31http://www.ogsadai.org.uk

Usage Patterns

GA

Q

S+R

Data

Q - QueryD - DeliveryS - StatusR - ResultU - UpdateI - Data id

Q+D

A

C

GS

R

G

C

A

Q

S

D

R

A G

Q+U

S

Retrieve Update/Insert Pipeline

G2=C

G1=P

A I

Q1

S2

S1

U/R

Q2+D

Q1+D

G2=C

A

G1=P

S2

S1

Q2

U/R

Actors

- OGSI process - Non-OGSI processA - AnalystC - ConsumerG - GDSP - Producer

CallResponse

Data Flow

A

PG

U

IQ

S

A

PG

U

I

S

Q+D

Page 32: Data and the Grid

32http://www.ogsadai.org.uk

The story so far

Technology enables Grids and more data Distributed systems for sharing information

– Essential, ubiquitous & challenging– Therefore share methods and technology as much as possible

Collaboration is essential– Combining approaches– Combining skills– Sharing resources

Structured Data is the language of Collaboration– Data Access & Integration a Ubiquitous Requirement– Primary data, metadata, administrative & system data

Many hard technical challenges– Scale, heterogeneity, distribution, dynamic variation– Intimate combinations of data and computation with autonomous

development of both