50
Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director www.nesc.ac.uk 2 nd December 2003

Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director 2 nd December 2003

Embed Size (px)

Citation preview

Page 1: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Data Challenges in e-Science

Aberdeen

Prof. Malcolm AtkinsonDirector

www.nesc.ac.uk

2nd December 2003

Page 2: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

What is e-Science?

Page 3: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Foundation for e-Science

sensor nets

Shared data archives

computers

software

colleagues

instruments

Grid

e-Science methodologies will rapidly transform science, engineering, medicine and business

driven by exponential growth (×1000/decade) enabling a whole-system approach

Diagram derived fromIan Foster’s slide

Page 4: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Three-way Alliance

Computing ScienceSystems, Notations &

Formal Foundation→ Process & Trust

TheoryModels & Simulations

→Shared Data

Experiment &Advanced Data

Collection→

Shared Data

Multi-national, Multi-discipline, Computer-enabledConsortia, Cultures & Societies

Requires Much Engineering, Much Innovation

Changes Culture, New Mores, New Behaviours

New Opportunities, New Results, New Rewards

Page 5: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Biochemical Pathway Simulator

Closing the inf ormation loop – between lab and computational model.

(Computing Science, Bioinformatics, Beatson Cancer Research Labs)

DTI Bioscience Beacon Project Harnessing Genomics Programme

Slide from Professor Muffy Calder, Glasgow

Page 6: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Why is Data Important?

Page 7: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Information, knowledge,decisions & designs

Derived &Synthesised

Data

Data as Evidence – all disciplines

Hypothesis,Curiosity or …

Collectionsof

Data

Analysis&

Models

Driven by creativity, imagination and perspiration

Page 8: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Information, knowledge,decisions & designs

Data as Evidence - Historically

Hypothesis,Curiosity or …

Collectionsof

Data

Analysis&

Models

Derived &Synthesised

Data

Individual’sidea

Personalcollection

Personaleffort

LabNotebook

Driven by creativity, imagination, perspiration & personal resources

Page 9: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Information, knowledge,decisions & designs

Data as Evidence – Enterprise Scale

Hypothesis,Curiosity,

Business Goals

Collectionsof

Digital Data

Analysis &Computational

Models

Derived &Synthesised

Data

Agreed Hypothesisor Goals

Enterprise databases& (archive) file stores

Data ProductionPipelines

Data Products& Results

Driven by creativity, imagination, perspiration & company’s resources

Page 10: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Information, knowledge,decisions & designs

Data as Evidence – e-Science

Shared GoalsMultiple

hypotheses

Collectionsof Published

& Private Data

AnalysisComputationAnnotation

Derived &Synthesised

Data

Communitiesand Challenges

Multi-enterprise& Public Curation

Synthesis fromMultiple SourcesMulti-enterpriseModels, Computation &Workflow

SharedData Products& Results

Driven by creativity, imagination, perspiration & shared resources

Page 11: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

global in-flight engine diagnostics

in-flight data

airline

maintenance centre

ground station

global networkeg SITA

internet, e-mail, pager

DS&S Engine Health Center

data centre

Distributed Aircraft Maintenance Environment: Universities of Leeds, Oxford, Sheffield &York

100,000 aircraft

0.5 GB/flight

4 flights/day

200 TB/day

Page 12: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

LHC Distributed Simulation &

Analysis

Tier2 Centre ~1 TIPS

Online System

Offline Farm~20 TIPS

CERN Computer Centre >20 TIPS

RAL Regional Centre

US Regional Centre

French Regional Centre

Italian Regional Centre

InstituteInstituteInstituteInstitute ~0.25TIPS

Workstations

~100 MBytes/sec

~100 MBytes/sec

100 - 1000 Mbits/sec

•One bunch crossing per 25 ns

•100 triggers per second

•Each event is ~1 Mbyte

Physicists work on analysis “channels”

Each institute has ~10 physicists working on one or more channels

Data for these channels should be cached by the institute server

Physics data cache

~PBytes/sec

~ Gbits/sec or Air Freight

Tier2 Centre ~1 TIPS

Tier2 Centre ~1 TIPS

~Gbits/sec

Tier Tier 00

Tier Tier 11

Tier Tier 33

Tier Tier 44

1 TIPS = 25,000 SpecInt95

PC (1999) = ~15 SpecInt95

ScotGRID++ ~1 TIPS

Tier Tier 22

1. CERN1. CERN

Page 13: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

DataGrid Testbed

Dubna

Moscow

RAL

Lund

Lisboa

Santander

Madrid

Valencia

Barcelona

Paris

Berlin

LyonGrenoble

Marseille

BrnoPrague

Torino

Milano

BO-CNAFPD-LNL

Pisa

Roma

Catania

ESRIN

CERN

HEP sites

ESA sites

IPSL

Estec KNMI

(>40)

[email protected] - [email protected]

Testbed Sites

Page 14: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Shared GoalsMultiple

hypotheses

Collectionsof Published

& Private Data

AnalysisComputationAnnotation

Derived &Synthesised

Data

Shared GoalsMultiple

hypotheses

Collectionsof Published

& Private Data

AnalysisComputationAnnotation

Derived &Synthesised

Data

Multiple overlapping communities

Shared GoalsMultiple

hypotheses

Collectionsof Published

& Private Data

AnalysisComputationAnnotation

Derived &Synthesised

Data

Supported by common standards & shared infrastructure

Page 15: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Life-science Examples

Page 16: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Database GrowthPDB Content Growth

Bases 45,356,382,990

Page 17: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Wellcome Trust: Cardiovascular Functional Genomics

Glasgow Edinburgh

Leicester

Oxford

LondonNetherlands

Shared dataPublic curated

data

BRIDGESIBM

Depends on building & maintaining security, privacy & trust

Page 18: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Comparative Functional Genomics

Large amounts of dataHighly heterogeneous

Data typesData formscommunity

Highly complex and inter-relatedVolatile

myGrid Project: Carole Goble, University of Manchester

Page 19: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

UCSF

UIUC

From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign

Page 20: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Community =1000s of home computer usersPhilanthropic computing vendor (Entropia)Research group (Scripps)

Common goal= advance AIDS research

Home ComputersEvaluate AIDS Drugs

From Steve Tuecke 12 Oct. 01

Page 21: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003
Page 22: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Astronomy Examples

Page 23: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003
Page 24: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Global Knowledge Communitiesdriven by Data: e.g., Astronomy

No. & sizes of data sets as of mid-2002, grouped by wavelength• 12 waveband coverage of large areas of the sky• Total about 200 TB data• Doubling every 12 months• Largest catalogues near 1B objects

Data and images courtesy Alex Szalay, John Hopkins

Page 25: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Sloan Digital Sky Survey Production System

Slide from Ian Foster’s ssdbm 03 keynote

Page 26: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Supernova Cosmology Requires Complex,

Widely Distributed Workflow Management

Page 27: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Engineering Examples

Page 28: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

whole-system simulations

•braking performance•steering capabilities•traction•dampening capabilities

landing gear models

•lift capabilities•drag capabilities•responsiveness

wing models

•deflection capabilities•responsiveness

stabilizer modelsairframe models

crew capabilities- accuracy- perception- stamina- reaction times- SOP’s

human models •thrust performance•reverse thrust performance•responsiveness•fuel consumption

engine models

NASA Information Power Grid: coupling all sub-system simulations - slide from Bill Johnson

Page 29: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Mathematicians Solve NUG30Looking for the solution to the NUG30 quadratic assignment problem An informal collaboration of mathematicians and computer scientistsCondor-G delivered 3.46E8 CPU seconds in 7 days (peak 1009 processors) in U.S. and Italy (8 sites)

14,5,28,24,1,3,16,15,10,9,21,2,4,29,25,22,13,26,17,30,6,20,19,8,18,7,27,12,11,23

MetaNEOS: Argonne, Iowa, Northwestern, WisconsinFrom Miron Livny 7 Aug. 01

Page 30: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Network for EarthquakeEngineering Simulation

NEESgrid: national infrastructure to couple earthquake engineers with experimental facilities, databases, computers, & each otherOn-demand access to experiments, data streams, computing, archives, collaboration

NEESgrid: Argonne, Michigan, NCSA, UIUC, USCFrom Steve Tuecke 12 Oct. 01

Page 31: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

National Airspace Simulation Environment

NASA Information Power Grid: aircraft, flight paths, airport operations and the environmentare combined to get a virtual national airspace

VirtualNational Air

SpaceVNAS

GRCengine models

LaRC

airframe models

landinggear models

ARC

wing models

stabilizer models

human models

• FAA ops data• weather data• airline schedule data• digital flight data• radar tracks• terrain data• surface data

22,000 commercialUS flights a day

50,000 engine runs

22,000 airframe impact runs

132,000 landing/take-off gear runs

48,000 human crew runs

66,000 stabilizer runs

44,000 wing runs

simulationdrivers

Page 32: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Data Challenges

Page 33: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Derived from Ian Foster’s slide at ssdbM July 03

It’s Easy to ForgetHow Different 2003 is From

1993Enormous quantities of data: Petabytes

For an increasing number of communitiesgating step is not collection but analysis

Ubiquitous Internet: >100 million hostsCollaboration & resource sharing the normSecurity and Trust are crucial issues

Ultra-high-speed networks: >10 Gb/sGlobal optical networksBottlenecks: last kilometre & firewalls

Huge quantities of computing: >100 Top/sMoore’s law gives us all supercomputersOrganising their effective use is the challenge

Moore’s law everywhereInstruments, detectors, sensors, scanners, …Organising their effective use is the challenge

Page 34: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Tera → Peta BytesRAM time to move

15 minutes

1Gb WAN move time10 hours ($1000)

Disk Cost7 disks = $5000 (SCSI)

Disk Power100 Watts

Disk Weight5.6 Kg

Disk FootprintInside machine

RAM time to move2 months

1Gb WAN move time14 months ($1 million)

Disk Cost6800 Disks + 490 units + 32 racks = $7 million

Disk Power100 Kilowatts

Disk Weight33 Tonnes

Disk Footprint60 m2

May 2003 Approximately Correct

See also Distributed Computing Economics Jim Gray, Microsoft Research, MSR-TR-2003-24

Page 35: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

The Story so Far

Technology enables Grids, More Data & …Information Grids will dominateCollaboration essential

Combining approachesCombining skillsSharing resources

(Structured) Data is the language of Collaboration

Data Access & Integration a Ubiquitous Requirement

Many hard technical challengesScale, heterogeneity, distribution, dynamic variation

Intimate combinations of data and computationWith unpredictable (autonomous) development of both

Page 36: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Scientific DataOpportunities

Global Production of Published DataVolume DiversityCombination Analysis Discovery

ChallengesData Huggers

Meagre metadata

Ease of Use

Optimised integration

Dependability

OpportunitiesSpecialised IndexingNew Data OrganisationNew AlgorithmsVaried ReplicationShared AnnotationIntensive Data & Computation

ChallengesFundamental PrinciplesApproximate MatchingMulti-scale optimisationAutonomous ChangeLegacy structuresScale and LongevityPrivacy and Mobility

Page 37: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

UK e-Science

Page 38: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

UK e-Science

e- Science and the Grid‘e- Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.’

‘e- Science will change the dynamic of the way science is undertaken.’

J ohn TaylorDirector General of Research Councils

Offi ce of Science and Technology

From presentation by Tony Hey

Page 39: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

e-Science Programme’s Vision

UK will lead the in the exploitation of e-Infrastructure

New, faster and better researchEngineering design, medical diagnosis, decision support, …e-Business, e-Research, e-Design & e-Decision

Depends on Leading e-Infrastructure development & deployment

Page 40: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

e-Science and SR2002

Research Council 2004-6 2001-4Medical £13.1M (£8M)Biological £10.0M (£8M)Environmental £8.0M (£7M)Eng & Phys £18.0M (£17M)HPC £2.5M (£9M)Core Prog. £16.2M (£15M) +

£20MParticle Phys & Astro £31.6M (£26M)Economic & Social £10.6M (£3M)Central Labs £5.0M (£5M)

Page 41: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

www.nesc.ac.uk

Nationale-

ScienceCentre

HPC(x)

National e-Science InstituteInternational relationshipsEngineering Task ForceGrid Support CentreArchitecture Task ForceOGSA-DAIOne of 11 Centre ProjectsGridNet to support standards workOne of 6 administration projectsTraining team5 Application projects15 “Fundamental” Research projectsEGEE

Globus Alliance

Page 42: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

NeSI in Edinburgh

National e-Science Centre

Page 43: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

NeSI Events held in the 2nd Year

(from 1 Aug 2002 to 31 Jul 2003)We have had 86 events: (Year 1 figures in

brackets)

11 project meetings ( 4)11 research meetings ( 7)25 workshops (17 + 1) 2 “summer” schools (0)15 training sessions (8)12 outreach events (3)5 international meetings (1)5 e-Science management meetings (7)

(though the definitions are fuzzy!)

> 3600 Participant Days

Suggestions always welcome

Establishing a training team

Investing in community building, skill generation & knowledge development

Page 44: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

NeSI Workshops

Space for real workCrossing communitiesCreativity: new strategies and solutionsWritten reports

Scientific Data Mining, Integration and VisualisationGrid Information SystemsPortals and PortletsVirtual Observatory as a Data GridImaging, Medical Analysis and Grid EnvironmentsGrid SchedulingProvenance & WorkflowGeoSciences & Scottish Bioinformatics Forum

http://www.nesc.ac.uk/events/

Page 45: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

E-Infrastructure

Page 46: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

OGSA

Infrastructure Architecture

OGSI: Interface to Grid Infrastructure

Data Intensive Applications for Application area X

Compute, Data & Storage Resources

Distributed

Simulation, Analysis & Integration Technology for Application area X

Data Intensive Users

Virtual Integration Architecture

Generic Virtual Data Access and Integration Layer

Structured DataIntegration

Structured Data Access

Structured Data Relational XML Semi-structured-

Transformation

Registry

Job Submission

Data Transport Resource Usage

Banking

Brokering Workflow

Authorisation

Page 47: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

1a. Request to Registry for sources of data about “x”

1b. Registry responds with

Factory handle2a. Request to Factory for access to database

2c. Factory returns handle of GDS to client

3a. Client queries GDS with XPath, SQL, etc

3b. GDS interacts with database

3c. Results of query returned to client as XML

SOAP/HTTP

service creation

API interactions

Registry

Factory

2b. Factory creates GridDataService to manage access

Grid Data Service

Client

XML / Relational database

Data Access & Integration Services

Page 48: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

GDTS2 GDS3

GDS2

GDTS1

Sx

Sy

1a. Request to Registry for sources of data about “x” & “y”

1b. Registry responds with

Factory handle

2a. Request to Factory for access and integration from resources Sx and Sy

2b. Factory creates GridDataServices network

2c. Factory returns handle of GDS to client

3a. Client submits sequence of scripts each has a set of queries to GDS with XPath, SQL, etc

3c. Sequences of result sets returned to analyst as formatted binary described in a standard XML notation

SOAP/HTTP

service creation

API interactions

Data Registry

Data Access& Integrationmaster

Client

Analyst XML database

Relational database

GDS

GDS

GDS

GDTS

GDTS

3b. Client tells analyst

GDS1

Future DAI Services?

“scientific”Applicationcodingscientificinsights

ProblemSolving

Environment

SemanticMeta data

Application Code

Page 49: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Integration is our Focus

Supporting CollaborationBring together disciplinesBring together people engaged in shared challengeInject initial energyInvent methods that work

Supporting Collaborative ResearchIntegrate compute, storage and communicationsDeliver and sustain integrated software stackOperate dependable infrastructure serviceIntegrate multiple data sourcesIntegrate data and computationIntegrate experiment with simulationIntegrate visualisation and analysis

High-level tools and automation essentialFundamental research as a foundation

Page 50: Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director  2 nd December 2003

Take Home MessageData is a Major Source of Challenges

AND an Enabler of New Science, Engineering , Medicine, Planning, …

Information GridsSupport for collaborationSupport for computation and data gridsStructured data is fundamentalIntegrated strategies & technologies needed

E-Infrastructure is Here – More to do – technically & socio-economically

Join in – explore the potential – develop the methods & standards

NeSC would like to help you develop e-ScienceWe seek suggestions and collaboration