32
1 Scientific Data Management Scientific Data Management Center Center All Hands Meeting All Hands Meeting March 2-3, 2005 March 2-3, 2005 Salt Lake City Salt Lake City

Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

  • Upload
    midori

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City. Scientific Data Management Center. Participating Institutions. Center PI: Arie Shoshani LBNL DOE Laboratories co-PIs: Bill Gropp, Rob Ross* ANL Arie Shoshani, Doron Rotem LBNL - PowerPoint PPT Presentation

Citation preview

Page 1: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

1

Scientific Data ManagementScientific Data ManagementCenterCenter

All Hands MeetingAll Hands Meeting

March 2-3, 2005March 2-3, 2005Salt Lake CitySalt Lake City

Page 2: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

2

Scientific Data Management CenterScientific Data Management Center

Center PI: Arie Shoshani LBNL

DOE Laboratories co-PIs:

Bill Gropp, Rob Ross* ANLArie Shoshani, Doron Rotem LBNLTerence Critchlow*, Chandrika Kamath LLNLNagiza Samatova* ORNL

Universities co-PIs :Mladen Vouk North Carolina State Alok Choudhary Northwestern Reagan Moore, Bertram Ludaescher UC San Diego (SDSC)

and UC DavisSteve Parker U of Utah

* Area Leaders

Participating Institutions

Page 3: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

3

AgendaAgenda

Session 1 (morning of first day):status reports: SEA, DMA, SPA8:15 – 10:0010:00 Break10:30 – 12:0012:00 Lunch

Session 2 (afternoon of first day): Application talksTopics: Talks by application people working with us, success stories, needs, bottlenecks, imagined new uses of SDM technologies1:30 – 3:00Eric Myra – AstrophysicsJackie Chan - CombustionScott Klasky – Fusion3:00 – 3:30 Break3:30 – 5:00Jerome Lauret – High Energy PhysicsElliot Peele - AstrophysicsWes Bethel - Visualization

Session 3 (morning of second day):Panel with Apps peopleModerator: Doron Rotem8:30 – 10:00 panel: part 1End-to-end use cases vs. Technologies10:00 – 10:30 break10:30 – 12:00 panel: part 2Engaging the sciences

Session 4 (afternoon of second day)1:30 – 4:00 future planningTopics: Discussion on the future plans including integration with other ISICs, considerations for new technology areas, universities role, planning for proposal

(Official end of meeting)

4:00 – 4:30 break4:30 – 6:00 informal meetings/discussions

Page 4: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

4

A Typical SDM Scenario

Control Flow Layer

Applications &Software Tools

Layer

I/O System Layer

Storage & NetworkResouces

Layer

Flo

w T

ier

Wo

rk T

ier

+

DataMover

SimulationProgram

ParallelR

PostProcessing

TerascaleBrowser

Task A:Generate

Time-Steps

Task B:Move TS

Task C:Analyze TS

Task D:Visualize TS

ParallelNetCDF

PVFS SabulHDF5

LibrariesSRM

Page 5: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

5

Technology Details by Layer

Hardware, OS, and MSS (HPSS)

WorkFlowManagement

Tools

Web Wrapping

Tools

EfficientParallel

Visualization(pVTK)

Efficientindexing(Bitmap Index)

DataAnalysis

tools(PCA, ICA)

ASPECT:integration Framework

Parallel NetCDFSoftware

Layer

ParallelVirtual

FileSystem

StorageResourceManager

(To HPSS)

ROMIOMPI-IOSystem

DataMining &Analysis(DMA)Layer

StorageEfficientAccess(SEA)Layer

ScientificProcess

Automation(SPA)Layer

Hardware, OS, and MSS (HPSS)

WorkFlowManagement

Tools

Web Wrapping

Tools

EfficientParallel

Visualization(pVTK)

Efficientindexing(Bitmap Index)

DataAnalysis

tools(PCA, ICA)

ASPECT:integration Framework

Parallel NetCDFSoftware

Layer

ParallelVirtual

FileSystem

StorageResourceManager

(To HPSS)

ROMIOMPI-IOSystem

DataMining &Analysis(DMA)Layer

StorageEfficientAccess(SEA)Layer

ScientificProcess

Automation(SPA)Layer

Analysis

Parallel R

Statistical

Page 6: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

6

Applications Panel

part 1

Page 7: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

7

Applications Panel: part 1end-to-end vs. SDM technologies

Scientific Exploration Phases Data Generation Post-processing / summarization Data Analysis

End-to-end use cases For each phase A combination of phases

What SDM technologies are needed/applicable Workflow and dataflow Efficient I/O from/to disk and tertiary storage Searching and indexing General analysis and visualization tools Large-scale data movement Metadata management Missing topic?

Page 8: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

8

Phases of Scientific Exploration

Data Generation From large scale simulations or experiments Fast data growth with computational power examples

• HENP: 100 Teraops and 10 Petabytes by 2006• Climate: Spatial Resolution: T42 (280 km) -> T85 (140 km) ->

T170 (70 km), T42: about 1 TB/100 year run => factor of ~ 10-20

Problems• Can’t dump the data to storage fast enough – waste of

compute resources• Can’t move terabytes of data over WAN robustly – waste of

scientist’s time• Can’t steer the simulation – waste of time and resource

Page 9: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

9

Phases of Scientific Exploration

Post-processing/summarization Process raw data from experiments/simulations

• May generate as much data as original raw data• e.g. HENP: process detectors raw data to produce “tracks”, “vertices”, etc.• e.g. Climate: generate vertical organization of data

from: time-step -> space points -> all variablesto: variable -> time point -> all spaceor: variable -> space points -> all times

Summarization• Produce high level summaries for preliminary analysis and/or

efficient search• e.g. HENP: “total_energy”, “number_of_particles per “event”

• Produce summarization of space/time for coarse analysis• e.g. Climate: generate “monthly-means”

Problems• Large volume “read” -> large volume “write”• Summarization – good metadata• Need to produce indexes to search over large data• Need to reorganize and transform data – large data intensive tasks

Page 10: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

10

Phases of Scientific Exploration

Data Analysis Analysis of large data volume Can’t fit all data in memory Problems

• Find the relevant data – need efficient indexing• Cluster analysis – need linear scaling• Feature selection – efficient high-dimensional analysis• Data heterogeneity – combine data from diverse sources• Streamline analysis steps – output of one step needs to match input of next• Read data fast enough from disk storage• Pre-stage data from tertiary storage

Page 11: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

11

Vision: facilitating end-to-end data management

Support entire scenarios• e.g. Data generation, post-processing, analysis

Be willing to apply any technology necessary• SDM center technology• Adapt technology as necessary

Package technology as components• Integration of technologies• Make SDM technology components callable from workflow

Facilitate the use of scientific workflow tools• Manage launching of tasks• Manage data movement• Permit dynamic interaction with workflow

Application scientists must be involved in incorporating the technology into existing frameworks and infrastructures• Need to work closely with app scientists• Identify end-to-end use cases (scenarios)• App scientist should be funded, too• App scientists are the “messengers of good news”

Page 12: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

12

Lessons learned – technology (1)

Scientific workflow is an important paradigm Coordination of tasks AND Management of data flow Managing repetitive steps Tracking, estimation

Efficient I/O is often the bottleneck Technology essential for efficient computation Mass storage need to be seamlessly managed Opportunities to interact with Math packages

Searching and indexing Searching over billions of objects Searching in space/time Searching in multi-dimensional space

Page 13: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

13

Lessons learned – technology (2)

General analysis tools are useful Statistical analysis, cluster analysis Feature selection and extraction Parallelization is key to scaling Visualization is an integral part of analysis

Data movement is complex Network infrastructure is not enough – can be unreliable Need robust software to manage failures Need to manage space allocation Managing format mismatch is part of data flow

Metadata emerging as an important need Description of experiments/simulation Provenance Use of hints / access patterns

Page 14: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

14

Fusion - Klasky

Data GenerationPost-processing/Summarization

Data Analysis

Scientific Workflow Comments

Parallel I/O

Searching & indexing

Analysis tools:

Feature extraction

Statistical, etc.

Data-driven

visualization

Data movement/

Space management

Metadata management

other

Page 15: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

15

Combustion - Chen

Data GenerationPost-processing/Summarization

Data Analysis

Scientific Workflow Comments

Parallel I/O

Searching & indexing

Analysis tools:

Feature extraction

Statistical, etc.

Data-driven

visualization

Data movement/

Space management

Metadata management

other

Page 16: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

16

HENP - Lauret

Data GenerationPost-processing/Summarization

Data Analysis

Scientific Workflow Comments

Parallel I/O

Searching & indexing

Analysis tools:

Feature extraction

Statistical, etc.

Data-driven

visualization

Data movement/

Space management

Metadata management

other

Page 17: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

17

Astrophysics - Peele

Data GenerationPost-processing/Summarization

Data Analysis

Scientific Workflow Comments

Parallel I/O

Searching & indexing

Analysis tools:

Feature extraction

Statistical, etc.

Data-driven

visualization

Data movement/

Space management

Metadata management

other

Page 18: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

18

Astrophysics - Myra

Data GenerationPost-processing/Summarization

Data Analysis

Scientific Workflow Comments

Parallel I/O

Searching & indexing

Analysis tools:

Feature extraction

Statistical, etc.

Data-driven

visualization

Data movement/

Space management

Metadata management

other

Page 19: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

19

Applications Panel

part 2

Page 20: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

20

Applications Panel: part 2engaging the sciences

Topics What percentage of time do you spend on data

management related tasks?• What are these tasks?• Suppose these tasks are taken care of, what other

technology you wish the SDM center will support?• How to you expect your software (simulation, analysis) to

interoperate with SDM center software

How do you see the role of the SDM center in your application domain?• Providing support for your SDM needs• Applying technology that enables new science• Jointly develop new technology

Page 21: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

21

Applications Panel: part 2engaging the sciences

Topics Close collaborative projects

• We believe that it is necessary to work with application scientists jointly in order to apply SDM technology• How do we achieve joint activities:

joint funding, good will, advertising, tutorials?

• Do you expect some SDM technology to be packaged and used directly from downloads?

Outreach• We believe that if we solve end-to-end use cases, the

technology can be spread to the science communities by example• Do you agree?

• Other ideas?

Page 22: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

22

SDM center Panel

Plans and Opportunities

Page 23: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

23

Plans (next proposal)

Organizational issues Plan: same participants

• Collaborations: take time to build teams• ISICs encouraged to re-apply• Need to think on “steering the boat”

What does each participant want to work on(not to discuss now)

Funding levels – same? Technical issues (next slides)

Is the concept of end-to-end attractive? Is the concept of close collaboration where scientists

“messenger of good news” attractive? What technologies How do we work with Apps How do we evolve – labs and universities Do labs and universities have different roles?

Page 24: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

24

Lessons learned from success stories

What do we consider a success? Successful use of SDM technology by application scientists

• Productivity of scientist

• Enable new science that could not be done previously

What does it take to apply SDM technology Stages of activities

• Development – of basic technology

• Adaptation – to a specific application domain

• Integration – into application framework

• Deployment – get scientists to use the technology

Application interaction with all these stages• Problems are complex – requires close collaboration

• Requires time commitment of an application scientist

• Grid Collector example – ½ time of app scientist paid off

Page 25: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

25

Vision: facilitating end-to-end data management

Support entire scenarios• e.g. Data generation, post-processing, analysis

Be willing to apply any technology necessary• SDM center technology• Adapt technology as necessary

Package technology as components• Integration of technologies• Make SDM technology components callable from workflow

Facilitate the use of scientific workflow tools• Manage launching of tasks• Manage data movement• Permit dynamic interaction with workflow

Application scientists must be involved in incorporating the technology into existing frameworks and infrastructures• Need to work closely with app scientists• Identify end-to-end use cases (scenarios)• App scientist should be funded, too• App scientists are the “messengers of good news”

Page 26: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

26

Lessons learned – technology (1)

Scientific workflow is an important paradigm Coordination of tasks AND Management of data flow Managing repetitive steps Tracking, estimation

Efficient I/O is often the bottleneck Technology essential for efficient computation Mass storage need to be seamlessly managed Opportunities to interact with Math packages

Searching and indexing Searching over billions of objects Searching in space/time Searching in multi-dimensional space

Page 27: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

27

Lessons learned – technology (2)

General analysis tools are useful Statistical analysis, cluster analysis Feature selection and extraction Parallelization is key to scaling Visualization is an integral part of analysis

Data movement is complex Network infrastructure is not enough – can be unreliable Need robust software to manage failures Need to manage space allocation Managing format mismatch is part of data flow

Metadata emerging as an important need Description of experiments/simulation Provenance Use of hints / access patterns

Page 28: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

28

Engaging application science communities

Close collaboration is essential for success Technology adaptation to an application domain is the

key to its use End-to-end solutions should be developed where

appropriate Embed SDM solutions in other packages/frameworks

Math (e.g. parallel I/O for AMR in APDEC) Application analysis (e.g. ROOT in HEP) Experiment frameworks (e.g. STAR project)

Page 29: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

29

Engaging application science communities

Need to fund application scientists for joint projects specialized for that application domain Funds can be made available on as needed basis Controlled by SDM center through advisory board Funded application scientists can spread the word in his/her

community (joint effort with SDM center)

Be part of application proposals, so their office funds the application scientist working with SDM center, or even some center activities (e.g. Current attempt with PPPL)

Page 30: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

30

SDM center and the role of CS basic research

Stages in technology development Research, Prototype, Product, Infrastructure

Role of CS basic research Research-to-prototype Can afford risky projects since application scientists are not waiting

for results Longer term payoff (2-3 years) Very important – research technology funneled into SciDAC

Role of SDM center in SciDAC Prototype-to-Product Low risk, apply technology that has been prototyped Shorter term payoff (1-2 years)

Role of SDM center in Application Offices Product-to-infrastructure Software moves from "product" to "infrastructure" when sites start

installing it by default Adoption of the software by key groups Requires jointly-funded collaborations (including App people)

Page 31: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

31

SDM center and other ISICs

Math ISICs Identify joint activities e.g. parallel I/O in APDEC

SC ISICs CCA technology for wrapping components to be

used in workflows PERC technology to identify I/O bottlenecks

Page 32: Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City

32

Summary

SDM technology can successfully be applied across multiple scientific applications

Close collaboration to establish end-to-end solutions is needed Application scientists must be involved in incorporating the

technology into existing frameworks and infrastructures (messengers of good news)

We recommend having a flexible funding structure to support application scientists on collaborative projects Level of such funding structure should be in addition to the SDM

center funding (about 20% of level of center’s funding) Funding structure to be managed by SDM center using advisory board

We recommend the Center joining application side proposals when possible Ability to use funds where technology is needed Need help from OASCR to make such connections