Upload
midori
View
44
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Scientific Data Management Center All Hands Meeting March 2-3, 2005 Salt Lake City. Scientific Data Management Center. Participating Institutions. Center PI: Arie Shoshani LBNL DOE Laboratories co-PIs: Bill Gropp, Rob Ross* ANL Arie Shoshani, Doron Rotem LBNL - PowerPoint PPT Presentation
Citation preview
1
Scientific Data ManagementScientific Data ManagementCenterCenter
All Hands MeetingAll Hands Meeting
March 2-3, 2005March 2-3, 2005Salt Lake CitySalt Lake City
2
Scientific Data Management CenterScientific Data Management Center
Center PI: Arie Shoshani LBNL
DOE Laboratories co-PIs:
Bill Gropp, Rob Ross* ANLArie Shoshani, Doron Rotem LBNLTerence Critchlow*, Chandrika Kamath LLNLNagiza Samatova* ORNL
Universities co-PIs :Mladen Vouk North Carolina State Alok Choudhary Northwestern Reagan Moore, Bertram Ludaescher UC San Diego (SDSC)
and UC DavisSteve Parker U of Utah
* Area Leaders
Participating Institutions
3
AgendaAgenda
Session 1 (morning of first day):status reports: SEA, DMA, SPA8:15 – 10:0010:00 Break10:30 – 12:0012:00 Lunch
Session 2 (afternoon of first day): Application talksTopics: Talks by application people working with us, success stories, needs, bottlenecks, imagined new uses of SDM technologies1:30 – 3:00Eric Myra – AstrophysicsJackie Chan - CombustionScott Klasky – Fusion3:00 – 3:30 Break3:30 – 5:00Jerome Lauret – High Energy PhysicsElliot Peele - AstrophysicsWes Bethel - Visualization
Session 3 (morning of second day):Panel with Apps peopleModerator: Doron Rotem8:30 – 10:00 panel: part 1End-to-end use cases vs. Technologies10:00 – 10:30 break10:30 – 12:00 panel: part 2Engaging the sciences
Session 4 (afternoon of second day)1:30 – 4:00 future planningTopics: Discussion on the future plans including integration with other ISICs, considerations for new technology areas, universities role, planning for proposal
(Official end of meeting)
4:00 – 4:30 break4:30 – 6:00 informal meetings/discussions
4
A Typical SDM Scenario
Control Flow Layer
Applications &Software Tools
Layer
I/O System Layer
Storage & NetworkResouces
Layer
Flo
w T
ier
Wo
rk T
ier
+
DataMover
SimulationProgram
ParallelR
PostProcessing
TerascaleBrowser
Task A:Generate
Time-Steps
Task B:Move TS
Task C:Analyze TS
Task D:Visualize TS
ParallelNetCDF
PVFS SabulHDF5
LibrariesSRM
5
Technology Details by Layer
Hardware, OS, and MSS (HPSS)
WorkFlowManagement
Tools
Web Wrapping
Tools
EfficientParallel
Visualization(pVTK)
Efficientindexing(Bitmap Index)
DataAnalysis
tools(PCA, ICA)
ASPECT:integration Framework
Parallel NetCDFSoftware
Layer
ParallelVirtual
FileSystem
StorageResourceManager
(To HPSS)
ROMIOMPI-IOSystem
DataMining &Analysis(DMA)Layer
StorageEfficientAccess(SEA)Layer
ScientificProcess
Automation(SPA)Layer
Hardware, OS, and MSS (HPSS)
WorkFlowManagement
Tools
Web Wrapping
Tools
EfficientParallel
Visualization(pVTK)
Efficientindexing(Bitmap Index)
DataAnalysis
tools(PCA, ICA)
ASPECT:integration Framework
Parallel NetCDFSoftware
Layer
ParallelVirtual
FileSystem
StorageResourceManager
(To HPSS)
ROMIOMPI-IOSystem
DataMining &Analysis(DMA)Layer
StorageEfficientAccess(SEA)Layer
ScientificProcess
Automation(SPA)Layer
Analysis
Parallel R
Statistical
6
Applications Panel
part 1
7
Applications Panel: part 1end-to-end vs. SDM technologies
Scientific Exploration Phases Data Generation Post-processing / summarization Data Analysis
End-to-end use cases For each phase A combination of phases
What SDM technologies are needed/applicable Workflow and dataflow Efficient I/O from/to disk and tertiary storage Searching and indexing General analysis and visualization tools Large-scale data movement Metadata management Missing topic?
8
Phases of Scientific Exploration
Data Generation From large scale simulations or experiments Fast data growth with computational power examples
• HENP: 100 Teraops and 10 Petabytes by 2006• Climate: Spatial Resolution: T42 (280 km) -> T85 (140 km) ->
T170 (70 km), T42: about 1 TB/100 year run => factor of ~ 10-20
Problems• Can’t dump the data to storage fast enough – waste of
compute resources• Can’t move terabytes of data over WAN robustly – waste of
scientist’s time• Can’t steer the simulation – waste of time and resource
9
Phases of Scientific Exploration
Post-processing/summarization Process raw data from experiments/simulations
• May generate as much data as original raw data• e.g. HENP: process detectors raw data to produce “tracks”, “vertices”, etc.• e.g. Climate: generate vertical organization of data
from: time-step -> space points -> all variablesto: variable -> time point -> all spaceor: variable -> space points -> all times
Summarization• Produce high level summaries for preliminary analysis and/or
efficient search• e.g. HENP: “total_energy”, “number_of_particles per “event”
• Produce summarization of space/time for coarse analysis• e.g. Climate: generate “monthly-means”
Problems• Large volume “read” -> large volume “write”• Summarization – good metadata• Need to produce indexes to search over large data• Need to reorganize and transform data – large data intensive tasks
10
Phases of Scientific Exploration
Data Analysis Analysis of large data volume Can’t fit all data in memory Problems
• Find the relevant data – need efficient indexing• Cluster analysis – need linear scaling• Feature selection – efficient high-dimensional analysis• Data heterogeneity – combine data from diverse sources• Streamline analysis steps – output of one step needs to match input of next• Read data fast enough from disk storage• Pre-stage data from tertiary storage
11
Vision: facilitating end-to-end data management
Support entire scenarios• e.g. Data generation, post-processing, analysis
Be willing to apply any technology necessary• SDM center technology• Adapt technology as necessary
Package technology as components• Integration of technologies• Make SDM technology components callable from workflow
Facilitate the use of scientific workflow tools• Manage launching of tasks• Manage data movement• Permit dynamic interaction with workflow
Application scientists must be involved in incorporating the technology into existing frameworks and infrastructures• Need to work closely with app scientists• Identify end-to-end use cases (scenarios)• App scientist should be funded, too• App scientists are the “messengers of good news”
12
Lessons learned – technology (1)
Scientific workflow is an important paradigm Coordination of tasks AND Management of data flow Managing repetitive steps Tracking, estimation
Efficient I/O is often the bottleneck Technology essential for efficient computation Mass storage need to be seamlessly managed Opportunities to interact with Math packages
Searching and indexing Searching over billions of objects Searching in space/time Searching in multi-dimensional space
13
Lessons learned – technology (2)
General analysis tools are useful Statistical analysis, cluster analysis Feature selection and extraction Parallelization is key to scaling Visualization is an integral part of analysis
Data movement is complex Network infrastructure is not enough – can be unreliable Need robust software to manage failures Need to manage space allocation Managing format mismatch is part of data flow
Metadata emerging as an important need Description of experiments/simulation Provenance Use of hints / access patterns
14
Fusion - Klasky
Data GenerationPost-processing/Summarization
Data Analysis
Scientific Workflow Comments
Parallel I/O
Searching & indexing
Analysis tools:
Feature extraction
Statistical, etc.
Data-driven
visualization
Data movement/
Space management
Metadata management
other
15
Combustion - Chen
Data GenerationPost-processing/Summarization
Data Analysis
Scientific Workflow Comments
Parallel I/O
Searching & indexing
Analysis tools:
Feature extraction
Statistical, etc.
Data-driven
visualization
Data movement/
Space management
Metadata management
other
16
HENP - Lauret
Data GenerationPost-processing/Summarization
Data Analysis
Scientific Workflow Comments
Parallel I/O
Searching & indexing
Analysis tools:
Feature extraction
Statistical, etc.
Data-driven
visualization
Data movement/
Space management
Metadata management
other
17
Astrophysics - Peele
Data GenerationPost-processing/Summarization
Data Analysis
Scientific Workflow Comments
Parallel I/O
Searching & indexing
Analysis tools:
Feature extraction
Statistical, etc.
Data-driven
visualization
Data movement/
Space management
Metadata management
other
18
Astrophysics - Myra
Data GenerationPost-processing/Summarization
Data Analysis
Scientific Workflow Comments
Parallel I/O
Searching & indexing
Analysis tools:
Feature extraction
Statistical, etc.
Data-driven
visualization
Data movement/
Space management
Metadata management
other
19
Applications Panel
part 2
20
Applications Panel: part 2engaging the sciences
Topics What percentage of time do you spend on data
management related tasks?• What are these tasks?• Suppose these tasks are taken care of, what other
technology you wish the SDM center will support?• How to you expect your software (simulation, analysis) to
interoperate with SDM center software
How do you see the role of the SDM center in your application domain?• Providing support for your SDM needs• Applying technology that enables new science• Jointly develop new technology
21
Applications Panel: part 2engaging the sciences
Topics Close collaborative projects
• We believe that it is necessary to work with application scientists jointly in order to apply SDM technology• How do we achieve joint activities:
joint funding, good will, advertising, tutorials?
• Do you expect some SDM technology to be packaged and used directly from downloads?
Outreach• We believe that if we solve end-to-end use cases, the
technology can be spread to the science communities by example• Do you agree?
• Other ideas?
22
SDM center Panel
Plans and Opportunities
23
Plans (next proposal)
Organizational issues Plan: same participants
• Collaborations: take time to build teams• ISICs encouraged to re-apply• Need to think on “steering the boat”
What does each participant want to work on(not to discuss now)
Funding levels – same? Technical issues (next slides)
Is the concept of end-to-end attractive? Is the concept of close collaboration where scientists
“messenger of good news” attractive? What technologies How do we work with Apps How do we evolve – labs and universities Do labs and universities have different roles?
24
Lessons learned from success stories
What do we consider a success? Successful use of SDM technology by application scientists
• Productivity of scientist
• Enable new science that could not be done previously
What does it take to apply SDM technology Stages of activities
• Development – of basic technology
• Adaptation – to a specific application domain
• Integration – into application framework
• Deployment – get scientists to use the technology
Application interaction with all these stages• Problems are complex – requires close collaboration
• Requires time commitment of an application scientist
• Grid Collector example – ½ time of app scientist paid off
25
Vision: facilitating end-to-end data management
Support entire scenarios• e.g. Data generation, post-processing, analysis
Be willing to apply any technology necessary• SDM center technology• Adapt technology as necessary
Package technology as components• Integration of technologies• Make SDM technology components callable from workflow
Facilitate the use of scientific workflow tools• Manage launching of tasks• Manage data movement• Permit dynamic interaction with workflow
Application scientists must be involved in incorporating the technology into existing frameworks and infrastructures• Need to work closely with app scientists• Identify end-to-end use cases (scenarios)• App scientist should be funded, too• App scientists are the “messengers of good news”
26
Lessons learned – technology (1)
Scientific workflow is an important paradigm Coordination of tasks AND Management of data flow Managing repetitive steps Tracking, estimation
Efficient I/O is often the bottleneck Technology essential for efficient computation Mass storage need to be seamlessly managed Opportunities to interact with Math packages
Searching and indexing Searching over billions of objects Searching in space/time Searching in multi-dimensional space
27
Lessons learned – technology (2)
General analysis tools are useful Statistical analysis, cluster analysis Feature selection and extraction Parallelization is key to scaling Visualization is an integral part of analysis
Data movement is complex Network infrastructure is not enough – can be unreliable Need robust software to manage failures Need to manage space allocation Managing format mismatch is part of data flow
Metadata emerging as an important need Description of experiments/simulation Provenance Use of hints / access patterns
28
Engaging application science communities
Close collaboration is essential for success Technology adaptation to an application domain is the
key to its use End-to-end solutions should be developed where
appropriate Embed SDM solutions in other packages/frameworks
Math (e.g. parallel I/O for AMR in APDEC) Application analysis (e.g. ROOT in HEP) Experiment frameworks (e.g. STAR project)
29
Engaging application science communities
Need to fund application scientists for joint projects specialized for that application domain Funds can be made available on as needed basis Controlled by SDM center through advisory board Funded application scientists can spread the word in his/her
community (joint effort with SDM center)
Be part of application proposals, so their office funds the application scientist working with SDM center, or even some center activities (e.g. Current attempt with PPPL)
30
SDM center and the role of CS basic research
Stages in technology development Research, Prototype, Product, Infrastructure
Role of CS basic research Research-to-prototype Can afford risky projects since application scientists are not waiting
for results Longer term payoff (2-3 years) Very important – research technology funneled into SciDAC
Role of SDM center in SciDAC Prototype-to-Product Low risk, apply technology that has been prototyped Shorter term payoff (1-2 years)
Role of SDM center in Application Offices Product-to-infrastructure Software moves from "product" to "infrastructure" when sites start
installing it by default Adoption of the software by key groups Requires jointly-funded collaborations (including App people)
31
SDM center and other ISICs
Math ISICs Identify joint activities e.g. parallel I/O in APDEC
SC ISICs CCA technology for wrapping components to be
used in workflows PERC technology to identify I/O bottlenecks
32
Summary
SDM technology can successfully be applied across multiple scientific applications
Close collaboration to establish end-to-end solutions is needed Application scientists must be involved in incorporating the
technology into existing frameworks and infrastructures (messengers of good news)
We recommend having a flexible funding structure to support application scientists on collaborative projects Level of such funding structure should be in addition to the SDM
center funding (about 20% of level of center’s funding) Funding structure to be managed by SDM center using advisory board
We recommend the Center joining application side proposals when possible Ability to use funds where technology is needed Need help from OASCR to make such connections