22
What do we need to manage end-to-end scientific workflows for efficiency and productivity? Lavanya Ramakrishnan [email protected] 1

What do we need to manage end-to-end scientific workflows for

Embed Size (px)

Citation preview

Page 1: What do we need to manage end-to-end scientific workflows for

What do we need to manage end-to-end scientific workflows for

efficiency and productivity?

Lavanya Ramakrishnan

[email protected]

1

Page 2: What do we need to manage end-to-end scientific workflows for

147MB 147MB

206MB

Wrf Static

Terrain PreProcessor

Lateral Boundary

Interpolator

ADAS Interpolator

ARPS2WRF

WRF

4secs

338secs

146secs

78secs

240secs

4570secs/ 16 processors

0.2MB

488MB 243MB

19MB

0.2MB

2422MB

(CS Biased) View of Workflow Challenges: LEAD North American Mesoscale (NAM) forecast (2009)

•  Mostly simple workflow

•  Repetitive, iterative, parametric studies

•  Provenance •  Workflow and data

sharing •  Mix of single core &

multiple core jobs managed by a service based arch People use ad-hoc scripts, keep

notes in text files and encode metadata in file names

Page 3: What do we need to manage end-to-end scientific workflows for

DDM: Dula’s Data Management (2012)

Way too many hard drives Multiple hours per week spent manually copying/erasing data EVERY BEAMLINE FOR ITSELF!

Advanced Light Source

Page 4: What do we need to manage end-to-end scientific workflows for

But haven’t we solved the workflow and data management problems?

What about all the workflow tools out there?

Page 5: What do we need to manage end-to-end scientific workflows for

Downloading and pre-processing data can be complex

5

NASA FTP Server

(Metadata and data)

Batch Queue Nodes

Data Transfer Nodes

FTP Queue

Job Queue

HPSS File System

Local Metadata Catalog

Output data

Metadata

Access re-projected data

Determine files to download for a tile

Create a file download record

Create a job record

http://modis.nersc.gov

Page 6: What do we need to manage end-to-end scientific workflows for

Data Collection (High Frequency and Meteorological

Data)

Pre-Processing and Sensor Calibration

Processing into Fluxes

Post-Processing and Product Generation

Synthesis Studies, Models, Simulations

QA / QC Flagging / Visual Checks

Ustar Calculation and Filtering

Gapfilling

Flux Partitioning

Generation of Data Products

Ameriflux: Atmosphere-Biosphere interactions in Climate Models Data Processing Pipeline for Net Ecosystem Exchange (NEE)

The structure of the workflow does not capture data complexities

http://ameriflux.lbl.gov

Page 7: What do we need to manage end-to-end scientific workflows for

Networking paths are still pretty complex: DayaBay Networking

• Relay Path requires daisy-chaining 2 data transfers (default)

• Direct Path requires 2 data transfers out of site. • DayaNet: Dedicated 150 Mbps optical link

• CSTNet: Chinese national network

• GLORIAD: Trans-Pacific scientific network (NSF)

• ASGC: Trans-Pacific eScience network (fallback for GLORIAD)

• ESNet: US national Energy Science Network

• Hot-swappable disk transport between Daya Bay and Hong Kong in case of long-term network failure of either DayaNet or CSTNet

7 http://dayabay.ihep.ac.cn/

Page 8: What do we need to manage end-to-end scientific workflows for

Failures of all magnitudes can occur • Relay Path requires daisy-chaining 2 data transfers (default)

• Direct Path requires 2 data transfers out of site. • DayaNet: Dedicated 150 Mbps optical link

• CSTNet: Chinese national network

• GLORIAD: Trans-Pacific scientific network (NSF)

• ASGC: Trans-Pacific eScience network (fallback for GLORIAD)

• ESNet: US national Energy Science Network

• Hot-swappable disk transport between Daya Bay and Hong Kong in case of long-term network failure of either DayaNet or CSTNet

8

Page 9: What do we need to manage end-to-end scientific workflows for

Framework

Beamline User

image

Data Transfer

Node Experiment HPC Storage and Archive

HPC Compute Resources

metadata

reference Data Pipeline

Remote User

Control and Feedback

Prompt Analysis Pipeline

Real-time data processing presents additional challenges

Page 10: What do we need to manage end-to-end scientific workflows for

How are we going to manage simulation and analysis workflows? (NERSC Cori System)

Cray Cascade- 64 Cabinets KNL  Compute    9304  Nodes    

2  BOOT   2  SDB   Network  Nodes  (4)  

Boot Cabinet With esLogin

GigE  Switch  

GigE (Admin)

FDR IB 40 GigE

GigE (Internal)

FC

Boot RAID

FC  Switch  

SMW  

Network  RSIP  (10)  

esLogin    14  Nodes  

esMS  2  Node  

40  GigE  Switch  

136 FDR IB

DVS  Server  Nodes  (32)  

MOM  28  Nodes  

DSL  32  Nodes  

To NERSC network

To NERSC network

Haswell  Compute    1920  Nodes    

Redundant  Core  IB  Switches  

15 FDR IB 102 FDR IB

PFS            

12  Scalable  Units  

MGS – 2 Nodes

MDS – 4 Nodes

OSS – 96 Nodes

DDN WolfCreek

5 NetApp E2700

Burst  Buffer    384  Nodes    

68  LNET  Routers  

32 FDR IB to NGF

Page 11: What do we need to manage end-to-end scientific workflows for

It takes a village to build and run pipelines …

•  Application Scientists (and Students and Postdocs) –  Know the use cases –  Write application codes –  Often write some scripts –  User who runs the final workflow

•  Workflow Developers –  Composes workflows

•  System Integrators –  Write middleware/software pieces –  Computational model experts

•  System/Facility Experts –  HPC center and ESnet staff

Not nearly well-defined roles

Page 12: What do we need to manage end-to-end scientific workflows for

Tigres: a workflow system “library”/”toolkit”

Page 13: What do we need to manage end-to-end scientific workflows for

What do we need for end-to-end pipelines?

•  Workflow tools need to be more than what runs on clusters/HPC –  Data Transfer, Data Processing and Storage Environment –  Integral to scientific processes

•  User/project requirements are changing and users have to play a key role in tool development –  Scientists should focus on science/algorithm –  Not practical to have a workflow developer for every scientist

•  We need to think beyond our individual boxes –  data, workflow, network, resource management has to happen

in conjunction •  Efficiencies (performance, energy, …), reliability and

productivity are only becoming more important

Even CS folks need workflow tools ( e.g., Prabhat)!

Page 14: What do we need to manage end-to-end scientific workflows for

Tigres: Design templates for common scientific workflow patterns

"LightSrc" Domain templates

Base Tigres templates

Scale up

Application "LightSrc-1"

Application "LightSrc-2"

Create andDebug

Share

Create andDebug

Implement templates as a library in an existing language

Early (friendly) release is now available! http://tigres.lbl.gov

Page 15: What do we need to manage end-to-end scientific workflows for

Key Aspects of Tigres

•  Targeted for large-scale data-intensive workflows –  Motivated by “MapReduce” model –  No centralized managed model

•  Library model embedded in existing languages such as Python and C –  “Extend current scripting/programming tools” –  API-based, embedded in code

•  Light-weight execution framework –  “As easy to run as an MPI program on an HPC resource” –  No persistent services

•  Scientist-Centered Design Process –  Get feedback from user before writing all the code

Page 16: What do we need to manage end-to-end scientific workflows for

Tigres Templates

TaskN

Task1 Sequence

Taskn Task1 ... ...

Split

Parallel

TaskN Task1

Task

Merge

Tasko

Taskn Task1

Page 17: What do we need to manage end-to-end scientific workflows for

Templates

•  Sequence ( name, task_array, input_array ) –  e.g., output [ ] = Sequence (“my seq”, task_array_12,

input_array_12) •  Parallel ( name, task_array, input_array )

–  e.g., output[ ] = Parallel(“abc”, task_array_12, input_array_12)

•  Split ( name, split_task, split_input_values, task_array, task_array_in ) –  e.g., Split( “split”, task_x1, input_value_1, spl_t_arr,

spl_i_arr) •  Merge ( name, task_array, input_array, merge_task,

merge_input_values) –  e.g., Merge( “merge”, syn_t_arr, syn_i_arr, task_x1,

input_value_1)

Page 18: What do we need to manage end-to-end scientific workflows for

Design

Execution Environment

API Implementation Optimizations Scientist-Centered

Design Process

Scientist Centered Design Process

Requirements gathering is not the same as usability studies

Page 19: What do we need to manage end-to-end scientific workflows for

Concept understanding by user Changes to Nomenclature Support in C also important

Priorities for first prototype: Desktop to NERSC Monitoring Intermediate state management

Impact of Scientist-Centered Design

Design

Execution Environment

API Implementation Optimizations Scientist-Centered

Design Process

It took days for first stub implementation rather than weeks

(or months)!

Page 20: What do we need to manage end-to-end scientific workflows for

Summary

•  User-centered design processes are vital for development of next-generation tools

•  We need libraries/toolkits that can be customized for specific needs

•  Next-generation workflow/software ecosystems need to be holistic

Page 21: What do we need to manage end-to-end scientific workflows for

Acknowledgements

•  Tigres Team –  Deb Agarwal (PI), Lavanya Ramakrishnan, Daniel Gunter –  Gilberto Pastorello, Valerie Hendrix, Ryan Rodriguez

•  NERSC/CRD –  John Shalf, Nicholas Wright, Christopher Daley

•  Science Use Cases –  Daniel Chivers, John Kua, Michael Quinlan –  Craig Tull, Dula Parkinson, Gilberto Pastorello –  …and many others

[email protected]

Page 22: What do we need to manage end-to-end scientific workflows for