25
AUTOMATING ENVIRONMENTAL COMPUTING APPLICATIONS WITH SCIENTIFIC WORKFLOWS Rafael Ferreira da Silva 1 , Ewa Deelman 1 , Rosa Filgueira 2 , Karan Vahi 1 Mats Rynge 1 , Rajiv Mayani 1 , Benjamin Mayer 3 Environmental Computing Workshop (ECW’16) Baltimore, MD – October 23 rd , 2016 1 USC Information Sciences Institute 2 University of Edinburgh 3 Oak Ridge National Laboratory 1

Automating Environmental Computing Applications with Scientific Workflows

Embed Size (px)

Citation preview

AUTOMATING ENVIRONMENTAL COMPUTING APPLICATIONSWITH SCIENTIFIC WORKFLOWS

Rafael Ferreira da Silva1, Ewa Deelman1, Rosa Filgueira2, Karan Vahi1Mats Rynge1, Rajiv Mayani1, Benjamin Mayer3

Environmental Computing Workshop (ECW’16)Baltimore, MD – October 23rd, 2016

1USC Information Sciences Institute2University of Edinburgh3Oak Ridge National Laboratory

1

OUTLINE

Introduction Scientific Workflows Workflow Manag. Systems

Pegasus WMS Workflow Applications

MotivationScientific ComputingExperiment Execution

DefinitionWorkflow CompositionWorkflow Design

Workflow SystemsWorkflow Execution

CyberShakeSeismic Cross-CorrelationACME

CapabilitiesArchitectureHeterogeneous Execution

R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows 2

SummaryConclusionsFuture Research Directions

ScientificProblem

AnalyticalSolution

ComputationalScripts

Automation MonitoringandDebug

ScientificResultModels,QualityControl,ImageAnalysis,etc.

Fault-tolerance,Provenance,etc.

Shellscripts,Python,Matlab,etc.

EarthScience,Astronomy,Neuroinformatics,Bioinformatics,etc.

DistributedComputing

Workflows,MapReduce,etc.

Clusters,HPC,Cloud,Grid,etc.

EXPERIMENT TIMELINE

R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows 3

4

EXPERIMENT EXECUTION

>Scientific ComputingExtract information from dataScientific Instruments

- USArray Transportable Array- LIGO Detectors- SNS Detector

R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows

5

WHY SCIENTIFIC WORKFLOWS?

>>>AutomationEnables parallel, distributed computations

Automatically executes data transfers

ReproducibilityReusable, aids reproducibilityRecords how data was produced (provenance)

Automate

Recover

Debug

Recover & DebugHandles failures with to provide reliabilityKeeps track of data and files

R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows

6

SCIENTIFIC WORKFLOWS

job

dependencyUsuallydatadependencies

Command-lineprograms

DAGdirected-acyclic graphs

<

RepresentationsDirected-acyclic Graphs (DAGs)Conditional statements:

- if-then-else- exception handling

Loops- for, while

Steering (human in the loop)

R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows

7

JOBS & DEPENDENCIES

Some workflow systems require the implementation of the workflow mechanism as part of the application code.

A job often represents a program (or script) written in any programming language (black box).

Typically dependencies are based on the data flow.It can also be expressed as conditions, exceptions,

user triggered action, etc.

R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows

import pickle

from obspy.core import read

from obspy.xseed import Parser

stream = read('20100206_045515_012.BGLD')

i = 0

for trace in stream:

i += 1

with open('trace_%s.trace' % i, 'wb') as output:

pickle.dump(trace, output)

preprocess.py

import pickle

from obspy.xseed import Parser

parser = Parser('dataless.seed.BW_BGLD')

trace = pickle.load(args[1])

trace.stats.paz =

parser.getPAZ(trace.stats.channel)

trace.stats.coordinates =

parser.getCoordinates(trace.stats.channel)

process.py

trace_1.trace trace_2.trace

import pickle

from obspy.xseed import Parser

parser = Parser('dataless.seed.BW_BGLD')

trace = pickle.load(args[1])

trace.stats.paz =

parser.getPAZ(trace.stats.channel)

trace.stats.coordinates =

parser.getCoordinates(trace.stats.channel)

process.py

trace_1.out trace_2.out

Dependencies define the order of execution.

8

WORKFLOW COMPOSITIONWORKFLOW DESCRIPTION

Workflow ConcretizationMaps the abstract workflow to the execution infrastructure:

- Adds stage-in/out jobs- Cleanup jobs- Clustering- etc.

High-level AbstractionLanguagesNo information about theunderlying infrastructure

Abstract Workflow

Executable Workflow

R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows

Fosters reproducibility

9

WORKFLOW PATTERNS AND DESIGN

GUI: easy mechanism to create workflowsChallenge: visualize and handle large-scale workflows Command-Line: require programming skills

Advantage: easily scale to tackle large, complex problems

DR

AG

-AN

D-D

RO

P

COM

MAN

D-LI

NE

R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows

WORKFLOW MANAGEMENT SYSTEMS

10

< < <

Tavernahttps://taverna.incubator.apache.org

Keplerhttps://kepler-project.org

Makeflowhttp://ccl.cse.nd.edu/software/makeflow

dispel4pyhttp://dispel4py.org

Nextflowhttps://www.nextflow.io

KNIMEhttps://www.knime.org

Swifthttp://swift-lang.org

VisTrailshttp://vistrails.org

FireWorkshttps://pythonhosted.org/FireWorks

Pegasushttp://pegasus.isi.edu

R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows

WORKFLOW EXECUTION

Execution EnvironmentUser’s environment: laptop or desktop machineHeterogeneous environment: clouds, cyberinfrastructures

11

General Behavior

>

Parameter sweep studies are often composed of independent small tasks that can significantly benefit from grid or cloud computing environments

Parallel applications often require specialized systems such as clusters and supercomputers (for large scale applications)

split.sh

process.R process.R process.R

merge.c

visualization.py

Grid

or C

loud

C

ompu

ting

HPC

clus

ter

Loca

l

MPI application

Parameter SweepAnalysis

Local visualizationof results

Dat

a Tr

ansf

ers

R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows

AutomationAutomates pipeline executions

Parallel, distributed computationsAutomatically executes data transfers

Heterogeneous resourcesTask-oriented model

Application is seen as a black box

DebugWorkflow execution and job performance metricsSet of debugging tools to unveil issuesReal-time monitoring, graphs, provenance

OptimizationJob clusteringData cleanup

RecoveryJob failure detection

Checkpoint FilesJob Retry

Rescue DAGs

12R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows

PEGASUS WMS

13R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows

PEGASUS ARCHITECTURE

Input data siteData staging site

sharedfilesystem

submit host(e.g., user’s laptop)

Compute site B

Compute site A

Data staging site

Input data siteOutput data site

Output data site

object storage

HETEROGENEOUS ENVIRONMENTS

R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows 14

Pegasusdashboard

webinterfaceformonitoringanddebuggingworkflows

Real-timemonitoring ofworkflowexecutions.Itshows

thestatus oftheworkflowsandjobs,jobcharacteristics,statistics

andperformance metrics.Provenance dataisstoredintoa

relationaldatabase.

Real-timeMonitoringReportingDebugging

TroubleshootingRESTful API

R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows 15

MONITORING

HPC Systemsubmit host

(e.g., user’s laptop)

workflow wrapped as an MPI jobAllows sub-graphs of a Pegasus workflow to be

submitted as monolithic jobs to remote resources

FINE-GRAINED WORKFLOWS ON HPC SYSTEMS

R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows 16

Builders ask seismologists: What will the peak ground motion be at my new building in the next 50 years?

Seismologists answer this question using Probabilistic Seismic Hazard Analysis (PSHA)

R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows 17

CYBERSHAKE>

CyberShake is a high-performance computing software that uses 3D waveform modeling to calculate PSHA

estimates for populated areas of California

R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows 18

CYBERSHAKE WORKFLOW

Phase 2Computes individual contributions from over 400,000 different earthquakes

Constructs a hazard map (~100 million loosely coupled jobs)

Phase 1Constructs and Populates a 3D mesh of ~1.2 billion

elements with seismic velocity data to compute Strain Green Tensors (SGTs)

SGT simulations uses parallel wave propagations(requires ~4000 processors)

R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows 19

CYBERSHAKE WORKFLOW EXECUTION

TITANBLUE WATERS

336 CyberShake Workflows1.4 million node-hours

500 TB of data

FAC

TS

> 1 single core job~4000 cores MPI jobs

and ~800 GPU jobsMPI jobs (~3800 cores)

20

SEISMIC AMBIENT NOISE CROSS-CORRELATION

>R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. Mayer

Automating Environmental Computing Applications with Scientific Workflows

Phase1(pre-process)

Phase2(cross-correlation)

Preprocesses and cross-correlates traces (sequences of measurements of acceleration in three dimensions) from

multiple seismic stations (IRIS database)

Phase 1: data preparation using statistics for extracting information from the noise

Phase 2: compute correlation, identifying the time for signals to travel between stations. Infers properties of the intervening rock

21

SEISMIC WORKFLOW

>

The workflow is implemented as a hybrid workflow approach to enable the execution of data-intensive stream-basedworkflow applications across different e-Infrastructures

Single RunThe workflow requests an hour data from IRIS services (USArrayTA, data from 394 stations)

- Preprocess phase: 8 minutes (OpenMPI cluster)- Cross-correlation phase: 2 hours (Apache Storm cluster)

Stream-based The workflow reads data from IRIS services every 2 hours>

R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows

Docker Container 2 OpenMPI cluster

Docker Container 1Pegasus and dispel4py

Docker Container 3Storm cluster

Pegasus workflow (DAX)

dispel4py preprocess workflowAutomatic translation

MPI application

dispel4py postprocess workflowAutomatic translation: Storm topology

Trace Prep

decim

detrend

remove resp

filter

white

calc fft

demean

ReadPrep xCorr Write xCorr

Results

ReadTraces

Write Prep.

Preprocessed data

Cross-correlated results

d4py preproc

.

d4py postpro

List of stations

Preprocessed data

Cross-correlated results

List of stations

Preprocessed data

SEISMIC WORKFLOW EXECUTION

R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows 22

IRISdatabase(stations)

datatransfersbetweensitesperformedbyPegasus

ComputesiteB(ApacheStorm)

ComputesiteA(MPI-based)

input data (~150MB)

submit host(e.g.,user’slaptop)

output data (~40GB)

Phase1

Phase2

ACME WORKFLOW

Climate and Earth System ModelingCoupled climate models with ocean, land, atmosphere, and ice

23

ACME Workflow

>

Each stage of the workflow runs the ACME model for a few time steps—helps keep simulations within batch queue limits

Running on Cori @ NERSC, Titan @ OLCF, and Mira @ ANL (in process)

A simple workflow execution on Titan (OLCF) with four different parameter values for a number of simulation time units (5 years per stage), consumes over 50,000 CPU hours

R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows

...

Year 1-5

Year 6-10

Year 36-40

Climatologies

Climatologies

ClimatologiesCADES

>

SUMMARY

Summary ConclusionsConclusionFuture Research Directions

Workflows have been used by several science domains to efficiently manage their computations on a number of different resources

We have presented an introduction to scientific workflows, emphasizing their main properties and how scientists may model their experiments as workflows

Although workflows have been used by a small group of environmental scientists, we strongly believe that more researchers can benefit from

these technologies

24R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, B. MayerAutomating Environmental Computing Applications with Scientific Workflows

AUTOMATING ENVIRONMENTAL COMPUTING APPLICATIONSWITH SCIENTIFIC WORKFLOWS

Rafael Ferreira da Silva, Ph.D.Research Assistant ProfessorDepartment of Computer ScienceUniversity of Southern [email protected] – http://rafaelsilva.com

Thank You

Questions?

http://pegasus.isi.edu 25