27
Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute sus.isi.edu www.isi.edu/~deel

Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Embed Size (px)

Citation preview

Page 1: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Pegasus: Running Large-Scale

Scientific Workflows on the TeraGrid

Ewa Deelman

USC

Information Sciences Institute

http://pegasus.isi.edu www.isi.edu/~deelman

Page 2: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Acknowledgements

Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi (Center for Grid Technologies, ISI)

James Blythe, Yolanda Gil (Intelligent Systems Division, ISI)

http://pegasus.isi.edu Research funded as part of the NSF GriPhyN, NVO

and SCEC projects, NIH-funded CRCNS project and EU-funded GridLab

Thanks for the use of the TeraGrid

Page 3: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Outline

Applications as workflows Pegasus (Planning for Execution in Grids) Montage application (Astronomy,

NSF&NASA) CyberShake (Southern California Earthquake

Center) Results from running on the TeraGrid

Conclusions

Page 4: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Today’s Scientific Applications

Applications Increasing in the level of complexity Use of individual application components Components are supplied by various individuals Reuse of individual intermediate data products (files)

Execution environment is complex and very dynamic Resources come and go Data is replicated Components can be found at various locations or staged in on demand

Separation between the application description the actual execution description

Applications being described in terms of workflows

Page 5: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Scientific AnalysisW

orkf

low

Evo

lutio

n Select the Input Data

Map the Workflow onto Available Resources

Execute the Workflow

Construct the Analysis

Workflow Template

Abstract Worfklow

Executable Workflow

Tasks to be executed

Grid Resources

Page 6: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Execution EnvironmentScientific AnalysisW

orkf

low

Evo

lutio

n

Grid Resources

Select the Input Data

Map the Workflow onto Available Resources

Execute the Workflow

Information Services

Library of Application

Components

Data Catalogs

Construct the Analysis

Resource availability and characteristics

Tasks to be executed

Data properties

Component characteristics

Workflow Template

Abstract Worfklow

Executable Workflow

Aut

omat

edU

ser

guid

ed

Page 7: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Executable Workflow Generation and Mapping

Intelligent Workflow

Composition tools ( WINGS

and CAT) (Nat. Lang.

Proc)

PegasusCondor

DAGManExecutable Workflow

Results

Virtual Data Language

(VDL)(GTOMO,

HEP, Biology, others)

Application-specificAbstract Workflow

Service (LIGO,SCEC,

Montage)

Abstract Workflow

Grid Resourcesjobs

Application-dependent

Application independent

WINGS and CAT, developed at ISI by Y. Gil, VDL, developed at ANL & Uof C by I. Foster, J. Voeckler & M. Wilde

Page 8: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Pegasus:Planning for Execution in Grids

Maps from abstract to executable workflow Automatically locates physical locations for

both workflow components and data Finds appropriate resources to execute the

components Reuses existing data products where

applicable Publishes newly derived data products

Provides provenance information

Page 9: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Information Components used by Pegasus

Globus Monitoring and Discovery Service (MDS) (or static file) Locates available resources Finds resource properties

Dynamic: load, queue length Static: location of GridFTP server, RLS, etc

Globus Replica Location Service Locates data that may be replicated Registers new data products

Transformation Catalog Locates installed executables

Page 10: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Example Workflow Reduction

Original abstract workflow

If “b” already exists (as determined by query to the RLS), the workflow can be reduced

Also useful in case of failures

d1 d2ba c

d2

b c

Page 11: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Mapping from abstract to executable

Query RLS, MDS, and TC, schedule computation and data movement

Execute d2 at B

Move b from A

to B

Move c from B

to U

Register c in the

RLS

d2

b c

Page 12: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Mosaic of M42 created on the Teragrid resources using Pegasus

Pegasus improved the runtime of this application by 90% over the baseline case

Workflow with 4,500 nodes

Bruce Berriman, John Good (Caltech)Joe Jacob, Dan Katz(JPL)

Gurmeet Singh, Mei Su (ISI)

Page 13: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Small Montage Workflow

~1200 nodes

Page 14: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

MontageRegion Name, Degrees

Pegasus

Concrete Workflow

Condor DAGMAN

TeraGrid Clusters

SDSC

NCSA

ISI Condor Pool

AbstractWorkflow

User Portal

AbstractWorkflowService

2MASSImage List

Service

Grid Schedulingand Execution

Service

UserNotification

Service

ComputationalGrid

mDAGFiles

m2MASSList

mGridExec

ImageList

DAGMan

mNotify

JPL

JPL

IPAC IPAC

ISIAbstractWorkflow

Initial prototype implemented and tested on the TeraGrid Montage performance evaluations Production Montage portal open to the astronomy community this year

Collaboration with JPL & IPAC

Page 15: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

SCEC Derive Probabilistic Hazard Curves & maps for the

Los Angeles Area: 6 sites in 2005, 625 in 2006, and 10,000 in 2007

Probability of a certain ground motion during a certain period of time

Hazard Map

Page 16: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

SCEC workflows on the TG

Provision the resources

Map the Workflow onto

the Grid resources

Run the Workflow on the Grid Resources

Record Information about

the Workflow

Res

ourc

e D

escr

iptio

ns

Executable Workflow

Tasks

Task Info

Grid Resources

Executable workflow

Page 17: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

SCEC Workflows on the TG

Condor’s DAGMan

VDS Kickstart & Provenance Tracking

Catalog (PTC)

Res

ourc

e D

escr

iptio

ns

Executable Workflow

Condor Glide-in

Tasks

Task Info

(Jens Voeckler, Mike Wilde (UofC, ANL)

(nice TeraGrid folks)

University of Wisconsin Madison

Gaurang Mehta at ISI ran the experiments

Local machine

Page 18: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

SCEC computations so far

Pasadena 33 workflows

USC 26 workflows

Each workflow [11, 1000] jobs

23 days total runtime NCSA & SDSC TG

Failed job recovery Retries Rescue DAG

Number of Jobs, total number of jobs 261,823

0100002000030000400005000060000700008000090000

100000

Pea

kValC

alc_O

kaya

Sei

smog

ram

Gen

_Li

Data

Trans

fer

Data

Regist

ratio

n

faile

d job

s

Page 19: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

So far 2 SCEC sites done (Pasadena and USC)

Number of jobs per day (23 days), 261,823 jobs total, Number of CPU hours per day, 15,706 hours total (1.8 years)

1

10

100

1000

10000

100000

10/1

910

/21

10/2

310

/25

10/2

710

/29

10/3

111

/211

/411

/611

/8

11/1

0

JOBS

HRS

Page 20: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Distribution of seismogram jobs

1

10

100

1000

10000

100000

10 60 110 160 210 260 310 360 410 460 510 900 2400 4200

Time (mins)

Nu

m o

f Jo

bs

70 hours

Page 21: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Observations from working with the Scientists Two way street: they give us feedback on our technologies,

we show them how things run (break) at scale We have seen great performance improvements in the

codes

Execution Sites

1

10

100

1,000

10,000

100,000

1,000,000

local ncsa sdsc

NUM JOBS

DAYS

Page 22: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Some other Pegasus Application Domains

Laser Gravitational Wave Observatory (LIGO)

Galaxy morphology (NVO)

Tomography for neural structure reconstruction (NIH)

High-energy physics Gene alignment Natural Language

processing

Page 23: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

LIGO has used Pegasus to run on the Open Science Grid at SC’05

Courtesy of David Meyers, Caltech

Page 24: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Benefits of the workflow & Pegasus approach

Pegasus can run the workflow on a variety of resources Pegasus can run a single workflow across multiple

resources Pegasus can opportunistically take advantage of

available resources (through dynamic workflow mapping) Pegasus can take advantage of pre-existing intermediate

data products Pegasus can improve the performance of the

application.

Page 25: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Benefits of the workflow & Pegasus approach

Pegasus shields from the Grid details

The workflow exposes the structure of the application maximum parallelism of the application

Pegasus can take advantage of the structure to Set a planning horizon (how far into the workflow to plan) Cluster a set of workflow nodes to be executed as one (for

performance)

Page 26: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Pegasus Research

resource discovery and assessment resource selection resource provisioning workflow restructuring

task merged together or reordered to improve overall performance

adaptive computing Workflow refinement adapts to changing execution

environment

workflow debugging

Page 27: Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute ://pegasus.isi.edu

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Software releases

Pegasus http://pegasus.isi.edu released as part of the GriPhyN Virtual Data System (VDS)

Collaborators in VDS: Ian Foster (ANL) Mike Wilde (ANL) and Jens Voeckler (Uof C)

http://vds.isi.edu