29
Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu Optimizing for Time and Space in Distributed Scientific Workflows Ewa Deelman University of Southern California Information Sciences Institute

Optimizing for Time and Space in Distributed Scientific Workflows

  • Upload
    maeve

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

Optimizing for Time and Space in Distributed Scientific Workflows. Ewa Deelman University of Southern California Information Sciences Institute. Generating mosaics of the sky (Bruce Berriman, Caltech). *The full moon is 0.5 deg. sq. when viewed form Earth, Full Sky is ~ 400,000 deg. sq. - PowerPoint PPT Presentation

Citation preview

Page 1: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman

University of Southern California

Information Sciences Institute

Page 2: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu*The full moon is 0.5 deg. sq. when viewed form Earth, Full Sky is ~ 400,000 deg. sq.

Generating mosaics of the sky (Bruce Berriman, Caltech)

Size of the mosaic is degrees square*

Number of jobs

Number of input data files

Number of Intermediate files

Total data footprint

Approx. execution time (20 procs)

1 232 53 588 1.2GB 40 mins

2 1,444 212 3,906 5.5GB 49 mins

4 4,856 747 13,061 20GB 1hr 46 mins

6 8,586 1,444 22,850 38GB 2 hrs. 14 mins

10 20,652 3,722 54,434 97GB 6 hours

BgModel

Project

Project

Project

Diff

Diff

Fitplane

Fitplane

Background

Background

Background

Add

Image1

Image2

Image3

Page 3: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Issue How to manage such applications on a variety of

resources and distributed resources?

Approach Structure the application as a workflow

Allow the scientist to describe the application at a high-level

Tie the execution resources together Using Condor and/or Globus

Provide an automated mapping and execution tool to map the high-level description onto the available resources

Page 4: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Specification: Place Y = F(x) at L

Find where x is--- {S1,S2, …} Find where F can be computed--- {C1,C2, …} Choose c and s subject to constraints (performance,

space availability,….) Move x from s to c

Move F to c

Compute F(x) at c Move Y from c to L Register Y in data registry Record provenance of Y, performance of F(x) at c

Error! c crashed!

Error! x was not at s!

Error! F(x) failed!

Error! there is not enough space at L!

Page 5: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Pegasus-Workflow Management System

Leverages abstraction for workflow description to obtain ease of use, scalability, and portability

Provides a compiler to map from high-level descriptions to executable workflows

Correct mappingPerformance enhanced mapping

Provides a runtime engine to carry out the instructions

Scalable mannerReliable manner

In collaboration with Miron Livny, UW Madison, funded under NSF-OCI SDCI

Page 6: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Pegasus Workflow Management System

Condor ScheddReliable, scalable execution of independent tasks (locally, across the network), priorities, scheduling

DAGMan Reliable and scalable execution of dependent tasks

Pegasus mapper

A decision system that develops strategies for reliable and efficient execution in a variety of environments

Cyberinfrastructure: Local machine, cluster, Condor pool, OSG, TeraGrid

Abstract Workflow

Results

Page 7: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Basic Workflow Mapping

Select where to run the computations Apply a scheduling algorithm

HEFT, min-min, round-robin, random The quality of the scheduling depends on the quality of information

Transform task nodes into nodes with executable descriptions Execution location Environment variables initializes Appropriate command-line parameters set

Select which data to access Add stage-in nodes to move data to computations Add stage-out nodes to transfer data out of remote sites to

storage Add data transfer nodes between computation nodes that

execute on different resources

Page 8: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Basic Workflow Mapping

Add nodes that register the newly-created data products Add data cleanup nodes to remove data from remote sites

when no longer needed reduces workflow data footprint

Provide provenance capture steps Information about source of data, executables invoked,

environment variables, parameters, machines used, performance

Page 9: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Pegasus Workflow Mapping

Original workflow: 15 compute nodesdevoid of resource assignment

41

85

10

9

13

12

15

Resulting workflow mapped onto 3 Grid sites:

11 compute nodes (4 reduced based on available intermediate data)

12 data stage-in nodes

8 inter-site data transfers

14 data stage-out nodes to long-term storage

14 data registration nodes (data cataloging)

9

4

837

10

13

12

15

60 jobs to execute

Page 10: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Time Optimizations during Mapping

Node clustering for fine-grained computations Can obtain significant performance benefits for some

applications (in Montage ~80%, SCEC ~50% )

Data reuse in case intermediate data products are available Performance and reliability advantages—workflow-level

checkpointing

Workflow partitioning to adapt to changes in the environment Map and execute small portions of the workflow at a time

Page 11: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

LIGO: (Laser Interferometer Gravitational-Wave Observatory)

Aims to detect gravitational waves predicted by Einstein’s theory of relativity.

Can be used to detect binary pulsars mergers of black holes “starquakes” in neutron stars

Two installations: in Louisiana (Livingston) and Washington State Other projects: Virgo (Italy), GEO (Germany), Tama (Japan)

Instruments are designed to measure the effect of gravitational waves on test masses suspended in vacuum.

Data collected during experiments is a collection of time series (multi-channel)

Page 12: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

LIGO: (Laser Interferometer Gravitational-Wave Observatory)

Aims to detect gravitational waves predicted by Einstein’s theory of relativity.

Can be used to detect binary pulsars mergers of black holes “starquakes” in neutron stars

Two installations: in Louisiana (Livingston) and Washington State Other projects: Virgo (Italy), GEO (Germany), Tama (Japan)

Instruments are designed to measure the effect of gravitational waves on test masses suspended in vacuum.

Data collected during experiments is a collection of time series (multi-channel)

LIGO Livingston

Page 13: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Example of LIGO’s computations

Binary inspiral analysis Size of analysis for meaningful results

at least 221 GBytes of gravitational-wave data approximately 70,000 computational tasks

Desired analysis: Data from November 2005--November 2006

10TB of input data Approximately 185,000 computations edges

1 Tb of output data

Page 14: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

LIGO’s computational resources

LIGO Data Grid Condor clusters managed by the collaboration ~ 6,000 CPUs

Open Science Grid A US cyberinfrastructure shared by many

applications ~ 20 Virtual Organizations ~ 258 GB of shared scratch disk space on OSG

sites

Page 15: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Problem

How to “fit” the computations onto the OSG Take into account intermediate data products Minimize the data footprint of the workflow Schedule the workflow tasks in a disk-space

aware fashion

Page 16: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Workflow Footprint In order to improve the workflow footprint, we

need to determine when data are no longer needed: Because data was consumed by the next

component and no other component needs it Because data was staged-out to permanent

storage Because data are no longer needed on a

resource and have been stage-out to the resource that needs it

Page 17: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

For each node add dependencies to cleanup all the files used and produced by the node

If a file is being staged-in from r1 to r2, add a dependency between the stage-in and the cleanup node

If a file is being staged-out, add a dependency between the stage-out and the cleanup node

Cleanup Disk Space as Workflow Progresses

Page 18: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Experiments on the Grid with LIGOand Montage

Page 19: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

1.25GB versus 4.5 GB

Open Science Grid

Cleanup on the Grid, Montage application

~ 1,200 nodes

Page 20: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

LIGO Inspiral Analysis Workflow

Small test workflow, 166 tasks, 600 GB max total storage (includes intermediate data products)

LIGO workflow running on OSG

Page 21: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Assumes level-based scheduling, all nodes at a level need to complete before the next level starts

Opportunities for data cleanup in LIGO workflow

Page 22: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Montage WorkflowMontage 2 Degree

0

1000

2000

3000

4000

5000

1 2 3 4 5 6 7 8 9 10 11

Levels

MB

WFP With CleanUp WFP Without CleanUp CleanUp Opportunity

Page 23: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

LIGO Workflows

26% ImprovementIn disk space Usage

50% slower runtime

Page 24: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

LIGO Workflows

26% ImprovementIn disk space Usage

50% slower runtime

LSC Workflow With Limited Restructuring

0

200

400

600

800

1000

1200

1 2 3 4 5 6 7 8 9 10 11 12

Levels

MB

WFP With CleanUp WFP Without CleanUp CleanUp Opportunity

Page 25: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

LIGO Workflows

56% improvementin space usage

3 times slower in runtime

Page 26: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Challenges in implementing data space-aware scheduling

Difficult to get accurate performance estimates for tasks

Difficult to get good estimates of the sizes of the output data Errors compound in the workflow

Difficult to get accurate estimates of data storage space Space is shared among many users Hard to get allocation estimates available to users Even if you have space when you schedule, may

not be there to receive all the data

Need space allocation support

0

1 2

5

6

43

a

b

c

f

g

h

b

cd

e

Page 27: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Conclusions

Data are an important part of today’s applications and need to be managed

Optimizing workflow disk space usage Data workflow footprint concept applicable within

one resource Data-aware scheduling across resources

Designed an algorithm which can cleanup the data as a workflow progresses The effectiveness of the algorithm depends on

the structure of the workflow and its data characteristics

Workflow restructuring may be needed to decrease footprint

Page 28: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Acknowledgments Henan Zhao, Rizos Sakellariou

University of Manchester, UK Kent Blackburn, Duncan Brown, Stephen Fairhurst,

David Meyers LIGO, Caltech, USA

G. Bruce Berriman, John Good, Daniel S. Katz Montage, Caltech and LSU, USA

Miron Livny and Kent Wenger DAGMan, UW Madison

Gurmeet Singh, Karan Vahi, Arun Ramakrishnan, Gaurang Mehta USC Information Science Institute, Pegasus

Page 29: Optimizing for Time and Space in Distributed Scientific Workflows

Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu

Relevant Links Condor: www.cs.wisc.edu/condor Pegasus: pegasus.isi.edu LIGO: www.ligo.caltech.edu/ Montage: montage.ipac.caltech.edu/ Open Science Grid: www.opensciencegrid.org

Workflows for e-Science

I.J. Taylor, E. Deelman, D. B. Gannon M. Shields (Eds.), Springer, Dec. 2006

NSF Workshop on Challenges of Scientific Workflows : www.isi.edu/nsf-workflows06,

E. Deelman and Y. Gil (chairs) OGF Workflow research group

www.isi.edu/~deelman/wfm-rg