29
The GriPhyN Virtual Data System Ian Foster for the VDS team

The GriPhyN Virtual Data System

  • Upload
    najila

  • View
    59

  • Download
    1

Embed Size (px)

DESCRIPTION

The GriPhyN Virtual Data System. Ian Foster for the VDS team. Science as “Workflow”: E.g., Galaxy Cluster Search. DAG. Sloan Data. Galaxy cluster size distribution. Jim Annis, Steve Kent, Vijay Sehkri, Fermilab , Michael Milligan, Yong Zhao, University of Chicago. - PowerPoint PPT Presentation

Citation preview

Page 1: The GriPhyN Virtual Data System

The GriPhyNVirtual Data System

Ian Foster for the VDS team

Page 2: The GriPhyN Virtual Data System

Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao,

University of Chicago

1

10

100

1000

10000

100000

1 10 100

Num

ber

of C

lust

ers

Number of Galaxies

Galaxy clustersize distribution

DAG

Science as “Workflow”:E.g., Galaxy Cluster Search

Sloan Data

Page 3: The GriPhyN Virtual Data System

Requirements Express complex multi-step “workflows”

Perhaps 100,000s of individual tasks Operate on heterogeneous distributed data

Different formats & access protocols Harness many computing resources

Parallel computers &/or distributed Grids Execute workflows reliably

Despite diverse failure conditions Enable reuse of data & workflows

Discovery & composition Support many users, workflows, resources

Policy specification & enforcement

Page 4: The GriPhyN Virtual Data System

Virtual Data System Express complex multi-step “workflows”

Perhaps 100,000s of individual tasks Operate on heterogeneous distributed data

Different formats & access protocols Harness many computing resources

Parallel computers &/or distributed Grids Execute workflows reliably & efficiently

Despite diverse failure conditions Enable reuse of data & workflows

Discovery & composition Support many users, workflows, resources

Policy specification & enforcement

VDL,XDTM

Pegasus,DAGman,

Globus

VDC

TBD

Page 5: The GriPhyN Virtual Data System

Virtual Data System

Local planner

DAGmanDAG

StaticallyPartitioned

DAG

DAGman &Condor-GDynamically

PlannedDAG

JobPlanner

JobCleanup

Abstractworkflow

VDLProgram

Virtual Datacatalog

Virtual DataWorkflowGenerator

Workflow spec Create Execution Plan Grid Workflow Execution

Page 6: The GriPhyN Virtual Data System

Genome Analysis &DB Update (GADU)

600-1000+ CPUs

Page 7: The GriPhyN Virtual Data System

The Rest of the Talk Express complex multi-step “workflows”

Perhaps 100,000s of individual tasks Operate on heterogeneous distributed data

Different formats & access protocols Harness many computing resources

Parallel computers &/or distributed Grids Execute workflows reliably & efficiently

Despite diverse failure conditions Enable reuse of data & workflows

Discovery & composition Support many users, workflows, resources

Policy specification & enforcement

VDL,XDTM

Pegasus,DAGman,

Globus

VDC

TBD

Ewa

Page 8: The GriPhyN Virtual Data System

“Messy” Scientific Data Diverse storage formats & access protocols

Logically identical dataset can be stored in text file (e.g. CSV), binary file, spreadsheet

Data available from filesystem, database, HTTP, WebDAV, etc...

Metadata encoded in directory & file names E.g.: “fMRI volume is composed of an image file

& header file with same prefix” Format dependency hinders program and

workflow reuse

Page 9: The GriPhyN Virtual Data System

But... Data is Often Logically Structured

Scientific data often maintain hierarchical structure

A common practice is to select a set of data items and apply a transformation to each individual item

A nested approach of such iterations could scale up to millions of objects

Page 10: The GriPhyN Virtual Data System

Introducing a Typing System Describe logical data structures as types …

… & physical representations as mappings Define procedures in terms of typed datasets

… & apply procedures to different physical data Compose workflows from typed procedures Benefits

Type checking Dataset selection and iteration Discovery by types Dynamic binding Type conversion

Page 11: The GriPhyN Virtual Data System

XDTM(Moreau, Zhao, Wilde, Foster)

XML Dataset Typing and Mapping Separates logical structure from physical

representations Logical structure described by XML Schema

Primitive scalar types: int, float, string, date … Complex types (structs and arrays)

Mapping descriptor How logical elements map to physical External parameters (e. g. location)

XPath for dataset selection

Page 12: The GriPhyN Virtual Data System

Mapping Define a common mapping interface

Initialize, read, create, write, close Data providers implement the interface

Responsible for data access details XView maintains cached logical datasets

VDS Mapper Data Source

VDS XViewMgr

Data SourceMapper

XView

Page 13: The GriPhyN Virtual Data System

Use Case: Functional MRIDBIC Archive Study #1 Group #1 Subject #1

Anatomy high-res volumeFunctional Runs run #1 volume #001 ... volume #275 ... run #5 volume #001 ... snrun #... …

Group #5 ... Study #...

DBIC Archive Study_2004.0521.hgd Group_1 Subject_2004.e024

volume_anat.imgvolume_anat.hdrbold1_001.imgbold1_001.hdr...bold1_275.imgbold1_275.hdr...bold5_001.img...snrbold*_*air*...

Group_5...

Study ...

Logical Structure Physical Representation

Page 14: The GriPhyN Virtual Data System

Type Definitions in VDLtype Image {};

type Header {};

type Volume { Image img; Header hdr;

}

type Anat Volume;

type Warp {};

type NormAnat {Anat aVol; Warp aWarp; Volume nHires;

}

Part of fMRI AIRSN (Spatial Normalization) Workflow

type Run { Volume v [ ];

}

type Subject { Anat anat; Run run [ ]; Run snrun [ ];

}

type Group { Subject s[ ]; }

type Study { Group g[ ]; }

Page 15: The GriPhyN Virtual Data System

Type Definitions in XML Schema <xs:schema

targetNamespace="http://www.fmri.org/schema/airsn.xsd"xmlns="http://www.fmri.org/schema/airsn.xsd"xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:simpleType name="Image“/><xs:simpleType name="Header“/>

<xs:complexType name="Volume"> <xs:sequence>

<xs:element name="img" type="Image"/><xs:element name="hdr" type="Header"/>

</xs:sequence></xs:complexType>

<xs:complexType name="Run"> <xs:sequence minOccurs="0 maxOccurs="unbounded">

<xs:element name="v" type="Volume"/> </xs:sequence></xs:complexType>

</xs:schema>

Page 16: The GriPhyN Virtual Data System

Procedure Definition in VDL(Run snr) functional( Run r, NormAnat a, Air shrink ) {

Run yroRun = reorientRun( r , "y" );Run roRun = reorientRun( yroRun , "x" );Volume std = roRun[0];Run rndr = random_select( roRun, .1 ); //10% sampleAirVector rndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, [81,3,3] );Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k");Volume meanRand = softmean(reslicedRndr, "y", null );Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, [81,3,3] );Volume mnQA = reslice( meanRand, mnQAAir, "o", "k“ );Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir );Run nr = reslice_warp_run( boldNormWarp, roRun );Volume meanAll = strictmean ( nr, "y", null )Volume boldMask = binarize( meanAll, "y" );snr = gsmoothRun( nr, boldMask, 6, 6, 6 );

}

Page 17: The GriPhyN Virtual Data System

Dataset Iteration Functional analysis

expressed in typed datasets

Iterate over each volume in a run

reorientRun

reorientRun

reslice_warpRun

random_select

alignlinearRun

resliceRun

softmean

alignlinear

combinewarp

strictmean

gsmoothRun

binarize

Page 18: The GriPhyN Virtual Data System

Expanded Execution Planreorient/01

reorient/02

reslice_warp/22

alignlinear/03 alignlinear/07alignlinear/11

reorient/05

reorient/06

reslice_warp/23

reorient/09

reorient/10

reslice_warp/24

reorient/25

reorient/51

reslice_warp/26

reorient/27

reorient/52

reslice_warp/28

reorient/29

reorient/53

reslice_warp/30

reorient/31

reorient/54

reslice_warp/32

reorient/33

reorient/55

reslice_warp/34

reorient/35

reorient/56

reslice_warp/36

reorient/37

reorient/57

reslice_warp/38

reslice/04 reslice/08reslice/12

gsmooth/41

strictmean/39

gsmooth/42gsmooth/43gsmooth/44 gsmooth/45 gsmooth/46 gsmooth/47 gsmooth/48 gsmooth/49 gsmooth/50

softmean/13

alignlinear/17

combinewarp/21

binarize/40

reorient

reorient

alignlinear

reslice

softmean

alignlinear

combine_warp

reslice_warp

strictmean

binarize

gsmooth

Datasets dynamically instantiated from data sources by mappers

Page 19: The GriPhyN Virtual Data System

Functional MRI Execution

Page 20: The GriPhyN Virtual Data System

Code Size Comparison

Workflow Script

Generator

VDL

GENATLAS1 49 72 6GENATLAS2 97 135 10FILM1 63 134 17FEAT 84 191 13AIRSN 215 ~400 37

Lines of code with different workflow encodings

Page 21: The GriPhyN Virtual Data System

The Rest of the Talk Express complex multi-step “workflows”

Perhaps 100,000s of individual tasks Operate on heterogeneous distributed data

Different formats & access protocols Harness many computing resources

Parallel computers &/or distributed Grids Execute workflows reliably & efficiently

Despite diverse failure conditions Enable reuse of data & workflows

Discovery & composition Support many users, workflows, resources

Policy specification & enforcement

VDL,XDTM

Pegasus,DAGman,

Globus

VDC

TBD

Page 22: The GriPhyN Virtual Data System

Virtual Data Schema

dvIDhoststart

durationexitcode

stats

Invocation

nmspacename

version

Call

passes passes

executescalls

binds references

describesuses

includes

nmspacename

version

Procedure

argnametype

direction

FormalArg

argnamevalue

ActualArg

wfidfromDV

toDV

Workflow

nmspacename

Dataset

objectpred

type/valuserdate

Annotation

1

1

1

1

1

1

*

*

*

*

*

1

11

1

1

1

1 describes

Page 23: The GriPhyN Virtual Data System

fMRI Virtual Data QueriesWhich transformations can process a “subject image”? Q: xsearchvdc -q tr_meta dataType

subject_image input A: fMRIDC.AIR::align_warp

List anonymized subject-images for young subjects: Q: xsearchvdc -q lfn_meta dataType subject_image privacy anonymized subjectType young A: 3472-4_anonymized.img

Show files that were derived from patient image 3472-3: Q: xsearchvdc -q lfn_tree 3472-3_anonymized.img A: 3472-3_anonymized.img

3472-3_anonymized.sliced.hdr atlas.hdr atlas.img … atlas_z.jpg 3472-3_anonymized.sliced.img

Page 24: The GriPhyN Virtual Data System

Provenance for ATLAS DC2(High Energy Physics)

How much compute time was delivered?| years| mon | year |+------+------+------+| .45 | 6 | 2004 || 20 | 7 | 2004 || 34 | 8 | 2004 || 40 | 9 | 2004 || 15 | 10 | 2004 || 15 | 11 | 2004 || 8.9 | 12 | 2004 |+------+------+------+

Selected statistics for one of these jobs:start: 2004-09-30 18:33:56duration: 76103.33 pid: 6123exitcode: 0 args: 8.0.5 JobTransforms-08-00-05-09/share/dc2.g4sim.filter.trf CPE_6785_556

... -6 6 2000 4000 8923 dc2_B4_filter_frag.txt utime: 75335.86 stime: 28.88 minflt: 862341 majflt: 96386

Which Linux kernel releases were used ?How many jobs were run on a Linux 2.4.28 Kernel?

Page 25: The GriPhyN Virtual Data System

LIGO Inspiral Search Application

Describe…

Inspiral workflow application is the work of Duncan Brown, Caltech,

Scott Koranda, UW Milwaukee, and the LSC Inspiral group

Page 26: The GriPhyN Virtual Data System

FOAM:Fast Ocean/Atmosphere Model

250-Member EnsembleRun on TeraGrid under VDS

FOAM run for Ensemble Member 1

FOAM run for Ensemble Member 2

FOAM run for Ensemble Member N

Atmos Postprocessing Ocean

Postprocessing for Ensemble Member 2

Coupl Postprocessing for Ensemble Member 2

Atmos Postprocessing for Ensemble Member 2

Coupl Postprocessing for Ensemble Member 2

Results transferred to archival storage

Work of: Rob Jacob (FOAM), Veronica Nefedova (workflow design and execution)

Remote Directory Creation for Ensemble Member 1

Remote Directory Creation for Ensemble Member 2

Remote Directory Creation for Ensemble Member N

Page 27: The GriPhyN Virtual Data System

FOAM and VDS

Climate Supercomputer

andGrad student

TeraGrid and VDS

Visualization courtesy Pat

Behling and Yun Liu, UW Madison

160 ensemble members in 75 days

250 ensemble members in 4 days

Page 28: The GriPhyN Virtual Data System

Summary:Science as Workflow

ExecutedExecutingExecutableNot yet executable

Query

Edit

ScheduleExecution environment

What I Did

What I Want to Do

What I Am Doing

Page 29: The GriPhyN Virtual Data System

Acknowledgements The Virtual Data System group is:

ISI/USC: Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi

U of Chicago: Ben Clifford, Ian Foster, Mike Wilde, Yong Zhao

GriPhyN is supported by the NSF Many research efforts involved in this work are

supported by the US Department of Energy, Office of Science