39
GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division

GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

Embed Size (px)

Citation preview

Page 1: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

GriPhyN and Data Provenance

The Grid Physics NetworkVirtual Data System

DOE Data Management WorkshopSLAC, 17 March 2004

Mike WildeArgonne National Laboratory

Mathematics and Computer Science Division

Page 2: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

2DOE Data Management www.griphyn.org/chimera 17 Mar 2004

GriPhyN:Grid Physics Network Mission

Enhance scientific productivity through discovery and processing of datasets, using the grid as a scientific workstation

Virtual Data enables this approach by creating datasets from workflow “recipes” and recording their provenance.

GriPhyN works to “cross the chasm” -

application and computer scientists create and field-test paradigms and toolkits together

Page 3: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

3DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Virtual Data Scenario

simulate –t 10 …

file1

file2reformat –f fz …

file1file1File3,4,5

psearch –t 10 …

conv –I esd –o aodfile6 summarize –t 10 …

file7

file8

On-demand data

generation

Update workflow following changes

Manage workflow;

psearch –t 10 –i file3 file4 file5 –o file8summarize –t 10 –i file6 –o file7reformat –f fz –i file2 –o file3 file4 file5 conv –l esd –o aod –i file 2 –o file6simulate –t 10 –o file1 file2

Explain provenance, e.g. for file8:

Page 4: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

4DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Grid3 – The Laboratory

Supported by the National Science Foundation and the Department of Energy.

Page 5: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

5DOE Data Management www.griphyn.org/chimera 17 Mar 2004

VDL: Virtual Data LanguageDescribes Data Transformations

Transformation– Abstract template of program invocation– Similar to "function definition"

Derivation– “Function call” to a Transformation– Store past and future:

> A record of how data products were generated> A recipe of how data products can be generated

Invocation– Record of a Derivation execution

These XML documents reside in a “virtual data catalog” – VDC - a relational database

Page 6: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

6DOE Data Management www.griphyn.org/chimera 17 Mar 2004

VDL Describes Workflowvia Data Dependencies

TR tr1(in a1, out a2) {

argument stdin = ${a1}; 

argument stdout = ${a2}; }

TR tr2(in a1, out a2) {

argument stdin = ${a1};

argument stdout = ${a2}; }

DV x1->tr1(a1=@{in:file1}, a2=@{out:file2});

DV x2->tr2(a1=@{in:file2}, a2=@{out:file3});

file1

file2

file3

x1

x2

Page 7: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

7DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Workflow example

Graph structure– Fan-in

– Fan-out

– "left" and "right" can run in parallel

Needs external input file– Located via replica catalog

Data file dependencies– Form graph structure

findrangefindrange

analyze

preprocess

Page 8: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

8DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Complete VDL workflow

Generate appropriate derivationsDV top->preprocess( b=[ @{out:"f.b1"},

@{ out:"f.b2"} ], a=@{in:"f.a"} );DV left->findrange( b=@{out:"f.c1"},

a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="left", p="0.5" );

DV right->findrange( b=@{out:"f.c2"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="right" );

DV bottom->analyze( b=@{out:"f.d"}, a=[ @{in:"f.c1"}, @{in:"f.c2"} );

Page 9: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

9DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Compound TransformationsEnable Functional Abstractions

Compound TR encapsulates an entire sub-graph:TR rangeAnalysis (in fa, p1, p2, out fd, io fc1, io fc2, io fb1, io fb2, ){ call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ] ); call findrange( a1=${in:fb1}, a2=${in:fb2},

name="LEFT", p=${p1}, b=${out:fc1} ); call findrange( a1=${in:fb1}, a2=${in:fb2},

name="RIGHT", p=${p2}, b=${out:fc2} ); call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} ); }

Page 10: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

10DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Derivation scripts Representation of virtual data provenance:

DV d1->diamond( fd=@{out:"f.00005"}, fc1=@{io:"f.00004"}, fc2=@{io:"f.00003"}, fb1=@{io:"f.00002"}, fb2=@{io:"f.00001"}, fa=@{io:"f.00000"}, p2="100", p1="0" );

DV d2->diamond( fd=@{out:"f.0000B"}, fc1=@{io:"f.0000A"}, fc2=@{io:"f.00009"}, fb1=@{io:"f.00008"}, fb2=@{io:"f.00007"}, fa=@{io:"f.00006"}, p2="141.42135623731", p1="0" );

...DV d70->diamond( fd=@{out:"f.001A3"},

fc1=@{io:"f.001A2"}, fc2=@{io:"f.001A1"}, fb1=@{io:"f.001A0"}, fb2=@{io:"f.0019F"}, fa=@{io:"f.0019E"}, p2="800", p1="18" );

Page 11: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

11DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Invocation Provenance

Completion status and resource usage

Attributes of executable transformation

Attributes of input and output files

Page 12: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

12DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Executing VDL Workflows

Abstractworkflow

local planner

ConcreteDAG

Global planner“Pegasus”

DAGman /Condor-G

GridInfo

“jit” planner(research)

Page 13: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

13DOE Data Management www.griphyn.org/chimera 17 Mar 2004

GriPhyN-iVDGLApplications to date

ATLAS, BTeV, CMS – HEP event simulation Argonne Computational Biology – sequence

comparison and result capture LIGO – Pulsar search Sloan Digital Sky Survey – cluster finding;

near-earth object search planned Quarknet – science education – cosmic

rays, HEP analysis

Page 14: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

14DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Genome Analysis Database Update

Jazz/ANL

Grid3

UofWiscJazz/ANL

Grid3

UofWisc

Grid

A

B

D

C A

B

C

D A

D

B

C

C

D

A

B

Automatic Workflows Created as per UserRequest or Project

GADU - GServer

A

B

D

C A

B

C

D A

D

B

C

C

D

A

B

A

B

D

C

A

B

D

C A

B

C

D

A

B

C

D A

D

B

C

A

D

B

C

C

D

A

B

C

D

A

B

Automatic Workflows Created as per UserRequest or Project

GADU - GServer

Automatic Workflows Created as per UserRequest or Project

GADU - GServer

Hit and Run Registered Groups Collaborators

Interface to theServer

Jets

pee

d

Hit and Run Registered Groups CollaboratorsPublic Registered Groups Collaborators

End Users

Interface to theServer

Jets

pee

d

Dat

a F

low

an

d S

tora

ge

at v

ario

us

leve

ls

Ch

imer

a, C

on

do

r, G

lob

us

Application work by Alex Rodriguez, Dina Sulakhe, Natalia Matlsev,Argonne MCS

Described in GGF10workshop paper.

Page 15: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

15DOE Data Management www.griphyn.org/chimera 17 Mar 2004

1

10

100

1000

10000

100000

1 10 100

Num

ber

of C

lust

ers

Number of Galaxies

Galaxy clustersize distribution

DAG

Virtual Data Example:Galaxy Cluster Search

Sloan Data

Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao,

University of Chicago. Described in SC2002 paper

Page 16: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

16DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Cluster SearchWorkflow Graph

and Execution Trace

Workflow jobs vs time

Page 17: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

17DOE Data Management www.griphyn.org/chimera 17 Mar 2004

mass = 200decay = WWstability = 1LowPt = 20HighPt = 10000

mass = 200decay = WWstability = 1event = 8

mass = 200decay = WWstability = 1plot = 1

mass = 200decay = WWplot = 1

mass = 200decay = WWevent = 8

mass = 200decay = WWstability = 1

mass = 200decay = WWstability = 3

mass = 200

mass = 200decay = WW

mass = 200decay = ZZ

mass = 200decay = bb

mass = 200plot = 1

mass = 200event = 8

Virtual Data Application: High Energy Physics

Data Analysis

Work and slide byRick Cavanaugh andDimitri Bourilkov,University of FloridaRef: CHEP 2002 paper

Page 18: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

18DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Using Virtual Data forScience Education

The QuarkNet-Trillium collaboration is using Grid virtual data tools and methods to enrich science education

Its an experiment to give students the means to:– discover and apply datasets, algorithms, and data

analysis methods

– collaborate by developing new ones and sharing results and observations

– learn data analysis methods that will ready and excite them for a scientific career

And in later steps, we may actually use the Grid!

Page 19: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

19DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Quarknet Virtual Data Project

Standard Web access

Central High SchoolReston, Virginia

LocallyCollected Data

CosmicRay

DetectorS

tud

ent/

Teach

erT

eams

Yale / Middletown High CollaborationHartford, Connecticut

LocallyCollected Data

CosmicRay

Detector

Stu

den

t/T

eacher

Team

s

Foothills High SchoolGreat Falls, Montana

LocallyCollected Data

CosmicRay

Detector

Stu

den

t/T

eacher

Team

s

Quarknet Virtual Data Portal

Student Data,Algorithms,

Results, Notes,and communications

VirtualData

Toolkit

VirtualData

Catalog

Student teacher teams sharing data, methods, programs, and knowledge

Enabling collaboration-intensive science discovery with virtual data tools and methods

Page 20: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

20DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Detector Performance Study

Page 21: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

21DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Example: BTeV Event Simulation

Page 22: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

22DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Search byMetadata

Page 23: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

23DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Derving a new

dataset

…to find mass of

“z” particle:

Page 24: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

24DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Workflow formissing energy calculations

Page 25: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

25DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Virtual Provenance:list of derivations and files

<job id="ID000001" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="5“ dv-namespace="Quarknet.HEPSRCH" dv-name="run1aesum"> <argument><filename file="run1a.event"/> <filename file="run1a.esm"/></argument> <uses file="run1a.esm" link="output" dontRegister="false" dontTransfer="false"/> <uses file="run1a.event" link="input" dontRegister="false" dontTransfer="false"/> </job><job id="ID000002" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="7“ dv-namespace="Quarknet.HEPSRCH" … <argument><filename file="electron10GeV.event"/> <filenamefile="electron10GeV.sum"/></argument>… </job><job id="ID000014" namespace="Quarknet.HEPSRCH" name="ReconTotalEnergy" level="3"… <argument><filename file="run1a.mis"/> <filename file="run1a.ecal"/> … <uses file="run1a.muon" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.total" link="output" dontRegister="false" dontTransfer="false"/> <uses file="run1a.ecal" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.hcal" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.mis" link="input" dontRegister="false" dontTransfer="false"/> </job>

<!--list of all files used --> <filename file="ecal.pct" link="inout"/> <filename file="electron10GeV.avg" link="inout"/> <filename file="electron10GeV.sum" link="inout"/> <filename file="hcal.pct" link="inout"/>….(excerpted for display)

Page 26: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

26DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Virtual Provenance in XML:control flow graph

<child ref="ID000003"> <parent ref="ID000002"/> </child> <child ref="ID000004"> <parent ref="ID000003"/> </child> <child ref="ID000005"> <parent ref="ID000004"/> <parent ref="ID000001"/>… <child ref="ID000009"> <parent ref="ID000008"/> </child> <child ref="ID000010"> <parent ref="ID000009"/> <parent ref="ID000006"/>… <child ref="ID000012"> <parent ref="ID000011"/> </child> <child ref="ID000013"> <parent ref="ID000011"/> </child> <child ref="ID000014"> <parent ref="ID000010"/> <parent ref="ID000012"/>… <parent ref="ID000013"/>… </child>…

(excerpted for display…)

Page 27: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

And writing the results up in a “poster”

Page 28: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

28DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Poster describing analysis

Page 29: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

29DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Observations

A provenance approach based on interface definition and data flow declaration fits well with Grid requirements for code and data transportability and heterogeneity

Working in a provenance-managed system has many fringe benefits: uniformity, precision, structure, communication, documentation

The real world is messy – finding the right abstractions is hard, and handling “legacy” applications is even harder

Page 30: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

30DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Vision for Provenance in the Large

Universal knowledge management and production systems

Vendors integrate the provenance tracking protocol into data processing products

Ability to run anywhere “in the Grid”

Page 31: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

31DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Virtual Data Grid Vision

GridOperations

simulation data

discovery

ScienceReview

Data Grid

storageelement

replica locationservice

storageelement

storageelement

Dat

aT

ran

spo

rt Sto

rage

Reso

urce

Mg

mt

virtualdata

catalogvirtual data

index

virtualdata

catalog

virtualdata

catalog

Computing Grid

workflowplanner

request plannerworkflowexecutor

(DAGman)

request executor(Condor-G,

GRAM)

requestpredictor

(Prophesy)

Grid Monitor

ProductionManager

Researcher

planning

discovery

com

po

sition

sim

ula

tio

n

anal

ysis

sharing

raw d

ata

detector

derivatio

n

Page 32: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

32DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Planned Dataset Model

<FORM <Title…>/FORM>

File Set of files

Relational query or spreadsheet range

XML Element

Set of files with relational index

Object closure

New user-defined dataset type:

Speculative model described in CIDR 2003 paper by Foster, Voeckler, Wilde and Zhao

Page 33: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

33DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Planned Dataset Type ModelFileDataset

File FileSet

MultiFileSet TarFileSetEventCollection

RawEventSet SimulatedEventSet

MonteCarloSimulation

DiscreteEventSimulation

Representational

Logical

(Nonleaf Typesare Superclasses)

Page 34: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

34DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Provenance Server Plans OGSA-based Grid services

– Discovery, security, resource management Supports code and data discovery

and workflow management Object names (TR, DS, TY, DV, IV) can be used as

global cross-server links Derivations can reference remote transformations

and datasets Structured object namespaces & object-level access

control enable large VO collaboration Generalize transforms to describe service calls,

database queries and language interpreters

Page 35: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

35DOE Data Management www.griphyn.org/chimera 17 Mar 2004

CollaborationVDS

TR

TR

TR

DV

TR

DV

DV

DV

DV

DV

Group VDS

PersonalVDS

PersonalVDS

DS

DSDS

Provenance Hyperlinks

Page 36: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

36DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Indexing Serversto Support Discovery

Collaboration-wideindex

Collaboration-levelindex

Group Index

PersonalIndex

PersonalIndex

PersonalIndex

CollaborationVDS

TR

TR

TR

DV

TR

DV

DV

DV

DV

DV

Group VDS

PersonalVDS

PersonalVDS

DS

DSDS

Page 37: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

37DOE Data Management www.griphyn.org/chimera 17 Mar 2004

For Information and Software Virtual Data System

– www.griphyn.org/chimera - Chimera Virtual Data System: Overview, papers, software

Grids and Grid Software– www.ivdgl.org/grid2003 - Using Grid3– www.griphyn.org/vdt - Virtual Data Toolkit– www.globus.org – The Globus Toolkit– www.cs.wisc.edu/condor - The Condor Project– www.ppdg.net – Particle Physics Data Grid

Page 38: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

38DOE Data Management www.griphyn.org/chimera 17 Mar 2004

Acknowledgements:Virtual Data is a Large Team Effort

The Chimera Virtual Data Systemis the work of Ian Foster, Jens Voeckler, Mike Wilde and Yong Zhao

The Pegasus Planner is the work of Ewa Deelman, Gaurang Mehta, and Karan Vahi

Applications described are the work of many people, including: James Annis, Rick Cavanaugh, Dan Engh, Rob Gardner, Albert Lazzarini, Natalia Maltsev, and their wonderful teams

Page 39: GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory

39DOE Data Management www.griphyn.org/chimera 17 Mar 2004

AcknowledgementsGriPhyN, iVDGL, and QuarkNet

(in part) are supported by the National Science Foundation

The Globus Alliance, PPDG, and QuarkNet are supported in part by the US Department of

Energy, Office of Science; by the NASA Information Power Grid program; and by IBM