Upload
donald-doyle
View
220
Download
0
Embed Size (px)
Citation preview
GriPhyN and Data Provenance
The Grid Physics NetworkVirtual Data System
DOE Data Management WorkshopSLAC, 17 March 2004
Mike WildeArgonne National Laboratory
Mathematics and Computer Science Division
2DOE Data Management www.griphyn.org/chimera 17 Mar 2004
GriPhyN:Grid Physics Network Mission
Enhance scientific productivity through discovery and processing of datasets, using the grid as a scientific workstation
Virtual Data enables this approach by creating datasets from workflow “recipes” and recording their provenance.
GriPhyN works to “cross the chasm” -
application and computer scientists create and field-test paradigms and toolkits together
3DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Virtual Data Scenario
simulate –t 10 …
file1
file2reformat –f fz …
file1file1File3,4,5
psearch –t 10 …
conv –I esd –o aodfile6 summarize –t 10 …
file7
file8
On-demand data
generation
Update workflow following changes
Manage workflow;
psearch –t 10 –i file3 file4 file5 –o file8summarize –t 10 –i file6 –o file7reformat –f fz –i file2 –o file3 file4 file5 conv –l esd –o aod –i file 2 –o file6simulate –t 10 –o file1 file2
Explain provenance, e.g. for file8:
4DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Grid3 – The Laboratory
Supported by the National Science Foundation and the Department of Energy.
5DOE Data Management www.griphyn.org/chimera 17 Mar 2004
VDL: Virtual Data LanguageDescribes Data Transformations
Transformation– Abstract template of program invocation– Similar to "function definition"
Derivation– “Function call” to a Transformation– Store past and future:
> A record of how data products were generated> A recipe of how data products can be generated
Invocation– Record of a Derivation execution
These XML documents reside in a “virtual data catalog” – VDC - a relational database
6DOE Data Management www.griphyn.org/chimera 17 Mar 2004
VDL Describes Workflowvia Data Dependencies
TR tr1(in a1, out a2) {
argument stdin = ${a1};
argument stdout = ${a2}; }
TR tr2(in a1, out a2) {
argument stdin = ${a1};
argument stdout = ${a2}; }
DV x1->tr1(a1=@{in:file1}, a2=@{out:file2});
DV x2->tr2(a1=@{in:file2}, a2=@{out:file3});
file1
file2
file3
x1
x2
7DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Workflow example
Graph structure– Fan-in
– Fan-out
– "left" and "right" can run in parallel
Needs external input file– Located via replica catalog
Data file dependencies– Form graph structure
findrangefindrange
analyze
preprocess
8DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Complete VDL workflow
Generate appropriate derivationsDV top->preprocess( b=[ @{out:"f.b1"},
@{ out:"f.b2"} ], a=@{in:"f.a"} );DV left->findrange( b=@{out:"f.c1"},
a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="left", p="0.5" );
DV right->findrange( b=@{out:"f.c2"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="right" );
DV bottom->analyze( b=@{out:"f.d"}, a=[ @{in:"f.c1"}, @{in:"f.c2"} );
9DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Compound TransformationsEnable Functional Abstractions
Compound TR encapsulates an entire sub-graph:TR rangeAnalysis (in fa, p1, p2, out fd, io fc1, io fc2, io fb1, io fb2, ){ call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ] ); call findrange( a1=${in:fb1}, a2=${in:fb2},
name="LEFT", p=${p1}, b=${out:fc1} ); call findrange( a1=${in:fb1}, a2=${in:fb2},
name="RIGHT", p=${p2}, b=${out:fc2} ); call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} ); }
10DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Derivation scripts Representation of virtual data provenance:
DV d1->diamond( fd=@{out:"f.00005"}, fc1=@{io:"f.00004"}, fc2=@{io:"f.00003"}, fb1=@{io:"f.00002"}, fb2=@{io:"f.00001"}, fa=@{io:"f.00000"}, p2="100", p1="0" );
DV d2->diamond( fd=@{out:"f.0000B"}, fc1=@{io:"f.0000A"}, fc2=@{io:"f.00009"}, fb1=@{io:"f.00008"}, fb2=@{io:"f.00007"}, fa=@{io:"f.00006"}, p2="141.42135623731", p1="0" );
...DV d70->diamond( fd=@{out:"f.001A3"},
fc1=@{io:"f.001A2"}, fc2=@{io:"f.001A1"}, fb1=@{io:"f.001A0"}, fb2=@{io:"f.0019F"}, fa=@{io:"f.0019E"}, p2="800", p1="18" );
11DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Invocation Provenance
Completion status and resource usage
Attributes of executable transformation
Attributes of input and output files
12DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Executing VDL Workflows
Abstractworkflow
local planner
ConcreteDAG
Global planner“Pegasus”
DAGman /Condor-G
GridInfo
“jit” planner(research)
13DOE Data Management www.griphyn.org/chimera 17 Mar 2004
GriPhyN-iVDGLApplications to date
ATLAS, BTeV, CMS – HEP event simulation Argonne Computational Biology – sequence
comparison and result capture LIGO – Pulsar search Sloan Digital Sky Survey – cluster finding;
near-earth object search planned Quarknet – science education – cosmic
rays, HEP analysis
14DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Genome Analysis Database Update
Jazz/ANL
Grid3
UofWiscJazz/ANL
Grid3
UofWisc
Grid
A
B
D
C A
B
C
D A
D
B
C
C
D
A
B
Automatic Workflows Created as per UserRequest or Project
GADU - GServer
A
B
D
C A
B
C
D A
D
B
C
C
D
A
B
A
B
D
C
A
B
D
C A
B
C
D
A
B
C
D A
D
B
C
A
D
B
C
C
D
A
B
C
D
A
B
Automatic Workflows Created as per UserRequest or Project
GADU - GServer
Automatic Workflows Created as per UserRequest or Project
GADU - GServer
Hit and Run Registered Groups Collaborators
Interface to theServer
Jets
pee
d
Hit and Run Registered Groups CollaboratorsPublic Registered Groups Collaborators
End Users
Interface to theServer
Jets
pee
d
Dat
a F
low
an
d S
tora
ge
at v
ario
us
leve
ls
Ch
imer
a, C
on
do
r, G
lob
us
Application work by Alex Rodriguez, Dina Sulakhe, Natalia Matlsev,Argonne MCS
Described in GGF10workshop paper.
15DOE Data Management www.griphyn.org/chimera 17 Mar 2004
1
10
100
1000
10000
100000
1 10 100
Num
ber
of C
lust
ers
Number of Galaxies
Galaxy clustersize distribution
DAG
Virtual Data Example:Galaxy Cluster Search
Sloan Data
Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao,
University of Chicago. Described in SC2002 paper
16DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Cluster SearchWorkflow Graph
and Execution Trace
Workflow jobs vs time
17DOE Data Management www.griphyn.org/chimera 17 Mar 2004
mass = 200decay = WWstability = 1LowPt = 20HighPt = 10000
mass = 200decay = WWstability = 1event = 8
mass = 200decay = WWstability = 1plot = 1
mass = 200decay = WWplot = 1
mass = 200decay = WWevent = 8
mass = 200decay = WWstability = 1
mass = 200decay = WWstability = 3
mass = 200
mass = 200decay = WW
mass = 200decay = ZZ
mass = 200decay = bb
mass = 200plot = 1
mass = 200event = 8
Virtual Data Application: High Energy Physics
Data Analysis
Work and slide byRick Cavanaugh andDimitri Bourilkov,University of FloridaRef: CHEP 2002 paper
18DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Using Virtual Data forScience Education
The QuarkNet-Trillium collaboration is using Grid virtual data tools and methods to enrich science education
Its an experiment to give students the means to:– discover and apply datasets, algorithms, and data
analysis methods
– collaborate by developing new ones and sharing results and observations
– learn data analysis methods that will ready and excite them for a scientific career
And in later steps, we may actually use the Grid!
19DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Quarknet Virtual Data Project
Standard Web access
Central High SchoolReston, Virginia
LocallyCollected Data
CosmicRay
DetectorS
tud
ent/
Teach
erT
eams
Yale / Middletown High CollaborationHartford, Connecticut
LocallyCollected Data
CosmicRay
Detector
Stu
den
t/T
eacher
Team
s
Foothills High SchoolGreat Falls, Montana
LocallyCollected Data
CosmicRay
Detector
Stu
den
t/T
eacher
Team
s
Quarknet Virtual Data Portal
Student Data,Algorithms,
Results, Notes,and communications
VirtualData
Toolkit
VirtualData
Catalog
Student teacher teams sharing data, methods, programs, and knowledge
Enabling collaboration-intensive science discovery with virtual data tools and methods
20DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Detector Performance Study
21DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Example: BTeV Event Simulation
22DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Search byMetadata
23DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Derving a new
dataset
…to find mass of
“z” particle:
24DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Workflow formissing energy calculations
25DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Virtual Provenance:list of derivations and files
<job id="ID000001" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="5“ dv-namespace="Quarknet.HEPSRCH" dv-name="run1aesum"> <argument><filename file="run1a.event"/> <filename file="run1a.esm"/></argument> <uses file="run1a.esm" link="output" dontRegister="false" dontTransfer="false"/> <uses file="run1a.event" link="input" dontRegister="false" dontTransfer="false"/> </job><job id="ID000002" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="7“ dv-namespace="Quarknet.HEPSRCH" … <argument><filename file="electron10GeV.event"/> <filenamefile="electron10GeV.sum"/></argument>… </job><job id="ID000014" namespace="Quarknet.HEPSRCH" name="ReconTotalEnergy" level="3"… <argument><filename file="run1a.mis"/> <filename file="run1a.ecal"/> … <uses file="run1a.muon" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.total" link="output" dontRegister="false" dontTransfer="false"/> <uses file="run1a.ecal" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.hcal" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.mis" link="input" dontRegister="false" dontTransfer="false"/> </job>
<!--list of all files used --> <filename file="ecal.pct" link="inout"/> <filename file="electron10GeV.avg" link="inout"/> <filename file="electron10GeV.sum" link="inout"/> <filename file="hcal.pct" link="inout"/>….(excerpted for display)
26DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Virtual Provenance in XML:control flow graph
<child ref="ID000003"> <parent ref="ID000002"/> </child> <child ref="ID000004"> <parent ref="ID000003"/> </child> <child ref="ID000005"> <parent ref="ID000004"/> <parent ref="ID000001"/>… <child ref="ID000009"> <parent ref="ID000008"/> </child> <child ref="ID000010"> <parent ref="ID000009"/> <parent ref="ID000006"/>… <child ref="ID000012"> <parent ref="ID000011"/> </child> <child ref="ID000013"> <parent ref="ID000011"/> </child> <child ref="ID000014"> <parent ref="ID000010"/> <parent ref="ID000012"/>… <parent ref="ID000013"/>… </child>…
(excerpted for display…)
And writing the results up in a “poster”
28DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Poster describing analysis
29DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Observations
A provenance approach based on interface definition and data flow declaration fits well with Grid requirements for code and data transportability and heterogeneity
Working in a provenance-managed system has many fringe benefits: uniformity, precision, structure, communication, documentation
The real world is messy – finding the right abstractions is hard, and handling “legacy” applications is even harder
30DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Vision for Provenance in the Large
Universal knowledge management and production systems
Vendors integrate the provenance tracking protocol into data processing products
Ability to run anywhere “in the Grid”
31DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Virtual Data Grid Vision
GridOperations
simulation data
discovery
ScienceReview
Data Grid
storageelement
replica locationservice
storageelement
storageelement
Dat
aT
ran
spo
rt Sto
rage
Reso
urce
Mg
mt
virtualdata
catalogvirtual data
index
virtualdata
catalog
virtualdata
catalog
Computing Grid
workflowplanner
request plannerworkflowexecutor
(DAGman)
request executor(Condor-G,
GRAM)
requestpredictor
(Prophesy)
Grid Monitor
ProductionManager
Researcher
planning
discovery
com
po
sition
sim
ula
tio
n
anal
ysis
sharing
raw d
ata
detector
derivatio
n
32DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Planned Dataset Model
<FORM <Title…>/FORM>
File Set of files
Relational query or spreadsheet range
XML Element
Set of files with relational index
Object closure
New user-defined dataset type:
Speculative model described in CIDR 2003 paper by Foster, Voeckler, Wilde and Zhao
33DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Planned Dataset Type ModelFileDataset
File FileSet
MultiFileSet TarFileSetEventCollection
RawEventSet SimulatedEventSet
MonteCarloSimulation
DiscreteEventSimulation
Representational
Logical
(Nonleaf Typesare Superclasses)
34DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Provenance Server Plans OGSA-based Grid services
– Discovery, security, resource management Supports code and data discovery
and workflow management Object names (TR, DS, TY, DV, IV) can be used as
global cross-server links Derivations can reference remote transformations
and datasets Structured object namespaces & object-level access
control enable large VO collaboration Generalize transforms to describe service calls,
database queries and language interpreters
35DOE Data Management www.griphyn.org/chimera 17 Mar 2004
CollaborationVDS
TR
TR
TR
DV
TR
DV
DV
DV
DV
DV
Group VDS
PersonalVDS
PersonalVDS
DS
DSDS
Provenance Hyperlinks
36DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Indexing Serversto Support Discovery
Collaboration-wideindex
Collaboration-levelindex
Group Index
PersonalIndex
PersonalIndex
PersonalIndex
CollaborationVDS
TR
TR
TR
DV
TR
DV
DV
DV
DV
DV
Group VDS
PersonalVDS
PersonalVDS
DS
DSDS
37DOE Data Management www.griphyn.org/chimera 17 Mar 2004
For Information and Software Virtual Data System
– www.griphyn.org/chimera - Chimera Virtual Data System: Overview, papers, software
Grids and Grid Software– www.ivdgl.org/grid2003 - Using Grid3– www.griphyn.org/vdt - Virtual Data Toolkit– www.globus.org – The Globus Toolkit– www.cs.wisc.edu/condor - The Condor Project– www.ppdg.net – Particle Physics Data Grid
38DOE Data Management www.griphyn.org/chimera 17 Mar 2004
Acknowledgements:Virtual Data is a Large Team Effort
The Chimera Virtual Data Systemis the work of Ian Foster, Jens Voeckler, Mike Wilde and Yong Zhao
The Pegasus Planner is the work of Ewa Deelman, Gaurang Mehta, and Karan Vahi
Applications described are the work of many people, including: James Annis, Rick Cavanaugh, Dan Engh, Rob Gardner, Albert Lazzarini, Natalia Maltsev, and their wonderful teams
39DOE Data Management www.griphyn.org/chimera 17 Mar 2004
AcknowledgementsGriPhyN, iVDGL, and QuarkNet
(in part) are supported by the National Science Foundation
The Globus Alliance, PPDG, and QuarkNet are supported in part by the US Department of
Energy, Office of Science; by the NASA Information Power Grid program; and by IBM