Automatic Generation Automatic Generation of of Workflow Workflow Execution Execution ProvenanceProvenanceRoger S. BargaRoger S. BargaDatabase Group, Microsoft Database Group, Microsoft Research (MSR)Research (MSR)
My interest in scientific My interest in scientific workflow and provenance…workflow and provenance…In a previous life… In a previous life…
Research Scientist, PNNL, DOE National Research Scientist, PNNL, DOE National LaboratoryLaboratory
• Machine learning, pattern recognition over Machine learning, pattern recognition over large data setslarge data sets• Scientific experiment management system Scientific experiment management system (EMSL)(EMSL)• Electronic laboratory notebook for Electronic laboratory notebook for experiment captureexperiment capture
More recently… More recently… Database Group, Microsoft Research in Database Group, Microsoft Research in Redmond, WARedmond, WA
ImmortalDB (ICDE’06, SIGMOD’06), Event Processing, Phoenix
• Extend commercial software to support Extend commercial software to support scientific researchscientific research
Tailor software for the sciences, provide free of chargeServe as a positive force in the community (Tony Hey)
Practical value, challenging information management research issues…
Objectives for this Objectives for this initial effortinitial effort
Provenance capture that is automatic & Provenance capture that is automatic & transparenttransparent
Should persist provenance data for a fixed period of time
Support multiple levels of representationSupport multiple levels of representationWF description Logical log (o & p) deviations step-by-step trace.
Version and lock the executablesVersion and lock the executables
Efficient representation and managementEfficient representation and managementOpportunity to significantly reduce execution Opportunity to significantly reduce execution provenance storage costsprovenance storage costs
An enactment engine for An enactment engine for scientific scientific workflowsworkflows that documents all steps that documents all steps linking original inputs with final linking original inputs with final results so an experiment (execution) results so an experiment (execution) can be verified, reproduced or reruncan be verified, reproduced or rerun
Issues NOT considered in our Issues NOT considered in our initial effortinitial effort
Annotations and provenance of the Annotations and provenance of the workflowworkflow
How to include external provenanceHow to include external provenance
Evaluate our prototype on actual Evaluate our prototype on actual scientific workflowsscientific workflows
Provide query and analysis support over Provide query and analysis support over execution provenance traces…execution provenance traces…
Focus on mechanism, implement Focus on mechanism, implement something simple but useful, something simple but useful, consider how to manage this virtual consider how to manage this virtual data productdata product
Provenance capture that is automatic & transparent
Support multiple levels of representation
Version and lock the executables
Efficient representation and management
Types of Provenance to Types of Provenance to Capture in Workflow Capture in Workflow ExecutionExecution
Experiment DesignExperiment DesignSerialize the workflow schedule (XOML)Serialize the workflow schedule (XOML)
Invocation RecordInvocation RecordInvocation of specific activities, events and Invocation of specific activities, events and rulesrules
Deviations from the defined schedule Deviations from the defined schedule (shims, etc)(shims, etc)
Interaction ProvenanceInteraction ProvenanceInput variables, runtime parameters, Input variables, runtime parameters, activation inputs activation inputs
External services invoked, return value(s), External services invoked, return value(s), etcetc
Job ProvenanceJob ProvenanceStart/complete time, etcStart/complete time, etc
A workflow schedulesequential, event, rule driven
An ActivityWhat about internal state?What about internal state?
Architecture OverviewArchitecture Overview
Query and ManagementInterface (QMI)
Provenance StorageService Interface (PSI)
Workflow ExecutionProvenance
Storage Service(built using CLFS)
Logical Logging Utility
Problem SolvingEnvironment
Workflow EnactmentEngine (WinWF)
Client Query Library
Management Routines
Provenance Services• Trace execution• Difference analysis• Reload runtime state• …
HPC Job Scheduler
HPC Job Scheduler
CreateJOB(XOML)
ExecuteTask(JID, Act)
Implementation – Implementation – extending base activity extending base activity
classesclassesActivities are the basic building Activities are the basic building blocksblocks
They are the unit of execution, re-use and They are the unit of execution, re-use and composition composition The The rootroot of of entire workflowentire workflow is itself an is itself an activityactivityComposite activitiesComposite activities contains other contains other activitiesactivitiesEG: Sequence, Parallel, Synchronize, EG: Sequence, Parallel, Synchronize, Exclusive Choice, Merge,…Exclusive Choice, Merge,…Basic activitiesBasic activities are steps within a are steps within a workflowworkflow
Activities are simply classesActivities are simply classesProperties Properties andand events events are introduced to are introduced to intercept and pass control to provenance intercept and pass control to provenance capture service capture service at runtimeat runtime……Each class defines provenance persistence Each class defines provenance persistence methodsmethods that are invoked by the workflow that are invoked by the workflow runtimeruntime
Workflow ExecutionWorkflow ExecutionMy Experiment
rt.StartWorkflow(typeof(WF1));
Instance Manager
Persist Provenance
11 App calls StartWorkflow(…)
WF1
Invoke1
22 Instance Manager:• Loads workflow type • Creates instance• Enqueues WF1 with Scheduler
33 Scheduler dequeues WF1, serializes XOML calls Executor(SequentialWorkflow base) which enqueues Sequence
Activity
MyWF.dll
Persist provenance to disk
Execute until idle
Create instance
Execute
Sequence
Save
SequentialWorkflow
Execute
Sequence Execute
OnEvent1
WF1 Instance
WF1
Scheduler
SequenceOnEvent1WF1
44 Dequeue Sequence & calls Executor whichserializes ActRec and enqueues OnEvent1Dequeue OnEvent1, serialize ActRec and call Executor which subscribes to event
55
InstanceMgr calls Flush() on WF1 (Activity base class) to flush provenance records and gets back stream
66
Instance Mgr call Provenance service passing serialized stream – Provenance Storage service saves to disk
77
BaseActivityLibrary
RuntimeEngine
RuntimeServices
Transparent Interception and Transparent Interception and Logical LoggingLogical Logging
......
SEQUENCESEQUENCEActivityActivity
WorkflowWorkflowActivity 1Activity 1
WorkflowWorkflowActivity NActivity N
Each activity is creating an operation Each activity is creating an operation history – a time serial stream of history – a time serial stream of provenance records.provenance records.
Each record represents a change in Each record represents a change in operational state, such as sequence operational state, such as sequence advancing, a synchronize or branch being advancing, a synchronize or branch being taken, activities passing data via method taken, activities passing data via method calls.calls.
Replay of the log is an accurate repeated Replay of the log is an accurate repeated history of state changes, up to and history of state changes, up to and including the “present” stateincluding the “present” state
Provenance Service “weaves” these records into the workflow XOML, Provenance Service “weaves” these records into the workflow XOML, recording LSNs for individual activities, insertions (shims), etc. recording LSNs for individual activities, insertions (shims), etc.
Host Process
Workflow Foundation
Provenance Capture Integrated Provenance Capture Integrated into Runtime Engine and Servicesinto Runtime Engine and Services
Base Activity Library, classes augmented with provenance capture
My Experiment
Runtime Services• hosting flexibility - pluggable implementations (with defaults)
Provenance Storage (PSI)
Communication Tracking …
Runtime Engine• provides intrinsic behaviors to activities
TrackingInfrastructure
State Management
WorkflowExecution
ProvenanceManagement
Query Support (initial)Query Support (initial)Individual Workflow Execution Individual Workflow Execution TraceTrace
Display a graphical trace of the Display a graphical trace of the execution;execution;
Query for skipped steps, inserted Query for skipped steps, inserted steps, etcsteps, etc
Query for the codes (activities) Query for the codes (activities) invoked.invoked.
Query for machine execution statiQuery for machine execution stati
Multiple Workflow Execution Multiple Workflow Execution TracesTraces
Comparative trace (shallow, versus Comparative trace (shallow, versus deep compare)deep compare)
Still “early days” for our query Still “early days” for our query support over a workflow support over a workflow execution provenance trace execution provenance trace storestore
An Issue to An Issue to ConsiderConsider……It may not possible to rerun It may not possible to rerun experiment, to either validate or experiment, to either validate or recreate a result because original recreate a result because original workflow is lost (activities have workflow is lost (activities have been updated).been updated).Assign a version identifier (strong Assign a version identifier (strong name) to the workflow assembly so name) to the workflow assembly so it can be associated with the result; it can be associated with the result; only retain if provenance is only retain if provenance is retained. retained.
Updating any activity in the workflow will change this version number, resulting in a new version being created.User is able to rerun the experiment by invoking workflow using fully-specified reference found in the provenance record;
Extended Windows Workflow Extended Windows Workflow FoundationFoundation
Transparently capture execution trace Transparently capture execution trace leading to a resultleading to a resultTowards a layered provenance modelTowards a layered provenance modelInitial query facility built over this Initial query facility built over this provenance dataprovenance data
This summer, evaluation and necessary This summer, evaluation and necessary extensions, analysis supportextensions, analysis supportLuciano Digiampietri (UniCamp/Brazil), project Luciano Digiampietri (UniCamp/Brazil), project
internintern
Tying provenance to code Tying provenance to code versioningversioning
In general, how to manage provenance In general, how to manage provenance data and code so the scientist simply data and code so the scientist simply doesn’t have to worry about it…doesn’t have to worry about it…
An interesting data management An interesting data management challengechallenge
Provenance as a first class derived data Provenance as a first class derived data itemitem
To Sum Up…To Sum Up…
Closing Comments…Closing Comments…Provenance presents many, many open Provenance presents many, many open questions, but offers so much potential…questions, but offers so much potential…
Execution provenance (sadly) is just the Execution provenance (sadly) is just the tip…tip…Is this even provenance – where to draw the
line?Shall we revel in complexity, or focus on the
low-hanging fruit? Can’t we do both?
Standards (agreements) on Standards (agreements) on representation/protocolsrepresentation/protocolsTry to reach a “tipping point”Try to reach a “tipping point”
Welcome your feedback, suggestions and open to opportunities to collaborate on this problem…