Upload
clare-york
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
REDUX – automatic capture, REDUX – automatic capture, efficient storageefficient storage
Roger S. BargaRoger S. BargaMicrosoft Research (MSR)Microsoft Research (MSR)
Luciano DigiampietriLuciano DigiampietriUniversity of Campinas, Sao University of Campinas, Sao Paolo, BrazilPaolo, Brazil
What information needs to be captured? Which version of BLAST did I use? What codes (activities) did I invoke to get this result, and what were the parameters? What data transformations did I use to get this result? What machine was used to perform the alignment?
Were any steps skipped in this experiment, or were any shims inserted? Did the experiment design differ between these two results? If so, where?... Are there any branches in the workflow that have not been explored?
Additional Issues to Consider…
Result of a provenance query is an executable workflow
Provenance storage costs can quickly grow out of hand…
ConsiderationsConsiderations
Allow the user to control what is shared/exposed – one size doesn’t fit all
It may not possible to rerun an experiment, to either validate or recreate a result because original workflow is lost (activities have been updated).
ImplementationImplementationExtended enactment engine of WinOE to Extended enactment engine of WinOE to
automatically capture steps during automatically capture steps during execution leading to a resultexecution leading to a resultProvenance capture is automatic & transparent
Store provenance in a RDBMS (SQL Store provenance in a RDBMS (SQL Server), utilize previous traces to Server), utilize previous traces to significantly reduce storage costssignificantly reduce storage costsCurrent query interface is SQL, eventually a forms based interface.
Version and lock the executablesVersion and lock the executablesUpdating any activity will change the workflow version number, resulting in a new version. User is able to rerun an experiment by invoking workflow using fully-specified reference found in the provenance record;
A multilayer model for representing result provenance A multilayer model for representing result provenance Abstract Workflow Service Instantiation Data Instantiation Runtime
Abstract WorkflowAbstract Workflow
Data Model for Abstract Data Model for Abstract WorkflowWorkflow
Bound to Activities Bound to Activities (code) and Data(code) and Data
Data Model for Data Model for Workflow InstanceWorkflow Instance
Provenance Queries – Provenance Queries – Query 1Query 1Provenance queries Provenance queries 11, 4, 5, 7, 8 , 4, 5, 7, 8
and 9and 9Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc.
Returns ExecutableWorkflowId (process), ExecutionId (id of specific execution of the process), EventId (event where data was produced) and ExecutableWorkflow_ ExecutableActivityId (activity that produced the data) of the processes that generated the Atlas X Graphic
Provenance Queries – Provenance Queries – Query 7aQuery 7aProvenance queries 1, 4, 5, Provenance queries 1, 4, 5, 77, 8 and 9, 8 and 9
Our layered model allows the detection of differences in several ways A user has run the workflow twice, in the second instance replacing each
procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant.
Provenance Queries – Provenance Queries – Query 7bQuery 7bA user has run the workflow twice, in the second instance replacing each
procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant.
Activities used by the second workflow but not the firstActivities used by the second workflow but not the first
Workflow ModelWorkflow Model captures information about the captures information about the instances of the activities, and the links among the instances of the activities, and the links among the ports (or activities interfaces). At this layer, our model ports (or activities interfaces). At this layer, our model allows provenance queries to question, for example, allows provenance queries to question, for example, what activities from Workflow 2 are not included in what activities from Workflow 2 are not included in Workflow 1: Workflow 1:
Provenance Queries – Provenance Queries – Query 7cQuery 7cA user has run the workflow twice, in the second instance replacing each
procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant.
Runtime Level which contains information about the Runtime Level which contains information about the execution of the workflow (produced data, timestamps, execution of the workflow (produced data, timestamps, activities invoked, etc.). Here the model allows queries activities invoked, etc.). Here the model allows queries about produced data, data flow (See Q2 and Q3), about produced data, data flow (See Q2 and Q3), date/time, etc.date/time, etc.
One example query that illustrates the difference One example query that illustrates the difference between two workflows, at this level, is: What is the between two workflows, at this level, is: What is the data produced by the second workflow that was not data produced by the second workflow that was not produced by the first?produced by the first?Data produced by workflow 2 that was not produced by workflow 1:Data produced by workflow 2 that was not produced by workflow 1:
Efficiently Storing Efficiently Storing Provenance DataProvenance DataFor Provenance Query 7For Provenance Query 7Two workflows are sharing more Two workflows are sharing more
that 99% of the provenance data that 99% of the provenance data (space) and sharing 46% of the (space) and sharing 46% of the database tuples.database tuples.
Extended Windows Workflow FoundationExtended Windows Workflow FoundationTransparently capture execution trace Transparently capture execution trace
leading to a resultleading to a result
A layered provenance modelA layered provenance model
Relational database (SQL Server) as Relational database (SQL Server) as provenance storeprovenance storeStore provenance as delta/edit over existing Store provenance as delta/edit over existing tracestraces
Initial query facility built over this Initial query facility built over this provenance dataprovenance data
Unique aspects of our systemUnique aspects of our systemResult of a provenance query is an Result of a provenance query is an executable workflowexecutable workflowCoupled code versioning to provenance Coupled code versioning to provenance collectioncollection
An open (and interesting) data An open (and interesting) data management challengemanagement challenge
To Sum Up…To Sum Up…