14
REDUX – automatic capture, REDUX – automatic capture, efficient storage efficient storage Roger S. Barga Roger S. Barga Microsoft Research (MSR) Microsoft Research (MSR) Luciano Luciano Digiampietri Digiampietri University of Campinas, University of Campinas, Sao Paolo, Brazil Sao Paolo, Brazil

REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil

Embed Size (px)

Citation preview

Page 1: REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil

REDUX – automatic capture, REDUX – automatic capture, efficient storageefficient storage

Roger S. BargaRoger S. BargaMicrosoft Research (MSR)Microsoft Research (MSR)

Luciano DigiampietriLuciano DigiampietriUniversity of Campinas, Sao University of Campinas, Sao Paolo, BrazilPaolo, Brazil

Page 2: REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil

What information needs to be captured? Which version of BLAST did I use? What codes (activities) did I invoke to get this result, and what were the parameters? What data transformations did I use to get this result? What machine was used to perform the alignment?

Were any steps skipped in this experiment, or were any shims inserted? Did the experiment design differ between these two results? If so, where?... Are there any branches in the workflow that have not been explored?

Additional Issues to Consider…

Result of a provenance query is an executable workflow

Provenance storage costs can quickly grow out of hand…

ConsiderationsConsiderations

Allow the user to control what is shared/exposed – one size doesn’t fit all

It may not possible to rerun an experiment, to either validate or recreate a result because original workflow is lost (activities have been updated).

Page 3: REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil

ImplementationImplementationExtended enactment engine of WinOE to Extended enactment engine of WinOE to

automatically capture steps during automatically capture steps during execution leading to a resultexecution leading to a resultProvenance capture is automatic & transparent

Store provenance in a RDBMS (SQL Store provenance in a RDBMS (SQL Server), utilize previous traces to Server), utilize previous traces to significantly reduce storage costssignificantly reduce storage costsCurrent query interface is SQL, eventually a forms based interface.

Version and lock the executablesVersion and lock the executablesUpdating any activity will change the workflow version number, resulting in a new version. User is able to rerun an experiment by invoking workflow using fully-specified reference found in the provenance record;

A multilayer model for representing result provenance A multilayer model for representing result provenance Abstract Workflow Service Instantiation Data Instantiation Runtime

Page 4: REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil

Abstract WorkflowAbstract Workflow

Page 5: REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil

Data Model for Abstract Data Model for Abstract WorkflowWorkflow

Page 6: REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil

Bound to Activities Bound to Activities (code) and Data(code) and Data

Page 7: REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil

Data Model for Data Model for Workflow InstanceWorkflow Instance

Page 8: REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil

Provenance Queries – Provenance Queries – Query 1Query 1Provenance queries Provenance queries 11, 4, 5, 7, 8 , 4, 5, 7, 8

and 9and 9Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc.

Returns ExecutableWorkflowId (process), ExecutionId (id of specific execution of the process), EventId (event where data was produced) and ExecutableWorkflow_ ExecutableActivityId (activity that produced the data) of the processes that generated the Atlas X Graphic

Page 9: REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil

Provenance Queries – Provenance Queries – Query 7aQuery 7aProvenance queries 1, 4, 5, Provenance queries 1, 4, 5, 77, 8 and 9, 8 and 9

Our layered model allows the detection of differences in several ways A user has run the workflow twice, in the second instance replacing each

procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant.

Page 10: REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil

Provenance Queries – Provenance Queries – Query 7bQuery 7bA user has run the workflow twice, in the second instance replacing each

procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant.

Activities used by the second workflow but not the firstActivities used by the second workflow but not the first

Workflow ModelWorkflow Model captures information about the captures information about the instances of the activities, and the links among the instances of the activities, and the links among the ports (or activities interfaces). At this layer, our model ports (or activities interfaces). At this layer, our model allows provenance queries to question, for example, allows provenance queries to question, for example, what activities from Workflow 2 are not included in what activities from Workflow 2 are not included in Workflow 1: Workflow 1:

Page 11: REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil

Provenance Queries – Provenance Queries – Query 7cQuery 7cA user has run the workflow twice, in the second instance replacing each

procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant.

Runtime Level which contains information about the Runtime Level which contains information about the execution of the workflow (produced data, timestamps, execution of the workflow (produced data, timestamps, activities invoked, etc.). Here the model allows queries activities invoked, etc.). Here the model allows queries about produced data, data flow (See Q2 and Q3), about produced data, data flow (See Q2 and Q3), date/time, etc.date/time, etc.

One example query that illustrates the difference One example query that illustrates the difference between two workflows, at this level, is: What is the between two workflows, at this level, is: What is the data produced by the second workflow that was not data produced by the second workflow that was not produced by the first?produced by the first?Data produced by workflow 2 that was not produced by workflow 1:Data produced by workflow 2 that was not produced by workflow 1:

Page 12: REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil

Efficiently Storing Efficiently Storing Provenance DataProvenance DataFor Provenance Query 7For Provenance Query 7Two workflows are sharing more Two workflows are sharing more

that 99% of the provenance data that 99% of the provenance data (space) and sharing 46% of the (space) and sharing 46% of the database tuples.database tuples.

Page 13: REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil

Extended Windows Workflow FoundationExtended Windows Workflow FoundationTransparently capture execution trace Transparently capture execution trace

leading to a resultleading to a result

A layered provenance modelA layered provenance model

Relational database (SQL Server) as Relational database (SQL Server) as provenance storeprovenance storeStore provenance as delta/edit over existing Store provenance as delta/edit over existing tracestraces

Initial query facility built over this Initial query facility built over this provenance dataprovenance data

Unique aspects of our systemUnique aspects of our systemResult of a provenance query is an Result of a provenance query is an executable workflowexecutable workflowCoupled code versioning to provenance Coupled code versioning to provenance collectioncollection

An open (and interesting) data An open (and interesting) data management challengemanagement challenge

To Sum Up…To Sum Up…

Page 14: REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil