View
98
Download
2
Category
Tags:
Preview:
Citation preview
Linking Prospective and Retrospective Provenance of
ScriptsSaumen Dey, Khalid Belhajjame, David Koop, Meghan Raul, Bertram Ludäscher
TaPP'15 2
Retrospective Provenance Is Useful … But Needs Abstraction
Scientists process and analyze their datasets using scripts.
Retrospective provenance information can be useful for scientists to analyze of script execution E.g., noWorkflow [Murta et al., IPAW’14]
While useful, the amount of information such fine-grained traces contain can be overwhelming for end users. There is a need for abstraction techniques that focus
the attentionof users on the provenance information relevant for
their analyses.
TaPP'15 3
Related WorkWorkflow-Oriented Proposals Zoom*UserViews [Biton et al., ICDE’08]: Workflow views are used
for abstracting the retrospective provenance of (potentially complex) workflows
LabelFlow [Alper et al., IPAW’14]: Both (prospective and retrospective) workflow provenance are summarized using user annotations and reduction rules.
Script-Oriented Proposals YesWorkflow [McPhillips et al., TaPP’15]: We have seen in the
presentation by B. Ludäscher, how a URI scheme can be used to reconstruct part of the retrospective provenance, without recording it at run time. This scheme is useful when the data products used and generated,
including intermediate ones, are stored within the file system.We adopt a similar approach as Zoom*UserViews and apply it to scripts:
We use the workflow specifications (i.e., prospective provenance) extracted by YesWorkflow to abstract the
retrospective provenance captured by noWorkflow
TaPP'15 6
Approach
Script
RunRetrospecti
ve Provenance
noWorkflow
Annotate
Extract Workflow
Specification
Workflow Descripti
on
YesWorkflow
• variable(6, 93, row, 14, ”[0.0]” , 1430231173.397779).
• dependency(6, 34, 94, 93).
TaPP'15 7
Approach
Script
Link* Query
Provenance
User
RunRetrospecti
ve Provenance
noWorkflow
Annotate
Extract Workflow
Specification
Workflow Descripti
on
YesWorkflow
Link*: Links retrospective and prospective provenances
TaPP'15 8
Linking noWorkflow data instances in noWorkflows to YesWorkflow Variables Variable names are not reliable IDs.
Two different variables may have the same name within a script, e.g., within the scope of different functions
noWorkflow tracks line numbers in the script to identify the variable
YesWorkflow does not provide this information However, it provides the line numbers of the start
and end of a block, when requested. Using the two pieces of information, we are able to
connect data values in noWorkflows to their corresponding YesWorkflow variables.
TaPP'15 9
Annotating data values with YesWorkflow annotations
The variables in YesWorkflows are associated with user annotations
Such annotations can be used to provide more information about the data values within noWorkflow retrospective provenance
We note here that it is possible in YesWorkflow to specify variables that are not mapped to any variables within the script
TaPP'15 10
Controlflow
model
pastTemperatureData
pastPrecipitationData
simulatedWeather
model_1 Model_2
simulatedWeather
TaPP'15 11
Data Dependencies
We found gaps in the noWorkflow dependency graph (when exported to prolog) Function returns do not always link back to
correct variable. An object that is modified (e.g via a list.append
call) is not always captured in the dependency graph
We also observed that noWorkflow does not capture information about the name of the script in the retrospective provenance This is required for uniquely identifying
variables.
TaPP'15 12
Objects Within Objects
An object may be nested, i.e., it may have other objects as is children.
Any change to the child object is attributed to the parent object.
noWorkflow attributes any changes to any child object to the ultimate parent.
TaPP'15 16
Example Query
Q1: find the temperature file used.
fileName(N) :- map(V,A), A=“temperatureDataFile”, iData(V,N).
Q2: show how the temperature file was used by.
fileUsedBy(X,Y) :-iDepAbs(X,Y),map(X,A), A=“temperatureDataFile”.
fileUsedBy(X,Y) :-fileUsedBy(X,Z).iDepAbs(Z,Y).
TaPP'15 17
Conclusions
We have implemented an approach for linking the retrospective provenance of script to a more abstracted and user defined prospective provenance.
This solution is complementary to the YesWorkflow solution for capturing retrospective provenance of data files [1]
The issues that we came across using YesWorkflow and noWorkflow were communicated to the development teams of these tools to address them
Linking prospective and retrospective provenance for multiple scripts as opposed to a single script Designers may organize the implementation their
data analyses into multiple scripts. [1] T. M. McPhillips et al. Retrospective provenance without a run-time provenance recorder. In Tapp, 2015
Recommended