Detecting Duplicate Records in Scientific Workflow Results

Khalid Belhajjame1, Paolo Missier2, and Carole A. Goble1

1University of Manchester 2University of Newcastle

Detecting Duplicate Records in Scientific Workflow Results

Scientific Workflows   Scientific workflows are increasingly

used by scientists as a means for specifying and enacting their experiments.

  They tend to be data intensive

  The data sets obtained as a result of their enactment can be stored in public repositories to be queried, analyzed and used to feed the execution of other workflows.

2 IPAW 2012

Duplicates in Workflow Results

  The datasets obtained as a result of workflow execution often contain duplicates.

  As a result:

 The analysis and interpretation of workflow results may become tedious.

 The presence of duplicates also unnecessarily increases the size of workflow results.

3 IPAW 2012

Duplicate Record Detection   Research in duplicate record detection has been active for

more than three decades.

  Elmagarmid et al., 2007 conducted a comprehensive survey of the topics.

  We do not aim to design yet another algorithm for comparing and matching records.

  Rather, we investigate how provenance traces produced as a result of workflow executions can be used to guide the detection of duplicate records in workflow results.

Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Du-plicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1–16,2007.

4 IPAW 2012

Outline

  Data-Driven Workflows and Provenance Trace

  A method for guiding duplicates detection in workflow results based on provenance traces.

  Preliminary validation using real-world workflows.

5 IPAW 2012

Preliminaries: Data-Driven Workflows   A data driven workflow can be defined as a directed graph:

  A node represent an analysis operation, which has a set of input and output parameters.

  The edges are dataflow dependencies:

wf = �N, E�

�op, Iop, Oop� ∈ N

��op, o�, �op�, i�� ∈ E

6 IPAW 2012

Preliminaries: Provenance Trace The execution of workflows gives rise to provenance trace, which we capture using two relations.

  Transformation: to specify that the execution of an operation took as input a given ordered set of records and generated another ordered set of records.

  Transfer: to specify transfer of records along the edges of the workflow.

op, o1, ro1 , . . . , op, om, rom op, i1, ri1 , . . . , op, in, rin

op , i , r op, o, r

OutBop InBop

7 IPAW 2012

Outline


  A method for guiding duplicates detection in workflow results based on provenance traces.


8 IPAW 2012

Provenance-Guided Detection of Duplicates: Approach

To guide the detection of duplicates in workflow results we explore the following fact:

  An operation that is known to be deterministic produces identical output bindings given the same input binding.

9 IPAW 2012

deterministic op OutBop InBop T OutBop InBop Tid OutBop, OutBop

IdentifyProtein GetGOTerm

Provenance-Guided Detection of Duplicates: Example

1.  The set of records Ri that are bound to the input parameter of the starting operation are compared to identify duplicate records.

The result of this phase is a partition of disjoint sets of identical records.

Ri Ro R’i R’o

i i’ o’ o

Ri R1i Rni10 IPAW 2012

Provenance-Guided Detection of Duplicates: Example

2.  The sets of records Ro, R’i and R’o are partitioned into sets of identical records based on the partitioning of Ri. For example:

Ro R1o Rno

Rio ro Ro s.t. ri Rii, IdentifyProtein, o, ro IdentifyProtein, i, ri

IdentifyProtein GetGOTerm

Ri Ro R’i R’o

i i’ o’ o

11 IPAW 2012

Provenance-Guided Detection of Duplicates: Example   In the example just described, the operations that compose

the workflow have exactly one input and one output parameter.

 However, the algorithm presented in the paper supports operations with multiple input and output parameters.

  Notice that we assumes that the analysis operations that compose the workflow are deterministic. This is not always the case.

 This raises the question as to how to determine that a given operation is deterministic.

12 IPAW 2012

Verifying The Determinism of Analysis Operations

To verify the determinism of operations, we use an approach whereby operations are probed.

1.  Given an operation op, we select examples values that can be used by the inputs of op, and invoke op using those values multiple times.

2.  If op produces identical output values given identical input values, then it is likely to be deterministic, otherwise, it is not deterministic.

13 IPAW 2012

Collection-Based Workflows To support duplicates detection in collection based workflows we need to be able to:   Identify when two collections are identical

Two collections Ri and Rj are identical if they are of the same size and there is a bijective mapping:

that maps each record ri in Ri to a record rj in Rj such that ri and rj are identical

  Identify duplicates records between two collections that are known to be identical

Identify a bijective mapping that maps every ri in Ri to an identical rj in Rj.

map : Ri Rj

14 IPAW 2012

Outline


 A method for guiding duplicates detection in workflow results based on provenance traces.


15 IPAW 2012

Validation   The method that we presented in this paper can be applied when the operations

are deterministic.

  To have an insight on the degree to which the operations that compose the workflows are deterministic, we run en experiments

  Datasets: 15 bioinformatics workflows that cover a wide range of analyzes, namely biological pathway analysis, sequence alignment, molecular interaction analysis

  Process: To identify which of these operations are deterministic, we run each of them 3 times using example values that were found either within myExperiment or Biocatalogue

16 IPAW 2012

Validation   After manual analysis of the results, it transpires that 5 operations out of

the 151 operations that compose the wokflows are not deterministic.

  Note that many of the operations that we analyzed access and use underlying data sources in their computation. Therefore updates to such sources may break the determinism assumption (Chirigati and Freire, 2012).

  This suggests that the determinism holds within a window of time during which the underlying sources remain the same, and that there is a need for monitoring techniques to identify such windows.

Fernando Chirigati and Juliana Freire. Towards Integrating Workflow and Database Provenance: A Practical Approach . IPAW, 2012.

17 IPAW 2012

Conclusions and Future Work

 we described a method that can be used to guide duplicate detection in workflow results.

  Monitoring the determinism of analysis operations

  Extending the method to support duplicate detection across the results of different workflows.

18 IPAW 2012

Khalid Belhajjame1, Paolo Missier2, and Carole A. Goble1

1University of Manchester 2University of Newcastle

Detecting Duplicate Records in Scientific Workflow Results

Technology

Detecting Duplicate Records in Scientific Workflow Results