Detecting Duplicate Records in Scientific Workflow Results

  • View
    278

  • Download
    0

Embed Size (px)

DESCRIPTION

This talk present a solution whereby duplicate records in workflow results are detected using provenance traces.

Text of Detecting Duplicate Records in Scientific Workflow Results

  • 1. Detecting Duplicate Records in Scientific Workflow ResultsKhalid Belhajjame1, Paolo Missier2, and Carole A. Goble1 1University of Manchester2University of Newcastle

2. Scientific Workflows Scientific workflows are increasinglyused by scientists as a means forspecifying and enacting theirexperiments. They tend to be data intensive The data sets obtained as a result oftheir enactment can be stored inpublic repositories to be queried,analyzed and used to feed theexecution of other workflows.2 IPAW 2012 3. Duplicates in Workflow Results The datasets obtained as a result of workflow execution often contain duplicates. As a result: The analysis and interpretation of workflow results may becometedious. The presence of duplicates also unnecessarily increases the sizeof workflow results.3 IPAW 2012 4. Duplicate Record Detection Research in duplicate record detection has been active formore than three decades. Elmagarmid et al., 2007 conducted a comprehensive survey ofthe topics. We do not aim to design yet another algorithm forcomparing and matching records. Rather, we investigate how provenance traces produced as aresult of workflow executions can be used to guide thedetection of duplicate records in workflow results.Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Du-plicate record detection: A survey. IEEETrans. Knowl. Data Eng., 19(1):116,2007.4 IPAW 2012 5. Outline Data-Driven Workflows and Provenance Trace A method for guiding duplicates detection in workflow results based on provenance traces. Preliminary validation using real-world workflows.5 IPAW 2012 6. Preliminaries: Data-Driven Workflows A data driven workflow can be defined as a directed graph: wf = N, E A node represent an analysis operation, which has a set of input and output parameters.op, Iop , Oop N The edges are dataflow dependencies:op, o, op , i E6 IPAW 2012 7. Preliminaries: Provenance TraceThe execution of workflows gives rise to provenance trace,which we capture using two relations. Transformation: to specify that the execution of anoperation took as input a given ordered set of records andgenerated another ordered set of records.op, o1 , ro1 , . . . , op, om , rom op, i1 , ri1 , . . . , op, in , rin OutBop InBop Transfer: to specify transfer of records along the edges ofthe workflow.op , i , r op, o, r7 IPAW 2012 8. Outline Data-Driven Workflows and Provenance Trace A method for guiding duplicates detection in workflow results based on provenance traces. Preliminary validation using real-world workflows.8 IPAW 2012 9. Provenance-Guided Detection ofDuplicates: ApproachTo guide the detection of duplicates in workflow results weexplore the following fact: An operation that is known to be deterministic produces identical output bindings given the same input binding.deterministic opOutBopInBop TOutBopInBopTid OutBop , OutBop9 IPAW 2012 10. Provenance-Guided Detection of Duplicates: Example ioi o IdentifyProtein GetGOTerm Ri Ro RiRo 1. The set of records Ri that are bound to the input parameter of the starting operation are compared to identify duplicate records.The result of this phase is a partition of disjoint sets ofidentical records. Ri R1 i Rni10 IPAW 2012 11. Provenance-Guided Detection ofDuplicates: Example i oi oIdentifyProtein GetGOTerm Ri Ro Ri Ro2. The sets of records Ro, Ri and Ro are partitioned into setsof identical records based on the partitioning of Ri. Forexample: 1nRo RoRo Rio ro Ro s.t. ri Ri , iIdentifyProtein, o, roIdentifyProtein, i, ri11IPAW 2012 12. Provenance-Guided Detection of Duplicates: Example In the example just described, the operations that composethe workflow have exactly one input and one outputparameter. However, the algorithm presented in the paper supports operations with multiple input and output parameters. Notice that we assumes that the analysis operations thatcompose the workflow are deterministic. This is not alwaysthe case. This raises the question as to how to determine that a given operation is deterministic.12 IPAW 2012 13. Verifying The Determinism of Analysis Operations To verify the determinism of operations, we use an approach whereby operations are probed. 1. Given an operation op, we select examples values that can be used by the inputs of op, and invoke op using those values multiple times. 2.If op produces identical output values given identical input values, then it is likely to be deterministic, otherwise, it is not deterministic.13 IPAW 2012 14. Collection-Based Workflows To support duplicates detection in collection based workflows we need to be able to: Identify when two collections are identicalTwo collections Ri and Rj are identical if they are of the same size andthere is a bijective mapping:map : RiRjthat maps each record ri in Ri to a record rj in Rj such that ri and rj areidentical Identify duplicates records between two collections thatare known to be identicalIdentify a bijective mapping that maps every ri in Ri to an identicalrj in Rj.14 IPAW 2012 15. Outline Data-Driven Workflows and Provenance Trace A method for guiding duplicates detection in workflowresults based on provenance traces. Preliminary validation using real-world workflows.15 IPAW 2012 16. Validation The method that we presented in this paper can be applied when the operationsare deterministic. To have an insight on the degree to which the operations that compose theworkflows are deterministic, we run en experiments Datasets: 15 bioinformatics workflows that cover a wide range of analyzes,namely biological pathway analysis, sequence alignment, molecular interactionanalysis Process: To identify which of these operations are deterministic, we run eachof them 3 times using example values that were found either withinmyExperiment or Biocatalogue16 IPAW 2012 17. Validation After manual analysis of the results, it transpires that 5 operations out of the 151 operations that compose the wokflows are not deterministic. Note that many of the operations that we analyzed access and use underlying data sources in their computation. Therefore updates to such sources may break the determinism assumption (Chirigati and Freire, 2012). This suggests that the determinism holds within a window of time during which the underlying sources remain the same, and that there is a need for monitoring techniques to identify such windows. Fernando Chirigati and Juliana Freire. Towards Integrating Workflow and Database Provenance: A Practical Approach . IPAW, 2012.17 IPAW 2012 18. Conclusions and Future Work we described a method that can be used to guide duplicate detection in workflow results. Monitoring the determinism of analysis operations Extending the method to support duplicate detection acrossthe results of different workflows.18 IPAW 2012 19. Detecting Duplicate Records in Scientific Workflow ResultsKhalid Belhajjame1, Paolo Missier2, and Carole A. Goble1 1University of Manchester2University of Newcastle