Mining Usage Patterns from a Repository of Scientific Workflow

  • View
    178

  • Download
    3

Embed Size (px)

DESCRIPTION

Full paper: http://boole.diiga.univpm.it/paper/sac2012.pdf In many experimental domains, especially e-Science, workflow management systems are gaining increasing attention to design and execute in-silico experiments involving data analysis tools. As a by-product, a repository of workflows is generated, that formally describes experimental protocols and the way different tools are combined inside experiments. In this paper we describe the use of the SUBDUE graph clustering algorithm to discover sub-workflows from a repository. Since sub-workflows represent significant usage patterns of tools, the discovered knowledge can be exploited by scientists to learn by-example about design practices, or to retrieve and reuse workflows. Such a knowledge, ultimately, leverages the potential of scientific workflow repositories to become a knowledge-asset. A set of experiments is conducted on the myExperiment repository to assess the effectiveness of the approach.

Text of Mining Usage Patterns from a Repository of Scientific Workflow

  • 1. Mining Usage Patterns from a Repositoryof Scientic Workows Claudia Diamantini, Domenico Potena, Emanuele StortiDII, Universita Politecnica delle Marche, Ancona, Italy SAC 2012, Riva del Garda, March 26-30Emanuele Storti (UNIVPM)Mining Usage Patterns SAC 2012 1 / 21

2. IntroductionProcess mining techniques allow for extracting knowledge from eventlogs, i.e. traces of running processes:audit trails of a workow management systemtransaction logs of an enterprise resource planning systemMain applications:process discovery: what is really happening?conformance check: are we doing what was agreed upon?process extension: how can we redesign the process?Emanuele Storti (UNIVPM) Mining Usage PatternsSAC 2012 2 / 21 3. IntroductionProcess mining techniques allow for extracting knowledge from eventlogs, i.e. traces of running processes:audit trails of a workow management systemtransaction logs of an enterprise resource planning systemMain applications:process discovery: what is really happening?conformance check: are we doing what was agreed upon?process extension: how can we redesign the process?Emanuele Storti (UNIVPM) Mining Usage PatternsSAC 2012 2 / 21 4. IntroductionProcess mining techniques allow for extracting knowledge from eventlogs, i.e. traces of running processes:audit trails of a workow management systemtransaction logs of an enterprise resource planning systemMain applications:process discovery: what is really happening?conformance check: are we doing what was agreed upon?process extension: how can we redesign the process?Emanuele Storti (UNIVPM) Mining Usage PatternsSAC 2012 2 / 21 5. MotivationLittle work about how to extract knowledge from a set of models (i.e.:process schemas)understanding of common patterns of usage (best/worstpractices)support in process designprocess integration (from multiple sources)process optimization/re-engineeringindexing of processes and their retrievalEmanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 3 / 21 6. MethodologyProcesses schemas have an inherent graph structure: manygraph-mining tools availableApproach 1 representation of original processes into graphs 2 usage of graph-mining techniques to extract SUBsWhich representation form?Which specic technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecicity) Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21 7. MethodologyProcesses schemas have an inherent graph structure: manygraph-mining tools availableApproach 1 representation of original processes into graphs 2 usage of graph-mining techniques to extract SUBsWhich representation form?Which specic technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecicity) Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21 8. MethodologyProcesses schemas have an inherent graph structure: manygraph-mining tools availableApproach 1 representation of original processes into graphs 2 usage of graph-mining techniques to extract SUBsWhich representation form?Which specic technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecicity) Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21 9. MethodologyProcesses schemas have an inherent graph structure: manygraph-mining tools availableApproach 1 representation of original processes into graphs 2 usage of graph-mining techniques to extract SUBsWhich representation form?Which specic technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecicity) Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21 10. SUBDUESUBDUE [Joyner, 2000]graph-based hierarchical clustering algorithmsuited for discrete-valued and structured datasearches for substructures (i.e., subgraphs) that best compressthe input graph, according to MDLComments:to compress G with a SUB: to replace every occurrence of SUB inG with a single new nodeSUB best compresses G: num. of bits needed to represent G afterthe compression is minimumEmanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 5 / 21 11. SUBDUESUBDUE [Joyner, 2000]graph-based hierarchical clustering algorithmsuited for discrete-valued and structured datasearches for substructures (i.e., subgraphs) that best compressthe input graph, according to MDLComments:to compress G with a SUB: to replace every occurrence of SUBin G with a single new nodeSUB best compresses G: num. of bits needed to represent G afterthe compression is minimumEmanuele Storti (UNIVPM) Mining Usage PatternsSAC 2012 5 / 21 12. SUBDUESUBDUE [Joyner, 2000]graph-based hierarchical clustering algorithmsuited for discrete-valued and structured datasearches for substructures (i.e., subgraphs) that best compressthe input graph, according to MDLComments:to compress G with a SUB: to replace every occurrence of SUB inG with a single new nodeSUB best compresses G: num. of bits needed to represent Gafter the compression is minimumEmanuele Storti (UNIVPM) Mining Usage PatternsSAC 2012 5 / 21 13. SUBDUE (contd) 1 searches for the best SUB (i.e., that minimizes the Description Length of the graph) 2 compresses the graph by using the best SUBSubsequent SUBs may be dened in terms of previously denedSUBs. Iterations of this basic step results in a lattice Emanuele Storti (UNIVPM) Mining Usage PatternsSAC 20126 / 21 14. SUBDUE (contd) 1 searches for the best SUB (i.e., that minimizes the Description Length of the graph) 2 compresses the graph by using the best SUBSubsequent SUBs may be dened in terms of previously denedSUBs. Iterations of this basic step results in a lattice Emanuele Storti (UNIVPM) Mining Usage PatternsSAC 20126 / 21 15. SUBDUE (contd) 1 searches for the best SUB (i.e., that minimizes the Description Length of the graph) 2 compresses the graph by using the best SUBSubsequent SUBs may be dened in terms of previously denedSUBs. Iterations of this basic step results in a lattice Emanuele Storti (UNIVPM) Mining Usage PatternsSAC 20126 / 21 16. SUBDUE (contd) 1 searches for the best SUB (i.e., that minimizes the Description Length of the graph) 2 compresses the graph by using the best SUBSubsequent SUBs may be dened in terms of previously denedSUBs. Iterations of this basic step results in a lattice Emanuele Storti (UNIVPM) Mining Usage PatternsSAC 20126 / 21 17. SUBDUE (contd) 1 searches for the best SUB (i.e., that minimizes the Description Length of the graph) 2 compresses the graph by using the best SUBSubsequent SUBs may be dened in terms of previously denedSUBs. Iterations of this basic step results in a lattice Emanuele Storti (UNIVPM) Mining Usage PatternsSAC 20126 / 21 18. SUBDUE (contd) 1 searches for the best SUB (i.e., that minimizes the Description Length of the graph) 2 compresses the graph by using the best SUBSubsequent SUBs may be dened in terms of previously denedSUBs. Iterations of this basic step results in a lattice Emanuele Storti (UNIVPM) Mining Usage PatternsSAC 20126 / 21 19. SUBDUE (contd) 1 searches for the best SUB (i.e., that minimizes the Description Length of the graph) 2 compresses the graph by using the best SUBSubsequent SUBs may be dened in terms of previously denedSUBs. Iterations of this basic step results in a lattice Emanuele Storti (UNIVPM) Mining Usage PatternsSAC 20126 / 21 20. SUBDUE (contd)Traditional cluster evaluation is not applicable to hierarchic domains interClusterDistance quality = intraClusterDistanceDesirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better dened concepts)New indexescompleteness: % of original nodes/edges still present in the nallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignicance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and applicationEmanuele Storti (UNIVPM) Mining Usage PatternsSAC 2012 7 / 21 21. SUBDUE (contd)Traditional cluster evaluation is not applicable to hierarchic domains interClusterDistance quality = intraClusterDistanceDesirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better dened concepts)New indexescompleteness: % of original nodes/edges still present in the nallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignicance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and applicationEmanuele Storti (UNIVPM) Mining Usage PatternsSAC 2012 7 / 21 22. SUBDUE (contd)Traditional cluster evaluation is not applicable to hierarchic domains interClusterDistance quality = intraClusterDistanceDesirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better dened concepts)New indexescompleteness: % of original nodes/edges still present in the nallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignicance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and applicationEmanuele Storti (UNIVPM) Mining Usage PatternsSAC 2012 7 / 21 23. SUBDUE (contd)Traditional cluster evaluation is not applicable to hierarchic domains interClusterDistance quality = intraClusterDistanceDesirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better dened concepts)New indexescompleteness: % of original nodes/edges still present in the nallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignicance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and applicationEmanuele Storti (UNIVPM) Mining Usage PatternsSAC 2012 7 / 21 24. RepresentationApproach 1 represe