31
From Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic Hospital (CHU de Nantes), CNRS, France LINA, Nantes University, CNRS, France Institut du Thorax, Nantes University, INSERM, CNRS, France 8th USENIX workshop on Theory and Practice of Provenance (TaPP’16) Washington DC

From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

From Scientific Workflow Patterns to 5-star Linked Open Data

Alban Gaignard Hala Skaf-Molli Audrey Bihouée

Nantes Academic Hospital (CHU de

Nantes), CNRS, France

LINA, Nantes University,

CNRS, France

Institut du Thorax, Nantes University,

INSERM, CNRS, France

8th USENIX workshop on Theory and Practice of Provenance (TaPP’16)Washington DC

Page 2: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

Needs for linked experiment reports

2

Page 3: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

Motivations: reusing (massive) RNA-seq data

TopHat: algorithm to align multiple sequence reads to a reference

genome (known genes).

3

Page 4: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

Motivations: reusing (massive) RNA-seq data

TopHat: algorithm to align multiple sequence reads to a reference

genome (known genes).

4

1 sample

Input data 2 x 17 Gb

1-core CPU 170 hours

32-cores CPU 32 hours

Output data 12 Gb

Page 5: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

Motivations: reusing (massive) RNA-seq data

TopHat: algorithm to align multiple sequence reads to a reference

genome (known genes).

5

1 sample 300 samples

Input data 2 x 17 Gb 10.2 Tb

1-core CPU 170 hours 5.9 years

32-cores CPU 32 hours 14 months

Output data 12 Gb 3.6 Tb

1 sample 300 samples

Input data 2 x 17 Gb 10.2 Tb

1-core CPU 170 hours 5.9 years

32-cores CPU 32 hours 14 months

Output data 12 Gb 3.6 Tb

1 sample

Input data 2 x 17 Gb

1-core CPU 170 hours

32-cores CPU 32 hours

Output data 12 Gb

Page 6: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

Motivations: reusing (massive) RNA-seq data

TopHat: algorithm to align multiple sequence reads to a reference

genome (known genes).

6

1 sample 300 samples

Input data 2 x 17 Gb 10.2 Tb

1-core CPU 170 hours 5.9 years

32-cores CPU 32 hours 14 months

Output data 12 Gb 3.6 Tb

Challenges

Algorithmic performance, storage, preservation,

reuse (limit recompute) & share.

1 sample 300 samples

Input data 2 x 17 Gb 10.2 Tb

1-core CPU 170 hours 5.9 years

32-cores CPU 32 hours 14 months

Output data 12 Gb 3.6 Tb

1 sample

Input data 2 x 17 Gb

1-core CPU 170 hours

32-cores CPU 32 hours

Output data 12 Gb

Page 7: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

Motivations: reusing experiment results

Scientific experiment: RNA sequencing to quantify gene expression

levels under multiple biological conditions.

7

Page 8: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

Motivations: reusing experiment results

Scientific experiment: RNA sequencing to quantify gene expression

levels under multiple biological conditions.

8

Need for scientific context : metadata

Page 9: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

Expected result: human+machine tractable reports

9

Annotated “Material & Methods”

Links to some workflow artifacts (algorithms, data)

Page 10: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

5-star Linked Open Data

10

W3C standards for machine and human

readable data on the web.

⭑⭑⭑⭑⭑ : time and expertise !

Page 11: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

5-star Linked Open Data

11

How to ease this process ?

● Workflow engines → automation

● PROV → workflow runs as linked data

W3C standards for machine and human

readable data on the web.

⭑⭑⭑⭑⭑ : time and expertise !

Page 12: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

PROV only

12

Page 13: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

PROV only

13

too fine-grained

no domain concepts

Page 14: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

Provenance as a Linked Experiment Report

14

few + meaningful statements

Page 15: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

Problem statement & objectives

15

Problem statement

Scientific workflows produce massive raw results. Their

publication into curated query-able linked data repositories

requires lot of time and expertise.

Can we exploit provenance traces to ease the publication of

scientific results as Linked Data ?

Page 16: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

Problem statement & objectives

16

Problem statement

Scientific workflows produce massive raw results. Their

publication into curated query-able linked data repositories

requires lot of time and expertise.

Can we exploit provenance traces to ease the publication of

scientific results as Linked Data ?

Objectives

(1) Leverage annotated workflow patterns to generate

provenance mining rules.

(2) Refine provenance traces into linked experiment reports.

Page 17: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

Rules generation

17

Page 18: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

Approach

18

Page 19: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

Input domain-specific annotations (❶,❷)

Workflow patterns ❶

19

Sequence patterns, with possibly intermediate steps

● P-PLAN ontology: Step, Variable, hasInputVar, hasOutputVar

● EDAM ontology: hasFunction, RNA sequence, Genome map

Page 20: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

Input domain-specific annotations (❶,❷)

Workflow patterns ❶

20

Sequence patterns, with possibly intermediate steps

● P-PLAN ontology: Step, Variable, hasInputVar, hasOutputVar

● EDAM ontology: hasFunction, RNA sequence, Genome map

Experiment report template ❷

Link scientific claims, statements, material and methods

● MicroPublication ontology: Material, Method, Claims

● Experimental factor ontology: Transcriptome, Gene expression

● NCBI taxonomy: Homo Sapiens

● Open Annotation model: hasBody, hasTarget

Page 21: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

PoeM: generating PrOvEnance Mining rules ❸

21

(SPARQL Property path)(SPARQL Basic graph pattern)

(SPARQL Construct query)

Page 22: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

PoeM: sample generated rule ❸

22

<If> part

<Then> part

Page 23: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

First experiments & results

23

Page 24: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

Experiment context

24

SyMeTRIC: systems medicine project (2015-2017, call “Connect Talent”), funded by the french region Pays de la Loire.

Page 25: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

Experiment

25

Material & methods

● Real-life RNA-seq workflow to study 3 mice populations

● WF implemented in Galaxy, run on 2 biological samples

● PROV traces exported from Galaxy Histories (API)

Page 26: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

Experiment

26

Material & methods

● Real-life RNA-seq workflow to study 3 mice populations

● WF implemented in Galaxy, run on 2 biological samples

● PROV traces exported from Galaxy Histories (API)

Results (for 1 biological sample)

● 60h CPU (12 cores for genome alignment), 21Gb storage

● 3s to export 81 PROV triples from the Galaxy history

● 2s to apply the rule and produce 35 Micropublication triples

Page 27: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

Conclusion & perspectives

27

Semi-automated approach

(1) PoeM generates semantic web rules

(2) PoeM rules applied on PROV traces to assemble

linked experiment reports (MicroPublication)

Limitations:

- Sequence workflow patterns only

- SPARQL property paths with complex WF patterns ?

- Syntactic matching between WF patterns and PROV labels

Usage scenarios:

→ Query workflow datasets with domain concepts

→ Populate RDF repositories with WF results

Page 28: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée

Conclusion & perspectives

28

Future works

(1) WF patterns: split-merge, “common motifs”

(2) Genericity: other domains / other reports (RO, Nanopub.)

(3) PROV heterogeneity: multi-systems PROV

(4) Evaluation: involving biologists, at larger scale

Page 29: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

Questions ?Demo: http://poem.univ-nantes.fr

Contact: [email protected]

Acknowledgments

BiRD bioinformatics facility Connect Talent Call

Page 30: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 30

Larger scale experiments (PROV traces)

1232 edges

Page 31: From Scientific Workflow Patterns to 5-star Linked Open DataFrom Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic

TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 31

Larger scale experiments (PROV traces)

49 edges