48
Paolo Missier, Norman Paton, Khalid Belhajjame Information Management Group School of Computer Science, University of Manchester, UK EDBT Conference Lausanne, Switzerland, March 2010 Fine-grained and efficient lineage querying of collection-based workflow provenance

Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

Embed Size (px)

DESCRIPTION

Missier, P., Paton, N., & Belhajjame, K. (2010). Fine-grained and efficient lineage querying of collection-based workflow provenance. Procs. EDBT. Lausanne, Switzerland.

Citation preview

Page 1: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

Paolo Missier, Norman Paton, Khalid BelhajjameInformation Management Group

School of Computer Science, University of Manchester, UK

EDBT ConferenceLausanne, Switzerland, March 2010

Fine-grained and efficient lineage querying of collection-based workflow provenance

Page 2: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

• Setting:– Black box provenance of workflow data products

• Fine-grained provenance:– tracking provenance through collections: motivation– functional model of collection-oriented workflow

processing

• Efficient query processing:– leveraging the functional model to achieve efficient

processing for a simple query model

• Experimental evaluation

Problem statement and outline

Page 3: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Computing provenance through a graph• Provenance graph is an unfolding of the workflow graph

structure– large: grows with size of input– lineage queries involve graph traversal

3From: Z. Bao, S. Cohen-Boulakia, S. Davidson, A. Eyal, and S. Khanna, "Differencing Provenance in Scientific Workflows," Procs. ICDE, 2009.

Page 4: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Main result

• Query the provenance of individual collections elements• But, avoid computing transitive closures on the provenance graph

• potentially very large• Traverse the workflow graph instead -- much smaller• This results in substantial performance improvement for typical queries

y11

a1 b1

ymn

bman

wv1 vn

... ...

...

...

Provenance graphWorkflow graph

X1 X2

Y

P

X3

y = [ [y11 ... y1n], ... [ym1 ... ymn] ]

X

Y

R

X

Y

Q

Page 5: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Main result

• Query the provenance of individual collections elements• But, avoid computing transitive closures on the provenance graph

• potentially very large• Traverse the workflow graph instead -- much smaller• This results in substantial performance improvement for typical queries

y11

a1 b1

ymn

bman

wv1 vn

... ...

...

...

Provenance graphWorkflow graph

X1 X2

Y

P

X3

y = [ [y11 ... y1n], ... [ym1 ... ymn] ]

X

Y

R

X

Y

Q

[1][n]

[]

[][1]

Page 6: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Main result

• Query the provenance of individual collections elements• But, avoid computing transitive closures on the provenance graph

• potentially very large• Traverse the workflow graph instead -- much smaller• This results in substantial performance improvement for typical queries

y11

a1 b1

ymn

bman

wv1 vn

... ...

...

...

Provenance graphWorkflow graph

X1 X2

Y

P

X3

y = [ [y11 ... y1n], ... [ym1 ... ymn] ]

X

Y

R

X

Y

Q

[1][n]

[]

[][1]

Page 7: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Workflow as data integrator

Page 8: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Workflow as data integrator

QTLgenomicregions

genesin QTL

metabolicpathways(KEGG)

Page 9: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Workflow as data integrator

QTLgenomicregions

genesin QTL

metabolicpathways(KEGG)

Page 10: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Motivation for fine-grained provenance List-structured KEGG gene ids:

[ [ mmu:26416 ], [ mmu:328788 ] ]

[ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ]

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]

Page 11: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Motivation for fine-grained provenance List-structured KEGG gene ids:

[ [ mmu:26416 ], [ mmu:328788 ] ]

[ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ]

geneIDs pathways

••

••

••

••

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]

Page 12: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Motivation for fine-grained provenance List-structured KEGG gene ids:

[ [ mmu:26416 ], [ mmu:328788 ] ]

[ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ]

geneIDs pathways

••

••

••

••

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]

Page 13: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Motivation for fine-grained provenance List-structured KEGG gene ids:

[ [ mmu:26416 ], [ mmu:328788 ] ]

[ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ]

geneIDs pathways

••

••

••

••

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]

Page 14: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Motivation for fine-grained provenance List-structured KEGG gene ids:

[ [ mmu:26416 ], [ mmu:328788 ] ]

[ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ]

geneIDs pathways

••

••

••

••

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]

Page 15: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

• Setting:– Black box provenance of workflow data products

• Fine-grained provenance:– tracking provenance through collection elements– motivation➡ functional model of collection-oriented workflow

processing

Problem statement and outline

Page 16: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Functional model for collection processing /1Simple processing: service expects atomic values,receives atomic values

v1 v2 v3

w1 w2

X1 X2

Y1

P

X3

Y2

Page 17: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Functional model for collection processing /1Simple processing: service expects atomic values,receives atomic values

v1 v2 v3

w1 w2

X1 X2

Y1

P

X3

Y2

① ②

Simple iteration:service expects atomic values,receives input list

X

Y

P1

X

Y

Pn

v1

w1 wn

vn

v = [v1 ... vn]

w = [w1 ... wn]

v = [v1 ... vn]

X

Y

P ➠

w = [w1 ... wn]

...

Page 18: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Functional model for collection processing /1Simple processing: service expects atomic values,receives atomic values

v1 v2 v3

w1 w2

X1 X2

Y1

P

X3

Y2

① ②

Simple iteration:service expects atomic values,receives input list

X

Y

P1

X

Y

Pn

v1

w1 wn

vn

v = [v1 ... vn]

w = [w1 ... wn]

v = [v1 ... vn]

X

Y

P ➠

w = [w1 ... wn]

...

lineage(wi) = vi

Page 19: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Functional model for collection processing /1Simple processing: service expects atomic values,receives atomic values

v1 v2 v3

w1 w2

X1 X2

Y1

P

X3

Y2

① ②

Simple iteration:service expects atomic values,receives input list

X

Y

P1

X

Y

Pn

v1

w1 wn

vn

v = [v1 ... vn]

w = [w1 ... wn]

v = [v1 ... vn]

X

Y

P ➠

w = [w1 ... wn]

...dd = 0

ad = 1

δ = 1

lineage(wi) = vi

Page 20: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Functional model for collection processing /1Simple processing: service expects atomic values,receives atomic values

v1 v2 v3

w1 w2

X1 X2

Y1

P

X3

Y2

v = [[...], ...[...]]

w = [[..] ...[...]]

X

Y

Pdd = 0

ad =2

δ = 2

lineage(wii) = vij

③Extension:service expects atomic values,receives input nested list

Simple iteration:service expects atomic values,receives input list

X

Y

P1

X

Y

Pn

v1

w1 wn

vn

v = [v1 ... vn]

w = [w1 ... wn]

v = [v1 ... vn]

X

Y

P ➠

w = [w1 ... wn]

...dd = 0

ad = 1

δ = 1

lineage(wi) = vi

Page 21: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Functional model /2

v = [[...], ...[...]]

w = [[..] ...[...]] - depth = n-m

X

Y

Pdd = m

ad =n

δ = n-m ≥ 0

The simple iteration modelgeneralises by induction to a generic δ=n-m

Page 22: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Functional model /2

v = [[...], ...[...]]

w = [[..] ...[...]] - depth = n-m

X

Y

Pdd = m

ad =n

δ = n-m ≥ 0

The simple iteration modelgeneralises by induction to a generic δ=n-m

v = !a1 . . . an"

(evall P v) =

!(P v) if l = 0(map (evall!1 P ) v) if l > 0

This leads to a recursive functional formulation for simple collection processing:

Page 23: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Functional model - multiple inputs /3

X1 X2

Y

P

X3v1 = [v11 ... vin] v3 = [v31 ... v3m]

v2 = [v21 ... v2k]

w = [ [w11 ... w1n], ... [wm1 ...wmn] ]

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

Page 24: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Functional model - multiple inputs /3

X1 X2

Y

P

X3v1 = [v11 ... vin] v3 = [v31 ... v3m]

v2 = [v21 ... v2k]

w = [ [w11 ... w1n], ... [wm1 ...wmn] ]

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

Cross-product involving v1 and v2 (but not v3):

v1 ⊗ v3 = [ [ <v1i, v3j> | j:1..m ] | i:1..n ] // cross product

and including v2: [ [ <v1i, v2, v3j> | j:1..m ] | i:1..n ]

Page 25: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Functional model - multiple inputs /3

X1 X2

Y

P

X3v1 = [v11 ... vin] v3 = [v31 ... v3m]

v2 = [v21 ... v2k]

w = [ [w11 ... w1n], ... [wm1 ...wmn] ]

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

Cross-product involving v1 and v2 (but not v3):

v1 ⊗ v3 = [ [ <v1i, v3j> | j:1..m ] | i:1..n ] // cross product

and including v2: [ [ <v1i, v2, v3j> | j:1..m ] | i:1..n ]

lineage(wii) = < v1i, v2, v3j>

Page 26: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

a! b = [["ai, bj#]|bj $ b]|ai $ a]

(eval2 P !a, b") = (map (eval1 P ) a # b)

EDBT, Lausanne, March 2010

Generalised cross product

Binary product, δ = 1:

Page 27: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

a! b = [["ai, bj#]|bj $ b]|ai $ a]

(eval2 P !a, b") = (map (eval1 P ) a # b)

EDBT, Lausanne, March 2010

Generalised cross product

!i:1...n(vi, di)

(v, d1)! (w, d2) =

!"""#

"""$

[[(vi, wj)|wj " w]|vi " v] if d1 > 0, d2 > 0[(vi, w)|vi " v] if d1 > 0, d2 = 0[(v, wj)|wj " w] if d1 = 0, d2 > 0(v, w) if d1 = 0, d2 = 0

Generalized to arbitrary depths:

...and to n operands:

Binary product, δ = 1:

Page 28: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

a! b = [["ai, bj#]|bj $ b]|ai $ a]

(eval2 P !a, b") = (map (eval1 P ) a # b)

EDBT, Lausanne, March 2010

Generalised cross product

!i:1...n(vi, di)

(v, d1)! (w, d2) =

!"""#

"""$

[[(vi, wj)|wj " w]|vi " v] if d1 > 0, d2 > 0[(vi, w)|vi " v] if d1 > 0, d2 = 0[(v, wj)|wj " w] if d1 = 0, d2 > 0(v, w) if d1 = 0, d2 = 0

Generalized to arbitrary depths:

...and to n operands:

Binary product, δ = 1:

Finally: general functional semantics for collection-based processing

(evall P !(v1, d1), . . . , (vn, dn)")

=

!(P !v1, . . . , vn") if l = 0(map (evall!1 P ) #i:1...n !vi, di") if l > 0

Page 29: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Static mapping of output to input values

X1 X2

Y

P

X3

this leads to a simple mapping rule:

index of an output list value ➔ {index of input values}

Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] =

The iteration structure can be determined statically

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

Page 30: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Static mapping of output to input values

X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)

this leads to a simple mapping rule:

index of an output list value ➔ {index of input values}

Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] =

The iteration structure can be determined statically

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

Page 31: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Static mapping of output to input values

X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)

this leads to a simple mapping rule:

index of an output list value ➔ {index of input values}

Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] = δ1

The iteration structure can be determined statically

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

Page 32: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Static mapping of output to input values

X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)

this leads to a simple mapping rule:

index of an output list value ➔ {index of input values}

Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] =

X1

The iteration structure can be determined statically

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

Page 33: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Static mapping of output to input values

X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)

this leads to a simple mapping rule:

index of an output list value ➔ {index of input values}

Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] = δ2

X1

The iteration structure can be determined statically

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

Page 34: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Static mapping of output to input values

X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)

this leads to a simple mapping rule:

index of an output list value ➔ {index of input values}

Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] =

X1 X2

The iteration structure can be determined statically

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

Page 35: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Static mapping of output to input values

X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)

this leads to a simple mapping rule:

index of an output list value ➔ {index of input values}

Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] = δk

X1 X2

The iteration structure can be determined statically

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

Page 36: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Static mapping of output to input values

X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)

this leads to a simple mapping rule:

index of an output list value ➔ {index of input values}

Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] =

X1 X2 Xk

The iteration structure can be determined statically

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

Page 37: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Workflow graph as index for query processing

1. traverse the workflow graph (small) rather than the provenance trace (large)2. use static prediction of iterations to trace through collection elements3. “parachute” into the actual trace only at the end

y11

a1 b1

ymn

bman

wv1 vn

... ...

...

...

Provenance graphWorkflow graph

X1 X2

Y

P

X3

y = [ [y11 ... y1n], ... [ym1 ... ymn] ]

X

Y

R

X

Y

Q

Page 38: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Workflow graph as index for query processing

1. traverse the workflow graph (small) rather than the provenance trace (large)2. use static prediction of iterations to trace through collection elements3. “parachute” into the actual trace only at the end

y11

a1 b1

ymn

bman

wv1 vn

... ...

...

...

Provenance graphWorkflow graph

X1 X2

Y

P

X3

y = [ [y11 ... y1n], ... [ym1 ... ymn] ]

X

Y

R

X

Y

Q

[1][n]

[]

[][1]

Page 39: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Workflow graph as index for query processing

1. traverse the workflow graph (small) rather than the provenance trace (large)2. use static prediction of iterations to trace through collection elements3. “parachute” into the actual trace only at the end

y11

a1 b1

ymn

bman

wv1 vn

... ...

...

...

Provenance graphWorkflow graph

X1 X2

Y

P

X3

y = [ [y11 ... y1n], ... [ym1 ... ymn] ]

X

Y

R

X

Y

Q

[1][n]

[]

[][1]

Page 40: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

From granularity to efficient query processing

Summary so far:

• whenever iterations are involved, we can trace the provenance of individual elements of a processor’s output

• iterations are explained in terms of a functional model and based on list depth discrepancies

• The relationships between output and input indexes are derived using the workflow specification graph (statically)

How about expressivity and efficient processing of lineage queries?

Page 41: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Lineage query model /1

I - Focusing:Not all processors are interesting:–report lineage only at

specified nodes in the graph

Page 42: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Lineage query model /2List-structured KEGG gene ids:

[ [ mmu:26416 ], [ mmu:328788 ] ]

[ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ]

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]

II - Granularity:Trace lineage for individual elements within collections- when possible!

Page 43: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Lineage query model and language

<pquery> <scope workflow="keggPathways"> <run id="ae1e2b6b-3bc5-4c93-a250-c4dd0210c3b3"/> </scope> <select> <outputPort name="paths_per_gene" index="[1,2]"/> <outputPort name="paths_per_gene" index="[3,4]"/> <outputPort name="commonPathways" index="[1]"/> <processor name="getPathwayDescriptions"> <outputPort name="return"/> </processor> </select> <focus> <processor name="get_pathway_by_genes" /> </focus></pquery>

optionally specifies one or more runs for the target workflow

port values for which lineage is sought:global outputs or processor-qualified

processors where lineage is to be reported- possibly workflow-qualified

workflow scopedefaults to latest run

Page 44: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Fine granularity + efficient processing• Scalability:

– query time depends on size of workflow graph, not size of provenance graph

– workflow graphs are small, fit in memory, can be indexed easily

• Graceful degradation:– worst case is a completely unfocused query– no worse than other approaches

• Fine-grain answers provided at the same time

Page 45: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

• Assumption:– Black box provenance of workflow data products

✓Fine-grained provenance:– tracking provenance through collection elements– motivation, functional model of collection-oriented

workflow processing

✓Efficient query processing:✓leveraging the functional model to achieve efficient

processing for a simple query model

➡Experimental evaluation

Outline

Page 46: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Experimental setup - I

20

• Performance evaluation performed on programmatically generated dataflows

– the “T-towers”

parameters:- size of the lists involved- length of the paths- includes one cross product

Page 47: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Experimental results - II

10 28 50 75 100 150 2000

250

500

750

1000

1250

1500

1750

2000

2250

2500

2750

3000

3250

3500

124256

440

700

1000

2024

3257

workflow pre-processing time by graph size

path length l

tim

e (

ms)

1.33% 6.67% 13.33% 20.00% 26.67% 33.33% 40.00% 46.67%

0

10

20

30

40

50

60

70

80

90

100

110

120

130

response times for PP on unfocused queries (l=150)

% of processors in target set

tim

e (

ms)

performance degradation on fully unfocused queries

10 28 50 75 100 150

0

25

50

75

100

125

150

175

200

225

d=10

NI

NGQ

PP

path length l

tim

e (

ms)

10 25 50 75 100 150

0

25

50

75

100

125

150

175

200

225

d=150

NR

NGQ

PP

path length l

tim

e (

ms)

10 28 50 75 100 150

0

25

50

75

100

125

150

175

200

225

d=10

NI

NGQ

PP

path length l

tim

e (

ms)

10 25 50 75 100 150

0

25

50

75

100

125

150

175

200

225

d=150

NR

NGQ

PP

path length l

tim

e (

ms)

Naive traversal of provenance graph

single multi-join query workflow as index(our approach)

10 elements in input list

150 elements in input list

Page 48: Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

Summary• A simple lineage query model for Taverna

–grounded in the semantics of collection-oriented processing

–combines fine-grain answers with efficient query processing

• Ongoing work:– space compression, indexing– QLP?– semantic provenance (initial paper submitted)

• Currently part of the Taverna 2.1 release