Data provenance in biomedical discovery Donald Dunbar Queen’s Medical Research Institute...

Preview:

Citation preview

Data provenance in biomedical discovery

Donald DunbarQueen’s Medical Research Institute

University of Edinburgh

Workshop on Principles of Provenance in DatabasesMay 21st 2008

Background

biomedical research

basic & clinical science

animal, cell models, patients

genes, proteins, pathways

data analysis & mining

publication

Biomedical discovery

• Looking for contribution to – human health and disease

• In house experiments– data workflows– knowledge capture

• Use public databases– many data types– integration is a problem

Databases we use

sequence structure

function

expression domain specific

Data workflows

experiment 2

spreadsheet

raw datacalculations

publication

database

processeddata

experiment 1 database

Data workflows

copy and paste

open from file

‘algorithm’

copy and paste

save to file

IN

OUT

BUT:

web servicesautomated tools & databasesbioinformatics workflows

Bioinformatics workflows

Is our field changing?databases

experiments knowledge knowledgebase

Knowledge capture

Knowledge capture

What provenance to we need?Example:Gene expression in a transgenic animal

gene annotation gene expression measurements

public databases output from machine

processingintegration

where, when

which identifiers how

when, what, how

data miningwhat and how did we select genes

……

What provenance to we need?Example:Curated protein database

expert data database links

curator input

archive

contributor, date

verify, add, delete, modify

source, identifiers, dates

Curated databaseversions, dates

developmentschema & interface changes

What do we do now (for provenance)?

• We trust the main data providers a lot!– a pragmatic approach

• We use tools and note the settings– rarely fully

• We put extra fields in our databases– source, modify date

• We deposit our data in public repositories– but only when we need to

What might we do next?

• Use workflow tools like Taverna– capture workflow provenance

• Build provenance tool & database– widely applicable

• Make provenance more visible to biologists– so they value and use it

Conclusions

• In biology we don’t do provenance well (yet)• We use databases and manual workflows• We implement rudimentary provenance• We should build useful provenance tools • We need to make provenance visible

Recommended