13
Personal Data Management • Why is this such an issue? Data Provenance • Representing links v Representing data • Identifying resources: Life Science Identifiers • Different types of provenance • Provenance generation • Provenance storage • Provenance retrieval

Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Personal Data Management• Why is this such an issue? Data Provenance

• Representing links v Representing data• Identifying resources: Life Science Identifiers

• Different types of provenance

• Provenance generation

• Provenance storage

• Provenance retrieval

Problem

• Automated workflows produce lots of heterogeneous data

• These are just some of the results from one workflow run for Williams Disease

Amplification of results

One input

Many outputs

Link v Data Representation

• Data management questions refer to relationships rather than internal content– What are the origins of this data?

• Which service produced this data?• Which data is this derived from?• Who was this data produced for?• ?What is this data telling me?

• Data analysis questions delegated to external services.

Representing links

• Identify each resource– Life science identifier: URI with associated data and

metadata retrieval protocols.– Understanding that underlying data will not change

urn:lsid:taverna.sf.net:datathing:45fg6 urn:lsid:taverna.sf.net:datathing:23ty3

Representing links II

• Identify link type– Again use URI– Allows us to use RDF infrastructure

• Repositories• Ontologies

urn:lsid:taverna.sf.net:datathing:45fg6 urn:lsid:taverna.sf.net:datathing:23ty3

http://www.mygrid.org.uk/ontology#derived_from

Workflow run

Workflow design

Experiment design

Project

Person

Organisation

Process

Service

Event

Data item

Data itemData item

data derivation e.g. output data derived from input data

knowledge statementse.g. similar protein sequence to

instanceOf

partOf componentProcesse.g. web service invocation of BLAST @ NCBI

componentEvente.g. completion of a web service invocation at 12.04pm

runBye.g. BLAST @ NCBI

run for

Organisation level provenance Process level provenance

Data/ knowledge level provenance

Pro

vena

nce

(1)

User can add templates to each workflow process to determine links between data items.

Storing management metadata

• Automated generation of this web of links preferable

• Workflow enactor generates– LSIDs– Data derivation links– Knowledge links– Process links– Organisation links

As RDF

Provenance generation

• Configuring and generating provenance within Taverna

Storage

• LSID has no protocol for storage

• Taverna/ Freefluo implements its own data/ metadata storage protocol

Taverna/Freefluo

Metadata Store

Data store

Publish interface

data

metadata

Retrieval• LSID protocol used to retrieve data and

metadata

• Query handled separately

Metadata Store

Data store

LSID interface

LSID aware client

Query

RDF aware client

LSID launchpad

• Light weight plug in to Internet Explorer providing access to LSID data / metadata

• demo

Using IBM’s HaystackGenBank

record

Portion of the Web of

provenance

Managing collection of

sequences for review