39
RightField: Semantic Enrichment of Systems Biology Data using Spreadsheets Katy Wolstencroft myGrid, SysMO-DB University of Manchester

RightField: Semantic Enrichment of Systems Biology Data using Spreadsheets Katy Wolstencroft myGrid, SysMO-DB University of Manchester

Embed Size (px)

Citation preview

RightField: Semantic Enrichment of Systems Biology Data using

Spreadsheets

Katy WolstencroftmyGrid, SysMO-DB

University of Manchester

Outline

What RightField does Origins - SysMO-DB project and data sharing in

Systems Biology How RightField works Evaluation – how successfully it works Extensions and future directions

RightField

A tool for embedding ranges of ontology terms into spreadsheets to allow the users of those spreadsheets to semantically annotate their data from simple drop-down lists

A tool for automatically extracting semantically annotated metadata from spreadsheets and producing RDF

RightField

Annotation benefits Makes annotation quicker and more efficient Standardises annotation Hides the ontology complexity from the usersRDF production benefits Querying over heterogeneous data files Semantic searching and reasoning Standard format for interoperability Hides semantic web tools from end users

Spreadsheets and web browsers

SEEK: Systems Biology Data SharingThe SEEK

Systems Biology of MicroOrganisms Pan-European

> 100 research groups > 320 scientists

Distributed, interdisciplinary projects

Expected to pool data and results and disseminate

Microbiologists, molecular biologists, biochemists, mathematicians....not many informaticians

SysMO Consortium

A platform for Systems Biology data and models sharing

Web based environment for sharing within a consortium and disseminating to the community (an eLaboratory)

Standards Compliant Fitting in with laboratory

practices

~ 1900 assets People – 350 Investigations - 35 Studies - 87 Assays - 167 Data sets - 930 Models - 60 SOPs - 140 Publications -165

SSFH

CISBIC

Consortia using SEEK

JenAge

SyBaCol

Rosage

YeastGlycolysis

Forsys

Types of data Multiple omics

genomics, transcriptomics proteomics, metabolomics fluxomics, reactomics

Images Molecular biology Reaction Kinetics Models

Metabolic, gene network, kinetic Relationships between data sets/experiments

Procedures, experiments, data, results and models Analysis of data

Minimum Information Model

What is the least amount of information required to: Find Interpret Understand Reuse

Different for different data sets

CIMR Core Information for Metabolomics ReportingMIABE Minimal Information About a Bioactive Entity MIACA Minimal Information About a Cellular Assay MIAME Minimum Information About a Microarray Experiment MIAME/Env MIAME / Environmental transcriptomic experiment MIAME/Nutr MIAME / Nutrigenomics MIAME/Plant MIAME / Plant transcriptomics MIAME/Tox MIAME / Toxicogenomics MIAPA Minimum Information About a Phylogenetic Analysis MIAPAR Minimum Information About a Protein Affinity Reagent MIAPE Minimum Information About a Proteomics Experiment MIARE Minimum Information About a RNAi Experiment MIASE Minimum Information About a Simulation Experiment

Not quite available “off the shelf”

Loose guidelines or checklists Specific formats (generally in XML) Specific formats with associated ontologies

Remaining questions for the scientists: How do we generate standards compliant data? Which vocabularies/ontologies should I use? How do I know which ontology terms to use

where?

DataMIBBI Model Ontologies

Microarray MIAME:Minimum Information about a Microarray Experiment

MGED

Proteomics MIAPE: Minimum Information about a Proteomics Experiment

PSI-MI, PSI-MS, PSI-MOD

Interaction experiments

MIMIX:Minimum Information about a Molecular Interaction Experiment

PSI-MI

Protein-Protein Interaction

Systems Biology Models

MIRIAM:Minimal Information Required In the Annotation of biochemical Models

SBO: Systems Biology Ontology

Systems Biology Model Simulation

MIASE:Minimum Information About a Simulation Experiment

KISAO:Kinetic Simulation Algorithm Ontology

SOP

Data Templates and Vocabularies

Construction Validation

SOP

SOP

Metabolomics

Metabolomics

Mass Spec

Transcriptomics

Proteomics

Fluxomics

Investigations

Studies

Assays

Fitting in with Laboratory practices

Scientists can continue to do what they have always done

Scientists remain in control Embedding semantics into the tools already in

use Excel, excel, excel.....

Ontology terms for marked-up cells in drop-down boxes

The End Result

RightField Architecture

JavaPlatform Independent

OWL APILoading ontologies and reasoning

Apache POI HSSF librariesLoading and saving of Excel Spreadsheets

Availability

Open source http://www.rightfield.org.uk

Excel Workbook

Ontology“Portion” of ontology terms

Terms Embedded into Excel Workbook

RightField Client

How RightField Works – part 1

Marked-up workbookSaved in plain Excel

Informaticians/ontologists End Users

Loading Ontologies from BioPortal

Published ontologiesPublished ontologies

Multiple versionsMultiple versions

You can also load local ontologies from file or URL

JERM = “Just Enough Results Model” What type of data is it

Microarray, growth curve, enzyme activity… What was measured

Gene expression, OD, metabolite concentration…. What do the values in the datasets mean

Units, time series, repeats….

Excel workbook loaded into RightField with multiple worksheets

Class hierarchies ofloaded ontologies. Multiple ontologies shown in separate tabs

Selected parent term from the ontology

Methods for specifying ontology terms

Term lists for selected cells

Value Type and Property

Excel workbook with marked-up cells

Marking-up Columns or Rows

Ontology Languages

RDFS - RDF Schema

OBO - Open Biomedical Ontologies

OWL - Web Ontology Language

Provenance and Identifiers

Term LabelThe human readable term label

Term IRI The (unique) term identifier

Ontology IRI

Ontology Version

The ontology that defines the term

The version of the ontology

Physical LocationThe (web) location of the ontology

Ontology Information

Ontologies encapsulated Scientists can work offline Ensures same versions of ontologies used for a series

of experiments No special macros or plugins required, just Excel or

Open Office Versions and URIs captured in hidden worksheets

Provenance Comparisons between sheets Linking back to the vocabularies

Store / Reuse

RDF Graph

Populate

Extract

Metadata Extraction and Querying

Generates RDF triples for each marked up cell Simple RDF, or conforming to ontology models Storage and querying solutions

Virtuoso triple store Linked data compliance

Already HTML and XML interface and REST API

Ontology Annotations and Properties

RightField Annotation Evaluation

Does RightField improve the quantity and consistency of data annotation? Improvements in annotation consistency

Assay type Technology type Experimental conditions Factors studied Organism and strains

RightField Annotation Evaluation

JERM Metadata Element Scores

Dataset ID RightField Template Pre-RightField Template

598 616 244

599 319 402

72 119 85

868 203 62

69 127 88

Metadata Extraction and Querying

No current ‘standard’ RDF format for MIBBI models (although it is in progress)

RDF vs ‘traditional’ relational approaches RDF more flexible in dealing with optional and

changing metadata elements RDF allows aggregation between different types of

experimental data E.g. biological samples, experimental conditions

RightField LifeCycle

Future Work

Visualising nodes with large numbers of terms Ontology label ambiguities Linked Data output for SysMO SEEK and related

resources

Other Work Using RightField

KupKB – Kidney and Urinary Pathway knowledge base (http://www.kupkb.org)

Knowledge bases for inflammatory bowel disease and Chagas disease

BioBanking sample annotation Annotation of historical samples

‘Patient records' for Egyptian mummies, Manchester Museum

RightField Extension: Populous

Generic tool for populating ontology templates Supports validation at the point of data entry Expressive Pattern language for OWL Ontology

generation Helps biologists with ontology design patterns

http://www.e-lico.eu/populous

Simon Jupp, Robert Stevens, University of Manchester

Summary

RightField-enabled spreadsheets show a marked increase in the consistency of annotation when compared with free text annotation or other template approaches.

Success from embedding and hiding semantics and complexity

Acknowledgements

Stuart Owen Katy Wolstencroft Carole Goble

Wolfgang Mueller Olga KrebsMatthew Horridge

http://www.sysmo-db.org