15
Proteome data Proteome data integration integration characteristics and characteristics and challenges challenges K. Belhajjame 1 , R. Cote 4 , S.M. Embury 1 , H. Fan 2 , C. Goble 1 , H. Hermjakob, S.J. Hubbard 1 , D. Jones 3 , P. Jones 4 , N. Martin 2 , S. Oliver 1 , C. Orengo 3 , N.W. Paton 1 , M. Pentony 3 , A. Poulovassilis 2 , J. Siepen, R.D. Stevens 1 , C. Taylor 4 , L. Zamboulis 2 , and W. Zhu 4 1 University of Manchester 2 Birkbeck College 3 University College London 4 European Bioinformatics Institute

Proteome data integration characteristics and challenges

Embed Size (px)

DESCRIPTION

Proteome data integration characteristics and challenges. - PowerPoint PPT Presentation

Citation preview

Page 1: Proteome data integration characteristics and challenges

Proteome data integrationProteome data integrationcharacteristics and challengescharacteristics and challenges

Proteome data integrationProteome data integrationcharacteristics and challengescharacteristics and challenges

K. Belhajjame1, R. Cote4, S.M. Embury1, H. Fan2, C. Goble1, H. Hermjakob, S.J. Hubbard1, D. Jones3, P. Jones4, N. Martin2, S. Oliver1,

C. Orengo3, N.W. Paton1, M. Pentony3, A. Poulovassilis2, J. Siepen, R.D. Stevens1, C. Taylor4, L. Zamboulis2, and W. Zhu4

1University of Manchester2Birkbeck College

3University College London4European Bioinformatics Institute

Page 2: Proteome data integration characteristics and challenges

All Hands Meetings, 2005 2

OutlineOutline

Experimental proteomics

ISPIDER architecture

Example use cases

Conclusion

Page 3: Proteome data integration characteristics and challenges

All Hands Meetings, 2005 3

Separation

Protein digestion

Mass Spectrometry

Experimental proteomicsExperimental proteomics

An essential component for elucidation of the biological functions of proteins The study of the set of proteins produced by an organism with the aim of understanding their behaviour under varying conditions Protein DB

2D gel electrophoresis

Maldi TOF

Enzymatic digestion

Identification

Protein ID

Page 4: Proteome data integration characteristics and challenges

All Hands Meetings, 2005 4

Experimental proteomicsExperimental proteomics

Development of new technologies for:

– protein separation (2D-SDS-PAGE, HPLC, Capillary

Electrophoresis)

– mass spectrometry (Multi-Dimensional protein identification)

Availability of publicly accessible protein sequence

databases

Proteomics databases (PedroDB, gpmDB, PepSeeker,

Pride, …)

Building experiments involving analysis services orchestration and data processing and integration

Page 5: Proteome data integration characteristics and challenges

All Hands Meetings, 2005 5

Objectives of ISPIDERObjectives of ISPIDER

A Grid dedicated to the creation of bioinformatics

experiments for proteomics

Develop, or make, existing Proteome databases and

Grid-enabled services

Develop Middleware support for developing and

executing new proteome analyses, based on distributed

query processing and workflow technologies

Undertake proteomic studies that demonstrate the

effectiveness of the resulting infrastructure

Page 6: Proteome data integration characteristics and challenges

All Hands Meetings, 2005 6

OutlineOutline

Experimental proteomics

ISPIDER architecture

Example use cases

Conclusion and future directions

Page 7: Proteome data integration characteristics and challenges

All Hands Meetings, 2005 7

ISPIDERISPIDER

ExistingExistingE-ScienceE-ScienceInfrastructureInfrastructure

ISPIDERISPIDERProteomics GridProteomics GridInfrastructureInfrastructure

ISPIDERISPIDERProteomics Proteomics ClientsClients

PublicPublicProteomicsProteomicsResourcesResources

ProteomeRequestHandler

InstanceIdent/Mapping

Services

ProteomicOntologies/

Vocabularies

SourceSelectionServices

DataCleaningServices

myGridOntologyServices

myGridDQP

AutoMedmyGrid

Workflows

KEY: WS = Web services, GS = Genome sequence, TR = transcriptomic data, PS = protein structure, PF = protein family, FA = functional annotation, PPI = protein-protein interaction data, WP = Work Package

VanillaQuery Client

2D GelVisualisatio

nClient + Aspergil.

Extensions

+ Phosph.Extensions PPI Validation

+ Analysis Client

Protein ID Client

Web services

Existing Resources

PS

WS

PF

WS

TR

WS

GS

WS

FA

WS

PPI

WS

PID

WS

PRIDE

WS

PEDRo

WS

ISPIDER Resources

Phos

WS

Page 8: Proteome data integration characteristics and challenges

All Hands Meetings, 2005 8

OutlineOutline

Experimental proteomics

ISPIDER architecture

Example use cases

Conclusion and future directions

Page 9: Proteome data integration characteristics and challenges

All Hands Meetings, 2005 9

MotivationMotivation

Protein identification experiments are usually used as input into further analysis processes.

– Gathering evidence for a biological hypothesis

– Suggesting new hypotheses

ObjectiveObjectiveAugment the identification results with additional information on the identified protein

ImplementationImplementationTaverna workflow system

Value-added protein datasetsValue-added protein datasets

Page 10: Proteome data integration characteristics and challenges

All Hands Meetings, 2005 10

Value-added protein datasetsValue-added protein datasets

PepMapper Web Service

GO Services

Auxiliary Services

Page 11: Proteome data integration characteristics and challenges

All Hands Meetings, 2005 11

Genome-focused protein identification

Genome-focused protein identification

MotivationMotivation

Currently, protein identification searches performed over large data

sets. This means fewer false negatives, but false positives are also

more likely.

ObjectiveObjective

More focused and thus more efficient protein identification

ImplementationImplementation

Taverna workflow system

DQP, a service-based query processor

Page 12: Proteome data integration characteristics and challenges

All Hands Meetings, 2005 12

Genome-focused protein identification

Genome-focused protein identification

DQP Web Service

IPI

PepMapper web service

GOA Web Service

select p.Name, p.Seqfrom p in db_proteinSequenceswhere p.OS='HomoSapiens';

Page 13: Proteome data integration characteristics and challenges

All Hands Meetings, 2005 13

Integrated access to proteome databases

Integrated access to proteome databases

MotivationMotivation

Ability to analyse existing proteomics results en masse is limited,

because of the heterogeneities between the schemas of the different

databases

ObjectiveObjective

Providing integrated access to proteome databases through a

common schema

ImplementationImplementation

AutoMed, a framework for mapping heterogeneous schemata

DQP, a service-based query processor

Page 14: Proteome data integration characteristics and challenges

All Hands Meetings, 2005 14

Integrated access to proteome databases

Integrated access to proteome databases

Automed Wrappers

PRIDEPedroDBgpmDB

Automed Repository

OGSA-DAIActivity

OGSA-DAIActivity

OGSA-DAIActivity

OGSA DistributedQuery Processor

AutomedQuery Processor

AutomedDQP Wrapper

User query

Result

OQL query

OQL result

Page 15: Proteome data integration characteristics and challenges

All Hands Meetings, 2005 15

ConclusionsConclusions

+ Available e-science technologies provide rapid prototyping facilities for bioinformatics analyses

+ Combining such technologies is possible and opens up more possibilities Taverna + DQP Automed + DQP

- Writing custom code is usually required– Processing service output to extract inputs for following services – Transforming results between data formats– Dealing with mismatches between identifiers

Developing a user-guided environment for the detection and resolution of mismatches

Development of Proteomics client applications (PepMapper, PepSeeker and PRIDE)