Download ppt - STRING - Prediction of a functional association network for the yeast mitochondrial system

STRINGPrediction of a functional association network

for the yeast mitochondrial system

Lars Juhl JensenEMBL Heidelberg

Overview

• Prediction of functional associations between proteins– What is STRING?– Genomic context methods– Integration of large-scale experimental data– Combination and cross-species transfer of evidence

• (Coffee break)

• The yeast mitochondrial system– Prediction of mitochondrial proteins– A functional association network for mitochondria– Mapping and correlating features of mitochondrial proteins

Part 1Prediction of functional association between proteins


What is STRING?

Genomic neighborhood

Species co-occurrence

Gene fusions

Database imports

Exp. interaction data

Microarray expression data

Literature co-mentioning

Let the data speak for themselves ...

• Classification schemes are obviously difficult to predict if they are not supported by the data – there are no obvious features separating:– Presidents vs. non-presidents

– Actors vs. non-actors

• Unsupervised methods may discover a more meaningful classification:– Holding your pinky to your mouth is a clear sign of evil

– Wearing a bowtie is a sign of good

– So is consumption of alcoholic drinks

Trends in Microbiology

Inferring functional modules fromgene presence/absence patterns

Restingprotuberances

Protractedprotuberance

Cellulose

© Trends Microbiol, 1999

CellCell wall

Anchoring proteins

Cellulosomes

Cellulose

The “Cellulosome”

Genomic context methods

© Nature Biotechnology, 2004

Score calibration against a common reference

• Many diverse types of evidence– The quality of each is judged by

very different raw scores

– Quality differences exist among data sets of the same type

• Solved by calibrating all scores against a common reference– Scores are directly comparable

– Probabilistic scores allow evidence to be combined

• Requirements for the reference– Must represent a compromise

of the all types of evidence

– Broad species coverage

Integrating physical interaction screens

Make binaryrepresentationof complexes

Make binaryrepresentationof complexes

Yeast two-hybriddata sets are

inherently binary

Yeast two-hybriddata sets are

inherently binary

Calculate scorefrom number of

(co-)occurrences

Calculate scorefrom number of

(co-)occurrences

Calculate scorefrom non-shared

partners

Calculate scorefrom non-shared

partners

Calibrate against KEGG mapsCalibrate against KEGG maps

Infer associations in other speciesInfer associations in other species

Combine evidence from experimentsCombine evidence from experiments

Mining microarray expression databases

Re-normalize arraysby modern methodto remove biases

Re-normalize arraysby modern methodto remove biases

Buildexpression

matrix

Buildexpression

matrix

Combinesimilar arrays

by PCA

Combinesimilar arrays

by PCA

Construct predictorby Gaussian kerneldensity estimation

Construct predictorby Gaussian kerneldensity estimation

Calibrateagainst

KEGG maps

Calibrateagainst

KEGG maps

Inferassociations inother species

Inferassociations inother species

?

Source species

Target species

Evidence transfer based on “fuzzy orthology”

• Orthology transfer is tricky– Correct assignment of orthology

is difficult for distant species

– Functional equivalence cannot be guaranteed for in-paralogs

• These problems are addressed by our “fuzzy orthology” scheme– Confidence scores for functional

equivalence are calculated from all-against-all alignment

– Evidence is distributed across possible pairs according to confidence scores in the case of many-to-many relationships

Multiple evidence types from several species

Image: Molecular Biology of the Cell, 3.rd edition

Metabolism overview

Defined manually:

cutting metabolic

maps into pathways

Purinebiosynthesis

Histidinebiosynthesis

Predicting and defining metabolic pathways and other functional modules

Defined objectively:

standard clustering

of genome-scale data

Part 2The yeast mitochondrial system


Yeast mitochondria – why it should work well

• Because it is metabolism– STRING was developed using KEGG pathways as a reference– This may have caused STRING to function best on metabolism

• Because it is yeast– By far the best covered organism in terms of physical interactions– Many microarray gene expression studies– Literature mining works well due to standardization of gene names

• Because it is prokaryotic– Evolutionarily, mitochondria are of bacterial origin– The genomic context methods in STRING are very powerful, but

can only provide evidence for proteins with prokaryotic orthologs

Strategy for extracting a functional association network of the mitochondrial system

• Starting point:– Reference set of proteins known to mitochondrial– A large, diverse set of experiments relevant for predicting

mitochondrial proteins– The global STRING network for yeast

• Predict mitochondrial candidate genes– Use reference set to train neural networks for predicting candidate

genes based on experimental data– Use very high-confidence STRING links to suggest additional

candidates based interactions with reference and candidate genes

• Extract network that includes lower confidence interactions and identify functional modules by clustering

Predicting mitochondrial proteins

• Training was done with 5-fold cross validation– Reference set used as

positive examples

– All other genes used as negative examples

• Top 800 contains more than 90% of known mitochondrial genes

• Surprising performance of the linear model– As good as NN with

250 hidden neurons

– Better than MitoP2

TOM

MRPLRibosome Related

MRPS

Vacuolar Acidification

Fatty Acid Biosynth.

Secondary RCC_Asy

RCC_Asy

RCCII

RCCIV

RCCVRCC_Asy

HAP Complex

Arg Biosynth.

PDH/KGD/GCV

Cell Wall & pH Reg.

DNA Repair

Glucose sensing and CH remodelling

APC

Fission/Fusion

rRNAProcessing

mRNAProcessing

TFIIICComplex

m-AAA Complex

TCA Cycle

Iron Homeostasis/Chaperone Activity

RCCI

rRNAProcessing

Leu/Val/IleBiosynth.

DNARepair

GARP Complex

Cytosolic Ribosome

TIM

RCC_Asy

Actin

tRNA Splicing

RCCIII

NUP

Replication/ DNA Repair

TOM


MRPS



Secondary RCC_Asy

RCC_Asy

RCCII

RCCIV

RCCVRCC_Asy

HAP Complex

Arg Biosynth.

PDH/KGD/GCV

Cell Wall & pH Reg.

DNA Repair


APC

Fission/Fusion

rRNAProcessing

mRNAProcessing

TFIIICComplex

m-AAA Complex

TCA Cycle


RCCI

rRNAProcessing


DNARepair

GARP Complex

Cytosolic Ribosome

TIM

RCC_Asy

Actin

tRNA Splicing

RCCIII

NUP


Protobacterialorthologs

TOM


MRPS



Secondary RCC_Asy

RCCII

RCCIVRCCV

RCC_Asy

HAP Complex

Arg Biosynth.

PDH/KGD/GCV

Cell Wall & pH Reg.

DNA Repair


APC

Fission/Fusion

rRNAProcessing

mRNAProcessing

TFIIICComplex

m-AAA Complex

TCA Cycle


RCCI

rRNAProcessing


DNARepair

GARP Complex

Cytosolic Ribosome

TIM

RCC_Asy

Actin

tRNA Splicing

RCCIII

NUP


Human diseaseorthologs

RCC_Asy

Composition and interconnectivity of clusters

• A network of clusters– Most probably path

between clusters used as score

• Interacting clusters are preferentially within the same compartment

• Protobacterial clusters typically localize to the mitochondria

Correlations among gene features

• Expression data agree with data on NF specific growth defects

• Genes with detectable human orthologs are more conserved among yeasts

• Disease orthologs are often protobacterial

• Knockout of disease orthologs cause less severe growth defects

Can human disease genes be predicted?

• Mitochondrial genes are already enriched in disease genes• Previous slide showed that mitochondrial genes of protobacterial origin

and are further enriched in disease gene orthologs• Disease gene orthologs show less growth defect than other

mitochondrial genes with human orthologs

Getting more specific – generally speaking

• Benchmarking against one common reference allows integration of heterogeneous data

• The different types of data do not all tell us about the same kind of functional associations

• It should be possible to assign likely interaction types from supporting evidence types

• An accurate model of the yeast mitotic cell cycle

• Approach– High confidence set of

physical interactions

– Custom analysis of cell cycle expression data

• Observations– Dynamic assembly of

cell cycle complexes

– Temporal regulation of Cdk specificity

Dynamic complex formation during the yeast cell cycleUlrik de Lichtenberg, Lars Juhl Jensen, Søren Brunak and Peer Bork

to appear in Science

Conclusions

• Genomic context methods are able to infer the function of many prokaryotic proteins from genome sequences alone

• New genomic context methods are still being developed

• Integration of large-scale experimental data allows similar predictions to be made for eukaryotic proteins

• Successful data integration requires benchmarking and cross-species transfer of information

• Protein networks are useful for the analysis of large, complex biological systems

Acknowledgments

• The STRING team– Christian von Mering– Berend Snel– Martijn Huynen– Daniel Jaeggi– Steffen Schmidt– Mathilde Foglierini– Peer Bork

• New genomic context methods– Jan Korbel– Christian von Mering– Peer Bork

• ArrayProspector web service– Julien Lagarde– Chris Workman

• NetView visualization tool– Sean Hooper

• Study of yeast mitochondria– Fabiana Perocchi– Lars Steinmetz

• Analysis of yeast cell cycle– Ulrik de Lichtenberg– Thomas Skøt– Anders Fausbøll– Søren Brunak

• Web resources– string.embl.de– www.bork.embl.de/ArrayProspector– www.bork.embl.de/synonyms

http://string.embl.de/

http://string.embl.de/

http://www.bork.embl.de/synonyms

Thank you!