STRINGPrediction of a functional association network
for the yeast mitochondrial system
Lars Juhl JensenEMBL Heidelberg
Overview
• Prediction of functional associations between proteins– What is STRING?– Genomic context methods– Integration of large-scale experimental data– Combination and cross-species transfer of evidence
• (Coffee break)
• The yeast mitochondrial system– Prediction of mitochondrial proteins– A functional association network for mitochondria– Mapping and correlating features of mitochondrial proteins
Part 1Prediction of functional association between proteins
Lars Juhl JensenEMBL Heidelberg
What is STRING?
Genomic neighborhood
Species co-occurrence
Gene fusions
Database imports
Exp. interaction data
Microarray expression data
Literature co-mentioning
Let the data speak for themselves ...
• Classification schemes are obviously difficult to predict if they are not supported by the data – there are no obvious features separating:– Presidents vs. non-presidents
– Actors vs. non-actors
• Unsupervised methods may discover a more meaningful classification:– Holding your pinky to your mouth is a clear sign of evil
– Wearing a bowtie is a sign of good
– So is consumption of alcoholic drinks
Trends in Microbiology
Inferring functional modules fromgene presence/absence patterns
Restingprotuberances
Protractedprotuberance
Cellulose
© Trends Microbiol, 1999
CellCell wall
Anchoring proteins
Cellulosomes
Cellulose
The “Cellulosome”
Genomic context methods
© Nature Biotechnology, 2004
Score calibration against a common reference
• Many diverse types of evidence– The quality of each is judged by
very different raw scores
– Quality differences exist among data sets of the same type
• Solved by calibrating all scores against a common reference– Scores are directly comparable
– Probabilistic scores allow evidence to be combined
• Requirements for the reference– Must represent a compromise
of the all types of evidence
– Broad species coverage
Integrating physical interaction screens
Make binaryrepresentationof complexes
Make binaryrepresentationof complexes
Yeast two-hybriddata sets are
inherently binary
Yeast two-hybriddata sets are
inherently binary
Calculate scorefrom number of
(co-)occurrences
Calculate scorefrom number of
(co-)occurrences
Calculate scorefrom non-shared
partners
Calculate scorefrom non-shared
partners
Calibrate against KEGG mapsCalibrate against KEGG maps
Infer associations in other speciesInfer associations in other species
Combine evidence from experimentsCombine evidence from experiments
Mining microarray expression databases
Re-normalize arraysby modern methodto remove biases
Re-normalize arraysby modern methodto remove biases
Buildexpression
matrix
Buildexpression
matrix
Combinesimilar arrays
by PCA
Combinesimilar arrays
by PCA
Construct predictorby Gaussian kerneldensity estimation
Construct predictorby Gaussian kerneldensity estimation
Calibrateagainst
KEGG maps
Calibrateagainst
KEGG maps
Inferassociations inother species
Inferassociations inother species
?
Source species
Target species
Evidence transfer based on “fuzzy orthology”
• Orthology transfer is tricky– Correct assignment of orthology
is difficult for distant species
– Functional equivalence cannot be guaranteed for in-paralogs
• These problems are addressed by our “fuzzy orthology” scheme– Confidence scores for functional
equivalence are calculated from all-against-all alignment
– Evidence is distributed across possible pairs according to confidence scores in the case of many-to-many relationships
Multiple evidence types from several species
Image: Molecular Biology of the Cell, 3.rd edition
Metabolism overview
Defined manually:
cutting metabolic
maps into pathways
Purinebiosynthesis
Histidinebiosynthesis
Predicting and defining metabolic pathways and other functional modules
Defined objectively:
standard clustering
of genome-scale data
Part 2The yeast mitochondrial system
Lars Juhl JensenEMBL Heidelberg
Yeast mitochondria – why it should work well
• Because it is metabolism– STRING was developed using KEGG pathways as a reference– This may have caused STRING to function best on metabolism
• Because it is yeast– By far the best covered organism in terms of physical interactions– Many microarray gene expression studies– Literature mining works well due to standardization of gene names
• Because it is prokaryotic– Evolutionarily, mitochondria are of bacterial origin– The genomic context methods in STRING are very powerful, but
can only provide evidence for proteins with prokaryotic orthologs
Strategy for extracting a functional association network of the mitochondrial system
• Starting point:– Reference set of proteins known to mitochondrial– A large, diverse set of experiments relevant for predicting
mitochondrial proteins– The global STRING network for yeast
• Predict mitochondrial candidate genes– Use reference set to train neural networks for predicting candidate
genes based on experimental data– Use very high-confidence STRING links to suggest additional
candidates based interactions with reference and candidate genes
• Extract network that includes lower confidence interactions and identify functional modules by clustering
Predicting mitochondrial proteins
• Training was done with 5-fold cross validation– Reference set used as
positive examples
– All other genes used as negative examples
• Top 800 contains more than 90% of known mitochondrial genes
• Surprising performance of the linear model– As good as NN with
250 hidden neurons
– Better than MitoP2
TOM
MRPLRibosome Related
MRPS
Vacuolar Acidification
Fatty Acid Biosynth.
Secondary RCC_Asy
RCC_Asy
RCCII
RCCIV
RCCVRCC_Asy
HAP Complex
Arg Biosynth.
PDH/KGD/GCV
Cell Wall & pH Reg.
DNA Repair
Glucose sensing and CH remodelling
APC
Fission/Fusion
rRNAProcessing
mRNAProcessing
TFIIICComplex
m-AAA Complex
TCA Cycle
Iron Homeostasis/Chaperone Activity
RCCI
rRNAProcessing
Leu/Val/IleBiosynth.
DNARepair
GARP Complex
Cytosolic Ribosome
TIM
RCC_Asy
Actin
tRNA Splicing
RCCIII
NUP
Replication/ DNA Repair
TOM
MRPLRibosome Related
MRPS
Vacuolar Acidification
Fatty Acid Biosynth.
Secondary RCC_Asy
RCC_Asy
RCCII
RCCIV
RCCVRCC_Asy
HAP Complex
Arg Biosynth.
PDH/KGD/GCV
Cell Wall & pH Reg.
DNA Repair
Glucose sensing and CH remodelling
APC
Fission/Fusion
rRNAProcessing
mRNAProcessing
TFIIICComplex
m-AAA Complex
TCA Cycle
Iron Homeostasis/Chaperone Activity
RCCI
rRNAProcessing
Leu/Val/IleBiosynth.
DNARepair
GARP Complex
Cytosolic Ribosome
TIM
RCC_Asy
Actin
tRNA Splicing
RCCIII
NUP
Replication/ DNA Repair
Protobacterialorthologs
TOM
MRPLRibosome Related
MRPS
Vacuolar Acidification
Fatty Acid Biosynth.
Secondary RCC_Asy
RCCII
RCCIVRCCV
RCC_Asy
HAP Complex
Arg Biosynth.
PDH/KGD/GCV
Cell Wall & pH Reg.
DNA Repair
Glucose sensing and CH remodelling
APC
Fission/Fusion
rRNAProcessing
mRNAProcessing
TFIIICComplex
m-AAA Complex
TCA Cycle
Iron Homeostasis/Chaperone Activity
RCCI
rRNAProcessing
Leu/Val/IleBiosynth.
DNARepair
GARP Complex
Cytosolic Ribosome
TIM
RCC_Asy
Actin
tRNA Splicing
RCCIII
NUP
Replication/ DNA Repair
Human diseaseorthologs
RCC_Asy
Composition and interconnectivity of clusters
• A network of clusters– Most probably path
between clusters used as score
• Interacting clusters are preferentially within the same compartment
• Protobacterial clusters typically localize to the mitochondria
Correlations among gene features
• Expression data agree with data on NF specific growth defects
• Genes with detectable human orthologs are more conserved among yeasts
• Disease orthologs are often protobacterial
• Knockout of disease orthologs cause less severe growth defects
Can human disease genes be predicted?
• Mitochondrial genes are already enriched in disease genes• Previous slide showed that mitochondrial genes of protobacterial origin
and are further enriched in disease gene orthologs• Disease gene orthologs show less growth defect than other
mitochondrial genes with human orthologs
Getting more specific – generally speaking
• Benchmarking against one common reference allows integration of heterogeneous data
• The different types of data do not all tell us about the same kind of functional associations
• It should be possible to assign likely interaction types from supporting evidence types
• An accurate model of the yeast mitotic cell cycle
• Approach– High confidence set of
physical interactions
– Custom analysis of cell cycle expression data
• Observations– Dynamic assembly of
cell cycle complexes
– Temporal regulation of Cdk specificity
Dynamic complex formation during the yeast cell cycleUlrik de Lichtenberg, Lars Juhl Jensen, Søren Brunak and Peer Bork
to appear in Science
Conclusions
• Genomic context methods are able to infer the function of many prokaryotic proteins from genome sequences alone
• New genomic context methods are still being developed
• Integration of large-scale experimental data allows similar predictions to be made for eukaryotic proteins
• Successful data integration requires benchmarking and cross-species transfer of information
• Protein networks are useful for the analysis of large, complex biological systems
Acknowledgments
• The STRING team– Christian von Mering– Berend Snel– Martijn Huynen– Daniel Jaeggi– Steffen Schmidt– Mathilde Foglierini– Peer Bork
• New genomic context methods– Jan Korbel– Christian von Mering– Peer Bork
• ArrayProspector web service– Julien Lagarde– Chris Workman
• NetView visualization tool– Sean Hooper
• Study of yeast mitochondria– Fabiana Perocchi– Lars Steinmetz
• Analysis of yeast cell cycle– Ulrik de Lichtenberg– Thomas Skøt– Anders Fausbøll– Søren Brunak
• Web resources– string.embl.de– www.bork.embl.de/ArrayProspector– www.bork.embl.de/synonyms
Thank you!