Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Day 3: From Homology to Orthology:
Predicting protein function.
Fitch 1970: “Where the homology is the result of gene duplication so that both copies
have descended side by side during the history of an organism, (for example alpha and
beta hemoglobin) the genes should be called paralogous (para= in parallel). Where the
homology is the result of speciation so that the history of the gene reflects the history of
the species (for example, alpha hemoglobin in man and mouse) the genes should be
called orthologous (ortho=exact)”
Mechanism of Gene Duplication
Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS Biol 2(7): e206. doi:10.1371/journal.pbio.0020206
Expansion of the globin family from a single ancestral member
Olfactory receptor family in mouse (green) and human (red, ~ 1000), many are shared between the species and predate the
speciation
Acad9, an example of a gene duplication followed
by neofunctionalization
Neofunctionalisation while maintaining some original function: The
assembly factor ACAD9 that originated via gene duplication from
VLCAD at the bilateria has maintained its active site and has
retained catalytic activity. Altered helices likely involved in
complex I assembly have been determined with sequence
harmony
Nouws et al, Cell. Metab. 2010
Subfunctionalisation + Neofunctionalisation in complex I (Zhu et al, Nature 2016)
The acyl carrier protein (SDAP/ ACP) that functions in fatty
acid synthesis in plants has acquired a role in complex I in
fungi/metazoa (neofunctionalization)
Sub-functionalization after gene duplication for the complex I
binding Acyl Carrier Protein in Y. lipolytica
fatty acids synthesis
fatty acids synthesis(?) + complex I interacting with B14
complex I interacting with B22
fatty acids synthesis + complex I interacting with B22 AND B14
fatty acids synthesis
fatty acids synthesis
fatty acids synthesis (does not have complex I)
Comparing genomes for their
genes, orthologs
Gene A
A
B
Species I
Species II Gene duplication Speciation
Orthologs
“Which genes do two genomes share, and which don’t they share, and how does that relate
to their phenotypical similarities and differences”
Genome I
Genome II
35% 30%
25% 23%
Orthologs are expected to have relatively high levels of sequence
identity to each other (compared to other non-orthologous homologs),
because they diverged relatively recently, and …… because they have
similar functions…. (???)
Large scale orthology determination is often done using bidirectional
best hits: a so-called “graph based approach”.
Genome I
Genome II
35% 35%
25% 23%
Genome III
40% 30% 22%
In graph based approaches multiple genomes can be used to check for
consistency of bidirectional best hits.
35% 20%
25%
Gene duplications are creative, creating the possibility for
developing new functions (in this case involved in carnitine
synthesis) but …. They mess up orthology: i.e. orthology is
non-transitive
Inparalogs versus outparalogs:
Inparalogs are due to relatively recent, species-specific gene duplications, e.g.
Q9V6P0 and Q9VY24.
Outparalogs are due to gene duplications that preceded speciations, e.g. Q9V6P0 vs.
Q9VDM7
Solution to the non-transitivity of the concept of orthology sensu
stricto is: “Group orthology”
Conceptually: all proteins that are directly descended from one
protein in the last common ancestor are considered orthologous to
each other
In graph based approaches: Combine all connected “best triangular
hits” into Clusters of Orthologous Groups (COGs, Tatusov et al,
1997). WWW.NCBI.NLM.GOV
Gene A
A
B
Species I
Species II Gene duplication Speciation
Non-Orthologs,
although
bidirectional
best hits
Parallel non-orthologous gene-loss can lead to misidentification of orthology relations
when using best bi-directional hits as criterion.
Gene loss
B . s u b t i l i s D n a K
E . c o l i H s c A
B u c h n e r a H s c A
R . p r o w a z e k i i H s c A
9 2 2
H . s a p i e n s 1 9 9 6 4 2
S . c e r e v i s i a e Y H R 0 6 4 C
H . s a p i e n s 5 9 0 6 2
H . s a p i e n s 1 8 7 1 1 6
S . c e r e v i s i a e B R 1 6 9 C
S . c e r e v i s i a e Y P L 1 0 6 C
1 0 0 0
1 0 0 0
S . c e r e v i s i a e Y K L 0 7 3 W
9 9 7
7 6 9
6 0 6
9 7 1
9 6 6
E . c o l i D n a K
B u c h n e r a D n a K
R . p r o w a z e k i i D n a K
H . s a p i e n s 2 3 6 2 7
S . c e r e v i s i a e E C M 1 0
S . c e r e v i s i a e S S C 11 0 0 0
S . c e r e v i s i a e S S Q 1
6 1 6
5 0 7
1 0 0 0
9 2 7
8 5 0
0 . 2
Variations in the rate of evolution can lead to misidentification of
orthology relations when the latter are based on bi/multi-directional
best hits.
Because of independent loss events, and because of variable rates
of evolution, in large gene families, orthology determination using
bi/multi-directional best hits does not always resolve separate
orthologous and/or functional groups.
One solution to this is the creation of phylogenies………
Prediction of orthology using phylogenies (unrooted)
Classic usage of phylogeny: inferring evolutionary history from a single,
orthologous group of proteins, e.g. the origin of hydrogenosomes. Trees
are not always perfect (even when published in Nature).
Dyall et al, Nature 2004
Hrdy et al, Nature 2004
Unrooted tree topologies
= =
=
((A,B),(C,D)) Bracket notation
A A
A
B B
B
C
C C
D
D
D
-Unrooted tree topologies only reflect relative evolutionary
relations (In the primates the humans and chimpanzee are closer
related to each other than they are to the Orang-Otang and the
Gibbon)
-Rooted trees reflect relative order of descendance (In the
primates first the Gibbon branched off, then the Orang-Otang
branched off, then the chimpanzee and then the humans)
Orang-Otang Gibbon
Chimp Human
Chimp
Human
Orang-Otang
Gibbon
Baboon
How we root a tree affects the orthology relationships.
Spec I A
Spec II C
Spec II B
Spec I D
Spec II C Spec I D
Spec I D Spec II B Spec I A
Spec I A Spec II B
Spec II C
How we root a tree also affects the number of gene losses, if the duplication from
A to A’ would have happened in the ancestor of Spec I, II and III, then where is
gene A’ in Spec II and III? Has it been lost?
Spec I D Spec II B Spec II C Spec I A Spec II ?
Spec I ?
Effectively, what one does when one roots the tree based on “species overlap”
between the partitions is minimizing gene losses and duplications
Spec I A
Spec II C
Spec II B
Spec I D
Spec II C Spec I D
Spec I D Spec II B
Spec I A Spec II B
Spec II C Spec I A
Beyond inparalogs and outparalogs: A numbering
system for paralogy. LOFT (Levels of Orthology
from Trees)
Prefab phylogenetic trees: TreeFam
TreeFam contains the domain composition of the
proteins, is sometimes varies between paralogous
proteins.
Substrate specificities are not necessarily monophyletic (convergent
evolution).
Convergent evolution of Trichomonas vaginalis lactate dehydrogenase from malate
dehydrogenase. Wu et al., PNAS 1999
CH3 - C - C - O-
OH
H O
Lactate
CH2 - C - C - O-
OH
H O
-O - C -
O
Malate
CH2 - C - C - O-
O
O
-O - C -
O
Oxaloacetate
CH3 - C - C - O-
O
O
Pyruvate
LDH
MDH
Lactate/Malate Dehydrogenase Different small-molecule specificity
Lactate/Malate Dehydrogenase
CH3 - C - C - O-
OH
H O
Lactate
CH2 - C - C - O-
OH
H O
-O - C -
O
Malate
negative
Arg 102
positive
Hannenhalli & Russell, JMB, 303, 61-76, 2000
Another source of information that can be used for orthology
prediction is gene-order conservation.
35% 35%
(be careful for duplicated sets of genes though)