Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Homology
• Sequences are homologous when they share a common ancestor
Homology
• Why are we interested?
– Function prediction • Homologous proteins tend to have similar functions
– Evolutionary dynamics • Tracing the evolution of protein families
The importance of homology
• Homologous proteins tend to have similar functions
• What does “similar functions” mean? – What is function?
• Various levels of description of function – Phenotypic: Protein A regulates limb formation – Cellular function: Protein A inhibits SSH signaling – Molecular function: Protein A phosphorylates Protein B
“Similar function”
What is function ?
Various levels of description:
Sequence similarity, Homology has the
largest relevance for Molecular
Function. This is aspect of protein
function that is best conserved, protein
sequence, structure can often be
interpreted in terms of function.
Sequence similarity
Similar 3D structure
Functional similarity Evolutionary origin
Homology
Homologous sequences have a similar 3D structure and tend to have similar
functions
Detecting homology
• Similarity of: – 3D structure -> most conserved aspect, yet few structures are
available. Structures are compared and classified by “eye” (A. Morzin, Scop), and software packages (Dali). More info on 3D in Bioinf II.
– Sequence -> less conserved, many sequences are however
available. Homology determination is mainly based on theoretical models of sequence evolution and the likelihood that when you compare a sequence to a database you will find a sequence of at least that similarity.
– 3D structure similarity is used as a benchmark for detection of
homology by sequence similarity.
Detecting sequence similarity
• We need a model for how to compare (align) sequences
• Evolution -> sequences change over time – We need models that describe how homologous
sequences change!
• Simplest model:
– All amino-acids are equal • Equally dissimilar, replaced at equal rates, independent of position
(based on identity matrices)
– Does this accurately describe sequence evolution? • If not, what are we missing?
Detecting sequence similarity
• A more complicated model:
– Some amino-acids are more equal than others
• We account for basic biochemical properties such as acidity, ionic charges, size (similarity matrices)
adapted from Livingstone & Barton, CABIOS, 9, 745-756, 1993
E-values
• When do you know that what you found is significant? – Theory based on “extreme value distributions”:
comparing two random sequences with each other will not tend to give you a high similarity, but when you compare one sequence with a large set of sequences you will always find some high scoring hits -> the extreme values. For your “hit” to be significant it has to be better than those expected extreme values.
– E-values: Expected number of hits of at least that similarity, if the sequence would have been compared to a database of random sequences.
How many hits of a certain quality/score (e.g. the Smith Waterman score) do you
expect if you were to compare your sequence to a random database
E value
E-values and how they are calculated
• E-value: Expected occurrence of a given sequence in a random sequence database
• E-value = K x m x n x e-λS
– m: length of query
– n: total length of all sequences in the database
– S: similarity score of the alignment based on the substitution matrix
– K and λ are scaling parameters for the database that is used
E-values and how they are calculated
• E-value = K x m x n x e-λS
– A longer query sequence increases the chance that some part may be found in a random database
– The chance of a hit increases with database size
– A low alignment score S • Short sequence: more likely to occur by chance in a database
• Bad alignment: there will be more sequences that will have a similar score, even though they are vastly different
– The lower the E-value, the more significant your hit! • E-values will change when you use different DB’s! Beware!!!
How do we know for sure that significantly similar sequences are truly homologous? (aside from the
statistical argument)
• Experimental benchmarking of E-values by comparisons of 3D structures (e.g. Brenner et al., 1996), where we “know” what is homologous and what is not.
• 3D structural similarity evolves at a lower rate than 2D similarity and is being used to test the quality of the statistics
TGAa
EGF
How do we judge how good these methods are (“benchmark”)?
1. You take a set of sequences that you know to be homologous or not
(based on their 3D structure)
2. You compare these sequences with each other using e.g. Smith-
Waterman
3. You sort the results of the comparisons based on some score (e.g. the
% identity) between the sequences, the highest scoring sequence pair
on top
4. Now you can judge how well the score separates homologous from
non-homologous sequences
Benchmarking homology detection with the Smith-Waterman algorithm, using 3D-structures (PDB40) as the “golden rule” for what is homologous and what is not …. Use those E-values……
Sequence similarity vs homology
• Sequences that are not significantly similar do not have to be non-homologous
• Bola (red) en OsmC (green) have no significant similarity at the
sequence level, but are significantly similar at the 3D level.
Increasing search sensitivity
• So to optimally “describe” the protein we are searching for, we may want to use – Information on allowed aa substitutions per position
– Information of where insertions and deletions can occur
• In short: we need to make a profile of the protein we are searching for – “Profile based searches”
– 2 to 3 fold increase in sensitivity
The level of conservation in sequence alignments varies considerably , one would
like to exploit that in homology detection.
Part 2
Profile based searches
• Position Specific Iterated BLAST (PSI-BLAST)
• Hidden Markov Models
– Rather than substitution matrix that is equal for all positions, these methods apply one for each position, as well as position-specific gap-penalties
– Positions are regarded as independent, though
Profile based searches
• Building a mathematical, probabilistic model that “generates” our protein domain allows us to asses the probability that any sequence of interest has been generated by any specific model.
P(A)=0.01
P(C)=0.8
P(E)=0.1
Etc.
Pos. 1 Pos. 2 Pos. 3 Pos. 4
P(A)=0.3
P(C)=0.01
P(E)=0.02
Etc.
P(A)=0.05
P(C)=0.01
P(E)=0.4
Etc.
P(A)=0.01
P(C)=0.01
P(E)=0.3
Etc.
(No insertions/deletions)
A very simple Hidden Markov Model
(With insertions (I) /deletions (D))
M M M M
I I I
D D D
I
A slightly more complicated Hidden Markov Model
Making an HMM
• Get all obvious homologs, align them
– Use this as input for your model
Making an HMM
• Software package HMMER
– Eddy et al., 1998
– Creates the model based on an alignment and allows you to search large sequence databases for matches
PSI-BLAST
• Altschul et al., 1997
• Easier to use than HMMER
– Just go to NCBI BLAST page
– Relatively fast, bit less accurate
– Alignment never exceeds length of seed protein
• Local alignments are used instead of global
JACKHMMER !
Jackhmmer, PSIBlast are iterative sequence profile based search procedures. The iterations add new sequences to the profile, allowing to detect more
distant homologs.
HMM 1
HMM 2
Comparison of various homology search techniques in terms of sensitivity (“number of homologues detected”) and selectivity (“number of non-homologous detected”) SAM-T98 = HMM ISS = Intermediate Sequence Search
After 1) sequence vs. sequence 2) sequence vs. profile We could also search 3) profile vs. profile
• Compass, HHsearch
Going the extra mile: profile vs. profile
HHsearch
Combining distant homology with orthology: main issues
• We cannot really make trees
• We have a shortage of benchmarks
BCAT1 -MKDCSNG-------CSAECTGEGGSKEVVGTFKAKDLIVTPATILKEKPDPNN-LVFGT
BCAT2 -MAAAALGQIWARKLLSVPWLLCGPRRYASSSFKAADLQLEMTQKPHKKPGPGEPLVFGK
BAT1 MLQRHSLK----------LGKFSIRTLATGAPLDASKLKITRNPNP-SKPRPNEELVFGQ
BAT2 ---------------------------MTLAPLDASKVKITTTQHA-SKPKPNSELVFGK
BCAT-leishmania MLLSRRWH----------QASAARGSRAPVVSFTAAALTKTLVADPPPLP-PMKGVAFGT
.: * : * * . :.**
In contrast to orthologs, paralogs do have a different subcellular locations
The branched chain aminotransferase (BCAT) loses mitochondrial localization after gene duplications
yeast human mitochondrial
localization Human gene description
ARG3 OTC human only Ornithine carbamoyltransferase,
PRO2 P5CS human only Delta 1-pyrroline-5-carboxylate synthetase
CAR2 OAT human only Ornithine aminotransferase
PRX1 PRDX6 yeast only Peroxiredoxin-6 (EC 1.11.1.15)
Relocalization of mitochondrial proteins (1-1 orthologs) between H.sapiens and
S.cerevisiae is rare.
Szklarczyk and Huynen, Genome Biology 2009
4 Relocalizations of 1-1 orthologs among 146 1-1 orthologous pairs with
experimentally confirmed cellular localization in both H.sapiens and yeast
…..Benchmarking Orthology…..
HMM db
seq db
(nr)
seq db
(human,yeast)
yes hit?
HHM
phase
reciprocal?
be
st b
i-dire
ctio
nal h
it
(pu
tativ
e o
rtho
log
)
BLAST
phase
PSI-BLAST
phase
yes
no no
yes hit? reciprocal?
yes
no no
yes hit? reciprocal?
yes
no no
no ortholog found
retrieve the HMM
MLVTYC...
Sensitive orthology prediction pipeline
Precompiled human profiles provided by
Johannes Soeding. (HHsearch, Bioinf. 2005)
Profile based methods add significant amounts of
putative orthology relations
55
460
PSI-Blast (55 bb-hits) and HHsearch (83 bb-hits) add 20% orthology relations to
Blast results (460 bb-hits), leading to a total of 598 orthologs shared between
S.pombe + S.cerevisiae mitochondria and H.sapiens
83
Benchmark based on the fraction of proteins with a conserved mitochondrial location
between fungi (S.cerevisiae and S.pombe) and Mammals (H.sapiens and M.musculus)
…..Benchmarking Orthology…..
(as Orthology is about evolutionary history this is kind of hard)
Predicted human complex III assembly factors based on
orthology with yeast proteins (one success story)
Yeast Human
Gene
name Description Phase Gene name
Targetin
g signal
Mito.
localization
OXPHOS co-
expression
Cbp3 Complex III
assembly HHM UQCC1 No + 0.93
Cbp4 Complex III
assembly HMM UQCC3 No +1 0.63
Cbp6 Complex III
assembly HMM UQCC2 No ND 0.63
Tucker EJ, Wanschers BF, Szklarczyk R et al., Plos
Genetics 2013
Tucker EJ, Wanschers BF, Szklarczyk R et al., Plos Genetics
2013
NB: Yeast complex III assembly factors Cbp1,Cbp2,
Cbp7 and Cbp8 that interact with cytochrome b
mRNA do not have human orthologs
UQCC3, the human ortholog of yeast complex III assembly
factor CBP4, was discovered by our profile-based orthology
prediction pipeline
Hildenbeutel et al, JCB 2014
CBP4 associates with Cbp3,Cbp6 and cytochrome b in S. cerevisiae
UQCC3 is mutated (homozygous) in a patient with lactic
acidosis, hypoglycemia, hypotonia and delayed development.
The mutation replaces a conserved, hydrophobic residue in the
transmembrane helix with a charged residue.
fibroblast muscle
complex I complex II complex III complex IV complex V
261 (163-599)a 596 (335-888)a 385 (570-1338)a 500 (288-954)a 693 (193-819)a
81 (84-273)a 438 (229-593)b 134 (1020-2530)b 1336 (520-2080)a NDc
Patient cells/tissue have: low complex III activity and lower
presence of the holocomplex and its constituent proteins
(UQCRC1,2,FS1). UQCC3 itself is absent, but assembly factors
UQCC1 and UQCC2 appear unaffected
a)mU per U citrate synthase. b)mU per U cytochrome c oxidase.
c)not determined.
Mitochondrial translation assay shows a specific decrease in
cytochrome b stability/synthesis in patient cells, consistent
with what we would expect based on CBP4’s function.
UQCC3 depends on UQCC1 and UQCC2 but UQCC1/2 do not
depend on UQCC3, suggesting UQCC3 functions in complex III
assembly downstream of UQCC1 and UQCC2
UQCC3 is, like CBP4, involved in complex
III assembly
• A mutation in UQCC3 leads to a severe
complex III deficiency (and a mild reduction in
complex I activity)
• UQCC3 appears involved in cytochrome b
synthesis or stability
• Mitochondrial inner membrane localization
and membrane topology are consistent with
CBP4
• UQCC3 depends on assembly factors
UQCC1 and UQCC2 and likely functions
downstream of them