Homology - biosb.nlavailable. Homology determination is mainly based on theoretical models of sequence evolution and the likelihood that when you compare a sequence to a database you

Homology

• Sequences are homologous when they share a common ancestor

Homology

• Why are we interested?

– Function prediction • Homologous proteins tend to have similar functions

– Evolutionary dynamics • Tracing the evolution of protein families

The importance of homology

• Homologous proteins tend to have similar functions

• What does “similar functions” mean? – What is function?

• Various levels of description of function – Phenotypic: Protein A regulates limb formation – Cellular function: Protein A inhibits SSH signaling – Molecular function: Protein A phosphorylates Protein B

“Similar function”

What is function ?

Various levels of description:

Sequence similarity, Homology has the

largest relevance for Molecular

Function. This is aspect of protein

function that is best conserved, protein

sequence, structure can often be

interpreted in terms of function.

Sequence similarity

Similar 3D structure

Functional similarity Evolutionary origin

Homology

Homologous sequences have a similar 3D structure and tend to have similar

functions

Detecting homology

• Similarity of: – 3D structure -> most conserved aspect, yet few structures are

available. Structures are compared and classified by “eye” (A. Morzin, Scop), and software packages (Dali). More info on 3D in Bioinf II.

– Sequence -> less conserved, many sequences are however

available. Homology determination is mainly based on theoretical models of sequence evolution and the likelihood that when you compare a sequence to a database you will find a sequence of at least that similarity.

– 3D structure similarity is used as a benchmark for detection of

homology by sequence similarity.

Detecting sequence similarity

• We need a model for how to compare (align) sequences

• Evolution -> sequences change over time – We need models that describe how homologous

sequences change!

• Simplest model:

– All amino-acids are equal • Equally dissimilar, replaced at equal rates, independent of position

(based on identity matrices)

– Does this accurately describe sequence evolution? • If not, what are we missing?

Detecting sequence similarity

• A more complicated model:

– Some amino-acids are more equal than others

• We account for basic biochemical properties such as acidity, ionic charges, size (similarity matrices)

adapted from Livingstone & Barton, CABIOS, 9, 745-756, 1993

E-values

• When do you know that what you found is significant? – Theory based on “extreme value distributions”:

comparing two random sequences with each other will not tend to give you a high similarity, but when you compare one sequence with a large set of sequences you will always find some high scoring hits -> the extreme values. For your “hit” to be significant it has to be better than those expected extreme values.

– E-values: Expected number of hits of at least that similarity, if the sequence would have been compared to a database of random sequences.

How many hits of a certain quality/score (e.g. the Smith Waterman score) do you

expect if you were to compare your sequence to a random database

E value

E-values and how they are calculated

• E-value: Expected occurrence of a given sequence in a random sequence database

• E-value = K x m x n x e-λS

– m: length of query

– n: total length of all sequences in the database

– S: similarity score of the alignment based on the substitution matrix

– K and λ are scaling parameters for the database that is used

E-values and how they are calculated

• E-value = K x m x n x e-λS

– A longer query sequence increases the chance that some part may be found in a random database

– The chance of a hit increases with database size

– A low alignment score S • Short sequence: more likely to occur by chance in a database

• Bad alignment: there will be more sequences that will have a similar score, even though they are vastly different

– The lower the E-value, the more significant your hit! • E-values will change when you use different DB’s! Beware!!!

How do we know for sure that significantly similar sequences are truly homologous? (aside from the

statistical argument)

• Experimental benchmarking of E-values by comparisons of 3D structures (e.g. Brenner et al., 1996), where we “know” what is homologous and what is not.

• 3D structural similarity evolves at a lower rate than 2D similarity and is being used to test the quality of the statistics

TGAa

EGF

How do we judge how good these methods are (“benchmark”)?

1. You take a set of sequences that you know to be homologous or not

(based on their 3D structure)

2. You compare these sequences with each other using e.g. Smith-

Waterman

3. You sort the results of the comparisons based on some score (e.g. the

% identity) between the sequences, the highest scoring sequence pair

on top

4. Now you can judge how well the score separates homologous from

non-homologous sequences

Benchmarking homology detection with the Smith-Waterman algorithm, using 3D-structures (PDB40) as the “golden rule” for what is homologous and what is not …. Use those E-values……

Sequence similarity vs homology

• Sequences that are not significantly similar do not have to be non-homologous

• Bola (red) en OsmC (green) have no significant similarity at the

sequence level, but are significantly similar at the 3D level.

Increasing search sensitivity

• So to optimally “describe” the protein we are searching for, we may want to use – Information on allowed aa substitutions per position

– Information of where insertions and deletions can occur

• In short: we need to make a profile of the protein we are searching for – “Profile based searches”

– 2 to 3 fold increase in sensitivity

The level of conservation in sequence alignments varies considerably , one would

like to exploit that in homology detection.

Part 2

Profile based searches

• Position Specific Iterated BLAST (PSI-BLAST)

• Hidden Markov Models

– Rather than substitution matrix that is equal for all positions, these methods apply one for each position, as well as position-specific gap-penalties

– Positions are regarded as independent, though

Profile based searches

• Building a mathematical, probabilistic model that “generates” our protein domain allows us to asses the probability that any sequence of interest has been generated by any specific model.

P(A)=0.01

P(C)=0.8

P(E)=0.1

Etc.

Pos. 1 Pos. 2 Pos. 3 Pos. 4

P(A)=0.3

P(C)=0.01

P(E)=0.02

Etc.

P(A)=0.05

P(C)=0.01

P(E)=0.4

Etc.

P(A)=0.01

P(C)=0.01

P(E)=0.3

Etc.

(No insertions/deletions)

A very simple Hidden Markov Model

(With insertions (I) /deletions (D))

M M M M

I I I

D D D

I

A slightly more complicated Hidden Markov Model

Making an HMM

• Get all obvious homologs, align them

– Use this as input for your model

Making an HMM

• Software package HMMER

– Eddy et al., 1998

– Creates the model based on an alignment and allows you to search large sequence databases for matches

PSI-BLAST

• Altschul et al., 1997

• Easier to use than HMMER

– Just go to NCBI BLAST page

– Relatively fast, bit less accurate

– Alignment never exceeds length of seed protein

• Local alignments are used instead of global

JACKHMMER !

Jackhmmer, PSIBlast are iterative sequence profile based search procedures. The iterations add new sequences to the profile, allowing to detect more

distant homologs.

HMM 1

HMM 2

Comparison of various homology search techniques in terms of sensitivity (“number of homologues detected”) and selectivity (“number of non-homologous detected”) SAM-T98 = HMM ISS = Intermediate Sequence Search

After 1) sequence vs. sequence 2) sequence vs. profile We could also search 3) profile vs. profile

• Compass, HHsearch

Going the extra mile: profile vs. profile

HHsearch

Combining distant homology with orthology: main issues

• We cannot really make trees

• We have a shortage of benchmarks

BCAT1 -MKDCSNG-------CSAECTGEGGSKEVVGTFKAKDLIVTPATILKEKPDPNN-LVFGT

BCAT2 -MAAAALGQIWARKLLSVPWLLCGPRRYASSSFKAADLQLEMTQKPHKKPGPGEPLVFGK

BAT1 MLQRHSLK----------LGKFSIRTLATGAPLDASKLKITRNPNP-SKPRPNEELVFGQ

BAT2 ---------------------------MTLAPLDASKVKITTTQHA-SKPKPNSELVFGK

BCAT-leishmania MLLSRRWH----------QASAARGSRAPVVSFTAAALTKTLVADPPPLP-PMKGVAFGT

.: * : * * . :.**

In contrast to orthologs, paralogs do have a different subcellular locations

The branched chain aminotransferase (BCAT) loses mitochondrial localization after gene duplications

yeast human mitochondrial

localization Human gene description

ARG3 OTC human only Ornithine carbamoyltransferase,

PRO2 P5CS human only Delta 1-pyrroline-5-carboxylate synthetase

CAR2 OAT human only Ornithine aminotransferase

PRX1 PRDX6 yeast only Peroxiredoxin-6 (EC 1.11.1.15)

Relocalization of mitochondrial proteins (1-1 orthologs) between H.sapiens and

S.cerevisiae is rare.

Szklarczyk and Huynen, Genome Biology 2009

4 Relocalizations of 1-1 orthologs among 146 1-1 orthologous pairs with

experimentally confirmed cellular localization in both H.sapiens and yeast

…..Benchmarking Orthology…..

HMM db

seq db

(nr)

seq db

(human,yeast)

yes hit?

HHM

phase

reciprocal?

be

st b

i-dire

ctio

nal h

it

(pu

tativ

e o

rtho

log

)

BLAST

phase

PSI-BLAST

phase

yes

no no

yes hit? reciprocal?

yes

no no

yes hit? reciprocal?

yes

no no

no ortholog found

retrieve the HMM

MLVTYC...

Sensitive orthology prediction pipeline

Precompiled human profiles provided by

Johannes Soeding. (HHsearch, Bioinf. 2005)

Profile based methods add significant amounts of

putative orthology relations

55

460

PSI-Blast (55 bb-hits) and HHsearch (83 bb-hits) add 20% orthology relations to

Blast results (460 bb-hits), leading to a total of 598 orthologs shared between

S.pombe + S.cerevisiae mitochondria and H.sapiens

83

Benchmark based on the fraction of proteins with a conserved mitochondrial location

between fungi (S.cerevisiae and S.pombe) and Mammals (H.sapiens and M.musculus)

…..Benchmarking Orthology…..

(as Orthology is about evolutionary history this is kind of hard)

Predicted human complex III assembly factors based on

orthology with yeast proteins (one success story)

Yeast Human

Gene

name Description Phase Gene name

Targetin

g signal

Mito.

localization

OXPHOS co-

expression

Cbp3 Complex III

assembly HHM UQCC1 No + 0.93

Cbp4 Complex III

assembly HMM UQCC3 No +1 0.63

Cbp6 Complex III

assembly HMM UQCC2 No ND 0.63

Tucker EJ, Wanschers BF, Szklarczyk R et al., Plos

Genetics 2013

Tucker EJ, Wanschers BF, Szklarczyk R et al., Plos Genetics

2013

NB: Yeast complex III assembly factors Cbp1,Cbp2,

Cbp7 and Cbp8 that interact with cytochrome b

mRNA do not have human orthologs

UQCC3, the human ortholog of yeast complex III assembly

factor CBP4, was discovered by our profile-based orthology

prediction pipeline

Hildenbeutel et al, JCB 2014

CBP4 associates with Cbp3,Cbp6 and cytochrome b in S. cerevisiae

UQCC3 is mutated (homozygous) in a patient with lactic

acidosis, hypoglycemia, hypotonia and delayed development.

The mutation replaces a conserved, hydrophobic residue in the

transmembrane helix with a charged residue.

fibroblast muscle

complex I complex II complex III complex IV complex V

261 (163-599)a 596 (335-888)a 385 (570-1338)a 500 (288-954)a 693 (193-819)a

81 (84-273)a 438 (229-593)b 134 (1020-2530)b 1336 (520-2080)a NDc

Patient cells/tissue have: low complex III activity and lower

presence of the holocomplex and its constituent proteins

(UQCRC1,2,FS1). UQCC3 itself is absent, but assembly factors

UQCC1 and UQCC2 appear unaffected

a)mU per U citrate synthase. b)mU per U cytochrome c oxidase.

c)not determined.

Mitochondrial translation assay shows a specific decrease in

cytochrome b stability/synthesis in patient cells, consistent

with what we would expect based on CBP4’s function.

UQCC3 depends on UQCC1 and UQCC2 but UQCC1/2 do not

depend on UQCC3, suggesting UQCC3 functions in complex III

assembly downstream of UQCC1 and UQCC2

UQCC3 is, like CBP4, involved in complex

III assembly

• A mutation in UQCC3 leads to a severe

complex III deficiency (and a mild reduction in

complex I activity)

• UQCC3 appears involved in cytochrome b

synthesis or stability

• Mitochondrial inner membrane localization

and membrane topology are consistent with

CBP4

• UQCC3 depends on assembly factors

UQCC1 and UQCC2 and likely functions

downstream of them

Documents

Homology - biosb.nlavailable. Homology determination is mainly based on theoretical models of sequence evolution and the likelihood that when you compare a sequence to a database you