Manual alignment

Manual alignment

•Difficult for remote homologies

•Slow

•Low reproductibility

•Allows to consider information not included in the sequence

“Equivalent” Aminoácids

Hydrophobic Ala (A), Val (V), Met (M), Leu (L), Ile (I), Phe (F), Trp

(W), Tyr (Y) Small

Gly (G), Ala (A), Ser (S) Polar

Ser (S), Thr (T), Asn (N), Gln (Q), Tyr (Y) Polar and charged are equivalent on the surface

Charged Asp (D), Glu (E) / Lys (K), Arg (R)

Hard to replace (special function) Gly (G), Pro (P), Cys (C), His (H)

BCL2 Human vs BCL2 Mouse

BH4

BH3

BH1

PhosphoSer

>sp|P10417|BCL2_MOUSE Apoptosis regulator Bcl-2 Length=236 Score = 429 bits (1103), Expect = 6e-119, Method: Composition-based stats. Identities = 214/239 (89%), Positives = 217/239 (90%), Gaps = 3/239 (1%)

MAHAGRTGYDNREIVMKYIHYKLSQRGYEWDAGDVGAAPPGAAPAPGIFSSQPGHTPHPA 60 MA AGRTGYDNREIVMKYIHYKLSQRGYEWDAGD AAP GAAP PGIFS QP P PAMAQAGRTGYDNREIVMKYIHYKLSQRGYEWDAGDADAAPLGAAPTPGIFSFQPESNPMPA 60

ASRDPVARTSPLQTPAAPGAAAGPALSPVPPVVHLTLRQAGDDFSRRYRRDFAEMSSQLH 120 R+ ARTSPL+ A AGPALSPVPP VHLTLR+AGDDFSRRYRRDFAEMSSQLH VHREMAARTSPLRPLVA---TAGPALSPVPPCVHLTLRRAGDDFSRRYRRDFAEMSSQLH 117

LTPFTARGRFATVVEELFRDGVNWGRIVAFFEFGGVMCVESVNREMSPLVDNIALWMTEY 180LTPFTARGRFATVVEELFRDGVNWGRIVAFFEFGGVMCVESVNREMSPLVDNIALWMTEY LTPFTARGRFATVVEELFRDGVNWGRIVAFFEFGGVMCVESVNREMSPLVDNIALWMTEY 177

LNRHLHTWIQDNGGWDAFVELYGPSMRPLFDFSWLSLKTLLSLALVGACITLGAYLGHK 239LNRHLHTWIQDNGGWDAFVELYGPSMRPLFDFSWLSLKTLLSLALVGACITLGAYLGHKLNRHLHTWIQDNGGWDAFVELYGPSMRPLFDFSWLSLKTLLSLALVGACITLGAYLGHK 236

BCL-2 vs BCL-X

>emb|CAA57886.1| bcl-x [Rattus norvegicus] Length=233 Score = 172 bits (435), Expect = 2e-41, Method: Composition-based stats. Identities = 93/199 (46%), Positives = 116/199 (58%), Gaps = 13/199 (6%)

NREIVMKYIHYKLSQRGYEW----DAGDVGAAPPGAAPAPGIFSSQPGHTPHPAASRDPV 66 N+E+V+ ++ YKLSQ+GY W D + P S P + P NQELVVDFLSYKLSQKGYSWSQFSDVEENRTEAPEETEPERETPSAINGNPSWHLADSPA 64

ARTSPLQTPAAPGAAAGPALSPVPPV--VHLTLRQAGDDFSRRYRRDFAEMSSQLHLTPF 124 A G ++ V P+ V LR+AGD+F RYRR F++++SQLH+TP VN-------GATGHSSSLDAREVIPMAAVKQALREAGDEFELRYRRAFSDLTSQLHITPG 117

TARGRFATVVEELFRDGVNWGRIVAFFEFGGVMCVESVNREMSPLVDNIALWMTEYLNRH 184 TA F VV ELFRDGVNWGRIVAFF FGG +CVESV++EM LV IA WM YLN HTAYQSFEQVVNELFRDGVNWGRIVAFFSFGGALCVESVDKEMQVLVSRIASWMATYLNDH 177

LHTWIQDNGGWDAFVELYG 203 L WIQ+NGGWD FV+LYG LEPWIQENGGWDTFVDLYG 196

BCL-2 vs BCL-X

>emb|CAA57886.1| bcl-x [Rattus norvegicus] Length=233 Score = 172 bits (435), Expect = 2e-41, Method: Composition-based stats. Identities = 93/199 (46%), Positives = 116/199 (58%), Gaps = 13/199 (6%)

NREIVMKYIHYKLSQRGYEW----DAGDVGAAPPGAAPAPGIFSSQPGHTPHPAASRDPV 66 N+E+V+ ++ YKLSQ+GY W D + P S P + P NQELVVDFLSYKLSQKGYSWSQFSDVEENRTEAPEETEPERETPSAINGNPSWHLADSPA 64

ARTSPLQTPAAPGAAAGPALSPVPPV--VHLTLRQAGDDFSRRYRRDFAEMSSQLHLTPF 124 A G ++ V P+ V LR+AGD+F RYRR F++++SQLH+TP VN-------GATGHSSSLDAREVIPMAAVKQALREAGDEFELRYRRAFSDLTSQLHITPG 117

TARGRFATVVEELFRDGVNWGRIVAFFEFGGVMCVESVNREMSPLVDNIALWMTEYLNRH 184 TA F VV ELFRDGVNWGRIVAFF FGG +CVESV++EM LV IA WM YLN HTAYQSFEQVVNELFRDGVNWGRIVAFFSFGGALCVESVDKEMQVLVSRIASWMATYLNDH 177

LHTWIQDNGGWDAFVELYG 203 L WIQ+NGGWD FV+LYG LEPWIQENGGWDTFVDLYG 196

BH4

BH3

BH1

PhosphoSer

Multiple alignment & functional prediction

Aim: Predict function from sequence

General strategies

Group sequences in a “family” (BLAST, PFAM)

Recognize sequence fragments (short) related to functional or structural features

Multiple alignment

Global alignment, more than two sequences

Allow to include the importance of the position in the alignment

Allow to define conserved residues Residues with functional or structural

importance Tree determinants Correlated mutations

Multiple alignments

Low similarity, only two sequences:Low similarity, only two sequences:

• Low similarity, multiple sequences:Low similarity, multiple sequences:GIFTTDIDMMHFYVKKPPGLDEFFTTLVLRTLLCMAAALTTGIDMMWTTAKRPPDMDDYYTIIIPGLLMNCIAVTTGLNMMWTTAKRPPGMDDFYTTILLPGLLMNCIGVTTTGLNMMYFTARRPPGLDEFYTTLVLRTLLCMCLGIFTTDIDMMHFYVKKPPGLDEFFTTLVLRTLLCMAAAVTTGLNMMWTTAKRPPGMDDFYTTILLPGLLMNCIGLFTTALNMMHFFGRKPPACEEYFTTLVVDGLLCNCI

AVTTGLNMWTTAKRPGMDDFYTILLPGLMNCIGLFTAIDMHFFGRKPACEEYFTLVVDGLCNCI

Tree determinants

Define subfamilies Relevant to philogeny

S1 GIFTTDIDMMHFYVKKPPGLDEFFTTLVLRTLLCMAAS2 ALTTGIDMMWTTAKRPPDMDDYYTIIIPGLLMNCIS3 AVTTGLNMMWTTAKRPPGMDDFYTTILLPGLLMNCIS4 GVTTTGLNLLYFTARRPP--DEFYSS-VLRTLLCMCLS5 GIFTTDIDLLHFYVKKPP--DEFFSSLVLRTLLCMAAS6 AVTTGLNLLWTTAKRPP--DDFYSSILLPGLLMNCIS7 GLFTTALNLLHFFGRKPP--EEYFSSLVVDGLLCNCI

Correlated mutations

Concerted changes in two or more conserved positions Reveal positions of structural interaction

GIFTTDIDMMHFYVKKPPGL DEFFTTLVLRTLLCMAAALTTGIDMMWTTAKRPPDM DDYYTIIIRGLLMNCIAVTTGLDMMWTTAKRPPGM DDFYTTILLRGLLMNCIGVTTTGLDMMYFTARRPPGL DEFYTTLVLKTLLCMCLGIFTTDIRMMHFYVKKPPGL DEFFTTLVLDTLLCMAAAVTTGLRMMWTTAKRPPGM DDFYTTILLDGLLMNCIGLFTTALRMMHFFGRKPPAC EEYFTTLVVEGLLCNCI

D-R R-D

Software

ClustalW Makes global pairwise alignments

building “clusters” of similar sequences

Tcoffee Slower than clustalw but more precise

for low similarity Combines global/local alignments

Profiles

Also known as Position-specific score matrix (PSSM).

Give scores for amino acids or gaps specific to sequence positions

Quantitative approach to include the role of positions

F K L L S H C L L V F K A F G Q T M F Q Y P I V G Q E L L G F P V V K E A I L K F K V L A A V I A D L E F I S E C I I Q F K L L G N V L V C A 0 0 1 0 1 1 1 0 1 0 C 0 0 0 0 0 0 2 0 0 1 D 0 0 0 0 0 0 0 0 0 1 E 0 1 0 0 0 2 1 0 0 0F 5 0 0 1 0 0 0 0 1 0 G 0 0 0 0 3 0 0 0 0 1 H 0 0 0 0 0 1 0 0 0 0 I 0 0 1 1 0 0 0 3 1 0 K 0 4 0 0 1 0 0 0 0 1 L 1 0 2 3 0 0 0 3 3 0 M 0 0 0 0 0 0 0 1 0 0 N 0 0 0 0 0 1 0 0 0 0 P 0 2 0 0 0 0 0 0 0 0Q 0 0 0 0 0 2 0 0 0 2R 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 2 0 0 0 0 0T 0 0 0 0 0 0 1 0 0 0V 0 0 2 2 0 0 2 0 1 1W 0 0 0 0 0 0 0 0 0 0Y 1 0 0 0 0 0 0 0 0 0

Profiles, simplest calculation

Mij =Nij/NS

fi

log

Frecuency of i at position j

Standard frequency of i.

Mij 0: Normal

> 0: Position j is rich in aa. i

< 0: Position j is poor in the aa. i

Profiles, improvements

Include amino acid frequency according to protein family

Include also classical similarity matrices Allows equivalent amino acids that do

no appear in the available alignment.

F K L L S H C L L V F K A F G Q T M F Q Y P I V G Q E L L G F P V V K E A I L K F K V L A A V I A D L E F I S E C I I Q F K L L G N V L V C A -18 –10 -1 -8 8 -3 3 -10 -2 -8 C -22 -33 -18 -18 -22 -26 22 -24 -19 -7 D -35 0 -32 -33 -7 6 -17 -34 -31 0 E -27 15 -25 -26 -9 23 -9 -24 -23 -1 F 60 -30 12 14 -26 -29 -15 4 12 -29 G -30 -20 -28 -32 28 -14 -23 -33 -27 -5 H -13 -12 -25 -25 -16 14 -22 -22 -23 -10 I 3 -27 21 25 -29 -23 -8 33 19 -23 K -26 25 -25 -27 -6 4 -15 -27 -26 0 L 14 -28 19 27 -27 -20 -9 33 26 -21 M 3 -15 10 14 -17 -10 -9 25 12 -11 N -22 -6 -24 -27 1 8 -15 -24 -24 -4 P -30 24 -26 -28 -14 -10 -22 -24 -26 -18 Q -32 5 -25 -26 -9 24 -16 -17 -23 7 R -18 9 -22 -22 -10 0 -18 -23 -22 -4 S -22 -8 -16 -21 11 2 -1 -24 -19 -4 T -10 -10 -6 -7 -5 -8 2 -10 -7 -11 V 0 -25 22 25 -19 -26 6 19 16 -16 W 9 -25 -18 -19 -25 -27 -34 -20 -17 -28 Y 34 -18 -1 1 -23 -12 -19 0 0 -18



Profiles

The use of profiles increases the information available and allow to extract “family” features opposite to individual sequence features

HMM profiles

Statistical models (Hidden Markov Models) to build profiles.

The model is “trained” using multiple alignments to determine evolution probabilities.

They contain a “theoretical machinery” that allows to understand sequence relatioships in a quantitative basis.

The twilight zone

Identity below 25 %

Structural similarity exists but it is difficult to identify it using standard methods

PSI-BLAST

Blast search based on profiles prepared dynamically:

Standard Blast search Building of a position-specific score

matrix (PSSM) from the alignment New search against the profile Repeat until self-consistence

PSI-BLAST

PSI-BLAST cannot find what Blast cannot detect

E-value for inclusion/exclusion must be chosen carefully (0.01)

Some times the system goes to non-sense results

Number of iterations is usually kept small

Motifs Motif: short sequence fragment. Highly

conserved, related to some structural or functional feature

Conserved in distant homologues, due to functional or structural restrictions

Help to functional prediction and to detect remote homology

C-x(2,4)-C-x(12)-H-x(3,5)-H

C

X(2,4)

C

X(12)

H

X(3,5)

H

Motifs

How to express motifs?

Regular expressions, patterns

Profiles (PSSM)

HMMs (Hidden Markov Models) (PFAM)

Regular expressions

ALRDFATHDDFSMTAEATHDSIECDQAATHEAS

A-T-H-[DE]

Regular expressions

[AC]-x-V-x(4)-{ED}

A o C

1 aa

4 aa

Never E o D

Sólo V

Regular expressions

<A-x-[ST](2)-x(0,1)-V

N-Terminal

one or none

Two S o T

Building regular expressions

From multiple alignments

Manual Pattern building software

eMotif, PRATT, Teiresias,…

Manual building

Pattern databases

PROSITE Known sequence motifs

Active site signatures Interaction sites Modification sites (glycosilation,

phosphorilation, ...)

[GA]-x(1,2)-[DE]-x-Y-x-[STAP]-x-C-[NKR]-x-[CH]-[LIVMFYWH] G GQ D L Y V P V C R L C Y

Reverse search

which proteins show a given motif? Allows

Test statistic validity of a hit Check new motifs

Documents

Manual alignment