Upload
savannah-mcguire
View
20
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Manual alignment. Difficult for remote homologies Slow Low reproductibility Allows to consider information not included in the sequence. “Equivalent” Aminoácids. Hydrophobic Ala (A), Val (V), Met (M), Leu (L), Ile (I), Phe (F), Trp (W), Tyr (Y) Small Gly (G), Ala (A), Ser (S) Polar - PowerPoint PPT Presentation
Citation preview
Manual alignment
•Difficult for remote homologies
•Slow
•Low reproductibility
•Allows to consider information not included in the sequence
“Equivalent” Aminoácids
Hydrophobic Ala (A), Val (V), Met (M), Leu (L), Ile (I), Phe (F), Trp
(W), Tyr (Y) Small
Gly (G), Ala (A), Ser (S) Polar
Ser (S), Thr (T), Asn (N), Gln (Q), Tyr (Y) Polar and charged are equivalent on the surface
Charged Asp (D), Glu (E) / Lys (K), Arg (R)
Hard to replace (special function) Gly (G), Pro (P), Cys (C), His (H)
BCL2 Human vs BCL2 Mouse
BH4
BH3
BH1
PhosphoSer
>sp|P10417|BCL2_MOUSE Apoptosis regulator Bcl-2 Length=236 Score = 429 bits (1103), Expect = 6e-119, Method: Composition-based stats. Identities = 214/239 (89%), Positives = 217/239 (90%), Gaps = 3/239 (1%)
MAHAGRTGYDNREIVMKYIHYKLSQRGYEWDAGDVGAAPPGAAPAPGIFSSQPGHTPHPA 60 MA AGRTGYDNREIVMKYIHYKLSQRGYEWDAGD AAP GAAP PGIFS QP P PAMAQAGRTGYDNREIVMKYIHYKLSQRGYEWDAGDADAAPLGAAPTPGIFSFQPESNPMPA 60
ASRDPVARTSPLQTPAAPGAAAGPALSPVPPVVHLTLRQAGDDFSRRYRRDFAEMSSQLH 120 R+ ARTSPL+ A AGPALSPVPP VHLTLR+AGDDFSRRYRRDFAEMSSQLH VHREMAARTSPLRPLVA---TAGPALSPVPPCVHLTLRRAGDDFSRRYRRDFAEMSSQLH 117
LTPFTARGRFATVVEELFRDGVNWGRIVAFFEFGGVMCVESVNREMSPLVDNIALWMTEY 180LTPFTARGRFATVVEELFRDGVNWGRIVAFFEFGGVMCVESVNREMSPLVDNIALWMTEY LTPFTARGRFATVVEELFRDGVNWGRIVAFFEFGGVMCVESVNREMSPLVDNIALWMTEY 177
LNRHLHTWIQDNGGWDAFVELYGPSMRPLFDFSWLSLKTLLSLALVGACITLGAYLGHK 239LNRHLHTWIQDNGGWDAFVELYGPSMRPLFDFSWLSLKTLLSLALVGACITLGAYLGHKLNRHLHTWIQDNGGWDAFVELYGPSMRPLFDFSWLSLKTLLSLALVGACITLGAYLGHK 236
BCL-2 vs BCL-X
>emb|CAA57886.1| bcl-x [Rattus norvegicus] Length=233 Score = 172 bits (435), Expect = 2e-41, Method: Composition-based stats. Identities = 93/199 (46%), Positives = 116/199 (58%), Gaps = 13/199 (6%)
NREIVMKYIHYKLSQRGYEW----DAGDVGAAPPGAAPAPGIFSSQPGHTPHPAASRDPV 66 N+E+V+ ++ YKLSQ+GY W D + P S P + P NQELVVDFLSYKLSQKGYSWSQFSDVEENRTEAPEETEPERETPSAINGNPSWHLADSPA 64
ARTSPLQTPAAPGAAAGPALSPVPPV--VHLTLRQAGDDFSRRYRRDFAEMSSQLHLTPF 124 A G ++ V P+ V LR+AGD+F RYRR F++++SQLH+TP VN-------GATGHSSSLDAREVIPMAAVKQALREAGDEFELRYRRAFSDLTSQLHITPG 117
TARGRFATVVEELFRDGVNWGRIVAFFEFGGVMCVESVNREMSPLVDNIALWMTEYLNRH 184 TA F VV ELFRDGVNWGRIVAFF FGG +CVESV++EM LV IA WM YLN HTAYQSFEQVVNELFRDGVNWGRIVAFFSFGGALCVESVDKEMQVLVSRIASWMATYLNDH 177
LHTWIQDNGGWDAFVELYG 203 L WIQ+NGGWD FV+LYG LEPWIQENGGWDTFVDLYG 196
BCL-2 vs BCL-X
>emb|CAA57886.1| bcl-x [Rattus norvegicus] Length=233 Score = 172 bits (435), Expect = 2e-41, Method: Composition-based stats. Identities = 93/199 (46%), Positives = 116/199 (58%), Gaps = 13/199 (6%)
NREIVMKYIHYKLSQRGYEW----DAGDVGAAPPGAAPAPGIFSSQPGHTPHPAASRDPV 66 N+E+V+ ++ YKLSQ+GY W D + P S P + P NQELVVDFLSYKLSQKGYSWSQFSDVEENRTEAPEETEPERETPSAINGNPSWHLADSPA 64
ARTSPLQTPAAPGAAAGPALSPVPPV--VHLTLRQAGDDFSRRYRRDFAEMSSQLHLTPF 124 A G ++ V P+ V LR+AGD+F RYRR F++++SQLH+TP VN-------GATGHSSSLDAREVIPMAAVKQALREAGDEFELRYRRAFSDLTSQLHITPG 117
TARGRFATVVEELFRDGVNWGRIVAFFEFGGVMCVESVNREMSPLVDNIALWMTEYLNRH 184 TA F VV ELFRDGVNWGRIVAFF FGG +CVESV++EM LV IA WM YLN HTAYQSFEQVVNELFRDGVNWGRIVAFFSFGGALCVESVDKEMQVLVSRIASWMATYLNDH 177
LHTWIQDNGGWDAFVELYG 203 L WIQ+NGGWD FV+LYG LEPWIQENGGWDTFVDLYG 196
BH4
BH3
BH1
PhosphoSer
General strategies
Group sequences in a “family” (BLAST, PFAM)
Recognize sequence fragments (short) related to functional or structural features
Multiple alignment
Global alignment, more than two sequences
Allow to include the importance of the position in the alignment
Allow to define conserved residues Residues with functional or structural
importance Tree determinants Correlated mutations
Multiple alignments
Low similarity, only two sequences:Low similarity, only two sequences:
• Low similarity, multiple sequences:Low similarity, multiple sequences:GIFTTDIDMMHFYVKKPPGLDEFFTTLVLRTLLCMAAALTTGIDMMWTTAKRPPDMDDYYTIIIPGLLMNCIAVTTGLNMMWTTAKRPPGMDDFYTTILLPGLLMNCIGVTTTGLNMMYFTARRPPGLDEFYTTLVLRTLLCMCLGIFTTDIDMMHFYVKKPPGLDEFFTTLVLRTLLCMAAAVTTGLNMMWTTAKRPPGMDDFYTTILLPGLLMNCIGLFTTALNMMHFFGRKPPACEEYFTTLVVDGLLCNCI
AVTTGLNMWTTAKRPGMDDFYTILLPGLMNCIGLFTAIDMHFFGRKPACEEYFTLVVDGLCNCI
Tree determinants
Define subfamilies Relevant to philogeny
S1 GIFTTDIDMMHFYVKKPPGLDEFFTTLVLRTLLCMAAS2 ALTTGIDMMWTTAKRPPDMDDYYTIIIPGLLMNCIS3 AVTTGLNMMWTTAKRPPGMDDFYTTILLPGLLMNCIS4 GVTTTGLNLLYFTARRPP--DEFYSS-VLRTLLCMCLS5 GIFTTDIDLLHFYVKKPP--DEFFSSLVLRTLLCMAAS6 AVTTGLNLLWTTAKRPP--DDFYSSILLPGLLMNCIS7 GLFTTALNLLHFFGRKPP--EEYFSSLVVDGLLCNCI
Correlated mutations
Concerted changes in two or more conserved positions Reveal positions of structural interaction
GIFTTDIDMMHFYVKKPPGL DEFFTTLVLRTLLCMAAALTTGIDMMWTTAKRPPDM DDYYTIIIRGLLMNCIAVTTGLDMMWTTAKRPPGM DDFYTTILLRGLLMNCIGVTTTGLDMMYFTARRPPGL DEFYTTLVLKTLLCMCLGIFTTDIRMMHFYVKKPPGL DEFFTTLVLDTLLCMAAAVTTGLRMMWTTAKRPPGM DDFYTTILLDGLLMNCIGLFTTALRMMHFFGRKPPAC EEYFTTLVVEGLLCNCI
D-R R-D
Software
ClustalW Makes global pairwise alignments
building “clusters” of similar sequences
Tcoffee Slower than clustalw but more precise
for low similarity Combines global/local alignments
Profiles
Also known as Position-specific score matrix (PSSM).
Give scores for amino acids or gaps specific to sequence positions
Quantitative approach to include the role of positions
F K L L S H C L L V F K A F G Q T M F Q Y P I V G Q E L L G F P V V K E A I L K F K V L A A V I A D L E F I S E C I I Q F K L L G N V L V C A 0 0 1 0 1 1 1 0 1 0 C 0 0 0 0 0 0 2 0 0 1 D 0 0 0 0 0 0 0 0 0 1 E 0 1 0 0 0 2 1 0 0 0F 5 0 0 1 0 0 0 0 1 0 G 0 0 0 0 3 0 0 0 0 1 H 0 0 0 0 0 1 0 0 0 0 I 0 0 1 1 0 0 0 3 1 0 K 0 4 0 0 1 0 0 0 0 1 L 1 0 2 3 0 0 0 3 3 0 M 0 0 0 0 0 0 0 1 0 0 N 0 0 0 0 0 1 0 0 0 0 P 0 2 0 0 0 0 0 0 0 0Q 0 0 0 0 0 2 0 0 0 2R 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 2 0 0 0 0 0T 0 0 0 0 0 0 1 0 0 0V 0 0 2 2 0 0 2 0 1 1W 0 0 0 0 0 0 0 0 0 0Y 1 0 0 0 0 0 0 0 0 0
Profiles, simplest calculation
Mij =Nij/NS
fi
log
Frecuency of i at position j
Standard frequency of i.
Mij 0: Normal
> 0: Position j is rich in aa. i
< 0: Position j is poor in the aa. i
Profiles, improvements
Include amino acid frequency according to protein family
Include also classical similarity matrices Allows equivalent amino acids that do
no appear in the available alignment.
F K L L S H C L L V F K A F G Q T M F Q Y P I V G Q E L L G F P V V K E A I L K F K V L A A V I A D L E F I S E C I I Q F K L L G N V L V C A -18 –10 -1 -8 8 -3 3 -10 -2 -8 C -22 -33 -18 -18 -22 -26 22 -24 -19 -7 D -35 0 -32 -33 -7 6 -17 -34 -31 0 E -27 15 -25 -26 -9 23 -9 -24 -23 -1 F 60 -30 12 14 -26 -29 -15 4 12 -29 G -30 -20 -28 -32 28 -14 -23 -33 -27 -5 H -13 -12 -25 -25 -16 14 -22 -22 -23 -10 I 3 -27 21 25 -29 -23 -8 33 19 -23 K -26 25 -25 -27 -6 4 -15 -27 -26 0 L 14 -28 19 27 -27 -20 -9 33 26 -21 M 3 -15 10 14 -17 -10 -9 25 12 -11 N -22 -6 -24 -27 1 8 -15 -24 -24 -4 P -30 24 -26 -28 -14 -10 -22 -24 -26 -18 Q -32 5 -25 -26 -9 24 -16 -17 -23 7 R -18 9 -22 -22 -10 0 -18 -23 -22 -4 S -22 -8 -16 -21 11 2 -1 -24 -19 -4 T -10 -10 -6 -7 -5 -8 2 -10 -7 -11 V 0 -25 22 25 -19 -26 6 19 16 -16 W 9 -25 -18 -19 -25 -27 -34 -20 -17 -28 Y 34 -18 -1 1 -23 -12 -19 0 0 -18
F K L L S H C L L V F K A F G Q T M F Q Y P I V G Q E L L G F P V V K E A I L K F K V L A A V I A D L E F I S E C I I Q F K L L G N V L V C A -18 –10 -1 -8 8 -3 3 -10 -2 -8 C -22 -33 -18 -18 -22 -26 22 -24 -19 -7 D -35 0 -32 -33 -7 6 -17 -34 -31 0 E -27 15 -25 -26 -9 23 -9 -24 -23 -1 F 60 -30 12 14 -26 -29 -15 4 12 -29 G -30 -20 -28 -32 28 -14 -23 -33 -27 -5 H -13 -12 -25 -25 -16 14 -22 -22 -23 -10 I 3 -27 21 25 -29 -23 -8 33 19 -23 K -26 25 -25 -27 -6 4 -15 -27 -26 0 L 14 -28 19 27 -27 -20 -9 33 26 -21 M 3 -15 10 14 -17 -10 -9 25 12 -11 N -22 -6 -24 -27 1 8 -15 -24 -24 -4 P -30 24 -26 -28 -14 -10 -22 -24 -26 -18 Q -32 5 -25 -26 -9 24 -16 -17 -23 7 R -18 9 -22 -22 -10 0 -18 -23 -22 -4 S -22 -8 -16 -21 11 2 -1 -24 -19 -4 T -10 -10 -6 -7 -5 -8 2 -10 -7 -11 V 0 -25 22 25 -19 -26 6 19 16 -16 W 9 -25 -18 -19 -25 -27 -34 -20 -17 -28 Y 34 -18 -1 1 -23 -12 -19 0 0 -18
F K L L S H C L L V F K A F G Q T M F Q Y P I V G Q E L L G F P V V K E A I L K F K V L A A V I A D L E F I S E C I I Q F K L L G N V L V C A -18 –10 -1 -8 8 -3 3 -10 -2 -8 C -22 -33 -18 -18 -22 -26 22 -24 -19 -7 D -35 0 -32 -33 -7 6 -17 -34 -31 0 E -27 15 -25 -26 -9 23 -9 -24 -23 -1 F 60 -30 12 14 -26 -29 -15 4 12 -29 G -30 -20 -28 -32 28 -14 -23 -33 -27 -5 H -13 -12 -25 -25 -16 14 -22 -22 -23 -10 I 3 -27 21 25 -29 -23 -8 33 19 -23 K -26 25 -25 -27 -6 4 -15 -27 -26 0 L 14 -28 19 27 -27 -20 -9 33 26 -21 M 3 -15 10 14 -17 -10 -9 25 12 -11 N -22 -6 -24 -27 1 8 -15 -24 -24 -4 P -30 24 -26 -28 -14 -10 -22 -24 -26 -18 Q -32 5 -25 -26 -9 24 -16 -17 -23 7 R -18 9 -22 -22 -10 0 -18 -23 -22 -4 S -22 -8 -16 -21 11 2 -1 -24 -19 -4 T -10 -10 -6 -7 -5 -8 2 -10 -7 -11 V 0 -25 22 25 -19 -26 6 19 16 -16 W 9 -25 -18 -19 -25 -27 -34 -20 -17 -28 Y 34 -18 -1 1 -23 -12 -19 0 0 -18
Profiles
The use of profiles increases the information available and allow to extract “family” features opposite to individual sequence features
HMM profiles
Statistical models (Hidden Markov Models) to build profiles.
The model is “trained” using multiple alignments to determine evolution probabilities.
They contain a “theoretical machinery” that allows to understand sequence relatioships in a quantitative basis.
The twilight zone
Identity below 25 %
Structural similarity exists but it is difficult to identify it using standard methods
PSI-BLAST
Blast search based on profiles prepared dynamically:
Standard Blast search Building of a position-specific score
matrix (PSSM) from the alignment New search against the profile Repeat until self-consistence
PSI-BLAST
PSI-BLAST cannot find what Blast cannot detect
E-value for inclusion/exclusion must be chosen carefully (0.01)
Some times the system goes to non-sense results
Number of iterations is usually kept small
Motifs Motif: short sequence fragment. Highly
conserved, related to some structural or functional feature
Conserved in distant homologues, due to functional or structural restrictions
Help to functional prediction and to detect remote homology
Motifs
How to express motifs?
Regular expressions, patterns
Profiles (PSSM)
HMMs (Hidden Markov Models) (PFAM)
Building regular expressions
From multiple alignments
Manual Pattern building software
eMotif, PRATT, Teiresias,…
Pattern databases
PROSITE Known sequence motifs
Active site signatures Interaction sites Modification sites (glycosilation,
phosphorilation, ...)
Reverse search
which proteins show a given motif? Allows
Test statistic validity of a hit Check new motifs