Upload
howard-gregory
View
222
Download
0
Embed Size (px)
DESCRIPTION
VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP -VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA * * * * * **** * * *** * * * * * *** * KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL ** ***** * ** * ** ** ** *** ** ** * ** * GKEFTPPVQAAYQKVVAGVANALAHKYH PAEFTPAVHASLDKFLASVSTVLTSKYR **** * * * * * * ** Dynamic Programming Needleman and Wunsch, 1970 O(L 2 ) algorithm Maximise score (or minimise distance) Gap penalties Amino acid weight matrix
Citation preview
Multiple Alignments and Multivariate Analysis
Clustal: 1988-2006
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :
Multiple Alignments
Phylogenetic Analysis Secondary Str. PredictionHomology Detection Profile AnalysisHomology Modeling
VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP-VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA * * * * * **** * * *** * * * * * *** * KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL ** ***** * ** * ** ** ** *** ** ** * ** * GKEFTPPVQAAYQKVVAGVANALAHKYHPAEFTPAVHASLDKFLASVSTVLTSKYR **** * * * * * * **
Dynamic Programming•Needleman and Wunsch, 1970 •O(L2) algorithm
Maximise score (or minimise distance)•Gap penalties•Amino acid weight matrix
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :
Weighted Sums of Pairs: WSP
N
i
i
jijij DW
2
1
1
Time O(LN)
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :
Weighted Sums of Pairs: WSP
N
i
i
jijij DW
2
1
1
Sequences Time2 1 second3 150 seconds4 6.25 hours5 39 days6 16 years7 2404 years
Time O(LN)
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :
Horse beta
Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin
Progressive Alignment:Feng and Doolittle, 1987Barton and Sternberg, 1987Willie Taylor, 1987, 1988Hogeweg and Hesper, 1984
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :
Horse beta
Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :
Horse beta
Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :
Horse beta
Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin
Clustal• 35000 citations
• Clustal1-Clustal4 1988– Paul Sharp, Dublin
• Clustal V 1992– EMBL Heidelberg,
• Rainer Fuchs• Alan Bleasby
• Clustal W 1994-2006, Clustal X 1997-2006– Toby Gibson, EMBL, Heidelberg– Julie Thompson, ICGEB, Strasbourg
• Clustal W and Clustal X 2.0 early 2007– University College Dublin
Since 1994?
Protein structure alignments and superpositions • Barton and Sternberg; Fitch and McLure• Dali• BaliBase • Homstrad • Oxbench• Prefab etc. etc.
Benchmarks
Protein structure analysis•APDBO'Sullivan O, Zehnder M, Higgins D, Bucher P, Grosdidier A, Notredame C. (2003) APDB: a novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics. 2003;19 Suppl 1:i215-21.
RNA alignments•Bralibase (Gardner PP, Wilm A & Washietl S (2005) NAR. )
Which Method is Best?
• Clustal W????
• MSA (Lipman, Altschul, Kececioglu)
• DCA (Stoye), PRRP (Gotoh) , SAGA (Notredame)
• Probcons (Do, Brudno, Batzoglu)
• T-Coffee (Notredame)
• 3-D Coffee M-Coffee
• MAFFT (Katoh) and MUSCLE (Edgar)
For Global Protein alignments!!!
Clustal W and X 2.0?
• Jan 2007• Re-engineered in C++• Aim to increase accuracy
– Iteration (Wallace, I. M., O'Sullivan, O. and Higgins, D. G., 2005 Evaluation of iterative alignment algorithms for multiple alignment. Bioinformatics 21:1408.)
• Reduce run times
Multivariate Analysis?
ADE-4 http://pbil.univ-lyon1.fr/ADE-4/
Thioulouse J., Chessel D., Dolédec S., & Olivier J.M. (1997) ADE-4: a multivariate analysis and graphical display software. Statistics and Computing, 7, 1, 75-83.
• MADE4 – Culhane, A., Thiolouse, J., Perriere, G., Higgins, D.G. (2005)
MADE4: an R package for multivariate analysis of gene expression data. Bioinformatics. 21(11):2789-2790.
Between Group Analysis BGA Dolédec, S. & Chessel, D. (1987) Acta Oecologica, Oecologica Generalis, 8, 3, 403-426.
Supervised Correspondence Analysis or PCA
CO-Inertia Analysis CIADolédec, S. & Chessel, D. (1994) Freshwater Biology, 31, 277-294.Thioulouse, J. & Lobry, J.R. (1995) CABIOS, 11, 321-329
2 datasets; Simultaneous CA or PCA
Use CA, PCA for Sequences?
PCOORD on sequence distances:Higgins, D.G. (1992) Sequence ordinations: a multivariate analysis approach to analysing large sequence data sets. CABIOS, 8, 15-22.
PCA on dipeptide composition:Van Heel, M. (1991) A new family of powerful multivariate statistical sequence analysis techniques. J. Mol Biol. 220(4): 877-887.
PCA on alignment columns:Casari G, Sander C, Valencia A. (1995) A method to predict functional residues in proteins. Nat Struct Biol. 2(2):171-8.
Supervised PCA or CA?
Malate Dehydrogenases
Lactate Dehydrogenases
Between Group Analysis
GSVD
samples
genes
N
d = 0.05
EC_4_117 EC_4_0
EC_1_1 EC_1_19
EC_4_1 EC_4_91 EC_4_89 EC_4_90 EC_4_92 EC_4_93
EC_4_98 EC_4_99 EC_4_97 EC_4_95 EC_4_94 EC_4_96
EC_4_113 EC_4_114 EC_4_108 EC_4_109 EC_4_110 EC_4_111 EC_4_112 EC_4_106 EC_4_107 EC_4_102 EC_4_100 EC_4_101 EC_4_104 EC_4_103 EC_4_105 EC_4_116
EC_4_88
EC_1_0
EC_36_5 EC_36_2 EC_36_4 EC_36_3 EC_36_6 EC_36_0 EC_36_1
EC_1_15 EC_1_16
EC_4_44 EC_4_115
EC_1_13 EC_1_14
EC_4_87
EC_4_46
EC_1_17 EC_1_18
EC_4_25 EC_4_24 EC_4_23 EC_4_21 EC_4_22 EC_4_18 EC_4_19 EC_4_16 EC_4_17 EC_4_20 EC_4_42 EC_4_43 EC_4_40 EC_4_41
EC_4_39 EC_4_45
EC_4_36 EC_4_37 EC_4_38
EC_4_34 EC_4_35 EC_4_32 EC_4_33 EC_4_29 EC_4_26 EC_4_30 EC_4_31 EC_4_27
EC_4_28
EC_1_2
EC_1_4 EC_1_7 EC_1_5 EC_1_6 EC_1_3 EC_1_9 EC_1_8
EC_1_10 EC_1_11 EC_1_12
EC_4_83 EC_4_84 EC_4_85 EC_4_86 EC_4_49 EC_4_81 EC_4_79 EC_4_80
EC_4_78 EC_4_77 EC_4_76 EC_4_48 EC_4_47
EC_4_74 EC_4_75 EC_4_72 EC_4_73 EC_4_68 EC_4_69 EC_4_63 EC_4_66 EC_4_64
EC_4_65 EC_4_67 EC_4_70 EC_4_71 EC_4_50 EC_4_82 EC_4_52 EC_4_51 EC_4_54 EC_4_53 EC_4_55 EC_4_56 EC_4_57 X5PTP_EC_4 EC_4_58 EC_4_62 EC_4_61 EC_4_59 EC_4_60
EC_4_6 EC_4_7 EC_4_5 EC_4_2 EC_4_3 EC_4_4 EC_4_15 EC_4_14
EC_4_13 EC_4_12 EC_4_11 EC_4_8 EC_4_9 EC_4_10
d = 0.05
Chymotrypsin
Elastase
Tripsin
d = 0.1
X3N
X7A
X10N
X14W
X16S X18I
X54V
X66T X70R
X82E
X82G
X87L
X92I
X93I
X93F X95N
X98W
X98Y
X132Y
X137C X154T
X154V X155S
X155T
X162S
X165N
X180Q
X181A X183L
X196Y X204S
X228K
X229D
X229S
X232Q
X232M X243Q X265S
X273K
X275G
Chymotrypsin
Elastase
Tripsin
0 e
+00
4 e
-04
8 e
-04
Eigenvalues
15 Chymotrypsins
31 Trypsins10 Elastases
Trypsin-like serine proteases
d = 0.05
EC_4_117 EC_4_0
EC_1_1 EC_1_19
EC_4_1 EC_4_91 EC_4_89 EC_4_90 EC_4_92 EC_4_93
EC_4_98 EC_4_99 EC_4_97 EC_4_95 EC_4_94 EC_4_96
EC_4_113 EC_4_114 EC_4_108 EC_4_109 EC_4_110 EC_4_111 EC_4_112 EC_4_106 EC_4_107 EC_4_102 EC_4_100 EC_4_101 EC_4_104 EC_4_103 EC_4_105 EC_4_116
EC_4_88
EC_1_0
EC_36_5 EC_36_2 EC_36_4 EC_36_3 EC_36_6 EC_36_0 EC_36_1
EC_1_15 EC_1_16
EC_4_44 EC_4_115
EC_1_13 EC_1_14
EC_4_87
EC_4_46
EC_1_17 EC_1_18
EC_4_25 EC_4_24 EC_4_23 EC_4_21 EC_4_22 EC_4_18 EC_4_19 EC_4_16 EC_4_17 EC_4_20 EC_4_42 EC_4_43 EC_4_40 EC_4_41
EC_4_39 EC_4_45
EC_4_36 EC_4_37 EC_4_38
EC_4_34 EC_4_35 EC_4_32 EC_4_33 EC_4_29 EC_4_26 EC_4_30 EC_4_31 EC_4_27
EC_4_28
EC_1_2
EC_1_4 EC_1_7 EC_1_5 EC_1_6 EC_1_3 EC_1_9 EC_1_8
EC_1_10 EC_1_11 EC_1_12
EC_4_83 EC_4_84 EC_4_85 EC_4_86 EC_4_49 EC_4_81 EC_4_79 EC_4_80
EC_4_78 EC_4_77 EC_4_76 EC_4_48 EC_4_47 EC_4_74 EC_4_75 EC_4_72 EC_4_73 EC_4_68 EC_4_69
EC_4_63 EC_4_66 EC_4_64 EC_4_65 EC_4_67 EC_4_70 EC_4_71 EC_4_50 EC_4_82 EC_4_52 EC_4_51 EC_4_54 EC_4_53 EC_4_55 EC_4_56 EC_4_57 X5PTP_EC_4 EC_4_58 EC_4_62 EC_4_61 EC_4_59 EC_4_60
EC_4_6 EC_4_7 EC_4_5 EC_4_2 EC_4_3 EC_4_4 EC_4_15 EC_4_14
EC_4_13 EC_4_12 EC_4_11 EC_4_8 EC_4_9 EC_4_10
d = 0.05
Chymotrypsin
Elastase
Tripsin
d = 0.1
X3N
X7A
X10N
X14W
X16S X18I
X54V
X66T X70R
X82E
X82G
X87L
X92I
X93I
X93F X95N
X98W
X98Y
X132Y
X137C X154T
X154V X155S
X155T
X162S
X165N
X180Q
X181A X183L
X196Y X204S
X228K
X229D
X229S
X232Q
X232M X243Q X265S
X273K
X275G
Chymotrypsin
Elastase
Tripsin
0 e
+00
4 e
-04
8 e
-04
Eigenvalues
Trypsin
d = 0.05
EC_4_117 EC_4_0
EC_1_1 EC_1_19
EC_4_1 EC_4_91 EC_4_89 EC_4_90 EC_4_92 EC_4_93
EC_4_98 EC_4_99 EC_4_97 EC_4_95 EC_4_94 EC_4_96
EC_4_113 EC_4_114 EC_4_108 EC_4_109 EC_4_110 EC_4_111 EC_4_112 EC_4_106 EC_4_107 EC_4_102 EC_4_100 EC_4_101 EC_4_104 EC_4_103 EC_4_105 EC_4_116
EC_4_88
EC_1_0
EC_36_5 EC_36_2 EC_36_4 EC_36_3 EC_36_6 EC_36_0 EC_36_1
EC_1_15 EC_1_16
EC_4_44 EC_4_115
EC_1_13 EC_1_14
EC_4_87
EC_4_46
EC_1_17 EC_1_18
EC_4_25 EC_4_24 EC_4_23 EC_4_21 EC_4_22 EC_4_18 EC_4_19 EC_4_16 EC_4_17 EC_4_20 EC_4_42 EC_4_43 EC_4_40 EC_4_41
EC_4_39 EC_4_45
EC_4_36 EC_4_37 EC_4_38
EC_4_34 EC_4_35 EC_4_32 EC_4_33 EC_4_29 EC_4_26 EC_4_30 EC_4_31 EC_4_27
EC_4_28
EC_1_2
EC_1_4 EC_1_7 EC_1_5 EC_1_6 EC_1_3 EC_1_9 EC_1_8
EC_1_10 EC_1_11 EC_1_12
EC_4_83 EC_4_84 EC_4_85 EC_4_86 EC_4_49 EC_4_81 EC_4_79 EC_4_80
EC_4_78 EC_4_77 EC_4_76 EC_4_48 EC_4_47 EC_4_74 EC_4_75 EC_4_72 EC_4_73 EC_4_68 EC_4_69
EC_4_63 EC_4_66 EC_4_64 EC_4_65 EC_4_67 EC_4_70 EC_4_71 EC_4_50 EC_4_82 EC_4_52 EC_4_51 EC_4_54 EC_4_53 EC_4_55 EC_4_56 EC_4_57 X5PTP_EC_4 EC_4_58 EC_4_62 EC_4_61 EC_4_59 EC_4_60
EC_4_6 EC_4_7 EC_4_5 EC_4_2 EC_4_3 EC_4_4 EC_4_15 EC_4_14
EC_4_13 EC_4_12 EC_4_11 EC_4_8 EC_4_9 EC_4_10
d = 0.05
Chymotrypsin
Elastase
Tripsin
d = 0.1
X3N
X7A
X10N
X14W
X16S X18I
X54V
X66T X70R
X82E
X82G
X87L
X92I
X93I
X93F X95N
X98W
X98Y
X132Y
X137C X154T
X154V X155S
X155T
X162S
X165N
X180Q
X181A X183L
X196Y X204S
X228K
X229D
X229S
X232Q
X232M X243Q X265S
X273K
X275G
Chymotrypsin
Elastase
Tripsin
0 e
+00
4 e
-04
8 e
-04
Eigenvalues
Trypsin
BGA With CA or PCA?
• CA:– Pretty pictures– Sequences/residues plots– Finds any clear/simple patterns
• Binary aa variables
• PCA:– Use continuous variables
• e.g. aa properties: size, charge, hydrophobicity etc.
d = 10
EC_4_117 EC_4_0
EC_1_1
EC_1_19 EC_4_1 EC_4_91 EC_4_89 EC_4_90 EC_4_92 EC_4_93 EC_4_98
EC_4_99 EC_4_97 EC_4_95 EC_4_94 EC_4_96
EC_4_113 EC_4_114
EC_4_108 EC_4_109 EC_4_110 EC_4_111 EC_4_112 EC_4_106 EC_4_107 EC_4_102 EC_4_100 EC_4_101
EC_4_104 EC_4_103 EC_4_105 EC_4_116 EC_4_88
EC_1_0
EC_36_5 EC_36_2 EC_36_4 EC_36_3 EC_36_6 EC_36_0 EC_36_1
EC_1_15 EC_1_16
EC_4_44 EC_4_115 EC_1_13 EC_1_14
EC_4_87 EC_4_46
EC_1_17 EC_1_18 EC_4_25 EC_4_24
EC_4_23
EC_4_21 EC_4_22 EC_4_18 EC_4_19 EC_4_16 EC_4_17
EC_4_20 EC_4_42 EC_4_43 EC_4_40 EC_4_41
EC_4_39 EC_4_45 EC_4_36
EC_4_37 EC_4_38 EC_4_34 EC_4_35 EC_4_32 EC_4_33 EC_4_29 EC_4_26 EC_4_30 EC_4_31 EC_4_27 EC_4_28
EC_1_2 EC_1_4 EC_1_7 EC_1_5 EC_1_6 EC_1_3 EC_1_9 EC_1_8
EC_1_10 EC_1_11 EC_1_12
EC_4_83 EC_4_84 EC_4_85
EC_4_86 EC_4_49 EC_4_81 EC_4_79 EC_4_80 EC_4_78 EC_4_77 EC_4_76 EC_4_48 EC_4_47 EC_4_74 EC_4_75 EC_4_72 EC_4_73
EC_4_68 EC_4_69
EC_4_63 EC_4_66 EC_4_64 EC_4_65 EC_4_67 EC_4_70 EC_4_71 EC_4_50 EC_4_82 EC_4_52 EC_4_51 EC_4_54 EC_4_53 EC_4_55 EC_4_56 EC_4_57 5PTP_EC_4 EC_4_58 EC_4_62 EC_4_61 EC_4_59 EC_4_60
EC_4_6 EC_4_7 EC_4_5 EC_4_2 EC_4_3 EC_4_4 EC_4_15 EC_4_14 EC_4_13
EC_4_12 EC_4_11 EC_4_8 EC_4_9 EC_4_10
d = 10
Chymotrypsin
Elastase
Tripsin
d = 0.5
X1C
X1D
X1E
X7B
X47D
X47E
X82B
X95B X95C X95E
X136A X136B
X165A
X185B
X196C
X216A
X227A
X227B
X229A
X229B
X229D
X229E
X232A
X232C
X240D X243A
X255A X255C X255E
X260C X260D
X260E
X267D
X272A X273A
X275A X275B
X275C
X275E
X277B
010
2030
40 Eigenvalues
31 Trypsins15 Chymotrypsins
10 Elastases
Sequences
Residue weights
BGA with PCA using
5 amino acid properties (A-E)
BGA on Alignments
• Focus on any split in the data• Binary or Property coding
– CA or PCA• Sequence Weighting • Pseudocounts
BGA, CIA, MADE4Aedín CulhaneGuy PerriereJean ThiolouseIan JefferyAilís Fagan
Clustal
Toby Gibson, EMBLJulie Thompson, ICGEB, Strasbourg
IterationBenchmarking Clustal W 2.0
Gordon BlackshieldsMark Larkin
Paul McGettiganIain Wallace
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT
SeqC GARFIELD THE VERY FAST CAT
SeqD THE FAT CAT
SeqA GARFIELD THE LAST FA-T CATSeqB GARFIELD THE FAST CA-T ---SeqC GARFIELD THE VERY FAST CATSeqD -------- THE ---- FA-T CAT
Weighted Sums of Pairs
N
i
i
jijij DW
2
1
1MSA Branch and Bound
Lipman, Altschul and Kececioglu, 1989
FastMSA Tweaked MSAGupta, Kececioglu and Schaeffer, 1995
DCA Divide and ConquerStoye, Moulton and Dress, 1997
SAGA Genetic AlgorithmNotredame and Higgins, 1996
PRRP IterationGotoh, 1996
Genetic Algorithm
MutationRecombination (cross-overs)
Selection (WSP)
Genetic Algorithm
MutationRecombination (cross-overs)
Selection (WSP)
Genetic Algorithm
MutationRecombination (cross-overs)
Selection (WSP)
SAGA
• Cedric Notredame• Sequence Alignment by Genetic Algorithm• Optimise any objective function• Notredame, C. and Higgins, D.G. (1996)
SAGA: Sequence alignment by genetic algorithm. Nucleic Acids Research, 24:1515-1524.
Test case N seqs
Length
Cytc 6 129
GCR 8 60
Ac Protease 5 183
S Protease 6 280
Chtp 6 247
Dfr secstr 4 189
Sbt 4 296
Globin 7 167
Plasto 5 132
ScoreWSP
Structurematch %
CPU-time
1051257 74 7
371875 75 3
379997 80 13
574884 91 184
111924 - 4525
171979 82.03 5
271747 80 7
659036 94 7
236343 54.03 22
ScoreWSP
Structurematch %
CPU-time
1051257 74 960
371650 82 75
379997 80 331
574884 91 3500
111579 - 3542
171975 82.50 411
271747 80 210
659036 94 330
236195 54.05 510
MSA SAGA
Structure Test Cases
Test case N seqs
Length
Cytc 6 129
GCR 8 60
Ac Protease 5 183
S Protease 6 280
Chtp 6 247
Dfr secstr 4 189
Sbt 4 296
Globin 7 167
Plasto 5 132
ScoreWSP
Structurematch %
CPU-time
1051257 74 7
371875 75 3
379997 80 13
574884 91 184
111924 - 4525
171979 82.03 5
271747 80 7
659036 94 7
236343 54.03 22
ScoreWSP
Structurematch %
CPU-time
1051257 74 960
371650 82 75
379997 80 331
574884 91 3500
111579 - 3542
171975 82.50 411
271747 80 210
659036 94 330
236195 54.05 510
MSA SAGA
Structure Test Cases
Which method is best?• Best score?• Empirical tests?
– Sets of test cases• Fitch and McLure• BaliBase• Homstrad• Oxbench• Prefab etc. etc.
– APDBO'Sullivan O, Zehnder M, Higgins D, Bucher P, Grosdidier A, Notredame C. (2003) APDB: a novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics. 2003;19 Suppl 1:i215-21.
COFFEE
• Consistency based Objective Function For Evaluation of Ehhhh things
• Maximum Weight Trace (John Kececioglu)• Maximise similarity to a LIBRARY of
residue pairs• Notredame, C., Holm, L. and Higgins, D.G. (1998) COFFEE: An
objective function for multiple sequence alignments. Bioinformatics 14: 407-422.
Human beta VHLTPEEKSAVTALWGKVN–-VDEVGGEALHorse beta VQLSGEEKAAVLALWDKVN–-EEEVGGEALHuman alpha –VLSPADKTNVKAAWGKVGAHAGEYGAEALHorse alpha –VLSAADKTNVKAAWSKVGGHAGEYGAEAL
Pairs of Residuese.g.
Seq N, Residue ISeq M, Residue J
Weight = w
Test Case Avg % ID
Nseq
COFFESAGA
PRRP MSA SAGA
ClustalW PILEUP
SAM HMM
Ac prot 21 14 50.2 48.8 51.2 39.2 40.9 27.9
Binding 31 7 64.5 76.2 64.2 50.0 66.6 36.9
Cytc 42 6 90.7 89.4 67.3 89.1 94.6 67.3
Fniii 17 9 47.0 36.3 45.2 42.0 37.8 16.2
Gcr 36 8 83.1 92.8 80.8 80.8 80.8 85.7
Globin 24 17 85.2 87.0 78.0 86.4 72.6 67.8
Igb 24 37 78.1 74.9 70.1 74.8 52.4 67.2
Lzm 39 6 72.3 71.1 72.3 72.2 72.3 55.3
Phenyldiox 22 8 64.7 49.9 55.6 58.5 37.4 45.7
Sbt 61 7 96.9 96.7 96.0 96.9 97.4 90.6
sprot 27 15 66.6 64.3 68.5 62.5 57.9 61.7
% Match
Test Case Avg % ID
Nseq
COFFESAGA
PRRP MSA SAGA
ClustalW PILEUP
SAM HMM
Ac prot 21 14 50.2 48.8 51.2 39.2 40.9 27.9
Binding 31 7 64.5 76.2 64.2 50.0 66.6 36.9
Cytc 42 6 90.7 89.4 67.3 89.1 94.6 67.3
Fniii 17 9 47.0 36.3 45.2 42.0 37.8 16.2
Gcr 36 8 83.1 92.8 80.8 80.8 80.8 85.7
Globin 24 17 85.2 87.0 78.0 86.4 72.6 67.8
Igb 24 37 78.1 74.9 70.1 74.8 52.4 67.2
Lzm 39 6 72.3 71.1 72.3 72.2 72.3 55.3
Phenyldiox 22 8 64.7 49.9 55.6 58.5 37.4 45.7
Sbt 61 7 96.9 96.7 96.0 96.9 97.4 90.6
sprot 27 15 66.6 64.3 68.5 62.5 57.9 61.7
72.6 71.5 68.1 65.5 64.5 56.4
% Match
T-Coffee
• Heuristic approximation to COFFEE– Uses progressive alignment (Trees)
• Heterogenous data– Sequences– Structures– Genomes– ESTs
• Notredame, C, Higgins, DG and Heringa, J. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J.Mol.Biol., 302: 205-217.
T-Coffee
• Mixed data sources– Primary library from
• Lalign (SIM): – 10 best local alignments
• Clustalw– All pairwise alignments
• SAP (Willie Taylor, Structure Superposition) • Multiple alignments
• Check library for CONSISTENCY– Upweight pairs of residues that agree with other pairs
Default
Local Alignment Global Alignment
T-Coffee
Multiple Sequence Alignment
Mixing Heterogenous Information
Multiple Alignment
StructuralSpecialist
Copyright Cédric Notredame, 2000, all rights reserved
Mixing Heterogenous Information
Structure Superposition
Weighted Residue Pairs
Copyright Cédric Notredame, 2000, all rights reserved
e.g. SAPTaylor and Orengo
Increasing Structure NumberstRNA-synt_2b 19% ID
020406080
100
0 2 3 4 5
no of Structures
% a
ccur
acy
Including Structures in an Alignment
35.24 38.39
66.49
020406080
clustalw T_Coffee Default T_Coffee plus allstructures
%ac
cura
cy
3D-CoffeeO’Sullivan, O., Suhre, K., Abergel, C., Higgins, DG and Notredame, C
(2004) J.Mol.Biol.
Recent Developments• 20-30 new programs in past 2 years
• MUSCLE– Bob Edgar, ISMB, 2004– Iteration/progressive alignment
• FAST• Big Alignments
• PROBCONS– Tom Do, Michael Brudno, Serafim Batzoglou– ISMB 2004– “P-Coffee”
• VERY accurate
Iteration Revisited--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS----------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS----------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTPIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
Iteration Revisited--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS----------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS----------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTPIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
Iteration Revisited--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS----------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS----------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTPIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS----------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTPIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-
Iteration Revisited--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS----------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS----------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTPIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS----------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTPIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-
Remove EACH Sequence RFRemove BEST Sequence RBRandom RandomTree based Tree
Iterate
Iterate
Iterate
Iteration on HomStrad 184Method Default Remove
EachRemove Best Random
ProbCons 64.88 64.69 65.00 64.27Muscle v3.3 63.12 63.77*** 63.70** 63.39
T-Coffee 62.87 63.24 63.38 62.70Muscle v3.2 61.76 63.58*** 63.57*** 62.75**
ClustalW 59.87 61.54*** 61.44*** 60.99**
FFT-NSI (Mafft) 59.65 62.10*** 62.05*** 61.55***
Average 59.32 60.70*** 60.88*** 60.57***
Tree Based
Quick Tree 63.45** 63.69*** 62.47**
Slow Tree 63.10* 63.27** 61.74
Wallace, O’Sullivan and Higgins, 2004, Bioinformatics, 21:1408
Clustal W T-Coffee
T-Coffee
Multiple Sequence Alignment
Combining Multiple Alignment Methods
Probcons
MuscleSpecialist
Copyright Cédric Notredame, 2000, all rights reserved
Combining Multiple Alignment methods with T-Coffee
65.80
66.00
66.2066.40
66.60
66.80
67.00
67.2067.40
67.60
67.80
prob
cons
+mus
cle v6
+tco
ffee
+mus
cle v3
.52
+fins
i
+pcm
a
+gins
i
+fftn
si
+clus
talw
+fftn
s2
+fftn
s1
+dial
ign-t
+dial
ign
+poa
-glob
al
+poa
-loca
l
Combining Multiple Alignment methods with T-Coffee
65.80
66.00
66.2066.40
66.60
66.80
67.00
67.2067.40
67.60
67.80
prob
cons
+mus
cle v6
+tco
ffee
+mus
cle v3
.52
+fins
i
+pcm
a
+gins
i
+fftn
si
+clus
talw
+fftn
s2
+fftn
s1
+dial
ign-t
+dial
ign
+poa
-glob
al
+poa
-loca
l
The Wisdom of CrowdsJames Surowiecki
Crowds are surprisingly good at accurate decisions
Better than “experts”
Only if they do not form a “mob”
50.00
52.00
54.00
56.00
58.00
60.00
62.00
64.00
66.00
68.00
70.00
Combined 51.96 58.32 62.75 65.15 65.94 66.73 67.38 67.75
Default 51.90 57.92 61.15 63.73 64.22 65.37 66.04 66.41
Poa -global +Dialign-T +ClustalW +PCMA +FINSI +T-Coffee +Muscle v6 +ProbCons
M-Coffee combine 8 methods
BGA, CIA, MADE4Aedín CulhaneGuy PerriereJean ThiolouseIan JefferyAilís Fagan
Clustal
Toby Gibson, EMBLJulie Thompson, ICGEB, Strasbourg
IterationBenchmarking Clustal W 2.0
Gordon BlackshieldsMark Larkin
Paul McGettiganIain Wallace
BaliBASE Thompson, JD, Plewniak, F. and Poch, O. (1999)NAR and Bioinformatics
•ICGEB Strasbourg
•141 manual alignments using structures•5 sections•core alignment regions marked
1. Equidistant(82)
2. Orphan(23)
3. Two groups (12)
4. Long internal gaps(13)
5. Long terminal gaps(11)
Compare Methods
• Sam HMMHughey and Krogh, 1996
• Dialign Local multiple alignmentsMorgenstern, 1999
• ClustalW Progressive alignmentThompson, Higgins and Gibson, 1994
• Prrp Iterative WSPGotoh, 1996
• T-Coffee Pairwise libraryNotredame, Higgins and Heringa, 2000
BalibaseMethod
1 (82) 2 (23) 3 (12) 4 (13) 5 (11) Total
SAM 46.8 20.0 13.9 43.9 42.7 39.8Dialign 71.0 25.2 35.1 74.7 80.4 61.5ClustalW 78.5 32.2 42.5 65.7 74.3 66.4PRRP 78.6 32.5 50.2 51.1 82.7 66.4
% alignment columns correct
Core alignment blocks only
BalibaseMethod
1 (82) 2 (23) 3 (12) 4 (13) 5 (11) Total
SAM 46.8 20.0 13.9 43.9 42.7 39.8Dialign 71.0 25.2 35.1 74.7 80.4 61.5ClustalW 78.5 32.2 42.5 65.7 74.3 66.4PRRP 78.6 32.5 50.2 51.1 82.7 66.4T-Coffee 80.7 37.3 52.9 83.2 88.7 72.1
% alignment columns correct
Core alignment blocks only
Clustal• Clustal, Clustal1-4 TCD– Higgins DG, Sharp PM. (1988)
CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene. 73(1):237-44.
– Higgins DG, Sharp PM. (1989) Fast and sensitive multiple sequence alignments on a microcomputer. Comput Appl Biosci. 5(2):151-3.
• ClustalV Heidelberg– Higgins DG, Bleasby AJ, Fuchs R. (1992)
CLUSTAL V: improved software for multiple sequence alignment. Comput Appl Biosci. 8(2):189-91.
• ClustalW Hinxton– Thompson JD, Higgins DG, Gibson TJ. (1994)
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22):4673-80.
• ClustalX UCC– Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG. (1997)
The CLUSTAL_X windows interface: flexible strategies for multiple sequencealignment aided by quality analysis tools. Nucleic Acids Res. 25(24):4876-82.
Clustal re-engineering in C++
• Problems:• Code has become very complex.• 18 code files (up to 5229 lines).• 400 Global variables.• 500 functions
• Wish to:• Simplify the code.• Improve structure of code (modularisation)• Make easier to make functional changes.• Make easier to understand code.• Improve portability
– Qt Cross platform C++ GUI toolbox.
Location
Energy
Global minimum
local minimum
The Local Minimum Problem: Clustal is “Greedy”