Upload
bly
View
33
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Multiple Sequence Alignment. Biology 224 Instructor: Tom Peavy March 13 & 18, 2008. . Multiple sequence alignment: definition. • a collection of three or more protein (or nucleic acid) - PowerPoint PPT Presentation
Citation preview
Biology 224Instructor: Tom PeavyMarch 13 & 18, 2008
<Images adapted from Bioinformatics and Functional Genomics by Jonathan Pevsner>
Multiple Sequence Alignment
Multiple sequence alignment: definition
• a collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned
• Homologous residues are aligned in columns across the length of the sequences
• residues are homologous in an evolutionary sense
• residues are homologous in a structural sense
Multiple sequence alignment: properties
• not necessarily one “correct” alignment of a protein family
• protein sequences evolve...
• ...the corresponding three-dimensional structures of proteins also evolve
• may be impossible to identify amino acid residues that align properly (structurally) throughout a multiple sequence alignment
• for two proteins sharing 30% amino acid identity, about 50% of the individual amino acids are superposable in the two structures
Multiple sequence alignment: features
• some aligned residues, such as cysteines that form disulfide bridges, may be highly conserved
• there may be conserved motifs such as a transmembrane domain
• there may be conserved secondary structure features
• there may be regions with consistent patterns of insertions or deletions (indels)
Multiple sequence alignment: methods
There are two main ways to make a multiple sequence alignment:
(1) Progressive alignment (Feng & Doolittle). (e.g. ClustalW)
(2) Iterative approaches.
Use Clustal W to do a progressive MSA
http://www2.ebi.ac.uk/clustalw/
Feng-Doolittle MSA occurs in 3 stages
[1] Do a set of global pairwise alignments (Needleman and Wunsch)
[2] Create a guide tree
[3] Progressively align the sequences
Progressive MSA stage 1 of 3:generate global pairwise alignments
Start of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 84Sequences (1:3) Aligned. Score: 84Sequences (1:4) Aligned. Score: 91Sequences (1:5) Aligned. Score: 92Sequences (2:3) Aligned. Score: 99Sequences (2:4) Aligned. Score: 86Sequences (2:5) Aligned. Score: 85Sequences (3:4) Aligned. Score: 85Sequences (3:5) Aligned. Score: 84Sequences (4:5) Aligned. Score: 96
five closely related lipocalins
best score
Number of pairwise alignments needed
For N sequences, (N-1)(N)/2
For 5 sequences, (4)(5)/2 = 10
Feng-Doolittle stage 2: guide tree
• Convert similarity scores to distance scores
• A tree shows the distance between objects
• Distance methods used (i.e. Neighbor joining)
• ClustalW provides a syntax to describe the tree
• A guide tree is not a phylogenetic tree
Progressive MSA stage 2 of 3:generate guide tree
five closely related lipocalins
3 (rat RBP)
2 (murine RBP)
4 (porcine RBP)
5 (bovine RBP)
1 (human RBP)
((Human RBP:0.04284,(Mouse RBP:0.00075, Rat RBP:0.00423) :0.10542):0.01900, Pig RBP:0.01924, Bovine RBP:0.01902);
Feng-Doolittle stage 3: progressive alignment
• Make a MSA based on the order in the guide tree
• Start with the two most closely related sequences
• Then add the next closest sequence
• Continue until all sequences are added to the MSA
• Rule: “once a gap, always a gap”
Clustal W alignment of 5 closely related lipocalins
CLUSTAL W (1.82) multiple sequence alignment
gi|89271|pir||A39486 MEWVWALVLLAALGSAQAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 50gi|132403|sp|P18902|RETB_BOVIN ------------------ERDCRVSSFRVKENFDKARFAGTWYAMAKKDP 32gi|5803139|ref|NP_006735.1| MKWVWALLLLAAW--AAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48gi|6174963|sp|Q00724|RETB_MOUS MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50gi|132407|sp|P04916|RETB_RAT MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50 ********************:* ***:*****
gi|89271|pir||A39486 EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 100gi|132403|sp|P18902|RETB_BOVIN EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 82gi|5803139|ref|NP_006735.1| EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98gi|6174963|sp|Q00724|RETB_MOUS EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100gi|132407|sp|P04916|RETB_RAT EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100 *********:*******.*:************.**:**************
gi|89271|pir||A39486 PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADS 150gi|132403|sp|P18902|RETB_BOVIN PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADS 132gi|5803139|ref|NP_006735.1| PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148gi|6174963|sp|Q00724|RETB_MOUS PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150gi|132407|sp|P04916|RETB_RAT PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150 ****************:*******:****:*:* ****** *********
Why “once a gap, always a gap”?
• There are many possible ways to make a MSA
• Where gaps are added is a critical question
• Gaps are often added to the first two (closest) sequences
• To change the initial gap choices later on would beto give more weight to distantly related sequences
• To maintain the initial gap choices is to trustthat those gaps are most believable
Multiple sequence alignment to profile HMMs
• Hidden Markov models (HMMs) are “states” that describe the probability of having a particular amino acid residue at arranged in a column of a multiple sequence alignment
• HMMs are probabilistic models
• Like a hammer is more refined than a blast, an HMM gives more sensitive alignments than traditional techniques such as progressive alignments
GTWYA (hs RBP)GLWYA (mus RBP)GRWYE (apoD)GTWYE (E Coli)GEWFS (MUP4)
An HMM is constructed from a MSA
Example: five lipocalins
GTWYAGLWYAGRWYEGTWYEGEWFS
Prob. 1 2 3 4 5p(G) 1.0p(T) 0.4p(L) 0.2p(R) 0.2p(E) 0.2 0.4p(W) 1.0p(Y) 0.8p(F) 0.2p(A) 0.4p(S) 0.2
GTWYAGLWYAGRWYEGTWYEGEWFS
P(GEWYE) = (1.0)(0.2)(1.0)(0.8)(0.4) = 0.064
log odds score = ln(1.0) + ln(0.2) + ln(1.0) + ln(0.8) + ln(0.4) = -2.75
G:1.0T:0.4L:0.2R:0.2E:0.2
W:1.0Y:0.8F:0.2
E:0.4A:0.4S:0.2
BLOCKS (HMM)CDD (HMM)DOMO (Gapped MSA)INTERPROiProClassMetaFAMPfam (profile HMM library)PRINTSPRODOM (PSI-BLAST)PROSITESMART
Databases of multiple sequence alignments
Query = your favorite protein
Database = set of many PSSMs
CDD is related to PSI-BLAST, but distinct
CDD searches against profiles generatedfrom pre-selected alignments
Purpose: to find conserved domainsin the query sequence
You can access CDD via DART at NCBI
CDD uses RPS-BLAST: reverse position-specific
Multiple sequence alignment algorithms
Progressive
Iterative
Local Global
PIMA
DIALIGN SAGA
CLUSTALPileUpother
AMASCINEMAClustalWClustalXDIALIGNHMMTMatch-BoxMultAlinMSAMuscaPileUpSAGAT-COFFEE
Multiple sequence alignment programs
Clustal X
GCGPileUp
[1] As percent identity among proteins drops,performance (accuracy) declines also. This isespecially severe for proteins < 25% identity.
Proteins <25% identity: 65% of residues align well
Proteins <40% identity: 80% of residues align well
Assessment of alternativemultiple sequence alignment algorithms
[2] “Orphan” sequences are highly divergent members of a family. Surprisingly, orphans do not disrupt alignments. Also surprisingly, global alignment algorithms outperform local.