Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
3. Analysis and alignment of sequences
• 3.1 Compositional bias in biological sequences
• 3.2 Alignment of pairs of sequences
• 3.3 Database searching for similar sequences
• 3.4 Multiple sequence alignment and domain
finding
CACTAGTCTCTGTACTAGCCACTAGAAGTACTAACCTTTCACACTAATATATCTATCTCCTGCTGCATTTAGTACACAAGTTCATAAAAGCACCCTATTTCTATAAAAAAAATACGGTAAATGTAGCAACTTACTAGTACCATAAGAAATTTTGCTGATCTAGCTAACTTATTACTAGCTACTTGCTAGGTCTGAACACTATTAAAATGTAACAATACACTTACCTCCTTGATCTGTGCAGCCCTGTTCTCACGCTGGCTTCTATGGTGCGAGTAGTATTCCTAGGTTTTCGTAGGCTTTTATAGCAACAGCTTTCTTCGGACCGAATGAGACACCTGCCTTGTTTATGAGAGGGATGGATAGCTTTCACCTGCTGGACATTTATTTGTTTTTTTTTACTGGTCACTACATTCCTATCCACTGGTGCATATCTATCCTATCCCCTTTGGTCAGTAAAATATACTGCCTCCCCCATTCTCTTTCTTTCTCTATCTTTCTCTAAGCTTAACACACTTTAAGTTCACAAAATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTAGCAGGCTTCCCTCCTTTAGAAATTTCATCGTCGAAATTATTATACCTTGGTGATGGAAAAACTGAGGCTAGTTTTTTCTGGAGATCATCTTCCTTCTCCCATGTGGCCTCATCCATGGTGTGATGACTCCATTGTACCTTTAAAAATCTAATTGTTTGGTTCCTTGTTTTTAGATCTTTAATATCCAAGATACAAACAGGATATTCCTGATATGTCAAATCGTTATGCAACTCAGCCATAGGAATTTCAACTTAATCACTTGGCCTCCGAAGGCATTTACGAAGCATGGAGATGTGGAATACATCATGTACCCCGGTGAAAGCATCTGGTAGCTTTAGCATGTAAGGCACTTCTCCTATTTGCTTAACAATTGTAAATGGTCCAACATATCTGGAACTTATTTTTTTTCCAAGTCCGAATCGCTTAATTCCCTTTATAGGTGATACTTTTAAATATACCCAGTCACCTATATCAAAGTTAAGATCCCTTCTCCTATTATCTGCATAACTTTTTTTGTCTATTTTGAGCTGTTTGCAGTCGTTCCCGTATCAGTCGTATTGTTTCTTCTATCTGTTGTATTATATCCGGTCCTAACAATTTTCTTTCTCCTACTTCGTTCCAGCAAACAGGTGTTCTGCATTTCCTTCCATATAAGGCTTCATACGGAGCCATTTGTATACTAGATTGATAACTATTGTTATATGCAAATTCTGCTAATGGCATAAATTCTTTCCATGATCCTTTAAATTCTAGGATGCAAGATCGTAAAATATTTTCAATTATTTGATTCACCCTTTCAGTTTGTCCATCGGTTTGGGGGTGATACGCTGCACTGAAATCTAATGTTGTTCCCACGGGCTTGTGTAGTCTTTTCTAGAAATTGGACAGAAACTGTGTATCTCTGTCTGACACAATCCTTCTTGGAACACCATGTAAAGATACTATTTCTTTGACATATAGTTTAGCTAACCTTTCCAAAGAAAATTTGCTTTTAACGGGTATGAAATGAGCAGATTTTGTTAACCGATCCACTATTCAGATACTATCATTTCCTGGAGGTGTGGTAGGTAATCCTTGAACAAAGTCCATACTGATTTCTTCTCATTTCCATAGTGGAATACTTAAGGGTTGTAACAGTCTTGCCGGCCTTTGATGTTCAACTTTTACGCATTGGCAGATATCACATTCTGCAATGAATTTTGCAATTTCTATTTTCATGATACATTTTGGTACTTCCTGGATGTATGGTATAGGGAGAGAAATGTGATTCTTCCAATATTCTCTGTTTTAAATTAGGGTCGTTAGGCACACACAATCTATTTTTGAAACATATAGCACCATTATGATCAATTCGAAATTCAGACACCTTCCCTTCTTCAATATTTTTCTTTGCCTTTTGCAATCCACTGTCGTCTCTTTGTTTCTCTAGAATATTTTCTTCTAAAGTAGGCTTTATTTGAAGCACGGGTAATAATACTCTGGGTTCATGGATCTTTAATTCCACATCCAATCTTTCCAAGTCTCTAAGTATATGTTGATCCTGTGTGATCTGAATAGCCATATTACAAAGAGCTTTTCGACTTAGAGCATCTTCCACAATGTTGGCTTTCAGAGGGTGATAATGAATATTCAAATCATAATCTTTCAATAATTCTAACCATCCCCTTTATCTCATATTCAATTCCTTCTGAGTAAATATGTACTTTAAACTTTTGTGGTCAGTAAATATTTCACAATGCTCACCATATAGGTAATGTCTCCAGATTTTTAAGGCAAAAATAACAGCAGCTAATTCCATATCATGGGTTGGATAATTTTGCTCGTATGGCTTTAATTGACGCGAAGCATAGGCAATTACCTTAGCTTTTTGCATGAGAACACAACCTAATCCAATTTTTGAAGCATCACAGTAAATAGTAAATTCTTCTCCCATTATAGGCAAGGCAAGAATAGTAAATTCTTTGCAATTCTGAGTCCACTCATATTTTACTCCCTTTTGTGTCAACCGGGTTAGAGGAGCTGCAATTCTAGCGAAGTTACTAATAAATCGACGGTAATATCCCGCCAACCCAAGAAAACTTCGTATCTCGGTTACCGATGAGGGCCTTTTCCACTCTGAGACGGTTTTGACCTTTTCAGGGTCCACTGATATACCTTCACCCGAAATAACATGACCAAGCAAAAATACTTTATCCATCCAGAAATCGCATTTCTTTAATTTGGCAAATAGTTTATGATCTCGCAATGTCTTGTAGTACTATTCTCAAATGATTTGCATGATCTTCCTTAGTCTTGGAATATATCAAAATATCATCTATATATAAATACAACTACAAATTAATCAAGATAAGGCTTGAATTTACGATTCATTAAATCCATAAAAGCTGCCGGTGCATTAGTCAAACCAAATGGCATTACTAGATATTCATAGTGTCCATAGCATGCACGGAAAGCAGTCTTGGGTATATCACTAGGTTTAATCTTTAGTTGATGGTAGCCTGATTGAAGATCAATTTTTGAGAAAACCCGAGCTCCTTGTAGTTGATCAAATAGATCGTCTATCCTTGGTAAAGGATATTTGTTTTTGATAGTCACCTTATTCAGTTCTCGGTAATCCGTGCATAATCGCATAGTTCCATCCTTTTTCTTGACAAATAGAACAGGAACACCCCCACGGGGAGACACTAGGACAAATGAATCCTTTATCTTCTAATTCTTTTAATTGTACATTTAGTTCCTTTAGCTCAACAGGGGCCATTATGTAGGGTGCCTAATAAATCGGAGTAGTTCCTGGTCCTATTTCAATACCAAATTCAATCTCTCGATCTAGTGCTAATCCTGGTAATTCAGCTGGAAAAACTGGAAACTCATTCACAATTGGCATTCCTTCCCAACTTGCTTCCTTTCTCATGATTTCTGCCACTAAAGGTCTTGGTAAATTGTTTTAATCTCCATGGTAAGTAATTTGGTTTTGATCCCATGGTTTAAGTGTAATTTGTTTTTCATGGCAATCAATATTTGCTTTGTTCTTACATAACCAATCCATACCAAGTATAATATCAAAATCATGCATATCCAAGGGTATGAGGTCAGCAGTTAATTCCCATCCATCAATAGTAATTGGACACAATTTGCAAATTAAATTAGTTATTTGGCTATCCAAAGGAGTTTCTATGCAAATCCTTTCTTTTAATTGACTAGTAGGGATGGTGTATTTTCTCACGAAGTTGGTGGAGATAAACGAATGTGTTGCGCCAGAATCAAATAAAACTTTACCAGGATAAGAGCACACTAAGACATTACCTGTAACCACGGTGTTGGATTTTTCGGCTGTGCTCTTAGTTAAGTTGTATACCCCAAGCGCGATTCCCACCTTGTGAATTATTCGACCGTATTCCTCATGTAGTATTAGTATTTGCAGGTGGCTTTCCTTGATTTGGCCCATTATTATTTGCTGAAGATGGTCCAGGTAAATAAAGCGACGGTACTGAAGTCAATACTTTAGTACTTGGCTGAGTAGTTCAATTAACTCGATTTTTACCCTTCTGTAACAGAGGACAAAGGTATCTAGTATGTCCTGCTTCTCCACACTCAAAGCACCTTCCCCACCGATTAGGACAAATTGATGGAACATGGCCACCTTGGCATATTGGACATTTTCTGTCTTGATTTTCTAAAGATTCCCTCTACATTTTTCCAGAGTAGTTTCCACGGAATCTTCCCTGGTTTTGTTGATTATTTGTCTTGAATTTCTTTTGGGGTTGTCCGTGTTCTATTCTTTGTTCATGATACCCCTTCTCAAGAAGTTGTGCTTTACTTACTACCTCCCTGAATATGGTTAATTCAAAGGCTTCGACACACCTTTTGAGAGGTTGGCGTAATCCACTTTCAAATCGTCGAGCTTTAGAGCCGTCCGTTTGTACAAATTCAGGAGCAAATCTTGCAAGTCTCGAAAATTCTATTTCATATTCTACTACAGATTTATTACCTTACTTAAGCTCTAGAAATTCCTTCTTCATTCTCTTCACACTTTCTGGAAAATATTTCTTGTAAAAAGCTTCTTTGAATATTTCCCATGTAATAGAGATACGTTCCGAATATGACTTTTTGTGAGCATCCCACCATTCAAAAGCACTAGACTGAAGCATATAGGTAGCATATGTAATCTTTTCTTTATCTGTACAACCCATAGCTTCAAATGCCTTTTCCATTGCTACTATCCAAACTTCCGCTTCAAGTGGATTGGTAGTTCCTGAAAGGAAAAAGTATGAATTACCCCCTGAACTATTGCGAGAGTATGAATTACCCCCCCCCCCCAAAACCACAAAACCAGACATATTAAACCTCAAACTATTGAAATCGGATTACCCCCCCTGATTCAATCCGGAGCGGTTTGGTCCTACGTGGCATACACGTGGCACCGCCATGGAAATCCAATCAGCAATATTAGGTGGTCCCACATGTCATGATCATGTATTTCTTCCACTTTCCCCTCTCTTCATCTCCTCCAGGGCAAATAGAAAGCGGCGCGGTGGTGGCGCTCTCCAGGGCGGCCGGGGGAAGCGGCGGCGGCGGCGTCCAGGGCGGGTGGGGGAAGCGGCGGCGTCCAGGGCGGCTGCGGAAGCGACGGCGGCGTCCAGGGTGGGCTAGGGAAGCGGCGGCTTCTAGGGCAAGCTGGGGAAGTGGCGGCGGTGGCGGCGACGGCGGCGTCCAGGGCGGGCTGGGGAAGCAGCGGCGTCCAGGGCAGGCGGGGAAGTGGCGGTGATGACGGCGCCCTCCAGGTCGAACTGGGGTGGTGGCGGGGAAGTGACGGCAGCGACGGCGCCCTCCAGGGCAGGTAGGGGAAGCGGTGGCGGCGGGTGTGGCGGGAGCGCTCGTGCGGTGGGCGCGGCGGGAGCGGGAGCGGGCGCGGCGAGGAGCAGGCGCTTGTGCTCCTCCTCCGTGGCGCCAGAGATGGAGCGGGCGCTCGTGAGCGGGTCGGCCGCCGCTGCGAGCTCGCCGTGGAGGCGGCGAGAATCGAGATCGACGGCGAGCTCCACGGAGATGGAGAGAAGAAGGGAAGGGGCAAAGAGGAGGGGGAGAAGAGGAGGGTTGGGCAGACAGTGGGCCCCACCATATTTATTTGTTGTGGCTGACAAGTGGGTCCTATATATTTTTCTTTTGTTTTAGCTGACCAGACTGCCACATGGGCATCCACGTAGGACCGAAACCACCCTATATCGATCTAGGGGGTAATTCATCCGGTTTGTAAAGTTCAGGGTTAAAAATAACTGGTATTGGAGTTCAGGGTTAAAAATCGGACGACCGTAATTGTTGAGGGGGTAATTCGTACTTTTTCCTTCTTGAAAATGTTGGTGGCTTCAATTTCTGAAATTCCCCAAGTCCATTCCGGTTAGCATCACTTTTAGTAGTACGTTCTAAAATCTCCATCTATCGTTGTTGGGTTTCCTGTTGCTTGCCCAATATATTCGCGAGTAAGTTAGCCCAAGGGTCTTGACTACTTGCACTAGGTATTATTGATCCAGTGGCACCATTACTAGTATTATTTCCATCCTGACTAGTACCATTGTTGTCGTTGTTTTGCTCCATCTATCATATTCAACTCATTAGCCAGAATACATAAATGATCATTGGATGGATCTCAAAATGGTAACAAAAATCAGATTTACTATAAAATATTCAATATAGGTAATATTAAAATAAAACTATTTAGTTATATTATCATCATTATACTTTTCTCTTCTTATTTTAGTCTTATCATTATTCTTAACATGCACCAGTTAAAAAATAAATAAATAAAATTAGTACAAACCACAAGCACCACAGCACTAGTGCATTACGGTCATGTTTAGATTCAAATTTTTTTCTTCAAACTTCTAACTTTTCCGTCACATCAAATGTTTGGACACATGCATGGAGCATTAAATGTGGAGAAAAAAACAATTGCACAGTTTGCATGTAAATTGTGAGACGAATCTTTTGAGCCTAATTACACCATGATTTGACAATGTGATGCTATAGTAAACATTTGTTAATGATAGATTAATTAGTCTTAATAAATTCATCTCGCAGTTTACAGGTGAAATCTGTAATTTGTTTTGTTATTAGTCTACATTTAATACTTCAAATGTATATCCATATACTTGAAAAAAAATTTGGCACACGAACTAAACACAGCCTACTTCGACGAAAAGAAAGTGCAGGAGCCTATCATGCTACACAAACACTAAGGCAAACACCTACTGGTGTACTAGTGCCACATACAGAGCTCTGGTTGTTTACACAAGATGTCTAGAAAGACATCACCATGAGTTCTGATGTTAACTCTTCAGTTCTAAAAGCTCCTTTGGCTGTCTCGTGACCCATCCACACATGCTACTAACACTAAGGGTGTGTAGGGTGTGTTTAGTTCACACCAAAATTGAAAGTTTGGTTGAAATTGAAACGATGTGACGGAAAAGTTGAAGTTTACGTGTGTAGGAGAGTTTTGATGTGATGAAAAAGTTAAAAGTTTGAAGAAAAATTTTGGAACTAAACTCAGCCTAAAGGACTTATTATAGTGGAGTACATCCCATCCCAAGGGAAAACAAAACCCATACTGACACCACTCCTACATCTCACACACTGCCACTAGAGCTGTCACTACCCCCAACCCCACTCTGCAGAACAGTAAATGGTTTCACTCAGGTAGCAGACGCGGTGGTACAGGCGATAGGTGAGGCGCTCCAGAAACATAGGCTGTGTTTAGATGGTGGAAAAGTTGGGAGGTTGGGAGAAAGTTAGTAGTTTGGAGAAAAAGTTGGTAGTTTATGTGTGTACGAAAGTTTTCGATGTGATGTGATGTGATGGAAAGTTAGGAATTTGGGGGGAACTAAACACGGCCATAACTTCATTCTCACTGGAGCGAACAATAGTCGGCAGTTATTTTTATATACATATTTGTTAAAGAAGAAATATTACTGTCCATGGATATTAATGGCCGATAAATAGTATAAAAAACATTAAATATAGTAAGTGATTTAAATACATTCTGCAGAGGTATTAAAATAATTGTCATAATCTCGTTCCTTCAATCCATTTTTTTCCAACTAGTGATACCTCATCTGAGAATCACGGCGCCGAATTCCCTACTTGTGTGAGGCATTCCTTCTCTCACACTGATATCAGCCGACCCGATATCGTTGTTTCAGGTATCGGCCGTCTCAGGCTAAGTATCAAAATCATGTTCCATGATTATGACGTTATTATTCTCACTGATAAAATCATCAATCAATTATTCGGGAGTTAATAATATTTACCGTTAGATCGTTAGTATCATCATCCCAATATATAATACAGGTAAGCGAATTTAGTTAGAGATGATTAAGTAAAATAGTTGATGGACACAGTCTTGCCTTCTCTTTTGTTGTTCTTCCTCTGCATCCCACCTAATCAAATATACATGTCTTTGGTATTAATTTATATCTATATTTGTTATGCAGGACATTAGCTACTGGAACCAGCTACTAGGACCATAGATAGCTAGTTGATGTGACTCTACTGGAGAAAGAAAACCAACATGTAGGCCTAGTTTATTTCCCCCAAAATTTTTCCCAAAAACATCACATTGAATCTTTGGACATATGCATGGAGCATTAAATATAGATTAAAAAAACTAATTGCACAGTTAGGGGGAAAATCACGAGACGAATCTTTTGAGCCTTATTAATCCATGATTAGCCATAAGTGCTACAGTAATGCCAGCTGGGCGAGGAGAGGTGGCAGTGGTGGTGAGCCCAGCTGGGTGGATGTGTGGAGGGTGGAGAGGAGACGGGGAGGGAGGGAGGGAGGGAGAGAGGACTAGG
3.1 Compositional bias in biological sequences
An obvious first summary of a DNA sequence is just the distribution of the four base types.
Almost all empirical studies show an unequal distribution of the four bases.
Promoter sequences Base content as a function of cDNA position, relative to the start of transcription
sites, and averaged over all cDNAs with a 10-bp sliding window
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
10-1-2-3-4
cDNA coord, 100bp
I-10-GC
I-10-A
I-10-T
I-10-G
I-10-C
Rice
TSS
Arab_10_A,T,G,C,GC
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
10-1-2-3-4
a-10-GC
a-10-A
a-10-T
a-10-G
a-10-C
Arabidopsis
Human_10_A,T,G,C,GC
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
10-1-2-3-4
H-10-GC
H-10-A
H-10-T
H-10-G
H-10-C
Human
Three patterns of base contents
Rice
Arabidopsis
Human
TSS
Neighboring bases are not independet
Pair Observed/Expected
TGCTCCAGAACAGGTTGATCGCATACGTTACG
1.291.261.181.161.151.151.141.071.041.000.990.850.840.820.650.42
Example:
Dinucleotide frequencies in some vertebrate squences.
Based on 166 vertebrate sequences, totaling 136,731 bases (Nussinov, 1984)
Puv ≠ PuPv
相邻碱基对 观测频率/期望频率*
人类 水稻
CC 1.27 1.05GG 1.22 1.03CA 1.20 1.11TG 1.19 1.11AG 1.18 0.99CT 1.15 0.99TT 1.13 1.13AA 1.13 1.11GC 1.02 1.11GA 0.99 1.05TC 0.96 1.00AT 0.88 1.02GT 0.84 0.84AC 0.83 0.86TA 0.75 0.77CG 0.26 0.83
数据来自这两个物种目前注释出来的所有基因的DNA序列,总长各为168,717,208和1,506,657,427个碱基 (邱杰,2016)
3.2 Alignment of Pairs of Sequences• The most basic sequence analysis task is to ask if
two sequences are related.• This is usually done by first aligning the
sequences (or parts of them) and deciding whether that alignment is more likely to have occurred because the sequences are related, or just by chance.
• Sequence alignment is the procedure of comparing two (pairwise alignment) or more (multiple sequence alignment) sequences by searching for a series of individual characters or character patterns that in the same order in the sequences.
CTATAATCCC CT–ATAATCCC CTATAATCCCCTGTA–TC CTG–TA–T–C CTGTA–T– –C
CTATAATCCC CTATAATCCC CTATAATCCCCTGT–ATC CTGT–AT–C CTGT–AT– –C
Many potential alignments for two sequences!
a: CTGTATC b: CTATAATCCC
(⑴)The scoring system used to rank alignments;
(⑵)The algorithm used to find optimal (or good) scoring alignments;
(⑶)The statistical methods used to evaluate the significance of an alignment score.
Three key steps to answer if two sequences are related
Global Alignment the global alignment is stretched over the entire
sequence length to include as many matching amino acids or nucleotides as possible up to and including the sequence ends.
local Alignment In a local alignment, the alignment stops at the ends
of regions of identity or strong similarity, and a much higher priority is given to finding these local regions than to extending the alignment to include more neighboring amino acid or nucleotide pairs.
What’s a sequence alignment?
Three principle methods of pair-wise sequence alignment
• Dot matrix pair-wise sequence comparison
• The dynamic programming (DP) algorithm• Needleman and Wunsch (1970) • Smith and Waterman (1981)
• Word or k-tuple methods• heuristic algorithms, used by the programs of
FASTA and BLAST (See Section 3.3)
DOT MATRIX SEQUENCE COMPARISION
SWISS, 2002
Dot matrix analysis of the human LDL receptor against itself
DP: Keep the every step being the optimal one.
The DP algorithm
• Needleman-Wunsch algorithm Global searches for similarity, which take into
account the total lengths of the sequences, are used to align sequences in such a way as to maximize the degree of global similarity. Alignment of sequences of unequal length necessarily requires the introduction of gaps on one sequence.
The DP algorithm
Do we get the best score 1. by aligning C with A and adding the score to the diagonal score x ?OR2. by placing a gap either opposite C or A and subtracting the gap penalty from the highest score in row y or column z ?
Sequence 1 A
C
S
eque
nce
2
xy
z
Mount D, 2002
A record of the path that produced the highest score to reach the matrix position is then kept.
CTATAATCCCCTGTA–T–C
a C T A T A A T C C C
b
C
T
G
T
A
T
C
a C T A T A A T C C C
b
C
T
G
T
A
T
C
Algebraically, the algorithm can be described with elements ai and bj
Within cell (i,j), the additions to distance for the three possible events that lead to that cell are (1) vertical movement from cell (i-1,j) to cell (i,j) by inserting a gap in sequence b; (2) diagonal movement from cell (i-1, j-1) to cell (i,j) by adding elements ai and bj; (3) horizontal movement from cell (i, j-1) to cell (i,j) by inserting a gap in sequence a.
• The distance d(ai,bj) associated with cell (i, j) is taken to be the minimum of the distances in each of the three neighboring cells plus the associated weights:
i
kk
i
j
kk
j
awbad
bwbad
1
0
1
0
)(),(
)(),(
0),( 00 bad
)(),(
),(),()(),(
min),(1
11
1
iji
jiji
iji
ji
bwbad
bawbadawbad
bad
And there is a need for initial conditions:
For two small sequences a: CTGTATC b: CTATAATCCC
• Weights 0 and -1 for each matched and mismatched element, respectively, and -3 for each element opposite a gap
• How many path(s) gives the alignments of smallest distance, or possible optimal alignments?
a C T A T A A T C C C
b 0 3 6 9 12 15 18 21 24 27 30
C 3 0 3 6
T 6 3 0
G 9 6 3
T 12 9 6
A 15
T 18
C 21
a C T A T A A T C C C
b 0 3 6 9 12 15 18 21 24 27 30
C 3 0 3 6 9 12 15 19 21 24 27
T 6 3 0 3 6 9 12 15 18 18 24
G 9 6 3 1 4 7 10 13 16 19 22
T 12 9 6 4 1 4 7 10 13 16 19
A 15 12 9 6 4 1 4 7 10 13 16
T 18 15 12 9 6 4 2 4 7 10 13
C 21 18 15 12 9 7 6 3 4 7 10
a C T A T A A T C C C
b 0 3 6 9 12 15 18 21 24 27 30
C 3 0 3 6 9 12 15 19 21 24 27
T 6 3 0 3 6 9 12 15 18 18 24
G 9 6 3 1 4 7 10 13 16 19 22
T 12 9 6 4 1 4 7 10 13 16 19
A 15 12 9 6 4 1 4 7 10 13 16
T 18 15 12 9 6 4 2 4 7 10 13
C 21 18 15 12 9 7 6 3 4 7 10
Six optimal alignments
CTATAATCCC CTATAATCCC CTATAATCCCCTGTA–TC – – CTGTA –T–C – CTGTA–T– –C
CTATAATCCC CTATAATCCC CTATAATCCCCTGT–ATC– – CTGT –AT –C – CTGT –AT – –C
• +5 for a match, -2 for a mismatch and –6 for each insertion or deletion
• Smith-Waterman algorithm
• Because distantly-related proteins may share only isolated regions of similarity, searches for local similarity may sometimes be more appropriate than global searches.
• Smith and Waterman described an algorithm to find a pair of segments, one from each of two long sequences, such that there is no other pair of segments with greater similarity.
• Concepts similar to those in the Needleman-Wunsch algorithm.
0)(max)(max
),(
max,1
,1
1,1
lljijlij
kjkiikij
jiji
ij wHQwHP
baSH
H
以碱基对aibj结束的片段可以由以ai-1和bj-1结束片段增加碱基(因子)来获得,或者ai可以删除k长度的碱基片段,bj可删除l长度碱基片段。具体算法如下:
对于序列A=(a1,a2,…,am)和B=(b1,b2,…,bn),Hij被定义为以ai和bj碱基对结束的片段(亚序列)的相似性值。
A: CTGTATCB: CTATAATCCC
• For sequences A=(a1, a2, …, am) and B=(b1, b2, …, bn), Hij is the similarity of the two subsequences ending in ai and bj
• The initial values: Hio=0 ,0≤i≤n,Hoj=0 ,, 0≤j≤m
• S(ai,bj)=1(ai=bj) and –1/3(ai≠bj)• w=-(1+k/3) (i.e. -4/3 for a gap, k=1)
0)(max)(max
),(
max,1
,1
1,1
lljijlij
kjkiikij
jiji
ij wHQwHP
baSH
H
A C T A T A A T C C C
B 0 0 0 0 0 0 0 0 0 0 0
C 0 1 0 0 0 0 0 0 1 1 1
T 0 0 2 0.67 1 0 0 1 0 0.67 0.67
G 0 0 0.67 1.67 0.33 0.67 0 0 0.67 0 0.33
T 0 0 1 0.33 2.67 1.33 0.33 1 0 0.33 0
A 0 0 0 2 1.33 3.67 2.33 1 0.67 0 0
T 0 0 1 0.67 3 2.33 3.33 3.33 2 0.33 0
C 0 1 0 0.67 1.67 2.67 2 3 4.33 3 1.67
Scoring system: 1 for match; -0.33 for mismatch; -1.33 for a gap
A C T A T A A T C C C
B 0 0 0 0 0 0 0 0 0 0 0
C 0 1 0 0 0 0 0 0 1 1 1
T 0 0 2 0.67 1 0 0 1 0 0.67 0.67
G 0 0 0.67 1.67 0.33 0.67 0 0 0.67 0 0.33
T 0 0 1 0.33 2.67 1.33 0.33 1 0 0.33 0
A 0 0 0 2 1.33 3.67 2.33 1 0.67 0 0
T 0 0 1 0.67 3 2.33 3.33 3.33 2 0.33 0
C 0 1 0 0.67 1.67 2.67 2 3 4.33 3 1.67
Maximally similar segments
CTGTA– TCCTATAATC
Scoring matrices and gap penalties in sequence alignments
• Protein chemists discovered early on that certain amino acid substitutions commonly occur in related proteins from different species. Because the protein still functions with these substitutions, the substituted amino acids are compatible with protein structure and function.
• Knowing the types of changes that are most and least common in a large number of proteins can assist with predicting alignments for any set of protein sequences
• If ancestor relationships among a group of proteins are assessed, the most likely amino acid changes that occurred during evolution can be predicted.
• In the amino acid substitution matrices, each matrix position is filled with a score that reflects how often one amino acid would have been paired with the other in an alignment of related protein sequences during evolution.
Gap penaltiesGap penalties:A large value is chosen for introducing a gap of size 1A small value is added for each increment in size
For example, in a blast search:Gap opening penalty = 11Gap extension penalty = 1Penalty = 11, 12, 13, 14, etc.These are called affine gap penaltiesP=a+bx
These values match the amino acid substitution scores in an amino acid substitution table called the PAM and BLOSUM matrix et al.
Mount D, 2002
Hypothetical example of scoring matrix value
(log odds scores)
20 amino acids20
am
ino
acid
s
C
C 11
W
P -5
Mount D, 2002
Sequences are aligned using log odds score
Log odds scoreWhat is a log odds score?What is an odds score?
What does it mean to say that the odds of a horse winning a race is 8/1?
odds = chance of winning/chance of losingWhat about the odds of two horses each winning, one with a chance of winning of by 8/1/ and the other a chance of winning of 16/1?
odds = 8/1 times 16/1 or 128/1
Let’s do this using logarithms to the base 2, or log odds scores.log28 = 3 and log216 = 4log odds of both horses winning is 3+4 = 7odds of both horses winning is 27/1 or 128/1
Mount D, 2002
Using log odds scores for scoring an alignment of two sequences
C A W - C W A -8/1 1/16 1/16odds = 8/256 = 1/32log odds = 3-4-4 = -5odds = 1/25 = 1/32
odds of a horse winning a race =no. of races won / no. of races lost
similarly, odds of correctly aligning two amino acids = no. of times they are aligned in sequences known to be related / no. of times they are aligned in sequences that are not related
Mount D, 2002
The scoring model
SWISS, 2002
Amino acid substitution matrices
§count matches of amino acid a with c in alignments of a large no. of related sequences and divide by no. matches in unrelated sequences to obtain odds score§average the no. of changes of a to c and c to a§convert to logarithms to base 2 or 10§round log odds scores to whole nos. and multiply by factor if needed to deal with in between values§put in a scoring matrix§examples are the PAM250 and the BLOSUM62 scoring matrices.
Mount D, 2002
Dayhoff Amino Acid Substitution Matrices (Percent Accepted Mutation or
PAM Matrices)
• There is presently no other type of scoring matrix that is based on such sound evolutionary principles as are these matrices
• To prepare the Dayhoff PAM matrices, amino acid substitutions that occur in a group of evolving proteins were estimated using 1572 changes in 71 groups of protein sequences that were at least 85% similar.
(log odds ratio matrix)
PAM250
BLOSUM62
BLOSUM
Henikoff和Henikoff于1992年提出了BLOSUM(Blocks Substitution Matrices)矩阵。他们直接利用多序列联配(multiple alignment)分析亲缘关系较远的蛋白质,而不是用相近的序列。这方法的优点是符合实际观测结果,不足之处是它不能和进化挂起钩来。大量的试验表明,BLOSUM矩阵总体比PAM矩阵更适合于生物学关系的分析和局部相似性搜索。Blocks amino acid substitution matrices (BLOSUM), described by Henikoff and Henikoff (1992), is widely used for scoring protein sequence alignments.
Comparison of the PAM and BLOSUM
• The PAM matrices are based on a mutational model of evolution that assumes amino acid changes occur as a Markov process, each amino acid change at a site being independent of previous changes at that site. In contrast, the BLOSUM matrices are not based on an explicit evolutionary model (starburst model). The model implied that the proteins in each family share a common origin, but closer versus distal relationships are ignored, as if they all were derived equally from the same ancestor.
• The PAM matrices are based on scoring all amino acid positions in related sequences, whereas the BLOSUM matrices are based on substitutions and conserved positions in blocks, which represent the most alike common regions in related sequences.
BLOSUM:矩阵后的数字越大,则表示关系越近。BOSUM62是指所使用的序列片段的各联配点上至少62%是相同的。
PAM:矩阵后的数字越大,则表示关系越远。 PAM1、PAM100、PAM250等