Upload
lilike
View
27
Download
0
Embed Size (px)
DESCRIPTION
Computing in Molecular Biology. Hugues Sicotte National Center for Biotechnology Information [email protected]. Finches of the Galápagos Islands observed by Charles Darwin on the voyage of HMS Beagle. C O M P A R A T I V E A N A L Y S I S. - PowerPoint PPT Presentation
Citation preview
Sequence Comparison& Alignment
Computing in Molecular Biology
Hugues Sicotte
National Center for Biotechnology Information
Sequence Comparison& Alignment
C O M P A R A T I V E A N A L Y S I S
Finches of the Galápagos Islands observed by Charles Darwin on the voyage of HMS Beagle
Sequence alignment is similar to other types of comparative analysis
Involves scoring similarities and differences among a group of related entities
Sequence Comparison& Alignment
Homology
Homology Is the central concept for all of biology. Whenever we say that a mammalian hormone is the ‘same’ hormone as a fish hormone, that a human gene sequence is the ‘same’ as a sequence in a chimp or a mouse, that a HOX gene is the ‘same’ in a mouse, a fruit fly, a frog and a human - even when we argue that discoveries about a worm, a fruit fly, a frog, a mouse, or a chimp have relevance to the human condition - we have made a bold and direct statement about homology. The aggressive confidence of modern biomedical science implies that we know what we are talking about.”
David B. Wake
Sequence Comparison& Alignment
C O M P A R A T I V E A N A L Y S I S
Alignment algorithms model evolutionary processes
GATTACCA
GATGACCA GATTACCA
Derivation from a common ancestor through incremental change due to dna replication errors, mutations, damage, or unequal crossing-over.
insertion
GATCATCA GATTGATCA
GATTACCA GATTATCA GATTACCA
deletionSubstitution
GAT ACCA
T
Sequence Comparison& Alignment
C O M P A R A T I V E A N A L Y S I S
Alignment algorithms model evolutionary processes
GATTACCA
GATGACCA GATTACCA
Derivation from a common ancestor through incremental change
GATCATCA GATTGATCA
GATTACCA GATTATCA GATTACCA
GATACCA
Only extant sequences are known, ancestral sequences are postulated.
GATCATCA GATTGATCA
GATTACCA
GATACCA
Sequence Comparison& Alignment
The term homology implies a common ancestry, which may be inferred from observations of sequence similarity
C O M P A R A T I V E A N A L Y S I S
Alignment algorithms model evolutionary processes
GATTACCA
GATGACCA GATTACCA
Derivation from a common ancestor through incremental change. Mutations that do not kill the host may carry over to the population. Rarely are mutations kept/rejected by natural selection.
GATCATCA GATTGATCA
GATTACCA GATTATCA GATTACCA
GATACCA
Sequence Comparison& Alignment
Comparative Analysis of Genes
MSH2_Human TGVIVLMAQIGCFVPCESAEVSIVDCILARVGAGDSQLKGVSTFMAEMLETASILRSATKSPE1_DROME VGTAVLMAHIGAFVPCSLATISMVDSILGRVGASDNIIKGLSTFMVEMIETSGIIRTATDMSH2_Yeast VGVISLMAQIGCFVPCEEAEIAIVDAILCRVGAGDSQLKGVSTFMVEILETASILKNASKMUTS_ECOLI TALIALMAYIGSYVPAQKVEIGPIDRIFTRVGAADDLASGRSTFMVEMTETANILRNATE *** ** ** * * **** **** * ** * *
HumanBacteria Yeast Worm Fly Mouse
3000Myr
1000Myr
500Myr
Human Colon Cancer MSH2 gene is homologous to DNA repair proteins
Align Extant Sequences
Sequence Comparison& Alignment
Why Align sequences?
- Finding similar sequences helps determine the properties and function of a new sequence. (Must be verified experimentally)
-Conserved positions in homologous sequences hint at functionally important sites in proteins. (active or catalytic sites, dna binding domains, di-sulfide bridges, structural bends, hydrophobic pockets, protein binding domains,…)
-Conserved nucleotides can hint at regulatory elements, either pre-transcriptional or post-transcriptional.
Sequence Comparison& AlignmentSound alignment methods reflect evolution.
DNA Evolution:
- Mutation: Errors in DNA replication of DNA repair.
-substitutions: replacement of one base by another.
-deletions/insertions: By dna mispairing during replication or unequal crossing over.
- Gene conversion or unequal crossing over: Large
segments of DNA can be inserted/deleted.
- Mutations that do not kill the host are propagated. Sometimes positive mutations are selected for.
Reference: Molecular Evolution: Wen-Hsiung Li, 1997,Sinauer Associates publishing
Sequence Comparison& Alignment
Synonymous versus non-synonymous mutations
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Co
din
g
No
n-C
od
ing
Pse
ud
og
enes
5' Flank
3' Flank
introns
5'UTR
3'UTR
non-denerate
Twofolddegenerate
4-folddegenerate
Pseudogenes
Substitution rate per nucleotide site per billion years.
Different regions evolve at
different rates, consistent with
evolutionary constraints.
Sequence Comparison& Alignment
Alignment definition and Type:
G-ATES
GRATED
Local Alignments:
Global Alignment:
Alignment:
All bases aligned with another base or with a gap (symbol of “-” or sometimes “.”).
Each Base is used at most once.
Do not need to align all the bases in all sequences.
Align BILLGATESLIKESCHEESE and GRATEDCHEESE
G-ATESLIKESCHEESE or G-ATES & CHEESE
GRATED-----CHEESE GRATED & CHEESE
Sequence Comparison& Alignment
C O M P A R A T I V E A N A L Y S I S
GATTATACCAGATTA---CA
Insertions and deletions (‘indels’) are represented by gaps in alignments
gap of length 3
Sequence Comparison& Alignment
S-S
S E Q U E N C E A L I G N M E N T
S-S
S-S
An alignment provides a mapping of residues in one sequence onto those of another
Conserved residues are often of structural or functional importance
Alignment of trypsin sequences from mouse and crayfish
Figure 7.1
*Mouse IVGGYNCEENSVPYQVSLNS-----GYHFCGGSLINEQWVVSAGHCYK-------SRIQVCrayfish IVGGTDAVLGEFPYQLSFQETFLGFSFHFCGASIYNENYAITAGHCVYGDDYENPSGLQI
*Mouse RLGEHNIEVLEGNEQFINAAKIIRHPQYDRKTLNNDIMLIKLSSRAVINARVSTISLPTACrayfish VAGELDMSVNEGSEQTITVSKIILHENFDYDLLDNDISLLKLSGSLTFNNNVAPIALPAQ
Mouse PPATGTKCLISGWGNTASSGADYPDELQCLDAPVLSQAKCEASYPG-KITSNMFCVGFLECrayfish GHTATGNVIVTGWG-TTSEGGNTPDVLQKVTVPLVSDAECRDDYGADEIFDSMICAGVPE
*Mouse GGKDSCQGDSGGPVVCNG----QLQGVVSWGDGCAQKNKPGVYTKVYNYVKWIKNTIAANCrayfish GGKDSCQGDSGGPLAASDTGSTYLAGIVSWGYGCARPGYPGVYTEVSYHVDWIKANAV--
Sequence Comparison& Alignment
Conserved positions are often of functional importance.
Alignment of trypsin proteins of mouse (Swiss-Prot P07146) and crayfish (Swiss-Prot P00765). Identical residues are highlighted red and underlined.
Indicated above the alignment are three disulfide bonds (-S-S-) whose participating cysteine residues are conserved, amino acids whose side chains are involved in the charge relay system (asterisk) and the active side residue which governs substrate specificity (diamond). The other conserved positions
have no known role. These conserved residues could be coincidentally conserved or have some unknown structural role.
S E Q U E N C E A L I G N M E N T
Figure 7.1
S-S
S-S
S-S
Alignment of trypsin sequences from mouse and crayfish
*Mouse IVGGYNCEENSVPYQVSLNS-----GYHFCGGSLINEQWVVSAGHCYK-------SRIQVCrayfish IVGGTDAVLGEFPYQLSFQETFLGFSFHFCGASIYNENYAITAGHCVYGDDYENPSGLQI
*Mouse RLGEHNIEVLEGNEQFINAAKIIRHPQYDRKTLNNDIMLIKLSSRAVINARVSTISLPTACrayfish VAGELDMSVNEGSEQTITVSKIILHENFDYDLLDNDISLLKLSGSLTFNNNVAPIALPAQ
Mouse PPATGTKCLISGWGNTASSGADYPDELQCLDAPVLSQAKCEASYPG-KITSNMFCVGFLECrayfish GHTATGNVIVTGWG-TTSEGGNTPDVLQKVTVPLVSDAECRDDYGADEIFDSMICAGVPE
*Mouse GGKDSCQGDSGGPVVCNG----QLQGVVSWGDGCAQKNKPGVYTKVYNYVKWIKNTIAANCrayfish GGKDSCQGDSGGPLAASDTGSTYLAGIVSWGYGCARPGYPGVYTEVSYHVDWIKANAV--
Sequence Comparison& Alignment
S E Q U E N C E A L I G N M E N T
Figure 7.2
CLUSTAL W (1.7) multiple sequence alignment
Human-Zcr MATGQKLMRAVRVFEFGGPEVLKLRSDIAVPIPKDHQVLIKVHACGVNPVETYIRSGTYSEcoli-QOR ------MATRIEFHKHGGPEVLQA-VEFTPADPAENEIQVENKAIGINFIDTYIRSGLYP : :...:.******: ::: . * :::: :: :* *:* ::****** *.
Human-Zcr RKPLLPYTPGSDVAGVIEAVGDNASAFKKGDRVFTSSTISGGYAEYALAADHTVYKLPEKEcoli-QOR -PPSLPSGLGTEAAGIVSKVGSGVKHIKAGDRVVYAQSALGAYSSVHNIIADKAAILPAA * ** *::.**::. **.... :* ****. :.: *.*:. ... **
Human-Zcr LDFKQGAAIGIPYFTAYRALIHSACVKAGESVLVHGASGGVGLAACQIARAYGLKILGTAEcoli-QOR ISFEQAAASFLKGLTVYYLLRKTYEIKPDEQFLFHAAAGGVGLIACQWAKALGAKLIGTV :.*:*.** : :*.* * :: :*..*..*.*.*:***** *** *:* * *::**.
Human-Zcr GTEEGQKIVLQNGAHEVFNHREVNYIDKIKKYVGEKGIDIIIEMLANVNLSKDLSLLSHGEcoli-QOR GTAQKAQSALKAGAWQVINYREEDLVERLKEITGGKKVRVVYDSVGRDTWERSLDCLQRR ** : : .*: ** :*:*:** : ::::*: .* * : :: : :.. . .:.*. *.:
Human-Zcr GRVIVVG-SRGTIEINPRDTMAKES----SIIGVTLFSSTKEEFQQYAAALQAGMEIGWLEcoli-QOR GLMVSFGNSSGAVTGVNLGILNQKGSLYVTRPSLQGYITTREELTEASNELFSLIASGVI * :: .* * *:: . : ::. : .: : :*:**: : : * : : * :
Human-Zcr KPVIGSQ--YPLEKVAEAHENIIHGSGATGKMILLLEcoli-QOR KVDVAEQQKYPLKDAQRAHE-ILESRATQGSSLLIP * :..* ***:.. .*** *:.. .: *. :*:
Stars indicate identical residues and dots indicate conservative substitutions
Human zeta crystallin vs E.coli quinone oxidoreductase
Sequence Comparison& Alignment
Score and Statistics
G-ATESLIKESCHEESE AND/OR G-ATES & CHEESE
GRATED-----CHEESE GRATED & CHEESE
Percent Identity. Can be misleading.
Score: A simple quality measure is the “score”. The score assigns points for each aligned base (or gap) of the alignment.
identical bases : “match” score
mismatching bases: “mismatch” score
gaps: “gap opening” penalty for starting a gap
“gap extension” penalty for each gap symbol.
Score = 10*(+1)+1*(-1)+(-5-1)+(-5+5*(-1))
= -7
Example: match = +1 , mismatch =-1,
gap opening = -5, gap extension=-1
Sequence Comparison& Alignment
S C O R I N G S Y S T E M S
Which alignment is “better”?
GCTACTAGTT------CGCTTAGCGCTACTAGCTCTAGCGCGTATAGC
GCTACTAG-T-T--CGC-T-TAGCGCTACTAGCTCTAGCGCGTATAGC
0 mismatches, 5 gaps
3 mismatches, 1 gap
Sequence Comparison& Alignment
S C O R I N G S Y S T E M S
High penalty for “opening” a gap
(e.g. G = 5)
GCTACTAGTT------CGCTTAGCGCTACTAGCTCTAGCGCGTATAGC
GCTACTAG-T-T--CGC-T-TAGCGCTACTAGCTCTAGCGCGTATAGC
Penalty = 5G + 6L = 31
Penalty = 1G + 6L = 11
Lower penalty for “entending” a gap
(e.g. L = 1)
Sequence Comparison& Alignment
L O C A L S I M I L A R I T Y
Figure 7.3
F12 F2 E F1 E K Catalytic
PLAT F1 E K CatalyticK
Mix-and-match protein modules confound alignment algorithms
Protein modules in coagulation factor XII (F12) and tissue plasminogen activator (PLAT)
F1,F2 Fibronectin repeatsE EGF similarity domainK Kringle domainCatalytic Serine protease activitiy
Sequence Comparison& Alignment
L O C A L S I M I L A R I T Y
Figure 7.3
F12 F2 E F1 E K Catalytic
PLAT F1 E K CatalyticK
Mix-and-match protein modules confound alignment algorithms
Protein modules in coagulation factor XII (F12) and tissue plasminogen activator (PLAT)
F1,F2 Fibronectin repeatsE EGF similarity domainK Kringle domainCatalytic Serine protease activitiy
modules inreverse order
Sequence Comparison& Alignment
L O C A L S I M I L A R I T Y
Figure 7.3
F12 F2 E F1 E K Catalytic
PLAT F1 E K CatalyticK
Mix-and-match protein modules confound alignment algorithms
Protein modules in coagulation factor XII (F12) and tissue plasminogen activator (PLAT)
F1,F2 Fibronectin repeatsE EGF similarity domainK Kringle domainCatalytic Serine protease activitiy
repeatedmodules
Sequence Comparison& Alignment
D O T P L O T S
Figure 7.4
Dot-plot Fitch : Biochem. Genet. (1969)3,99-108
A
C
G
T
C G T A C C G T
0 0 0 1 0 0 0 0
1
0
0
0 0 0 1 1 0 0
1 0 0 0 0 1 0
0 1 0 0 0 0 1
Horizontal axis is coordinates for one sequence
Vertical axis is coordinates for the other
Sequence Comparison& Alignment
D O T P L O T S
Figure 7.4b
Dot-plot Fitch : Biochem. Genet. (1969)3,99-108
Can also score not 1 position at a time, but in sliding window. For example a window of 3 nucleotides where we score 1 for identical triplets and 0 for all other combinations yields.
A
C
G
T
C G T A C C G T
0 0 0 0 0 0
1 0 0 0 0 1
Horizontal axis is coordinates for one sequence
Vertical axis is coordinates for the other
Sequence Comparison& Alignment
D O T P L O T S
Tis
sue
Pla
smin
ogen
Act
ivat
or (
PLA
T)
Coagulation Factor XII (F12)
Figure 7.4
Horizontal axis is coordinates for one sequence
Vertical axis is coordinates for the other
Sequence Comparison& Alignment
D O T P L O T S
Tis
sue
Pla
smin
ogen
Act
ivat
or (
PLA
T)
Coagulation Factor XII (F12)
Figure 7.4
K
K
Catalytic
Cat
aly
ticK
EF1EF2
EF
1
Plot dots for high similarity within a short window
Adjacent dots merge to form diagonal segments
Sequence Comparison& Alignment
D O T P L O T S
Tis
sue
Pla
smin
ogen
Act
ivat
or (
PLA
T)
Coagulation Factor XII (F12)
Figure 7.4
K
K
Catalytic
Cat
aly
ticK
EF1EF2
EF
1
Repeated domains show a characteristic pattern
Sequence Comparison& Alignment
P A T H G R A P H S
Figure 7.5
90 137
72
23
90 137
72
23
PLAU 90 EPKKVKDHCSKHSPCQKGGTCVNMP--SGPH-CLCPQHLTGNHCQKEK---CFE 137PLAT 23 ELHQVPSNCD----CLNGGTCVSNKYFSNIHWCNCPKKFGGQHCEIDKSKTCYE 72
EGF similarity domains of urokinse plasminogen activator (PLAU) and tissue plasminogen activator (PLAT)
Dot plots suggest paths through the alignment space
Path graphs are more explicit representations
Each path is a unique alignment
Sequence Comparison& Alignment
Routing a phone call from Washington DC to San Francisco
P A T H G R A P H S
Best-path problems are common in computer science
A best-path algorithm used for sequence alignment is called ‘dynamic programming’
Sequence Comparison& Alignment
G A T A C T AG A T T A C C A
Construct an optimal of these two sequences:
Using these scoring rules: Match:
Mismatch:Gap:
+1-1-1
D Y N A M I C P R O G R A M M I N G
Dynamic Programming Example
Sequence Comparison& Alignment
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Arrange the sequence residues along a two-dimensional lattice
Vertices of the lattice fall between letters
Sequence Comparison& Alignment
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
The goal is to find the optimal path
from here
to here
Sequence Comparison& Alignment
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Each path corresponds to a unique alignment
Which one is optimal?
Sequence Comparison& Alignment
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
The score for a path is the sum of its incremental edges scores
A aligned with AMatch = +1
Sequence Comparison& Alignment
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
The score for a path is the sum of its incremental edges scores A aligned with T
Mismatch = -1
Sequence Comparison& Alignment
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
The score for a path is the sum of its incremental edges scores
T aligned with NULL
Gap = -1
NULL aligned with T
Sequence Comparison& Alignment
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
0 -1
+1-1
Sequence Comparison& Alignment
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
0
+1-1
-2
-2
-1
Remember the best sub-path leading to each point on the lattice
Sequence Comparison& Alignment
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
0
-1
-2
Remember the best sub-path leading to each point on the lattice
0 +2
+1
-1
-20
Sequence Comparison& Alignment
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
0 -2
Remember the best sub-path leading to each point on the lattice
0 +2
+1
-1
-20
-2
-1
Sequence Comparison& Alignment
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
0
Remember the best sub-path leading to each point on the lattice
+1
-1
-2-1
-3-2
-3
-2
+3
-1
-1
0
0
+1
+1
+2
Sequence Comparison& Alignment
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
0
Remember the best sub-path leading to each point on the lattice
+1
-1
-1
-2
-2 0
0
+1+2
-5-4
-5
-4
-3
-3
-1 -3-2
-10
+1
+2
0
+1-1
+2
-3 -1
-2
+1 +3
+2 +1
+2+3
Sequence Comparison& Alignment
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
Remember the best sub-path leading to each point on the lattice
0
+1
-1
-1
-2
-2 0
0
+1+2
-4
-4
-3
-3
-1 -2
0
+2
0
+1-1
+2-2 +2 +1
+2+3
-8
-7
-6
-5
-7-6-5
-5-3
-2 -3
-4
-1
-1
0+1
+1
+1 +3
+2
-4
-6
-3
-2
-3
-1
-4
-5
+1 +3
+1
0 +2
+4
+4
+3
+2
+2
+3
-2 0
-1
+2 +2
+3
Sequence Comparison& Alignment
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Trace-back to get optimal path and alignment
0
+1
-1
-1
-2
-2 0
0
+1+2
-4
-4
-3
-3
-1 -2
0
+2
0
+1-1
+2-2 +2 +1
+2+3
-8
-7
-6
-5
-7-6-5
-5-3
-2 -3
-4
-1
-1
0+1
+1
+1 +3
+2
-4
-6
-3
-2
-3
-1
-4
-5
+1 +3
+1
0 +2
+4
+4
+3
+2
+2
+3
-2 0
-1
+2 +2
+3
Sequence Comparison& Alignment
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Print out the alignment
AA-TTTAACCTCAA
GG
Sequence Comparison& Alignment
Two different types of Alignment
Needleman & Wunch (J. Mol. Biol. (1970) 48,443-453 : Problem of finding the best path. Revelation: Any partial sub-path that ends at a point along the true optimal path must itself be the optimal path leading to that point. This provides a method to create a matrix of path “score”, the score of a path leading to that point. Trace the optimal path from one end to the other of the two sequences.
Global Alignment methods:
Smith & Waterman.(J. Mol. Biol. (1981), 147,195-197: Use Needleman &Wunch, but report all non-overlapping paths, starting at the highest scoring points in the path graph.
FASTP(Lipman &Pearson(1985),Science 227,1435-1441
BLAST (Altschul et al (1990),J. Mol. Bio. 215,408-410): don’t report all overlapping paths, but only attempt to find paths if there are words that are high-scoring. Speeds up considerably the alignments.
Local Alignment methods:
Sequence Comparison& Alignment
G L O B A L & L O C A L S I M I L A R I T Y
Implementations of dynamic programming for global and local similarities
Optimal global alignment
Needleman & Wunsch (1970)
Sequences align essentially from end to end
Optimal local alignment
Smith & Waterman (1981)
Sequences align only in small, isolated regions
Sequence Comparison& Alignment
Score and Statistics
Some amino acids mutations do not affect structure/function very much. Amino acids with similar physico-chemical and steric properties can often replace each other.
Scoring system that doesn’t penalize very much mutations to similar amino acid.
PAM Matrices: Point Accepted Mutations. Defined in terms of a divergence of 1 percent PAM. For distant sequences use PAM250, while for closer sequences (like DNA) use PAM100. Some sites accumulate mutations some others don’t, thus use of the PAM100 matrice doesn’t mean that the sequences compared were 100% mutated.
BLOSUM: BLOCK substitution matrices. Started with the BLOCKS database of multiple alignment only involving distant sequences. BLOSUM62 means that the proteins compated were never closer than 62% Identity. BLOSUM50 matrices involved alignment of more distant sequences. Recommend use BLOSUM matrices (BLOSUM62) for most protein alignments.
Sequence Comparison& Alignment
S C O R I N G S Y S T E M S
BLOSUM62
Figure 7.8
A 4
R -1 5
N -2 0 6
D -2 -2 1 6
C 0 -3 -3 -3 9
Q -1 1 0 0 -3 5
E -1 0 0 2 -4 2 5
G 0 -2 0 -1 -3 -2 -2 6
H -2 0 1 -1 -3 0 0 -2 8
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A R N D C Q E G H I L K M F P S T W Y V
Some amino acid substitutions are more common than others
Substitution scores come from an odds ratio based on measured substitution rates
Sequence Comparison& Alignment
S C O R I N G S Y S T E M S
BLOSUM62
Figure 7.8
A 4
R -1 5
N -2 0 6
D -2 -2 1 6
C 0 -3 -3 -3 9
Q -1 1 0 0 -3 5
E -1 0 0 2 -4 2 5
G 0 -2 0 -1 -3 -2 -2 6
H -2 0 1 -1 -3 0 0 -2 8
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A R N D C Q E G H I L K M F P S T W Y V
Identities get positive scores, but some are better than others
Sequence Comparison& Alignment
S C O R I N G S Y S T E M S
BLOSUM62
Figure 7.8
A 4
R -1 5
N -2 0 6
D -2 -2 1 6
C 0 -3 -3 -3 9
Q -1 1 0 0 -3 5
E -1 0 0 2 -4 2 5
G 0 -2 0 -1 -3 -2 -2 6
H -2 0 1 -1 -3 0 0 -2 8
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A R N D C Q E G H I L K M F P S T W Y V
Some non-identities have positive scores, but most are negative
Sequence Comparison& Alignment
D A T A B A S E S E A R C H I N G
Compare one query sequence against an entire database
> fasta myquery swissprot -ktup 2
search program
querysequence
sequencedatabase
optionalparameters
A typical search has four basic elements
Sequence Comparison& Alignment
D A T A B A S E S E A R C H I N G
With exponential database growth, searches keep taking more time
> fasta myquery swissprot -ktup 2
searching . . . . . .
Sequence Comparison& Alignment
D A T A B A S E S E A R C H I N G
The “hit list” gives titles and scores for matched sequences
> fasta myquery swissprot -ktup 2The best scores are: initn init1 opt z-sc E(77110)gi|1706794|sp|P49789|FHIT_HUMAN BIS(5'-ADENOSYL)- 996 996 996 1262.1 0gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL) 412 382 395 507.6 1.4e-21gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEI 238 133 316 407.4 5.4e-16gi|3915958|sp|Q58276|Y866_METJA HYPOTHETICAL HIT- 153 98 190 253.1 2.1e-07gi|3916020|sp|Q11066|YHIT_MYCTU HYPOTHETICAL 15.7 163 163 184 244.8 6.1e-07gi|3023940|sp|O07513|HIT_BACSU HIT PROTEIN 164 164 170 227.2 5.8e-06gi|2506515|sp|Q04344|HNT1_YEAST HIT FAMILY PROTEI 130 91 157 210.3 5.1e-05gi|2495235|sp|P75504|YHIT_MYCPN HYPOTHETICAL 16.1 125 125 148 199.7 0.0002gi|418447|sp|P32084|YHIT_SYNP7 HYPOTHETICAL 12.4 42 42 140 191.3 0.00058gi|3025190|sp|P94252|YHIT_BORBU HYPOTHETICAL 15.9 128 73 139 188.7 0.00082gi|1351828|sp|P47378|YHIT_MYCGE HYPOTHETICAL HIT- 76 76 133 181.0 0.0022gi|418446|sp|P32083|YHIT_MYCHR HYPOTHETICAL 13.1 27 27 119 165.2 0.017gi|1708543|sp|P49773|IPK1_HUMAN HINT PROTEIN (PRO 66 66 118 163.0 0.022gi|2495231|sp|P70349|IPK1_MOUSE HINT PROTEIN (PRO 65 65 116 160.5 0.03gi|1724020|sp|P49774|YHIT_MYCLE HYPOTHETICAL HIT- 52 52 117 160.3 0.031gi|1170581|sp|P16436|IPK1_BOVIN HINT PROTEIN (PRO 66 66 115 159.3 0.035gi|2495232|sp|P80912|IPK1_RABIT HINT PROTEIN (PRO 66 66 112 155.5 0.057gi|1177047|sp|P42856|ZB14_MAIZE 14 KD ZINC-BINDIN 73 73 112 155.4 0.058gi|1177046|sp|P42855|ZB14_BRAJU 14 KD ZINC-BINDIN 76 76 110 153.8 0.072gi|1169825|sp|P31764|GAL7_HAEIN GALACTOSE-1-PHOSP 58 58 104 138.5 0.51gi|113999|sp|P16550|APA1_YEAST 5',5'''-P-1,P-4-TE 47 47 103 137.8 0.56gi|1351948|sp|P49348|APA2_KLULA 5',5'''-P-1,P-4-T 63 63 98 131.3 1.3gi|123331|sp|P23228|HMCS_CHICK HYDROXYMETHYLGLUTA 58 58 99 129.4 1.6gi|1170899|sp|P06994|MDH_ECOLI MALATE DEHYDROGENA 70 48 91 122.9 3.7gi|3915666|sp|Q10798|DXR_MYCTU 1-DEOXY-D-XYLULOSE 75 50 92 121.9 4.3gi|124341|sp|P05113|IL5_HUMAN INTERLEUKIN-5 PRECU 36 36 85 121.3 4.7gi|1170538|sp|P46685|IL5_CERTO INTERLEUKIN-5 PREC 36 36 84 120.0 5.5gi|121369|sp|P15124|GLNA_METCA GLUTAMINE SYNTHETA 45 45 90 118.9 6.3gi|2506868|sp|P33937|NAPA_ECOLI PERIPLASMIC NITRA 48 48 92 117.4 7.6gi|119377|sp|P10403|ENV1_DROME RETROVIRUS-RELATED 59 59 89 117.0 8gi|1351041|sp|P48415|SC16_YEAST MULTIDOMAIN VESIC 48 48 97 117.0 8gi|4033418|sp|O67501|IPYR_AQUAE INORGANIC PYROPHO 38 38 83 116.8 8.3
Sequence Comparison& Alignment
E-value
“Hits” can be sorted according to their E-value or their score.
The E-value is better known as the EXPECT value and is a function of score, database size and query sequence length.
E-value: Number of alignments with a score >=S that you expect to find if the database was a collection of random letters.
e.g. For a score of 1, one only requires 1 match, and there should be an enormous amount of alignments. One expects to find less alignments with a score of 5, and so on.. Eventually when the score is big enough, one expects to find an insignificant number of of alignments that could be due to chance.
E-value of less than 1e-6 (1* 10-6 in scientific notation) are usually very good and for proteins, E<1e-2 is usually considered significant. It is still possible for a Hit with E>1 to be biologically meaningful, but more analysis is required to comfirm that.
Even for VERY good hits, it is possible that the hit is due to a biological artifact (sequencing/cloning vector, repeats, low-complexity sequence…)
Sequence Comparison& Alignment
E-value
Another type of statistics is the P-value, which given a score S for an alignment is the Probability that an alignment of the query against a database of random sequences has a score >= S.For gapless alignments the P-value can be computed from theory.
Sometimes one has an alignments algorithms, or biologically complex databases that do not allow the computation of P-value based on the statistical theory of a uniform database. In this case, one computes uses an alternate statistics, the Z-value (e.g. FASTA suite), which shuffles the query sequence and thus creates many compositionally identical query sequence. Each random sequences is then re-queried agains the database. When done enough times, this provides a distribution of scores which is approximately normally distributed (if lucky) around some mean.
Z-value = score distance away from mean/ standard devuation
.. a Z-value of 3 or greater is good.
Prob
Distrib
Score
S = score of alignment
= Standard deviation
Deviation from mean
Sequence Comparison& Alignment
D A T A B A S E S E A R C H I N G
Detailed alignments are shown farther down in the output
> fasta myquery swissprot -ktup 2
>>gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL)-TETR (182 aa)initn: 412 init1: 382 opt: 395 z-score: 507.6 E(): 1.4e-21Smith-Waterman score: 395; 52.3% identity in 109 aa overlap
10 20 30 40 50gi|170 MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRPVERFHDLRPDEVADLF : X: .:.:: :.:: ::..:::::: : : : :..:: :.:..:::gi|170 MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLVIPQRAVPRLKDLTPSELTDLF 10 20 30 40 50 60
60 70 80 90 100 110gi|170 QTTQRVGTVVEKHFHGTSLTFSMQDGPEAGQTVKHVHVHVLPRKAGDFHRNDSIYEELQK ....: :.:: : ... ....::: .::::: :::::..::: .:: .:: .: :X.:gi|170 TSVRKVQQVIEKVFSASASNIGIQDGVDAGQTVPHVHVHIIPRKKADFSENDLVYSELEK 70 80 90 100 110 120
120 130 140gi|170 HDKEDFPASWRSEEEMAAEAAALRVYFQ ..gi|170 NEGNLASLYLTGNERYAGDERPPTSMRQAIPKDEDRKPRTLEEMEKEAQWLKGYFSEEQE 130 140 150 160 170 180
>>gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEIN 2 (217 aa)initn: 238 init1: 133 opt: 316 z-score: 407.4 E(): 5.4e-16Smith-Waterman score: 316; 37.4% identity in 131 aa overlap
10 20 30 40gi|170 MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRP-VER :.. :. .v^: :.. ..:::: ::.::::::. ::X :
Sequence Comparison& Alignment
H A S H I N G M E T H O D S
Query sequence
Dat
abas
e se
quen
ce
Simplest Database searching could is a large dynamic programming example.
For a query of N letters against a database of M letters, it requires MxN comparisons.
Sequence Comparison& Alignment
H A S H I N G M E T H O D S
Hashing is a common method for accelerating database searches
MLILII
MLIIKRDELVISWASHEREquery sequence
IIKIKRKRDRDEDELELVLVIVISISWSWAWASASHSHEHERERE
all overlappingwords of size 3
Compile “dictionary” of words from the query sequence. Put each word in a look-up table that points to the original position in the sequence. Thus given one word, you can know if it is in the query in a single operation.
Sequence Comparison& Alignment
Index lookup
Each word is assigned a unique integer.
E.g. for a word of 3 letters made up of an alphabet of 20 letters.
1. Assign a code to each letter Code(l) (0 to 19)
2. For a word of 3 letters L1 L2 L3 the code is
index = Code(L1)*202 + Code(L2)*201 + Code(L3)
3. Have an array with a list of the positions that have that word.
1
0 1 2 3
Position in query sequence of word
Sequence Comparison& Alignment
H A S H I N G M E T H O D S
Building the dictionary for the query sequence requires (N-2) operations.
MLILII
MLIIKRDELVISWASHEREquery sequence
IIKIKRKRDRDEDELELVLVIVISISWSWAWASASHSHEHERERE
all overlappingwords of size 3
The database contains (M-2) words, and it takes only one operation to see if the word was in the query.
Sequence Comparison& Alignment
H A S H I N G M E T H O D S
Query sequence
Dat
abas
e se
quen
ce
Scan the database, looking up words in the dictionary
Use word hits to determine were to search for alignments
fills the dynamic programming matrix
in (N-2)+(M-2) operations instead
of MxN.
Sequence Comparison& Alignment
H A S H I N G M E T H O D S
Query sequence
Dat
abas
e se
quen
ce
Scan the database, looking up words in the dictionary
Use word hits to determine were to search for alignments
FASTA searches in a band
Sequence Comparison& Alignment
H A S H I N G M E T H O D S
Query sequence
Dat
abas
e se
quen
ce
Scan the database, looking up words in the dictionary
Use word hits to determine were to search for alignments
BLAST extends from word hits
Sequence Comparison& Alignment
Multiple Alignment
FHIT_HUMAN MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLV...
APH1_SCHPO MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLV...
HNT2_YEAST MILSKTKKPKSMNKPIYFSKFLVTEQVFYKSKYTYALVNLKPIV PGHVLI...
Y866_METJA MCIFCKIINGEIPAKVVYEDEHVLAFLDINPRNKGHTLV...
A true multiple alignment method
will align all the sequences
together at the same time.
FHIT_HUMAN -----------MS-F RFGQHLIKP-SVVFL KTELSFALVNRKPVV PGHVLV...
APH1_SCHPO -----------MPKQ LYFSKFPVG-SQVFY RTKLSAAFVNLKPIL PGHVLV...
HNT2_YEAST MILSKTKKPKSMNKP IYFSKFLVT-EQVFY KSKYTYALVNLKPIV PGHVLI...
Y866_METJA -----------MCIF CKIINGEIP-AKVVY EDEHVLAFLDINPRN KGHTLV...
Sequence Comparison& Alignment
Multiple Alignment
A true multiple alignment method
will align all the sequences
together at the same time.
FHIT_HUMAN -----------MS-F RFGQHLIKP-SVVFL KTELSFALVNRKPVV PGHVLV...
APH1_SCHPO -----------MPKQ LYFSKFPVG-SQVFY RTKLSAAFVNLKPIL PGHVLV...
HNT2_YEAST MILSKTKKPKSMNKP IYFSKFLVT-EQVFY KSKYTYALVNLKPIV PGHVLI...
Y866_METJA -----------MCIF CKIINGEIP-AKVVY EDEHVLAFLDINPRN KGHTLV...
Unfortunately, there is no formal computationally tractable method for more than 3 sequences.
There are many approximate methods, such as Progressive multiple alignment methods.
Sequence Comparison& Alignment
Progressive Multiple Alignment
FHIT_HUMAN APH1_SCHPO HNT2_YEAST Y866_METJA
FHIT_HUMAN
APH1_SCHPO 395
HNT2_YEAST 316 380
Y866_METJA 290 300 340
Pairwise alignments: compute distance matrix
APH1_SCHPO
HNT2_YEAST Y866_METJA
FHIT_HUMAN
Align all pairs of sequences.
Sequence Comparison& Alignment
Progressive Multiple Alignment
FHIT_HUMAN APH1_SCHPO HNT2_YEAST Y866_METJA
FHIT_HUMAN
APH1_SCHPO 395
HNT2_YEAST 316 380
Y866_METJA 290 300 340
Pairwise alignments: compute distance matrix
APH1_SCHPO
HNT2_YEAST
Y866_METJA
FHIT_HUMANGuide Tree
Sequence Comparison& Alignment
Multiple Alignment
Align two closest sequences
FHIT_HUMAN MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLV...
APH1_SCHPO MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLV...
HNT2_YEAST MILSKTKKPKSMNKPIYFSKFLVTEQVFYKSKYTYALVNLKPIVPGHVLI...
Y866_METJA MCIFCKIINGEIPAKVVYEDEHVLAFLDINPRNKGHTLV...
FHIT_HUMAN MSFR FGQHLIKP-SVVFL KTELSFALVNRKPVV PGHVLV...
APH1_SCHPO MPKQ LYFSKFPVGSQVFY RTKLSAAFVNLKPIL PGHVLV...
HNT2_YEAST MILSKTKKPKSMNKPIYFSKFLVTEQVFYKSKYTYALVNLKPIVPGHVLI...
Y866_METJA MCIF CKIINGEIPAKVVYEDEHVLAFLDINPRNKGHTLV...
This alignment creates a consensus sequence that is next used to align
subsequent sequences.
From the point of view of this pairwise alignment, the gap can be inserted anywhere
In the green region (between the 1st M , and base 13 (S))
Sequence Comparison& Alignment
Multiple Alignment
Align Next closest sequence to the
consensus.
FHIT_HUMAN MS-F RFGQHLIKP-SVVFL KTELSFALVNRKPVV PGHVLV...
APH1_SCHPO MPKQ LYFSKFPVG-SQVFY RTKLSAAFVNLKPIL PGHVLV...
HNT2_YEAST MILSKTKKPKSMNKPIYFSKFLVTEQVFYKSKYTYALVNLKPIVPGHVLI...
Y866_METJA MCIFCKIINGEIP-AKVVYEDEHVLAFLDINPRNKGHTLV...
FHIT_HUMAN -----------MSF RFGQHLIKP-SVVFL KTELSFALVNRKPVV PGHVLV...
APH1_SCHPO -----------MPK QLYFSKFPVGSQVFY RTKLSAAFVNLKPIL PGHVLV...
HNT2_YEAST MILSKTKKPKSMNK PIYFSKFLVTEQVFY KSKYTYALVNLKPIV PGHVLI...
Y866_METJA MCIF CKIINGEIPAKVVYEDEHVLAFLDINPRNKGHTLV...
Once inserted gap position cannot move because they are part of the consensus.
Sequence Comparison& Alignment
FHIT_HUMAN -----------MS-F RFGQHLIKP-SVVFL KTELSFALVNRKPVV PGHVLV...
APH1_SCHPO -----------MPKQ LYFSKFPVG-SQVFY RTKLSAAFVNLKPIL PGHVLV...
HNT2_YEAST MILSKTKKPKSMNKP IYFSKFLVT-EQVFY KSKYTYALVNLKPIV PGHVLI...
Y866_METJA MCIFCKIINGEIPAKVVYEDEHVLAFLDINPRNKGHTLV...
Multiple Alignment
FHIT_HUMAN -----------MSFR FGQHLIKP-SVVFL KTELSFALVNRKPVV PGHVLV...
APH1_SCHPO -----------MPKQ LYFSKFPVGSQVFY RTKLSAAFVNLKPIL PGHVLV...
HNT2_YEAST MILSKTKKPKSMNKP IYFSKFLVTEQVFY KSKYTYALVNLKPIV PGHVLI...
Y866_METJA -----------MCIF CKIINGEIPAKVVY EDEHVLAFLDINPRN KGHTLV...
Hopefully, the result should be similar to what a true multiple alignment
method would have yielded. We saw that the order of alignment determines
the existence of gaps.
Align Next closest sequence to new
consensus.
Because of the order of alignments, the gap position cannot be changed to align these two P,
which would have resulted in a higher score.
Sequence Comparison& Alignment
CLUSTALW
Clustalw: is a progressive multiple alignment tool.
- Adaptive gap opening and extension scores, makes it relatively insensitive to small changes in gap parameters.
- Choice of DNA or protein gap penalty alignments.
- Available on the web or on PC/Mac/unix.
http://dot.imgen.bcm.tmc.edu:9331/multi-align/Options/clustalw.html
The uppercase “O” in options is relevant.
Sequence Comparison& Alignment
BLAST and BLAST2SEQUENCES
BLAST is a database search engine based on
using hashing to accelerate the search.
blastn (for nucleotides) orblastp (for proteins)blastx (translates a nucleotide query in all 6 reading frames
and compare it to a protein database.)tblastn (compare a protein against a nucleotide database
translated in all 6 reading frames.)tblastx (compares a nucleotide sequence against a
nucleotide database by translating the query and database in all 6 reading frames.)
http://www.ncbi.nlm.nih.gov/BLAST/
A pairwise alignment implementation of these
program is available at:
http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html
Sequence Comparison& Alignment
Query-Anchored Alignments (master Slave)
Clustalw:
Is a multiple alignment program. Every Sequence is aligned to every other one.NOT a multiple alignment program, but may display Query-Anchored multiple pairwise alignments that look like multiple alignment, but all the sequences are only aligned to the first sequence!
Gap in subject sequence
This Column is NOT aligned together. It is displayed there for convenience.
Gaps in the query, means NOTHING
can be aligned to it. Gaps may optionally be shown(flat view),
or entire column omitted.
Blast:
Sequence Comparison& Alignment
BLAST and BLAST2SEQUENCES
Exercizes: Use Entrez to find the protein sequences with LOCUS name
FHIT_HUMAN
HNT2_YEAST
Use clustalw to align these two sequences,
And WITHOUT LOSING THAT RESULT SCREEN!!!
Use pairwise blast to align these two sequences as well.
EXERCIZE: Try to reproduce the example of clustalW alignment (the order of input sequences is not important)
Sequence Comparison& Alignment
References
TextBook: "Bioinformatics" A Practical Guide to the Analysis of Genes
and Proteins. Edited by Andy D. Baxevanis and B.F. Ouellette
readings: chapters 7,8,9
http://www.ncbi.nlm.nih.gov/BLAST/blast_overview.html