View
4
Download
0
Category
Preview:
Citation preview
How Do We Compare Biological Sequences?�
Dynamic Programming
Phillip Compeau and Pavel Pevzner Bioinformatics Algorithms: An Active Learning Approach
©2015 by Compeau and Pevzner. All rights reserved.
Outline
• Introduction to Sequence Alignment
• The Manhattan Tourist Problem
• Sequence Alignment is the Manhattan Tourist Problem in Disguise
• An Introduction to Dynamic Programming: The Change Problem
• The Manhattan Tourist Problem Revisited
• From Global to Local Alignment
• Penalizing Insertions and Deletions in Sequence Alignment
• Multiple Sequence Alignment
Recursive Manhattan Tourist
SouthOrEast(i,j)ifi=0andj=0return0x←-infinity,y←-infinityifi>0x←SouthOrEast(i-1,j)+weightofvert.edgeinto(i,j)ifj>0y←SouthOrEast(i,j-1)+weightofhoriz.edgeinto(i,j)returnmax(x,y)
Recursive Manhattan Tourist
SouthOrEast(i,j)ifi=0andj=0return0x←-infinity,y←-infinityifi>0x←SouthOrEast(i-1,j)+weightofvert.edgeinto(i,j)ifj>0y←SouthOrEast(i,j-1)+weightofhoriz.edgeinto(i,j)returnmax(x,y)
Exercise Break: How many times is SouthOrEast(3, 2) called in the computation of SouthOrEast(9, 7)?
Dynamic Programming Manhattan
Dynamic Programming Manhattan
STOP and Think: Which element of the table should we fill in next and what should its value be?
Dynamic Programming Manhattan
Dynamic Programming Manhattan
Dynamic Programming Manhattan
STOP and Think: Do you see a longest path in this grid? What algorithm did you use?
Reconstructing an Optimal Path
STOP and Think: In general, how do we reconstruct this path?
Reconstructing an Optimal Path
Answer: start at ending node and follow edges backwards to the beginning node.
Finding an LCS
A T C G T C CA
T
G
T
T
A
T
A
Exercise Break: Find a longest common subsequence of ATGTTATA and ATCGTCC.
Outline
• Introduction to Sequence Alignment
• The Manhattan Tourist Problem
• Sequence Alignment is the Manhattan Tourist Problem in Disguise
• An Introduction to Dynamic Programming: The Change Problem
• The Manhattan Tourist Problem Revisited
• From Global to Local Alignment
• Penalizing Insertions and Deletions in Sequence Alignment
• Multiple Sequence Alignment
Strengthening Alignment Scoring
0 1 2 2 3 4 5 6 7 8A T - G T T A T AA T C G T - C - C
0 1 2 3 4 5 5 6 6 7
Alignment score: Divided into three components: • match reward (+1) • mismatch penalty (-μ) • insertion/deletion penalty (-σ)
Strengthening Alignment Scoring
Global Alignment Problem: Find a highest-scoring alignment of two strings. • Input: Two strings. • Output: An alignment of the strings with
maximum alignment score.
Strengthening Alignment Scoring
Global Alignment Problem: Find a highest-scoring alignment of two strings. • Input: Two strings. • Output: An alignment of the strings with
maximum alignment score.
STOP and Think: How can we solve this problem?
Strengthening Alignment Scoring
T C G T
T
G
T
T
A
+1 +1
+1
+1 +1
+1+1
- -
- - -
- -
--
- - - - -
-
-
-
-
-
-
-
-
--
-
-
-
--
-
-
-
-
-
-
-
-
-
----
- - - -
- - - -
----
- - - -
- - - -
Answer: Slight modification to alignment graph ...
Strengthening Alignment Scoring
T C G T
T
G
T
T
A
+1 +1
+1
+1 +1
+1+1
- -
- - -
- -
--
- - - - -
-
-
-
-
-
-
-
-
--
-
-
-
--
-
-
-
-
-
-
-
-
-
----
- - - -
- - - -
----
- - - -
- - - -
Exercise Break: Find a best alignment (σ = 2, μ = 3).
Finding “Local” Similarities
Homeobox genes: regulate embryonic development and are present in a large variety of species.
Finding “Local” Similarities
H O W D O W E C O M PA R E D N A S E Q U E N C E S ?
Limitations of global alignment
Analysis of homeobox genes offers an example of a problem for which global alignmentmay fail to reveal biologically relevant similarities. These genes regulate embryonicdevelopment and are present in a large variety of species, from flies to humans. Home-obox genes are long, and they differ greatly between species, but an approximately 60amino acid-long region in each gene, called the homeodomain, is highly conserved.For instance, consider the mouse and human homeodomains below.
Mouse...ARRSRTHFTKFQTDILIEAFEKNRFPGIVTREKLAQQTGIPESRIHIWFQNRRARHPDPG......ARQKQTFITWTQKNRLVQAFERNPFPDTATRKKLAEQTGLQESRIQMWFQKQRSLYLKKS...
Human
The immediate question is how to find this conserved segment within the muchlonger genes and ignore the flanking areas, which exhibit little similarity. Global align-ment seeks similarities between two strings across their entire length; however, whensearching for homeodomains, we are looking for smaller, local regions of similarityand do not need to align the entire strings. For example, the global alignment belowhas 22 matches, 18 indels, and 2 mismatches, resulting in the score 22 � 18 � 2 = 2 (ifs = µ = 1):
GCC-C-AGTC-TATGT-CAGGGGGCACG--A-GCATGCACA-GCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-T-CAGAT
However, these sequences can be aligned differently (with 17 matches and 32 indels)based on a highly conserved interval represented by the substrings CAGTCTATGTCAGand CAGTTATGTTCAG:
---G----C-----C--CAGTCTATG-TCAGGGGGCACGAGCATGCACAGCCGCCGTCGTTTTCAGCAGT-TATGTTCAG-----A------T-----
This alignment has fewer matches and a lower score of 17 � 32 = �15, even though theconserved region of the alignment contributes a score of 12 � 2 = 10, which is hardlyan accident.
Figure 5.19 shows the two alignment paths corresponding to these two differentalignments. The upper path, corresponding to the second alignment above, losesout because it contains many heavily penalized indels on either side of the diagonalcorresponding to the conserved interval. As a result, global alignment outputs thebiologically irrelevant lower path.
257
Homeobox genes: regulate embryonic development and are present in a large variety of species.
Homeodomain: area within homeobox genes of shared “local” similarity.
Finding “Local” Similarities
H O W D O W E C O M PA R E D N A S E Q U E N C E S ?
Limitations of global alignment
Analysis of homeobox genes offers an example of a problem for which global alignmentmay fail to reveal biologically relevant similarities. These genes regulate embryonicdevelopment and are present in a large variety of species, from flies to humans. Home-obox genes are long, and they differ greatly between species, but an approximately 60amino acid-long region in each gene, called the homeodomain, is highly conserved.For instance, consider the mouse and human homeodomains below.
Mouse...ARRSRTHFTKFQTDILIEAFEKNRFPGIVTREKLAQQTGIPESRIHIWFQNRRARHPDPG......ARQKQTFITWTQKNRLVQAFERNPFPDTATRKKLAEQTGLQESRIQMWFQKQRSLYLKKS...
Human
The immediate question is how to find this conserved segment within the muchlonger genes and ignore the flanking areas, which exhibit little similarity. Global align-ment seeks similarities between two strings across their entire length; however, whensearching for homeodomains, we are looking for smaller, local regions of similarityand do not need to align the entire strings. For example, the global alignment belowhas 22 matches, 18 indels, and 2 mismatches, resulting in the score 22 � 18 � 2 = 2 (ifs = µ = 1):
GCC-C-AGTC-TATGT-CAGGGGGCACG--A-GCATGCACA-GCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-T-CAGAT
However, these sequences can be aligned differently (with 17 matches and 32 indels)based on a highly conserved interval represented by the substrings CAGTCTATGTCAGand CAGTTATGTTCAG:
---G----C-----C--CAGTCTATG-TCAGGGGGCACGAGCATGCACAGCCGCCGTCGTTTTCAGCAGT-TATGTTCAG-----A------T-----
This alignment has fewer matches and a lower score of 17 � 32 = �15, even though theconserved region of the alignment contributes a score of 12 � 2 = 10, which is hardlyan accident.
Figure 5.19 shows the two alignment paths corresponding to these two differentalignments. The upper path, corresponding to the second alignment above, losesout because it contains many heavily penalized indels on either side of the diagonalcorresponding to the conserved interval. As a result, global alignment outputs thebiologically irrelevant lower path.
257
Homeobox genes: regulate embryonic development and are present in a large variety of species.
Homeodomain: area within homeobox genes of shared “local” similarity.
STOP and Think: Will our dynamic programming algorithm find regions of local similarity?
Finding “Local” Similarities
H O W D O W E C O M PA R E D N A S E Q U E N C E S ?
Limitations of global alignment
Analysis of homeobox genes offers an example of a problem for which global alignmentmay fail to reveal biologically relevant similarities. These genes regulate embryonicdevelopment and are present in a large variety of species, from flies to humans. Home-obox genes are long, and they differ greatly between species, but an approximately 60amino acid-long region in each gene, called the homeodomain, is highly conserved.For instance, consider the mouse and human homeodomains below.
Mouse...ARRSRTHFTKFQTDILIEAFEKNRFPGIVTREKLAQQTGIPESRIHIWFQNRRARHPDPG......ARQKQTFITWTQKNRLVQAFERNPFPDTATRKKLAEQTGLQESRIQMWFQKQRSLYLKKS...
Human
The immediate question is how to find this conserved segment within the muchlonger genes and ignore the flanking areas, which exhibit little similarity. Global align-ment seeks similarities between two strings across their entire length; however, whensearching for homeodomains, we are looking for smaller, local regions of similarityand do not need to align the entire strings. For example, the global alignment belowhas 22 matches, 18 indels, and 2 mismatches, resulting in the score 22 � 18 � 2 = 2 (ifs = µ = 1):
GCC-C-AGTC-TATGT-CAGGGGGCACG--A-GCATGCACA-GCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-T-CAGAT
However, these sequences can be aligned differently (with 17 matches and 32 indels)based on a highly conserved interval represented by the substrings CAGTCTATGTCAGand CAGTTATGTTCAG:
---G----C-----C--CAGTCTATG-TCAGGGGGCACGAGCATGCACAGCCGCCGTCGTTTTCAGCAGT-TATGTTCAG-----A------T-----
This alignment has fewer matches and a lower score of 17 � 32 = �15, even though theconserved region of the alignment contributes a score of 12 � 2 = 10, which is hardlyan accident.
Figure 5.19 shows the two alignment paths corresponding to these two differentalignments. The upper path, corresponding to the second alignment above, losesout because it contains many heavily penalized indels on either side of the diagonalcorresponding to the conserved interval. As a result, global alignment outputs thebiologically irrelevant lower path.
257
H O W D O W E C O M PA R E D N A S E Q U E N C E S ?
Limitations of global alignment
Analysis of homeobox genes offers an example of a problem for which global alignmentmay fail to reveal biologically relevant similarities. These genes regulate embryonicdevelopment and are present in a large variety of species, from flies to humans. Home-obox genes are long, and they differ greatly between species, but an approximately 60amino acid-long region in each gene, called the homeodomain, is highly conserved.For instance, consider the mouse and human homeodomains below.
Mouse...ARRSRTHFTKFQTDILIEAFEKNRFPGIVTREKLAQQTGIPESRIHIWFQNRRARHPDPG......ARQKQTFITWTQKNRLVQAFERNPFPDTATRKKLAEQTGLQESRIQMWFQKQRSLYLKKS...
Human
The immediate question is how to find this conserved segment within the muchlonger genes and ignore the flanking areas, which exhibit little similarity. Global align-ment seeks similarities between two strings across their entire length; however, whensearching for homeodomains, we are looking for smaller, local regions of similarityand do not need to align the entire strings. For example, the global alignment belowhas 22 matches, 18 indels, and 2 mismatches, resulting in the score 22 � 18 � 2 = 2 (ifs = µ = 1):
GCC-C-AGTC-TATGT-CAGGGGGCACG--A-GCATGCACA-GCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-T-CAGAT
However, these sequences can be aligned differently (with 17 matches and 32 indels)based on a highly conserved interval represented by the substrings CAGTCTATGTCAGand CAGTTATGTTCAG:
---G----C-----C--CAGTCTATG-TCAGGGGGCACGAGCATGCACAGCCGCCGTCGTTTTCAGCAGT-TATGTTCAG-----A------T-----
This alignment has fewer matches and a lower score of 17 � 32 = �15, even though theconserved region of the alignment contributes a score of 12 � 2 = 10, which is hardlyan accident.
Figure 5.19 shows the two alignment paths corresponding to these two differentalignments. The upper path, corresponding to the second alignment above, losesout because it contains many heavily penalized indels on either side of the diagonalcorresponding to the conserved interval. As a result, global alignment outputs thebiologically irrelevant lower path.
257
Exercise Break: Score these alignments (σ = μ = 1). Does our scoring function make sense?
Visualizing Local Alignments C H A P T E R 5
FIGURE 5.19 Global and local alignments of two DNA strings that share a highlyconserved interval. The relevant alignment that captures this interval (upper path) losesto an irrelevant alignment (lower path), since the former incurs heavy indel penalties.
When biologically significant similarities are present in some parts of sequencesv and w and absent from others, biologists attempt to ignore global alignment andinstead align substrings of v and w, which yields a local alignment of the two strings.The problem of finding substrings that maximize the global alignment score over allsubstrings of v and w is called the Local Alignment Problem.
258
Revisiting Global Alignment
Global Alignment Problem: Find a highest-scoring alignment of two strings. • Input: Two strings. • Output: An alignment of the strings with
maximum alignment score.
Revisiting Global Alignment
Global Alignment Problem: Find a highest-scoring alignment of two strings. • Input: Two strings. • Output: An alignment of the strings with
maximum alignment score.
STOP and Think: How can we reformulate the problem to find areas of “local” similarity?
Revisiting Global Alignment
Local Alignment Problem: Find a highest-scoring “local” alignment of two strings. • Input: Two strings v and w. • Output: Substrings of v and w whose best global
alignment score is maximized over all substrings.
Revisiting Global Alignment
Local Alignment Problem: Find a highest-scoring “local” alignment of two strings. • Input: Two strings v and w. • Output: Substrings of v and w whose best global
alignment score is maximized over all substrings.
STOP and Think: What algorithm would you propose to solve this problem?
“Free Taxi Rides” for Local Alignment
GCCCAGTCTATGTCAGGGGGCACGAGCATGCACA
G C C G C C G T C G T T T T C A G C A G T T A T G T T C A G A T
0
0
Outline
• Introduction to Sequence Alignment
• The Manhattan Tourist Problem
• Sequence Alignment is the Manhattan Tourist Problem in Disguise
• An Introduction to Dynamic Programming: The Change Problem
• The Manhattan Tourist Problem Revisited
• From Global to Local Alignment
• Penalizing Insertions and Deletions in Sequence Alignment
• Multiple Sequence Alignment
Comparing Same-Score Alignments
STOP and Think: Which of these two alignments (which have the same score) is “better”? Why?
GATCCAGGA-C-AG
GATCCAGGA--CAG
Comparing Same-Score Alignments
GATCCAGGA-C-AG
GATCCAGGA--CAG
Affine penalty: a way of scoring contiguous gaps higher than discontiguous gaps. • gap opening penalty (σ): assessed to first symbol. • gap extension penalty (ε): assessed to subsequent
symbols.
Comparing Same-Score Alignments
GATCCAGGA-C-AG
GATCCAGGA--CAG
Affine penalty: a way of scoring contiguous gaps higher than discontiguous gaps. • gap opening penalty (σ): assessed to first symbol. • gap extension penalty (ε): assessed to subsequent
symbols.
If σ = 5 and ε = 1, then the alignment on the left is penalized by 2σ = 10, whereas the alignment on the right is only penalized by σ + ε.
Adding Affine Gap Penalties
Alignment with Affine Gap Penalties Problem: Construct a highest-scoring global alignment of two strings (with affine gap penalties). • Input: Two strings along with numbers σ and ε. • Output: A highest scoring global alignment
between these strings, as defined by the gap opening and extension penalties σ and ε.
STOP and Think: How can we modify the alignment graph to solve this problem?
Adding “Long” Edges to Graph
One solution: Add a (huge) number of new edges to alignment graph to facilitate longer gaps.
Adding “Long” Edges to Graph
One solution: Add a (huge) number of new edges to alignment graph to facilitate longer gaps.
Outline
• Introduction to Sequence Alignment
• The Manhattan Tourist Problem
• Sequence Alignment is the Manhattan Tourist Problem in Disguise
• An Introduction to Dynamic Programming: The Change Problem
• The Manhattan Tourist Problem Revisited
• From Global to Local Alignment
• Penalizing Insertions and Deletions in Sequence Alignment
• Multiple Sequence Alignment
Moving to Multiple Sequences
Multiple Alignment Problem: Find the highest-scoring alignment between multiple strings. • Input: A collection of t strings. • Output: A multiple alignment of these strings
having maximal score.
Moving to Multiple Sequences
Multiple Alignment Problem: Find the highest-scoring alignment between multiple strings. • Input: A collection of t strings. • Output: A multiple alignment of these strings
having maximal score.
STOP and Think: What algorithm would you propose to solve this problem?
Moving to Multiple Dimensions
(i – 1, j – 1, k – 1)
(i, j – 1, k – 1)
(i – 1, j, k – 1)
(i – 1, j – 1, k) (i – 1, j, k)
(i, j, k) (i, j – 1, k)
(i, j, k – 1)
Moving to Multiple Dimensions
(i – 1, j – 1, k – 1)
(i, j – 1, k – 1)
(i – 1, j, k – 1)
(i – 1, j – 1, k) (i – 1, j, k)
(i, j, k) (i, j – 1, k)
(i, j, k – 1)
STOP and Think: What is the issue with the dynamic programming approach in multiple dimensions?
Answer: The number of edges in a single block grows like 2t – 1...
Moving to Multiple Dimensions
(i – 1, j – 1, k – 1)
(i, j – 1, k – 1)
(i – 1, j, k – 1)
(i – 1, j – 1, k) (i – 1, j, k)
(i, j, k) (i, j – 1, k)
(i, j, k – 1)
Moving to Multiple Dimensions
(i – 1, j – 1, k – 1)
(i, j – 1, k – 1)
(i – 1, j, k – 1)
(i – 1, j – 1, k) (i – 1, j, k)
(i, j, k) (i, j – 1, k)
(i, j, k – 1)
STOP and Think: What heuristic might you propose to align multiple sequences?
Greedy Heuristic for Multiple Alignment
1. Find an optimal pairwise alignment of each pair of strings.
2. Combine the set of optimal pairwise alignments into a multiple alignment.
Greedy Heuristic for Multiple Alignment
1. Find an optimal pairwise alignment of each pair of strings.
2. Combine the set of optimal pairwise alignments into a multiple alignment.
STOP and Think: Try this approach on the strings CCCCTTTT, TTTTGGGG, and GGGGCCCC.
There is no way to combine these optimal pairwise alignment into a meaningful multiple alignment!
Greedy Heuristic for Multiple Alignment
1. Find an optimal pairwise alignment of each pair of strings.
2. Combine the set of optimal pairwise alignments into a multiple alignment.
CCCCTTTT---- ----CCCCTTTT TTTTGGGG--------TTTTGGGG GGGGCCCC---- ----GGGGCCCC
Recommended