How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf ·...

Preview:

Citation preview

How Do We Compare Biological Sequences?�

Dynamic Programming

Phillip Compeau and Pavel Pevzner Bioinformatics Algorithms: An Active Learning Approach

©2015 by Compeau and Pevzner. All rights reserved.

Outline

•  Introduction to Sequence Alignment

•  The Manhattan Tourist Problem

•  Sequence Alignment is the Manhattan Tourist Problem in Disguise

•  An Introduction to Dynamic Programming: The Change Problem

•  The Manhattan Tourist Problem Revisited

•  From Global to Local Alignment

•  Penalizing Insertions and Deletions in Sequence Alignment

•  Multiple Sequence Alignment

Recursive Manhattan Tourist

SouthOrEast(i,j)ifi=0andj=0return0x←-infinity,y←-infinityifi>0x←SouthOrEast(i-1,j)+weightofvert.edgeinto(i,j)ifj>0y←SouthOrEast(i,j-1)+weightofhoriz.edgeinto(i,j)returnmax(x,y)

Recursive Manhattan Tourist

SouthOrEast(i,j)ifi=0andj=0return0x←-infinity,y←-infinityifi>0x←SouthOrEast(i-1,j)+weightofvert.edgeinto(i,j)ifj>0y←SouthOrEast(i,j-1)+weightofhoriz.edgeinto(i,j)returnmax(x,y)

Exercise Break: How many times is SouthOrEast(3, 2) called in the computation of SouthOrEast(9, 7)?

Dynamic Programming Manhattan

Dynamic Programming Manhattan

STOP and Think: Which element of the table should we fill in next and what should its value be?

Dynamic Programming Manhattan

Dynamic Programming Manhattan

Dynamic Programming Manhattan

STOP and Think: Do you see a longest path in this grid? What algorithm did you use?

Reconstructing an Optimal Path

STOP and Think: In general, how do we reconstruct this path?

Reconstructing an Optimal Path

Answer: start at ending node and follow edges backwards to the beginning node.

Finding an LCS

A T C G T C CA

T

G

T

T

A

T

A

Exercise Break: Find a longest common subsequence of ATGTTATA and ATCGTCC.

Outline

•  Introduction to Sequence Alignment

•  The Manhattan Tourist Problem

•  Sequence Alignment is the Manhattan Tourist Problem in Disguise

•  An Introduction to Dynamic Programming: The Change Problem

•  The Manhattan Tourist Problem Revisited

•  From Global to Local Alignment

•  Penalizing Insertions and Deletions in Sequence Alignment

•  Multiple Sequence Alignment

Strengthening Alignment Scoring

0 1 2 2 3 4 5 6 7 8A T - G T T A T AA T C G T - C - C

0 1 2 3 4 5 5 6 6 7

Alignment score: Divided into three components: •  match reward (+1) •  mismatch penalty (-μ) •  insertion/deletion penalty (-σ)

Strengthening Alignment Scoring

Global Alignment Problem: Find a highest-scoring alignment of two strings. •  Input: Two strings. •  Output: An alignment of the strings with

maximum alignment score.

Strengthening Alignment Scoring

Global Alignment Problem: Find a highest-scoring alignment of two strings. •  Input: Two strings. •  Output: An alignment of the strings with

maximum alignment score.

STOP and Think: How can we solve this problem?

Strengthening Alignment Scoring

T C G T

T

G

T

T

A

+1 +1

+1

+1 +1

+1+1

- -

- - -

- -

--

- - - - -

-

-

-

-

-

-

-

-

--

-

-

-

--

-

-

-

-

-

-

-

-

-

----

- - - -

- - - -

----

- - - -

- - - -

Answer: Slight modification to alignment graph ...

Strengthening Alignment Scoring

T C G T

T

G

T

T

A

+1 +1

+1

+1 +1

+1+1

- -

- - -

- -

--

- - - - -

-

-

-

-

-

-

-

-

--

-

-

-

--

-

-

-

-

-

-

-

-

-

----

- - - -

- - - -

----

- - - -

- - - -

Exercise Break: Find a best alignment (σ = 2, μ = 3).

Finding “Local” Similarities

Homeobox genes: regulate embryonic development and are present in a large variety of species.

Finding “Local” Similarities

H O W D O W E C O M PA R E D N A S E Q U E N C E S ?

Limitations of global alignment

Analysis of homeobox genes offers an example of a problem for which global alignmentmay fail to reveal biologically relevant similarities. These genes regulate embryonicdevelopment and are present in a large variety of species, from flies to humans. Home-obox genes are long, and they differ greatly between species, but an approximately 60amino acid-long region in each gene, called the homeodomain, is highly conserved.For instance, consider the mouse and human homeodomains below.

Mouse...ARRSRTHFTKFQTDILIEAFEKNRFPGIVTREKLAQQTGIPESRIHIWFQNRRARHPDPG......ARQKQTFITWTQKNRLVQAFERNPFPDTATRKKLAEQTGLQESRIQMWFQKQRSLYLKKS...

Human

The immediate question is how to find this conserved segment within the muchlonger genes and ignore the flanking areas, which exhibit little similarity. Global align-ment seeks similarities between two strings across their entire length; however, whensearching for homeodomains, we are looking for smaller, local regions of similarityand do not need to align the entire strings. For example, the global alignment belowhas 22 matches, 18 indels, and 2 mismatches, resulting in the score 22 � 18 � 2 = 2 (ifs = µ = 1):

GCC-C-AGTC-TATGT-CAGGGGGCACG--A-GCATGCACA-GCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-T-CAGAT

However, these sequences can be aligned differently (with 17 matches and 32 indels)based on a highly conserved interval represented by the substrings CAGTCTATGTCAGand CAGTTATGTTCAG:

---G----C-----C--CAGTCTATG-TCAGGGGGCACGAGCATGCACAGCCGCCGTCGTTTTCAGCAGT-TATGTTCAG-----A------T-----

This alignment has fewer matches and a lower score of 17 � 32 = �15, even though theconserved region of the alignment contributes a score of 12 � 2 = 10, which is hardlyan accident.

Figure 5.19 shows the two alignment paths corresponding to these two differentalignments. The upper path, corresponding to the second alignment above, losesout because it contains many heavily penalized indels on either side of the diagonalcorresponding to the conserved interval. As a result, global alignment outputs thebiologically irrelevant lower path.

257

Homeobox genes: regulate embryonic development and are present in a large variety of species.

Homeodomain: area within homeobox genes of shared “local” similarity.

Finding “Local” Similarities

H O W D O W E C O M PA R E D N A S E Q U E N C E S ?

Limitations of global alignment

Analysis of homeobox genes offers an example of a problem for which global alignmentmay fail to reveal biologically relevant similarities. These genes regulate embryonicdevelopment and are present in a large variety of species, from flies to humans. Home-obox genes are long, and they differ greatly between species, but an approximately 60amino acid-long region in each gene, called the homeodomain, is highly conserved.For instance, consider the mouse and human homeodomains below.

Mouse...ARRSRTHFTKFQTDILIEAFEKNRFPGIVTREKLAQQTGIPESRIHIWFQNRRARHPDPG......ARQKQTFITWTQKNRLVQAFERNPFPDTATRKKLAEQTGLQESRIQMWFQKQRSLYLKKS...

Human

The immediate question is how to find this conserved segment within the muchlonger genes and ignore the flanking areas, which exhibit little similarity. Global align-ment seeks similarities between two strings across their entire length; however, whensearching for homeodomains, we are looking for smaller, local regions of similarityand do not need to align the entire strings. For example, the global alignment belowhas 22 matches, 18 indels, and 2 mismatches, resulting in the score 22 � 18 � 2 = 2 (ifs = µ = 1):

GCC-C-AGTC-TATGT-CAGGGGGCACG--A-GCATGCACA-GCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-T-CAGAT

However, these sequences can be aligned differently (with 17 matches and 32 indels)based on a highly conserved interval represented by the substrings CAGTCTATGTCAGand CAGTTATGTTCAG:

---G----C-----C--CAGTCTATG-TCAGGGGGCACGAGCATGCACAGCCGCCGTCGTTTTCAGCAGT-TATGTTCAG-----A------T-----

This alignment has fewer matches and a lower score of 17 � 32 = �15, even though theconserved region of the alignment contributes a score of 12 � 2 = 10, which is hardlyan accident.

Figure 5.19 shows the two alignment paths corresponding to these two differentalignments. The upper path, corresponding to the second alignment above, losesout because it contains many heavily penalized indels on either side of the diagonalcorresponding to the conserved interval. As a result, global alignment outputs thebiologically irrelevant lower path.

257

Homeobox genes: regulate embryonic development and are present in a large variety of species.

Homeodomain: area within homeobox genes of shared “local” similarity.

STOP and Think: Will our dynamic programming algorithm find regions of local similarity?

Finding “Local” Similarities

H O W D O W E C O M PA R E D N A S E Q U E N C E S ?

Limitations of global alignment

Analysis of homeobox genes offers an example of a problem for which global alignmentmay fail to reveal biologically relevant similarities. These genes regulate embryonicdevelopment and are present in a large variety of species, from flies to humans. Home-obox genes are long, and they differ greatly between species, but an approximately 60amino acid-long region in each gene, called the homeodomain, is highly conserved.For instance, consider the mouse and human homeodomains below.

Mouse...ARRSRTHFTKFQTDILIEAFEKNRFPGIVTREKLAQQTGIPESRIHIWFQNRRARHPDPG......ARQKQTFITWTQKNRLVQAFERNPFPDTATRKKLAEQTGLQESRIQMWFQKQRSLYLKKS...

Human

The immediate question is how to find this conserved segment within the muchlonger genes and ignore the flanking areas, which exhibit little similarity. Global align-ment seeks similarities between two strings across their entire length; however, whensearching for homeodomains, we are looking for smaller, local regions of similarityand do not need to align the entire strings. For example, the global alignment belowhas 22 matches, 18 indels, and 2 mismatches, resulting in the score 22 � 18 � 2 = 2 (ifs = µ = 1):

GCC-C-AGTC-TATGT-CAGGGGGCACG--A-GCATGCACA-GCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-T-CAGAT

However, these sequences can be aligned differently (with 17 matches and 32 indels)based on a highly conserved interval represented by the substrings CAGTCTATGTCAGand CAGTTATGTTCAG:

---G----C-----C--CAGTCTATG-TCAGGGGGCACGAGCATGCACAGCCGCCGTCGTTTTCAGCAGT-TATGTTCAG-----A------T-----

This alignment has fewer matches and a lower score of 17 � 32 = �15, even though theconserved region of the alignment contributes a score of 12 � 2 = 10, which is hardlyan accident.

Figure 5.19 shows the two alignment paths corresponding to these two differentalignments. The upper path, corresponding to the second alignment above, losesout because it contains many heavily penalized indels on either side of the diagonalcorresponding to the conserved interval. As a result, global alignment outputs thebiologically irrelevant lower path.

257

H O W D O W E C O M PA R E D N A S E Q U E N C E S ?

Limitations of global alignment

Analysis of homeobox genes offers an example of a problem for which global alignmentmay fail to reveal biologically relevant similarities. These genes regulate embryonicdevelopment and are present in a large variety of species, from flies to humans. Home-obox genes are long, and they differ greatly between species, but an approximately 60amino acid-long region in each gene, called the homeodomain, is highly conserved.For instance, consider the mouse and human homeodomains below.

Mouse...ARRSRTHFTKFQTDILIEAFEKNRFPGIVTREKLAQQTGIPESRIHIWFQNRRARHPDPG......ARQKQTFITWTQKNRLVQAFERNPFPDTATRKKLAEQTGLQESRIQMWFQKQRSLYLKKS...

Human

The immediate question is how to find this conserved segment within the muchlonger genes and ignore the flanking areas, which exhibit little similarity. Global align-ment seeks similarities between two strings across their entire length; however, whensearching for homeodomains, we are looking for smaller, local regions of similarityand do not need to align the entire strings. For example, the global alignment belowhas 22 matches, 18 indels, and 2 mismatches, resulting in the score 22 � 18 � 2 = 2 (ifs = µ = 1):

GCC-C-AGTC-TATGT-CAGGGGGCACG--A-GCATGCACA-GCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-T-CAGAT

However, these sequences can be aligned differently (with 17 matches and 32 indels)based on a highly conserved interval represented by the substrings CAGTCTATGTCAGand CAGTTATGTTCAG:

---G----C-----C--CAGTCTATG-TCAGGGGGCACGAGCATGCACAGCCGCCGTCGTTTTCAGCAGT-TATGTTCAG-----A------T-----

This alignment has fewer matches and a lower score of 17 � 32 = �15, even though theconserved region of the alignment contributes a score of 12 � 2 = 10, which is hardlyan accident.

Figure 5.19 shows the two alignment paths corresponding to these two differentalignments. The upper path, corresponding to the second alignment above, losesout because it contains many heavily penalized indels on either side of the diagonalcorresponding to the conserved interval. As a result, global alignment outputs thebiologically irrelevant lower path.

257

Exercise Break: Score these alignments (σ = μ = 1). Does our scoring function make sense?

Visualizing Local Alignments C H A P T E R 5

FIGURE 5.19 Global and local alignments of two DNA strings that share a highlyconserved interval. The relevant alignment that captures this interval (upper path) losesto an irrelevant alignment (lower path), since the former incurs heavy indel penalties.

When biologically significant similarities are present in some parts of sequencesv and w and absent from others, biologists attempt to ignore global alignment andinstead align substrings of v and w, which yields a local alignment of the two strings.The problem of finding substrings that maximize the global alignment score over allsubstrings of v and w is called the Local Alignment Problem.

258

Revisiting Global Alignment

Global Alignment Problem: Find a highest-scoring alignment of two strings. •  Input: Two strings. •  Output: An alignment of the strings with

maximum alignment score.

Revisiting Global Alignment

Global Alignment Problem: Find a highest-scoring alignment of two strings. •  Input: Two strings. •  Output: An alignment of the strings with

maximum alignment score.

STOP and Think: How can we reformulate the problem to find areas of “local” similarity?

Revisiting Global Alignment

Local Alignment Problem: Find a highest-scoring “local” alignment of two strings. •  Input: Two strings v and w. •  Output: Substrings of v and w whose best global

alignment score is maximized over all substrings.

Revisiting Global Alignment

Local Alignment Problem: Find a highest-scoring “local” alignment of two strings. •  Input: Two strings v and w. •  Output: Substrings of v and w whose best global

alignment score is maximized over all substrings.

STOP and Think: What algorithm would you propose to solve this problem?

“Free Taxi Rides” for Local Alignment

GCCCAGTCTATGTCAGGGGGCACGAGCATGCACA

G C C G C C G T C G T T T T C A G C A G T T A T G T T C A G A T

0

0

Outline

•  Introduction to Sequence Alignment

•  The Manhattan Tourist Problem

•  Sequence Alignment is the Manhattan Tourist Problem in Disguise

•  An Introduction to Dynamic Programming: The Change Problem

•  The Manhattan Tourist Problem Revisited

•  From Global to Local Alignment

•  Penalizing Insertions and Deletions in Sequence Alignment

•  Multiple Sequence Alignment

Comparing Same-Score Alignments

STOP and Think: Which of these two alignments (which have the same score) is “better”? Why?

GATCCAGGA-C-AG

GATCCAGGA--CAG

Comparing Same-Score Alignments

GATCCAGGA-C-AG

GATCCAGGA--CAG

Affine penalty: a way of scoring contiguous gaps higher than discontiguous gaps. •  gap opening penalty (σ): assessed to first symbol. •  gap extension penalty (ε): assessed to subsequent

symbols.

Comparing Same-Score Alignments

GATCCAGGA-C-AG

GATCCAGGA--CAG

Affine penalty: a way of scoring contiguous gaps higher than discontiguous gaps. •  gap opening penalty (σ): assessed to first symbol. •  gap extension penalty (ε): assessed to subsequent

symbols.

If σ = 5 and ε = 1, then the alignment on the left is penalized by 2σ = 10, whereas the alignment on the right is only penalized by σ + ε.

Adding Affine Gap Penalties

Alignment with Affine Gap Penalties Problem: Construct a highest-scoring global alignment of two strings (with affine gap penalties). •  Input: Two strings along with numbers σ and ε. •  Output: A highest scoring global alignment

between these strings, as defined by the gap opening and extension penalties σ and ε.

STOP and Think: How can we modify the alignment graph to solve this problem?

Adding “Long” Edges to Graph

One solution: Add a (huge) number of new edges to alignment graph to facilitate longer gaps.

Adding “Long” Edges to Graph

One solution: Add a (huge) number of new edges to alignment graph to facilitate longer gaps.

Outline

•  Introduction to Sequence Alignment

•  The Manhattan Tourist Problem

•  Sequence Alignment is the Manhattan Tourist Problem in Disguise

•  An Introduction to Dynamic Programming: The Change Problem

•  The Manhattan Tourist Problem Revisited

•  From Global to Local Alignment

•  Penalizing Insertions and Deletions in Sequence Alignment

•  Multiple Sequence Alignment

Moving to Multiple Sequences

Multiple Alignment Problem: Find the highest-scoring alignment between multiple strings. •  Input: A collection of t strings. •  Output: A multiple alignment of these strings

having maximal score.

Moving to Multiple Sequences

Multiple Alignment Problem: Find the highest-scoring alignment between multiple strings. •  Input: A collection of t strings. •  Output: A multiple alignment of these strings

having maximal score.

STOP and Think: What algorithm would you propose to solve this problem?

Moving to Multiple Dimensions

(i – 1, j – 1, k – 1)

(i, j – 1, k – 1)

(i – 1, j, k – 1)

(i – 1, j – 1, k) (i – 1, j, k)

(i, j, k) (i, j – 1, k)

(i, j, k – 1)

Moving to Multiple Dimensions

(i – 1, j – 1, k – 1)

(i, j – 1, k – 1)

(i – 1, j, k – 1)

(i – 1, j – 1, k) (i – 1, j, k)

(i, j, k) (i, j – 1, k)

(i, j, k – 1)

STOP and Think: What is the issue with the dynamic programming approach in multiple dimensions?

Answer: The number of edges in a single block grows like 2t – 1...

Moving to Multiple Dimensions

(i – 1, j – 1, k – 1)

(i, j – 1, k – 1)

(i – 1, j, k – 1)

(i – 1, j – 1, k) (i – 1, j, k)

(i, j, k) (i, j – 1, k)

(i, j, k – 1)

Moving to Multiple Dimensions

(i – 1, j – 1, k – 1)

(i, j – 1, k – 1)

(i – 1, j, k – 1)

(i – 1, j – 1, k) (i – 1, j, k)

(i, j, k) (i, j – 1, k)

(i, j, k – 1)

STOP and Think: What heuristic might you propose to align multiple sequences?

Greedy Heuristic for Multiple Alignment

1.  Find an optimal pairwise alignment of each pair of strings.

2.  Combine the set of optimal pairwise alignments into a multiple alignment.

Greedy Heuristic for Multiple Alignment

1.  Find an optimal pairwise alignment of each pair of strings.

2.  Combine the set of optimal pairwise alignments into a multiple alignment.

STOP and Think: Try this approach on the strings CCCCTTTT, TTTTGGGG, and GGGGCCCC.

There is no way to combine these optimal pairwise alignment into a meaningful multiple alignment!

Greedy Heuristic for Multiple Alignment

1.  Find an optimal pairwise alignment of each pair of strings.

2.  Combine the set of optimal pairwise alignments into a multiple alignment.

CCCCTTTT---- ----CCCCTTTT TTTTGGGG--------TTTTGGGG GGGGCCCC---- ----GGGGCCCC

Recommended