How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf ·...

How Do We Compare Biological Sequences?�

Dynamic Programming

Phillip Compeau and Pavel Pevzner Bioinformatics Algorithms: An Active Learning Approach

Outline

•  Introduction to Sequence Alignment

•  The Manhattan Tourist Problem

•  Sequence Alignment is the Manhattan Tourist Problem in Disguise

•  An Introduction to Dynamic Programming: The Change Problem

•  The Manhattan Tourist Problem Revisited

•  From Global to Local Alignment

•  Penalizing Insertions and Deletions in Sequence Alignment

•  Multiple Sequence Alignment

Recursive Manhattan Tourist

SouthOrEast(i,j)ifi=0andj=0return0x←-infinity,y←-infinityifi>0x←SouthOrEast(i-1,j)+weightofvert.edgeinto(i,j)ifj>0y←SouthOrEast(i,j-1)+weightofhoriz.edgeinto(i,j)returnmax(x,y)

Recursive Manhattan Tourist

SouthOrEast(i,j)ifi=0andj=0return0x←-infinity,y←-infinityifi>0x←SouthOrEast(i-1,j)+weightofvert.edgeinto(i,j)ifj>0y←SouthOrEast(i,j-1)+weightofhoriz.edgeinto(i,j)returnmax(x,y)

Exercise Break: How many times is SouthOrEast(3, 2) called in the computation of SouthOrEast(9, 7)?

Dynamic Programming Manhattan

STOP and Think: Which element of the table should we fill in next and what should its value be?

Dynamic Programming Manhattan

STOP and Think: Do you see a longest path in this grid? What algorithm did you use?

Reconstructing an Optimal Path

STOP and Think: In general, how do we reconstruct this path?

Reconstructing an Optimal Path

Answer: start at ending node and follow edges backwards to the beginning node.

Finding an LCS

A T C G T C CA

Exercise Break: Find a longest common subsequence of ATGTTATA and ATCGTCC.

Outline

Strengthening Alignment Scoring

0 1 2 2 3 4 5 6 7 8A T - G T T A T AA T C G T - C - C

0 1 2 3 4 5 5 6 6 7

Alignment score: Divided into three components: •  match reward (+1) •  mismatch penalty (-μ) •  insertion/deletion penalty (-σ)

Global Alignment Problem: Find a highest-scoring alignment of two strings. •  Input: Two strings. •  Output: An alignment of the strings with

maximum alignment score.

STOP and Think: How can we solve this problem?

T C G T

- - - - -

- - - -

Answer: Slight modification to alignment graph ...

T C G T

- - - - -

- - - -

Exercise Break: Find a best alignment (σ = 2, μ = 3).

Finding “Local” Similarities

Homeobox genes: regulate embryonic development and are present in a large variety of species.

H O W D O W E C O M PA R E D N A S E Q U E N C E S ?

Limitations of global alignment

Analysis of homeobox genes offers an example of a problem for which global alignmentmay fail to reveal biologically relevant similarities. These genes regulate embryonicdevelopment and are present in a large variety of species, from flies to humans. Home-obox genes are long, and they differ greatly between species, but an approximately 60amino acid-long region in each gene, called the homeodomain, is highly conserved.For instance, consider the mouse and human homeodomains below.

Mouse...ARRSRTHFTKFQTDILIEAFEKNRFPGIVTREKLAQQTGIPESRIHIWFQNRRARHPDPG......ARQKQTFITWTQKNRLVQAFERNPFPDTATRKKLAEQTGLQESRIQMWFQKQRSLYLKKS...

The immediate question is how to find this conserved segment within the muchlonger genes and ignore the flanking areas, which exhibit little similarity. Global align-ment seeks similarities between two strings across their entire length; however, whensearching for homeodomains, we are looking for smaller, local regions of similarityand do not need to align the entire strings. For example, the global alignment belowhas 22 matches, 18 indels, and 2 mismatches, resulting in the score 22 � 18 � 2 = 2 (ifs = µ = 1):

GCC-C-AGTC-TATGT-CAGGGGGCACG--A-GCATGCACA-GCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-T-CAGAT

However, these sequences can be aligned differently (with 17 matches and 32 indels)based on a highly conserved interval represented by the substrings CAGTCTATGTCAGand CAGTTATGTTCAG:

---G----C-----C--CAGTCTATG-TCAGGGGGCACGAGCATGCACAGCCGCCGTCGTTTTCAGCAGT-TATGTTCAG-----A------T-----

This alignment has fewer matches and a lower score of 17 � 32 = �15, even though theconserved region of the alignment contributes a score of 12 � 2 = 10, which is hardlyan accident.

Figure 5.19 shows the two alignment paths corresponding to these two differentalignments. The upper path, corresponding to the second alignment above, losesout because it contains many heavily penalized indels on either side of the diagonalcorresponding to the conserved interval. As a result, global alignment outputs thebiologically irrelevant lower path.

Homeodomain: area within homeobox genes of shared “local” similarity.

STOP and Think: Will our dynamic programming algorithm find regions of local similarity?

Exercise Break: Score these alignments (σ = μ = 1). Does our scoring function make sense?

Visualizing Local Alignments C H A P T E R 5

FIGURE 5.19 Global and local alignments of two DNA strings that share a highlyconserved interval. The relevant alignment that captures this interval (upper path) losesto an irrelevant alignment (lower path), since the former incurs heavy indel penalties.

When biologically significant similarities are present in some parts of sequencesv and w and absent from others, biologists attempt to ignore global alignment andinstead align substrings of v and w, which yields a local alignment of the two strings.The problem of finding substrings that maximize the global alignment score over allsubstrings of v and w is called the Local Alignment Problem.

Revisiting Global Alignment

STOP and Think: How can we reformulate the problem to find areas of “local” similarity?

Local Alignment Problem: Find a highest-scoring “local” alignment of two strings. •  Input: Two strings v and w. •  Output: Substrings of v and w whose best global

alignment score is maximized over all substrings.

Local Alignment Problem: Find a highest-scoring “local” alignment of two strings. •  Input: Two strings v and w. •  Output: Substrings of v and w whose best global

alignment score is maximized over all substrings.

STOP and Think: What algorithm would you propose to solve this problem?

“Free Taxi Rides” for Local Alignment

GCCCAGTCTATGTCAGGGGGCACGAGCATGCACA

G C C G C C G T C G T T T T C A G C A G T T A T G T T C A G A T

Outline

Comparing Same-Score Alignments

STOP and Think: Which of these two alignments (which have the same score) is “better”? Why?

GATCCAGGA-C-AG

GATCCAGGA--CAG

GATCCAGGA-C-AG

GATCCAGGA--CAG

Affine penalty: a way of scoring contiguous gaps higher than discontiguous gaps. •  gap opening penalty (σ): assessed to first symbol. •  gap extension penalty (ε): assessed to subsequent

symbols.

GATCCAGGA-C-AG

GATCCAGGA--CAG

Affine penalty: a way of scoring contiguous gaps higher than discontiguous gaps. •  gap opening penalty (σ): assessed to first symbol. •  gap extension penalty (ε): assessed to subsequent

symbols.

If σ = 5 and ε = 1, then the alignment on the left is penalized by 2σ = 10, whereas the alignment on the right is only penalized by σ + ε.

Adding Affine Gap Penalties

Alignment with Affine Gap Penalties Problem: Construct a highest-scoring global alignment of two strings (with affine gap penalties). •  Input: Two strings along with numbers σ and ε. •  Output: A highest scoring global alignment

between these strings, as defined by the gap opening and extension penalties σ and ε.

STOP and Think: How can we modify the alignment graph to solve this problem?

Adding “Long” Edges to Graph

One solution: Add a (huge) number of new edges to alignment graph to facilitate longer gaps.

Adding “Long” Edges to Graph

One solution: Add a (huge) number of new edges to alignment graph to facilitate longer gaps.

Outline

Moving to Multiple Sequences

Multiple Alignment Problem: Find the highest-scoring alignment between multiple strings. •  Input: A collection of t strings. •  Output: A multiple alignment of these strings

having maximal score.

Moving to Multiple Sequences

Multiple Alignment Problem: Find the highest-scoring alignment between multiple strings. •  Input: A collection of t strings. •  Output: A multiple alignment of these strings

having maximal score.

STOP and Think: What algorithm would you propose to solve this problem?

Moving to Multiple Dimensions

(i – 1, j – 1, k – 1)

(i, j – 1, k – 1)

(i – 1, j, k – 1)

(i – 1, j – 1, k) (i – 1, j, k)

(i, j, k) (i, j – 1, k)

(i, j, k – 1)

(i – 1, j – 1, k – 1)

(i, j – 1, k – 1)

(i – 1, j, k – 1)

(i – 1, j – 1, k) (i – 1, j, k)

(i, j, k) (i, j – 1, k)

(i, j, k – 1)

STOP and Think: What is the issue with the dynamic programming approach in multiple dimensions?

Answer: The number of edges in a single block grows like 2t – 1...

(i – 1, j – 1, k – 1)

(i, j – 1, k – 1)

(i – 1, j, k – 1)

(i – 1, j – 1, k) (i – 1, j, k)

(i, j, k) (i, j – 1, k)

(i, j, k – 1)

(i – 1, j – 1, k – 1)

(i, j – 1, k – 1)

(i – 1, j, k – 1)

(i – 1, j – 1, k) (i – 1, j, k)

(i, j, k) (i, j – 1, k)

(i, j, k – 1)

STOP and Think: What heuristic might you propose to align multiple sequences?

Greedy Heuristic for Multiple Alignment

1.  Find an optimal pairwise alignment of each pair of strings.

2.  Combine the set of optimal pairwise alignments into a multiple alignment.

STOP and Think: Try this approach on the strings CCCCTTTT, TTTTGGGG, and GGGGCCCC.

There is no way to combine these optimal pairwise alignment into a meaningful multiple alignment!

CCCCTTTT---- ----CCCCTTTT TTTTGGGG--------TTTTGGGG GGGGCCCC---- ----GGGGCCCC

How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf ·...

Documents

Sequence Alignment - Kirkwood Community CollegeLocal alignment •Uses for local alignment: –compare distantly-related sequences that share a few non-connected regions in common

Ch. 11 – Sequences & Series 11.1 – Sequences as Functions

SEQUENCE ALIGNMENT ALGORITHMS Why compare sequences?angom.myweb.cs.uwindsor.ca/teaching/cs558/Blec02.pdf · Global Pairwise Alignment Given two sequences S and T of roughly the same

Using LCD digital sign technology to enlighten patrons about library resources and services Marianne Hebert Keith Compeau SUNY Potsdam SUNYLA2009

Scoring Matrices - Biological computing · 2017-06-28 · searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison rely

EGEE Applications The Future EGEE Middleware · BLAST – comparing DNA or protein sequences • BLAST is the first step for analysing new sequences: to compare DNA or protein sequences

DAFTAR PUSTAKA - repository.unika.ac.idrepository.unika.ac.id/1145/7/10.60.0088 Vincent Febrian Wibowo DAFTAR... · Journal of Business Management Vol. 4(5), hal 800-810. Compeau,

Chapter 4 Sequences and Mathematical Induction. 4.1 Sequences

Pairwise Global Alignment of Sequences · Pairwise Global Alignment of Sequences Comparing sequences, structures (and sequences with structures) is the most fun-damental operation

Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment

RECURSIVE SEQUENCES vs. ARITHMETIC SEQUENCES

A Simplified View of DCJ-Indel Distance Phillip Compeau A Simplified View of DCJ- Indel Distance Phillip Compeau University of California-San Diego Department

AreaSurvey SourceNumber of items Cognitive Absorption (CA)Agarwal and Karahanna, 2000)20 questions Computer Self-Efficacy (CSE)Compeau and Higgings (1995)10

Smith Waterman Algorithm - Performance Analysis · The algorithm is a variation of the Needleman-Wunsch algorithm to compare two sequences and create a global similarity score Application

Investigating Sequences and Series Arithmetic Sequences

Using DNA sequences Obtain sequence Align sequences, number of parsimony informative sites Gap handling Picking sequences (order) Analyze sequences (similarity/parsimony/exhaustive/bayesian

Oh Users, Where Art Thou? Dan Newton & Keith Compeau, SUNY Potsdam

BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

ANOTHER SET OF SEQUENCES, SUB-SEQUENCES, AND SEQUENCES OF SEQUENCES

Recognizing Sequences of Sequences - FIL | UCL