Upload
hailee-harrill
View
216
Download
3
Tags:
Embed Size (px)
Citation preview
Pairwise Sequence Alignment
Sushmita RoyBMI/CS 576
www.biostat.wisc.edu/bmi576/Sushmita Roy
[email protected] 10th, 2013
BMI/CS 576
Key concepts in today’s class
• What is pairwise sequence alignment?• How to handle gaps?
– Gap penalties
• What are the issues in sequence alignment?• What are the different types of alignment problems
– Global alignment– Local alignment
• Dynamic programming solution to finding the best alignment• Given a scoring function, find the best alignment of two
sequences– Global– Local
What is sequence alignment
The task of locating equivalent regions of two or more sequences to maximize their overall similarity
Why sequence alignment?
• Identifying similarity between two sequences
• Sequence specifies the function of a protein
• Similarity in sequence can imply similarity in function.– Assign function to uncharacterized sequences based on characterized
sequences
• Sequence from different species can be compared to estimate the evolutionary relationships between species – We will come back to this in Phylogenetic trees.
Alignment example
T H I S S E Q U E N C E
T H A T S E Q U E N C E
How to compare these two sequences?
T H I S S E Q U E N C E
T H A T I S A S E Q U E N C E
Need to incorporate gaps
_ _ _T H I S S E Q U E N C E
T H A T I S A S E Q U E N C E
T H I S _ _ _ S E Q U E N C E
T H A T I S A S E Q U E N C E
Alignment 1: 3 gaps, 8 matches Alignment 2: 3 gaps, 9 matches
Issues in sequence alignment
• What type of alignment?– Align the entire sequence or part of it?
• How to score an alignment?– the sequences we’re comparing typically differ in length– some amino acid pairs are more substitutable than others
• How to find the alignment?– Algorithms for alignment
• How to tell if the alignment is biologically meaningful?– Computing significance of an alignment
Pairwise alignment: task definition
Given– a pair of sequences (DNA or protein)– a scoring function
Do– determine the common substrings in the sequences
such that the similarity score is maximized
Sequence can change through different mutations
• substitutions: ACGA AGGA
• insertions: ACGA ACGGA
• deletions: ACGA AGAGaps
Scoring function
• Scoring function comprises– Substitution matrix s(a,b): indicates the score of aligning a with b– Gap penalty function γ(g) where g is the gap length
• DNA has a simple scoring function– Consider match and mismatch
• Proteins have a more complex scoring function– Some amino acid pairs might be more substitutable versus others– Scores might capture
• similar physical and chemical properties• evolutionary relationships• More next class
Gap penalty function
• Different gap penalty functions require somewhat different scoring algorithms
• Linear score
– γ(g)=-dg, where d is a fixed cost, g is the gap length
– Longer the gap higher the penalty
• Affine score
– γ(g)=-(d +(g-1)e), d and e are fixed penalties and e<d.
– d : gap open penalty
– e : gap extension penalty
Types of pairwise sequence alignment problems
• Global alignment (Needleman-Wunsch algorithm)– Align the entire sequence
• Local alignment (Smith-Waterman algorithm)– Align part of the sequence
Global alignment
• Given two sequences u and v and a scoring function, find the alignment with the maximal score.
• The number of possible alignments is very big.
Number of possible alignments
CAATGA ATTGATSequence 1 Sequence 2
CAATGA__ATTGAT
CAATGAATTGAT
CAATGA_AT_TGAT
Alignment 1 Alignment 2 Alignment 3
Number of possible alignments
• There are
nn
n
n
n n
π
2
2
2)!()!2(2
≈=⎟⎟⎠
⎞⎜⎜⎝
⎛
possible global alignments for 2 sequences of length n
• E.g. two sequences of length 100 have possible alignments
• But we can use dynamic programming to find an optimal alignment efficiently
7710≈
Dynamic programming (DP)
• Divide and conquer
• Solve a problem using pre-computed solutions for subparts of the problem
Dynamic programming for sequence alignment
• Two sequences: AAAC, AAG• x: AAAC
– xi: Denotes the ith letter
• y: AAG– yi denotes the jth letter in y.
• Assume cost of gap is d.• Assume we have aligned x1..xi to y1..yj
• There are three possibilities
AAxi
AAyj
AAAAAG
AA_AAG
AAAAA_
Dynamic programming idea
• Let x’s length be n• Let y’s length be m• construct an (n+1) (m+1) matrix F• F ( j, i ) = score of the best alignment of x1…xi with y1…yj
y
x
A
A
CA G
A
C
score of best alignment ofAAA to AG
DP for global alignment with linear gap penalty
€
F( j,i) = max
F( j −1,i −1) +s(x i,y j )
F( j −1,i) − d
F( j,i −1) − d
⎧
⎨ ⎪
⎩ ⎪
F(j,i)
F(j-1,i-1) F(j-1,i)
F(j,i-1)-d
-ds(xi,yj)
DP recurrence relation
Score of the best partial alignment between x1..xi and y1.. yj
DP algorithm sketch: global alignment
• Initialize first row and column of matrix
• Fill in rest of matrix from top to bottom, left to right
• For each F(i, j), save pointer(s) to cell(s) that resulted in best score
• F(m, n) holds the optimal alignment score;
• Trace back from F(m, n) to F(0, 0) to recover alignment
Global alignment example
• Suppose we choose the following scoring scheme
d = 2, where d is the gap penalty
Initializing matrix: global alignment with linear gap penalty
A -d
A -2d
CA G
A -3d
C -4d
0 -3d-d -2d
Global alignment example
A -2
A -4
CA G
A -6
C -8
0 -6-2 -4
1 -3-1
0 -2
-2 -1
-5 -4 -1
-1
-3
Trace back to retrieve the alignment
A -2
A -4
CA G
A -6
C -8
0 -6-2 -4
1 -3-1
0 -2
-2 -1
-5 -4 -1
-1
-3
x:
y:
one optimal alignment
C
C
A
_
A
G
A
A
Computational complexity
• initialization: O(m), O(n) where sequence lengths are m, n• filling in rest of matrix: O(mn)• traceback: O(m + n)• Since sequences have nearly same length, the computational
complexity is
)( 2nO
Summary of DP
• Maximize F(j,i) using F(j-1,i-1), F(j-1,i) or F(j,i-1)• works for either DNA or protein sequences, although the
substitution matrices used differ• finds an optimal alignment• the exact algorithm (and computational complexity) depends
on gap penalty function
Local alignment
• Look for local regions sequence similarity.– Aligning substrings of x and y.
• Try to find the best alignment over all possible substrings.
• This seems difficult computationally.– But can be solved by DP.
Local alignment motivation
• Comparing protein sequences that share a domain (independently folded unit) but differ elsewhere
• Comparing DNA sequences that share a similar sequence pattern but differ elsewhere
• More sensitive when comparing highly diverged sequences– Selection on the genome differs between regions– Unconstrained regions are not very alignable
Local alignment DP
Similar to global alignment with a few exceptions• DP recurrence is different
• Top and left row are initialized to 0s.
€
F( j,i) = max
F( j −1,i −1) +s(x i, y j )
F( j −1,i) − d
F( j,i −1) − d
0
⎧
⎨ ⎪ ⎪
⎩ ⎪ ⎪
New in local: starts a new alignment
Local alignment DP algorithm
• Initialization: first row and first column initialized with 0’s
• Traceback:– Find maximum value of F(j, i)
• can be anywhere in matrix
– Stop when we get to a cell with value 0
Local alignment example
0
0
00 00
00 00
0
T
T
A
A
G
0
0
0
0
0
0 0
G
0
A
0
A
0
A
1
0
1
1 2
3
1
1
x:
y:
G
G
A
A
A
A
1