Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Sequence Alignment
part 2
Dynamic programming with more realistic scoring scheme
• Using the same initial sequences, we’ll look at a dynamic programming example with a scoring scheme that selects for matches and penalizes both mismatched and gaps, as follows:
– Si,j = 2 if residues match
– Si,j = -1 if there is a mismatch at the current position
– w = -2 (gap penalty)
Initialization – same as before
• Example sequences:• GAATTCAGTTA (sequence 1: M=11)• GGATCGA (sequence 2: N=7)
G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0
G 0
A 0
T 0
C 0
G 0
A 0
Filling matrix (scoring)
• Recall the scoring scheme:
Mi,j (matrix position to be filled in) = maximum of these 3 terms:• Mi-1, j-1 + Si,j (match/mismatch in diagonal)
– where:
o Si,j = 2 for match
o Si,j = -1 for mismatch
• Mi, j-1 + w (gap in sequence 1)
• Mi-1,j + w (gap in sequence 2)– in either case, w = -2
Filling matrix (scoring)
• Since both sequences start with G, the maximum value for M1,1 is 0 + 2 = 2 for the match
G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 2
G 0
A 0
T 0
C 0
G 0
A 0
Filling matrix (scoring)
• Continuing down column 2, we see that the G in the first position of sequence 1 also matches the G in the second position of sequence 2; this means that M1,2 will be the maximum of [2, -2, -2] which is 2
• We also add a backward pointing arrow at each position to show where the maximum score came from (see next slide)
Filling matrix (scoring)
G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 2
G 0 2
A 0
T 0
C 0
G 0
A 0
Filling matrix (scoring)
• At M1,3 there is no match, so S1,3 = -1
• M1,3 = MAX[M0,2-1, M1,2-2,M0,3-2] = MAX[-1,0,-2] = 0
G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 2
G 0 2
A 0 0
T 0
C 0
G 0
A 0
Filling matrix (scoring)
• We can continue filling in column 1 using the same reasoning:
G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 2
G 0 2
A 0 0
T 0 -1
C 0 -1
G 0 2
A 0 0
Filling matrix (scoring)
• We continue into column 2:
G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 2 0
G 0 2 1
A 0 0 4
T 0 -1 2
C 0 -1 0
G 0 2 0
A 0 0 4
Filling matrix (scoring)
• At column 3 row 3 we encounter the following situation:– M3,2 = MAX[M2,1-1, M3,1-2,M2,2-2] = MAX[-1,-3,-1] = -1– Since there are 2 different ways we could reach the maximum score,
we provide arrows back to both cells that could get us there:
G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 2 0 -1
G 0 2 1 -1
A 0 0 4
T 0 -1 2
C 0 -1 0
G 0 2 0
A 0 0 4
Completed matrix
G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 2 0 -1 -1 -1 -1 -1 2 0 -1 -1
G 0 2 1 -1 -2 -2 -2 -2 1 1 -1 -2
A 0 0 4 3 1 -1 -3 0 -1 0 0 1
T 0 -1 2 3 5 3 1 -1 -1 1 2 0
C 0 -1 0 1 3 4 5 3 1 -1 0 1
G 0 2 0 -1 1 2 3 4 5 3 1 -1
A 0 0 4 2 0 0 1 5 3 4 2 3
Traceback
• Maximum global alignment is 3, the value in the last row of the last column; traceback step begins here
• The traceback step is simplified by the presence of the arrows – we can follow them to get to the predecessor at each step
• Since some locations have multiple arrows, we may find multiple alignments, but there will be fewer than under the simple scoring scheme we used before
Second possible path
• Alternate path gives the following alignment:GAATTCAGTTA
GGAT-C-G--A
Verifying the score(s)
• Recall our scoring scheme:
– match: +2
– mismatch: -1
– gap: -2
• Final overall score in table was 3, so alignments should add up to 3, given the above
• Calculations on next slide verify that they do
Verifying the score(s)
• First alignment:GAATTCAGTTA
GGA-TC-G--A
+-+-++-+--+
21222222222
2-1+2-2+2+2-2+2-2-2+2 = 3
• Second alignment:GAATTCAGTTA
GGAT-C-G--A
+-++-+-+--+
21222222222
2-1+2+2-2+2-2+2-2-2+2 = 3
Global alignments: pros & cons
• What they’re good for:
– checking for minor differences between sequences
– comparing sequences that partly overlap
• What they’re not good for:
– discovering similarities between 2 sequences
– exploring similarities within a family of sequences (for this we do multiple alignment)
Programming global alignment
• GA algorithms are based on dynamic programming which is a recursive technique
• Recursion can be summarized as follows:– nature of the problem: must be divisible into smaller
subproblems
– begin by solving the smallest subproblems
– solutions to smallest problems are used in solutions to larger problems
– process continues until entire (largest) problem is solved
Detailed algorithm: step 1: build scoring matrix
1. Read in 2 sequences to be aligned: lines 7-25 of program
2. Obtain match, mismatch & gap scores (from user): lines 35-45
3. Create m+1 x n+1 matrix (see next slide): lines 47-54
4. Prepend a blank character to the front of both sequences (so that position 0 of each sequence contains a blank, and sequence indices are consistent with matrix indices): lines 30 & 31
Arrays in Perl
• Array: single variable that holds multiple scalar values
– individual elements are accessible via index (subscript)
– one-dimensional array: vector
– two-dimensional array: matrix
Array declaration & notation
• To distinguish a vector or matrix from a scalar, the starting character of the array name is @:@array = ();
# declares empty array
• We refer to the entire array by @name; we reference individual elements using $name and subscript(s):for ($i=0; $i<10; $i++)
{
$array[$i] = ‘’;
}
# initializes 10-element vector to blank strings
Example program#!/usr/bin/perl
# declare array:@array = ();
# initialize with empty strings:for ($i=0; $i<10; $i++){
$array[$i]='';}
# prompt for & read in some data:for ($i=0; $i<10; $i++){
print "Please enter a word or phrase:\n";$array[$i] = <STDIN>;
}
# print it all out backwards:for ($i=9; $i>=0; $i--){
print $array[$i] . "\n";}
Back to algorithm …
5. Initialize first row & first column of matrix by adding gap penalty to each successive cell: lines 56-63
6. Fill remaining cells (lines 66-99):– compute three candidate values for each cell by
adding gap penalty or match score (as appropriate) to value in appropriate neighboring cell (lines 80-87)
– compare the three values to determine maximum score (lines 89-97)
Algorithm continued
7. Develop directional string to facilitate traceback: lines 111-154
– unlike the human observer, a program cannot see directional arrows in the matrix
– instead, we create a string containing directional indicators to develop a traceback path:
• H indicates left neighbor (horizontal gap)
• D indicates diagonal neighbor (match/mismatch)
• V indicates above neighbor (vertical gap)
Algorithm continued
• Perform traceback (lines 156 – 201):
– Starting at the “right” end of each sequence, obtain the current character
– Read the first (leftmost) character of the directional string & align the retrieved sequence characters as directed
– Continue until you run out of directional characters
Terminal gaps & semiglobal alignments
• Terminal gaps occur when you align 2 sequences that differ significantly in length; our global alignment algorithm doesn’t distinguish between these gaps and internal gaps, even though an alignment with only terminal gaps actually represents the optimal alignment
• For example, the three alignments below represent what the global alignment would consider optimal:CGCTATAG CGCTATAG CGCTATAG
--CTA--- C--TA--- --C--TA-
• Eliminating the gap penalty for terminal gaps produces a semiglobal alignment
Local alignment
• Many pairs of sequences will include regions of high similarity (conserved regions) interspersed with dissimilar regions
• A global alignment algorithm on such sequences will result in – poor scores and/or
– many equally (un)likely alignments reported as optimal
• In such situations, a local alignment algorithm is preferable
Local alignment
• Uses for local alignment:
– compare distantly-related sequences that share a few non-connected regions in common
– analysis of repeated elements within a single sequence
• Smith-Waterman: the original local alignment algorithm
Smith-Waterman algorithm
• Uses scoring system similar to Needleman-Wunsch for building matrix:Mi,j = maximum of:
• 0 or
• Mi-1, j-1 + Si,j or
• Mi-1, j + w or
• Mi, j-1 + 2
• Note that the inclusion of 0 as a possible maximum eliminates negative values from the matrix
• FASTA format originated with FASTA algorithm – a fast (or “fasta”) approximation of Smith-Waterman
Online resource for local alignment
• BLAST: bl2seq
• Lalign: slower but more accurate than BLAST
– BLAST returns only best alignment between query & target
– Lalign returns as many as specified, ranked from best to worst
http://www.ch.embnet.org/software/LALIGN_form.html
Lalign output
• % identity within conserved regions
• length of alignment
• score:
– sum of gap/substitution penalties
– higher score = better alignment
• E-value
– better indicator of alignment quality
– lower = better
A look under the hood at BLAST
• BLAST algorithm splits query sequence into “hot spots” consisting of:– “words”: short subsequences
– “neighboring words”: subsequences similar to words
• Sequence database is scanned for matches to these hot spots
• Identified matches used to extend hot spots
• Uses heuristics to identify best matches
BLAST algorithm illustrated