Sequence Alignment - Kirkwood Community CollegeLocal alignment •Uses for local alignment: –compare distantly-related sequences that share a few non-connected regions in common

Sequence Alignment

part 2

Dynamic programming with more realistic scoring scheme

• Using the same initial sequences, we’ll look at a dynamic programming example with a scoring scheme that selects for matches and penalizes both mismatched and gaps, as follows:

– Si,j = 2 if residues match

– Si,j = -1 if there is a mismatch at the current position

– w = -2 (gap penalty)

Initialization – same as before

• Example sequences:• GAATTCAGTTA (sequence 1: M=11)• GGATCGA (sequence 2: N=7)

G A A T T C A G T T A

0 0 0 0 0 0 0 0 0 0 0 0

G 0

G 0

A 0

T 0

C 0

G 0

A 0

Filling matrix (scoring)

• Recall the scoring scheme:

Mi,j (matrix position to be filled in) = maximum of these 3 terms:• Mi-1, j-1 + Si,j (match/mismatch in diagonal)

– where:

o Si,j = 2 for match

o Si,j = -1 for mismatch

• Mi, j-1 + w (gap in sequence 1)

• Mi-1,j + w (gap in sequence 2)– in either case, w = -2


• Since both sequences start with G, the maximum value for M1,1 is 0 + 2 = 2 for the match


0 0 0 0 0 0 0 0 0 0 0 0

G 0 2

G 0

A 0

T 0

C 0

G 0

A 0


• Continuing down column 2, we see that the G in the first position of sequence 1 also matches the G in the second position of sequence 2; this means that M1,2 will be the maximum of [2, -2, -2] which is 2

• We also add a backward pointing arrow at each position to show where the maximum score came from (see next slide)



0 0 0 0 0 0 0 0 0 0 0 0

G 0 2

G 0 2

A 0

T 0

C 0

G 0

A 0


• At M1,3 there is no match, so S1,3 = -1

• M1,3 = MAX[M0,2-1, M1,2-2,M0,3-2] = MAX[-1,0,-2] = 0


0 0 0 0 0 0 0 0 0 0 0 0

G 0 2

G 0 2

A 0 0

T 0

C 0

G 0

A 0


• We can continue filling in column 1 using the same reasoning:


0 0 0 0 0 0 0 0 0 0 0 0

G 0 2

G 0 2

A 0 0

T 0 -1

C 0 -1

G 0 2

A 0 0


• We continue into column 2:


0 0 0 0 0 0 0 0 0 0 0 0

G 0 2 0

G 0 2 1

A 0 0 4

T 0 -1 2

C 0 -1 0

G 0 2 0

A 0 0 4


• At column 3 row 3 we encounter the following situation:– M3,2 = MAX[M2,1-1, M3,1-2,M2,2-2] = MAX[-1,-3,-1] = -1– Since there are 2 different ways we could reach the maximum score,

we provide arrows back to both cells that could get us there:


0 0 0 0 0 0 0 0 0 0 0 0

G 0 2 0 -1

G 0 2 1 -1

A 0 0 4

T 0 -1 2

C 0 -1 0

G 0 2 0

A 0 0 4

Completed matrix


0 0 0 0 0 0 0 0 0 0 0 0

G 0 2 0 -1 -1 -1 -1 -1 2 0 -1 -1

G 0 2 1 -1 -2 -2 -2 -2 1 1 -1 -2

A 0 0 4 3 1 -1 -3 0 -1 0 0 1

T 0 -1 2 3 5 3 1 -1 -1 1 2 0

C 0 -1 0 1 3 4 5 3 1 -1 0 1

G 0 2 0 -1 1 2 3 4 5 3 1 -1

A 0 0 4 2 0 0 1 5 3 4 2 3

Traceback

• Maximum global alignment is 3, the value in the last row of the last column; traceback step begins here

• The traceback step is simplified by the presence of the arrows – we can follow them to get to the predecessor at each step

• Since some locations have multiple arrows, we may find multiple alignments, but there will be fewer than under the simple scoring scheme we used before

Second possible path

• Alternate path gives the following alignment:GAATTCAGTTA

GGAT-C-G--A

Verifying the score(s)

• Recall our scoring scheme:

– match: +2

– mismatch: -1

– gap: -2

• Final overall score in table was 3, so alignments should add up to 3, given the above

• Calculations on next slide verify that they do

Verifying the score(s)

• First alignment:GAATTCAGTTA

GGA-TC-G--A

+-+-++-+--+

21222222222

2-1+2-2+2+2-2+2-2-2+2 = 3

• Second alignment:GAATTCAGTTA

GGAT-C-G--A

+-++-+-+--+

21222222222

2-1+2+2-2+2-2+2-2-2+2 = 3

Global alignments: pros & cons

• What they’re good for:

– checking for minor differences between sequences

– comparing sequences that partly overlap

• What they’re not good for:

– discovering similarities between 2 sequences

– exploring similarities within a family of sequences (for this we do multiple alignment)

Programming global alignment

• GA algorithms are based on dynamic programming which is a recursive technique

• Recursion can be summarized as follows:– nature of the problem: must be divisible into smaller

subproblems

– begin by solving the smallest subproblems

– solutions to smallest problems are used in solutions to larger problems

– process continues until entire (largest) problem is solved

Detailed algorithm: step 1: build scoring matrix

1. Read in 2 sequences to be aligned: lines 7-25 of program

2. Obtain match, mismatch & gap scores (from user): lines 35-45

3. Create m+1 x n+1 matrix (see next slide): lines 47-54

4. Prepend a blank character to the front of both sequences (so that position 0 of each sequence contains a blank, and sequence indices are consistent with matrix indices): lines 30 & 31

Arrays in Perl

• Array: single variable that holds multiple scalar values

– individual elements are accessible via index (subscript)

– one-dimensional array: vector

– two-dimensional array: matrix

Array declaration & notation

• To distinguish a vector or matrix from a scalar, the starting character of the array name is @:@array = ();

# declares empty array

• We refer to the entire array by @name; we reference individual elements using $name and subscript(s):for ($i=0; $i<10; $i++)

{

$array[$i] = ‘’;

}

# initializes 10-element vector to blank strings

Example program#!/usr/bin/perl

# declare array:@array = ();

# initialize with empty strings:for ($i=0; $i<10; $i++){

$array[$i]='';}

# prompt for & read in some data:for ($i=0; $i<10; $i++){

print "Please enter a word or phrase:\n";$array[$i] = <STDIN>;

}

# print it all out backwards:for ($i=9; $i>=0; $i--){

print $array[$i] . "\n";}

Back to algorithm …

5. Initialize first row & first column of matrix by adding gap penalty to each successive cell: lines 56-63

6. Fill remaining cells (lines 66-99):– compute three candidate values for each cell by

adding gap penalty or match score (as appropriate) to value in appropriate neighboring cell (lines 80-87)

– compare the three values to determine maximum score (lines 89-97)

Algorithm continued

7. Develop directional string to facilitate traceback: lines 111-154

– unlike the human observer, a program cannot see directional arrows in the matrix

– instead, we create a string containing directional indicators to develop a traceback path:

• H indicates left neighbor (horizontal gap)

• D indicates diagonal neighbor (match/mismatch)

• V indicates above neighbor (vertical gap)

Algorithm continued

• Perform traceback (lines 156 – 201):

– Starting at the “right” end of each sequence, obtain the current character

– Read the first (leftmost) character of the directional string & align the retrieved sequence characters as directed

– Continue until you run out of directional characters

Terminal gaps & semiglobal alignments

• Terminal gaps occur when you align 2 sequences that differ significantly in length; our global alignment algorithm doesn’t distinguish between these gaps and internal gaps, even though an alignment with only terminal gaps actually represents the optimal alignment

• For example, the three alignments below represent what the global alignment would consider optimal:CGCTATAG CGCTATAG CGCTATAG

--CTA--- C--TA--- --C--TA-

• Eliminating the gap penalty for terminal gaps produces a semiglobal alignment

Local alignment

• Many pairs of sequences will include regions of high similarity (conserved regions) interspersed with dissimilar regions

• A global alignment algorithm on such sequences will result in – poor scores and/or

– many equally (un)likely alignments reported as optimal

• In such situations, a local alignment algorithm is preferable

Local alignment

• Uses for local alignment:

– compare distantly-related sequences that share a few non-connected regions in common

– analysis of repeated elements within a single sequence

• Smith-Waterman: the original local alignment algorithm

Smith-Waterman algorithm

• Uses scoring system similar to Needleman-Wunsch for building matrix:Mi,j = maximum of:

• 0 or

• Mi-1, j-1 + Si,j or

• Mi-1, j + w or

• Mi, j-1 + 2

• Note that the inclusion of 0 as a possible maximum eliminates negative values from the matrix

• FASTA format originated with FASTA algorithm – a fast (or “fasta”) approximation of Smith-Waterman

Online resource for local alignment

• BLAST: bl2seq

• Lalign: slower but more accurate than BLAST

– BLAST returns only best alignment between query & target

– Lalign returns as many as specified, ranked from best to worst

http://www.ch.embnet.org/software/LALIGN_form.html

http://www.ch.embnet.org/software/LALIGN_form.html

Lalign output

• % identity within conserved regions

• length of alignment

• score:

– sum of gap/substitution penalties

– higher score = better alignment

• E-value

– better indicator of alignment quality

– lower = better

A look under the hood at BLAST

• BLAST algorithm splits query sequence into “hot spots” consisting of:– “words”: short subsequences

– “neighboring words”: subsequences similar to words

• Sequence database is scanned for matches to these hot spots

• Identified matches used to extend hot spots

• Uses heuristics to identify best matches

BLAST algorithm illustrated

Documents

Sequence Alignment - Kirkwood Community CollegeLocal alignment •Uses for local alignment: –compare distantly-related sequences that share a few non-connected regions in common