Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background

.

Sequence Alignment IILecture #3

This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran.

Background Readings: Chapters 2.5, 2.7 in the text book, Biological Sequence Analysis, Durbin et al., 2001.Chapters 3.5.1- 3.5.3, 3.6.2 in Introduction to Computational Molecular Biology, Setubal and Meidanis, 1997.

Winter 2004: 1st lecture up to 19 (BLAST), 2nd up to 35 - in an easy manner. looks like the right pace.winter 2005: again in a very easy manner up to 34, and the time was tight. thus all slides beyond 35 were removed

2

Last class we discussed dynamic programming algorithms for

global alignment local alignment

(In the tutorial, affine gap scores were incorporated) All of these assumed a scoring rule:

that determines the quality of perfect matches, substitutions, insertions, and deletions.

Reminder

}){(}){(:

3

Alignment in Real Life

One of the major uses of alignments is to find sequences in a “database.”

The current protein database contains about 108 residues ! So searching a 103 long target sequence requires to evaluate about 1011 matrix cells which will take about three hours in the rate of 10 millions evaluations per second.

Quite annoying when, say, one thousand target sequences need to be searched because it will take about four months to run.

4

Heuristic Search

Instead, most searches rely on heuristic procedures These are not guaranteed to find the best match Sometimes, they will completely miss a high-scoring

match

We now describe the main ideas used by the best known(?) of these heuristic procedures.

5

Basic Intuition

Almost all heuristic search procedures are based on the observation that real-life matches often contain long strings with gap-less matches.

These heuristic try to find significant gap-less matches and then extend them.

6

Banded DP

Suppose that we have two strings s[1..n] and t[1..m] such that nm

If the optimal alignment of s and t has few gaps, then path of the alignment will be close to diagonal

t

s

7

Banded DP

To find such a path, it suffices to search in a diagonal band of the matrix.

If the diagonal band consists of k diagonals (width k), then dynamic programming takes O(kn).

Much faster than O(n2) of standard DP.t

s

k

V[i+1, i+k/2 +1]V[i+1, i+k/2]

Out of rangeV[i,i+k/2]

Note that for diagonals, i-j = constant.

8

Banded DP for local alignment

Problem: Where is the banded diagonal ? It need not be the main diagonal when looking for a good local alignment (or when the lengths of s and t are different).

How do we select which subsequences to align using banded DP?

t

sk

We heuristically find potential diagonals and evaluate them using Banded DP.

This is the main idea of FASTA.

9

Overview of FASTA

Input: strings s and t, and a parameter ktup Output: A highly scored local alignment.

1. Find pairs of matching substrings s[i...i+ktup]=t[j...j+ktup]

2. Extend to ungapped diagonals3. Extend to gapped matches using banded DP

10

Finding Potential DiagonalsSuppose there exists a relatively long gap-less local alignment

S=****AGCGCCATGGATTGAGCGA* T=**TGCGACATTGATCGACCTA**

Each gap-less local alignment defines a potential diagonal: If the first sequence starts at location i (e.g.,5 above) and the second starts at location j (e.g.,3 above), then the potential diagonal starts at location (i,j).

Can we identify potential diagonals quickly?

Such diagonals can then be evaluated using Banded DP.

t

si

j

11

Identifying Potential Diagonals

Assumption: High scoring gap-less alignments contain several “seeds” of perfect matches

S=****AGCGCCATGGATTGAGCGA*

T=**TGCGACATTGATCGACCTA**

t

si

jSince this is a gap-less alignment, all perfect match regions reside on the same diagonal (defined by i-j).

How do we find seeds efficiently ?

12

Formalizing the task

Task at hand (Identifying seeds): Find all pairs (i,j) such that s[i...i +ktup] = t[j...j+ktup]

From now we assume that s is the database and t is the query string. i.e., |s|>>|t|.

Let ktup be a parameter denoting the seed length of interest.

winter05: student mentioned than it is possible to restrict the search to a subset of the database (eg, mammals)

13

Finding Seeds Efficiently

Index Table (ktup =2)AA -AC -AG 5, 19AT 11, 15CA 10 CC 9,21CG 7…TT 16

S=****AGCGCCATGGATTGAGCGA*5 10 15 20

T=**TGCGACATTGATCGACCTA**7

(-,7) No match(10,8) One match

8 9

(11,9), (15,9) Two matches

March on the query sequence T while using the index table to list all matches with the database sequence S.

Prepare an index table of the database sequence S such that for any sequence of length ktup, one gets the list of its positions in S.

In practice, these steps take linear time: O(|s|+|t|).

14

CommentsThe maximal size of the index table is ||ktup where is the alphabet size (4 or 20). For small ktup, the entire table is stored.

For large ktup values, one should keep only entries for tuples actually found in the database, so the index table size is indeed linear. In this case, hashing is needed.

Typical values of ktup are 1-2 for Proteins and 4-6 for DNA. Tradeoffs of these values to be discussed.

The index table is prepared for each database sequence ahead of users’ matching requests, at compilation time. So matching time is O(|T|max{row_length}).

AA -AC -AG 5, 19AT 11, 15CA 10 CC 9,21CG 7…TT 16

Index table

15

S=***AGCGCCATGGATTGAGCGA*

T=**TGCGACATTGATCGACCTA**t

si

j

Identifying Potential DiagonalsInput: Sets of pairs. E.g, (6,4),(10,8),(14,12),(15,10),(20,4) …

Task: Locate sets of pairs that are on the same diagonal.

20

i-j = 20-4=16

Method: Sort according to the difference i-j.

i-j = 2; 6-4 ; 10-8; 14-12

6 10 14

4 8 12

16

Processing Potential Diagonals

For high i-j offset frequency, namely, diagonals with many pieces, combine the pieces into regions by extending pieces greedily along the diagonal as long as the score improves (and never below some score value).

t

s

17

FASTA’s Final steps:using banded DP

List the highest scoring diagonal matches Run banded DP on regions containing a high

scoring diagonal (say with width 12).

t

s 3

2

1

Hence, the algorithm may combine some diagonals into gapped matches. In the example above it could combine diagonals 2 and 3).

18

Most applications of FASTA use very small ktup (1-2 for proteins, and 4-6 for DNA).

Higher values yield less potential diagonals.Hence to search around potential diagonals (DP) is faster. But the chance to miss an optimal local alignment is increased.

FASTA- practical choices

Some implementation choices /tricks have not been explicated herein.

t

s

march 2004: a student said today 11 is the default value for DNA search) - probably refers to BLAST - see few slides ahead

19

BLAST (Basic Local Alignment Search Tool)

Based on similar ideas described earlier (High scoring pairs rather than exact k tuples as seeds).

Uses an established statistical framework to determine thresholds.The new PSI-BLAST (Position Specific Iterated – BLAST ) is the

state of the art sequence comparison software.

Iterative Procedure Performs BLAST on a database Uses significant alignments to construct “position specific”

score matrix. This matrix is used in the next round of database searching

until no new significant alignments are found.

Can sometime detect remote homologs.

this slide was taken from Danny's Winter 03-4 lecture3. By discussion with any (fall05), it updates the scoring function according to the specific query string, by running BLAST on the input and the updating the score by the statistic of the returned alignments and running it again (3-4 times).

20

BLAST Overview

Input: strings s and t, and a parameter T = threshold valueOutput: A highly scored local alignment

Definition: Two strings u and v of length k are a high scoring pair (HSP) if d(u,v) > T (usually consider un-gapped alignments only).

1. Find high scoring pairs of substrings such that d(u,v) > T These words serve as seeds for finding longer matches

2. Extend to ungapped diagonals (as in FASTA)3. Extend to gapped matches

21

BLAST Overview (cont.)

Step 1: Find high scoring pairs of substrings such that d(u,v) > T (The seeds):

Find all strings of length k which score at least T with substrings of s in a gapless alignment (k = 4 for proteins, 11 for DNA)

(note: possibly, not all k-words must be tested, e.g. when such a word scores less than T with itself).

Find in t all exact matches with each of the above strings.

22

Extending Potential Matches

s

t

Once a seed is found, BLAST attempts to find a local alignment that extends the seed.

Seeds on the same diagonal are combined (as in FASTA), then extended as far as possible in a greedy manner.

During the extension phase, the search stops when the score passes below some lower bound computed by BLAST (to save time).

23

Where do scoring rules come from ?

We have defined an additive scoring function by specifying a function ( , ) such that

(x,y) is the score of replacing x by y (x,-) is the score of deleting x (-,x) is the score of inserting x

But how do we come up with the “correct” score ?

Answer: By encoding experience of what are similar sequences for the task at hand. Similarity depends on time, evolution trends, and sequence types.

24

Why use probability to define and/or interpret a scoring function ?

• Similarity is probabilistic in nature because biological changes like mutation, recombination, and selection, are not deterministic.

• We could answer questions such as:• How probable two sequences are similar?• Is the similarity found significant or random?• How to change a similarity score when, say, mutation rate of a specific area on the chromosome becomes known ?

25

A Probabilistic Model

For now, we will focus on alignment without indels. For now, we assume each position (nucleotide

/amino-acid) is independent of other positions. We consider two options:

M: the sequences are Matched (related)

R: the sequences are Random (unrelated)

26

Unrelated Sequences

Our random model of unrelated sequences is simple Each position is sampled independently from a

distribution over the alphabet We assume there is a distribution q() that

describes the probability of letters in such positions.

Then:

i

itqisqRntnsP ])[(])[()|]..1[],..1[(

27

Related Sequences

We assume that each pair of aligned positions (s[i],t[i]) evolved from a common ancestor

Let p(a,b) be a distribution over pairs of letters. p(a,b) is the probability that some ancestral letter

evolved into this particular pair of letters.

i

itispMntnsP ])[],[()|]..1[],..1[(

28

Odds-Ratio Test for Alignment

i

i

i

itqisq

itisp

itqisq

itisp

RtsP

MtsPQ

])[(])[(

])[],[(

])[(])[(

])[],[(

)|,(

)|,(

If Q > 1, then the two strings s and t are more likely tobe related (M) than unrelated (R).

If Q < 1, then the two strings s and t are more likely tobe unrelated (R) than related (M).

odds-ratio=p/(1-p) ie, if the probability for A is p and for not A is 1-p, the odds-ratio of A is p/(1-p).

29

Score(s[i],t[i])

Log Odds-Ratio Test for AlignmentTaking logarithm of Q yields

])[(])[(

])[],[(log

])[(])[(

])[],[(log

)|,(

)|,(log

itqisq

itisp

itqisq

itisp

RtsP

MtsP

ii

If log Q > 0, then s and t are more likely to be related.If log Q < 0, then they are more likely to be unrelated.

How can we relate this quantity to a score function ?

30

Probabilistic Interpretation of Scores

We define the scoring function via

Then, the score of an alignment is the log-ratio between the two models:

Score > 0 Model is more likely

Score < 0 Random is more likely

)()(),(

log),(bqaq

bapba

31

Modeling Assumptions

It is important to note that this interpretation depends on our modeling assumption!!

For example, if we assume that the letter in each position depends on the letter in the preceding position, then the likelihood ratio will have a different form.

32

Estimating Probabilities

Suppose we are given a long string s[1..n] of letters from

We want to estimate the distribution q(·) that generated the sequence

How should we go about this?

We build on the theory of parameter estimation in statistics, eg by using maximum likelihood.

33

Estimating q()

Suppose we are given a long string s[1..n] of letters from

s can be the concatenation of all sequences in our database

We want to estimate the distribution q()

That is, q is defined per letter

a

Nn

i

aaqisqsqL )(])[()|(1

Likelihood function:

34

Estimating q() (cont.)

How do we define q?

n

Naq a)( ||

1)(

n

Naq a

a

Nn

i

aaqisqsqL )(])[()|(1

Likelihood function:

ML parameters

(Maximum Likelihood)

MAP parameters(Maximum A posteriori Probability)

35

Estimating p(·,·)

Intuition: Find pair of aligned sequences s[1..n], t[1..n], Estimate probability of pairs:

Again, s and t can be the concatenation of many aligned pairs from the database

n

Nbap ba,),(

Number of times a is

aligned with b in (s,t)

Documents

Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background