60
Fa05 CSE 182 CSE182-L4: Keyword matching

CSE182-L4: Keyword matching

  • Upload
    mort

  • View
    37

  • Download
    1

Embed Size (px)

DESCRIPTION

CSE182-L4: Keyword matching. Backward scoring. Defin S b [i,j] : Best scoring alignment of the suffixes s[i+1..n] and t[j+1..m] Q: What is the score of the best alignment of the two strings s and t? HW: Write the recurrences for S b. Forward/Backward computations. j. m. 1. - PowerPoint PPT Presentation

Citation preview

Page 1: CSE182-L4:  Keyword matching

Fa05 CSE 182

CSE182-L4: Keyword matching

Page 2: CSE182-L4:  Keyword matching

Fa05 CSE 182

Backward scoring

• Defin Sb[i,j] : Best scoring alignment of the suffixes s[i+1..n] and t[j+1..m]

• Q: What is the score of the best alignment of the two strings s and t?

• HW: Write the recurrences for Sb

Page 3: CSE182-L4:  Keyword matching

Fa05 CSE 182

Forward/Backward computations

• F[j] = Score of the best scoring alignment of s[1..n/2] and t[1..j]– F[j] = S[n/2,j]

• B[j] = Score of the best scoring alignment of s[n/2+1..n] and t[j+1..m]

– B[j] = Sb[n/2,j]

n/2

j1 m

Page 4: CSE182-L4:  Keyword matching

Fa05 CSE 182

Forward/Backward computations

• At the optimal coordinate, j– F[j]+B[j]=S[n,m]

• In O(nm) time, and O(m) space, we can compute one of the coordinates on the optimum path.

n/2

j1 m

Page 5: CSE182-L4:  Keyword matching

Fa05 CSE 182

Forward, Backward computation

• There exists a coordinate, j– F[j]+B[j]=S[n,m]

• In O(nm) time, and O(m) space, we can compute one of the coordinates on the optimum path.

Page 6: CSE182-L4:  Keyword matching

Fa05 CSE 182

Linear Space Alignment

• Align(1..n,1..m)– For all 1<=j <= m

• Compute F[j]=S(n/2,j)

– For all 1<=j <= m• Compute B[j]=Sb(n/2,j)

– j* = maxj {F[j]+B[j] }

– X = Align(1..n/2,1..j*)– Y = Align(n/2..n,j*..m)– Return X,j*,Y

Page 7: CSE182-L4:  Keyword matching

Fa05 CSE 182

Linear Space complexity

• T(nm) = c.nm + T(nm/2) = O(nm)• Space = O(m)

Page 8: CSE182-L4:  Keyword matching

Fa05 CSE 182

Summary

• We considered the basics of sequence alignment– Opt score computation– Reconstructing alignments– Local alignments– Affine gap costs– Space saving measures

• Can we recreate Blast?

Page 9: CSE182-L4:  Keyword matching

Fa05 CSE 182

Blast and local alignment

• Concatenate all of the database sequences to form one giant sequence.

• Do local alignment computation with the query.

Page 10: CSE182-L4:  Keyword matching

Fa05 CSE 182

Large database search

Query (m)

Database (n)

Database size n=100M, Querysize m=1000.O(nm) = 1011 computations

Page 11: CSE182-L4:  Keyword matching

Fa05 CSE 182

Why is Blast Fast?

Page 12: CSE182-L4:  Keyword matching

Fa05 CSE 182

Silly Question!

• True or False: No two people in new york city have the same number of hair

Page 13: CSE182-L4:  Keyword matching

Fa05 CSE 182

Observations

• Much of the database is random from the query’s perspective

• Consider a random DNA string of length n. – Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25

• Assume for the moment that the query is all A’s (length m).

• What is the probability that an exact match to the query can be found?

Page 14: CSE182-L4:  Keyword matching

Fa05 CSE 182

Basic probability

• Probability that there is a match starting at a fixed position i = 0.25m

• What is the probability that some position i has a match.

• Dependencies confound probability estimates.

Page 15: CSE182-L4:  Keyword matching

Fa05 CSE 182

Basic Probability:Expectation

• Q: Toss a coin: each time it comes up heads, you get a dollar– What is the money you expect to get after n

tosses?

– Let Xi be the amount earned in the i-th toss

E(X i) =1.p + 0.(1− p) = p

Total money you expect to earn

E( X i) = E(X i) =i

∑ npi

Page 16: CSE182-L4:  Keyword matching

Fa05 CSE 182

Expected number of matches

• Expected number of matches can still be computed.

Let Xi=1 if there is a match starting at position i, Xi=0 otherwise

Pr(Match at Position i) = pi = 0.25m

E(X i) = pi = 0.25m

Expected number of matches =

E( X i) = E(X i)i∑

i∑ = n 1

4( )m

i

Page 17: CSE182-L4:  Keyword matching

Fa05 CSE 182

Expected number of exact Matches is small!

• Expected number of matches = n*0.25m

– If n=107, m=10, • Then, expected number of matches = 9.537

– If n=107, m=11• expected number of hits = 2.38

– n=107,m=12, • Expected number of hits = 0.5 < 1

• Bottom Line: An exact match to a substring of the query is unlikely just by chance.

Page 18: CSE182-L4:  Keyword matching

Fa05 CSE 182

Observation 2

• What is the pigeonhole principle?

Suppose we are looking for a database string with greater than 90% identity to the query (length 100) Partition the query into size 10 substrings. At least one much match the database string exactly

Page 19: CSE182-L4:  Keyword matching

Fa05 CSE 182

Why is this important?

• Suppose we are looking for sequences that are 80% identical to the query sequence of length 100.

• Assume that the mismatches are randomly distributed.• What is the probability that there is no stretch of 10 bp, where the query and the subject

match exactly?

• Rough calculations show that it is very low. Exact match of a short query substring to a truly similar subject is very high.

– The above equation does not take dependencies into account– Reality is better because the matches are not randomly distributed

≅ 1− 810( )

10 ⎛ ⎝ ⎜ ⎞

⎠ ⎟90

= 0.000036

Page 20: CSE182-L4:  Keyword matching

Fa05 CSE 182

Just the Facts

• Consider the set of all substrings of the query string of fixed length W.– Prob. of exact match to a random database

string is very low.– Prob. of exact match to a true homolog is

very high.– Keyword Search (exact matches) is MUCH

faster than sequence alignment

Page 21: CSE182-L4:  Keyword matching

Fa05 CSE 182

BLAST

• Consider all (m-W) query words of size W (Default = 11)• Scan the database for exact match to all such words• For all regions that hit, extend using a dynamic programming alignment.• Can be many orders of magnitude faster than SW over the entire string

Database (n)

Page 22: CSE182-L4:  Keyword matching

Fa05 CSE 182

Why is BLAST fast?

• Assume that keyword searching does not consume any time and that alignment computation the expensive step.

• Query m=1000, random Db n=107, no TP• SW = O(nm) = 1000*107 = 1010 computations• BLAST, W=11

• E(#11-mer hits)= 1000* (1/4)11 * 107=2384• Number of computations = 2384*100*100=2.384*107

• Ratio=1010/(2.384*107)=420• Further speed improvements are possible

Page 23: CSE182-L4:  Keyword matching

Fa05 CSE 182

Keyword Matching

• How fast can we match keywords?

• Hash table/Db index? What is the size of the hash table, for m=11

• Suffix trees? What is the size of the suffix trees?

• Trie based search. We will do this in class.

AATCA 567

Page 24: CSE182-L4:  Keyword matching

Fa05 CSE 182

Related notes

• How to choose the alignment region?– Extend greedily until the score falls below a certain

threshold• What about protein sequences?

– Default word size = 3, and mismatches are allowed.• Like sequences, BLAST has been evolving continuously

– Banded alignment– Seed selection– Scanning for exact matches, keyword search versus

database indexing

Page 25: CSE182-L4:  Keyword matching

Fa05 CSE 182

P-value computation• How significant is a score? What happens to

significance when you change the score function• A simple empirical method:

• Compute a distribution of scores against a random database.

• Use an estimate of the area under the curve to get the probability.

• OR, fit the distribution to one of the standard distributions.

Page 26: CSE182-L4:  Keyword matching

Fa05 CSE 182

Z-scores for alignment• Initial assumption was that the scores followed a

normal distribution.• Z-score computation:

– For any alignment, score S, shuffle one of the sequences many times, and recompute alignment. Get mean and standard deviation

– Look up a table to get a P-value

ZS =S − μ

σ

Page 27: CSE182-L4:  Keyword matching

Fa05 CSE 182

Blast E-value

• Initial (and natural) assumption was that scores followed a Normal distribution• 1990, Karlin and Altschul showed that ungapped local alignment scores follow

an exponential distribution• Practical consequence:

– Longer tail. – Previously significant hits now not so significant

Page 28: CSE182-L4:  Keyword matching

Fa05 CSE 182

Exponential distribution

• Random Database, Pr(1) = p • What is the expected number of hits to a sequence of k 1’s

• Instead, consider a random binary Matrix. Expected # of diagonals of k 1s

(n − k)pk ≅ nek ln p = ne−k ln

1

p

⎝ ⎜

⎠ ⎟

Λ=(n − k)(m − k) pk ≅ nmek ln p = nme−k ln

1

p

⎝ ⎜

⎠ ⎟

Page 29: CSE182-L4:  Keyword matching

Fa05 CSE 182

• As you increase k, the number decreases exponentially.• The number of diagonals of k runs can be approximated by a Poisson

process

• In ungapped alignments, we replace the coin tosses by column scores, but the behaviour does not change (Karlin & Altschul).

• As the score increases, the number of alignments that achieve the score decreases exponentially €

Pr[u] =Λue−Λ

u!

Pr[u > 0] =1− e−Λ

Page 30: CSE182-L4:  Keyword matching

Fa05 CSE 182

Blast E-value• Choose a score such that the expected score between a pair of

residues < 0• Expected number of alignments with a particular score

• For small values, E-value and P-value are the same

E = Kmne−λS = mn2−

λS−ln K

ln 2

⎝ ⎜

⎠ ⎟

Pr(S ≥ x) =1− e−Kmne −λx

Page 31: CSE182-L4:  Keyword matching

Fa05 CSE 182

Blast Variants

1. What is mega-blast?2. What is discontiguous

mega-blast?3. Phi-Blast/Psi-Blast?4. BLAT?5. PatternHunter?

Longer seeds.

Seeds with don’t care values

Later

Database pre-processing

Seeds with don’t care values

Page 32: CSE182-L4:  Keyword matching

Fa05 CSE 182

Silly Quiz

• Name a famous Bioinformatics Researcher

• Name a famous Bioinformatics Researcher who is a woman

Page 33: CSE182-L4:  Keyword matching

Fa05 CSE 182

Scoring DNA

• DNA has structure.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 34: CSE182-L4:  Keyword matching

Fa05 CSE 182

DNA scoring matrices

• So far, we considered a simple match/mismatch criterion.

• The nucleotides can be grouped into Purines (A,G) and Pyrimidines.

• Nucleotide substitutions within a group (transitions) are more likely than those across a group (transversions)

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 35: CSE182-L4:  Keyword matching

Fa05 CSE 182

Scoring proteins• Scoring protein sequence alignments is

a much more complex task than scoring DNA– Not all substitutions are equal

• Problem was first worked on by Pauling and collaborators

• In the 1970s, Margaret Dayhoff created the first similarity matrices.– “One size does not fit all”– Homologous proteins which are

evolutionarily close should be scored differently than proteins that are evolutionarily distant

– Different proteins might evolve at different rates and we need to normalize for that

Page 36: CSE182-L4:  Keyword matching

Fa05 CSE 182

PAM 1 distance

• Two sequences are 1 PAM apart if they differ in 1 % of the residues.

• PAM1(a,b) = Pr[residue b substitutes residue a, when the sequences are 1 PAM apart]

1% mismatch

Page 37: CSE182-L4:  Keyword matching

Fa05 CSE 182

PAM1 matrix

• Align many proteins that are very similar– Is this a problem?

• PAM1 distance is the probability of a substitution when 1% of the residues have changed

• Estimate the frequency Pb|a of residue a being substituted by residue b.

• S(a,b) = log10(Pab/PaPb) = log10(Pb|a/Pb)

Page 38: CSE182-L4:  Keyword matching

Fa05 CSE 182

PAM 1

Page 39: CSE182-L4:  Keyword matching

Fa05 CSE 182

PAM distance

• Two sequences are 1 PAM apart when they differ in 1% of the residues.

• When are 2 sequences 2 PAMs apart?

1 PAM

1 PAM

2 PAM

Page 40: CSE182-L4:  Keyword matching

Fa05 CSE 182

Higher PAMs

• PAM2(a,b) = ∑c PAM1(a,c). PAM1 (c,b)

• PAM2 = PAM1 * PAM1 (Matrix multiplication)

• PAM250

– = PAM1*PAM249

– = PAM1250

Page 41: CSE182-L4:  Keyword matching

Fa05 CSE 182

Note: This is not the score matrix: What happens as you keep increasing the power?

Page 42: CSE182-L4:  Keyword matching

Fa05 CSE 182

Scoring using PAM matrices

• Suppose we know that two sequences are 250 PAMs apart.

• S(a,b) = log10(Pab/PaPb)= log10(Pb|a/Pb) = log10(PAM250(a,b)/Pb)

Page 43: CSE182-L4:  Keyword matching

Fa05 CSE 182

BLOSUM series of Matrices

• Henikoff & Henikoff: Sequence substitutions in evolutionarily distant proteins do not seem to follow the PAM distributions

• A more direct method based on hand-curated multiple alignments of distantly related proteins from the BLOCKS database.

• BLOSUM60 Merge all proteins that have greater than 60%. Then, compute the substitution probability.– In practice BLOSUM62 seems to work very well.

Page 44: CSE182-L4:  Keyword matching

Fa05 CSE 182

PAM vs. BLOSUM

• What is the correspondence?

• PAM1 Blosum1• PAM2 Blosum2

• Blosum62

• PAM250 Blosum100

Page 45: CSE182-L4:  Keyword matching

Fa05 CSE 182

Dictionary Matching, R.E. matching, and position specific scoring

Page 46: CSE182-L4:  Keyword matching

Fa05 CSE 182

Keyword search

• Recall: In BLAST, we get a collection of keywords from the query sequence, and identify all db locations with an exact match to the keyword.

• Question: Given a collection of strings (keywords), find all occrrences in a database string where they keyword might match.

Page 47: CSE182-L4:  Keyword matching

Fa05 CSE 182

Dictionary Matching

• Q: Given k words (si has length li), and a database of size n, find all matches to these words in the database string.

• How fast can this be done?

1:POTATO2:POTASSIUM3:TASTE

P O T A S T P O T A T O

dictionary

database

Page 48: CSE182-L4:  Keyword matching

Fa05 CSE 182

Dict. Matching & string matching

• How fast can you do it, if you only had one word of length m?– Trivial algorithm O(nm) time– Pre-processing O(m), Search O(n) time.

• Dictionary matching

– Trivial algorithm (l1+l2+l3…)n

– Using a keyword tree, lpn (lp is the length of the longest pattern)

– Aho-Corasick: O(n) after preprocessing O(l1+l2..)

• We will consider the most general case

Page 49: CSE182-L4:  Keyword matching

Fa05 CSE 182

Direct Algorithm

P O P O P O T A S T P O T A T OP O T A T OP O T A T OP O T A T OP O T A T O P O T A T O

Observations:• When we mismatch, we (should) know something about

where the next match will be.• When there is a mismatch, we (should) know something

about other patterns in the dictionary as well.

Page 50: CSE182-L4:  Keyword matching

Fa05 CSE 182

P O T A T O

T UIS M

S ETA

The Trie Automaton

• Construct an automaton A from the dictionary– A[v,x] describes the transition from node v to a node w upon

reading x.– A[u,’T’] = v, and A[u,’S’] = w– Special root node r– Some nodes are terminal, and labeled with the index of the

dictionary word.

1:POTATO2:POTASSIUM3:TASTE

1

2

3

w

vu

S

r

Page 51: CSE182-L4:  Keyword matching

Fa05 CSE 182

An O(lpn) algorithm for keyword matching

• Start with the first position in the db, and the root node.

• If successful transition– Increment current

pointer– Move to a new node– If terminal node

“success”• Else

– Retract ‘current’ pointer– Increment ‘start’ pointer– Move to root & repeat

Page 52: CSE182-L4:  Keyword matching

Fa05 CSE 182

Illustration:

P O T A T O

T UIS M

S ETA

P O T A S T P O T A T Ol c

v

S

1

Page 53: CSE182-L4:  Keyword matching

Fa05 CSE 182

Idea for improving the time

P O T A S T P O T A T O

• Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match– Then prefix(pattern j) = suffix [ first c-l characters of

pattern(i))

l c

1:POTATO2:POTASSIUM3:TASTE

P O T A S S I U MT A S T E

Pattern i

Pattern j

Page 54: CSE182-L4:  Keyword matching

Fa05 CSE 182

Improving speed of dictionary matching

• Every node v corresponds to a string sv that is a prefix of some pattern.

• Define F[v] to be the node u such that su is the longest suffix of sv

• If we fail to match at v, we should jump to F[v], and commence matching from there

• Let lp[v] = |su|

P O T A T O

T UIS M

S ETA

1 2 3 4 5

67

89 10

11S

Page 55: CSE182-L4:  Keyword matching

Fa05 CSE 182

An O(n) alg. For keyword matching

• Start with the first position in the db, and the root node.

• If successful transition– Increment current pointer– Move to a new node– If terminal node “success”

• Else (if at root)– Increment ‘current’ pointer– Mv ‘start’ pointer– Move to root

• Else – Move ‘start’ pointer forward– Move to failure node

Page 56: CSE182-L4:  Keyword matching

Fa05 CSE 182

Illustration

P O T A S T P O T A T O

P O T A T O

T UIS M

S ETA

lc

v S

1

Page 57: CSE182-L4:  Keyword matching

Fa05 CSE 182

Time analysis

• In each step, either c is incremented, or l is incremented

• Neither pointer is ever decremented (lp[v] < c-l).

• l and c do not exceed n• Total time <= 2n

P O T A S T P O T A T Ol c

Page 58: CSE182-L4:  Keyword matching

Fa05 CSE 182

Blast: Putting it all together

• Input: Query of length m, database of size n

• Select word-size, scoring matrix, gap penalties, E-value cutoff

Page 59: CSE182-L4:  Keyword matching

Fa05 CSE 182

Blast Steps

1. Generate an automaton of all query keywords.2. Scan database using a “Dictionary Matching” algorithm

(O(n) time). Identify all hits.3. Extend each hit using a variant of “local alignment”

algorithm. Use the scoring matrix and gap penalties.4. For each alignment with score S, compute the bit-

score, E-value, and the P-value. Sort according to increasing E-value until the cut-off is reached.

5. Output results.

Page 60: CSE182-L4:  Keyword matching

Fa05 CSE 182

Protein Sequence Analysis

• What can you do if BLAST does not return a hit?– Sometimes, homology (evolutionary similarity) exists at

very low levels of sequence similarity.

• A: Accept hits at higher P-value. – This increases the probability that the sequence similarity

is a chance event.– How can we get around this paradox?– Reformulated Q: suppose two sequences B,C have the

same level of sequence similarity to sequence A. If A& B are related in function, can we assume that A& C are? If not, how can we distinguish?