L4: Blast: Alignment Scores etc. · 2004. 10. 11. · Z-scores for alignment ßInitial assumption was that the scores followed a normal distribution. ßZ-score computation: ßFor

L4: Blast: Alignment Scores etc.L4: Blast: Alignment Scores etc.

Why is Blast Fast?Why is Blast Fast?

Silly QuestionSilly Question

ßß Prove or Disprove:Prove or Disprove:ßß There are two people in New York City withThere are two people in New York City with

exactly the same number of hairs.exactly the same number of hairs.

Large database searchLarge database search

Query (m)

Database (n)

Database size n=10M, Querysize m=300.O(nm) = 3. 109 computations

ObservationsObservationsßß Much of the database is random from theMuch of the database is random from the

queryquery’’s perspectives perspectiveßß Consider a random DNA string of length n.Consider a random DNA string of length n.ßß Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25

ßß What is the probability that an exact matchWhat is the probability that an exact matchto the query can be found?to the query can be found?

Basic probabilityBasic probability

ßß Probability that there is a match starting at aProbability that there is a match starting at afixed position i = 0.25fixed position i = 0.25mm

ßß What is the probability that some position iWhat is the probability that some position ihas a match.has a match.ßß Dependencies confound probability estimates.Dependencies confound probability estimates.

Basic Probability:ExpectationBasic Probability:Expectation

ßß Q: Toss a coin: each time it comes up heads,Q: Toss a coin: each time it comes up heads,you get a dollaryou get a dollarßß What is the money you expect to get after nWhat is the money you expect to get after n

tosses?tosses?ßß Let XLet Xii be the amount earned in the i- be the amount earned in the i-th th tosstoss

†

E(Xi) =1.p + 0.(1- p) = p

ß Total money you expect to earn

†

E( Xi) = E(Xi) =iÂ np

iÂ

Expected number of matchesExpected number of matches

ßß Expected number of matches can still be computed.Expected number of matches can still be computed.

ß Let Xi=1 if there is a match starting at position i, Xi=0 otherwise

†

Pr(Match at Position i) = pi = 0.25m

E(Xi) = pi = 0.25m

ß Expected number of matches =

†

E( Xi) = E(Xi)iÂiÂ = n 14( )

m

i

Expected number of exactExpected number of exactMatches is small!Matches is small!

ßß Expected number of matches = n*0.25Expected number of matches = n*0.25mm

ßß If n=10If n=1077, m=10,, m=10,ßß Then, expected number of matches = 9.537Then, expected number of matches = 9.537

ßß If n=10If n=1077, m=11, m=11ßß expected number of hits = 2.38expected number of hits = 2.38

ßß n=10n=1077,m=12,,m=12,ßß Expected number of hits = 0.5 < 1Expected number of hits = 0.5 < 1

ßß Bottom Line: An exact match to a substring of theBottom Line: An exact match to a substring of thequery is unlikely just by chance.query is unlikely just by chance.

Observation 2Observation 2

ßß What is the pigeonhole principle?What is the pigeonhole principle?

ß Suppose we are looking for a database string withgreater than 90% identity to the query (length 100)ß Partition the query into size 10 substrings. Atleast one much match the database string exactly

Why is this important?Why is this important?ßß Suppose we are looking for sequences that are 80% identical to the querySuppose we are looking for sequences that are 80% identical to the query

sequence of length 100.sequence of length 100.ßß Assume that the mismatches are randomly distributed.Assume that the mismatches are randomly distributed.ßß What is the probability that there is no stretch of 10 What is the probability that there is no stretch of 10 bpbp, where the query and, where the query and

the subject match exactly?the subject match exactly?

ßß Rough calculations show that it is very low. Exact match of a short queryRough calculations show that it is very low. Exact match of a short querysubstring to a truly similar subject is very high.substring to a truly similar subject is very high.ßß The above equation does not take dependencies into accountThe above equation does not take dependencies into accountßß Reality is better because the matches are not randomly distributedReality is better because the matches are not randomly distributed

†

†

@ 1- 810( )

10Ê Ë Á ˆ

¯ ˜

90

= 0.000036

Just the FactsJust the Facts

ßß Consider the set of all substrings of theConsider the set of all substrings of thequery string of fixed length W.query string of fixed length W.ßß ProbProb. of exact match to a random database string. of exact match to a random database string

is very low.is very low.ßß ProbProb. of exact match to a true homolog is very. of exact match to a true homolog is very

high.high.ßß Keyword Search (exact matches) is MUCH fasterKeyword Search (exact matches) is MUCH faster

than sequence alignmentthan sequence alignment

BLASTBLAST

•• Consider all (m-W) query words of size W (Default = 11)Consider all (m-W) query words of size W (Default = 11)•• Scan the database for exact match to all such wordsScan the database for exact match to all such words•• For all regions that hit, extend using a dynamic programming alignment.For all regions that hit, extend using a dynamic programming alignment.•• Can be many orders of magnitude faster than SW over the entire stringCan be many orders of magnitude faster than SW over the entire string

Database (n)

Why is BLAST fast?Why is BLAST fast?•• Assume that keyword searching does not consume anyAssume that keyword searching does not consume any

time and that alignment computation the expensivetime and that alignment computation the expensivestep.step.

•• Query m=1000, random Db n=10Query m=1000, random Db n=1077,, no TPno TP•• SW = O(nm) = 1000*10SW = O(nm) = 1000*107 7 = 10= 1010 10 computationscomputations•• BLAST, W=11BLAST, W=11

•• E(#11-E(#11-mer mer hits)= 1000* (1/4)hits)= 1000* (1/4)1111 * 10 * 1077=2384=2384•• Number of computations = 2384*100*10=2.384*10Number of computations = 2384*100*10=2.384*1066

•• Ratio=10Ratio=101010/(2.384*10/(2.384*1066)=4200)=4200

•• Further speed improvements are possibleFurther speed improvements are possible

Keyword MatchingKeyword Matchingßß How fast can we matchHow fast can we match

keywords?keywords?ßß Hash table/Db index?Hash table/Db index?

What is the size of theWhat is the size of thehash table, for m=11hash table, for m=11ßß Suffix trees? What isSuffix trees? What is

the size of the suffixthe size of the suffixtrees?trees?ßß Trie Trie based search. Webased search. We

will do this in class.will do this in class.

AATCA 567

Related notesRelated notesßß How to choose the alignment region?How to choose the alignment region?ßß Extend greedily until the score falls below a certainExtend greedily until the score falls below a certain

thresholdthreshold

ßß What about protein sequences?What about protein sequences?ßß Default word size = 3, and mismatches are allowed.Default word size = 3, and mismatches are allowed.

ßß Like sequences, BLAST has been evolving continuouslyLike sequences, BLAST has been evolving continuouslyßß Banded alignmentBanded alignmentßß Seed selectionSeed selectionßß Scanning for exact matches, keyword search versus databaseScanning for exact matches, keyword search versus database

indexingindexing

P-value computationP-value computation•• How significant is a score? What happens toHow significant is a score? What happens to

significance when you change the score functionsignificance when you change the score function•• A simple empirical method:A simple empirical method:

•• Compute a distribution of scores against a random database.Compute a distribution of scores against a random database.•• Use an estimate of the area under the curve to get theUse an estimate of the area under the curve to get the

probability.probability.•• OR, fit the distribution to one of the standard distributions.OR, fit the distribution to one of the standard distributions.

Z-scores for alignmentZ-scores for alignmentßß Initial assumption was that the scoresInitial assumption was that the scores

followed a normal distribution.followed a normal distribution.ßß Z-score computation:Z-score computation:ßß For any alignment, score S, shuffle one of theFor any alignment, score S, shuffle one of the

sequences many times, and sequences many times, and recompute recompute alignment.alignment.Get mean and standard deviationGet mean and standard deviation

ßß Look up a table to get a P-valueLook up a table to get a P-value

†

ZS =S - m

s

Blast E-valueBlast E-valueßß Initial (and natural) assumption was that scores followed a Normal distributionInitial (and natural) assumption was that scores followed a Normal distributionßß 1990,1990, Karlin Karlin andand Altschul Altschul showed thatshowed that ungapped ungapped local alignment scores follow anlocal alignment scores follow an

exponential distributionexponential distributionßß Practical consequence:Practical consequence:

ßß Longer tail.Longer tail.ßß Previously significant hits now not so significantPreviously significant hits now not so significant

Exponential distributionExponential distributionßß Random Database, Pr(1) = pRandom Database, Pr(1) = pßß What is the expected number of hits to a sequence of k 1What is the expected number of hits to a sequence of k 1’’ss

ßß Instead, consider a random binary Matrix. Expected # ofInstead, consider a random binary Matrix. Expected # ofdiagonals of k 1sdiagonals of k 1s

†

(n - k)pk @ nek ln p = ne-k ln 1

pÊ

Ë Á

ˆ

¯ ˜

†

L = (n - k)(m - k)pk @ nmek ln p = nme-k ln 1

pÊ

Ë Á

ˆ

¯ ˜

ßß As you increase k, the number decreases exponentially.As you increase k, the number decreases exponentially.ßß The number of diagonals of k runs can be approximated by a PoissonThe number of diagonals of k runs can be approximated by a Poisson

processprocess

ßß In In ungapped ungapped alignments, we replace the coin tosses by column scores,alignments, we replace the coin tosses by column scores,but the but the behaviour behaviour does not change (does not change (Karlin Karlin & & AltschulAltschul).).

ßß As the score increases, the number of alignments that achieve theAs the score increases, the number of alignments that achieve thescore decreases exponentiallyscore decreases exponentially

†

Pr[u] =Lue-L

u!

Pr[u > 0] =1- e-L

Blast E-valueBlast E-valueßß Choose a score such that the expected score between a pair ofChoose a score such that the expected score between a pair of

residues < 0residues < 0ßß Expected number of alignments with a particular scoreExpected number of alignments with a particular score

ßß For small values, E-value and P-value are the sameFor small values, E-value and P-value are the same

†

E = Kmne-lS = mn2-

lS- ln Kln 2

Ê

Ë Á

ˆ

¯ ˜

Pr(S ≥ x) =1- e-Kmne - lx

ConclusionConclusion

ßß Blast is fast becauseBlast is fast becauseßß Keyword search is fast (To be shown).Keyword search is fast (To be shown).ßß Keywords form good filters.Keywords form good filters.ßß The Poisson approximation can be used to showThe Poisson approximation can be used to show

that the keywords form a good filter, as well as tothat the keywords form a good filter, as well as tocompute P-values for scorescompute P-values for scores

Blast VariantsBlast Variants

0.0. What Are What AreBlastnBlastn//BlastpBlastp//BlastxBlastx//TTblastnblastn//TBlastxTBlastx

§§ What is mega-blast?What is mega-blast?§§ What is What is discontiguousdiscontiguous

mega-blast?mega-blast?§§ Phi-Blast/Phi-Blast/PsiPsi-Blast?-Blast?§§ BLAT?BLAT?§§ PatternHunterPatternHunter??

1.1. Longer seeds.Longer seeds.2.2. Seeds with donSeeds with don’’t caret care

valuesvalues3.3. LaterLater4.4. Database pre-Database pre-

processingprocessing5.5. Seeds with donSeeds with don’’t caret care

valuesvalues

Documents

L4: Blast: Alignment Scores etc. · 2004. 10. 11. · Z-scores for alignment ßInitial assumption was that the scores followed a normal distribution. ßZ-score computation: ßFor