40
1 BLAST, gapped BLAST and PSI-BLAST

BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

1

BLAST, gapped BLAST and PSI-BLAST

Page 2: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

2

Overview

• Quick repetition of pairwise alignment• Statistical fundament of BLAST• The basic BLAST algorithm• Gapped BLAST• PSI-BLAST

Page 3: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

3

Pairwise alignment

• Global alignment– Needleman-Wunsch

• Local Alignment– Smith-Waterman

• Dynamic programming, fill out matrix• Find optimal solution• Time complexity: O(mn)

Page 4: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

4

BLAST

• Heuristic method• Fast search through large databases• Good but not necessarily best solution• Skip explicit search of the entire matrix• Extensions:

– Faster– Include gaps– Use position-specific scoring matrices

Page 5: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

5

Statistical fundament

Recall random walks and extreme valuedistributions

Assumption: Independet backgrounddistribution of amino acids, Pi

sij denote the score of aligning AAs i and jThe expected score must be negative

(cf. random walks – else drift to infinity)∑ <

jiijji sPP

,0

Page 6: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

6

Statistical fundament

Given Pi and sij,derive parameters λ and KLet S be the nominal score of a sequence

pair. The normalized score in bits is:

The number E of chance occurrences ofpairs giving S’ is approximated by:

2lnln' KSS −

⎟⎠⎞

⎜⎝⎛==

EmnSmnE S lg'

2 ' or

Page 7: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

7

Statistical fundament

The target frequency of aligned pairs:

Note: Makes it possible to scale scores to fitdesired frequencies qij

All this holds only when not using gapsExpected to hold for gaps as well when

costs are sufficiently large

u

jiijij

sjiij

PPqsePPq iju

λλ )/ln(

== or

Page 8: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

8

BLAST

Basic Local Alignment Search ToolBrief repetition of basic BLAST:Fast search for local ungapped alignmentsW: Word size – find W-mers in target/queryT: Threshold – focus on pairs scoring >TX: Drop-off – stop extending when loss >XS: Score – the final score of segment pair

Page 9: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

9

BLAST

Look for high scoring words of length WCompile list L of all W-mers that score >T

with some word in query sequenceScan database for words in LWhen some word found: Extend alignmentWhen score drops more than X below

hitherto best score stop extensionReport all words with large score S

Page 10: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

10

BLAST

If W too large: Too many words in L– or too few

If T too large: Too restrictive search– or too many extensions

Choose cut-off for relevant hitsHigh-scoring Sequence Pairs (HSPs)It turns out that >90% computation time is in

extending hits!

Page 11: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

11

The Two-Hit Method

The goal: Faster algorithmReduce number of extensionsObservation:• HSP much longer than W• often contains more than one word-pairIdea: Focus on two or more words on same

diagonal

Page 12: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

12

The Two-Hit Method

How to do it:• For each hit, remember diagonal position

– If overlapping: Ignore– If distance to previous hit < A: Extend

• Must lower T to get same sensitivity– Many more single hits– Only a few are extended due to diagonal

constraint

Page 13: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

13

The Two-Hit Method

Evaluation: Generate 100,000

model HSPsCheck for hits > TW=3, T=13W=3, T=11, A=40For S≥33, two-hits more

sensitive

Page 14: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

14

The Two-Hit Method

Test on real data:15 hits with T≥13 (+)Additional:22 hits with T≥11 (•)One-hit extends all 15Two-hit extends 2 pairs

More hits, fewer extensions

Page 15: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

15

Gapped BLAST

How to let BLAST find gapped alignments?Already implicit ’gapped’ alignment:When more HSPs in same sequence →

assess combined resultIf one HSP is missed the combined result

might be missed, tooLower T needed – large execution time

Page 16: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

16

Gapped BLAST

New idea: Introduce a moderate score Sg

If HSP exceeds Sg start gapped extensionChoose Sg to trigger ~ 1 extension per 50

sequences in database (Sg≈22 bits)Costly operation but few of themGapped extension based on a single HSP –

we may tolerate missing more HSPsRaise T

Page 17: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

17

Gapped BLAST

New algorithm:Two-hit method: Two words of score ≥T

trigger ungapped extensionIf HSP scores ≥ Sg, start gapped extensionReport final alignment if significant (low E-

value)

Page 18: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

18

Gapped BLAST

How to construct gapped local alignment?Standard way: Limit search to a banded matrix• Gapped extension may stray outside bandInstead use standard BLAST procedure:Look in cells where the score drops no more

than XgBut how to begin…

Page 19: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

19

Gapped BLAST

Use central aligned pairas seed

Heuristic: Find length-11 segment with highestscore. Use central pair

Extend forward and backward

Page 20: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

20

Gapped BLAST

This result is not foundby standard BLAST

Combined result of firstand last HSP gives E-value 31

Gapped BLAST:S=32.4 bitsE=0.54

Page 21: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

21

Gapped BLAST

What about spurioshits? Does that give extra work?

No: Score decays fastSmall part of matrix

explored

Page 22: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

22

Gapped BLAST

New version faster. Relative times:

BLAST Gapped BLASTOverhead 8 (8%) 8 (24%)Extend? - 12 (37%)Ungapped 92 (92%) 5 (15%)Gapped - 8 (24%)

100 33

Page 23: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

23

Gapped BLAST

What about the parameters λ and K?Cannot be estimated during execution since

BLAST looks at only some sequencesNo theory covers gapped alignmentsUse estimations made in advanceDrawback: Cannot use arbitrary scoring

systems

Page 24: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

24

PSI-BLAST

Position Specific Iterated BLASTUse sequence information to build position-

specific scoring matricesReaders of X-Men will know that psy blasts are

something else entirely …

Page 25: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

25

PSI-BLAST

More sensitive procedureEach iteration a little slowerIssues:i) Architecture of score matrixii) Construction of multiple alignmentiii) Sequence weightsiv) Target frequencies and scoresv) Applying BLAST to scoring matrices

Page 26: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

26

PSI-BLAST: Architecture

Automated generation is difficultBoundaries, many motifs, subsets…1) Length of query determines dimensions2) No position-specific gap cost

– No theory for deriving gap costs from M– Estimate statistical significance

So they build a L%20 scoring matrix

Page 27: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

27

PSI-BLAST: Constructing M

Collect BLAST output with E<0.01Remove similar sequences

– Sequences identical to query segments– Only one copy of sequences >98% similarity

Local alignment → varying number ofsequences per column

No true multiple alignment methods

Page 28: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

28

PSI-BLAST Constructing M

Reduce M to MC:Treat columns independentlyFor each column C: Let R be the set of

sequences with a residue in CThe columns of MC:Columns from M with

all sequences in RNow: Characters in

all positions

Page 29: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

29

PSI-BLAST: Weights

Weighting needed to avoid biasMany methods, roughly same results

– Voronoi, maximum entropy, position basedInformation contentNC: Number of independent observationsSimple estimate: Mean number of different

residues in each column

Page 30: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

30

PSI-BLAST: Target frequencies

Many methods for creating scoring matricesGood theoretical foundation:

log(Qi/Pi)Pi: Background. How to estimate Qi?Pseudocount frequencies gi for column Cfj: Observed frequency, qij: Implicit target (7)

∑=j

ijj

ji q

Pf

g

Page 31: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

31

PSI-BLAST: Target frequencies

Weight observed and pseudocount freq’sα=NC-1 β=10 (empirical)

Now Qi is given as:

This makes it possible to build a matrixBut how to use it with BLAST…

βαβα

++

= iii

gfQ

Page 32: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

32

PSI-BLAST: Application

Minor modifications to– find words in query matrix– find hits– extend hits (gapped and ungapped)

But what about the parameters T and Xg?Test whether the scale λu of the matrix

corresponds to the scale of sij

If similar: Probably the same scale λg

Page 33: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

33

PSI-BLAST: Application

Test the hypothesis:Construct matrix by BLASTing length-567

influenza A virus hemagglutinin precursorCompare to 10,000 random sequencesPlot local alignment score versus countFit best extreme value distribution

λg=0.251 Kg=0.031

Page 34: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

34

PSI-BLAST: Application

Good fit to dataSupported by other

experimentsGenerally: Less than

2% deviation whenusing precomputedparameters

Page 35: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

35

Testing PSI-BLAST

Create scoring matrix for 11 familiesCompare to shuffled SWISS-PROTRecord:

– Lowest E-value– No. of sequences with E≤1 and E≤10

BLAST, gapped BLAST and PSI-BLAST

Page 36: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

36

Testing PSI-BLAST

Within uncertainties of the theoryPSI-BLAST can automate the procedureBeware of including used sequences

Page 37: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

37

Testing PSI-BLAST

Compare sensitivity and speed of• Smith-Waterman• Original BLAST• Gapped BLAST• PSI-BLAST (1 iteration)

Page 38: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

38

Testing PSI-BLAST

All but one are true homologsPSI-BLAST is faster and more sensitiveOther BLAST algorithms good as well

Page 39: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

39

Conclusions

• The two-hit method improves speed• Gapped BLAST is fast• PSI-BLAST finds weak homologs fast• The theory can be extended

Page 40: BLAST, gapped BLAST and PSI-BLASTweb.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf · Gapped BLAST New version faster. Relative times: BLAST Gapped BLAST Overhead 8 (8%)

40

Future work

• Gap costs: Generalized affine gap cost• Input scoring matrices to PSI-BLAST

– Problems with parameters• More refined multiple alignment

– Use most significant hits– Rescore and realign sequences– Iterate