8/13/2019 BLAST Slides BLAST ALGORITHM
1/38
MPI for Developmental Biology, Tubingen
logo
BLAST and FASTAHeuristics in pairwise sequence alignment
Christoph Dieterich
Department of Evolutionary BiologyMax Planck Institute for Developmental Biology
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
2/38
MPI for Developmental Biology, Tubingen
logo
Introduction BLAST Statistical analysis FASTA
Introduction
1 Pairwise alignment is used to detect homologies between
different protein or DNA sequences, e.g. as global or local
alignments.
2 Problem solved using dynamic programming inO(nm)time andO(n)space.
3 This istoo slowfor searching current databases.
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
3/38
MPI for Developmental Biology, Tubingen
logo
Introduction BLAST Statistical analysis FASTA
Heuristics for large-scale database searching
In practice algorithms are used that run much faster, at the
expense of possibly missing some significant hits due to the
heuristics employed.
Such algorithms are usually seed-and-extendapproaches in
which first small exact matches are found, which are then
extended to obtain long inexact ones.
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
4/38
MPI for Developmental Biology, Tubingen
logo
Introduction BLAST Statistical analysis FASTA
Basic idea: Preprocessing
After preprocessing, a large part of the computation is already
finished before we search for similarities.For biological sequences of lengthnthere are 4n (for the
DNA-alphabet ={A,G,C,T}) and/or 20n (for the amino acidalphabet) different strings.
For small||andn, it is possible to store all in a hash table.
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
5/38
MPI for Developmental Biology, Tubingen
logo
Introduction BLAST Statistical analysis FASTA
Basic idea: Preprocessing
1 A certain consistency of the database entries is assumed.
2 Large sequence databases are split into two parts: oneconsists of the constant part, containing all the sequences
that were used for hash table generation, and of a dynamic
part, containing all new entries.
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
http://find/http://goback/8/13/2019 BLAST Slides BLAST ALGORITHM
6/38
MPI for Developmental Biology, Tubingen
logo
Introduction BLAST Statistical analysis FASTA
BLAST
BLAST, the Basic Local Alignment Search Tool (Altschul et al.,
1990), is perhaps the most widely used bioinformatics tool ever
written. It is an alignment heuristic that determines localalignments between aqueryand adatabase. It uses an
approximation of the Smith-Waterman algorithm.
BLAST consists of two components: a search algorithm and
computation of the statistical significance of solutions.
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
S S S
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
7/38
MPI for Developmental Biology, Tubingen
logo
Introduction BLAST Statistical analysis FASTA
BLAST terminology
Definition
Letqbe the query anddthe database. Asegmentis simply asubstringsofqord.
Asegment-pair(s, t)(or hit) consists of two segments, one inqand oned, of the same length.
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
I t d ti BLAST St ti ti l l i FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
8/38
MPI for Developmental Biology, Tubingen
logo
Introduction BLAST Statistical analysis FASTA
BLAST terminology
Example
V A L L A R
P A M M A R
We think ofsandtas being aligned without gaps andscore
this alignment using a substitution score matrix, e.g. BLOSUM
or PAM in the case of protein sequences.
The alignment score for(s, t)is denoted by(s, t).
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
9/38
MPI for Developmental Biology, Tubingen
logo
Introduction BLAST Statistical analysis FASTA
BLAST terminology
Alocally maximal segment pair (LMSP)is any segment
pair(s, t)whose score cannot be improved by shorteningor extending the segment pair.
Amaximum segment pair (MSP)is any segment pair(s, t)of maximal alignment score(s, t).
Given a cutoff scoreS, a segment pair(s, t)is called ahigh-scoring segment pair (HSP), if it is locally maximal
and(s, t) S.Finally, awordis simply a short substring of fixed length w.
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
10/38
MPI for Developmental Biology, Tubingen
logo
Introduction BLAST Statistical analysis FASTA
The BLAST algorithm
Goal: Find all HSPs for a given cut-off score.
Given three parameters, i.e. a word size w, a word similarity
thresholdTand a minimum cut-off score S. Then we are
looking fora segment pair with a score of at least S that
contains at least one word pair of length w with score at least T.
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
11/38
MPI for Developmental Biology, Tubingen
logo
Introduction BLAST Statistical analysis FASTA
The BLAST algorithm - Preprocessing
Preprocessing: Of the query sequenceqfirst all words of
lengthware generated. Then a list of all w-mers of lengthw
over the alphabetthat have similarity>Tto some word inthe query sequenceqis generated.
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
12/38
MPI for Developmental Biology, Tubingen
logo
Introduction BLAST Statistical analysis FASTA
The BLAST algorithm - Preprocessing
Example
For the query sequence RQCSAGWthe list of words of length
w=2 with a scoreT>8 using the BLOSUM62 matrix are:
word 2 merwith score> 8RQ RQQC QC, RC, EC, NC, DC, KC, MC, SCCS CS,CA,CN,CD,CQ,CE,CG,CK,CT
SA -AG AGGW GW,AW,RW,NW,DW,QW,EW,HW,KW,PW,SW,TW,WW
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
13/38
MPI for Developmental Biology, Tubingen
logo
Introduction BLAST Statistical analysis FASTA
The BLAST algorithm - Searching
1 Localization of the hits:The database sequenced is
scanned for all hitstofw-merssin the list, and the
position of the hit is saved.
2 Detection of hits: First all pairs of hits are searched that
have a distance of at most A (think of them lying on thesame diagonal in the matrix of the SW-algorithm).
3 Extension to HSPs:Each suchseed(s, t)is extended inboth directions until its score(s, t)cannot be enlarged
(LMSP). Then all best extensions are reported that havescore S, these are the HSPs. Originally the extensiondid not include gaps, the modern BLAST2 algorithm allows
insertion of gaps.
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/http://goback/8/13/2019 BLAST Slides BLAST ALGORITHM
14/38
MPI for Developmental Biology, Tubingen
logo
y
The BLAST algorithm - Searching
The listLof all words of length wthat have similarity> Tto some word in the query sequence qcan be produced in
O(|L|)time.
These are placed in a keyword tree and then, for each
word in the tree, all exact locations of the word in the
databasedare detected in time linear to the length of d.
As an alternative to storing the words in a tree, a
finite-state machine can be used, which Altschul et al.found to have the faster implementation.
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
15/38
MPI for Developmental Biology, Tubingen
logo
y
The BLAST algorithm
Use of seeds of lengthwand the termination of extensions with
fading scores (score dropoff threshold X) are both steps that
speed up the algorithm.Recent improvements (BLAST 2.0):
Two word hits must be found within a window ofA residues.
Explicit treatment of gaps.
Position-specific iterative BLAST (PSI-BLAST).
Christoph Dieterich Max Planck Institute for Developmental BiologyBLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/http://goback/8/13/2019 BLAST Slides BLAST ALGORITHM
16/38
MPI for Developmental Biology, Tubingen
logo
The BLAST algorithm - DNA
ForDNAsequences, BLAST operates as follows:
The list of all words of lengthwin the query sequence q is
generated. In practice,w=12 for DNA.The databasedis scanned for all hits of words in this list.
Blast uses a two-bit encoding for DNA. This saves space
and also search time, as four bases are encoded per byte.
Note that the T parameter dictates the speed and sensitivity ofthe search.
Christoph Dieterich Max Planck Institute for Developmental BiologyBLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
17/38
MPI for Developmental Biology, Tubingen
logo
All flavors of BLAST
BLASTN: compares a DNA query sequence to a DNA
sequence database; qDNAsDNA
BLASTP: compares a protein query sequence to a proteinsequence database; qproteinsprotein
TBLASTN: compares a protein query sequence to a DNA
sequence database (6 frames translation); qproteinst{+1
,
+2,
+3,
1,
2,
3}(DNA)
Christoph Dieterich Max Planck Institute for Developmental BiologyBLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
18/38
MPI for Developmental Biology, Tubingen
logo
All flavors of BLAST - continued
BLASTX: compares a DNA query sequence (6 frames
translation) to a protein sequence database.TBLASTX: compares a DNA query sequence (6 frames
translation) to a DNA sequence database (6 frames
translation).
Christoph Dieterich Max Planck Institute for Developmental BiologyBLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
19/38
MPI for Developmental Biology, Tubingen
logo
Statistical analysis
1 Thenull hypothesis H0 states that the two sequences (s, t)arenothomologous. Then thealternative hypothesis
states that the two sequences are homologous.
2 Choose an experiment to find the pair(s, t): use BLAST to
detect HSPs.3 Compute the probability of the result under the hypothesis
H0, P(Score(s, t)| H0)by generating a probabilitydistribution with random sequences.
4 Fix a rejection level forH0.5 Perform the experiment, compute the probability of
achieving the result or higher and compare with the
rejection level.
Christoph Dieterich Max Planck Institute for Developmental BiologyBLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
20/38
MPI for Developmental Biology, Tubingen
logo
Poisson and extreme value distributions
The Karlin and Altschul theory for local alignments (without
gaps) is based on Poisson and extreme value distributions. Thedetails of that theory are beyond the scope of this lecture, but
basics are sketched in the following.
Christoph Dieterich Max Planck Institute for Developmental BiologyBLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
21/38
MPI for Developmental Biology, Tubingen
logo
Poisson distribution
Definition
The Poisson distribution with parameterv is given by
P(X =x) =vx
x!
ev
Note thatv is theexpected valueas well as the variance. From
the equation we follow that the probability that a variable X will
have a value at least x is
P(X x) =1 x1
i=0
vi
i!ev
Christoph Dieterich Max Planck Institute for Developmental BiologyBLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
22/38
MPI for Developmental Biology, Tubingen
logo
Statistical significance of an HSP
Problem
Given an HSP(s, t)with score(s, t). How significant is thismatch(i.e., local alignment)?
Given the scoring matrixS(a, b), the expected score foraligning a random pair of amino acid is required to be negative:
E=
a,b
papbS(a,b)
8/13/2019 BLAST Slides BLAST ALGORITHM
23/38
MPI for Developmental Biology, Tubingen
logo
Statistical significance of an HSP
HSP scores are characterized by two parameters,K and.The parametersK and depend on the backgroundprobabilities of the symbols and on the employed scoring
matrix. is the unique value for ythat satisfies the equation
a,b
papbeS(a,b)y =1
K and are scaling-factors for the search space and for thescoring scheme, respectively.
Christoph Dieterich Max Planck Institute for Developmental BiologyBLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://goforward/http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
24/38
MPI for Developmental Biology, Tubingen
logo
Statistical significance of an HSP
The number of random HSPs(s, t)with(s, t) Scan bedescribed by a Poisson distribution with parameter
v=KmneS
. The number of HSPs with score Sthat weexpectto see due to chance is then the parameter v, alsocalled theE-value:
E(HSPs with scoreS) =KmneS
Christoph Dieterich Max Planck Institute for Developmental BiologyBLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
25/38
MPI for Developmental Biology, Tubingen
logo
Hence, the probability of finding exactlyxHSPs with a score
Sis given by
P(X=x) =eEEx
x!,
whereE is theE-value forC.The probability of finding at least one HSP by chance is
P(S) =1 P(X =0) =1 eE.
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/http://goback/8/13/2019 BLAST Slides BLAST ALGORITHM
26/38
MPI for Developmental Biology, Tubingen
logo
We would like to hide the parametersK and to make iteasier to compare results from different BLAST searches.
For a given HSP(s, t)we transform therawscoreSinto abit-score:
S := S ln Kln 2
.
Such bit-scores can be compared between different BLAST
searches, as the parameters of the given scoring systems are
subsumed in them.
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
27/38
MPI for Developmental Biology, Tubingen
logo
We would like to hide the parametersK and to make iteasier to compare results from different BLAST searches.
For a given HSP(s, t)we transform therawscoreSinto abit-score:
S := S ln Kln 2
.
Such bit-scores can be compared between different BLAST
searches, as the parameters of the given scoring systems are
subsumed in them.
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
28/38
MPI for Developmental Biology, Tubingen
logo
Significance of a bit-score
To determine the significance of a given bit-score S the only
additional value required is the size of the search space. Since
S= (S ln 2 + ln K)/, we can express theE-value in terms ofthe bit-score as follows:
E=KmneS =Kmne(S ln 2+ln K) =mn2S
.
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
29/38
MPI for Developmental Biology, Tubingen
logo
FASTA
FASTA (pronounced fast-ay) is a heuristic for finding significant
matches between a query string qand a database stringd. It is
the older of the two heuristics introduced in the lecture.
FASTAs general strategy is to find the most significant
diagonals in the dot-plot or dynamic programming matrix.
The algorithm consists of four phases: Hashing, 1st
scoring, 2nd scoring, alignment.
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/http://goback/8/13/2019 BLAST Slides BLAST ALGORITHM
30/38
MPI for Developmental Biology, Tubingen
logo
FASTA
FASTA (pronounced fast-ay) is a heuristic for finding significant
matches between a query string qand a database stringd. It is
the older of the two heuristics introduced in the lecture.
FASTAs general strategy is to find the most significant
diagonals in the dot-plot or dynamic programming matrix.
The algorithm consists of four phases: Hashing, 1st
scoring, 2nd scoring, alignment.
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
31/38
MPI for Developmental Biology, Tubingen
logo
Phase 1: Hashing
The first step of the algorithm is to determine all exact
matches of lengthk(word-size) between the two
sequences, calledhot-spots.
Ahot-spotis given by(i,j), whereiandjare thelocations(i.e., start positions) of an exact match of length k in the
query and database sequence respectively.
Any such hot-spot(i,j)lies on the diagonal(ij)of the
dot-plot or dynamic programming matrix.
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
32/38
MPI for Developmental Biology, Tubingen
logo
Phase 1: Hashing
Using this scheme, the main diagonal has number 0 (i=j),whereas diagonals above the main one have positive
numbers (i
8/13/2019 BLAST Slides BLAST ALGORITHM
33/38
MPI for Developmental Biology, Tubingen
logo
Phase 2+3:
Each of the ten diagonal runs with highest score are further
processed. Within each of these scores an optimal local
alignment is computed using the match score substitution
matrix. These alignments are called initial regions.
The score of the best sub-alignment found in this phase is
reported asinit1.
The next step is to combine high scoring sub-alignments
into a single larger alignment, allowing the introduction of
gaps into the alignment. The score of this alignment isreported asinitn
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/http://goback/8/13/2019 BLAST Slides BLAST ALGORITHM
34/38
MPI for Developmental Biology, Tubingen
logo
Phase 4:
Finally, abandedSmith-Waterman dynamic program is used toproduce an optimal local alignment along the best matched
regions. The center of the band is determined by the region
with the scoreinit1, and the band has width 8 for ktup=2. The
score of the resulting alignment is reported asopt.
In this way, FASTA determines a highest scoring region, not all
high scoring alignments between two sequences. Hence,
FASTA may miss instances of repeats or multiple domains
shared by two proteins.
After all sequences of the databases have thus been searched
a statistical significance similar to the BLAST statistics is
computed and reported.
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://goforward/http://find/http://goback/8/13/2019 BLAST Slides BLAST ALGORITHM
35/38
MPI for Developmental Biology, Tubingen
logo
FASTA
Example
Two sequences ACTGACandTACCGA: The hot spots fork=2 are
marked as pairs of black bullets, a diagonal run is shaded in dark
grey. An optimal sub-alignment in this case coincides with the
diagonal run. The light grey shaded band of width 3 around thesub-alignment denotes the area in which the optimal local alignment
is searched.
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
36/38
MPI for Developmental Biology, Tubingen
logo
Comparing BLAST and FASTA
BLAST FASTA
Query
Data ase
Query
ata ase
Qu
ery
Data ase
Query
Data ase
(a) (b)
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://goforward/http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
37/38
MPI for Developmental Biology, Tubingen
logo
BLAT- BLAST-Like Alignment Tool
BLAT (Kent et al. 2002), is supposedly more accurate and 500times faster than popular existing tools for mRNA/DNA
alignments and 50 times faster for protein alignments at
sensitivity settings typically used when comparing vertebrate
sequences.
BLATs speed stems from an index of all non-overlapping
K-mers in the genome. The program has several stages: It
uses the index to find regions in the genome that are possibly
homologous to the query sequence. It performs an alignment
between such regions. It stitches together the aligned regions
(often exons) into larger alignments (typically genes). Finally,
BLAT revisits small internal exons and adjusts large gap
boundaries that have canonical splice siteswherefeasible.Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
Introduction BLAST Statistical analysis FASTA
http://find/8/13/2019 BLAST Slides BLAST ALGORITHM
38/38
MPI for Developmental Biology, Tubingen
logo
Job Advertisement
I encourage applications for
Student research projects
Diploma thesis projects
HiWi jobs
http://www2.tuebingen.mpg.de/abt4/plone/BT
Christoph Dieterich Max Planck Institute for Developmental Biology
BLAST and FASTA
http://find/