BLAST 06/01/2012
Introduction:
Acronym for Basic Local Alignment Search Tool
The BLAST program was developed by Stephen
Altschul et al of NCBI in 1990
Also a heuristic method like FASTA
It is one of the most popular programs for sequence
analysis
enables a researcher to compare a query
sequence with a library or database of
sequences and
identify library sequences that resemble the
query sequence above a certain threshold
The objective is to find high-scoring ungapped
segments among related sequences
Using BLAST
http://www.ncbi.nlm.nih.gov/BLAST
1. Select BLAST program to use (blastn, blastp,
blastx, tblastn, tblastx)
2. Select database to search
3. different BLAST programs have different
databases
4. Enter Query Sequence
5. Submit Search
Steps in BLAST
The seq is optionally filtered to remove low-
complexity regions (AGAGAG…)
The next step is to create a list of words from the
query sequence.
Each word is typically 3 residues for protein
sequences and 11 residues for DNA sequences.
The list includes every possible word extracted from
the query sequence.
This step is also called seeding.
PROTEIN WORDS GTQITVEDLFYNIATRRKALKN Query:
Neighborhood Words
LTV, MTV, ISV, LSV, etc.
GTQ
TQI
QIT
ITV
TVE
VED
EDL
DLF
...
Make a lookup
table of words
Word Size = 3 Word Size can be 2 or 3 (default = 3)
NUCLEOTIDE WORDS GTACTGGACATGGACCCTACAGGAA Query:
GTACTGGACAT
TACTGGACATG
ACTGGACATGG
CTGGACATGGA
TGGACATGGAC
GGACATGGACC
GACATGGACCC
ACATGGACCCT
...........
Make a lookup
table of words
Word Size = 11 minimum word size = 7
blastn default = 11
megablast default = 28
The third step is to search a sequence
database for the occurrence of these words.
This step is to identify database sequences
containing the matching words
Using substitution scores matrixes the query
seq. words are evaluated for matches with
any DB seq. and these scores (log) are added
A cut-off score (T) is selected to reduce
number of matches to the most significant
ones
The above procedure is repeated for each
word in the query seq.
The remaining high-scoring words are
organised into efficient search tree and rapidly
compared to the DB seq.
If a good match is found then an alignment is
extended from the match area in both
directions as far as the score continue to grow.
The extension continues until the score of the
alignment drops below a threshold due to
mismatches
(the drop threshold is twenty-two for proteins
and twenty for DNA).
The resulting contiguous aligned segment pair
without gaps is called high-scoring segment pair
(HSP )
In the original version of BLAST, the highest
scored HSPs are presented as the final report
A recent improvement in the implementation
of BLAST is the ability to provide gapped
alignment.
In gapped BLAST, the highest scored segment
is chosen to be extended in both directions
using dynamic programming where gaps may
be introduced.
The extension continues if the alignment
score is above a certain threshold otherwise it
is terminated
BLAST Output
1. an introduction that tells where the search occurred and what database and query were compared
2. a list of the sequences in the database containing segment pairs whose scores were least likely to occur by chance
3. alignments of the high-scoring segment pairs showing identical and similar residues
4. a complete list of the parameter settings used for the search.
BLAST Variants
Program Query sequence Database sequence
BLASTP protein protein
BLASTN nucleic acid nucleic acid
BLASTX translated nucleic acid protein
TBLASTN protein translated nucleic acid
TBLASTX translated nucleic acid translated nucleic acid
Databases available on BLAST Web server
Database - Description
A. Peptide sequence databases
1. nr-translations of GenBank DNA sequences with redundancies removed, PDB, SwissProt, PIR, and PRF
2. month -new or revised entries or updates to nr in the previous 30 days
3. Swissprot- latest release of the SwissProt protein sequence databasea
4. Drosophila genome -provided by Celera and Berkeley Drosophila genome project
5. yeast -yeast (Saccharomyces cerevisiae) genomic sequences
6. E. Coli- E. coli genomic sequences
7. pdb -sequences of proteins of known three-dimensional structure from the Brookhaven Protein Data Bank
8. yeast -yeast (S. cerevisiae) protein sequences
9. E. coli- E. coli genomic coding sequence translations
10. kabat [kabatpro] -Kabat’s database of sequences of immunological interest
11. Alu- translations of select Alu repeats from REPBASE, a database of sequence repeats
B. Nucleotide sequence databases
1. nr- GenBank, EMBL, DDBJ, and PDB sequences with redundancies removed (EST, STS, GSS, and HTGS sequences excluded)
2. month -new or revised entries or updates to nr in the previous 30 days
3. dbestb- EST sequences from GenBank, EMBL, and DDBJ with redundancies removed
4. dbstsb- STS sequences from GenBank, EMBL, and DDBJ with redundancies removed
5. htgsb- high-throughput genomic sequences
6. kabat [kabatnuc] -Kabat’s database of sequences of immunological interest
7. vector- vector subset of GenBank
8. mito -database of mitochondrial sequences
9. alu -select Alu repeats from REPBASE, a database of sequence repeats; suitable for masking Alu repeats from query sequences
10. epd- eukaryotic promoter database
11. gssb -genome survey sequences, includes single-pass genomic data,exon-trapped sequences, and Alu PCR sequences
Difference between BLAST and FASTA BLAST FASTA
uses a substitution matrix to find matching
words
Uses the hashing procedure
Word size:
Protein=3 ;DNA=11
K-tuple:
Protein=2;DNA=4-6
Faster than FASTA Slower than BLAST
have higher specificity than FASTA due to
Low complexity masking
Lower specificity
E-value (expectation value)
Important statistical indicator in Sequence alignment
it indicates the probability that the resulting
alignments from a database search are caused by
random chance
The E-value provides information about the
likelihood that a given sequence match is purely by
chance.
The lower the E-value, the less likely the database
match is a result of random chance and therefore
the more significant the match is
Formula
E-value is determined by the equation
E = m × n × P
Where
m is the total number of residues in a database
n is the number of residues in the query sequence
and
P is the probability that an HSP alignment is a result
of random chance.
Bit Score
A bit score is another prominent statistical indicator
used in addition to the E value in a BLAST output.
The bit score measures sequence similarity
independent of query sequence length and
database size and is normalized based on the raw
pairwise alignment score.
Formula
The bit score (S) is determined by the following formula:
S = (λ × s − lnK)/ ln2
Where
λ is the Gumble distribution constant,
s is the raw alignment score, and
K is a constant associated with the scoring matrix used.
Thus, the bit score (S) is linearly related to the raw
alignment score (s).
Hence, the higher the bit score, the more highly
significant the match is.