Using BLAST options to refine a search

Using BLAST options to refine a search

1) Address the question “how many of the Phytophthora/tomato interaction ESTs are tomato?”

A: Will depend on conditions. E-value <1 x 10-8 ; match length > 200 bp; identities > 95%; % match overlap > 50%: ~2100 (54%) show match with 1622 unique ESTs.

2) Can the question be more easily addressed by refining BLAST search?

3) Other BLAST options.

$ ./blastall.exe

-e Expectation value <E> [Real] default = 10.0

$ ./blastall.exe

-m alignment view options:0 = pairwise1 = query-anchored showing identities... 7 = XML Blast output8 = tabular9 = tabular with comment lines

Run nucleotide BLAST (blastn)

$ /cygdrive/c/Blast/bin/blastall -p blastn -d ./TA496Seq1.txt -i ./tomatosequence.txt –o OUTE2.txt –e 0.01

$ grep –c “Strand =“ OUTE2.txt

3 (with default this was 82…)

$ /cygdrive/c/Blast/bin/blastall -p blastn -d ./TA496Seq1.txt -i ./PhytophSeq1.txt –o PhytOUTE1.txt –e 1e-8

$ grep –c “Strand =“ PhytOUTE1.txt

108,787 (with default this was 292,568…)

NOTE: the blast which compares 3,921 sequences to a database of 116,711 sequences will take some time (15 minutes on my laptop).

Searching..................................................done

Score ESequences producing significant alignments: (bits) Value

gi|9292199|gb|BE354223.1|BE354223 EST355566 tomato flower buds, ... 1237 0.0 gi|16248018|gb|BI933546.1|BI933546 EST553435 tomato flower, anth... 1017 0.0 gi|4384985|gb|AI489614.1|AI489614 EST247953 tomato ovary, TAMU S... 908 0.0

>gi|9292199|gb|BE354223.1|BE354223 EST355566 tomato flower buds, anthesis, Cornell University Solanum lycopersicum cDNA clone cTOD9L3, mRNA sequence Length = 632

Score = 1237 bits (624), Expect = 0.0 Identities = 630/632 (99%) Strand = Plus / Plus

Query: 1504 gactggctagaatggctgcaatcatggcatctacttacaaggcttatcttggcgtcggac 1563 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||Sbjct: 1 gactggctagaatggctgcaatcatggcatctacttacaaggcttatcttggcgtcggac 60

Query: 1564 ttggtccactatcatttttgacgcagtatagaataccacatcctggaagagttggtggaa 1623 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||Sbjct: 61 ttggtccactatcatttttgacgcagtatagaataccacatcctggaagagttggtggaa 120

Run nucleotide BLAST (blastn)

$ /cygdrive/c/Blast/bin/blastall -p blastn -d ./TA496Seq1.txt -i ./tomatosequence.txt –o OUTE2.txt –m 8

8 = tabular format

-m = alignment view options

Slycopersicum.sequence gi|9292199|gb|BE354223.1|BE354223 99.68 632 2 0 1504 2135 1 632 0.0 1237

Slycopersicum.sequence gi|16248018|gb|BI933546.1|BI933546 99.62 521 2 0 1668 2188 1 521 0.0 1017

Slycopersicum.sequence gi|4384985|gb|AI489614.1|AI489614 99.57 466 2 0 1818 2283 1 466 0.0 908

querry start/end

bit score

e-value

Subject start/end

length/mismatch

gap openings

identities

tblastn

Running BLAST against a protein or peptide (translated BLAST vs nucleotide data)

$ /cygdrive/c/Blast/bin/blastall -p tblastn -d ./TA496Seq1.txt -i ./SB7-15-13.txt –o PEPTIDEOUT.txt (–e #)

Try:

$ /cygdrive/c/Blast/bin/blastall -p tblastn -d ./TA496Seq1.txt -i ./SB7-15-13-Pep4A.txt –o PEPTIDEOUT.txt

Then Try:$ /cygdrive/c/Blast/bin/blastall -p tblastn -d ./TA496Seq1.txt -i ./SB7-15-13-Pep4A.txt –o PEPTIDEOUT.txt –e 50

From Xiaodong

Other useful BLAST options

(1) “-b integer” number of database sequence to show alignments for. The default value is 250. To give it a smaller number will effectively reduce the size of the output file and make the BLAST searches faster.

(2) “-v integer” number of database sequences to show one-line descriptions for. The default value is 500. A smaller number for “-v” option will have a similar effect as the “-b”.

(3) “-a integer” number of processor to use. Most laptops have only one processor. But if they use BLAST program in a linux workstation with multiple processors, use all processors will drastically reduce the execution time.

From Xiaodong

Other useful BLAST options

(4) “-m 7” will give results in XML format, which is useful if the users will import the BLAST output results into the Blast2GO for GO assignment and metabolic pathway predictions.

(5) “-l string” Restrict search of database to list of GI’s (gene index), a specific identifier for each sequence in GenBank. The string is the name of the file containing all the GI’s of the sequences of the subset you want to search against. Use this option for searches against subsets of a large database without creating multiple databases. The advantage of doing this is that the E values for all the searches against the subsets are comparable. If the subsets were individual databases, the sizes are different making E values incomparable between the searches.

Documents

Using BLAST options to refine a search