Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
2014 -‐ BMMB 852: Applied Bioinforma8cs
Week 6, Lecture 12
István Albert
Bioinforma8cs Consul8ng Center
Penn State
blastdbcmd – an unsung hero
• a useful tool with an unfortunate name
• and unfortunate parameters
• and unfortunate documenta8on
makeblastdb -‐parse_seqids op8on
• Use the –parse_seqids flag when invoking makeblastdb à allows the retrieval of sequences based upon sequence iden;fiers.
• In that case, each sequence must have a unique iden;fier, and that iden8fier must have a specific format see also sec8on 5.14 Limi)ng a Search with a List of Iden)fiers in the BLAST+ handbook
FASTA sequence ID format values
Accession number prefixes
Some prefixes are have addi8onal meaning. Others are may only indicate a database or molecule type.
One of the most common ques8ons How to extract a
small sub-‐sequence from a genome?
There are a number of answers – blastdbcmd could be the simplest but it is not all that well documented
Get the Ebola genome for the 1999 outbreak: BioProject: PRJNA14703
blastdbcmd – format and extract sequences in the blast database
More formacng op8ons
List of BLAST+ programs
What I think programs should be called
Official Name Query Subject What should it be called
blastn Nucleo8de Nucleo8de blast NN
blastp Protein Protein blast PP
blastx Nucleo8de Protein blast NP
tblastn Protein Nucleo8de blast PN
tblastx Nucleo8de Nucleo8de tblast NN
Running tools in the blast family: blastp
• Think it trough: What? Where? How?
protein vs protein
blastx and tblastn
• nucleo8de vs protein • protein vs nucleo8de
It is very easy to list the query/database incorrectly or use the wrong types. Blast oeen will not report it and produces no hits.
Homework 12
Create a blast database from all proteins found in the 2014 Ebola paper (you’ll have at least 891). • find the shortest and longest protein among these
• Compare these proteins to the NP_066243 nucleoprotein iden8fied during the 1999 Ebola outbreak. What are the best and worst matches.