Upload
earl-ellis
View
218
Download
3
Embed Size (px)
Citation preview
Databasesindexation
Laurent Falquet, EPFL March, 2005
Swiss Institute of BioinformaticsSwiss EMBnet node
Overview
Data access concept sequential direct
Indexing EMBOSS Fetch Other
BLAST Why indexing? formatdb Parsing output
Excel import/export Tab delimited Coma delimited
Data access: sequential vs direct
Sequential access Direct access
Vary from very short to very longVery small variations
track
sector
head
Similar concept for databases
Flat files = sequential Indexing = simulated direct
>seq1
cgatgtcatgtg
>seq2
cgatcgtagctgtagctgtag
>seq3
catgtgcatgcgacgt
ID Position (byte)
Length (byte)
SEQ1 0 19
SEQ2 19 28
SEQ3 47 23
Tools
EMBOSS dbiflat dbifasta dbiblast
seqret seqretsplit entret
Other examples SRS (icarus language)
http://srs.ebi.ac.uk http://www.lionbioscience.com/
indexer & fetch (warning local SIB tool)
Relational (MySQL, Oracle…)
EMBOSS how to index?
Where is your file? What is the format? Where should be the
indices? Where is the
emboss.default file? (.embossrc)
Other EMBOSS tools textsearch whichdb
EMBOSS example
Input file and directory ~/embossidx/ECOLI.dat cd embossidx
Index creation dbiflat -idformat swiss -dbname ECOLI.dat -directory . -release 1.0 -date
12/02/05 -fields AC
Generates 4 files acnum.hit acnum.trg division.lkp entrynam.idx
Don’t forget to modify ~/.embossrc
.embossrc Example of queries
seqret ecoli:thio_ecoli seqret ecoli:P00274 entret ecoli:thio_ecoli
and even seqret ‘ecoli:*_ECOLI’
set emboss_filter 1
# Ecoli
DB ecoli [
type: P
comment: "E.coli proteome"
method: emblcd
format: swiss
dir: "~/embossidx"
file: "ECOLI.dat"
release: "1.0"
indexdir: "~/embossidx"
]
Indexer & fetch
Warning this is a local SIB tool!! Input file and directory
~/embossidx/ECOLI.dat cd embossidx
Index creation indexer -h '^ID' -t '^//' -i -p '^ID\s+(\S+)' ECOLI.dat ecoli.idx
Generates 1 file ecoli.idx
Don’t forget to modify config file
Config file: fetch.conf
Example of queries fetch -c fetch.conf ecoli:thio_ecoli fetch -c fetch.conf -f ‘ecoli:thio_ecoli[20..50]’
fetch.conf
#dbkey format indexfile datafile
ecoli sp ~/embossidx/ecoli.idx ~/embossidx/ECOLI.dat
BLAST
Maintained at NCBI Source distributed freely with
several accessory tools ftp://ftp.ncbi.nlm.nih.gov/
toolbox/ncbi_tools/ncbi.tar.gz
Requires compilation to install on your local computer
blastall contains blastp blastn blastx tblastn tblastx
Other tools blastpgp megablast formatdb
Available Blast programsProgram Query Database
blastp protein protein
blastn nucleotide nucleotide
blastx
protein
nucleotide
protein
tblastn
protein protein
nucleotide
tblastx
protein
nucleotide
protein
nucleotide
VS
VS
VS
VS
VS
What makes BLAST so fast?
Indexing all words of 3 aa or 11 bp in the sequence database
Searching the query for all words of a score > T
Search the indexed database for all perfect matches
Try to align matches that are on the same diagonal
Indexing for Blast (1)
RELQuery
RSLRSL
AAAAACAAD
YYY
AAAAACAAD
YYY
List of all possible words with3 amino acid residues (8000)
...
ACT
RSL
TVF
ACT
RSL
TVF
List of words matching thequery with a score > T
score > T
...
...
LKPLKP
LKPLKP
score < T
A substitution matrix is used to compute the word scoresA substitution matrix is used to compute the word scores
Indexing for Blast (2)
ACT
RSL
TVF
ACT
RSL
TVF
List of words matching thequery with a score > T
...
...
ACTACTACT
RSL
RSL TVF
RSLRSL
RSLRSL TVFTVF
Database sequences
List of sequences containing words similar to the query (hits)
List of sequences containing words similar to the query (hits)
Search forexact matches
Indexing for Blast (3)Database sequence
Qu
er
y
A
Ungapped extension if:2 "Hits" are on the same diagonal but at a distance less than A
Database sequence
Qu
er
y
A
Extension using dynamic programminglimited to a restricted region limited through a score drop-off threshold
BLAST indexing with formatdb
Formatdb mydb.seq must contain sequences in FASTA format formatdb -i mydb.seq -p T -n mydb
Generates 3 files mydb.psq mydb.pin mydb.phr
Then start a Blast: blastall -p blastp -d mydb -i myseq (-optional parameters)
Blast local vs remote
blastall Executed locally Slow No need to transfert db
blastall.remote Executed remotely Fast Requires special
priviledges and db transfert
Multiple Blasts?
1 seq vs db seq 1 FASTA seq as input
db seq vs db seq Several single FASTA seq
files as input or 1 Multiple FASTA seq file
as input
Possibility to export results as XML
Use Perl to automatize the queries and parse the output
Parsing Blast outputBLASTP 2.2.10 [Oct-19-2004]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.
Query= ACCA_BACSU O34847 Acetyl-coenzyme A carboxylase carboxyltransferase subunit alpha (EC 6.4.1.2). (325 letters)
Database: ecoli_blast 4339 sequences; 1,373,039 total letters
Searching.........done
Score ESequences producing significant alignments: (bits) Value
ACCA_ECOLI P30867 Acetyl-coenzyme A carboxylase carboxyl transfe... 266 1e-72
Parsing Blast output (2)>ACCA_ECOLI P30867 Acetyl-coenzyme A carboxylase carboxyl transferase subunit alpha (EC 6.4.1.2). Length = 318
Score = 266 bits (681), Expect = 1e-72 Identities = 143/312 (45%), Positives = 188/312 (60%), Gaps = 3/312 (0%)
Query: 5 LEFEKPVIELQTKIAELKKFTQDS---DMDLSAEIERLEDRLAKLQDDIYKNLKPWDRVQ 61 L+FE+P+ EL+ KI L ++ D+++ E+ RL ++ +L I+ +L W QSbjct: 5 LDFEQPIAELEAKIDSLTAVSRQDEKLDINIDEEVHRLREKSVELTRKIFADLGAWQIAQ 64
Query: 62 IARLADRPTTLDYIEHLFTDFFECHGDRAYGDDEAIVGGIAKFHGLPVTVIGHQRGKDTK 121 +AR RP TLDY+ F +F E GDRAY DD+AIVGGIA+ G PV +IGHQ+G++TKSbjct: 65 LARHPQRPYTLDYVRLAFDEFDELAGDRAYADDKAIVGGIARLDGRPVMIIGHQKGRETK 124
Query: 122 ENLVRNFGMPHPEGYRKALRLMKQADKFNRPIICFIDTKGAYPGRAAEERGQSEAIAKNL 181 E + RNFGMP PEGYRKALRLM+ A++F PII FIDT GAYPG AEERGQSEAIA+NLSbjct: 125 EKIRRNFGMPAPEGYRKALRLMQMAERFKMPIITFIDTPGAYPGVGAEERGQSEAIARNL 184
Query: 182 FEMAGLRVPXXXXXXXXXXXXXXXXXXXXXXXHMLENSTYSVISPEGAAALLWKDSSLAK 241 EM+ L VP +ML+ STYSVISPEG A++LWK + A Sbjct: 185 REMSRLGVPVVCTVIGEGGSGGALAIGVGDKVNMLQYSTYSVISPEGCASILWKSADKAP 244
Query: 242 KAAETMKITAPDLKELGIIDHMIKEVKGGAHHDVKLQASYMDXXXXXXXXXXXXXXXXXX 301 AAE M I AP LKEL +ID +I E GGAH + + A+ + Sbjct: 245 LAAEAMGIIAPRLKELKLIDSIIPEPLGGAHRNPEAMAASLKAQLLADLADLDVLSTEDL 304
Query: 302 VQQRYEKYKAIG 313 +RY++ + GSbjct: 305 KNRRYQRLMSYG 316
Parsing Blast output (3)
With BioPerl:#!/usr/local/bin/perl
use Bio::SearchIO;
my $blast_report = new Bio::SearchIO ('-format' => 'blast', '-file' => $ARGV[0]);
print "Query name:\tQuery description:\tHit name:\tHit description:\tE-value\tScore\n";while( my $result = $blast_report->next_result) {
print $result->query_name(), "\t", $result->query_description(), "\n";while( my $hit = $result->next_hit()) {
print "\t\t", $hit->name(), "\t", $hit->description(); while( my $hsp = $hit->next_hsp()) { print "\t", $hsp->evalue(), "\t", $hsp->score(); } print "\n";
}}exit 0;
MS-Excel import/export
Excel can import Tab delimited Coma delimited
Excel can export Tab delimited Space delimited
AC/ID desc score e-value
THIO_ECOLI thioredoxin Escherichia coli 234 2.1e-5
THIO_HUMAN thioredoxin Homo sapiens 120 0.001
MS-Excel import/export
Tab delimited file: \t delimits the columns \n delimits the lines Optional first line contains columns title Example:
AC/ID\tdesc\tscore\te-value\n
THIO_ECOLI\tthioredoxin Escherichia coli\t234\t2.1e-5\n
THIO_HUMAN\tthioredoxin Homo sapiens\t120\t0.001\n
MS-Excel import/export
Coma delimited file: , delimits the columns, each value is surrounded by ‘ ’ \n delimits the lines Optional first line contains columns title Example:
‘AC/ID’,’desc’,’score’,’e-value’\n
’THIO_ECOLI’,’thioredoxin Escherichia coli’,’234’,’2.1e-5’\n
’THIO_HUMAN’,’thioredoxin Homo sapiens’,’120’,’0.001’\n