45
Glossary Agroinformatics / Agricultural informatics: Agroinformatics concentrates on the aspects of bioinformatics dealing with plant genomes. Alignment :The process of lining up two or more sequences to achieve maximal levels of identity (and conservation , in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Arrangement of two or more nucleotides of protein sequences to maximize the number of matching monomers . Alignment score: A numerical value that describes the quality of a sequence alignment. Algorithm :A fixed procedure embodied in a computer program. A set of rules for calculating or problem solving carried out by a computer program. Bioinformatics :The merger of biotechnology and information technology with the goal of revealing new insights and principles in biology.

ENSEMBL Project- Browsing Tools Human Genome

Embed Size (px)

DESCRIPTION

Agro-Informatics Assignment-ADAC&RI (TNAU)

Citation preview

Page 1: ENSEMBL Project- Browsing Tools Human Genome

Glossary• Agroinformatics / Agricultural informatics:

Agroinformatics concentrates on the aspects of bioinformatics dealing with plant genomes.

• Alignment :The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Arrangement of two or more nucleotides of protein sequences to maximize the number of matching monomers.

• Alignment score: A numerical value that describes the quality of a sequence alignment.

• Algorithm :A fixed procedure embodied in a computer program. A set of rules for calculating or problem solving carried out by a computer program.

• Bioinformatics :The merger of biotechnology and information technology with the goal of revealing new insights and principles in biology.

Page 2: ENSEMBL Project- Browsing Tools Human Genome

• Annotation: 1-Finding genes and other important elements in raw sequence data (structural annotation).2-Determining the function of genes and proteins (functional annotation)

• BLAST Basic Local Alignment Search Tool. (Altschul et al.) A sequence comparison algorithm optimized for speed used to search sequence databases for optimal local alignments to a query. The initial search is done for a word of length "W" that scores at least "T" when compared to the query using a substitution matrix. Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S". The "T" parameter dictates the speed and sensitivity of the search.

• Database: On a computer, a collection of data records either in a single file or as a multiple files. The central component of a database management system.

Page 3: ENSEMBL Project- Browsing Tools Human Genome

• Database management system (DBMS): A software suite including a database and utilities required to organize, search, and update it, maintain data security and control access.

• Domain: Usually used to describe part of a protein that can fold and carry out a function independently, but sometimes used more generally to indicate part of a protein sequence, for instance a ‘glycine-rich domain’, or a geometrically distinct part of a protein structure.

• E value Expectation value used to test the significance of a sequence similarity score. The number of different alignents with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score.

• FASTA The first widely used algorithm (a sequence alignment algorithm) for database similarity searching. The program looks for optimal local alignments by scanning the sequence for small matches called "words". Initially, the scores of segments in which there are multiple word hits are calculated ("init1"). Later the scores of several segments may be summed to generate an "initn" score. An optimized alignment that includes gaps is shown in the output as "opt". The sensitivity and speed of the search are inversely related and controlled by the "k-tup" variable which specifies the size of a "word". (Pearson and Lipman)

Page 4: ENSEMBL Project- Browsing Tools Human Genome

• Global Alignment:The alignment of two nucleic acid or protein sequences over their entire length

• Heuristic: Of a computer program, making guesses to obtain approximate results but much faster than possible with exhaustive searching.

• Homology :Similarity attributed to descent from a common ancestor. An evolutionary relationship of two molecules deriving from a common ancesstor.

• Identity :The extent to which two (nucleotide or amino acid) sequences are invariant.

• K :A statistical parameter used in calculating BLAST scores that can be thought of as a natural scale for search space size. The value K is used in converting a raw score (S) to a bit score (S').

• lambda :A statistical parameter used in calculating BLAST scores that can be thought of as a natural scale for scoring system. The value lambda is used in converting a raw score (S) to a bit score (S').

• Local Alignment :The alignment of some portion of two nucleic acid or protein sequences

Page 5: ENSEMBL Project- Browsing Tools Human Genome

• Motif :A short conserved region in a protein sequence. Motifs are frequently highly conserved parts of domains.

• Multiple Sequence Alignment :An alignment of three or more sequences with gaps inserted in the sequences such that residues with common structural positions and/or ancestral residues are aligned in the same column. Clustal W is one of the most widely used multiple sequence alignment programs

• Optimal Alignment :An alignment of two sequences with the highest possible score.

• Orthologous :Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function.

• P value :The probability of an alignment occurring with the score in question or better. The p value is calculated by relating the observed alignment score, S, to the expected distribution of HSP scores from comparisons of random sequences of the same length and composition as the query to the database. The most highly significant P values will be those close to 0. P values and E values are different ways of representing the significance of the alignment.

Page 6: ENSEMBL Project- Browsing Tools Human Genome

• Paralogous :Homologous sequences within a single species that arose by gene duplication.

• Profile :A table that lists the frequencies of each amino acid in each position of protein sequence. Frequencies are calculated from multiple alignments of sequences containing a domain of interest.

• Proteome: The entire complement of proteins produced by a particular genome, including variants of the same basic protein generated by post-translational modification etc. The study of the proteome is known as proteomics.

• Proteomics :Systematic analysis of protein expression of normal and diseased tissues that involves the separation, identification and characterization of all of the proteins in an organism.

• PSI-BLAST :Position-Specific Iterative BLAST. An iterative search using the BLAST algorithm. A profile is built after the initial search, which is then used in subsequent searches. The process may be repeated, if desired with new sequences found in each cycle used to refine the profile. Details can be found in this discussion of PSI-BLAST. (Altschul et al.)

• PSSM :Position-specific scoring matrix; see profile. The PSSM gives the log-odds score for finding a particular matching amino acid in a target sequence.

• Query :The input sequence (or other type of search term) with which all of the entries in a database are to be compared.

Page 7: ENSEMBL Project- Browsing Tools Human Genome

• Primary database: A database for primary sequence data. The primary nucleotide databases are NCBI GenBank, the European Molecular Biology Laboratory (EMBL), Nucleotide Sequence Database, and the DNA Database of Japan. The primary protein databases are SWISS-PROT and TrEMBL.

• Secondary database: A database of sequence information derived from the data in primary databases. Example include PROSITE, BLOCKS, Pfam and PRINTS.

• Relational database: A database in which data records are organized as tables, allowing the data from tables containing similar fields to be linked together.

• SWISS-PROT: Database of confirmed protein sequences with extensive annotations. Maintained by the Swiss Bioinformatics Institute.

• TrEMBL: Translated EMBL. Database of protein sequences translated from the EMBL nucleotide sequence database. Not as extensively annotated as SWISS-PROT

• SQL: Symbolic query language. The industry-standard language used to interrogate and process data in relational database.

Page 8: ENSEMBL Project- Browsing Tools Human Genome

ENSEMBL project- Browsing tools human genome

Page 9: ENSEMBL Project- Browsing Tools Human Genome

EnsemblFrom Wikipedia, the free encyclopedia

• EMnsembl is a bioinformatics research project aiming to "develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes". It is run in a collaboration between the Wellcome Trust Sanger Institute and the European Bioinformatics Institute, an outstation of the European Molecular Biology Laboratory.

Page 10: ENSEMBL Project- Browsing Tools Human Genome

Goals of Ensembl

The Ensembl project aims to provide: • Accurate, automatic analysis of genome

data • Analysis and annotation maintained on the

current data • Presentation of the analysis to all via the

Web • Distribution of the analysis to other

bioinformatics laboratories

Page 11: ENSEMBL Project- Browsing Tools Human Genome

Software and data

• The project is open source - all data and all software that is produced in the project can be freely accessed and used.

• Most of the software produced and used is written in the language Perl and is based on the BioPerl infrastructure. The Perl API can be easily employed in other genomic projects e.g. for the annotation of gene or clone lists. The website code uses an extensible plugins system which allows groups to modify the website for their own data sets, e.g. Vega which stores and displays manual annotation.

• Also available is an API in Java.

Page 12: ENSEMBL Project- Browsing Tools Human Genome

Current species • The annotated genomes include most finished vertebrates and

selected model organisms. Currently this includes:• Chordates

– Mammals: Human, Mouse, Rat, Chimp, Macaque, Dog, Cow, Elephant (pre), Opossum, Rabbit (preliminary data), Armadillo (pre), Tenrec (pre)

– Birds: Chicken – Fish: Takifugu rubripes (Fugu), Tetradodon nigroviridis, Danio rerio

(Zebrafish) – Frog: Xenopus tropicalis – Ancient relatives: Ciona intestinalis, Ciona savignyi (pre)

• Invertebrates – Insects: Anopheles gambiae (Mosquito), Honeybee, Drosophila

melanogaster (Fruitfly), Aedes aegypti (Mosquito) – Worm: Caenorhabditis elegans

• Yeast: Saccharomyces cerevisiae (Baker's yeast)

Page 13: ENSEMBL Project- Browsing Tools Human Genome

Usage• The service is used by molecular biologists and

bioinformaticians around the world working with genome data of the above organisms. The predictions of coding, controlling and other elements in the genomes can be compared with primary research data and with common repositories of current genomic knowledge (Biological Databases).

• The comparison of organisms (comparative genomics or also intergenomics) with respect to their gene structures and the coded proteins is of special interest. The synteny view can be useful educational material for school classes.

Page 14: ENSEMBL Project- Browsing Tools Human Genome

Human genome • The human genome project is the result of an

international consortium among many different sequencing and bioinformatics centers. A wealth of data is available including; the annotated assembled genomic sequence, transcript sequence, library resources, expression data, map data, disease and functional information, and more. The result is an unprecedented amount of knowledge concerning human genetics that will eventually result in breakthroughs in understanding human biology as well as significant medical advances.

• A challenge facing researchers today is that of analyzing and integrating the plethora of data available. The human genome sequence provides a critical foundation for continued advances in medicine, basic research, and clinical diagnostic technologies.

Page 15: ENSEMBL Project- Browsing Tools Human Genome
Page 17: ENSEMBL Project- Browsing Tools Human Genome

Map of the human X chromosome (from the NCBI website). Assembly of the

human genome is one of the greatest achievements of bioinformatics.

Page 18: ENSEMBL Project- Browsing Tools Human Genome

Usage of SWISSPROT, EMBL, BLAST software for similarities

searches- Comparing to sequences-building a multiple

alignment sequence.

Page 19: ENSEMBL Project- Browsing Tools Human Genome

Swiss-Prot

• Swiss-Prot is a manually curated biological database of protein sequences. Swiss-Prot was created in 1986 by Amos Bairoch during his PhD and developed by the Swiss Institute of Bioinformatics and the European Bioinformatics Institute. Swiss-Prot strives to provide reliable protein sequences associated with a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases.

Page 20: ENSEMBL Project- Browsing Tools Human Genome

• In 2002, the UniProt consortium was created: it is a collaboration between the Swiss Institute of Bioinformatics, the European Bioinfomatics Institute and the Protein Information Resource (PIR), funded by the National Institutes of Health. Swiss-Prot and its automatically curated supplement TrEMBL, have joined with the Protein Information Resource protein database to produce the UniProt Knowledgebase, the world's most comprehensive catalogue of information on proteins. The UniProtKB/Swiss-Prot release 51.3 from 12 December 2006 contains 250,296 entries.

• The UniProt consortium produced 3 database components, each optimised for different uses. The UniProt Knowledgebase (UniProtKB (Swiss-Prot + TrEMBL)), the UniProt Non-redundant Reference (UniRef) databases, which combine closely related sequences into a single record to speed similarity searches and the UniProt Archive (UniParc), which is a comprehensive repository of protein sequences, reflecting the history of all protein sequences.

Page 21: ENSEMBL Project- Browsing Tools Human Genome

• European Molecular Biology Laboratory• The European Molecular Biology Laboratory (EMBL)

is a molecular biology research institution supported by 19 European countries. The EMBL was created in 1974 and has laboratories in Heidelberg, Germany; Hamburg, Germany; Grenoble, France; and Hinxton, UK, and an external Research Programme in Monterotondo, Italy.

• Cell biology and biophysics, developmental biology, gene expression, structural biology and computational biology are the major fields of research at EMBL Heidelberg.

• Many scientific breakthroughs have been made at EMBL Heidelberg, most notably the first systematic genetic analysis of embryonic development in the fruit fly by Christiane Nüsslein-Volhard and Erich Wieschaus, for which they were awarded the Nobel Prize for Medicine in 1995.

• Heidelberg is the largest centre for biomedical research in Germany and home to the oldest German university, the Ruprecht-Karls-Universität Heidelberg.

Page 22: ENSEMBL Project- Browsing Tools Human Genome

BLAST• Developer:Altschul S.F., Gish W., Miller E.W., Lipman D.J., • NCBILatest release:2.2.15 /• OS:UNIX, Linux, Mac, MS-Windows• Use:Bioinformatics tool

• In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences.

• A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if human beings carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

Page 23: ENSEMBL Project- Browsing Tools Human Genome

Agroinformatics

• BLAST is one of the most widely used bioinformatics programs, probably because it addresses a fundamental problem and the algorithm emphasizes speed over sensitivity. This emphasis on speed is vital to making the algorithm practical on the huge genome databases currently available, although subsequent algorithms can be even faster.

• Examples of other questions that researchers use BLAST to answer are

• Which bacterial species have a protein that is related in lineage to a certain protein whose amino-acid sequence I know?

• Where does the DNA that I've just sequenced come from? • What other genes encode proteins that exhibit structures or motifs

such as the one I've just determined? • BLAST is also often used as part of other algorithms that require

approximate sequence matching.• The BLAST algorithm and the computer program that implements it

were developed by Stephen Altschul, Warren Gish, David Lipman at the U.S. National Center for Biotechnology Information (NCBI), Webb Miller at The Pennsylvania State University, and Gene Myers at the University of Arizona .

• Input and Output, complies to the FASTA format

Page 24: ENSEMBL Project- Browsing Tools Human Genome

Algorithm• To run, BLAST requires two sequences as

input: a query sequence (also called the target sequence) and a sequence database. BLAST will find subsequences in the query that are similar to subsequences in the database. In typical usage, the query sequence is much smaller than the database, e.g., the query may be one thousand nucleotides while the database is several billion nucleotides.

Page 25: ENSEMBL Project- Browsing Tools Human Genome

BLAST searches for high scoring sequence alignments between the query sequence

and sequences in the database using a heuristic approach that approximates the

Smith-Waterman algorithm. The exhaustive Smith-Waterman approach is too slow for searching large genomic databases such as GenBank. Therefore,

the BLAST algorithm uses a heuristic approach that is slightly less accurate than Smith-Waterman but over 50 times faster. The speed and relatively

good accuracy of BLAST are the key technical innovation of the BLAST programs and arguably why the tool is the most popular bioinformatics

search tool.

Page 26: ENSEMBL Project- Browsing Tools Human Genome

• The BLAST algorithm can be conceptually divided into three stages.• In the first stage, BLAST searches for exact matches of a small fixed

length W between the query and sequences in the database. For example, given the sequences AGTTAC and ACTTAG and a word length W = 3, BLAST would identify the matching substring TTA that is common to both sequences. By default, W = 11 for nucleic seeds.

• • In the second stage, BLAST tries to extend the match in both

directions, starting at the seed. The ungapped alignment process extends the initial seed match of length W in each direction in an attempt to boost the alignment score. Insertions and deletions are not considered during this stage. For our example, the ungapped alignment between the sequences AGTTAC and ACTTAG centered around the common word TTA would be:

• ..A G T T A C.. • | | | | | | • .A C T T A G..

If a high-scoring ungapped alignment is found, the database sequence is passed on to the third stage.

• In the third stage, BLAST performs a gapped alignment between the query sequence and the database sequence using a variation of the Smith-Waterman algorithm. Statistically significant alignments are then displayed to the user.

Page 27: ENSEMBL Project- Browsing Tools Human Genome

An extremely fast but considerably less sensitive alternative to BLAST that

compares nucleotide sequences to the genome is BLAT (Blast Like Alignment Tool). A version designed for comparing

multiple large genomes or chromosomes is BLASTZ. Also there is another well-known

software called PatternHunter which produces significantly better sensitivity

results than BLAST at the same speed or very similar sensitivity results at a much

faster speed.

Page 28: ENSEMBL Project- Browsing Tools Human Genome

Parallel BLAST

• Parallel BLAST versions are implemented using MPI, Pthreads and are ported on various platforms including Windows,Linux, Solaris, OSX, and AIX. Popular approaches to parallelize BLAST include query distribution, hash table segmentation, computation parallelization, and database segmentation(partition).

Page 29: ENSEMBL Project- Browsing Tools Human Genome

Program

• The BLAST program can either be downloaded and run as a command-line utility "blastall" or accessed for free over the web. The BLAST web server, hosted by the NCBI, allows anyone with a web browser to perform similarity searches against constantly updated databases of proteins and DNA that include most of the newly sequenced organisms.

• BLAST is actually a family of programs (all included in the blastall executable). The following are some of the programs, ranked mostly in order of importance:

Page 30: ENSEMBL Project- Browsing Tools Human Genome

• Nucleotide-nucleotide BLAST (blastn): This program, given a DNA query, returns the most similar DNA sequences from the DNA database that the user specifies.

• Protein-protein BLAST (blastp): This program, given a protein query, returns the most similar protein sequences from the protein database that the user specifies.

• Position-Specific Iterative BLAST (PSI-BLAST): One of the more recent BLAST programs, this program is used for finding distant relatives of a protein. First, a list of all closely related proteins is created. Then these proteins are combined into a "profile" that is a sort of average sequence. A query against the protein database is then run using this profile, and a larger group of proteins found. This larger group is used to construct another profile, and the process is repeated.By including related proteins in the search, PSI-BLAST is much more sensitive in picking up distant evolutionary relationships than the standard protein-protein BLAST.

• Nucleotide 6-frame translation-protein (blastx): This program compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database.

• Nucleotide 6-frame translation-nucleotide 6-frame translation (tblastx): This program is the slowest of the BLAST family. It translates the query nucleotide sequence in all six possible frames and compares it against the six-frame translations of a nucleotide sequence database. The purpose of tblastx is to find very distant relationships between nucleotide sequences.

• Protein-nucleotide 6-frame translation (tblastn): This program compares a protein query against the six-frame translations of a nucleotide sequence database.

• Large numbers of query sequences (megablast): When comparing large numbers of input sequences via the command-line BLAST, "megablast" is much faster than running BLAST multiple times. It basically concatenates many input sequences together to form a large sequence before searching the BLAST database, then post-analyze the search results to glean individual alignments and statistical values.

Page 32: ENSEMBL Project- Browsing Tools Human Genome

• The core of NCBI 's BLAST services is BLAST 2.0 otherwise known as "Gapped BLAST".  This service is designed to take protein and nucleic acid sequences and compare them against a selection of NCBI databases. 

• The BLAST algorithm was written balancing speed and increased sensitivity for distant sequence relationships.  Instead of relying on  global alignments (commonly seen in multiple sequence alignment programs)  BLAST emphasizes regions of local alignment to detect relationships among sequences which share only isolated regions of similarity (Altschul et al., 1990). Therefore, BLAST is more than a tool to view sequences aligned with each other or to find homology, but a program to locate regions of sequence similarity with a view to comparing structure and function

Page 33: ENSEMBL Project- Browsing Tools Human Genome

Selecting the BLAST ProgramThe BLAST search pages allow you to select from several different programs.

Below is a table of these programs.

  • Program Description• Blastp :Compares an amino acid query sequence against a

protein sequence database.• Blastn:Compares a nucleotide query sequence against a

nucleotide sequence database.• Blastx:Compares a nucleotide query sequence translated in

all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence.

• Tblastn: Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames.

• Tblastx: Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that the tblastx program cannot be used with the nr database on the BLAST Web page because it is computationally intensive.

Page 34: ENSEMBL Project- Browsing Tools Human Genome

To select a BLAST program for your search

1. Open the Basic BLAST search page.

2. From the "Program" Pull Down Menu select the appropriate program.

Page 35: ENSEMBL Project- Browsing Tools Human Genome

Figure 1. Using the pull down menu to select a BLAST program.

Page 36: ENSEMBL Project- Browsing Tools Human Genome

Proteins• Database & Description• Nr : All non-redundant GenBank CDS

tralations+PDB+SwissProt+PIR+PRF •  month: All new or revised GenBank CDS

translation+PDB+SwissProt+PIR released in the last 30 days. • Swissprot: The last major release of the SWISS-PROT protein

sequence database (no updates). These are uploaded to our system when they are received from EMBL

• .patents:Protein sequences derived from the Patent division of GenBank.

• Yeast: Yeast (Saccharomyces cerevisiae) protein sequences. This database is not to be confused with a listing of all Yeast protein sequences. It is a database of the protein translations of the Yeast complete genome.

• E. coli :E. coli (Escherichia coli) genomic CDS translations• .pdbSequences derived from the 3-dimensional structure

Brookhaven Protein Data Bank.• kabat [kabatpro]: Kabat's database of sequences of immunological

interest. For more information http://immuno.bme.nwu.edu/• Alu: Translations of select Alu repeats from REPBASE, suitable for

masking Alu repeats from query sequences. It is available at ftp://ncbi.nlm.nih.gov/pub/jmc/alu. See "Alu alert" by Claverie and Makalowski, Nature vol. 371, page 752 (1994).

Page 37: ENSEMBL Project- Browsing Tools Human Genome

Nucleotides• Database Description• Nr: All non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS,

GSS, or HTGS sequences).• Month: All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the

last 30 days.• Dbest: Non-redundant database of GenBank+EMBL+DDBJ EST Divisions• Dbsts: Non-redundant database of GenBank+EMBL+DDBJ STS Divisions.• mouse ests: The non-redundant Database of GenBank+EMBL+DDBJ EST Divisions

limited to the organism mouse.• human ests: The Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions

limited to the organism human.• other ests: The non-redundant database of GenBank+EMBL+DDBJ EST Divisions all

organisms except mouse and human.• Yeast: Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences. Not a

collection of all Yeast nucelotides sequences, but the sequence fragments from the Yeast complete genome

• E. coli: E. coli (Escherichia coli) genomic nucleotide sequences.• Pdb:Sequences derived from the 3-dimensional structure of proteins.• .kabat [kabatnuc]: Kabat's database of sequences of immunological interest. For

more information http://immuno.bme.nwu.edu/• Patents; Nucleotide sequences derived from the Patent division of GenBank.• vector: Vector subset of GenBank(R), NCBI, (ftp://ncbi.nlm.nih.gov/pub/blast/db/

directory).• Mito: Database of mitochondrial sequences (Rel. 1.0, July 1995).

Page 38: ENSEMBL Project- Browsing Tools Human Genome

• Alu: Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available at ftp://ncbi.nlm.nih.gov/pub/jmc/alu. See "Alu alert" by epd – Eukaryotic Promotor Database ISREC in

Epalinges s/Lausanne (Switzerland).

• Gss: Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.

• Htgs: High Throughput Genomic Sequences.

Page 39: ENSEMBL Project- Browsing Tools Human Genome

Figure 2. Using the Pull Down Menu to select the BLAST database.

Page 40: ENSEMBL Project- Browsing Tools Human Genome

Entering your Sequence • The BLAST web pages accept input

sequences in three formats; FASTA sequence format, NCBI Accession numbers, or GIs.

• FASTA Format

• A description of the FASTA format is located on the Basic BLAST search pages.

Page 41: ENSEMBL Project- Browsing Tools Human Genome

1. Open your FASTA formatted sequence in a text editor as plain text.

2. Use your mouse to highlight the entire sequence. 3. Select Edit/Copy from the menu in your text editor. 4. Go to the BLAST search page in your web browser. 5. Use your mouse to select the main input field titled

"Enter your input data here", by clicking it once. 6. Select Edit/Paste from the browser's menu. 7. You should now see your FASTA sequence in this field. 8. Set the pull down menu to "Sequence in FASTA format".

Page 42: ENSEMBL Project- Browsing Tools Human Genome

Figure 3. Example of a FASTA sequence in the input field.

Page 43: ENSEMBL Project- Browsing Tools Human Genome

Accession or GI number• If you know the Accession number or the GI of a

sequence in GenBank, you can use this as the query sequence in a BLAST search.  

1. Go to the BLAST search page in your web browser.

2. Use your mouse to select the main input field titled "Enter your input data here", by clicking it once.

3. Using the keyboard enter the GenBank Accession number or the GI number.

4. Set the Pull Down Menu to "Accession or GI".

Page 44: ENSEMBL Project- Browsing Tools Human Genome

Submitting your Search 1. Make sure you have selected the correct

BLAST program and BLAST database.

2. If you have entered your FASTA sequence or an Accession or GI number, click the "Submit Query Button".

3. BLAST will now open a new window and tell you it is working on your search.

4. Once your results are computed they will be presented in the window.

Page 45: ENSEMBL Project- Browsing Tools Human Genome

Introduction to a BLAST Query

Open a new browser window so that the BLAST program can be compared to the tutorial. Notice that the tutorial page resembles the Query form for an ADVANCED BLAST search, however, the elements of the Query form have been reorganized on the tutorial page to facilitate describing them. Explanatory notes have been added in light grey boxes. Additional details about BLAST are available through the buttons.

The BLAST browser window may be left open and used in parallel, or it may be closed while browsing through this tutorial. Scroll down the tutorial page to learn how to submit a BLAST search, step by step. When you are ready, the button will take you to the BLAST output page where the results of this search can be examined.