ENSEMBL Project- Browsing Tools Human Genome

Preview:

DESCRIPTION

Agro-Informatics Assignment-ADAC&RI (TNAU)

Citation preview

Glossary• Agroinformatics / Agricultural informatics:

Agroinformatics concentrates on the aspects of bioinformatics dealing with plant genomes.

• Alignment :The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Arrangement of two or more nucleotides of protein sequences to maximize the number of matching monomers.

• Alignment score: A numerical value that describes the quality of a sequence alignment.

• Algorithm :A fixed procedure embodied in a computer program. A set of rules for calculating or problem solving carried out by a computer program.

• Bioinformatics :The merger of biotechnology and information technology with the goal of revealing new insights and principles in biology.

• Annotation: 1-Finding genes and other important elements in raw sequence data (structural annotation).2-Determining the function of genes and proteins (functional annotation)

• BLAST Basic Local Alignment Search Tool. (Altschul et al.) A sequence comparison algorithm optimized for speed used to search sequence databases for optimal local alignments to a query. The initial search is done for a word of length "W" that scores at least "T" when compared to the query using a substitution matrix. Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S". The "T" parameter dictates the speed and sensitivity of the search.

• Database: On a computer, a collection of data records either in a single file or as a multiple files. The central component of a database management system.

• Database management system (DBMS): A software suite including a database and utilities required to organize, search, and update it, maintain data security and control access.

• Domain: Usually used to describe part of a protein that can fold and carry out a function independently, but sometimes used more generally to indicate part of a protein sequence, for instance a ‘glycine-rich domain’, or a geometrically distinct part of a protein structure.

• E value Expectation value used to test the significance of a sequence similarity score. The number of different alignents with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score.

• FASTA The first widely used algorithm (a sequence alignment algorithm) for database similarity searching. The program looks for optimal local alignments by scanning the sequence for small matches called "words". Initially, the scores of segments in which there are multiple word hits are calculated ("init1"). Later the scores of several segments may be summed to generate an "initn" score. An optimized alignment that includes gaps is shown in the output as "opt". The sensitivity and speed of the search are inversely related and controlled by the "k-tup" variable which specifies the size of a "word". (Pearson and Lipman)

• Global Alignment:The alignment of two nucleic acid or protein sequences over their entire length

• Heuristic: Of a computer program, making guesses to obtain approximate results but much faster than possible with exhaustive searching.

• Homology :Similarity attributed to descent from a common ancestor. An evolutionary relationship of two molecules deriving from a common ancesstor.

• Identity :The extent to which two (nucleotide or amino acid) sequences are invariant.

• K :A statistical parameter used in calculating BLAST scores that can be thought of as a natural scale for search space size. The value K is used in converting a raw score (S) to a bit score (S').

• lambda :A statistical parameter used in calculating BLAST scores that can be thought of as a natural scale for scoring system. The value lambda is used in converting a raw score (S) to a bit score (S').

• Local Alignment :The alignment of some portion of two nucleic acid or protein sequences

• Motif :A short conserved region in a protein sequence. Motifs are frequently highly conserved parts of domains.

• Multiple Sequence Alignment :An alignment of three or more sequences with gaps inserted in the sequences such that residues with common structural positions and/or ancestral residues are aligned in the same column. Clustal W is one of the most widely used multiple sequence alignment programs

• Optimal Alignment :An alignment of two sequences with the highest possible score.

• Orthologous :Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function.

• P value :The probability of an alignment occurring with the score in question or better. The p value is calculated by relating the observed alignment score, S, to the expected distribution of HSP scores from comparisons of random sequences of the same length and composition as the query to the database. The most highly significant P values will be those close to 0. P values and E values are different ways of representing the significance of the alignment.

• Paralogous :Homologous sequences within a single species that arose by gene duplication.

• Profile :A table that lists the frequencies of each amino acid in each position of protein sequence. Frequencies are calculated from multiple alignments of sequences containing a domain of interest.

• Proteome: The entire complement of proteins produced by a particular genome, including variants of the same basic protein generated by post-translational modification etc. The study of the proteome is known as proteomics.

• Proteomics :Systematic analysis of protein expression of normal and diseased tissues that involves the separation, identification and characterization of all of the proteins in an organism.

• PSI-BLAST :Position-Specific Iterative BLAST. An iterative search using the BLAST algorithm. A profile is built after the initial search, which is then used in subsequent searches. The process may be repeated, if desired with new sequences found in each cycle used to refine the profile. Details can be found in this discussion of PSI-BLAST. (Altschul et al.)

• PSSM :Position-specific scoring matrix; see profile. The PSSM gives the log-odds score for finding a particular matching amino acid in a target sequence.

• Query :The input sequence (or other type of search term) with which all of the entries in a database are to be compared.

• Primary database: A database for primary sequence data. The primary nucleotide databases are NCBI GenBank, the European Molecular Biology Laboratory (EMBL), Nucleotide Sequence Database, and the DNA Database of Japan. The primary protein databases are SWISS-PROT and TrEMBL.

• Secondary database: A database of sequence information derived from the data in primary databases. Example include PROSITE, BLOCKS, Pfam and PRINTS.

• Relational database: A database in which data records are organized as tables, allowing the data from tables containing similar fields to be linked together.

• SWISS-PROT: Database of confirmed protein sequences with extensive annotations. Maintained by the Swiss Bioinformatics Institute.

• TrEMBL: Translated EMBL. Database of protein sequences translated from the EMBL nucleotide sequence database. Not as extensively annotated as SWISS-PROT

• SQL: Symbolic query language. The industry-standard language used to interrogate and process data in relational database.

ENSEMBL project- Browsing tools human genome

EnsemblFrom Wikipedia, the free encyclopedia

• EMnsembl is a bioinformatics research project aiming to "develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes". It is run in a collaboration between the Wellcome Trust Sanger Institute and the European Bioinformatics Institute, an outstation of the European Molecular Biology Laboratory.

Goals of Ensembl

The Ensembl project aims to provide: • Accurate, automatic analysis of genome

data • Analysis and annotation maintained on the

current data • Presentation of the analysis to all via the

Web • Distribution of the analysis to other

bioinformatics laboratories

Software and data

• The project is open source - all data and all software that is produced in the project can be freely accessed and used.

• Most of the software produced and used is written in the language Perl and is based on the BioPerl infrastructure. The Perl API can be easily employed in other genomic projects e.g. for the annotation of gene or clone lists. The website code uses an extensible plugins system which allows groups to modify the website for their own data sets, e.g. Vega which stores and displays manual annotation.

• Also available is an API in Java.

Current species • The annotated genomes include most finished vertebrates and

selected model organisms. Currently this includes:• Chordates

– Mammals: Human, Mouse, Rat, Chimp, Macaque, Dog, Cow, Elephant (pre), Opossum, Rabbit (preliminary data), Armadillo (pre), Tenrec (pre)

– Birds: Chicken – Fish: Takifugu rubripes (Fugu), Tetradodon nigroviridis, Danio rerio

(Zebrafish) – Frog: Xenopus tropicalis – Ancient relatives: Ciona intestinalis, Ciona savignyi (pre)

• Invertebrates – Insects: Anopheles gambiae (Mosquito), Honeybee, Drosophila

melanogaster (Fruitfly), Aedes aegypti (Mosquito) – Worm: Caenorhabditis elegans

• Yeast: Saccharomyces cerevisiae (Baker's yeast)

Usage• The service is used by molecular biologists and

bioinformaticians around the world working with genome data of the above organisms. The predictions of coding, controlling and other elements in the genomes can be compared with primary research data and with common repositories of current genomic knowledge (Biological Databases).

• The comparison of organisms (comparative genomics or also intergenomics) with respect to their gene structures and the coded proteins is of special interest. The synteny view can be useful educational material for school classes.

Human genome • The human genome project is the result of an

international consortium among many different sequencing and bioinformatics centers. A wealth of data is available including; the annotated assembled genomic sequence, transcript sequence, library resources, expression data, map data, disease and functional information, and more. The result is an unprecedented amount of knowledge concerning human genetics that will eventually result in breakthroughs in understanding human biology as well as significant medical advances.

• A challenge facing researchers today is that of analyzing and integrating the plethora of data available. The human genome sequence provides a critical foundation for continued advances in medicine, basic research, and clinical diagnostic technologies.

Map of the human X chromosome (from the NCBI website). Assembly of the

human genome is one of the greatest achievements of bioinformatics.

Usage of SWISSPROT, EMBL, BLAST software for similarities

searches- Comparing to sequences-building a multiple

alignment sequence.

Swiss-Prot

• Swiss-Prot is a manually curated biological database of protein sequences. Swiss-Prot was created in 1986 by Amos Bairoch during his PhD and developed by the Swiss Institute of Bioinformatics and the European Bioinformatics Institute. Swiss-Prot strives to provide reliable protein sequences associated with a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases.

• In 2002, the UniProt consortium was created: it is a collaboration between the Swiss Institute of Bioinformatics, the European Bioinfomatics Institute and the Protein Information Resource (PIR), funded by the National Institutes of Health. Swiss-Prot and its automatically curated supplement TrEMBL, have joined with the Protein Information Resource protein database to produce the UniProt Knowledgebase, the world's most comprehensive catalogue of information on proteins. The UniProtKB/Swiss-Prot release 51.3 from 12 December 2006 contains 250,296 entries.

• The UniProt consortium produced 3 database components, each optimised for different uses. The UniProt Knowledgebase (UniProtKB (Swiss-Prot + TrEMBL)), the UniProt Non-redundant Reference (UniRef) databases, which combine closely related sequences into a single record to speed similarity searches and the UniProt Archive (UniParc), which is a comprehensive repository of protein sequences, reflecting the history of all protein sequences.

• European Molecular Biology Laboratory• The European Molecular Biology Laboratory (EMBL)

is a molecular biology research institution supported by 19 European countries. The EMBL was created in 1974 and has laboratories in Heidelberg, Germany; Hamburg, Germany; Grenoble, France; and Hinxton, UK, and an external Research Programme in Monterotondo, Italy.

• Cell biology and biophysics, developmental biology, gene expression, structural biology and computational biology are the major fields of research at EMBL Heidelberg.

• Many scientific breakthroughs have been made at EMBL Heidelberg, most notably the first systematic genetic analysis of embryonic development in the fruit fly by Christiane Nüsslein-Volhard and Erich Wieschaus, for which they were awarded the Nobel Prize for Medicine in 1995.

• Heidelberg is the largest centre for biomedical research in Germany and home to the oldest German university, the Ruprecht-Karls-Universität Heidelberg.

BLAST• Developer:Altschul S.F., Gish W., Miller E.W., Lipman D.J., • NCBILatest release:2.2.15 /• OS:UNIX, Linux, Mac, MS-Windows• Use:Bioinformatics tool

• In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences.

• A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if human beings carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

Agroinformatics

• BLAST is one of the most widely used bioinformatics programs, probably because it addresses a fundamental problem and the algorithm emphasizes speed over sensitivity. This emphasis on speed is vital to making the algorithm practical on the huge genome databases currently available, although subsequent algorithms can be even faster.

• Examples of other questions that researchers use BLAST to answer are

• Which bacterial species have a protein that is related in lineage to a certain protein whose amino-acid sequence I know?

• Where does the DNA that I've just sequenced come from? • What other genes encode proteins that exhibit structures or motifs

such as the one I've just determined? • BLAST is also often used as part of other algorithms that require

approximate sequence matching.• The BLAST algorithm and the computer program that implements it

were developed by Stephen Altschul, Warren Gish, David Lipman at the U.S. National Center for Biotechnology Information (NCBI), Webb Miller at The Pennsylvania State University, and Gene Myers at the University of Arizona .

• Input and Output, complies to the FASTA format

Algorithm• To run, BLAST requires two sequences as

input: a query sequence (also called the target sequence) and a sequence database. BLAST will find subsequences in the query that are similar to subsequences in the database. In typical usage, the query sequence is much smaller than the database, e.g., the query may be one thousand nucleotides while the database is several billion nucleotides.

BLAST searches for high scoring sequence alignments between the query sequence

and sequences in the database using a heuristic approach that approximates the

Smith-Waterman algorithm. The exhaustive Smith-Waterman approach is too slow for searching large genomic databases such as GenBank. Therefore,

the BLAST algorithm uses a heuristic approach that is slightly less accurate than Smith-Waterman but over 50 times faster. The speed and relatively

good accuracy of BLAST are the key technical innovation of the BLAST programs and arguably why the tool is the most popular bioinformatics

search tool.

• The BLAST algorithm can be conceptually divided into three stages.• In the first stage, BLAST searches for exact matches of a small fixed

length W between the query and sequences in the database. For example, given the sequences AGTTAC and ACTTAG and a word length W = 3, BLAST would identify the matching substring TTA that is common to both sequences. By default, W = 11 for nucleic seeds.

• • In the second stage, BLAST tries to extend the match in both

directions, starting at the seed. The ungapped alignment process extends the initial seed match of length W in each direction in an attempt to boost the alignment score. Insertions and deletions are not considered during this stage. For our example, the ungapped alignment between the sequences AGTTAC and ACTTAG centered around the common word TTA would be:

• ..A G T T A C.. • | | | | | | • .A C T T A G..

If a high-scoring ungapped alignment is found, the database sequence is passed on to the third stage.

• In the third stage, BLAST performs a gapped alignment between the query sequence and the database sequence using a variation of the Smith-Waterman algorithm. Statistically significant alignments are then displayed to the user.

An extremely fast but considerably less sensitive alternative to BLAST that

compares nucleotide sequences to the genome is BLAT (Blast Like Alignment Tool). A version designed for comparing

multiple large genomes or chromosomes is BLASTZ. Also there is another well-known

software called PatternHunter which produces significantly better sensitivity

results than BLAST at the same speed or very similar sensitivity results at a much

faster speed.

Parallel BLAST

• Parallel BLAST versions are implemented using MPI, Pthreads and are ported on various platforms including Windows,Linux, Solaris, OSX, and AIX. Popular approaches to parallelize BLAST include query distribution, hash table segmentation, computation parallelization, and database segmentation(partition).

Program

• The BLAST program can either be downloaded and run as a command-line utility "blastall" or accessed for free over the web. The BLAST web server, hosted by the NCBI, allows anyone with a web browser to perform similarity searches against constantly updated databases of proteins and DNA that include most of the newly sequenced organisms.

• BLAST is actually a family of programs (all included in the blastall executable). The following are some of the programs, ranked mostly in order of importance:

• Nucleotide-nucleotide BLAST (blastn): This program, given a DNA query, returns the most similar DNA sequences from the DNA database that the user specifies.

• Protein-protein BLAST (blastp): This program, given a protein query, returns the most similar protein sequences from the protein database that the user specifies.

• Position-Specific Iterative BLAST (PSI-BLAST): One of the more recent BLAST programs, this program is used for finding distant relatives of a protein. First, a list of all closely related proteins is created. Then these proteins are combined into a "profile" that is a sort of average sequence. A query against the protein database is then run using this profile, and a larger group of proteins found. This larger group is used to construct another profile, and the process is repeated.By including related proteins in the search, PSI-BLAST is much more sensitive in picking up distant evolutionary relationships than the standard protein-protein BLAST.

• Nucleotide 6-frame translation-protein (blastx): This program compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database.

• Nucleotide 6-frame translation-nucleotide 6-frame translation (tblastx): This program is the slowest of the BLAST family. It translates the query nucleotide sequence in all six possible frames and compares it against the six-frame translations of a nucleotide sequence database. The purpose of tblastx is to find very distant relationships between nucleotide sequences.

• Protein-nucleotide 6-frame translation (tblastn): This program compares a protein query against the six-frame translations of a nucleotide sequence database.

• Large numbers of query sequences (megablast): When comparing large numbers of input sequences via the command-line BLAST, "megablast" is much faster than running BLAST multiple times. It basically concatenates many input sequences together to form a large sequence before searching the BLAST database, then post-analyze the search results to glean individual alignments and statistical values.

• The core of NCBI 's BLAST services is BLAST 2.0 otherwise known as "Gapped BLAST".  This service is designed to take protein and nucleic acid sequences and compare them against a selection of NCBI databases. 

• The BLAST algorithm was written balancing speed and increased sensitivity for distant sequence relationships.  Instead of relying on  global alignments (commonly seen in multiple sequence alignment programs)  BLAST emphasizes regions of local alignment to detect relationships among sequences which share only isolated regions of similarity (Altschul et al., 1990). Therefore, BLAST is more than a tool to view sequences aligned with each other or to find homology, but a program to locate regions of sequence similarity with a view to comparing structure and function

Selecting the BLAST ProgramThe BLAST search pages allow you to select from several different programs.

Below is a table of these programs.

  • Program Description• Blastp :Compares an amino acid query sequence against a

protein sequence database.• Blastn:Compares a nucleotide query sequence against a

nucleotide sequence database.• Blastx:Compares a nucleotide query sequence translated in

all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence.

• Tblastn: Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames.

• Tblastx: Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that the tblastx program cannot be used with the nr database on the BLAST Web page because it is computationally intensive.

To select a BLAST program for your search

1. Open the Basic BLAST search page.

2. From the "Program" Pull Down Menu select the appropriate program.

Figure 1. Using the pull down menu to select a BLAST program.

Proteins• Database & Description• Nr : All non-redundant GenBank CDS

tralations+PDB+SwissProt+PIR+PRF •  month: All new or revised GenBank CDS

translation+PDB+SwissProt+PIR released in the last 30 days. • Swissprot: The last major release of the SWISS-PROT protein

sequence database (no updates). These are uploaded to our system when they are received from EMBL

• .patents:Protein sequences derived from the Patent division of GenBank.

• Yeast: Yeast (Saccharomyces cerevisiae) protein sequences. This database is not to be confused with a listing of all Yeast protein sequences. It is a database of the protein translations of the Yeast complete genome.

• E. coli :E. coli (Escherichia coli) genomic CDS translations• .pdbSequences derived from the 3-dimensional structure

Brookhaven Protein Data Bank.• kabat [kabatpro]: Kabat's database of sequences of immunological

interest. For more information http://immuno.bme.nwu.edu/• Alu: Translations of select Alu repeats from REPBASE, suitable for

masking Alu repeats from query sequences. It is available at ftp://ncbi.nlm.nih.gov/pub/jmc/alu. See "Alu alert" by Claverie and Makalowski, Nature vol. 371, page 752 (1994).

Nucleotides• Database Description• Nr: All non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS,

GSS, or HTGS sequences).• Month: All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the

last 30 days.• Dbest: Non-redundant database of GenBank+EMBL+DDBJ EST Divisions• Dbsts: Non-redundant database of GenBank+EMBL+DDBJ STS Divisions.• mouse ests: The non-redundant Database of GenBank+EMBL+DDBJ EST Divisions

limited to the organism mouse.• human ests: The Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions

limited to the organism human.• other ests: The non-redundant database of GenBank+EMBL+DDBJ EST Divisions all

organisms except mouse and human.• Yeast: Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences. Not a

collection of all Yeast nucelotides sequences, but the sequence fragments from the Yeast complete genome

• E. coli: E. coli (Escherichia coli) genomic nucleotide sequences.• Pdb:Sequences derived from the 3-dimensional structure of proteins.• .kabat [kabatnuc]: Kabat's database of sequences of immunological interest. For

more information http://immuno.bme.nwu.edu/• Patents; Nucleotide sequences derived from the Patent division of GenBank.• vector: Vector subset of GenBank(R), NCBI, (ftp://ncbi.nlm.nih.gov/pub/blast/db/

directory).• Mito: Database of mitochondrial sequences (Rel. 1.0, July 1995).

• Alu: Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available at ftp://ncbi.nlm.nih.gov/pub/jmc/alu. See "Alu alert" by epd – Eukaryotic Promotor Database ISREC in

Epalinges s/Lausanne (Switzerland).

• Gss: Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.

• Htgs: High Throughput Genomic Sequences.

Figure 2. Using the Pull Down Menu to select the BLAST database.

Entering your Sequence • The BLAST web pages accept input

sequences in three formats; FASTA sequence format, NCBI Accession numbers, or GIs.

• FASTA Format

• A description of the FASTA format is located on the Basic BLAST search pages.

1. Open your FASTA formatted sequence in a text editor as plain text.

2. Use your mouse to highlight the entire sequence. 3. Select Edit/Copy from the menu in your text editor. 4. Go to the BLAST search page in your web browser. 5. Use your mouse to select the main input field titled

"Enter your input data here", by clicking it once. 6. Select Edit/Paste from the browser's menu. 7. You should now see your FASTA sequence in this field. 8. Set the pull down menu to "Sequence in FASTA format".

Figure 3. Example of a FASTA sequence in the input field.

Accession or GI number• If you know the Accession number or the GI of a

sequence in GenBank, you can use this as the query sequence in a BLAST search.  

1. Go to the BLAST search page in your web browser.

2. Use your mouse to select the main input field titled "Enter your input data here", by clicking it once.

3. Using the keyboard enter the GenBank Accession number or the GI number.

4. Set the Pull Down Menu to "Accession or GI".

Submitting your Search 1. Make sure you have selected the correct

BLAST program and BLAST database.

2. If you have entered your FASTA sequence or an Accession or GI number, click the "Submit Query Button".

3. BLAST will now open a new window and tell you it is working on your search.

4. Once your results are computed they will be presented in the window.

Introduction to a BLAST Query

Open a new browser window so that the BLAST program can be compared to the tutorial. Notice that the tutorial page resembles the Query form for an ADVANCED BLAST search, however, the elements of the Query form have been reorganized on the tutorial page to facilitate describing them. Explanatory notes have been added in light grey boxes. Additional details about BLAST are available through the buttons.

The BLAST browser window may be left open and used in parallel, or it may be closed while browsing through this tutorial. Scroll down the tutorial page to learn how to submit a BLAST search, step by step. When you are ready, the button will take you to the BLAST output page where the results of this search can be examined.

Recommended