Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

Embed Size (px)

Citation preview

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    1/36

    Electronic Molecular Biology

    Dr. Fazeeda N. Hosein

    BIOL30612012

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    2/36

    Learning Objectives

    What is a database?

    Why do we need databases and what databases are available to us?

    What information can be obtained from databases?

    What is BLAST?

    What is a sequence alignment?

    Which software can we use to compare sequences?

    Which software can we use to obtain phylogenetic data?

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    3/36

    What is a database?

    A Database (db) is designed to offer an organized mechanism for storing, managingand retrieving information.

    A collection of

    structured

    searchable (index) -> table of contents updated periodically (release) -> new edition

    cross-referenced (hyperlinks) -> links with other db data

    Includes also associated tools (software)

    necessary for db access/query, db updating, db

    information insertion, db information deletion.

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    4/36

    Why do we need databases

    Biology has turned into data-rich science

    High-throughput genomics, proteomics, metabolomics, ...

    vast amount of data generated in experiments (like MS peptide fragments,whole genome sequencing)

    Need for storing and communicating large datasets has grown tremendously archiving, curation, analysis and interpretation of all of these datasets are a

    challenge

    convenient methods for proper storing, searching & retrieving necessary

    Databases are the means to handle this data overload

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    5/36

    What can databases do?

    Make biological data available ...

    1. to scientists

    2. in computer-readable form.

    analysis (computer based)

    handle and share large volumes of data

    interface for computer based systems(Algorithms, Web interfaces)

    Store data

    defined formats

    automated storage and retrieval of experimental data

    Link knowledge with external resources

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    6/36

    What databases are available for us?

    Sequences submitted directly by scientists and genome sequencing group andsequences taken from literature and patents

    Entries in the EMBL, GenBank and DDBJ databases are synchronized on a daily basis

    Accession numbers are managed in a consistent manner

    Comparatively little error checking and fair amount of redundancy

    http://www.ddbj.nig.ac.jpwww.ncbi.nlm.nih.gov/

    www.ebi.ac.uk/embl/

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    7/36

    What databases are available for us?

    GenBank/DDJB/EMBLwww.ncbi.nlm.nih.gov Nucleotide sequences

    Ensembl www.ensembl.org Human/mouse genome

    PubMed www.ncbi.nlm.nih.gov Literature references

    NR www.ncbi.nlm.nih.gov Protein sequences

    Swiss-Prot www.expasy.org Protein sequences

    InterPro www.ebi.ac.uk Protein domains

    OMIM www.ncbi.nlm.nih.gov Genetic diseases

    Enzymes www.expasy.org Enzymes

    PDB www.rcsb.org/pdb/ Protein structures

    KEGG www.genome.ad.jp Metabolic pathways

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    8/36

    Minimal content of an entry in asequence database

    Sequence

    Accession number (AC) (never changes)

    Taxonomic data

    References

    Annotation/Curation

    Keywords

    Cross-references

    Documentation

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    9/36

    The Perfect Database

    1. Comprehensive, but easy to search.

    2. Annotated, but not too annotated.

    3. A simple, easy to understand structure.

    4. Cross-referenced.

    5. Minimum redundancy.

    6. Easy retrieval of data

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    10/36

    How do you read an entry in GenBank

    LOCUS: Unique string of 10 letters and numbers in the database. Notmaintained amongst databases, and is therefore a poor sequence identifier.

    ACCESSION: A unique identifier to that record, citable entity; does not changewhen record is updated. A good record identifier, ideal for citation in publication.

    VERSION: : New system where the accession and version play the samefunction as the accession and gi number.

    Nucleotide gi: Geninfo identifier (gi), a unique integer which will change everytime the sequence changes.

    PID: Protein Identifier: g, e or d prefix to gi number. Can have one or two on oneCDS.

    Protein gi: Geninfo identifier (gi), a unique integer which will change every timethe sequence changes.

    Protein_id: Identifier which has the same structure and function as thenucleotide accession version numbers, but slightly different format.

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    11/36

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    12/36

    12

    1 The LOCUS fieldconsists of fivedifferentsubfields:

    1a Locus Name (HSHFE) - The locus name is a tag for grouping similar sequences. The firsttwo or three letters usually designate the organism. In this case HS stands for Homo sapiensThe last several characters are associated with another group designation, such as geneproduct. In this example, the last three digits represent the gene symbol, HFE. Currently,the only requirement for assigning a locus name to a record is that it is unique.

    1b Sequence Length (12146 bp) - The total number of nucleotide base pairs (oramino acid residues) in the sequence record.

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    13/36

    1c Molecule Type (DNA) Type of molecule that was sequenced. All sequence data in an entry must be of the same type.

    1d GenBank Division (PRI) There are different GenBank divisions. In this example, PRI stands for primate sequences.Some other divisions include ROD (rodent sequences), MAM (other mammal sequences), PLN(plant, fungal, and algal sequences), and BCT (bacterial sequences).

    1e Modification Date (23-July-1999) Date of most recent modification made to the record. The date of first public release is not

    available in the sequence record. This information can be obtained only by contacting NCBI [email protected].

    1d1e

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    14/36

    2 DEFINITION Brief description of the sequence. The description may include source organism name, geneor protein name, or designation as untranscribed or untranslated sequences (e.g., a promoterregion). For sequences containing a coding region (CDS), the definition field may also containa completeness qualifier such as "complete CDS" or "exon 1."

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    15/36

    3 ACCESSION (Z92910) Unique identifier assigned to a complete sequence record. This number never changes, even ifthe record is modified. An accession number is a combination of letters and numbers that areusually in the format of one letter followed by five digits (e.g., M12345) or two lettersfollowed by six digits (e.g., AC123456).

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    16/36

    4 VERSION (Z92910.1) Identification number assigned to a single, specific sequence in the database. This number isin the format accession.version. If any changes are made to the sequence data, the versionpart of the number will increase by one. For example U12345.1 becomes U12345.2. A versionnumber of Z92910.1 for this HFE sequence indicates that the sequence data has not beenaltered since its original submission.

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    17/36

    17

    5 GI (1890179) Also a sequence identification number. Whenever a sequence is changed, the version number isincreased and a new GI is assigned. If a nucleotide sequence record contains a proteintranslation of the sequence, the translation will have its own GI number

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    18/36

    6 KEYWORDS (haemochromatosis; HFE gene) A keyword can be any word or phrase used to describe the sequence. Keywords are not takenfrom a controlled vocabulary. Notice that in this record the keyword, "haemochromatosis,"employs British spelling, rather than the American "hemochromatosis." Many records have nokeywords. A period is placed in this field for records without keywords.

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    19/36

    7SOURCE (human) Usually contains an abbreviated or common name of the source organism.

    8ORGANISM (Homo sapiens) The scientific name (usually genus and species) and phylogenetic lineage. See the NCBI

    Taxonomy Homepage for more information about the classification scheme used to constructtaxonomic lineages.

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    20/36

    9REFERENCE Citations of publications by sequence authors that support information presented in thesequence record. Several references may be included in one record. References areautomatically sorted from the oldest to the newest. Cited publications are searchable byauthor, article or publication title, journal title, or MEDLINE unique identifier (UID). TheUID links the sequence record to the MEDLINE record.

    A feature is simply an

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    21/36

    The FEATURES tableA feature is simply anannotation that describes aportion of the sequence.

    Each feature includes alocation (sequence location or

    interval) and one or severalqualifiers.

    Clicking on the feature namewill open a record for thesequence interval identified inthe feature location.

    A list of features can befound inhttp://www.ncbi.nlm.nih.gov/collab/FT/

    http://www.ncbi.nlm.nih.gov/collab/FT/http://www.ncbi.nlm.nih.gov/collab/FT/http://www.ncbi.nlm.nih.gov/collab/FT/http://www.ncbi.nlm.nih.gov/collab/FT/
  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    22/36

    source - An obligatoryfeature. The source givesthe length of the entiresequence, the scientificname of the source

    organism, and the Taxon IDnumber.

    Other types of informationthat the submitter mayinclude in this field arechromosome number, map

    location, clone, and strainidentification.

    The FEATURES table

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    23/36

    gene - Sequence portion thatdelineates the beginning and end

    of a gene.

    exon - Sequence segmentthat contains an exon.Exons may contain portions

    of 5' and 3 UTRs(untranslated regions). Thename of the gene to whichthe exon belongs and exonnumber are provided.

    The FEATURES table

    CDS S f l id h d f i id f h i d ( di

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    24/36

    CDS - Sequence of nucleotides that code for amino acids of the protein product (codingsequence).

    This feature includes the translation into amino acids and may also contain gene name, geneproduct function, link to protein sequence record, and cross-references to other databaseentries.

    intron - Transcribedbut spliced-out parts.Intron number isshown.

    polyA_signal -Identifies thesequence portionrequired for

    endonuclease cleavageof an mRNAtranscript. Consensussequence for thepolyA signal isAATAAA.

    BASE COUNT - Base Count gives the total number of adenine (A) cytosine (C) guanine (G)

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    25/36

    25

    BASE COUNT - Base Count gives the total number of adenine (A), cytosine (C), guanine (G),and thymine (T) bases in the sequence.

    ORIGIN - Origin contains the sequence data, which begins on the line immediately below the

    field title.

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    26/36

    Blast Local Alignment Search Tool

    BLAST, is an algorithm for comparing primary biological sequence information(amino-acid or nucleotide sequence)

    Enables comparison of a query sequence with a library or database ofsequences and identify sequences that resemble the query sequence above acertain threshold

    BLAST is one of the most widely used bioinformatics program

    It addresses a fundamental problem

    The algorithm emphasizes speed over sensitivity (practical on the hugegenome databases currently available)

    Variants Nucleotide-nucleotide BLAST (blastn)

    Protein-protein BLAST (blastp)

    Nucleotide 6-frame translation-protein (blastx)

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    27/36

    BLAST

    To run, BLAST requires a query sequence to search for, and a sequence tosearch against (also called the target sequence) or a sequence databasecontaining multiple such sequences

    Input: sequence in FASTA or Genbank format

    Output: graphical format showing the hits found, a table showing sequenceidentifiers for the hits with scoring data, as well as alignments for the sequenceof interest and the hits received with the corresponding BLAST scores of these

    NCBI: http://blast.ncbi.nlm.nih.gov/Blast.cgi

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    28/36

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    29/36

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    30/36

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    31/36

    Now what?What other genes encode proteins that exhibit structures similar to your sequence(Gene families)

    Do you find proteins that are related in lineage over a range of species(evolutionary biology)

    A phylogenetic tree shows the evolution of a species.

    To do this, we use other programs which are available online as freeware

    MEGA,

    ClustalW, ClustalX,

    Phylip

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    32/36

    Data entered into MEGA

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    33/36

    Data analysed using MEGA

    a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein toidentify regions of similarity that may be a consequence of functional, structural,

    or evolutionary relationships between the sequences.

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    34/36

    Data analyses in Clustal W

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    35/36

    V. vinefera BAD18977 VvMYBA1

    V. vinifera AB097924 VvmybA2

    A. thalianaABB03879 PAP1

    L. esculatum AAQ55181 LeANT1

    Petunia x hybrid AAF66727 An2 protein

    Gh CAD87010 MYB10

    A. majus ABB83828 VENOSA

    A, majus ABB83826 ROSEA1

    A. majusABB83827 ROSEA2

    M. dDQ886415 MYB1-1

    V. vineferaAAS68190 Myb transcription factorA. thaliana Q9FJA2 TT2

    Z. maysAAA33482 c1 locus myb homologue

    Z. maysAAA19821 transcriptional activator

    A. andreanum MYB1 AAO92352.1

    Fragaria x ananassa AAK84064 transcription factor MYB1

    A. majus CAA55725 mixta

    A. thalianaABB03913 MYB12

    Petunia x hybrid AAV98200 MYB-like protein ODORANT1

    D. carota BAE54312 transcription factor DcMYB1 ...

    N. tobacum BAA88222 myb-related transcription factor100

    100

    100

    98

    50

    42

    53

    44

    45

    22

    19

    93

    74

    25100

    47

    27

    26

    0.05

    Tree generated using MEGA

  • 7/30/2019 Lab 1 - Electronic Molecular Databases FNH 100912. College Molecular Biology

    36/36

    Conclusions

    A Database is designed to offer an organized mechanism for storing, managingand retrieving information.

    BLAST, is an algorithm for comparing primary biological sequence information(amino-acid or nucleotide sequence)

    Nucleotide-nucleotide BLAST (blastn)

    Protein-protein BLAST (blastp)

    Nucleotide 6-frame translation-protein (blastx)

    Programs used to perform multiple alignments and generate phylogenetic treesMEGA, ClustalW, ClustalX, Phylip