31

genetic databases ppt

Embed Size (px)

Citation preview

Page 1: genetic databases ppt
Page 2: genetic databases ppt
Page 3: genetic databases ppt

INTRODUCTION

• A database is a collection of data that is organized so that its contents can easily be accessed, managed, and modified by a computer. In the biosciences, a database is a curated repository of raw data containing annotations, further analysis, and links to other databases.

Page 4: genetic databases ppt

• Currently, a lot of bio informatics work is concerned with the technology of databases.

• These databases include both "public" repositories of gene data like GenBank or the Protein DataBank (the PDB), and private databases like those used by research groups involved in gene mapping projects or those held by biotech companies.

Page 5: genetic databases ppt

THE “UNITS OF INFORMATION”

DNA RNA PROTEIN SEQUENCE STRUCTURE EVOLUTION PATHWAYS STRUCTURE MUTATION

Page 6: genetic databases ppt

BASIC STRUCTURE

CORE DATA-It is data the database was generated to organize.

ANNOTATION-Extra information that rounds out our picture of the core data.

Page 7: genetic databases ppt

THE MAIN SEQUENCE DATABASES

• There are three main nucleic acid sequence databases and one main protein sequence database in widespread general use.

• For nucleic acid these are EMBL, Genbank and DDBJ and

• For protein this is SWISS-PROT.

Page 8: genetic databases ppt

THE DNA DATABASES

• DATA SOURCES FOR DNA DATABASES

• Direct scientific submission

• Genome sequencing labs and groups

• Scientific literature

• Patent applications

Page 9: genetic databases ppt

       DIFFERENT DNA DATABASES

Important DNA databases are:• Genbank at NCBI

(http://www.ncbi.nlm.nih.gov/)

•     EMBL at EBI(http://www.ebi.ac.uk/embl/)

• DDBJ in Japan(http://www.ddbj.nig.ac.jp/)

Page 10: genetic databases ppt
Page 11: genetic databases ppt

GenBank (Genetic Sequence Databank)

• One of the fastest growing repositories of known genetic sequences.

• It has a flat file structure that is an ASCII text file, readable by both humans and computers.

• In addition to sequence data, GenBank files contain information like accession numbers and gene names, phylogenetic classification and references to published literature.

Page 12: genetic databases ppt

The EMBL Nucleotide Sequence

Database • The EMBL Nucleotide Sequence

Database is a comprehensive database of DNA and RNA sequences collected from the scientific literature and patent applications and directly submitted from researchers and sequencing groups.

Page 13: genetic databases ppt

REFSEQ from NCBI(Reference sequence database)

• The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.

Page 14: genetic databases ppt

Entrez Gene

• Entrez Gene is a database for gene-specific information.

• It does not include all known or predicted genes;

• Instead Entrez Gene focuses on the genomes that have been completely sequenced, that have an active research community to contribute gene-specific information, or that are scheduled for intense sequence analysis.

Page 15: genetic databases ppt

  HOW THE DATA IS ENTERED

• Sequences are placed in the databases from published papers describing them, or

• More commonly, they are submitted directly by their authors to the database organisations.

• The sequence is then deemed to belong to author and only they can update or amend it.

Page 16: genetic databases ppt

Contd.

Webin is the WWW site for submitting nucleotide sequence data and associated biological information to EMBL database at theEBI.

BankIt is the NCBI equivalent site WWW site for submitting to Genbank.

Sequin is a programme, which can be downloaded and run on the authors’ local computer for preparing a sequence for submission. The result is then sent by e-mail to the NCBI or the EBI.

Page 17: genetic databases ppt

DATA RELIABILITY IN DATABASES

• The huge amount data collected in DNA databases present a lot of problems:

• Data accuracy

• Redundancy

• Inconsistent nomenclature

• Inaccurate annotation

• Sequence contamination(vectors, bacterial)

Page 18: genetic databases ppt

THE MAIN PROTEIN DATABASES

• SOURCES OF PROTEIN •Proteins that have been worked on experimentally

•mRNA whose product has been worked on experimentally (no actual protein sequencing done)

•Translated DNA (mRNA) sequences

Page 19: genetic databases ppt

INFORMATION CONTAINED IN PROTEIN DATABASES

• Primary amino acid sequences

• Secondary structure

• 3D structure

• Prtein family domains

• Consensus active sites

Page 20: genetic databases ppt

PROTEIN PRIMARY SEQUENCE

DATABASES • SWISS-PROT• SWISS-PROT is the main protein sequence database. • Produced collaboratively by Amos Bairos (University of

Geneva) and EBI.• The data in SWISS-PROT are derived from translations of

DNA sequences from the EMBL Nucleotide Sequence Database. SWISS-PROT is a curated protein database, which strives to provide a high level of annotation, a minimal level of redundancy and high level of integration with other databases

Page 21: genetic databases ppt

• The core data consists of the sequence; the citation information (bibliographical references) and the taxonomic data (description of the biological source of the protein).

Page 22: genetic databases ppt

The annotation consists of the description of:•Function(s) of the protein•Post-translational modification(s). •Domains and sites. •Secondary structure •Quaternary structure. •Similarities to other proteins •Disease(s) associated with deficiency(s) of/in the

protein •Sequence conflicts, variants, etc.

Page 23: genetic databases ppt

UniProt (Universal Protein Resource)

• UniProt (Universal Protein Resource) is the world's most comprehensive catalog of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR. The UniProt Knowledgebase (UniProt) is the central access point for extensive curated protein information, including function, classification, and cross-reference. The UniProt Non-redundant Reference (UniRef) databases combine closely related sequences into a single record to speed searches. The UniProt Archive (UniParc) is a comprehensive repository, reflecting the history of all protein sequences

Page 24: genetic databases ppt

TrEMBL

• TrEMBL is a computer-annotated protein sequence database supplementing the SWISS-PROT database. It contains translations of all coding sequences present in the EMBL database that are not yet integrated into SWISS-PROT. TrEMBL can be considered as a preliminary section of SWISS-PROT.It is split into two sections: SP- TrEMBL and REM- TrEMBL. 

• SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries, which should be eventually incorporated into SWISS-PROT. REM-TrEMBL (REMainning TrEMBL) contains the entries that are not to be incorporated in SWISS-PROT.It includes immunoglobulins and T-cell receptors, synthetic sequences, patent application sequences, small sequences and coding sequence translations where is strong evidence to believe that proteins are not real.

Page 25: genetic databases ppt

NR DATABASE (Primary Database from NCBI)

• The NR Protein database contains sequence data from the translated coding regions from DNA sequences in GenBank, EMBL and DDBJ as well as protein sequences submitted to PIR, SWISSPROT, PRF, PDB (sequences from solved structures).

 

Page 26: genetic databases ppt

HOW THE DATA IS ENTERED

• The entries in SWISS-PROT are derived from much the same sources as the nucleotide database entries, with addition of translations of the coding sequences in EMBL entries.

• Submissions to SWISS-PROT of directly sequenced peptides should be made via the site at the EBI. EBI do not provide accession numbers, in advance, for protein sequences that are the result of translation of nucleic acid sequences. These translations will automatically be forwarded to SWISS-PROT from the EMBL nucleotide database and are assigned SWISS-PROT accession numbers on incorporation into TrEMBL.

Page 27: genetic databases ppt

OTHER PROTEIN DATABASES

• GenPept is Genbank’s equivalent of TrEMBL.It is automatic translation of all coding sequences present in the Genbank database.

• OWL is a nonredudant protein sequence database produced from SWIISS-PROT, PIR, NRL-3D and GenPept.

• The International Protein Sequence Database (PIR) is a collaborative database from PIR, MIPS and JIPID that contains much the same sequence information as SWISS-PROT. However, it has a substantial amount of duplicated sequence entries, is hard to read and is not well annotated. In particular, it lacks SWISS-Port’s superb cross-referencing to other databases.

• NRL-3D is produced by PIR from sequence and annotation information extracted from Brookhaven Protein Databank (PDB) of crystallographic 3D protein structures. It is useful for similarity searches.

• Kabat Database of Sequences of Proteins of Immunological Interest is a database of sequences involved in the immune system.

Page 28: genetic databases ppt

DATA RELIABILITY IN PROTEIN

DATABASES

•About 30% of the proteins in the databases have erroneous sequences due to:

– Missing exons in the DNA translation.

–Introns mistakenly translated.

Page 29: genetic databases ppt

ACCESSING THE DATABASE

Following are the publicly available WWW sites for keyword and similarity searches:

• (1)   Entrez provides good cross-linking between the nucleic, and protein databases with the Medline bibliographic database. A very powerful feature is the ability to find other entries like the one already found.

• (2)   SRS, the Sequence Retrieval System provides a powerful means of finding entries in related sets of databases.

• (3)   BLAST and FASTA are publicly available sites providing access to these popular sequence similarity searching programmes.

Page 30: genetic databases ppt

MISCELLANEOUS OTHER DATABASES

Page 31: genetic databases ppt

GENOMES

• GOLD-Genome Online Database

• KEGG-Kyoto Encyclopedia of genes and Genomes

• FlyBase