Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Introduction to databases
part 2
Shifra Ben-Dor
Irit Orr
And now, for the molecules
and databases...
• DNA
• RNA
• Protein
DNA sequences
• Genes are encoded in genomic sequences.
• Genes are transcribed into mRNAs
(including coding, intronic, 5’ and 3’
untranslated regions).
• mRNA’s are spliced (introns removed) and
Translated into proteins.
• mRNAs are copied to cDNAs
TSS TTS
ATG Stop PolyA site
Promoter1 2 3 4
ATG Stop PolyA site
1 2 3 4
Genomic
DNA
Pre-mRNA
mRNA
Modified from Zhang MQ Nat Rev Genet. 2002 Sep;3(9):698-709.
ATG Stop
1 2 3 4Cap PolyA
5’ UTR 3’ UTRCDS
International DNA databases
Genbank at NCBI
http://www.ncbi.nlm.nih.gov/
EMBL at EBI
http://www.ebi.ac.uk/embl/
DDBJ in Japan
http://www.ddbj.nig.ac.jp/
DATA sources for DNA databases
• Direct scientist submission
• Genome sequencing labs and groups
• Scientific literature
• Patent applications
• EMBL, Genbank and DDBJ collaborateto collect all sequence data reportedaround the world.
International DNA databases
All of these databases are:
Updated every 2-3 months.
Have weekly (or daily updates).
Are divided into sublibraries for easier
searching.
DNA database divisions
• PRI - primate (human,monkey)
• ROD - rodent (mouse,rat)
• MAM - other mammalian(bovine,cat)
• VRT - other vertebrate (chicken)
• INV - invertebrate
• PLN - plant, fungal, and alga
• BCT - bacteria
• VRL - viruses
• PHG - bacteriophage
• SYN - synthetic (plasmids,vectors)
• UNA - unannotated sequences
• PAT - patent sequences
• EST - Expressed Sequence
Tags
• STS - Sequence Tagged Sites
• GSS - Genome Survey
Sequences
• HTG - High Throughput
Genomic Sequences
• HTC - High Throughput cDNA
Sequences
Genomic databases
• Specialized resources that are:
– Species specific
– Sequencing technique specific
• Display whole chromosomes (not a
specific sequence).
Sources of mRNA’s
• Experimental
– Clone new gene
– Clone gene from database
– 2 hybrid system
• Database
– “Typical” cDNA
– Full length cDNA
– EST
mRNA
Full length cDNA
Typical cDNA
5’mG AAAA
TTTT
TTTT
primer
AAAAprimer
primerSources of mRNA’s
• Individual Labs various
• Refseq NM
• Kasuza (KIAA) D, AB
Full Length Sequencing projects:
• Riken AK; ends - AV or BB
• Nedo (FLJ) AK
• German (DKFZ)
• MGC BC
Accession Numbers
REFSEQ from NCBI
(Reference sequence database)
! Definition
The Reference Sequence (RefSeq) collectionaims to provide a comprehensive, integrated,non-redundant set of sequences, includinggenomic DNA, transcript (RNA), and proteinproducts, for major research organisms.
REFSEQ from NCBI
!non-redundancy
!explicitly linked nucleotide and proteinsequences
!updates to reflect current knowledge of sequencedata and biology
!data validation and format consistency
!distinct accession series
!ongoing curation by NCBI staff and collaborators,with reviewed records indicated
RefSeq
• Reviewed
• Provisional
• Predicted
• Genome Annotation
• Validated
• Model
• Inferred
• WGS
!Status Codes:
RefSeq records are provided with a statuscode which provides an indication of the levelof review a RefSeq record has undergone.
Accession Format Molecule Type
NC_123456 Complete Genome
Complete Chromosome
Complete Sequence
NG_123456 Genomic Region
NM_123456 mRNA
NP_123456 Protein
NT_123456 Genomic Contig (from BACs)
NW_123456 Genomic Contig (from WGS)
XM_123456 mRNA (taken from genomic seq)
XR_123456 RNA (taken from genomic seq)
XP_123456 Protein (taken from genomic seq)
NEDO
Full Length mRNA
Sequencing
NEDO
~ 160,000 clones were isolated from more than
20 full-length enriched human cDNA libraries made
by "Oligo-capping" method. Their 5's end sequences
were determined.
We selected about 10,000 putatively full-length
cDNA using these sequence data and determined the
entire sequence of the selected clones. This NEDO
project aims to determine the sequence of 20,000
full-length cDNA clones in addition.
RIKEN
Mouse Genome Encyclopedia
A project to sequence full length mouse cDNA’s.
Over 21,000 genes sequenced from oligo capped
libraries from about 200 tissues and cell types
Set standards for annotation with FANTOM
(Functional Annontation Of Mouse)
Kasuza
Large cDNA inserts (> 4 kb).
Determined the complete base sequences of
approximately 2000 species of previously
undiscovered cDNA from KG-1 cells and brain tissue,
with an average length of 5 kb.
Database: HUGE (Human Unidentified Gene-Encoded
protein database)
http://www.kazusa.or.jp/huge
MGC - Mammalian
Gene Collection
The NIH Mammalian Gene Collection (MGC) seeks
to identify and sequence a representative full open
reading frame (ORF) clone for each human and
mouse gene.
MGC has produced over 80 cDNA libraries enriched
for full-length cDNAs derived from human tissue
and cell lines, and mouse tissue.
5' EST reads are generated from each library.
Several algorithms are applied to select putative full
ORF clones.
Sources of mRNA’s
• Experimental
– Clone new gene
– Clone gene from database
– 2 hybrid system
• Database
– “Typical” cDNA
– Full length cDNA
– EST
RNA
RNA, cDNA, and ESTs
mRNA
cDNA
exon 1 exon 2 exon 3
EST
EST
cDNA clone
GenBank ESTs GenBank ESTs (Expressed Sequence Tags): (Expressed Sequence Tags):
~ 6,000,000 human ~ 6,000,000 human ESTsESTs
~ 4,300,000 mouse ~ 4,300,000 mouse ESTsESTs
Adapted with permission from Adam Sartiel
Problems with ESTs
- low copy number genes
- rare tissues
- mistakes
- enrichment of 3’ ends of genes
- incomplete coverage of genes
Uses of ESTs
- prediction of coding regions
- detection of alternative splicing
- clustering to form “genes”
Problems with clustering:
- incomplete coverage breaks genes up
- gene families
• With the increasing sequencing and annotation of keygenomes, having a gene-based view of the resultantinformation is useful. Entrez Gene has therefore beenimplemented to supply key connections in the nexus ofmap, sequence, expression, structure, function, citation,and homology data. Unique identifiers are assigned togenes with defining sequences, genes with known mappositions, and genes inferred from phenotypicinformation. These gene identifiers are tracked, andinformation is added when available. Entrez Gene can beconsidered as the successor to LocusLink, with the majordifferences being in greater scope (more of the genomesrepresented by NCBI Reference Sequences or RefSeqs)and in being integrated for indexing and query in NCBI'sEntrez system.
Data reliability in databases
• The huge amount of data collected indatabases present a lot of problems:
– Data accuracy
– Sequence redundancy
– Inconsistent nomenclature
– Inaccurate annotation
– Sequence contamination (vectors,bacterial)
Data reliability in databases
• The database staff notify the Authors
that an error (or contamination) was
detected in their sequence entry.
• However, it takes time to correct the data.
• Meanwhile the error is continued, because
a lot of the Proteins in the Protein db are
translated from the DNA sequence db.
Data reliability in databases
• A lot of the sequences in the
database are quite “old”. They were
not updated since they were
submitted, even though technology
and data was very much updated.
HUGO Gene Nomenclature
Committee• This committee is responsible for the approval of a
unique symbol for each gene.
• It also designs a longer and more descriptive name.
• The committee makes considerable efforts to usesymbols acceptable to workers in the field, butsometimes it is not possible to use exactly what haspreviously appeared in the literature.
• However, wherever the committee is aware of suchsymbols, they are listed as aliases in the Genewdatabase.(http://www.gene.ucl.ac.uk/cgibin/nomenclature/searchgenes.pl)
Gene symbols
Gene symbols are designated by upper case Latin letters or
by a combination of upper-case letters and Arabic numbers.
Symbols should be short in order to be useful, and should
not attempt to represent all known information about a
gene.
Ideally symbols should be no longer than six characters in
length.
Based on classical genetic guidelines, it is recommended that
gene symbols are either underlined or italicized when
referring to genotypic information (phenotypic information
is represented in standard fonts).
Gene Symbols
80887826000469q31ATP-binding
cassette, sub-
family A (ABC1),
member 1
ABCA1
PubMed
ID
MIM
Number
Cytogenetic
LocationFull nameSymbol
Protein databases Protein databases
• There are many different proteindatabases containing different types ofinformation:
– Primary Amino Acids sequence.
– Secondary structure
– 3D structure
– Protein family domains
– Consensus active sites
Sources of Protein
• Proteins that have been worked on
experimentally
• mRNA whose product has been
worked on experimentally (no actual
protein sequencing done)
• Translated DNA (mRNA) sequences
Protein Primary Sequence
Databases
• Usually contain description of the protein entry(annotation), the amino acid sequence andsometimes links to other related databases.
• Swiss-Prot, from the University of Geneva (nowthe Swiss Institute of Bioinformatics), is acurated protein database which strives toprovide a high level of annotation, a minimallevel of redundancy and high level ofintegration with other databases.
UniProt (Universal Protein Resource) is the world's mostcomprehensive catalog of information on proteins. It isa central repository of protein sequence and functioncreated by joining the information contained in Swiss-Prot, TrEMBL, and PIR.
• The UniProt Knowledgebase (UniProt) is the centralaccess point for extensive curated protein information,including function, classification, and cross-reference.
• The UniProt Non-redundant Reference (UniRef)databases combine closely related sequences into asingle record to speed searches.
• The UniProt Archive (UniParc) is a comprehensiverepository, reflecting the history of all proteinsequences.
Swiss-Prot Database (primary database)
• Swiss-Prot annotation includes:
– Description of protein function
– Protein domain structure
– Post-translational modifications
– Protein variants
• Sequence entries are composed of different line-types, each with their own format. Forstandardization purposes the format ofSwissProt follows as closely as possible that ofthe EMBL (DNA) Database.
Swiss-Prot Database
Swiss-Prot differs from other protein databasesby the following criteria:
! Annotation
! Minimal Redundancy
! Integration with other databases
Swiss-Prot Database
"Annotation
In Swiss-Prot, as in most other sequencedatabases, two classes of data can bedistinguished: the core data and the annotation.
The core data consists of the sequence; thecitation information (bibliographical references)and the taxonomic data (description of thebiological source of the protein).
The annotation consists of the description of:
• Function(s) of the protein
• Post-translational modification(s). For
example carbohydrates, phosphorylation,
acetylation, GPI-anchor, etc.
• Domains and sites. For example calcium
binding regions, ATP-binding sites, zinc
fingers, etc.
• Secondary structure
The annotation consists of the description of:
• Quaternary structure. For examplehomodimer, heterotrimer, etc.
• Similarities to other proteins
• Disease(s) associated with deficiency(s)of/in the protein
• Sequence conflicts, variants, etc.
Swiss-Prot Database
To obtain this information, Swiss-Prot uses, inaddition to the publications that report newsequence data, review articles to periodicallyupdate the annotations of families or groupsof proteins.
Swiss-Prot also makes use of externalexperts, who have been recruited to sendtheir comments and updates concerningspecific groups of proteins.
Swiss-Prot Database
! Minimal Redundancy
Many sequence databases contain, for a givenprotein sequence, separate entries whichcorrespond to different literature reports.In SWISS-PROT, they try as much as possibleto merge all these data so as to minimize theredundancy of the database.
If conflicts exist between various sequencingreports, they are indicated in the feature tableof the corresponding entry.
Swiss-Prot Database! Integration with other databases
It is important to provide the users ofbiomolecular databases with a degree ofintegration between the three types sequence-related databases (nucleic acid sequences, proteinsequences and protein tertiary structures) as wellas with specialized data collections.
SWISS- PROT is currently cross-referenced with~100 different databases. Cross-references areprovided in the form of pointers to informationrelated to SWISS-PROT entries and found in datacollections other than SWISS-PROT.
TrEMBL database
• TrEMBL is a computer-annotated
supplement of SWISS-PROT that
contains all the translations of the
EMBL (DNA) database.
• TrEMBL contain entries not yet
integrated in SWISS-PROT.
NR database
(primary databases from NCBI ! !)
• The NR Protein database contains
sequence data from the translated
coding regions from DNA sequences in
GenBank, EMBL and DDBJ as well as
protein sequences submitted to PIR,
SWISSPROT, PRF, PDB (sequences from
solved structures).
Data reliability in Protein
databases
• About 30% of the proteins in thedatabases have erroneous sequences dueto:
– missing exons in the DNA translation.
– Introns mistakenly translated.
• Another common problem is the assigningof functions to “new” proteins, based onsequence similarity.
Data reliability in Protein
databases
• For example:
– Protein A is similar to protein B.
– Protein B annotation is based on Protein A
annotation (which has an error).
– Annotation of Protein A is corrected by the
group working on it. This correction does not
appear or reflect in Protein B annotation.
– When Protein C and D are also based on the
erroneous annotation on B, the problem…...
http://www.geneontology.org/
Text searching pitfalls
• It finds exactly what you type
• Older records may have different
annotation, from gene names on…
• human vs homo sapiens
• Most sites use boolean operators(AND, OR, BUT NOT)
• Can do (or add) a field specific tag -but each site has a different way ofadding it to a search - for example,NCBI uses square brackets []
Remember:
Text searching is NOT sequence
similarity searching! You many not
find all related sequences by text
searching!!!!