And now, for the molecules Introduction to databases and ... · Introduction to databases part 2 Shifra Ben- Dor Irit Orr And now, for the molecules and databases... ¥DNA ¥RNA ¥Protein

Introduction to databases

part 2

Shifra Ben-Dor

Irit Orr

And now, for the molecules

and databases...

• DNA

• RNA

• Protein

DNA sequences

• Genes are encoded in genomic sequences.

• Genes are transcribed into mRNAs

(including coding, intronic, 5’ and 3’

untranslated regions).

• mRNA’s are spliced (introns removed) and

Translated into proteins.

• mRNAs are copied to cDNAs

TSS TTS

ATG Stop PolyA site

Promoter1 2 3 4

ATG Stop PolyA site

1 2 3 4

Genomic

DNA

Pre-mRNA

mRNA

Modified from Zhang MQ Nat Rev Genet. 2002 Sep;3(9):698-709.

ATG Stop

1 2 3 4Cap PolyA

5’ UTR 3’ UTRCDS

International DNA databases

Genbank at NCBI

http://www.ncbi.nlm.nih.gov/

EMBL at EBI

http://www.ebi.ac.uk/embl/

DDBJ in Japan

http://www.ddbj.nig.ac.jp/

DATA sources for DNA databases

• Direct scientist submission

• Genome sequencing labs and groups

• Scientific literature

• Patent applications

• EMBL, Genbank and DDBJ collaborateto collect all sequence data reportedaround the world.

International DNA databases

All of these databases are:

Updated every 2-3 months.

Have weekly (or daily updates).

Are divided into sublibraries for easier

searching.

DNA database divisions

• PRI - primate (human,monkey)

• ROD - rodent (mouse,rat)

• MAM - other mammalian(bovine,cat)

• VRT - other vertebrate (chicken)

• INV - invertebrate

• PLN - plant, fungal, and alga

• BCT - bacteria

• VRL - viruses

• PHG - bacteriophage

• SYN - synthetic (plasmids,vectors)

• UNA - unannotated sequences

• PAT - patent sequences

• EST - Expressed Sequence

Tags

• STS - Sequence Tagged Sites

• GSS - Genome Survey

Sequences

• HTG - High Throughput

Genomic Sequences

• HTC - High Throughput cDNA

Sequences

Genomic databases

• Specialized resources that are:

– Species specific

– Sequencing technique specific

• Display whole chromosomes (not a

specific sequence).

Sources of mRNA’s

• Experimental

– Clone new gene

– Clone gene from database

– 2 hybrid system

• Database

– “Typical” cDNA

– Full length cDNA

– EST

mRNA

Full length cDNA

Typical cDNA

5’mG AAAA

TTTT

TTTT

primer

AAAAprimer

primerSources of mRNA’s

• Individual Labs various

• Refseq NM

• Kasuza (KIAA) D, AB

Full Length Sequencing projects:

• Riken AK; ends - AV or BB

• Nedo (FLJ) AK

• German (DKFZ)

• MGC BC

Accession Numbers

REFSEQ from NCBI

(Reference sequence database)

! Definition

The Reference Sequence (RefSeq) collectionaims to provide a comprehensive, integrated,non-redundant set of sequences, includinggenomic DNA, transcript (RNA), and proteinproducts, for major research organisms.

REFSEQ from NCBI

!non-redundancy

!explicitly linked nucleotide and proteinsequences

!updates to reflect current knowledge of sequencedata and biology

!data validation and format consistency

!distinct accession series

!ongoing curation by NCBI staff and collaborators,with reviewed records indicated

RefSeq

• Reviewed

• Provisional

• Predicted

• Genome Annotation

• Validated

• Model

• Inferred

• WGS

!Status Codes:

RefSeq records are provided with a statuscode which provides an indication of the levelof review a RefSeq record has undergone.

Accession Format Molecule Type

NC_123456 Complete Genome

Complete Chromosome

Complete Sequence

NG_123456 Genomic Region

NM_123456 mRNA

NP_123456 Protein

NT_123456 Genomic Contig (from BACs)

NW_123456 Genomic Contig (from WGS)

XM_123456 mRNA (taken from genomic seq)

XR_123456 RNA (taken from genomic seq)

XP_123456 Protein (taken from genomic seq)

NEDO

Full Length mRNA

Sequencing

NEDO

~ 160,000 clones were isolated from more than

20 full-length enriched human cDNA libraries made

by "Oligo-capping" method. Their 5's end sequences

were determined.

We selected about 10,000 putatively full-length

cDNA using these sequence data and determined the

entire sequence of the selected clones. This NEDO

project aims to determine the sequence of 20,000

full-length cDNA clones in addition.

RIKEN

Mouse Genome Encyclopedia

A project to sequence full length mouse cDNA’s.

Over 21,000 genes sequenced from oligo capped

libraries from about 200 tissues and cell types

Set standards for annotation with FANTOM

(Functional Annontation Of Mouse)

Kasuza

Large cDNA inserts (> 4 kb).

Determined the complete base sequences of

approximately 2000 species of previously

undiscovered cDNA from KG-1 cells and brain tissue,

with an average length of 5 kb.

Database: HUGE (Human Unidentified Gene-Encoded

protein database)

http://www.kazusa.or.jp/huge

MGC - Mammalian

Gene Collection

The NIH Mammalian Gene Collection (MGC) seeks

to identify and sequence a representative full open

reading frame (ORF) clone for each human and

mouse gene.

MGC has produced over 80 cDNA libraries enriched

for full-length cDNAs derived from human tissue

and cell lines, and mouse tissue.

5' EST reads are generated from each library.

Several algorithms are applied to select putative full

ORF clones.

Sources of mRNA’s

• Experimental

– Clone new gene

– Clone gene from database

– 2 hybrid system

• Database

– “Typical” cDNA

– Full length cDNA

– EST

RNA

RNA, cDNA, and ESTs

mRNA

cDNA

exon 1 exon 2 exon 3

EST

EST

cDNA clone

GenBank ESTs GenBank ESTs (Expressed Sequence Tags): (Expressed Sequence Tags):

~ 6,000,000 human ~ 6,000,000 human ESTsESTs

~ 4,300,000 mouse ~ 4,300,000 mouse ESTsESTs

Adapted with permission from Adam Sartiel

Problems with ESTs

- low copy number genes

- rare tissues

- mistakes

- enrichment of 3’ ends of genes

- incomplete coverage of genes

Uses of ESTs

- prediction of coding regions

- detection of alternative splicing

- clustering to form “genes”

Problems with clustering:

- incomplete coverage breaks genes up

- gene families

• With the increasing sequencing and annotation of keygenomes, having a gene-based view of the resultantinformation is useful. Entrez Gene has therefore beenimplemented to supply key connections in the nexus ofmap, sequence, expression, structure, function, citation,and homology data. Unique identifiers are assigned togenes with defining sequences, genes with known mappositions, and genes inferred from phenotypicinformation. These gene identifiers are tracked, andinformation is added when available. Entrez Gene can beconsidered as the successor to LocusLink, with the majordifferences being in greater scope (more of the genomesrepresented by NCBI Reference Sequences or RefSeqs)and in being integrated for indexing and query in NCBI'sEntrez system.

Data reliability in databases

• The huge amount of data collected indatabases present a lot of problems:

– Data accuracy

– Sequence redundancy

– Inconsistent nomenclature

– Inaccurate annotation

– Sequence contamination (vectors,bacterial)


• The database staff notify the Authors

that an error (or contamination) was

detected in their sequence entry.

• However, it takes time to correct the data.

• Meanwhile the error is continued, because

a lot of the Proteins in the Protein db are

translated from the DNA sequence db.


• A lot of the sequences in the

database are quite “old”. They were

not updated since they were

submitted, even though technology

and data was very much updated.

HUGO Gene Nomenclature

Committee• This committee is responsible for the approval of a

unique symbol for each gene.

• It also designs a longer and more descriptive name.

• The committee makes considerable efforts to usesymbols acceptable to workers in the field, butsometimes it is not possible to use exactly what haspreviously appeared in the literature.

• However, wherever the committee is aware of suchsymbols, they are listed as aliases in the Genewdatabase.(http://www.gene.ucl.ac.uk/cgibin/nomenclature/searchgenes.pl)

Gene symbols

Gene symbols are designated by upper case Latin letters or

by a combination of upper-case letters and Arabic numbers.

Symbols should be short in order to be useful, and should

not attempt to represent all known information about a

gene.

Ideally symbols should be no longer than six characters in

length.

Based on classical genetic guidelines, it is recommended that

gene symbols are either underlined or italicized when

referring to genotypic information (phenotypic information

is represented in standard fonts).

Gene Symbols

80887826000469q31ATP-binding

cassette, sub-

family A (ABC1),

member 1

ABCA1

PubMed

ID

MIM

Number

Cytogenetic

LocationFull nameSymbol

Protein databases Protein databases

• There are many different proteindatabases containing different types ofinformation:

– Primary Amino Acids sequence.

– Secondary structure

– 3D structure

– Protein family domains

– Consensus active sites

Sources of Protein

• Proteins that have been worked on

experimentally

• mRNA whose product has been

worked on experimentally (no actual

protein sequencing done)

• Translated DNA (mRNA) sequences

Protein Primary Sequence

Databases

• Usually contain description of the protein entry(annotation), the amino acid sequence andsometimes links to other related databases.

• Swiss-Prot, from the University of Geneva (nowthe Swiss Institute of Bioinformatics), is acurated protein database which strives toprovide a high level of annotation, a minimallevel of redundancy and high level ofintegration with other databases.

UniProt (Universal Protein Resource) is the world's mostcomprehensive catalog of information on proteins. It isa central repository of protein sequence and functioncreated by joining the information contained in Swiss-Prot, TrEMBL, and PIR.

• The UniProt Knowledgebase (UniProt) is the centralaccess point for extensive curated protein information,including function, classification, and cross-reference.

• The UniProt Non-redundant Reference (UniRef)databases combine closely related sequences into asingle record to speed searches.

• The UniProt Archive (UniParc) is a comprehensiverepository, reflecting the history of all proteinsequences.

Swiss-Prot Database (primary database)

• Swiss-Prot annotation includes:

– Description of protein function

– Protein domain structure

– Post-translational modifications

– Protein variants

• Sequence entries are composed of different line-types, each with their own format. Forstandardization purposes the format ofSwissProt follows as closely as possible that ofthe EMBL (DNA) Database.

Swiss-Prot Database

Swiss-Prot differs from other protein databasesby the following criteria:

! Annotation

! Minimal Redundancy

! Integration with other databases

Swiss-Prot Database

"Annotation

In Swiss-Prot, as in most other sequencedatabases, two classes of data can bedistinguished: the core data and the annotation.

The core data consists of the sequence; thecitation information (bibliographical references)and the taxonomic data (description of thebiological source of the protein).

The annotation consists of the description of:

• Function(s) of the protein

• Post-translational modification(s). For

example carbohydrates, phosphorylation,

acetylation, GPI-anchor, etc.

• Domains and sites. For example calcium

binding regions, ATP-binding sites, zinc

fingers, etc.

• Secondary structure

The annotation consists of the description of:

• Quaternary structure. For examplehomodimer, heterotrimer, etc.

• Similarities to other proteins

• Disease(s) associated with deficiency(s)of/in the protein

• Sequence conflicts, variants, etc.

Swiss-Prot Database

To obtain this information, Swiss-Prot uses, inaddition to the publications that report newsequence data, review articles to periodicallyupdate the annotations of families or groupsof proteins.

Swiss-Prot also makes use of externalexperts, who have been recruited to sendtheir comments and updates concerningspecific groups of proteins.

Swiss-Prot Database

! Minimal Redundancy

Many sequence databases contain, for a givenprotein sequence, separate entries whichcorrespond to different literature reports.In SWISS-PROT, they try as much as possibleto merge all these data so as to minimize theredundancy of the database.

If conflicts exist between various sequencingreports, they are indicated in the feature tableof the corresponding entry.

Swiss-Prot Database! Integration with other databases

It is important to provide the users ofbiomolecular databases with a degree ofintegration between the three types sequence-related databases (nucleic acid sequences, proteinsequences and protein tertiary structures) as wellas with specialized data collections.

SWISS- PROT is currently cross-referenced with~100 different databases. Cross-references areprovided in the form of pointers to informationrelated to SWISS-PROT entries and found in datacollections other than SWISS-PROT.

TrEMBL database

• TrEMBL is a computer-annotated

supplement of SWISS-PROT that

contains all the translations of the

EMBL (DNA) database.

• TrEMBL contain entries not yet

integrated in SWISS-PROT.

NR database

(primary databases from NCBI ! !)

• The NR Protein database contains

sequence data from the translated

coding regions from DNA sequences in

GenBank, EMBL and DDBJ as well as

protein sequences submitted to PIR,

SWISSPROT, PRF, PDB (sequences from

solved structures).

Data reliability in Protein

databases

• About 30% of the proteins in thedatabases have erroneous sequences dueto:

– missing exons in the DNA translation.

– Introns mistakenly translated.

• Another common problem is the assigningof functions to “new” proteins, based onsequence similarity.

Data reliability in Protein

databases

• For example:

– Protein A is similar to protein B.

– Protein B annotation is based on Protein A

annotation (which has an error).

– Annotation of Protein A is corrected by the

group working on it. This correction does not

appear or reflect in Protein B annotation.

– When Protein C and D are also based on the

erroneous annotation on B, the problem…...

http://www.geneontology.org/

Text searching pitfalls

• It finds exactly what you type

• Older records may have different

annotation, from gene names on…

• human vs homo sapiens

• Most sites use boolean operators(AND, OR, BUT NOT)

• Can do (or add) a field specific tag -but each site has a different way ofadding it to a search - for example,NCBI uses square brackets []

Remember:

Text searching is NOT sequence

similarity searching! You many not

find all related sequences by text

searching!!!!

Documents

And now, for the molecules Introduction to databases and ... · Introduction to databases part 2 Shifra Ben- Dor Irit Orr And now, for the molecules and databases... ¥DNA ¥RNA ¥Protein