Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

Introduction to biological databases

Introduction to Biological Databases

2, Introduction to biological databases

What is ‗Bioinformatics‘?

Bioinformatics is the application of computer sciences to

biology

… interdisciplinary science

… strives to solve the problems of the life sciences with

theoretical computer-assisted methods

… indispensable for modern biology and medicine

… uses techniques such as applied mathematics and

statistics


Some major research areas in bioinformatics

• Sequence analysis and function prediction

• Analysis and prediction of protein structure

• Computational evolutionary biology

• Comparative genomics

• Gene and protein expression

• Protein-Protein Interaction (PPI) analysis

• System‘s biology

• Image analysis

• Visualization


Indispensable for bioinformatic studies:

1. Databases

2. Software tools

3. Servers

Introduction


Outline

• Introduction

• Selected categories of life sciences databases 1. Nucleotide sequences

2. Genomics

3. Mutation/polymorphism

4. Protein sequences

5. Protein domain/family

6. Proteomics (2D gel, Mass Spectrometry)

7. 3D structure

8. Metabolism/Pathways

9. Bibliography

10. Others

• Concluding remarks

• Practicals


Introduction

What is a database (db) ?

• A collection of related data, which are: – structured

– searchable (index) -> table of contents

– updated periodically (release) -> new edition

– cross-referenced (hyperlinks) -> links with other db

• Includes also associated tools (software) necessary for

db access, db updating, db information insertion or

deletion….

• Data storage format: flat files (text, FASTA), relational

(XML, RDF)


Introduction

Why biological databases (db) ?

• Exponential growth in biological data

• Data are no longer published in a conventional manner,

but directly submitted to databases (nucleotides & amino acids sequences, 3D structures, 2D gel analysis, MS

analysis, microarrays, publications, protein-protein interactions,…)

• Essential tools for biological research


P. Gaudet, ‗A community of Biocurators‘


Science cover, February 2011


Some statistics and remarks

• More than 1000 different "biological" databases

• Variable size: <100Kb to >100Gb (ENA > 728Gb !) – DNA: > 100 Gb

– Protein: 2 Gb

– 3D structure: 5 Gb

– Other: smaller

• Update frequency: daily to annually

• Generally accessible through the web (free!?)


Where can we find…

• a video -> Youtube

• info on S. Hawking-> wikipedia

• a book -> Amazon

• a friend -> Facebook, Google plus

• DNA sequence -> EMBL

• protein sequence -> UniProtKB, RefSeq

• 3D data -> PDB

• Microarrays data -> ArrayExpress, GEO

• Publications -> PubMed


10 most important bioinformatics databases *

* according to the "Bioinformatics for dummies"

Name URL Data type

GenBank www.ncbi.nlm.nih.gov Nucleotide sequences

Ensembl www.ensembl.org Genomes

PubMed www.ncbi.nlm.nih.gov Literature references

NCBI nr www.ncbi.nlm.nih.gov Protein sequences

UniProtKB www.uniprot.org Protein sequences

InterPro www.ebi.ac.uk Protein domains

OMIM www.omim.org/ Genetic diseases

Enzymes http://enzyme.expasy.org/ Enzymes

PDB www.rcsb.org/pdb/ Protein structures

KEGG www.genome.ad.jp Metabolic pathways


Databases / Servers

• A server is a computer (from a given institute) that

provides services (stores databases and associated

tools) to other computers

• Main biological servers: – ExPASy (www.expasy.org/)

– UniProt (www.uniprot.org)

– NCBI (www.ncbi.nlm.nih.gov/)

– EBI (www.ebi.ac.uk/)

– Japanese GenomeNet (www.genome.jp/)

• Not all servers give access to the same databases and

to the same search tools ! ... when servers give access to the same databases, the 'look' is different ...

and beware the date of the latest release !

http://www.expasy.org/

http://www.uniprot.org/

http://www.ncbi.nlm.nih.gov/

http://www.ebi.ac.uk/

http://www.genome.jp/

UniProt NCBI

The same data on different servers…. Same data on different servers ...


How to find a database ?

• The Nucleic Acids Research (NAR) Online Molecular Biology Database collection 2011:

a total of 1‘330 databases

http://www.oxfordjournals.org/nar/database/a/

• Expasy Life Science Directory: http://www.expasy.org/links.html (no more updated)

• Google: http://www.google.com/

http://www.oxfordjournals.org/nar/database/a/

http://www.expasy.org/links.html

http://www.google.com/


http://www.expasy.org/links.html Expasy Life Science Directory


Awareness of the content

and usage of knowledge resources

is a pre-requisite to do any type of "serious" research

in the field of molecular life sciences

(Amos Bairoch, 2007)


Outline

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db


Deluge of sequence data

• ~ 3200 genomes sequenced (single organism, varying sizes, including virus)

• ~ 5‘000 ongoing genome sequencing projects

• cDNAs sequencing projects (ESTs or cDNAs)

• metagenome sequencing projects (~300) (environmental samples: multiple ‘unknown’ organisms, varying sizes)

– Ecological metagenomics: beach sand, Sargasso Sea, New-York air, …

– Organismal metagenomics: human fluids, mouse gut, …

• Personal Human genomes

http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html

http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html



• Personal human genomes!

http://www.youtube.com/watch?v=mVZI7NBgcWM

http://www.youtube.com/watch?v=mVZI7NBgcWM



But…we know now that his apoE allele is the one

associated with increased risk for Alzheimer and

that he has the ‗blue eye‘ allele…


Enseml genome browser



http://www.personalgenomes.org

http://www.personalgenomes.org/


DNA sequence of the

telomeric region of

human chromosome x


1. Nucleotide sequences db

• The main DNA sequence db are:

EMBL/ENA (Europe)/GenBank (USA) /DDBJ (Japan)

-> INSDC collaboration

• There are also specialized databases for the different

types of RNAs (i.e. tRNA, rRNA, tmRNA, uRNA, etc…)

• Others:

Eucaryotic promoter db (EPD); RNA editing sites,...


1. EMBL-ENA/GenBank/DDBJ http://www.insdc.org/

Archive of primary sequence data and corresponding annotation

submitted by the laboratories that did the sequencing.

http://www.insdc.org/


1. Same data on different servers

EBI (EMBL/ENA) NCBI (GenBank)

NIG (DDBJ)


1. EMBL-ENA/GenBank/DDBJ

• Serve as archives : ‗nothing goes out‘

• Contain all public sequences derived from: – Genome projects (> 80 % of entries)

– Sequencing centers (cDNAs, ESTs…)

– Individual scientists ( 15 % of entries)

– Patent offices (i.e. European Patent Office, EPO)

• Currently: ~150x106 sequences, ~200 x109 bp;

• Sequences from > 500‘000 different species;


1. Ideal content of a "sequence" db

• Sequences !!

• Unique Accession number (AC)

• References

• Taxonomic data

• ANNOTATION/CURATION

• Keywords

• Cross-references

• Documentation

Minimal requirements !


1. EMBL-ENA entry

Cross-references

accession number

taxonomy

references


1. EMBL-ENA entry (cont.)

Annotation

(Prediction or

experimentally determined)

sequence

CDS

Coding Sequence

(proposed by submitters)


cDNAs, ESTs, genes, genomes, …

EMBL/ENA GenBank DDBJ

Data not submitted to public databases: delayed or cancelled…

1. The hectic life of a sequence

CDS

Coding sequence

Portion of DNA/RNA translated into protein (from Met to 'STOP')

Experimentally proved or derived from gene prediction

Not so well documented !

with or without annotated CDS

provided by the authors


Coding Sequence (CDS): Alignments between a mRNA and a genomic sequence

1. EMBL-ENA vs GenBank format


1. Fasta format



• Heterogeneous sequence length and quality: – ESTs, genomes, variants, fragments…

• Sequence sizes: – max 350‘000 bp /entry (! genomic sequences, overlapping)

– min 10 bp /entry

• Archive: nothing goes out -> highly redundant !

• full of errors: in sequences, in annotations, in CDS

attribution, no consistency of annotations: – most annotations are done by the submitters;

– heterogeneity of the quality and the completion and updating of the

information



• Unexpected information you can find in these db: FT source 1..124

FT /db_xref="taxon:4097"

FT /organelle="plastid:chloroplast"

FT /organism="Nicotiana tabacum"

FT /isolate="Cuban cahibo cigar, gift from

FT President Fidel Castro"

• Or: FT source 1..17084

FT /chromosome="complete mitochondrial genome"

FT /db_xref="taxon:9267"

FT /organelle="mitochondrion"

FT /organism="Didelphis virginiana"

FT /dev_stage="adult"

FT /isolate="fresh road killed individual"

FT /tissue_type="liver"


1. Other nucleotide sequences databases

http://www.rnaiweb.com/RNAi/RNAi_Web_Resources/siRNA_Collections___Databases/

http://www.rnaiweb.com/RNAi/RNAi_Web_Resources/siRNA_Collections___Databases/


1. Other nucleotide sequences databases

• EPD is a rigorously selected database. In order to be included in EPD, a

promoter must be: – recognized by eukaryotic RNA POL II,

– active in a higher eukaryote,

– experimentally defined, or homologous and sufficiently similar to an experimentally defined promoter,

– biologically functional,

– available in the current ENA release,

– distinct from other promoters in the database.

http://www.epd.isb-sib.ch/





Outline


2. Genomics


2. ‗Genomics databases‘

• Contain information on gene chromosomal location

(mapping) and nomenclature, and provide links to

sequence databases; contain usually no sequence!

• Exist for most model organisms; usually species specific.

• Examples: MIM (human), MGD (mouse), FlyBase

(Drosophila), SGD (yeast), MaizeDB (maize), SubtiList

(B.subtilis), TAIR (arabidopsis) etc.;


2. TAIR

http://www.arabidopsis.org/

http://www.arabidopsis.org/


• ~20‘300 human protein-coding genes

• 2850 protein-coding genes with mutations causing

human disorders

• ~ 1800 more to be discovered

• ~1100 loci affecting more than 165 polygenic disease

have been identified (PMID:21307931)

2. OMIM: Online Mendelian Inheritance in Man


2. OMIM: Online Mendelian Inheritance in Man

http://www.omim.org/

http://www.omim.org/


OMIM: ‗gene‘ entry


OMIM: ‗disease‘ entry


2. Genome browser: Ensembl

• Ensembl provides a bioinformatics framework to

organize biology around the sequences of large

genomes.

http://www.ensembl.org/

http://www.ensembl.org/


Enseml genome browser


Genome browser: USCS

http://genome.ucsc.edu/cgi-bin/hgGateway





A eukaryotic gene (UCSC)

5‘ untranslated

region

Initial exon

Final exon

Introns

Internal exons

5’ 3’

Stop Met


Genome browser: USCS






Outline


2. Genomics




Single nucleotide polymorphisms (SNPs) are unique genetic

differences between individuals that contribute in significant ways to

the determination of human variation including physical characteristics

like height and appearance as well as less obvious traits such as

personality, behaviour and disease susceptibility. SNPs can also

significantly influence responses to pharmacotherapy and whether

drugs will produce adverse reactions.

DOI: 10.2174/157016308785739811

SNP Technologies for Drug Discovery: A Current Review.

Each human genome contains: ~3‘000‘000 Single Nucleotide Polymorphisms (SNP) variants (1/1000 pb).


S.E. Antonorakis


3. Mutation/polymorphism db

• Contain information on sequence variations that are linked or not to

genetic diseases;

• General db:

– dbSNP - Human single nucleotide polymorphism (SNP) db

(variants with frequency > 1 %;

!!! a disease mutation is rare -> dbSNP has not much ‗disease–linked mutation‘)

• Disease-specific db: most of these databases are either linked to a

single gene or to a single disease;

– p53 mutation db

– ADB - Albinism db (Mutations in human genes causing albinism)

– Asthma and Allergy gene db

– ….


http://www.ncbi.nlm.nih.gov/SNP/






Blue eye allele… db SNP: rs12913832 -> link to the Alfred database

Yeux bleus Yeux bruns


Outline


2. Genomics


4. Protein sequences -> Primary db


4. Protein sequences – Eukaryotic cell

Cell elemental composition

Cells are made of 90% water.

The remaining is approximately:

• 50% protein (3.5kg)

• 15% carbohydrate

• 15% nucleic acid (1.3kg)

• 10% lipid

• 10% miscellaneous


Amino acid sequence

(1 letter code)

of human titin


4. Protein sequence origin

• About 180 milliards of proteins (?)

• > 15.0 millions of ‗known‘ protein sequences in 2011

• More than 99 % of the protein sequences are derived

from the translation of nucleotide sequences

• Less than 1 % direct protein sequencing (Edman,

MS/MS…)

-> It is important that users know where the protein sequence comes from…

(sequencing & gene prediction quality) !


http://www.nature.com/news/2010/100922/full/467380a.html

(US$30 million per year)

http://www.nature.com/news/2010/100922/full/467380a.html



ENA GenBank DDBJ



Nucleic acid databases

Protein sequence

databases

…if the submitters provide an

annotated Coding Sequence (CDS) (1/10 ENA entries)

Gene prediction

RefSeq, Ensembl

no CDS



ENA GenBank DDBJ



TrEMBL Genpept RefSeq PRF

Scientific publications

derived sequences

Swiss-Prot

CoDing Sequences provided by submitters

CoDing Sequences provided by submitters

and gene prediction

UniProtKB Ensembl

CCDS

UniParc

PDB (PIR)

+ all ‗species‘ specific databases (EcoGene, TAIR, …)

(IPI)

UniMES


4. Major protein sequence db ‗sources‘

1. UniProtKB: Swiss-Prot + TrEMBL

2. NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq

UniProtKB/Swiss-Prot: manually annotated protein sequences (12‘500 species)

UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with

Swiss-Prot (380‘000 species)

GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (380‘000 species ?)

PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences

PRF: journal scan of ‗published‘ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual

annotation (16‘000 species)

PIR PDB PRF Integrated resources

‗cross-references‘

Separated resources


Swiss-Prot

TrEMBL

Look for toll-like receptor 4

(homo sapiens)

www.uniprot.org



GenPept

Swiss-Prot

RefSeq

GenPept

GenPept

GenPept

GenPept

GenPept

GenPept

Look for toll-like receptor 4

(homo sapiens)

http://www.ncbi.nlm.nih.gov/


4. UniProt - The Universal Protein resource

is maintained by the UniProt consortium: SIB + EBI + PIR


UniProt is mainly supported by the National Institutes of Health (NIH) grant 2 U01 HG02712-04. Additional support for the EBI's involvement in

UniProt comes from the European Commission (EC)'s FELICS grant (021902RII3) and from the NIH grant 1R01HGO2273-01. Swiss-Prot

activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science and the

European Commission contracts FELICS (021902RII3) and SLING (226073). PIR activities are also supported by the NIH grants and

contracts HHSN266200400061C, NCI-caBIG, and 5R01GM080646-04, and the Department of Defense grant W81XWH0720112.


4. UniProtKB: from ENA to TrEMBL

ENA (DNA)

TrEMBL

Translated CDS

Reference + tissue

Protein name

Translated CDS

Product name

Tissue

Reference

Automated extraction of

protein sequence

(translated CDS), gene

name and references.

Automated annotation.


UniProtKB/TrEMBL

Automatic annotation

Protein sequence

- The quality of the protein sequences is dependent on the information provided by the

submitter of the original nucleotide entry (CDS) or of the gene prediction pipeline (i.e.

Ensembl).

- 100% identical sequences (same length, same organism are merged automatically).

Biological information Sources of annotation

-Provided by the submitter (EMBL, PDB, TAIR…)

-From automated annotation (automated generated annotation rules (i.e. SAAS) and/or

manually generated annotation rules (i.e. UniRule))


4. UniProtKB: from TrEMBL to Swiss-Prot

TrEMBL

Translated CDS

Reference

Protein name

Swiss-Prot

Manual annotation of

the sequence and

manual review of

associated biological

information

Protein nameS

Many more references

Translated CDS

+ polymorphisms

+ isoforms

+ …

Full annotation

Once manually annotated and integrated into Swiss-Prot,

the entry is deleted from TrEMBL

-> minimal redundancy


UniProtKB/Swiss-Prot

Manual annotation

1. Protein sequence (merge available CDS, annotate sequence

discrepancies, report sequencing mistakes…)

2. Biological information (sequence analysis, extract literature

information, ortholog data propagation, …)


Protein and gene names


…enable researchers to obtain a summary of what is known about a protein…

General annotation

(Comments)

www.uniprot.org



Human protein manual annotation:

some statistics (June 2011)


Sequence annotation

(Features)

…enable researchers to obtain a summary of what is known about a protein…

www.uniprot.org



Non-experimental qualifiers

UniProtKB/Swiss-Prot considers both experimental and predicted data

and makes a clear distinction between both

Type of evidence Qualifier

Strong experimental evidence None or Ref.X

Light experimental evidence Probable

Inferred by similarity with homologous protein By similarity

Inferred by prediction Potential


• The ‗Protein existence‘ tag indicates what is the evidence for

the existence of a given protein;

• Different qualifiers:

1. Evidence at protein level (~18%)

(MS, western blot (tissue specificity), immuno (subcellular location),…)

2. Evidence at transcript level (~19%)

3. Inferred from homology (~58 %)

4. Predicted (~5%)

5. Uncertain (mainly in TrEMBL)

‘Protein existence’ tag

http://www.uniprot.org/docs/pe_criteria



The UniProt web site

www.uniprot.org

• Powerful search engine, google-like and easy-to-use, but also supports very directed field searches

• Scoring mechanism presenting relevant matches first

• Entry views, search result views and downloads are customizable

• The URL of a result page reflects the query; all pages and queries are

bookmarkable, supporting programmatic access

• Search, Blast, Align, Retrieve, ID mapping


Search

A very powerful text search tool with autocompletion and refinement options allowing to look for UniProt entries and documentation by biological information


The search interface guides users with helpful suggestions and hints



Advanced Search

A very powerful search tool

To be used when you know in which

entry section the information is stored


Find all the protein localized in the cytoplasm (experimentally proven)

which are phosphorylated on a serine (experimentally proven)


Result pages: highly customizable


Result pages: downloadable



4. Major protein sequence db ‗sources‘

1. UniProtKB: Swiss-Prot + TrEMBL

2. NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq

UniProtKB/Swiss-Prot: manually annotated protein sequences (12‘500 species)

UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with

Swiss-Prot (380‘000 species)

GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (380‘000 species ?)

PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences

PRF: journal scan of ‗published‘ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual

annotation (16‘000 species)

PIR PDB PRF Integrated resources

‗cross-references‘

Separated resources

4. NCBI nr - Entrez ‗protein‘

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein


Contains all CDS annotated in

GenBank/ENA/DDBJ sequences

‗translations from annotated coding regions in

GenBank‘

- equivalent to TrEMBL,

except that it is

redundant with other databases

(Swiss-Prot, RefSeq, PIR….)

All PIR data have been

integrated into Swiss-Prot

and TrEMBL (UniProt)

3D structure database:

all the protein sequences

which have been cristallized

(Swiss-Prot/TrEMBL are

crosslinked to PDB)

Scientific publications

derived sequences

« Journal scan »

(integrated into TrEMBL)

4. Protein sequences: NCBI nr

NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF


4. RefSeq

RefSeq: The Reference Sequence (RefSeq) collection aims to provide

a comprehensive, integrated, non-redondant set of sequences,

including genomic DNA, transcript (RNA), and protein products, for

major research organisms.

Tightly linked to Entrez Gene ("interdependent curated resources")


AC

Taxonomy

References

4. RefSeq

Protein: NP_

mRNA: NM_

DNA: NC_


Status and Genbank source

Annotation

- automated,

- derived from Swiss-Prot

- in-house


Annotation

- automated,

- derived from Swiss-Prot

- in-house

Sequence

Cross-references


4. RefSeq

http://www.ncbi.nlm.nih.gov/RefSeq/

Curation status : manual annotation

GENOME ANNOTATION No

INFERRED No

MODEL No

PREDICTED No

PROVISIONAL No

REVIEWED Yes (sequence + functional information

and features)

VALIDATED Yes (initial sequence)

Whole Genome Sequencing (WGS) No

http://www.ncbi.nlm.nih.gov/RefSeq/


These identifiers are all pointing to a TP53 (p53) protein sequence !

P04637, NP_000537, NP_001119584.1, NP_001119585.1,

NP_001119584.1, NP_001119584.1, NP_001119584.1,

NP_001119584.1, ENSG00000141510, CCDS11118,

UPI000002ED67, IPI00025087, etc.

4. Accession number (AC) mapping


http://www.uniprot.org/mapping/

http://www.uniprot.org/mapping/


Outline


2. Genomics





• Most proteins have « modular » conserved structures

• Estimation: ~ 3 domains / protein

• Estimation: ~ 6000 ‗known‘ domains

-> Prediction of domain content of a unkown protein

sequence may help to find a ‗function‘

…Estimation: ~ 80% of protein have at least a ‗known‘ domain

5. Protein domain/family: some definitions

CSA_PPIASE

Cys 181: active site residue Binding cleft (motif)

Example of conserved regions (PPID family)

- 1 CSA_PPIASE (cyclophilin-type peptydil-prolyl cis-trans isomerase) (domain)

- 3 TPR repeats (tetratrico peptide repeat).

- 1 active site

- Binding cleft (motif)


Domain signatures methods:

derived from ‗modelled‘ multiple sequence alignments (MSA)

• Pattern

• Fingerprint

• Sequence clustering

• Profile

• HMM


How to build a PROSITE pattern ?

• Start with a multiple sequence alignment (MSA)

Information lost: 4D 1E


5. Protein domain/family db

PROSITE Patterns / Profiles

ProDom Aligned motifs (PSI-BLAST) (Pfam B)

PRINTS Aligned motifs

Pfam HMM (Hidden Markov Models)

SMART HMM

TIGRfam HMM

DOMO Aligned motifs

BLOCKS Aligned motifs (PSI-BLAST)

CDD(CDART) PSI-BLAST(PSSM) of Pfam and SMART

I

n

t

e

r

p

r

o


InterPro scan results

?

Part of the protein

sequence wich has been

‗recognized‘ by different

modelled MSA

What makes Bee special?


Outline


2. Genomics






6. Proteomics db

• Mass Spectrometry (MS) database: Pride

• SWISS-2DPAGE, ECO2DBASE, Maize-2DPAGE,

Sub2D, Cyano2DBase, etc.

– Contain informations obtained by 2D-PAGE: images of master gels and

description of identified proteins

– Composed of image and text files


6. PRIDE

http://www.ebi.ac.uk/pride/

http://www.ebi.ac.uk/interpro


Outline


2. Genomics





7. 3D structure


3D structure

• Only one database : PDB (Protein Data Bank)

but several servers….

• Contains the spatial coordinates of macromolecule

atoms whose 3D structure has been experimentally

obtained by X-ray or NMR studies; also a few models.

• Proteins represent more than 90% of available structures

(others are DNA, RNA, sugars, viruses, protein/DNA

complexes…)


7. PDB: Protein Data Bank

• Managed by Research Collaboratory for Structural Bioinformatics

(RCSB) (USA).

• Associated with specialized programs allow the visualization of the

corresponding 3D structure (e.g., SwissPDB-viewer, Chime,

Rasmol)).

• Currently - September 28, 2011 - there are 75‘000 structural data for

about 20‘000 different proteins (highly redundant) !

http://www.pdb.org/ (RCSB)

http://www.ebi.ac.uk/pdbe/

http://www.pdbj.org/

http://www.pdb.org/

http://www.ebi.ac.uk/pdbe/

http://www.pdbj.org/

http://www.pdb.org/


7. PDB: example

HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2

COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3

COMPND 2 (E.C.4.2.1.1) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4

SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5

AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6

REVDAT 1 15-OCT-92 12CA 0 12CA 7

JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8

JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9

JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10

JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11

JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12

JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13

REMARK 1 12CA 14

REMARK 2 12CA 15

REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16

REMARK 3 12CA 17

REMARK 3 REFINEMENT. 12CA 18

REMARK 3 PROGRAM PROLSQ 12CA 19

REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20

REMARK 3 R VALUE 0.170 12CA 21

REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22

REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23

REMARK 4 12CA 24

REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25

REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26

REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27

………


7. PDB (cont.)

SHEET 3 S10 PHE 66 PHE 70 -1 O ASN 67 N LEU 60 12CA 68

SHEET 4 S10 TYR 88 TRP 97 -1 O PHE 93 N VAL 68 12CA 69

SHEET 5 S10 ALA 116 ASN 124 -1 O HIS 119 N HIS 94 12CA 70

SHEET 6 S10 LEU 141 VAL 150 -1 O LEU 144 N LEU 120 12CA 71

SHEET 7 S10 VAL 207 LEU 212 1 O ILE 210 N GLY 145 12CA 72

SHEET 8 S10 TYR 191 GLY 196 -1 O TRP 192 N VAL 211 12CA 73

SHEET 9 S10 LYS 257 ALA 258 -1 O LYS 257 N THR 193 12CA 74

SHEET 10 S10 LYS 39 TYR 40 1 O LYS 39 N ALA 258 12CA 75

TURN 1 T1 GLN 28 VAL 31 TYPE VIB (CIS-PRO 30) 12CA 76

TURN 2 T2 GLY 81 LEU 84 TYPE II(PRIME) (GLY 82) 12CA 77

TURN 3 T3 ALA 134 GLN 137 TYPE I (GLN 136) 12CA 78

TURN 4 T4 GLN 137 GLY 140 TYPE I (ASP 139) 12CA 79

TURN 5 T5 THR 200 LEU 203 TYPE VIA (CIS-PRO 202) 12CA 80

TURN 6 T6 GLY 233 GLU 236 TYPE II (GLY 235) 12CA 81

CRYST1 42.700 41.700 73.000 90.00 104.60 90.00 P 21 2 12CA 82

ORIGX1 1.000000 0.000000 0.000000 0.00000 12CA 83

ORIGX2 0.000000 1.000000 0.000000 0.00000 12CA 84

ORIGX3 0.000000 0.000000 1.000000 0.00000 12CA 85

SCALE1 0.023419 0.000000 0.006100 0.00000 12CA 86

SCALE2 0.000000 0.023981 0.000000 0.00000 12CA 87

SCALE3 0.000000 0.000000 0.014156 0.00000 12CA 88

ATOM 1 N TRP 5 8.519 -0.751 10.738 1.00 13.37 12CA 89

ATOM 2 CA TRP 5 7.743 -1.668 11.585 1.00 13.42 12CA 90

ATOM 3 C TRP 5 6.786 -2.502 10.667 1.00 13.47 12CA 91

ATOM 4 O TRP 5 6.422 -2.085 9.607 1.00 13.57 12CA 92

ATOM 5 CB TRP 5 6.997 -0.917 12.645 1.00 13.34 12CA 93

ATOM 6 CG TRP 5 5.784 -0.209 12.221 1.00 13.40 12CA 94

ATOM 7 CD1 TRP 5 5.681 1.084 11.797 1.00 13.29 12CA 95

ATOM 8 CD2 TRP 5 4.417 -0.667 12.221 1.00 13.34 12CA 96

ATOM 9 NE1 TRP 5 4.388 1.418 11.515 1.00 13.30 12CA 97

ATOM 10 CE2 TRP 5 3.588 0.375 11.797 1.00 13.35 12CA 98

ATOM 11 CE3 TRP 5 3.837 -1.877 12.645 1.00 13.39 12CA 99

ATOM 12 CZ2 TRP 5 2.216 0.208 11.656 1.00 13.39 12CA 100

ATOM 13 CZ3 TRP 5 2.465 -2.043 12.504 1.00 13.33 12CA 101

ATOM 14 CH2 TRP 5 1.654 -1.001 12.009 1.00 13.34 12CA 102

…….

Coordinates <x; y; z> of each atom

The same PDB

entry

―visualized‖ with

Chime


Outline


2. Genomics





7. 3D structure

8. Metabolism/Pathways


8. Databases: metabolic

• Contain informations that describe enzymes,

biochemical reactions and metabolic pathways;

• Nomenclature databases store informations on enzyme

names and reactions: ENZYME, BRENDA, IntEnz

• Metabolic databases: MetaCyc, KEGG, UniPathway,

RhEA;

• Usually these databases are tightly coupled with query

software that allows the user to visualise reaction

schemes;

• Ligands and chemicals: ChEBI, KEGG ligand;


Useful to prepare lab’s experiments ! http://www.brenda-enzymes.org/

8. BRENDA

http://www.brenda-enzymes.org/



http://www.genome.ad.jp/kegg

8. KEGG

http://www.genome.ad.jp/kegg


Outline


2. Genomics





7. 3D structure

9. Bibliography


9. Bibliography

• Bibliographic reference databases contain citations and

abstract information of published life science articles;

• Example: PubMed, PubMed central

• Other more specialized databases also exist:

Agricola ( http://agricola.nal.usda.gov/)

EMBASE - not free

…

http://agricola.nal.usda.gov/

http://agricola.nal.usda.gov/


9. PubMed / Medline

• Established in 1950;

• Database of citations and abstracts to biomedical and

other life science journal literature;

• Encompasses MedLine;

• Gives access to: – > 21 millions papers (dating back to the 1860s),

– > 20‘400 life science journals,

– ~ 55 languages (17‘751 jounals in English, 2‘000 in French, 372 in

Chinese, 29 in Latin, 1 in Azerbaijani, etc…).

PMID: 10923642 (PubMed ID)

UI: 20378145 (Medline ID)

DOI : 10.1016/S0960-9822(03)00148-9 (Digital Object Identifier)

http://www.ncbi.nlm.nih.gov/pubmed/

http://www.ncbi.nlm.nih.gov/pubmed/


PubMed central

• Free digital archive of free access full-texts (since 2000)

• ~700 journals (list: http://www.ncbi.nlm.nih.gov/pmc/journals/), most of which have a corresponding entry in PubMed

• Free access to the full text either immediately after publication of within a 12-month period.

http://www.ncbi.nlm.nih.gov/pmc/




10. Others

• There are many databases that cannot be classified in

the categories listed previously;

• Examples: – ReBase (restriction enzymes)

– TRANSFAC (transcription factors)

– CarbBank

– GlycoSuiteDB (linked sugars)

– Protein-protein interactions db (DIP, ProNet, BIND, MINT, String),

– Protease db (MEROPS), biotechnology patents db, Microarrays, etc.;

• As well as many other resources concerning any aspects

of macromolecules and molecular biology.


Protein/protein interaction: description from 1 to more than 20‘000

interactions / publication

Several databases: Intact, BIND, DIP, String

Estimation: 10’000 fundamental interaction types

10. Interactome



10. Intact

http://www.ebi.ac.uk/intact/

http://www.ebi.ac.uk/intact/


10. Gene Ontology

• The Gene Ontology is a controlled vocabulary, a set of

standard terms—words and phrases—used for indexing

and retrieving information.

• In addition to defining terms, GO also defines the

relationships between the terms, making it a structured

vocabulary.

• The Gene Ontology ensures that the flood of information

produced can be effectively utilized by standardization of

biological data/information

http://www.geneontology.org

http://www.geneontology.org/


http://www.geneontology.org

http://www.geneontology.org/


About 30‘000 terms (with definition and hierarchy)

biological process

• broad biological phenomena e.g. mitosis, growth, digestion

(included PTMs).

molecular function

• molecular role e.g. catalytic activity, binding

cellular component

• Subcellular location e.g nucleus, ribosome, origin recognition

complex

10. Gene Ontology


http://www.ebi.ac.uk/QuickGO/


10. Gene Ontology annotation

Annotation is the process of assigning/mapping

GO terms to gene products…

!!! Electronic vs Manual annotation…


Example with EPO


Histone H4

!!! Large scale derived data (‗proteome‘)


Essential link between biological knowledge and high

throuput genomic and proteomic datasets…

‘summary of the gene ontology classifications for all mapped ESTs…’

10. Gene Ontology


Genome-Wide RNAi screens identify genes

required for ricin and Pseudomonas

exotoxin intoxications

DOI 10.1016/j.devcel.2011.06.014

Gene Ontology analysis on the 2038 genes hit list.



2. Genomics





7. 3D structure

9. Bibliography

10. Others

• Concluding remarks


Proliferation of databases

• Which does contain the highest quality data ?

• Which is the more comprehensive ?

• Which is the more up-to-date ?

• Which is the less redundant ?

• Which is the more indexed (allows complex queries) ?

• Which Web server does respond most quickly ?

• …….??????


Some important practical remarks

• Databases: many errors (automated annotation) !

• Not all db are available on all servers

• The update frequency is not the same for all servers;

• Some servers add automatically cross-references to an

entry (implicit links) in addition to already existing links

(explicit links)… different looks…


Before the introduction to databases…

After the introduction to databases…


Marie-Claude Blatter

Swiss-Prot, Geneva

SIB Swiss Institute of Bioinformatics

[email protected]

Credits

Documents

Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly