Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Introduction to biological databases
Introduction to Biological Databases
2, Introduction to biological databases
What is ‗Bioinformatics‘?
Bioinformatics is the application of computer sciences to
biology
… interdisciplinary science
… strives to solve the problems of the life sciences with
theoretical computer-assisted methods
… indispensable for modern biology and medicine
… uses techniques such as applied mathematics and
statistics
3, Introduction to biological databases
Some major research areas in bioinformatics
• Sequence analysis and function prediction
• Analysis and prediction of protein structure
• Computational evolutionary biology
• Comparative genomics
• Gene and protein expression
• Protein-Protein Interaction (PPI) analysis
• System‘s biology
• Image analysis
• Visualization
4, Introduction to biological databases
Indispensable for bioinformatic studies:
1. Databases
2. Software tools
3. Servers
Introduction
5, Introduction to biological databases
Outline
• Introduction
• Selected categories of life sciences databases 1. Nucleotide sequences
2. Genomics
3. Mutation/polymorphism
4. Protein sequences
5. Protein domain/family
6. Proteomics (2D gel, Mass Spectrometry)
7. 3D structure
8. Metabolism/Pathways
9. Bibliography
10. Others
• Concluding remarks
• Practicals
6, Introduction to biological databases
Introduction
What is a database (db) ?
• A collection of related data, which are: – structured
– searchable (index) -> table of contents
– updated periodically (release) -> new edition
– cross-referenced (hyperlinks) -> links with other db
• Includes also associated tools (software) necessary for
db access, db updating, db information insertion or
deletion….
• Data storage format: flat files (text, FASTA), relational
(XML, RDF)
7, Introduction to biological databases
Introduction
Why biological databases (db) ?
• Exponential growth in biological data
• Data are no longer published in a conventional manner,
but directly submitted to databases (nucleotides & amino acids sequences, 3D structures, 2D gel analysis, MS
analysis, microarrays, publications, protein-protein interactions,…)
• Essential tools for biological research
8, Introduction to biological databases
P. Gaudet, ‗A community of Biocurators‘
9, Introduction to biological databases
Science cover, February 2011
10, Introduction to biological databases
Some statistics and remarks
• More than 1000 different "biological" databases
• Variable size: <100Kb to >100Gb (ENA > 728Gb !) – DNA: > 100 Gb
– Protein: 2 Gb
– 3D structure: 5 Gb
– Other: smaller
• Update frequency: daily to annually
• Generally accessible through the web (free!?)
11, Introduction to biological databases
Where can we find…
• a video -> Youtube
• info on S. Hawking-> wikipedia
• a book -> Amazon
• a friend -> Facebook, Google plus
• DNA sequence -> EMBL
• protein sequence -> UniProtKB, RefSeq
• 3D data -> PDB
• Microarrays data -> ArrayExpress, GEO
• Publications -> PubMed
12, Introduction to biological databases
10 most important bioinformatics databases *
* according to the "Bioinformatics for dummies"
Name URL Data type
GenBank www.ncbi.nlm.nih.gov Nucleotide sequences
Ensembl www.ensembl.org Genomes
PubMed www.ncbi.nlm.nih.gov Literature references
NCBI nr www.ncbi.nlm.nih.gov Protein sequences
UniProtKB www.uniprot.org Protein sequences
InterPro www.ebi.ac.uk Protein domains
OMIM www.omim.org/ Genetic diseases
Enzymes http://enzyme.expasy.org/ Enzymes
PDB www.rcsb.org/pdb/ Protein structures
KEGG www.genome.ad.jp Metabolic pathways
14, Introduction to biological databases
Databases / Servers
• A server is a computer (from a given institute) that
provides services (stores databases and associated
tools) to other computers
• Main biological servers: – ExPASy (www.expasy.org/)
– UniProt (www.uniprot.org)
– NCBI (www.ncbi.nlm.nih.gov/)
– EBI (www.ebi.ac.uk/)
– Japanese GenomeNet (www.genome.jp/)
• Not all servers give access to the same databases and
to the same search tools ! ... when servers give access to the same databases, the 'look' is different ...
and beware the date of the latest release !
UniProt NCBI
The same data on different servers…. Same data on different servers ...
16, Introduction to biological databases
How to find a database ?
• The Nucleic Acids Research (NAR) Online Molecular Biology Database collection 2011:
a total of 1‘330 databases
http://www.oxfordjournals.org/nar/database/a/
• Expasy Life Science Directory: http://www.expasy.org/links.html (no more updated)
• Google: http://www.google.com/
17, Introduction to biological databases
http://www.expasy.org/links.html Expasy Life Science Directory
18, Introduction to biological databases
Awareness of the content
and usage of knowledge resources
is a pre-requisite to do any type of "serious" research
in the field of molecular life sciences
(Amos Bairoch, 2007)
19, Introduction to biological databases
Outline
• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db
20, Introduction to biological databases
Deluge of sequence data
• ~ 3200 genomes sequenced (single organism, varying sizes, including virus)
• ~ 5‘000 ongoing genome sequencing projects
• cDNAs sequencing projects (ESTs or cDNAs)
• metagenome sequencing projects (~300) (environmental samples: multiple ‘unknown’ organisms, varying sizes)
– Ecological metagenomics: beach sand, Sargasso Sea, New-York air, …
– Organismal metagenomics: human fluids, mouse gut, …
• Personal Human genomes
21, Introduction to biological databases
Deluge of sequence data
• Personal human genomes!
http://www.youtube.com/watch?v=mVZI7NBgcWM
22, Introduction to biological databases
Deluge of sequence data
But…we know now that his apoE allele is the one
associated with increased risk for Alzheimer and
that he has the ‗blue eye‘ allele…
23, Introduction to biological databases
Enseml genome browser
24, Introduction to biological databases
Deluge of sequence data
http://www.personalgenomes.org
25, Introduction to biological databases
DNA sequence of the
telomeric region of
human chromosome x
26, Introduction to biological databases
1. Nucleotide sequences db
• The main DNA sequence db are:
EMBL/ENA (Europe)/GenBank (USA) /DDBJ (Japan)
-> INSDC collaboration
• There are also specialized databases for the different
types of RNAs (i.e. tRNA, rRNA, tmRNA, uRNA, etc…)
• Others:
Eucaryotic promoter db (EPD); RNA editing sites,...
27, Introduction to biological databases
1. EMBL-ENA/GenBank/DDBJ http://www.insdc.org/
Archive of primary sequence data and corresponding annotation
submitted by the laboratories that did the sequencing.
28, Introduction to biological databases
1. Same data on different servers
EBI (EMBL/ENA) NCBI (GenBank)
NIG (DDBJ)
29, Introduction to biological databases
1. EMBL-ENA/GenBank/DDBJ
• Serve as archives : ‗nothing goes out‘
• Contain all public sequences derived from: – Genome projects (> 80 % of entries)
– Sequencing centers (cDNAs, ESTs…)
– Individual scientists ( 15 % of entries)
– Patent offices (i.e. European Patent Office, EPO)
• Currently: ~150x106 sequences, ~200 x109 bp;
• Sequences from > 500‘000 different species;
30, Introduction to biological databases
1. Ideal content of a "sequence" db
• Sequences !!
• Unique Accession number (AC)
• References
• Taxonomic data
• ANNOTATION/CURATION
• Keywords
• Cross-references
• Documentation
Minimal requirements !
31, Introduction to biological databases
1. EMBL-ENA entry
Cross-references
accession number
taxonomy
references
32, Introduction to biological databases
1. EMBL-ENA entry (cont.)
Annotation
(Prediction or
experimentally determined)
sequence
CDS
Coding Sequence
(proposed by submitters)
33, Introduction to biological databases
cDNAs, ESTs, genes, genomes, …
EMBL/ENA GenBank DDBJ
Data not submitted to public databases: delayed or cancelled…
1. The hectic life of a sequence
CDS
Coding sequence
Portion of DNA/RNA translated into protein (from Met to 'STOP')
Experimentally proved or derived from gene prediction
Not so well documented !
with or without annotated CDS
provided by the authors
34, Introduction to biological databases
Coding Sequence (CDS): Alignments between a mRNA and a genomic sequence
1. EMBL-ENA vs GenBank format
36, Introduction to biological databases
1. Fasta format
37, Introduction to biological databases
1. EMBL-ENA/GenBank/DDBJ
• Heterogeneous sequence length and quality: – ESTs, genomes, variants, fragments…
• Sequence sizes: – max 350‘000 bp /entry (! genomic sequences, overlapping)
– min 10 bp /entry
• Archive: nothing goes out -> highly redundant !
• full of errors: in sequences, in annotations, in CDS
attribution, no consistency of annotations: – most annotations are done by the submitters;
– heterogeneity of the quality and the completion and updating of the
information
38, Introduction to biological databases
1. EMBL-ENA/GenBank/DDBJ
• Unexpected information you can find in these db: FT source 1..124
FT /db_xref="taxon:4097"
FT /organelle="plastid:chloroplast"
FT /organism="Nicotiana tabacum"
FT /isolate="Cuban cahibo cigar, gift from
FT President Fidel Castro"
• Or: FT source 1..17084
FT /chromosome="complete mitochondrial genome"
FT /db_xref="taxon:9267"
FT /organelle="mitochondrion"
FT /organism="Didelphis virginiana"
FT /dev_stage="adult"
FT /isolate="fresh road killed individual"
FT /tissue_type="liver"
39, Introduction to biological databases
1. Other nucleotide sequences databases
http://www.rnaiweb.com/RNAi/RNAi_Web_Resources/siRNA_Collections___Databases/
40, Introduction to biological databases
1. Other nucleotide sequences databases
• EPD is a rigorously selected database. In order to be included in EPD, a
promoter must be: – recognized by eukaryotic RNA POL II,
– active in a higher eukaryote,
– experimentally defined, or homologous and sufficiently similar to an experimentally defined promoter,
– biologically functional,
– available in the current ENA release,
– distinct from other promoters in the database.
http://www.epd.isb-sib.ch/
41, Introduction to biological databases
Outline
• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db
2. Genomics
42, Introduction to biological databases
2. ‗Genomics databases‘
• Contain information on gene chromosomal location
(mapping) and nomenclature, and provide links to
sequence databases; contain usually no sequence!
• Exist for most model organisms; usually species specific.
• Examples: MIM (human), MGD (mouse), FlyBase
(Drosophila), SGD (yeast), MaizeDB (maize), SubtiList
(B.subtilis), TAIR (arabidopsis) etc.;
43, Introduction to biological databases
2. TAIR
http://www.arabidopsis.org/
44, Introduction to biological databases
• ~20‘300 human protein-coding genes
• 2850 protein-coding genes with mutations causing
human disorders
• ~ 1800 more to be discovered
• ~1100 loci affecting more than 165 polygenic disease
have been identified (PMID:21307931)
2. OMIM: Online Mendelian Inheritance in Man
45, Introduction to biological databases
2. OMIM: Online Mendelian Inheritance in Man
http://www.omim.org/
46, Introduction to biological databases
OMIM: ‗gene‘ entry
47, Introduction to biological databases
OMIM: ‗disease‘ entry
48, Introduction to biological databases
2. Genome browser: Ensembl
• Ensembl provides a bioinformatics framework to
organize biology around the sequences of large
genomes.
http://www.ensembl.org/
49, Introduction to biological databases
Enseml genome browser
50, Introduction to biological databases
Genome browser: USCS
http://genome.ucsc.edu/cgi-bin/hgGateway
51, Introduction to biological databases
A eukaryotic gene (UCSC)
5‘ untranslated
region
Initial exon
Final exon
Introns
Internal exons
5’ 3’
Stop Met
52, Introduction to biological databases
Genome browser: USCS
http://genome.ucsc.edu/cgi-bin/hgGateway
53, Introduction to biological databases
Outline
• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db
2. Genomics
3. Mutation/polymorphism
54, Introduction to biological databases
3. Mutation/polymorphism
Single nucleotide polymorphisms (SNPs) are unique genetic
differences between individuals that contribute in significant ways to
the determination of human variation including physical characteristics
like height and appearance as well as less obvious traits such as
personality, behaviour and disease susceptibility. SNPs can also
significantly influence responses to pharmacotherapy and whether
drugs will produce adverse reactions.
DOI: 10.2174/157016308785739811
SNP Technologies for Drug Discovery: A Current Review.
Each human genome contains: ~3‘000‘000 Single Nucleotide Polymorphisms (SNP) variants (1/1000 pb).
55, Introduction to biological databases
S.E. Antonorakis
56, Introduction to biological databases
3. Mutation/polymorphism db
• Contain information on sequence variations that are linked or not to
genetic diseases;
• General db:
– dbSNP - Human single nucleotide polymorphism (SNP) db
(variants with frequency > 1 %;
!!! a disease mutation is rare -> dbSNP has not much ‗disease–linked mutation‘)
• Disease-specific db: most of these databases are either linked to a
single gene or to a single disease;
– p53 mutation db
– ADB - Albinism db (Mutations in human genes causing albinism)
– Asthma and Allergy gene db
– ….
57, Introduction to biological databases
http://www.ncbi.nlm.nih.gov/SNP/
58, Introduction to biological databases
59, Introduction to biological databases
60, Introduction to biological databases
Blue eye allele… db SNP: rs12913832 -> link to the Alfred database
Yeux bleus Yeux bruns
61, Introduction to biological databases
Outline
• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db
2. Genomics
3. Mutation/polymorphism
4. Protein sequences -> Primary db
62, Introduction to biological databases
4. Protein sequences – Eukaryotic cell
Cell elemental composition
Cells are made of 90% water.
The remaining is approximately:
• 50% protein (3.5kg)
• 15% carbohydrate
• 15% nucleic acid (1.3kg)
• 10% lipid
• 10% miscellaneous
63, Introduction to biological databases
Amino acid sequence
(1 letter code)
of human titin
64, Introduction to biological databases
4. Protein sequence origin
• About 180 milliards of proteins (?)
• > 15.0 millions of ‗known‘ protein sequences in 2011
• More than 99 % of the protein sequences are derived
from the translation of nucleotide sequences
• Less than 1 % direct protein sequencing (Edman,
MS/MS…)
-> It is important that users know where the protein sequence comes from…
(sequencing & gene prediction quality) !
65, Introduction to biological databases
http://www.nature.com/news/2010/100922/full/467380a.html
(US$30 million per year)
66, Introduction to biological databases
cDNAs, ESTs, genes, genomes, …
ENA GenBank DDBJ
Data not submitted to public databases: delayed or cancelled…
4. The hectic life of a sequence
Nucleic acid databases
Protein sequence
databases
…if the submitters provide an
annotated Coding Sequence (CDS) (1/10 ENA entries)
Gene prediction
RefSeq, Ensembl
no CDS
67, Introduction to biological databases
cDNAs, ESTs, genes, genomes, …
ENA GenBank DDBJ
Data not submitted to public databases: delayed or cancelled…
4. The hectic life of a sequence
TrEMBL Genpept RefSeq PRF
Scientific publications
derived sequences
Swiss-Prot
CoDing Sequences provided by submitters
CoDing Sequences provided by submitters
and gene prediction
UniProtKB Ensembl
CCDS
UniParc
PDB (PIR)
+ all ‗species‘ specific databases (EcoGene, TAIR, …)
(IPI)
UniMES
68, Introduction to biological databases
4. Major protein sequence db ‗sources‘
1. UniProtKB: Swiss-Prot + TrEMBL
2. NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq
UniProtKB/Swiss-Prot: manually annotated protein sequences (12‘500 species)
UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with
Swiss-Prot (380‘000 species)
GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (380‘000 species ?)
PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB
PDB: Protein Databank: 3D data and associated sequences
PRF: journal scan of ‗published‘ peptide sequences
RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual
annotation (16‘000 species)
PIR PDB PRF Integrated resources
‗cross-references‘
Separated resources
69, Introduction to biological databases
Swiss-Prot
TrEMBL
Look for toll-like receptor 4
(homo sapiens)
www.uniprot.org
70, Introduction to biological databases
GenPept
Swiss-Prot
RefSeq
GenPept
GenPept
GenPept
GenPept
GenPept
GenPept
Look for toll-like receptor 4
(homo sapiens)
http://www.ncbi.nlm.nih.gov/
71, Introduction to biological databases
4. UniProt - The Universal Protein resource
is maintained by the UniProt consortium: SIB + EBI + PIR
http://www.uniprot.org/
UniProt is mainly supported by the National Institutes of Health (NIH) grant 2 U01 HG02712-04. Additional support for the EBI's involvement in
UniProt comes from the European Commission (EC)'s FELICS grant (021902RII3) and from the NIH grant 1R01HGO2273-01. Swiss-Prot
activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science and the
European Commission contracts FELICS (021902RII3) and SLING (226073). PIR activities are also supported by the NIH grants and
contracts HHSN266200400061C, NCI-caBIG, and 5R01GM080646-04, and the Department of Defense grant W81XWH0720112.
4. UniProtKB: from ENA to TrEMBL
ENA (DNA)
TrEMBL
Translated CDS
Reference + tissue
Protein name
Translated CDS
Product name
Tissue
Reference
Automated extraction of
protein sequence
(translated CDS), gene
name and references.
Automated annotation.
73, Introduction to biological databases
UniProtKB/TrEMBL
Automatic annotation
Protein sequence
- The quality of the protein sequences is dependent on the information provided by the
submitter of the original nucleotide entry (CDS) or of the gene prediction pipeline (i.e.
Ensembl).
- 100% identical sequences (same length, same organism are merged automatically).
Biological information Sources of annotation
-Provided by the submitter (EMBL, PDB, TAIR…)
-From automated annotation (automated generated annotation rules (i.e. SAAS) and/or
manually generated annotation rules (i.e. UniRule))
74, Introduction to biological databases
4. UniProtKB: from TrEMBL to Swiss-Prot
TrEMBL
Translated CDS
Reference
Protein name
Swiss-Prot
Manual annotation of
the sequence and
manual review of
associated biological
information
Protein nameS
Many more references
Translated CDS
+ polymorphisms
+ isoforms
+ …
Full annotation
Once manually annotated and integrated into Swiss-Prot,
the entry is deleted from TrEMBL
-> minimal redundancy
76, Introduction to biological databases
UniProtKB/Swiss-Prot
Manual annotation
1. Protein sequence (merge available CDS, annotate sequence
discrepancies, report sequencing mistakes…)
2. Biological information (sequence analysis, extract literature
information, ortholog data propagation, …)
78, Introduction to biological databases
Protein and gene names
79, Introduction to biological databases
…enable researchers to obtain a summary of what is known about a protein…
General annotation
(Comments)
www.uniprot.org
80, Introduction to biological databases
Human protein manual annotation:
some statistics (June 2011)
81, Introduction to biological databases
Sequence annotation
(Features)
…enable researchers to obtain a summary of what is known about a protein…
www.uniprot.org
82, Introduction to biological databases
Non-experimental qualifiers
UniProtKB/Swiss-Prot considers both experimental and predicted data
and makes a clear distinction between both
Type of evidence Qualifier
Strong experimental evidence None or Ref.X
Light experimental evidence Probable
Inferred by similarity with homologous protein By similarity
Inferred by prediction Potential
83, Introduction to biological databases
• The ‗Protein existence‘ tag indicates what is the evidence for
the existence of a given protein;
• Different qualifiers:
1. Evidence at protein level (~18%)
(MS, western blot (tissue specificity), immuno (subcellular location),…)
2. Evidence at transcript level (~19%)
3. Inferred from homology (~58 %)
4. Predicted (~5%)
5. Uncertain (mainly in TrEMBL)
‘Protein existence’ tag
http://www.uniprot.org/docs/pe_criteria
84, Introduction to biological databases
85, Introduction to biological databases
The UniProt web site
www.uniprot.org
• Powerful search engine, google-like and easy-to-use, but also supports very directed field searches
• Scoring mechanism presenting relevant matches first
• Entry views, search result views and downloads are customizable
• The URL of a result page reflects the query; all pages and queries are
bookmarkable, supporting programmatic access
• Search, Blast, Align, Retrieve, ID mapping
86, Introduction to biological databases
Search
A very powerful text search tool with autocompletion and refinement options allowing to look for UniProt entries and documentation by biological information
87, Introduction to biological databases
The search interface guides users with helpful suggestions and hints
88, Introduction to biological databases
89, Introduction to biological databases
Advanced Search
A very powerful search tool
To be used when you know in which
entry section the information is stored
90, Introduction to biological databases
Find all the protein localized in the cytoplasm (experimentally proven)
which are phosphorylated on a serine (experimentally proven)
91, Introduction to biological databases
Result pages: highly customizable
92, Introduction to biological databases
Result pages: downloadable
93, Introduction to biological databases
94, Introduction to biological databases
4. Major protein sequence db ‗sources‘
1. UniProtKB: Swiss-Prot + TrEMBL
2. NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq
UniProtKB/Swiss-Prot: manually annotated protein sequences (12‘500 species)
UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with
Swiss-Prot (380‘000 species)
GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (380‘000 species ?)
PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB
PDB: Protein Databank: 3D data and associated sequences
PRF: journal scan of ‗published‘ peptide sequences
RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual
annotation (16‘000 species)
PIR PDB PRF Integrated resources
‗cross-references‘
Separated resources
4. NCBI nr - Entrez ‗protein‘
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
96, Introduction to biological databases
Contains all CDS annotated in
GenBank/ENA/DDBJ sequences
‗translations from annotated coding regions in
GenBank‘
- equivalent to TrEMBL,
except that it is
redundant with other databases
(Swiss-Prot, RefSeq, PIR….)
All PIR data have been
integrated into Swiss-Prot
and TrEMBL (UniProt)
3D structure database:
all the protein sequences
which have been cristallized
(Swiss-Prot/TrEMBL are
crosslinked to PDB)
Scientific publications
derived sequences
« Journal scan »
(integrated into TrEMBL)
4. Protein sequences: NCBI nr
NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF
97, Introduction to biological databases
4. RefSeq
RefSeq: The Reference Sequence (RefSeq) collection aims to provide
a comprehensive, integrated, non-redondant set of sequences,
including genomic DNA, transcript (RNA), and protein products, for
major research organisms.
Tightly linked to Entrez Gene ("interdependent curated resources")
98, Introduction to biological databases
AC
Taxonomy
References
4. RefSeq
Protein: NP_
mRNA: NM_
DNA: NC_
99, Introduction to biological databases
Status and Genbank source
Annotation
- automated,
- derived from Swiss-Prot
- in-house
100, Introduction to biological databases
Annotation
- automated,
- derived from Swiss-Prot
- in-house
Sequence
Cross-references
101, Introduction to biological databases
4. RefSeq
http://www.ncbi.nlm.nih.gov/RefSeq/
Curation status : manual annotation
GENOME ANNOTATION No
INFERRED No
MODEL No
PREDICTED No
PROVISIONAL No
REVIEWED Yes (sequence + functional information
and features)
VALIDATED Yes (initial sequence)
Whole Genome Sequencing (WGS) No
102, Introduction to biological databases
These identifiers are all pointing to a TP53 (p53) protein sequence !
P04637, NP_000537, NP_001119584.1, NP_001119585.1,
NP_001119584.1, NP_001119584.1, NP_001119584.1,
NP_001119584.1, ENSG00000141510, CCDS11118,
UPI000002ED67, IPI00025087, etc.
4. Accession number (AC) mapping
103, Introduction to biological databases
http://www.uniprot.org/mapping/
104, Introduction to biological databases
Outline
• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db
2. Genomics
3. Mutation/polymorphism
4. Protein sequences -> Primary db
5. Protein domain/family
105, Introduction to biological databases
• Most proteins have « modular » conserved structures
• Estimation: ~ 3 domains / protein
• Estimation: ~ 6000 ‗known‘ domains
-> Prediction of domain content of a unkown protein
sequence may help to find a ‗function‘
…Estimation: ~ 80% of protein have at least a ‗known‘ domain
5. Protein domain/family: some definitions
CSA_PPIASE
Cys 181: active site residue Binding cleft (motif)
Example of conserved regions (PPID family)
- 1 CSA_PPIASE (cyclophilin-type peptydil-prolyl cis-trans isomerase) (domain)
- 3 TPR repeats (tetratrico peptide repeat).
- 1 active site
- Binding cleft (motif)
107, Introduction to biological databases
Domain signatures methods:
derived from ‗modelled‘ multiple sequence alignments (MSA)
• Pattern
• Fingerprint
• Sequence clustering
• Profile
• HMM
108, Introduction to biological databases
How to build a PROSITE pattern ?
• Start with a multiple sequence alignment (MSA)
Information lost: 4D 1E
109, Introduction to biological databases
5. Protein domain/family db
PROSITE Patterns / Profiles
ProDom Aligned motifs (PSI-BLAST) (Pfam B)
PRINTS Aligned motifs
Pfam HMM (Hidden Markov Models)
SMART HMM
TIGRfam HMM
DOMO Aligned motifs
BLOCKS Aligned motifs (PSI-BLAST)
CDD(CDART) PSI-BLAST(PSSM) of Pfam and SMART
I
n
t
e
r
p
r
o
110, Introduction to biological databases
InterPro scan results
?
Part of the protein
sequence wich has been
‗recognized‘ by different
modelled MSA
What makes Bee special?
112, Introduction to biological databases
Outline
• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db
2. Genomics
3. Mutation/polymorphism
4. Protein sequences -> Primary db
5. Protein domain/family
6. Proteomics (2D gel, Mass Spectrometry)
113, Introduction to biological databases
6. Proteomics db
• Mass Spectrometry (MS) database: Pride
• SWISS-2DPAGE, ECO2DBASE, Maize-2DPAGE,
Sub2D, Cyano2DBase, etc.
– Contain informations obtained by 2D-PAGE: images of master gels and
description of identified proteins
– Composed of image and text files
115, Introduction to biological databases
6. PRIDE
http://www.ebi.ac.uk/pride/
116, Introduction to biological databases
Outline
• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db
2. Genomics
3. Mutation/polymorphism
4. Protein sequences -> Primary db
5. Protein domain/family
6. Proteomics (2D gel, Mass Spectrometry)
7. 3D structure
117, Introduction to biological databases
3D structure
• Only one database : PDB (Protein Data Bank)
but several servers….
• Contains the spatial coordinates of macromolecule
atoms whose 3D structure has been experimentally
obtained by X-ray or NMR studies; also a few models.
• Proteins represent more than 90% of available structures
(others are DNA, RNA, sugars, viruses, protein/DNA
complexes…)
118, Introduction to biological databases
7. PDB: Protein Data Bank
• Managed by Research Collaboratory for Structural Bioinformatics
(RCSB) (USA).
• Associated with specialized programs allow the visualization of the
corresponding 3D structure (e.g., SwissPDB-viewer, Chime,
Rasmol)).
• Currently - September 28, 2011 - there are 75‘000 structural data for
about 20‘000 different proteins (highly redundant) !
http://www.pdb.org/ (RCSB)
http://www.ebi.ac.uk/pdbe/
http://www.pdbj.org/
119, Introduction to biological databases
7. PDB: example
HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2
COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3
COMPND 2 (E.C.4.2.1.1) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4
SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5
AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6
REVDAT 1 15-OCT-92 12CA 0 12CA 7
JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8
JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9
JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10
JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11
JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12
JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13
REMARK 1 12CA 14
REMARK 2 12CA 15
REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16
REMARK 3 12CA 17
REMARK 3 REFINEMENT. 12CA 18
REMARK 3 PROGRAM PROLSQ 12CA 19
REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20
REMARK 3 R VALUE 0.170 12CA 21
REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22
REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23
REMARK 4 12CA 24
REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25
REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26
REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27
………
120, Introduction to biological databases
7. PDB (cont.)
SHEET 3 S10 PHE 66 PHE 70 -1 O ASN 67 N LEU 60 12CA 68
SHEET 4 S10 TYR 88 TRP 97 -1 O PHE 93 N VAL 68 12CA 69
SHEET 5 S10 ALA 116 ASN 124 -1 O HIS 119 N HIS 94 12CA 70
SHEET 6 S10 LEU 141 VAL 150 -1 O LEU 144 N LEU 120 12CA 71
SHEET 7 S10 VAL 207 LEU 212 1 O ILE 210 N GLY 145 12CA 72
SHEET 8 S10 TYR 191 GLY 196 -1 O TRP 192 N VAL 211 12CA 73
SHEET 9 S10 LYS 257 ALA 258 -1 O LYS 257 N THR 193 12CA 74
SHEET 10 S10 LYS 39 TYR 40 1 O LYS 39 N ALA 258 12CA 75
TURN 1 T1 GLN 28 VAL 31 TYPE VIB (CIS-PRO 30) 12CA 76
TURN 2 T2 GLY 81 LEU 84 TYPE II(PRIME) (GLY 82) 12CA 77
TURN 3 T3 ALA 134 GLN 137 TYPE I (GLN 136) 12CA 78
TURN 4 T4 GLN 137 GLY 140 TYPE I (ASP 139) 12CA 79
TURN 5 T5 THR 200 LEU 203 TYPE VIA (CIS-PRO 202) 12CA 80
TURN 6 T6 GLY 233 GLU 236 TYPE II (GLY 235) 12CA 81
CRYST1 42.700 41.700 73.000 90.00 104.60 90.00 P 21 2 12CA 82
ORIGX1 1.000000 0.000000 0.000000 0.00000 12CA 83
ORIGX2 0.000000 1.000000 0.000000 0.00000 12CA 84
ORIGX3 0.000000 0.000000 1.000000 0.00000 12CA 85
SCALE1 0.023419 0.000000 0.006100 0.00000 12CA 86
SCALE2 0.000000 0.023981 0.000000 0.00000 12CA 87
SCALE3 0.000000 0.000000 0.014156 0.00000 12CA 88
ATOM 1 N TRP 5 8.519 -0.751 10.738 1.00 13.37 12CA 89
ATOM 2 CA TRP 5 7.743 -1.668 11.585 1.00 13.42 12CA 90
ATOM 3 C TRP 5 6.786 -2.502 10.667 1.00 13.47 12CA 91
ATOM 4 O TRP 5 6.422 -2.085 9.607 1.00 13.57 12CA 92
ATOM 5 CB TRP 5 6.997 -0.917 12.645 1.00 13.34 12CA 93
ATOM 6 CG TRP 5 5.784 -0.209 12.221 1.00 13.40 12CA 94
ATOM 7 CD1 TRP 5 5.681 1.084 11.797 1.00 13.29 12CA 95
ATOM 8 CD2 TRP 5 4.417 -0.667 12.221 1.00 13.34 12CA 96
ATOM 9 NE1 TRP 5 4.388 1.418 11.515 1.00 13.30 12CA 97
ATOM 10 CE2 TRP 5 3.588 0.375 11.797 1.00 13.35 12CA 98
ATOM 11 CE3 TRP 5 3.837 -1.877 12.645 1.00 13.39 12CA 99
ATOM 12 CZ2 TRP 5 2.216 0.208 11.656 1.00 13.39 12CA 100
ATOM 13 CZ3 TRP 5 2.465 -2.043 12.504 1.00 13.33 12CA 101
ATOM 14 CH2 TRP 5 1.654 -1.001 12.009 1.00 13.34 12CA 102
…….
Coordinates <x; y; z> of each atom
The same PDB
entry
―visualized‖ with
Chime
122, Introduction to biological databases
Outline
• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db
2. Genomics
3. Mutation/polymorphism
4. Protein sequences -> Primary db
5. Protein domain/family
6. Proteomics (2D gel, Mass Spectrometry)
7. 3D structure
8. Metabolism/Pathways
123, Introduction to biological databases
8. Databases: metabolic
• Contain informations that describe enzymes,
biochemical reactions and metabolic pathways;
• Nomenclature databases store informations on enzyme
names and reactions: ENZYME, BRENDA, IntEnz
• Metabolic databases: MetaCyc, KEGG, UniPathway,
RhEA;
• Usually these databases are tightly coupled with query
software that allows the user to visualise reaction
schemes;
• Ligands and chemicals: ChEBI, KEGG ligand;
124, Introduction to biological databases
Useful to prepare lab’s experiments ! http://www.brenda-enzymes.org/
8. BRENDA
127, Introduction to biological databases
Outline
• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db
2. Genomics
3. Mutation/polymorphism
4. Protein sequences -> Primary db
5. Protein domain/family
6. Proteomics (2D gel, Mass Spectrometry)
7. 3D structure
9. Bibliography
128, Introduction to biological databases
9. Bibliography
• Bibliographic reference databases contain citations and
abstract information of published life science articles;
• Example: PubMed, PubMed central
• Other more specialized databases also exist:
Agricola ( http://agricola.nal.usda.gov/)
EMBASE - not free
…
129, Introduction to biological databases
9. PubMed / Medline
• Established in 1950;
• Database of citations and abstracts to biomedical and
other life science journal literature;
• Encompasses MedLine;
• Gives access to: – > 21 millions papers (dating back to the 1860s),
– > 20‘400 life science journals,
– ~ 55 languages (17‘751 jounals in English, 2‘000 in French, 372 in
Chinese, 29 in Latin, 1 in Azerbaijani, etc…).
PMID: 10923642 (PubMed ID)
UI: 20378145 (Medline ID)
DOI : 10.1016/S0960-9822(03)00148-9 (Digital Object Identifier)
http://www.ncbi.nlm.nih.gov/pubmed/
130, Introduction to biological databases
PubMed central
• Free digital archive of free access full-texts (since 2000)
• ~700 journals (list: http://www.ncbi.nlm.nih.gov/pmc/journals/), most of which have a corresponding entry in PubMed
• Free access to the full text either immediately after publication of within a 12-month period.
http://www.ncbi.nlm.nih.gov/pmc/
http://www.ncbi.nlm.nih.gov/pmc/
131, Introduction to biological databases
10. Others
• There are many databases that cannot be classified in
the categories listed previously;
• Examples: – ReBase (restriction enzymes)
– TRANSFAC (transcription factors)
– CarbBank
– GlycoSuiteDB (linked sugars)
– Protein-protein interactions db (DIP, ProNet, BIND, MINT, String),
– Protease db (MEROPS), biotechnology patents db, Microarrays, etc.;
• As well as many other resources concerning any aspects
of macromolecules and molecular biology.
132, Introduction to biological databases
Protein/protein interaction: description from 1 to more than 20‘000
interactions / publication
Several databases: Intact, BIND, DIP, String
Estimation: 10’000 fundamental interaction types
10. Interactome
133, Introduction to biological databases
134, Introduction to biological databases
10. Intact
http://www.ebi.ac.uk/intact/
135, Introduction to biological databases
10. Gene Ontology
• The Gene Ontology is a controlled vocabulary, a set of
standard terms—words and phrases—used for indexing
and retrieving information.
• In addition to defining terms, GO also defines the
relationships between the terms, making it a structured
vocabulary.
• The Gene Ontology ensures that the flood of information
produced can be effectively utilized by standardization of
biological data/information
http://www.geneontology.org
137, Introduction to biological databases
About 30‘000 terms (with definition and hierarchy)
biological process
• broad biological phenomena e.g. mitosis, growth, digestion
(included PTMs).
molecular function
• molecular role e.g. catalytic activity, binding
cellular component
• Subcellular location e.g nucleus, ribosome, origin recognition
complex
10. Gene Ontology
138, Introduction to biological databases
http://www.ebi.ac.uk/QuickGO/
139, Introduction to biological databases
10. Gene Ontology annotation
Annotation is the process of assigning/mapping
GO terms to gene products…
!!! Electronic vs Manual annotation…
140, Introduction to biological databases
Example with EPO
141, Introduction to biological databases
Histone H4
!!! Large scale derived data (‗proteome‘)
142, Introduction to biological databases
Essential link between biological knowledge and high
throuput genomic and proteomic datasets…
‘summary of the gene ontology classifications for all mapped ESTs…’
10. Gene Ontology
143, Introduction to biological databases
Genome-Wide RNAi screens identify genes
required for ricin and Pseudomonas
exotoxin intoxications
DOI 10.1016/j.devcel.2011.06.014
Gene Ontology analysis on the 2038 genes hit list.
144, Introduction to biological databases
• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db
2. Genomics
3. Mutation/polymorphism
4. Protein sequences -> Primary db
5. Protein domain/family
6. Proteomics (2D gel, Mass Spectrometry)
7. 3D structure
9. Bibliography
10. Others
• Concluding remarks
145, Introduction to biological databases
Proliferation of databases
• Which does contain the highest quality data ?
• Which is the more comprehensive ?
• Which is the more up-to-date ?
• Which is the less redundant ?
• Which is the more indexed (allows complex queries) ?
• Which Web server does respond most quickly ?
• …….??????
146, Introduction to biological databases
Some important practical remarks
• Databases: many errors (automated annotation) !
• Not all db are available on all servers
• The update frequency is not the same for all servers;
• Some servers add automatically cross-references to an
entry (implicit links) in addition to already existing links
(explicit links)… different looks…
147, Introduction to biological databases
Before the introduction to databases…
After the introduction to databases…
148, Introduction to biological databases
Marie-Claude Blatter
Swiss-Prot, Geneva
SIB Swiss Institute of Bioinformatics
Credits