Upload
others
View
12
Download
1
Embed Size (px)
Citation preview
Salvador Martínez de Bartolomé[email protected] support – ProteoRedProteomics Facility, National Center for Biotechnology, Madrid
1. Proteomics database contents
Protein sequence databases
Menu
Introduction : bioinformatics and sequence databases
Nucleic acid sequence databases
Protein sequences databases (sources)
Protein sequences databases (other)
Biology of the XXI century
Three major developments:
High throughput technique analysis: DNA sequencing, mass spectrometry, micro-
Numerous biological databases available through the Web
Bioinformatics tools available through the Web
Tool
ToolTool
Tool
Tool
Tool
Tool
Tool
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
ToolTool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
ToolTool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
ToolTool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
ToolTool
Tool
Tool
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
databasedatabase
database
databasedatabase
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
Tool
Tool
Tool
Tool
Tool
Tool
Tool
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
Tool
Tool
Tool
Tool
Tool
Tool
Tool
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
Tool
Tool
Tool
Tool
Tool
Tool
Tool
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
Tool
Tool
Tool
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
Tool
Tool
Tool
Tool
Tool
database ToolAn overwhelming number of unordered resources
database
Tool
database
databasedatabase
database
database
databasedatabase
database
database
database
database
database
database
database
Protein Sequence
database
database
database
database
database
databasedatabasedatabase
database
database
database
database
databasedatabasedatabase
database
database
databasedatabase
3o Structure
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
databasedatabase
database
database
Protein 2D
PAGE & MS
database
database
database
database
database
database
database
database
PTM
ToolTool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
ToolTool
Tool Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Protein identification
&
characterization
Tool
Tool
Tool
Tool
Tool
Tool
Tool
ToolTool
ToolTool
Tool
PTM Prediction tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
1o Structure AnalysisTool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
ToolTool
Tool
3o Structure
Prediction
Tool
ToolTool
Tool
Tool
Tool
Tool Tool
Tool
Tool
Tool
Nucleotide Amino
Acid Translator
Tool
ToolTool
Sequence Alignment
ToolTool
Tool
Tool
Tool
Tool
ToolTool
Tool
Tool
Tool
Tool
ToolTool
Tool
Tool
Tool
Tool
Tool
ToolTool
Similarity Search
database
database
database
database
database
database
databasedatabase
Gene Expression
Protein
Interactions
database
database
database
databasedatabase
database
database
database
database
database
database database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
databasedatabase
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
Species / Genomic
databasedatabase
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
Functional
2o Structure Prediction
Tool
Tool
Tool
Tool
Tool
Tool
Tool
ToolTool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Subcellular localization
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database database
database
database
database
database
database
database
database
databasedatabase
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
Polymorphism / Mutation /
Disease
database
database
databaedatabase
database
database
database
Topology Prediction
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Pattern &
Profile searchDomains &
classification
database
database
database
database
database database
database
database
2o Structure
Tool
ToolTool
ToolTool
Tool
Tool
database
Database Database
databasedatabase
database
database
database
database
database
databasedatabase
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
Database
Database
Database
Database
ToolDatabase
Database
Database
Database
Database
database
database
database
database
databasedatabase
database
Database
Database
database
databasedatabase
databasedatabase
database
Phylogenetics &
Taxonomy
database
database
database
database
database
database
database
database database
database
database
References /
nomenclatur
e
Nucleotide sequence
repository
database
database
database
database
database
database database
database
database
References /
nomenclatur
e
database
databasedatabase
databasedatabase
database
Phylogenetics &
Taxonomy
database
database
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Subcellular
localization
database
Tool
database
databasedatabase
database
database
databasedatabase
database
database
database
database
database
database
database
Protein Sequence
database
database
database
database
database
databasedatabasedatabase
database
database
database
database
databasedatabasedatabase
database
database
databasedatabase
3o Structure
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
databasedatabase
database
database
Protein 2D
PAGE & MS
database
database
database
database
database
database
database
database
PTM
ToolTool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
ToolTool
Tool Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Protein identification
&
characterization
Tool
Tool
Tool
Tool
Tool
Tool
Tool
ToolTool
ToolTool
Tool
PTM Prediction tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
1o Structure AnalysisTool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
ToolTool
Tool
3o Structure
Prediction
Tool
ToolTool
Tool
Tool
Tool
Tool Tool
Tool
Tool
Tool
Nucleotide Amino
Acid Translator
Tool
ToolTool
Sequence Alignment
ToolTool
Tool
Tool
Tool
Tool
ToolTool
Tool
Tool
Tool
Tool
ToolTool
Tool
Tool
Tool
Tool
Tool
ToolTool
Similarity Search
database
database
database
database
database
database
databasedatabase
Gene Expression
Protein
Interactions
database
database
database
databasedatabase
database
database
database
database
database
database database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
databasedatabase
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
Species / Genomic
databasedatabase
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
Functional
2o Structure Prediction
Tool
Tool
Tool
Tool
Tool
Tool
Tool
ToolTool
Tool
Tool
database
database
database
database
database
database
database
database
database
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database database
database
database
database
database
database
database
database
databasedatabase
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
database
database
Polymorphism / Mutation /
Disease
database
database
databaedatabase
database
database
database
Topology Prediction
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Tool
Pattern &
Profile searchDomains &
classification
database
database
database
database
database database
database
database
2o
Structure
Tool
ToolTool
ToolTool
Tool
Tool
database
Database Database
databasedatabase
database
database
database
database
database
databasedatabase
database
database
databasedatabase
database
database
database
database
database
database
database
database
database
database
database
Database
Database
Database
Database
ToolDatabase
Database
Database
Database
Database
database
database
database
database
databasedatabase
database
Database
Database
Nucleotide
sequence
repository
UniProtKB
(Swiss-Prot/TrEMBL)
TargetP
EcoGene
Ensembl
FlyBase
MGD
SGDSubtiList
TIGR CMR
HIV
TAIR
MEROPS
ENZYME
TRANSFAC
KEGG
HAMAP
PROSITE
InterPro
PfamProDom
BLOCKS
TIGRFAM
ProtoMap
CATH
SCOP
PDBSWISS-MODEL
ScanProsite
MotifScan
HSSP JpredGOR
DIP
IntAct
ProtScaleProtParamBLAST
FASTA
dbSNP
GeneCards
OMIMCleanEx
DDBJ
GenBank
EMBL
TreeBaseNEWT
Taxonomy
PSORT
Glycosuite
PhosphBase
NetOGlyc
ChloroP
PeptideMass
Mascot
Phenyx ECO2DBASE
Siena-2D PAGE
SWISS-2D PAGE
TMHMM
SOSUI
PubMed
HUGOGO
ClustalW
DIALIGN
Translate
Molecular bioinformatics: an operational definition
The applications of computer sciences to molecular biology…
…in particular for the study of macromolecules such as proteins, nucleic acids and oligosaccharides
Protein sequence databases
- Identification of proteins by proteomics--> completeness, sequence quality
- Similarity searches (functional prediction)--> sequence quality (non redundance)
- Training datasets (prediction tools)--> sequence and annotation quality
- Genome annotation…
(Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: 15036154).
Proteome complexity
Not predictable at the genome
level !
Avalanche of sequence data…
… ~ 1630 genomes sequenced(single organism, varying sizes)
… ~ 952 ongoing genome sequencing projects
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj
http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html
… ~ 1630 genomes sequenced(single organism, varying sizes)
… ~ 952 ongoing genome sequencing projects
…. ~ 200 metagenome sequencing projects (environmental samples: multiple „unknown‟ organisms, varying sizes)
Ecological metagenomes: beach sand, Sargasso Sea….
Organismal metagenomes: mouse gut
~ 17 million sequences being processed at Venter Institute
= 179'000'000'000
How many „protein‟ sequences at the end ?
For fun: estimate: ~30 million species (1.5 million named)
20 million bacteria/archea x 4'000 genes (182-8500)
5 million protists x 6'000 genes
3 million insects x 14'000 genes
1 million fungi x 6'000 genes
0.6 million plants x 20'000 genes
0.2 million molluscs, worms, arachnids, etc. x 20'000 genes
0.2 million vertebrates x 25'000 genes
The calculation:
2x107x4000+5x106x6000+3x106x14000+106x6000+6x105x20000+2x
105x20000+2x105x25000
AMB, SP20
Protein sequence origin
About 4.5 millions of „known‟ protein sequences (in 2007)
More than 99 % of the protein sequences are derived from the translation of nucleotide sequences
Less than 1 %: direct protein sequencing (Edman, MS/MS…)
-> It is important that users know where the protein sequence comes from…
(sequencing & gene prediction quality) !
Menu
Introduction : bioinformatics and sequences
Nucleic acid sequence databases
Protein sequences databases (sources)
Protein sequences databases (other)
cDNAs, ESTs(expressed sequence tags), genes, genomes, …
EMBL, GenBank, DDBJ
Data not submitted to public databases*, delayed or cancelled…
http://www.insdc.org/
The hectic life of a sequence …
EMBL: http://www.ebi.ac.uk/embl/GenBank: http://www.ncbi.nlm.nih.gov/Genbank/GenbankSearch.htmlDDBJ: http://www.ddbj.nig.ac.jp/
Contribution: EMBL 10 %; GenBank 75 %; DDBJ 15 %
Goal
-to accept, process and make freely available sequence data from individual researchers,
research group and patent office
- available via SRS/Entrez, ftp, web services and similarity search tools.
The tremendous increase in nucleotide sequences
1980: 80 genes fully sequenced !
http://www3.ebi.ac.uk/Services/DBStats/
•Serve as archives : „nothing goes out‟
• Contain all public sequences derived from:
– Genome projects (> 80 % of entries)
– Sequencing centers (cDNAs, ESTs…)
– Individual scientists ( 15 % of entries)
– Patent offices (i.e. European Patent Office, EPO)
• Currently: ~152x106 sequences, ~242 x109 bp;
• Sequences from > 260‟000 different species;
EMBL/GenBank/DDBJ
human
mouse
rat
http://www3.ebi.ac.uk/Services/DBStats/
More than 260‟000 species, but…
Human/Mouse/Rat: organisms with the highest redundancy !
Where the sequenced specimen was collected?
Geographical Origin of Sequenced Samples (since 2005)(lat_lon: latitude_longitude qualifier)
http://www3.ebi.ac.uk/Services/EMBLWorld/EMBLWorld.pl
A very important annotation for proteomic:the CoDing Sequence (CDS)
(in particular for eucaryotes)
EMBL/GenBank/DDBJ
cDNAs, ESTs, genes, genomes, …
EMBL, GenBank, DDBJ
Data not submitted to public databases*, delayed or cancelled…
with or without annotated CDS
provided by authors
CDSCoDing Sequence
portion of DNA/RNA translated into protein(from Met to STOP)
Experimentally provedor derived from gene prediction
Problem 1Complete genome (submitted)
only ~ 2,015 CDS available !
http://www3.ebi.ac.uk/Services/DBStats/
At the protein level (Example with UniProtKB/TrEMBL):The CDS of virus and bacteria areeasy to obtain !
human
mouse
rat
At the nucleic acid level
At the protein level
http://www.ebi.ac.uk/swissprot/sptr_stats/index.html
Problem 2: Variable level of sequence quality
- Sequencing quality- Gene prediction quality
Authors can specify the nature of the CDS by using the qualifier: "/evidence=experimental" or "/evidence=not_experimental".
Very rarely done…
Protein existence (PE): %
1: At protein level 15,3%
2: Evidence at transcript level 15,8%
3: Inferred from homology 65,2%
4: Predicted 3,4%
5: Uncertain 0,3%
http://www.expasy.org/sprot/relnotes/relstat.html
UniProtKB/Swiss-Prot protein knowledgebase
release 56.6 statistics (16-Dec-08)
Problem 3: highly redundant
Sort of sequence museum, where sequences are preserved for eternity as they were determined, interpreted and published originally by their
authors
(primary sequence repository)
-> Similarity searches are not obvious…
Problem no 4
Author authority
--> variable level of the annotation (CDS and other) quality- i.e. gene/protein name attribution…
EMBL/GenBank/DDBJ
The authors have full authority over the content of the entries they submit !
(editorial control of the content belongs to the authors)
(exception: TPA (Third Party Annotation), since january 2003)
‘Problem’ no 5
Environmental samples…
Environmental sequences (ENV)
Aim:To sequence all DNA present in a given sample, without knowing from which species the DNA is derived from
- Sargasso sea (Craig Venter)- human fluids- earth
No idea of the species…(microbial population…)No idea of the gene prediction program to be used…No idea of the genetic code to be used for traduction !!!!!
Not always associated with CDS. If yes,the protein sequence are present in protein sequence databases
Menu
Introduction : bioinformatics and sequences
Nucleic acid sequence databases
Protein sequences databases (sources)
Protein sequences databases (other)
cDNAs, ESTs, genomes, …
EMBL, GenBank, DDBJ
Data not submitted to public databases, delayed or cancelled…
…if the submitters provide an annotated Coding Sequence (CDS)
(1/10 EMBL entries)
Protein sequence databases
Nucleic acid databases
Gene prediction
no CDS
Major protein sequence database „sources‟
UniProtKB: Swiss-Prot + TrEMBL
NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq
PIR PDB PRF
UniProtKB/Swiss-Prot: manually annotated protein sequences (11’612 species)
UniProtKB/TrEMBL: submitted CDS (EMBL) + automated annotation; non redundant with Swiss-Prot(184’698 species)
GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (130’000 species)
PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB
PDB: Protein Databank: 3D data and associated sequences
PRF: journal scan of „published‟ peptide sequences
RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction (4’000 species)
Integrated resources
„cross-references‟
Separated resources
UniProt, the Universal protein resource
is maintained by
the UniProt consortium SIB + EBI + PIR
SIB = Swiss Institute Bioinformatics
EBI = European Bioinformatics Institute
PIR = Protein Information Resource
www.uniprot.org
6’964’485 entries(184’698 species)
405’506 entries(11’612 species)
The UniProt KnowledgeBase
(UniProtKB)
an encyclopedia on proteins
biweekly released
TrEMBL
EMBL
Automated extraction of
protein sequence
(translated CDS), gene
name and references.+
Automated annotation
!!!!
The quality of UniProtKB/TrEMBL data, including the protein sequence, is directly dependent on the
information provided by the submitter of the original nucleotide entry.
Automated annotation•using rules derived from Swiss-Prot manually annotated entries but with no manual oversight – RuleBase
•using automatically generated rules - Spearmint
TrEMBL
EMBL
Automated extraction of
protein sequence
(translated CDS), gene
name and references.+
Automated annotation
Manual annotation of
the sequence and
associated biological
information
Swiss-Prot
UniProtKB from TrEMBL to Swiss-Prot
Sequence check
UniProtKB/Swiss-Prot
1 entry <-> 1 gene (1 species)
i) Merge of all known protein sequences (CDS) derived from the same gene
-> avoid redundancy and improve sequence reliability
(for human: ~ 6 different sequence report per entry)
ii) Annotation of the sequence differences
(including conflicts, polymorphisms, splice variants etc..)
-> annotation of protein diversity
Righting the wrongs
“Sequences are rarely deposited in a “mature” state; as with all scientific research, DNA and protein annotation is a continual process of learning, revision and corrections.”
“Sequencing error rates: ~1 base in 10‟000”
evidence exists that prove the existence of a protein;
Different qualifiers:
1. Evidence at protein level (~15,3%)
2. Evidence at transcript level (~15,8%)
3. Inferred from homology (~65,2 %)
4. Predicted (~3,4%)
5. Unassigned (mainly in TrEMBL) (0,3%)
• Focal point of our efforts to maintain and develop UniProtKB/Swiss-Prot;
• Enables individual researchers to obtain a summary of what is known about a protein
Annotation
In a UniProtKB/Swiss-Prot entry, you can expect to find:
• A (often corrected) protein sequence and the description of various isoforms/variants.
• Its biological origin with links to the taxonomic databases;
• All the names of a given protein (and of its gene);• A summary of what is known about the protein:
function, alternative products, PTM, tissue expression, disease, 3D data etc.…;
• A description of important sequence features: domains, PTMs, variations, etc.;
• A selection of references;• Selected keywords;• Numerous cross-references (central hub);
An easy way to access the history of a protein sequence entry…
http://www.ebi.ac.uk/uniprot/unisave/
UniSave homepage:
UniRef useful for comprehensive BLAST searches by providing
sets of representative sequences«Collapsing BLAST results»
= Three collections of sequences clusters from the UniProt knowledgebase and EnsEMBL, IPI, EMBL_WGS:
One UniRef100 entry -> all identical sequences (Identical sequences and sub-fragments with 11 or more residues are placed into a single record) -> reduction of 12 %
One UniRef90 entry -> sequences that have at least 90 % or more identity -> reduction of 40 %
One UniRef50 entry -> sequences that are at least 50 % identical-> reduction of 65 %
Independently of the species !
UniProt Archive (UniParc) is part of UniProt project.
It is a non-redundant archive of protein sequences extracted from public
databases UniProtKB/Swiss-Prot,UniProtKB/TrEMBL, PIR-PSD, EMBL, EMBL
WGS, Ensembl, IPI, PDB, PIR-PSD,RefSeq, FlyBase, WormBase, H-Invitational
Database, TROME database, European Patent Office proteins, United States Patent and
Trademark Office proteins (USPTO) and Japan Patent Office proteins.
UniParc contains only protein sequences. All other information about the protein must
be retrieved from the source databases using the database cross-references.
Each unique sequence is stored only once with a stable identifier. The format of the
identifier is UPI followed by ten hexadecimal numbers, e.g.UPI000000000A.
UniParc
Use with extreme caution: also contains pseudogene, incorrect CDS prediction etc…!
Also patent office database data (EPO, ESPO…).
UniParc
The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental data.
UniMES is available in FASTA format on the UniProt ftp servers, in the new subdirectory current_release/unimes: •ftp.uniprot.org/pub/databases/uniprot•ftp.ebi.ac.uk/pub/databases/uniprot•ftp.expasy.org/databases/uniprot
http://www.uniprot.org/downloads
NCBInr(Entrez protein)
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
Protein sequences: « NR database »Entrez protein
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
Major protein sequence database „sources‟
UniProtKB: Swiss-Prot + TrEMBL
NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq
PIR PDB PRF
UniProtKB/Swiss-Prot: manually annotated protein sequences (11’612 species)
UniProtKB/TrEMBL: submitted CDS (EMBL) + automated annotation; non redundant with Swiss-Prot(184698 species)
GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (130’000 species)
PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB
PDB: Protein Databank: 3D data and associated sequences
PRF: journal scan of „published‟ peptide sequences
RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction (4’000 species)
Integrated resources
„cross-references‟
Separated resources
NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF
derived from GenBank/EMBL/DDBJ sequenceswhich have a CDS annotated on them
- equivalent to TrEMBL
All PIR data have been integrated into Swiss-Prot and TrEMBL (UniProt)
3D structure database:all the protein sequences which have been cristallized(Swiss-Prot/TrEMBL are crosslinked to PDB)
Scientific publications derived sequences« Journal scan »
(integrated into TrEMBL)
http://www.ncbi.nlm.nih.gov/RefSeq/
Accession numbers- for RNA (NM_)- for genomic (NT_)- for protein (NP_)- for predicted protein (XP_)
RefSeq: The Reference Sequence (RefSeq) collection aims to providea comprehensive, integrated, non-redundant, well-annotated set ofsequences, including genomic DNA, transcripts, and proteins.
3,648,590 entries (22-May-2007); 4,300 species.
5,590,364 entries (11-July-2008); 5,395 species.
6,042,750 entries (20-November-2008); 5,726 species.
KW
AC
Taxonomy
References
NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF
derived from GenBank/EMBL/DDBJ sequenceswhich have a CDS annotated on them
- equivalent to TrEMBL, except that it is
redundant with Swiss-Prot
All PIR data have been integrated into Swiss-Prot and TrEMBL (UniProt)
3D structure database:all the protein sequences which have been cristallized(Swiss-Prot/TrEMBL are crosslinked to PDB)
Scientific publications derived sequences« Journal scan »
(integrated into TrEMBL)
PIR: the Protein Identification Resource
PIR-PSD is no more updated, but exists as an archive
PDBPDB (Protein Data Bank), 3D structure
Contains the spatial coordinates of macromolecule atoms whose 3D structure has been obtained by X-ray or NMR studies
Contains also the corresponding protein sequences*The PIR-NRL3D database makes the sequence information in PDB available for similarity
searches and other tools
Includes protein sequences which are mutated,
effect of a mutation on the 3D structure)
PDB: Protein Data Bankwww.rcsb.org/pdb/
Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA).
Associated with specialized programs allow the visualization of the corresponding 3D structure (e.g.,
SwissPDB-viewer, Chime, Rasmol)).
Currently there are structural data for about different proteins, but far less protein family
(highly redundant) !
Coordinates of each atom
Sequence
http://www.genome.jp/dbget-bin/www_bfind?prf
Looks for the peptide sequence described in publication (and which are not submitted in databases !!!)
Query at Entrez protein (NCBInr)
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
Typical result of a query at
« Entrez protein »
RefSeq
Swiss-Prot
Genpept(gb/embl/ddbj)
PIR
PDB
AC
GenInfo identifier number
GI number: ‘GenInfo identifier’ number
- In addition to an AC number specific from the original database, each protein sequence in the NCBInr database has a GI number.
- If the sequence changes in any way, a new GI number will be assigned -> not a stable identifier
- A separate GI number is also assigned to each protein translation within a nucleotide sequence record (alternative products)
- A Sequence Revision History tool is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record: http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi
Menu
Introduction : bioinformatics and sequences
Nucleic acid sequence databases
Protein sequences databases (sources)
Protein sequences databases (other)
EnsEMBLhttp://www.ensembl.org/
not only for proteins….
EnsEMBL
Automated genome annotation and subsequent visualisation of annotated genomes.
Ensembl concentrates on vertebrate genomes, but other groups have adapted the system for use with plant and fungal genomes.
http://www.ensembl.org/info/about/index.html
http://www.ensembl.org/info/data/docs/genome_annotation.html
- EnsEMBL: align the genomic sequences with all the sequences found in EMBL, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (-> known genes)
- Also do gene prediction (-> novel genes)
-DNA, RNA and protein sequences available for ~30 species
- Browsing tool
http://www.ensembl.org/index.html
Browsing tool available for 49 species…
CCDSConsensus CDS protein set
htt
p:/
/www.n
cbi.nlm.n
ih.g
ov/C
CDS/
CCDS (human)
Combining different approaches – ab initio, by similarity - and taking
advantage of the expertise acquired by different institutes, including
manual annotation…
Consensus between 4 institutions…
IPIInternational Protein Indexhttp://www.ebi.ac.uk/IPI/IPIhelp.html
IPI (International Protein Index)
Provides a guide to the main databases that describe the human,mouse, rat, Zebrafish, Arabidopsis, Chicken, and Cowproteomes: Swiss-Prot , TrEMBL, RefSeq and Ensembl (and H-InvDB, TAIR and VEGA).
IPI is built in order to provide maximum coverage of the majorpublicly available protein (and gene) databases, for a sameprotein
For each protein in IPI, an entry from one of the constituentdatabases is selected as the master entry, and supplies the IPIentry with its sequence and annotation.
Stable identifiers (with incremental versioning) are maintainedto allow the tracking of sequences in IPI between IPI releases.
IMGT(international ImMunoGeneTics information)
Is a collection of high-quality integrated databases specialising
in inmunoglobulins, T cell receptors and the Major
Histocompatibility Complex (MHC) of all vertebrate species.
http://www.ebi.ac.uk/imgt/
http://www.ebi.ac.uk/imgt/
Protein sequence databases for proteomics
Phenyx: UniProtKB
PROWL: NCBInr, Swiss-Prot, dbEST
Protein prospector: NCBInr, Swiss-Prot, dbEST, GenPept, Ludwignr, OWL*.
Peptident (Aldente): UniProtKB.
Mascot: NCBInr, Swiss-Prot, dbEST, OWL*, MSDB
* OWL is obsolete since 1999
Translation of ESTs sequences in the 6 frames(EST are not associated with annotated CDSs !)
OWLNon redundant protein database, including: Swiss-Prot, PIR, NRL3-D* and GenPept.
*The PIR-NRL3D database makes the sequence information in PDB available for similarity searches
Phenyx: UniProtKB
PROWL: NCBInr, Swiss-Prot, dbEST
Protein prospector: NCBInr, Swiss-Prot, dbEST, GenPept, Ludwignr, OWL*.
Peptident (Aldente): UniProtKB.
Mascot: NCBInr, Swiss-Prot, dbEST, OWL*, MSDB
* OWL is obsolete since 1999
Translation of ESTs sequences in the 6 frames(EST are not associated with annotated CDSs !)
-> Accession / version number jungle !According to the database, a AC number can be associated with an entry (gene product: stable even if the sequence
changes) or with a sequence (it change as soon as the sequence changes)
In resume
For the same protein sequence
You can find: A UniProtKB/Swiss-Prot entry A RefSeq entry (or GenPept)A EnsEMBl entryA CCDS entryA UniParc entry (archive)A IPI…
Type of record Sample Accession Format
GenBank/EMBL/DDBJ One letter followed by five digits: e.g. U12345
Two letters followed by 6 digits: e.g. AF123456
Swiss-Prot/TrEMBL One letter and five digits/letters: e.g. P12345, A0B533
RefSeq nucleotide Two letters, underscore bar and six digit:
e.g. mRNA NM_000492
e.g. genomic NT_000907
RefSeq protein e.g. NP_00483
RefSeq prediction e.g. XM_000483
e.g. XP_000467
PDB (protein structure) One digit followed by three letters: e.g. 1TUP
The AC number jungle
uniprot.org
(Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: 15036154).
Proteome complexity
Not predictable at the genome
level !
Chemical aspects
• Post-translational modifications (PTMs) consist in the breaking and/or the making of covalent bonds catalyzed by enzyme
• PTMs modify both protein mass and isoelectric point (PI)
The PTM variety
Gly Ala Val Leu Ile Lys Arg His Asp Glu Asn Gln Cys Ser Thr Met Pro Phe Tyr Trp
acetylation
methylation
acylation
phosphorylation
oxidation
crosslinks
hydroxylation cofactor binding sulfation C-linked sugar N-linked sugar O-linked sugar S-linked sugar
acetylation
methylation
acylation
crosslinks
GPI
amidation
crosslinks
methylation
C-terminal modifications
in black: cytoplasmic modifications
in dark grey: both cytoplasmic and extracellular modifications, depending on the exact type
in light grey: extracellular modifications
N-terminal modifications
side-chain modifications
PTM distribution among kingdoms
bacteria
eukaryotes
archaea
acetylation
amidation
FMN binding
FAD binding
GPI-anchor
lanthionine crosslink
bacterial lipid anchor
myristoylation
archaean lipid anchor
palmitoylation
methylation
phosphorylation
sulfation
diphthamide
pyrrolysine
archaea-specific methylation
eukaryote-specific methylation
bacteria-specific methylation
PTM annotation in UniProtKB entriesPTMs are annotated in the feature table (‘sequence annotation’) when they can be assigned a position on the protein sequence - in the comments when they cannot.
PTM-dedicated FT keys
FT key usage
CARBOHYD
(Glycosylation )
sugars
DISULFID
(Disulfide bond)
disulfide bonds
CROSSLNK
(Cross-link)
other crosslinks
LIPID lipids
MOD_RES
(Modified residue)
other modifications
PTMs are grouped by type, are specifically and uniquely annotated by the use of a controlled vocabulary and a
set of specific FT keys
PTM annotation in UniProtKB entriesPTMs are annotated in the feature table when they can be assigned a position on
the protein sequence - in the comments when they cannot.
Associated keywords
Find all mouse proteins which are
phosphorylated
UniProtKB/Swiss-Prot
Number of PTMs in Swiss-Prot release 51 (241242 entries)
all organisms
Pot. By sim. Exp. & Prob. total
signal peptide 15235 2850 4996 23081
N-GlcNAc 66264 826 3161 70251
O-GalNAc 306 354 628 1288
O-GlcNAc 1 167 78 246
phosphorylation 1110 14288 7760 23158
sulfation 252 294 171 717
myristate 129 535 131 795
GPI-anchor 478 115 68 661
RESID
RESID is a database of
473 natural modifications
(Rel. 56.00) with chemical
and structural annotations
such as recommended
name and synonyms,
delta mass, 3D structure,
UniProt annotations, etc.
FTP sites:
ftp://ftp.ebi.ac.uk/pub/databases/RESID/
ftp://ftp.ncifcrf.gov/pub/users/residues
Web sites:
http://www.ebi.ac.uk/RESID
http://www.ncifcrf.gov/RESID/
http://home.earthlink.net/~jsgaravelli/RESIDInfo.HTML
http://hpc.cs.tsinghua.edu.cn/bioinfo/database/index.html
Other PTM databases
UNIMOD: http://www.unimod.org/
PSI-MOD: http://psidev.sourceforge.net/mod/: ontology
Delta Mass: http://www.abrf.org/index.cfm/dm.home
http://www.geneontology.org
Three disjoint axes:
cellular component• Sub-cellular location e.g nucleus, ribosome,
origin recognition complex
molecular function• molecular role e.g. catalytic activity, binding
biological process• broad biological phenomena e.g. mitosis,
growth, digestion
GO scope
terms are related within a hierarchy
GO structure
• Terms are linked by two relationships
– is-a
– part-of
cell
membrane chloroplast
mitochondrial chloroplastmembrane membrane
is-apart-of
GO structure
GOA: Gene Ontology Annotation
GOA aims to provide high-quality electronic and manual annotations to the
UniProt Knowledgebase and International Protein Index, using GO terms.
The GOA project is run by EBI and is a member of the GO consortium since 2001
http://www.ebi.ac.uk/GOA/
What is GOA ?
In 2001, the first phase of the GOA project involved the large-scale assignment
of GO terms to Swiss-Prot and TrEMBL entries using electronic methods,
namely the mappings spkw2go, ec2go and Interpro2go.
e-proxemis: http://e-proxemis.expasy.org