Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Salvador Martínez de Bartolomé[email protected] support – ProteoRedProteomics Facility, National Center for Biotechnology, Madrid

1. Proteomics database contents

Protein sequence databases

http://www.eupa.org/

http://www.proteored.org/

http://www.cbm.uam.es/seprot

Menu

Introduction : bioinformatics and sequence databases

Nucleic acid sequence databases

Protein sequences databases (sources)

Protein sequences databases (other)




Biology of the XXI century

Three major developments:

High throughput technique analysis: DNA sequencing, mass spectrometry, micro-

Numerous biological databases available through the Web

Bioinformatics tools available through the Web




Tool

ToolTool

Tool

Tool

Tool

Tool

Tool

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

Tool

Tool

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

databasedatabase

database

databasedatabase

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

Tool

Tool

Tool

Tool

Tool

Tool

Tool

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

Tool

Tool

Tool

Tool

Tool

Tool

Tool

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

Tool

Tool

Tool

Tool

Tool

Tool

Tool

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

Tool

Tool

Tool

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

Tool

Tool

Tool

Tool

Tool

database ToolAn overwhelming number of unordered resources

database

Tool

database

databasedatabase

database

database

databasedatabase

database

database

database

database

database

database

database

Protein Sequence

database

database

database

database

database

databasedatabasedatabase

database

database

database

database


database

database

databasedatabase

3o Structure

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

databasedatabase

database

database

Protein 2D

PAGE & MS

database

database

database

database

database

database

database

database

PTM

ToolTool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

Tool Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Protein identification

&

characterization

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

ToolTool

Tool

PTM Prediction tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

1o Structure AnalysisTool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

Tool

3o Structure

Prediction

Tool

ToolTool

Tool

Tool

Tool

Tool Tool

Tool

Tool

Tool

Nucleotide Amino

Acid Translator

Tool

ToolTool

Sequence Alignment

ToolTool

Tool

Tool

Tool

Tool

ToolTool

Tool

Tool

Tool

Tool

ToolTool

Tool

Tool

Tool

Tool

Tool

ToolTool

Similarity Search

database

database

database

database

database

database

databasedatabase

Gene Expression

Protein

Interactions

database

database

database

databasedatabase

database

database

database

database

database

database database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

databasedatabase

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

Species / Genomic

databasedatabase

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

Functional

2o Structure Prediction

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Subcellular localization

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database database

database

database

database

database

database

database

database

databasedatabase

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

Polymorphism / Mutation /

Disease

database

database

databaedatabase

database

database

database

Topology Prediction

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Pattern &

Profile searchDomains &

classification

database

database

database

database

database database

database

database

2o Structure

Tool

ToolTool

ToolTool

Tool

Tool

database

Database Database

databasedatabase

database

database

database

database

database

databasedatabase

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

Database

Database

Database

Database

ToolDatabase

Database

Database

Database

Database

database

database

database

database

databasedatabase

database

Database

Database

database

databasedatabase

databasedatabase

database

Phylogenetics &

Taxonomy

database

database

database

database

database

database

database

database database

database

database

References /

nomenclatur

e

Nucleotide sequence

repository

database

database

database

database

database

database database

database

database

References /

nomenclatur

e

database

databasedatabase

databasedatabase

database

Phylogenetics &

Taxonomy

database

database

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Subcellular

localization

database

Tool

database

databasedatabase

database

database

databasedatabase

database

database

database

database

database

database

database

Protein Sequence

database

database

database

database

database


database

database

database

database


database

database

databasedatabase

3o Structure

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

databasedatabase

database

database

Protein 2D

PAGE & MS

database

database

database

database

database

database

database

database

PTM

ToolTool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

Tool Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Protein identification

&

characterization

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

ToolTool

Tool

PTM Prediction tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

1o Structure AnalysisTool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

Tool

3o Structure

Prediction

Tool

ToolTool

Tool

Tool

Tool

Tool Tool

Tool

Tool

Tool

Nucleotide Amino

Acid Translator

Tool

ToolTool

Sequence Alignment

ToolTool

Tool

Tool

Tool

Tool

ToolTool

Tool

Tool

Tool

Tool

ToolTool

Tool

Tool

Tool

Tool

Tool

ToolTool

Similarity Search

database

database

database

database

database

database

databasedatabase

Gene Expression

Protein

Interactions

database

database

database

databasedatabase

database

database

database

database

database

database database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

databasedatabase

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

Species / Genomic

databasedatabase

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

Functional

2o Structure Prediction

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

Tool

Tool

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database database

database

database

database

database

database

database

database

databasedatabase

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

Polymorphism / Mutation /

Disease

database

database

databaedatabase

database

database

database

Topology Prediction

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Pattern &

Profile searchDomains &

classification

database

database

database

database

database database

database

database

2o

Structure

Tool

ToolTool

ToolTool

Tool

Tool

database

Database Database

databasedatabase

database

database

database

database

database

databasedatabase

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

Database

Database

Database

Database

ToolDatabase

Database

Database

Database

Database

database

database

database

database

databasedatabase

database

Database

Database

Nucleotide

sequence

repository

UniProtKB

(Swiss-Prot/TrEMBL)

TargetP

EcoGene

Ensembl

FlyBase

MGD

SGDSubtiList

TIGR CMR

HIV

TAIR

MEROPS

ENZYME

TRANSFAC

KEGG

HAMAP

PROSITE

InterPro

PfamProDom

BLOCKS

TIGRFAM

ProtoMap

CATH

SCOP

PDBSWISS-MODEL

ScanProsite

MotifScan

HSSP JpredGOR

DIP

IntAct

ProtScaleProtParamBLAST

FASTA

dbSNP

GeneCards

OMIMCleanEx

DDBJ

GenBank

EMBL

TreeBaseNEWT

Taxonomy

PSORT

Glycosuite

PhosphBase

NetOGlyc

ChloroP

PeptideMass

Mascot

Phenyx ECO2DBASE

Siena-2D PAGE

SWISS-2D PAGE

TMHMM

SOSUI

PubMed

HUGOGO

ClustalW

DIALIGN

Translate

Molecular bioinformatics: an operational definition

The applications of computer sciences to molecular biology…

…in particular for the study of macromolecules such as proteins, nucleic acids and oligosaccharides


- Identification of proteins by proteomics--> completeness, sequence quality

- Similarity searches (functional prediction)--> sequence quality (non redundance)

- Training datasets (prediction tools)--> sequence and annotation quality

- Genome annotation…




(Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: 15036154).

Proteome complexity

Not predictable at the genome

level !




Avalanche of sequence data…







… ~ 1630 genomes sequenced(single organism, varying sizes)

… ~ 952 ongoing genome sequencing projects

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj

http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html

… ~ 1630 genomes sequenced(single organism, varying sizes)

… ~ 952 ongoing genome sequencing projects

…. ~ 200 metagenome sequencing projects (environmental samples: multiple „unknown‟ organisms, varying sizes)

Ecological metagenomes: beach sand, Sargasso Sea….

Organismal metagenomes: mouse gut

~ 17 million sequences being processed at Venter Institute

= 179'000'000'000

How many „protein‟ sequences at the end ?

For fun: estimate: ~30 million species (1.5 million named)

20 million bacteria/archea x 4'000 genes (182-8500)

5 million protists x 6'000 genes

3 million insects x 14'000 genes

1 million fungi x 6'000 genes

0.6 million plants x 20'000 genes

0.2 million molluscs, worms, arachnids, etc. x 20'000 genes

0.2 million vertebrates x 25'000 genes

The calculation:

2x107x4000+5x106x6000+3x106x14000+106x6000+6x105x20000+2x

105x20000+2x105x25000

AMB, SP20

Protein sequence origin

About 4.5 millions of „known‟ protein sequences (in 2007)

More than 99 % of the protein sequences are derived from the translation of nucleotide sequences

Less than 1 %: direct protein sequencing (Edman, MS/MS…)

-> It is important that users know where the protein sequence comes from…

(sequencing & gene prediction quality) !




Menu

Introduction : bioinformatics and sequences







cDNAs, ESTs(expressed sequence tags), genes, genomes, …

EMBL, GenBank, DDBJ

Data not submitted to public databases*, delayed or cancelled…

http://www.insdc.org/

The hectic life of a sequence …

EMBL: http://www.ebi.ac.uk/embl/GenBank: http://www.ncbi.nlm.nih.gov/Genbank/GenbankSearch.htmlDDBJ: http://www.ddbj.nig.ac.jp/

Contribution: EMBL 10 %; GenBank 75 %; DDBJ 15 %




Goal

-to accept, process and make freely available sequence data from individual researchers,

research group and patent office

- available via SRS/Entrez, ftp, web services and similarity search tools.




The tremendous increase in nucleotide sequences

1980: 80 genes fully sequenced !

http://www3.ebi.ac.uk/Services/DBStats/




•Serve as archives : „nothing goes out‟

• Contain all public sequences derived from:

– Genome projects (> 80 % of entries)

– Sequencing centers (cDNAs, ESTs…)

– Individual scientists ( 15 % of entries)

– Patent offices (i.e. European Patent Office, EPO)

• Currently: ~152x106 sequences, ~242 x109 bp;

• Sequences from > 260‟000 different species;

EMBL/GenBank/DDBJ




human

mouse

rat


More than 260‟000 species, but…

Human/Mouse/Rat: organisms with the highest redundancy !




Where the sequenced specimen was collected?

Geographical Origin of Sequenced Samples (since 2005)(lat_lon: latitude_longitude qualifier)

http://www3.ebi.ac.uk/Services/EMBLWorld/EMBLWorld.pl

A very important annotation for proteomic:the CoDing Sequence (CDS)

(in particular for eucaryotes)

EMBL/GenBank/DDBJ




cDNAs, ESTs, genes, genomes, …

EMBL, GenBank, DDBJ

Data not submitted to public databases*, delayed or cancelled…

with or without annotated CDS

provided by authors

CDSCoDing Sequence

portion of DNA/RNA translated into protein(from Met to STOP)

Experimentally provedor derived from gene prediction




5 Problems




Problem 1Complete genome (submitted)

only ~ 2,015 CDS available !


At the protein level (Example with UniProtKB/TrEMBL):The CDS of virus and bacteria areeasy to obtain !

human

mouse

rat

At the nucleic acid level

At the protein level

http://www.ebi.ac.uk/swissprot/sptr_stats/index.html




Problem 2: Variable level of sequence quality

- Sequencing quality- Gene prediction quality

Authors can specify the nature of the CDS by using the qualifier: "/evidence=experimental" or "/evidence=not_experimental".

Very rarely done…




Very rarely done…




Protein existence (PE): %

1: At protein level 15,3%

2: Evidence at transcript level 15,8%

3: Inferred from homology 65,2%

4: Predicted 3,4%

5: Uncertain 0,3%

http://www.expasy.org/sprot/relnotes/relstat.html

UniProtKB/Swiss-Prot protein knowledgebase

release 56.6 statistics (16-Dec-08)




Problem 3: highly redundant

Sort of sequence museum, where sequences are preserved for eternity as they were determined, interpreted and published originally by their

authors

(primary sequence repository)

-> Similarity searches are not obvious…




Problem no 4

Author authority

--> variable level of the annotation (CDS and other) quality- i.e. gene/protein name attribution…




EMBL/GenBank/DDBJ

The authors have full authority over the content of the entries they submit !

(editorial control of the content belongs to the authors)

(exception: TPA (Third Party Annotation), since january 2003)




‘Problem’ no 5

Environmental samples…




Environmental sequences (ENV)

Aim:To sequence all DNA present in a given sample, without knowing from which species the DNA is derived from

- Sargasso sea (Craig Venter)- human fluids- earth







No idea of the species…(microbial population…)No idea of the gene prediction program to be used…No idea of the genetic code to be used for traduction !!!!!

Not always associated with CDS. If yes,the protein sequence are present in protein sequence databases




Menu








cDNAs, ESTs, genomes, …

EMBL, GenBank, DDBJ

Data not submitted to public databases, delayed or cancelled…

…if the submitters provide an annotated Coding Sequence (CDS)

(1/10 EMBL entries)


Nucleic acid databases

Gene prediction

no CDS

Major protein sequence database „sources‟

UniProtKB: Swiss-Prot + TrEMBL

NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq

PIR PDB PRF

UniProtKB/Swiss-Prot: manually annotated protein sequences (11’612 species)

UniProtKB/TrEMBL: submitted CDS (EMBL) + automated annotation; non redundant with Swiss-Prot(184’698 species)

GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (130’000 species)

PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences

PRF: journal scan of „published‟ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction (4’000 species)

Integrated resources

„cross-references‟

Separated resources

UniProt, the Universal protein resource

is maintained by

the UniProt consortium SIB + EBI + PIR

SIB = Swiss Institute Bioinformatics

EBI = European Bioinformatics Institute

PIR = Protein Information Resource

www.uniprot.org

6’964’485 entries(184’698 species)

405’506 entries(11’612 species)

The UniProt KnowledgeBase

(UniProtKB)

an encyclopedia on proteins

biweekly released




TrEMBL

EMBL

Automated extraction of

protein sequence

(translated CDS), gene

name and references.+

Automated annotation

!!!!

The quality of UniProtKB/TrEMBL data, including the protein sequence, is directly dependent on the

information provided by the submitter of the original nucleotide entry.

Automated annotation•using rules derived from Swiss-Prot manually annotated entries but with no manual oversight – RuleBase

•using automatically generated rules - Spearmint




TrEMBL

EMBL

Automated extraction of

protein sequence

(translated CDS), gene

name and references.+

Automated annotation

Manual annotation of

the sequence and

associated biological

information

Swiss-Prot

UniProtKB from TrEMBL to Swiss-Prot

Sequence check




UniProtKB/Swiss-Prot

1 entry <-> 1 gene (1 species)

i) Merge of all known protein sequences (CDS) derived from the same gene

-> avoid redundancy and improve sequence reliability

(for human: ~ 6 different sequence report per entry)

ii) Annotation of the sequence differences

(including conflicts, polymorphisms, splice variants etc..)

-> annotation of protein diversity




Righting the wrongs

“Sequences are rarely deposited in a “mature” state; as with all scientific research, DNA and protein annotation is a continual process of learning, revision and corrections.”

“Sequencing error rates: ~1 base in 10‟000”




evidence exists that prove the existence of a protein;

Different qualifiers:

1. Evidence at protein level (~15,3%)

2. Evidence at transcript level (~15,8%)

3. Inferred from homology (~65,2 %)

4. Predicted (~3,4%)

5. Unassigned (mainly in TrEMBL) (0,3%)




• Focal point of our efforts to maintain and develop UniProtKB/Swiss-Prot;

• Enables individual researchers to obtain a summary of what is known about a protein

Annotation

In a UniProtKB/Swiss-Prot entry, you can expect to find:

• A (often corrected) protein sequence and the description of various isoforms/variants.

• Its biological origin with links to the taxonomic databases;

• All the names of a given protein (and of its gene);• A summary of what is known about the protein:

function, alternative products, PTM, tissue expression, disease, 3D data etc.…;

• A description of important sequence features: domains, PTMs, variations, etc.;

• A selection of references;• Selected keywords;• Numerous cross-references (central hub);

An easy way to access the history of a protein sequence entry…

http://www.ebi.ac.uk/uniprot/unisave/

UniSave homepage:










Other UniProt databases




UniRef




UniRef useful for comprehensive BLAST searches by providing

sets of representative sequences«Collapsing BLAST results»

= Three collections of sequences clusters from the UniProt knowledgebase and EnsEMBL, IPI, EMBL_WGS:

One UniRef100 entry -> all identical sequences (Identical sequences and sub-fragments with 11 or more residues are placed into a single record) -> reduction of 12 %

One UniRef90 entry -> sequences that have at least 90 % or more identity -> reduction of 40 %

One UniRef50 entry -> sequences that are at least 50 % identical-> reduction of 65 %

Independently of the species !

UniParc




UniParc




UniProt Archive (UniParc) is part of UniProt project.

It is a non-redundant archive of protein sequences extracted from public

databases UniProtKB/Swiss-Prot,UniProtKB/TrEMBL, PIR-PSD, EMBL, EMBL

WGS, Ensembl, IPI, PDB, PIR-PSD,RefSeq, FlyBase, WormBase, H-Invitational

Database, TROME database, European Patent Office proteins, United States Patent and

Trademark Office proteins (USPTO) and Japan Patent Office proteins.

UniParc contains only protein sequences. All other information about the protein must

be retrieved from the source databases using the database cross-references.

Each unique sequence is stored only once with a stable identifier. The format of the

identifier is UPI followed by ten hexadecimal numbers, e.g.UPI000000000A.

UniParc




Use with extreme caution: also contains pseudogene, incorrect CDS prediction etc…!

Also patent office database data (EPO, ESPO…).

UniParc




Not downloadable…




UniMES




The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental data.

UniMES is available in FASTA format on the UniProt ftp servers, in the new subdirectory current_release/unimes: •ftp.uniprot.org/pub/databases/uniprot•ftp.ebi.ac.uk/pub/databases/uniprot•ftp.expasy.org/databases/uniprot




http://www.uniprot.org/downloads




NCBInr(Entrez protein)

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein




Protein sequences: « NR database »Entrez protein


Major protein sequence database „sources‟

UniProtKB: Swiss-Prot + TrEMBL

NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq

PIR PDB PRF

UniProtKB/Swiss-Prot: manually annotated protein sequences (11’612 species)

UniProtKB/TrEMBL: submitted CDS (EMBL) + automated annotation; non redundant with Swiss-Prot(184698 species)

GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (130’000 species)

PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences

PRF: journal scan of „published‟ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction (4’000 species)

Integrated resources

„cross-references‟

Separated resources

NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF

derived from GenBank/EMBL/DDBJ sequenceswhich have a CDS annotated on them

- equivalent to TrEMBL

All PIR data have been integrated into Swiss-Prot and TrEMBL (UniProt)

3D structure database:all the protein sequences which have been cristallized(Swiss-Prot/TrEMBL are crosslinked to PDB)

Scientific publications derived sequences« Journal scan »

(integrated into TrEMBL)




RefSeq




http://www.ncbi.nlm.nih.gov/RefSeq/

Accession numbers- for RNA (NM_)- for genomic (NT_)- for protein (NP_)- for predicted protein (XP_)

RefSeq: The Reference Sequence (RefSeq) collection aims to providea comprehensive, integrated, non-redundant, well-annotated set ofsequences, including genomic DNA, transcripts, and proteins.

3,648,590 entries (22-May-2007); 4,300 species.

5,590,364 entries (11-July-2008); 5,395 species.

6,042,750 entries (20-November-2008); 5,726 species.

AC




KW

AC

Taxonomy

References




NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF

derived from GenBank/EMBL/DDBJ sequenceswhich have a CDS annotated on them

- equivalent to TrEMBL, except that it is

redundant with Swiss-Prot

All PIR data have been integrated into Swiss-Prot and TrEMBL (UniProt)

3D structure database:all the protein sequences which have been cristallized(Swiss-Prot/TrEMBL are crosslinked to PDB)

Scientific publications derived sequences« Journal scan »

(integrated into TrEMBL)




PIR




PIR: the Protein Identification Resource

PIR-PSD is no more updated, but exists as an archive




PDB




PDBPDB (Protein Data Bank), 3D structure

Contains the spatial coordinates of macromolecule atoms whose 3D structure has been obtained by X-ray or NMR studies

Contains also the corresponding protein sequences*The PIR-NRL3D database makes the sequence information in PDB available for similarity

searches and other tools

Includes protein sequences which are mutated,

effect of a mutation on the 3D structure)




PDB: Protein Data Bankwww.rcsb.org/pdb/

Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA).

Associated with specialized programs allow the visualization of the corresponding 3D structure (e.g.,

SwissPDB-viewer, Chime, Rasmol)).

Currently there are structural data for about different proteins, but far less protein family

(highly redundant) !




PDB: example




Coordinates of each atom

Sequence




Visualisation with Jmol




PRF




http://www.genome.jp/dbget-bin/www_bfind?prf

Looks for the peptide sequence described in publication (and which are not submitted in databases !!!)







Query at Entrez protein (NCBInr)


Typical result of a query at

« Entrez protein »

RefSeq

Swiss-Prot

Genpept(gb/embl/ddbj)

PIR

PDB

AC

GenInfo identifier number




GI number: ‘GenInfo identifier’ number

- In addition to an AC number specific from the original database, each protein sequence in the NCBInr database has a GI number.

- If the sequence changes in any way, a new GI number will be assigned -> not a stable identifier

- A separate GI number is also assigned to each protein translation within a nucleotide sequence record (alternative products)

- A Sequence Revision History tool is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record: http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi




http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi

Menu








EnsEMBLhttp://www.ensembl.org/

not only for proteins….




http://www.ensembl.org/

EnsEMBL

Automated genome annotation and subsequent visualisation of annotated genomes.

Ensembl concentrates on vertebrate genomes, but other groups have adapted the system for use with plant and fungal genomes.

http://www.ensembl.org/info/about/index.html




http://www.ensembl.org/info/data/docs/genome_annotation.html

- EnsEMBL: align the genomic sequences with all the sequences found in EMBL, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (-> known genes)

- Also do gene prediction (-> novel genes)

-DNA, RNA and protein sequences available for ~30 species

- Browsing tool







http://www.ensembl.org/index.html

Browsing tool available for 49 species…




CCDSConsensus CDS protein set




htt

p:/

/www.n

cbi.nlm.n

ih.g

ov/C

CDS/




CCDS (human)

Combining different approaches – ab initio, by similarity - and taking

advantage of the expertise acquired by different institutes, including

manual annotation…

Consensus between 4 institutions…







IPIInternational Protein Indexhttp://www.ebi.ac.uk/IPI/IPIhelp.html







IPI (International Protein Index)

Provides a guide to the main databases that describe the human,mouse, rat, Zebrafish, Arabidopsis, Chicken, and Cowproteomes: Swiss-Prot , TrEMBL, RefSeq and Ensembl (and H-InvDB, TAIR and VEGA).

IPI is built in order to provide maximum coverage of the majorpublicly available protein (and gene) databases, for a sameprotein

For each protein in IPI, an entry from one of the constituentdatabases is selected as the master entry, and supplies the IPIentry with its sequence and annotation.

Stable identifiers (with incremental versioning) are maintainedto allow the tracking of sequences in IPI between IPI releases.




IMGT(international ImMunoGeneTics information)

Is a collection of high-quality integrated databases specialising

in inmunoglobulins, T cell receptors and the Major

Histocompatibility Complex (MHC) of all vertebrate species.




http://www.ebi.ac.uk/imgt/




http://www.ebi.ac.uk/imgt/




Protein sequence databases for proteomics




Phenyx: UniProtKB

PROWL: NCBInr, Swiss-Prot, dbEST

Protein prospector: NCBInr, Swiss-Prot, dbEST, GenPept, Ludwignr, OWL*.

Peptident (Aldente): UniProtKB.

Mascot: NCBInr, Swiss-Prot, dbEST, OWL*, MSDB

* OWL is obsolete since 1999

Translation of ESTs sequences in the 6 frames(EST are not associated with annotated CDSs !)




OWLNon redundant protein database, including: Swiss-Prot, PIR, NRL3-D* and GenPept.

*The PIR-NRL3D database makes the sequence information in PDB available for similarity searches




Phenyx: UniProtKB

PROWL: NCBInr, Swiss-Prot, dbEST

Protein prospector: NCBInr, Swiss-Prot, dbEST, GenPept, Ludwignr, OWL*.

Peptident (Aldente): UniProtKB.

Mascot: NCBInr, Swiss-Prot, dbEST, OWL*, MSDB

* OWL is obsolete since 1999

Translation of ESTs sequences in the 6 frames(EST are not associated with annotated CDSs !)







ID/AC mapping




-> Accession / version number jungle !According to the database, a AC number can be associated with an entry (gene product: stable even if the sequence

changes) or with a sequence (it change as soon as the sequence changes)




In resume

For the same protein sequence

You can find: A UniProtKB/Swiss-Prot entry A RefSeq entry (or GenPept)A EnsEMBl entryA CCDS entryA UniParc entry (archive)A IPI…




Type of record Sample Accession Format

GenBank/EMBL/DDBJ One letter followed by five digits: e.g. U12345

Two letters followed by 6 digits: e.g. AF123456

Swiss-Prot/TrEMBL One letter and five digits/letters: e.g. P12345, A0B533

RefSeq nucleotide Two letters, underscore bar and six digit:

e.g. mRNA NM_000492

e.g. genomic NT_000907

RefSeq protein e.g. NP_00483

RefSeq prediction e.g. XM_000483

e.g. XP_000467

PDB (protein structure) One digit followed by three letters: e.g. 1TUP

The AC number jungle

uniprot.org




UniProtKBand PTMs




(Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: 15036154).

Proteome complexity

Not predictable at the genome

level !




Chemical aspects

• Post-translational modifications (PTMs) consist in the breaking and/or the making of covalent bonds catalyzed by enzyme

• PTMs modify both protein mass and isoelectric point (PI)




The PTM variety

Gly Ala Val Leu Ile Lys Arg His Asp Glu Asn Gln Cys Ser Thr Met Pro Phe Tyr Trp

acetylation

methylation

acylation

phosphorylation

oxidation

crosslinks

hydroxylation cofactor binding sulfation C-linked sugar N-linked sugar O-linked sugar S-linked sugar

acetylation

methylation

acylation

crosslinks

GPI

amidation

crosslinks

methylation

C-terminal modifications

in black: cytoplasmic modifications

in dark grey: both cytoplasmic and extracellular modifications, depending on the exact type

in light grey: extracellular modifications

N-terminal modifications

side-chain modifications




PTM distribution among kingdoms

bacteria

eukaryotes

archaea

acetylation

amidation

FMN binding

FAD binding

GPI-anchor

lanthionine crosslink

bacterial lipid anchor

myristoylation

archaean lipid anchor

palmitoylation

methylation

phosphorylation

sulfation

diphthamide

pyrrolysine

archaea-specific methylation

eukaryote-specific methylation

bacteria-specific methylation




PTM annotation in UniProtKB entriesPTMs are annotated in the feature table (‘sequence annotation’) when they can be assigned a position on the protein sequence - in the comments when they cannot.

PTM-dedicated FT keys

FT key usage

CARBOHYD

(Glycosylation )

sugars

DISULFID

(Disulfide bond)

disulfide bonds

CROSSLNK

(Cross-link)

other crosslinks

LIPID lipids

MOD_RES

(Modified residue)

other modifications

PTMs are grouped by type, are specifically and uniquely annotated by the use of a controlled vocabulary and a

set of specific FT keys




PTM annotation in UniProtKB entriesPTMs are annotated in the feature table when they can be assigned a position on

the protein sequence - in the comments when they cannot.

Associated keywords




Find all mouse proteins which are

phosphorylated




UniProtKB/Swiss-Prot

Number of PTMs in Swiss-Prot release 51 (241242 entries)

all organisms

Pot. By sim. Exp. & Prob. total

signal peptide 15235 2850 4996 23081

N-GlcNAc 66264 826 3161 70251

O-GalNAc 306 354 628 1288

O-GlcNAc 1 167 78 246

phosphorylation 1110 14288 7760 23158

sulfation 252 294 171 717

myristate 129 535 131 795

GPI-anchor 478 115 68 661




Resid




RESID

RESID is a database of

473 natural modifications

(Rel. 56.00) with chemical

and structural annotations

such as recommended

name and synonyms,

delta mass, 3D structure,

UniProt annotations, etc.

FTP sites:

ftp://ftp.ebi.ac.uk/pub/databases/RESID/

ftp://ftp.ncifcrf.gov/pub/users/residues

Web sites:

http://www.ebi.ac.uk/RESID

http://www.ncifcrf.gov/RESID/

http://home.earthlink.net/~jsgaravelli/RESIDInfo.HTML

http://hpc.cs.tsinghua.edu.cn/bioinfo/database/index.html




RESID




RESID




Other PTM databases

UNIMOD: http://www.unimod.org/

PSI-MOD: http://psidev.sourceforge.net/mod/: ontology

Delta Mass: http://www.abrf.org/index.cfm/dm.home




http://www.unimod.org/

http://psidev.sourceforge.net/mod/

http://www.abrf.org/index.cfm/dm.home

GO




http://www.geneontology.org

Three disjoint axes:

cellular component• Sub-cellular location e.g nucleus, ribosome,

origin recognition complex

molecular function• molecular role e.g. catalytic activity, binding

biological process• broad biological phenomena e.g. mitosis,

growth, digestion

GO scope




terms are related within a hierarchy

GO structure

• Terms are linked by two relationships

– is-a

– part-of

cell

membrane chloroplast

mitochondrial chloroplastmembrane membrane

is-apart-of

GO structure




GOA: Gene Ontology Annotation

GOA aims to provide high-quality electronic and manual annotations to the

UniProt Knowledgebase and International Protein Index, using GO terms.

The GOA project is run by EBI and is a member of the GO consortium since 2001

http://www.ebi.ac.uk/GOA/

What is GOA ?

In 2001, the first phase of the GOA project involved the large-scale assignment

of GO terms to Swiss-Prot and TrEMBL entries using electronic methods,

namely the mappings spkw2go, ec2go and Interpro2go.







e-proxemis: http://e-proxemis.expasy.org




Documents

Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence