2
UniProtKB/Swiss-Prot is a central hub for biological central hub for biological data data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank, PDB, 2D- PAGE, OMIM, TAIR, FlyBase, InterPro, PROSITE, etc.) In order to avoid redundancy avoid redundancy and improve sequence improve sequence reliability reliability, all protein sequences encoded by a given gene are merged into a single entry (on average: 1 human entry -> more than 6 cross-references to EMBL). Differences found between merged entries are documented. Evidence on protein existence are provided. Our main sources of data sources of data are publications (~1’900 journals cited), external scientific expertise and high- performance bioinformatics tools. Swiss-Prot Swiss-Prot (55.5, June 2008) 389’046 entries / 11’419 species Bacteria/Archae 777 proteomes Homo sapiens 19’804entries Other mammals 42’674 entries Plants 22’919 entries Virus 12’283 entries TrEMBL TrEMBL (38.5, June 2008) 5’906’286 entries / 165’662 species Swiss-Prot + TrEMBL give access to all publicly available protein sequences. Once in Swiss-Prot, an entry is no more in TrEMBL. Highlights of an UniProtKB/Swiss-Prot entry in the UniProt view format Highlights of an UniProtKB/Swiss-Prot entry in the UniProt view format UniProtKB/Swiss-Prot is the manually annotated section of the UniProt knowledgebase. UniProtKB/Swiss-Prot is the manually annotated section of the UniProt knowledgebase. Manual annotation consists of a critical review of experimentally proven or predicted data Manual annotation consists of a critical review of experimentally proven or predicted data about each protein, including the protein sequence about each protein, including the protein sequence . . Data are continuously updated by an Data are continuously updated by an expert team of biologists. expert team of biologists. A special emphasis is laid on the annotation of biological events which biological events which generate protein generate protein diversity diversity but are not always predictable at the genomic level. Alternative products (alternative splicing, RNA editing…) and post- translational modifications are extensively annotated. In mammals, polymorphisms (SAPs) and strain differences are also integrated. GenBank/DDBJ/EMBL, Ensembl and other protein ressources UniProt Knowledgebase (UniProtKB) Annotation priorities Annotation priorities complete microbial proteomes, plastidencoded proteins, human and mammalian orthologous proteins, plant proteins (A.thaliana and rice), fungal proteomes, proteome of representative subsets of strains of virus, toxins and anti-microbial peptides, Drosophila, Zebrafish, Xenopus, and C.elegans proteomes… UniProtKB/Swiss-Prot UniProtKB/Swiss-Prot - the manually annotated section of the UniProt Knowledgebase - - the manually annotated section of the UniProt Knowledgebase - provides a link between protein sequences and state-of-the-art provides a link between protein sequences and state-of-the-art knowledge knowledge www.uniprot.org We need We need your your feedback ! feedback ! [email protected] [email protected] UniProtKB/Swiss-Prot provides a link between UniProtKB/Swiss-Prot provides a link between protein sequences and state-of-the-art knowledge protein sequences and state-of-the-art knowledge UniProt Consortium Swiss Institute of Bioinformatics, European Bioinformatics Institute, Protein Information Reso www.uniprot.org UniProtKB/TrEMBL UniProtKB/TrEMBL Unreviewed protein sequences Automatic annotation UniProtKB/Swiss-Prot UniProtKB/Swiss-Prot Reviewed protein sequences Manual annotation: sequence accuracy, no redundancy, high quality annotation, numerous cross-references

Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,

Embed Size (px)

Citation preview

Page 1: Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,

UniProtKB/Swiss-Prot is a central central hub for biological datahub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank, PDB, 2D-

PAGE, OMIM, TAIR, FlyBase, InterPro, PROSITE, etc.)

In order to avoid redundancy avoid redundancy and improve sequence reliabilityimprove sequence reliability, all protein sequences encoded by a

given gene are merged into a single entry (on average:

1 human entry -> more than 6 cross-references to EMBL).Differences found between

merged entries are documented.Evidence on protein existence are provided.

Our main sources of datasources of data are publications (~1’900 journals cited), external

scientific expertise and high-performance bioinformatics

tools.

Swiss-ProtSwiss-Prot (55.5, June 2008) 389’046 entries / 11’419 speciesBacteria/Archae 777 proteomes

Homo sapiens 19’804entriesOther mammals 42’674 entries

Plants 22’919 entriesVirus 12’283 entries

TrEMBLTrEMBL (38.5, June 2008)5’906’286 entries / 165’662 species

Swiss-Prot + TrEMBL give access to all publicly available protein sequences.Once in Swiss-Prot, an entry is no more in TrEMBL.

Highlights of an UniProtKB/Swiss-Prot entry in the UniProt view formatHighlights of an UniProtKB/Swiss-Prot entry in the UniProt view format

UniProtKB/Swiss-Prot is the manually annotated section of the UniProt knowledgebase. UniProtKB/Swiss-Prot is the manually annotated section of the UniProt knowledgebase. Manual annotation consists of a critical review of experimentally proven or predicted data about each protein, Manual annotation consists of a critical review of experimentally proven or predicted data about each protein,

including the protein sequenceincluding the protein sequence. . Data are continuously updated by an expert team of biologists. Data are continuously updated by an expert team of biologists.

A special emphasis is laid on the annotation of biological biological

events which generate protein events which generate protein diversitydiversity but are not always predictable at the genomic level. Alternative products (alternative splicing, RNA

editing…) and post-translational modifications are

extensively annotated. In mammals, polymorphisms (SAPs) and strain differences

are also integrated.

GenBank/DDBJ/EMBL,Ensembl and other protein

ressources

UniProt Knowledgebase (UniProtKB)

Annotation prioritiesAnnotation prioritiescomplete microbial

proteomes, plastid–encoded proteins, human and

mammalian orthologous proteins, plant proteins

(A.thaliana and rice), fungal proteomes, proteome of representative subsets of

strains of virus, toxins and anti-microbial peptides, Drosophila, Zebrafish,

Xenopus, and C.elegans proteomes…

UniProtKB/Swiss-Prot UniProtKB/Swiss-Prot - the manually annotated section of the UniProt Knowledgebase - - the manually annotated section of the UniProt Knowledgebase -

provides a link between protein sequences and state-of-the-art knowledgeprovides a link between protein sequences and state-of-the-art knowledge www.uniprot.org

…We need We need youryour feedback ! feedback !

[email protected]@uniprot.org

UniProtKB/Swiss-Prot provides a link between UniProtKB/Swiss-Prot provides a link between protein sequences and state-of-the-art knowledgeprotein sequences and state-of-the-art knowledge

UniProt Consortium Swiss Institute of Bioinformatics, European Bioinformatics Institute, Protein Information Resourcewww.uniprot.org

UniProtKB/TrEMBLUniProtKB/TrEMBLUnreviewed protein sequences

Automatic annotation

UniProtKB/Swiss-ProtUniProtKB/Swiss-ProtReviewed protein sequences

Manual annotation: sequence accuracy, no redundancy, high quality annotation,

numerous cross-references

Page 2: Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,

UniRef UniParcUniProt KnowledgebaseGives access to archived protein sequences, found

in publicly accessible databases (UniProtKB, PIR, EMBL, Ensembl, IPI, PDB, RefSeq, FlyBase,

WormBase, Patent Offices…)

UniParc allows the tracking of a protein

sequence and its integration into various

databases.

One UniRef100 entry groups identical sequences (including

fragments).

One UniRef90 entry groups sequences that have at least

90% or more identity-> database size reduction of

~ 40%.

One UniRef50 entry groups sequences that are at least

50 % identical-> database size reduction of

~ 65%.

Clustering across species.

Three collections of sequence clusters (UniRef100, UniRef90,

UniRef50) based on UniProtKB and selected UniParc records

UniRef is useful forcomprehensive BLAST

similarity searches by providing sets of

representative sequences.

Use with caution: also contains pseudogenes, incorrect CDS predictions,

etc.

Gives access to publicly available protein sequences with a maximum of biological information.

UniProtKB is composed of two sections: UniProtKB/TrEMBL and UniProtKB/Swiss-Prot

UniProtKB/TrEMBL Unreviewed protein sequences- Computer annotated entries -

5’906’286 entries (Rel. 38.5, June 2008): Available protein sequences are automatically integrated into TrEMBL with: Merge of 100% identical sequences derived from the same organism, Protein family and domain attribution (InterPro), Automated annotation.

UniProtKB/Swiss-ProtReviewed protein sequences

- Manually annotated entries - 389’046 entries (Rel. 55.5, June 2008)

TrEMBL sequences are manually integrated into Swiss-Prot. This process involves:

Merge of all variant sequences derived from the same gene in a single species (polymorphisms, alternative splicing, RNA editing, etc.): low redundancy and high accuracy of the protein sequence;

Integration of biological and medical data derived from publications, external expertise, as well as high-performance bioinformatic tools, etc.:high-quality manual annotation;

Addition of cross-references to relevant databases: links to about 100 databases are available: central hub for biological data.

UniProtThe Universal Protein Resource

One UniParc entry groups identical sequences

across species.

Each entry contains a protein sequence,

taxonomic data and cross-references to source

databases.

Swiss Institute of Bioinformatics (SIB)European Bioinformatics Institute (EMBL-EBI)

Protein Information Resource (PIR)

UniProt is mainly supported by the National Institutes of Health (NIH) grant 2 U01 HG02712-04. Additional support for the EBI's involvement in UniProt comes from the European Commission (EC)'s FELICS grant (021902RII3) and from the NIH grant 1R01HGO2273-01. UniProtKB/Swiss-Prot activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science. PIR activities are also supported by the NIH grants and contracts HHSN266200400061C, NCI-caBIG, and 1R01GM080646-01, and the National Science Foundation (NSF) grant IIS-0430743.

UniMESUniProt Metagenomic and Environmental Sequences

Currently the database contains only data from the Global Ocean Sampling Expedition (GOS). UniMES is released in FASTA format together with an UniMES

matches to InterPro method file.

The UniProt Consortium

The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information.

UniProt provides four databases, each optimized for different uses:UniProtKB, UniRef, UniParc and UniMES.

UniProt is produced by SIB, EBI and PIR.

UniMesUniMesMetagenomic

UniParc UniParc Sequence archive

EMBL/GenBank/DDBJ, Ensembl, VEGA, RefSeq, other protein resources

UniRefUniRefSequence clusters

Expert manual annotation

UniProtKB/TrEMBLUniProtKB/TrEMBLUnreviewed

Automated annotation

UniProtKB/Swiss-ProtUniProtKB/Swiss-ProtReviewed

UniProtKBUniProtKBProtein sequence knowledgebase

Contact: [email protected]@uniprot.orgWeb site: www.uniprot.orgwww.uniprot.org