Upload
alban-ramsey
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
EBI is an Outstation of the European Molecular Biology Laboratory.
UniProtKB
Sandra Orchard
Importance of reference protein sequence databases
• Completeness and minimal redundancy
A non redundant protein sequence database, with maximal coverage including splice isoforms, disease variant and PTMs.
Low degree of redundancy for facilitating peptide assignments
• Stability and consistency Stable identifiers and consistent nomenclature
Databases are in constant change due to a substantial amount of work to improve their completeness and the quality of sequence annotation
• High quality protein annotation
Detailed information on protein function, biological processes, molecular interactions and pathways cross-referenced to external source
Summary of protein sequence databases
Database Description Species
UniProtKB Expertly curated section (UniProtKB/Swiss-Prot) and computer-annotated section (UniProtKB/TrEMBL); minimum level of redundancy; high level of integration with other databases; stable identifiers; diversity of sources including large scale genomics, small scale cloning and sequencing, protein sequencing, PDB, predicted sequences from Ensembl and RefSeq
Many
UniRef100 Assembled from UniProtKB, Ensembl and RefSeq; merges 100% identical sequences; stable identifiers
Many
Ensembl Predictions using automated genome annotation pipeline; explicitly linked to nucleotide and protein sequences; stable reference; merge their annotations with Vega annotations at transcript level; extensive quality checks to remove erroneous gene models ; high level of integration with other databases
Over 50 Eukaryotic genomesEnsembl Genomes: Metazoa, Plants and Fungi, Protists, Bacteria and Archaea
RefSeq NCBI creates from existing data; ongoing curation; non-redundant; explicitly linked nucleotide and protein sequences; stable reference; high level of integration with other databases
Limited to fully sequenced organisms
Entrez protein (NCBInr) Assembled from GenBank and RefSeq coding sequence translations and UniProt KB ; annotations extracted from source curated databases; high degree of sequence redundancy
Many
Updated from Nesvizhskii, A. I., and Aebersold, R. (2005) Interpretation of shotgun proteomic data: the protein inference problem. Mol. Cell. Proteomics. 4,1419–1440l
UniProtKB
Master headline
UniProt Knowledgebase: 2 sections
1. UniProtKB/Swiss-Prot Non-redundant, high-quality manual annotation - reviewed
2. UniProtKB/TrEMBL Redundant, automatically annotated - unreviewed
www.uniprot.org
Sequence Sequence features
Ontologies
ReferencesNomenclature
Splice variants
Annotations
UniProtKB
Manual annotation of UniProtKB/Swiss-Prot
Master headline
Sequence curation, stable identifiers, versioning and archiving
For example – erroneous gene model predictions, frameshifts….
..premature stop codons, read-throughs, erroneous initiator methionines…..
Master headline
Splice variants
Master headline
Identification of amino acid variants
..and of PTMs
… and also
Master headline
Domain annotation
Binding sites
Master headline
Protein nomenclature
Master headline
Master headline
Controlled vocabularies used whenever possible…
Annotation - >30 defined fields
Master headline
..and also imported from external resources
Binary interactions taken from the IntAct database
Interactors of human p53
Master headline
Controlled vocabulary usage increasing – for example from the Gene Ontology
Annotation for human Rhodopsin
1 Evidence at protein levelThere is experimental evidence of the existence of a protein
(e.g. Edman sequencing, MS, X-ray/NMR structure, good quality protein-protein interaction , detection by antibodies)
2 Evidence at transcript levelThe existence of a protein has not been proven but there is expression data (e.g. existence of cDNAs, RT-PCR or Northern blots)
that indicates the existence of a transcript.
3 Inferred from homologyThe existence of a protein is likely because orthologs exist in closely related species
4 Predicted
5 Uncertain
Sequence evidence
Type of evidence that supports the existence of a protein
Manual annotation of the human proteome(UniProtKB/Swiss-Prot)
• A draft of the complete human proteome has been available in UniProtKB/Swiss-Prot since 2008
• Manually annotated representation of 20,231 protein coding genes with 36,865 protein sequences - an additional 33,243 UniProtKB/TrEMBL form the complete proteome set
• Approximately 67,600 single amino acid polymorphisms (SAPs), mostly disease-linked
• ~75,500 post-translational modifications (PTMs)• Close collaboration with NCBI, Ensembl, Sanger Institute
and UCSC to provide the authoritative set to the user community
Master headline
Searching UniProt – Simple Search
• Text-based searching• Logical operators ‘&’ (and), ‘|’
Master headline
Searching UniProt – Advanced Search
Master headline
Searching UniProt – Search Results
Each linked to the UniProt entry
Master headline
Searching UniProt – Search Results
Master headline
Searching UniProt – Search Results
Master headline
Searching UniProt – Blast Search
Master headline
Searching UniProt – Blast Search
Master headline
Searching UniProt – Blast Results
Alignment with query sequence
Master headline
Searching UniProt – Blast Results
UniProtKB/TrEMBL
Multiple entries for the same protein (redundancy) can arise in UniProtKB/TrEMBL due to:
o Erroneous gene model predictionso Sequence errors (Frame shifts)o Polymorphismso Alternative start siteso Isoforms
Apart from 100% identical sequences all merged sequences are analysed by a curator so they can be annotated accordingly.
Why do we need predictive annotation tools?
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
UniProtKB
UniProtKB/Swiss-Prot
Date
Num
ber
of s
eque
nces
Master headline
• Automated clean-up of annotation from original nucleotide sequence entry
• Additional value added by using automatic annotation
• Recognises common annotation belonging to a
closely related family within UniProtKB/Swiss-Prot
• Identifies all members of this family using pattern/motif/HMMs in InterPro
• Transfers common annotation to related family members in TrEMBL
Automatic Annotation
← Name (non-standard)
← Taxonomy
← Publication
← Sequence
Master headline
InterPro
Master headline
Finding a complete proteome in UniProtKB
Complete Proteomes
MS Proteomics
• Require each sequence (inc isoforms) to be present in the dataset as an separate entity for search engines to access
• For higher organisms, with isoforms, expanded set made available on ftp site
• Fasta files by FTP• One file per species containing canonical + isoform sequences
Master headline
????
??? ?
??
?
?
?
?
?
?
??
?
?
? ?
?