State of Genome Annotation
• Most prokaryotic genomes are auto-annotated.• Sequence and function are inferred with
comparative genomics; validation is sparse.• Difficulties with novel or HGT genes• Mature protein features
localization PTM, cleavage
Salzberg 2007
Proteogenomics
• Definition: using proteomics data to do genome annotation
• Goals: Find all coding regions of the genome,
annotated and unannotated Submit improved annotation to NCBI Identify “mature protein” features
Proteogenomics Protocol
• Data sources Yersinia pestis - Pieper et al., 2008, 2009 Bacillus anthracis – PRC/NIAID
Exceptions to Rules
• Non-canonical start codons infC – ATT (Sacerdot 1982, Payne 2010) in
enterobacteria; ATA in Shewanella (Gupta 2007) Deinococcus (Baudet 2009) suggests new non-
standard starts
Pseudo?genes
• Expression of ABC transporter n-terminus. Missing critical motif elements.
• 5 peptides (with splicing) map to a transposable element gene. Sequence alignment to an Arabidopsis Ulp1 Castellana 2008
Signal Peptide
• N-terminal motif, target protein for export• 1983 Perlman & Halvorson
Early basic residue, hydrophobic patch, AxB motif– A = [I,V,L,A,G,S], B = [A,G,S]
Future
• Rinse and repeat• 30 proteomes in 3 years• Stable, robust pipeline for general use
Hosted at TeraGrid
Novel New Start
Y. pestis 4 5
B. anthracis 4 6
D. radiodurans 225 117
D. vulgaris 55 89
L. interrogans 20 23
When Gene Predictors Fail
• Are GC extremes difficult? 50% (Y. pestis) – 4 missed 30’s (B. anthracis, L.interrogans) 4, 20 60’s (D. vulgaris, D. radiodurans) 55, 225
We See What We Know
• Proximity to Model Organism
Yersinia/Bacillus errors: 4/4
‘Remote species’ errors: 20, 55, >200
We See What We Know
• Hypothetical vs. Named Compare novel genes to observed proteome Hypergeometric where Null probability is from
the observed proteome
Hypothetical Named p-value
B. anthracis 3 1 0.018
L. interrogans 12 8 0.018
D. radiodurans 31 8 10-10
D. vulgaris 39 16 10-14
Expressed Protein Resource
• Protein Sequences >30 M sequences nr, uniprot JCVI metagenomics JGI genomes
• 40,000 clusters• Cross referenced with
proteomics, for validated proteins