Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Was T. rex Just a Big Chicken?
Computational Proteomics
Phillip Compeau and Pavel Pevzner
adjusted by Jovana Kovačević
Bioinformatics Algorithms: an Active Learning Approach
© 2015 by Compeau and Pevzner. All rights reserved.
Was T. rex Just a Big Chicken?
• Paleontology Meets Computing
• Decoding an Ideal Spectrum
• From Ideal to Real Spectra
• Peptide Sequencing
• Peptide Identification
• Spectral Dictionaries
T. Rex and Chicken Collagens Are Nearly
Identical!
T. rex and chicken
collagens are nearly identical!
Scientists Sequence Collagens from T. Rex!
Jack Horner
discovers a T. rex
femur fossil in
Montana (2000)
Mary
Schweitzer
demineralizes it
John Asara
generates
spectra and
decodes them
(2007)
Frederick Sanger’s Two Nobel Prizes
GIVEECCA!
GIVEECCASV!
GIVEECCASVC!
GIVEECCASVCSL!
GIVEECCASVCSLY!
SLYELEDYC!
ELEDY!
ELEDYCD!
LEDYCD!
EDYCD!
FVDEHLCG!
FVDEHLCGSHL!
HLCGSHL!
SHLVEA !
VEALY!
YLVCG!
LVCGERGF!
LVCGERGFF!
GFFYTPK!
YTPKA!
GIVECCASVCSLYELEDYCDFVDEHLCGSHLVEALYLVCGERGFFFYTPKA!
1958: protein
sequencing
1977: DNA
sequencing
1958: protein sequencing difficult, DNA sequencing
impossible
Today: protein sequencing difficult, DNA sequencing trivial
Multiple identical
copies of a genome
AGAATATCASequence the reads
Shatter the genome
into reads
Assemble the
genome using
overlapping reads
...TGAGAATATCA...
AGAATATCA
GAGAATATC
TGAGAATAT
GAGAATATCTGAGAATAT
Sequencing Proteins Today
• Putative proteome– If we know a genome, we can predict all the genes
that the genome encodes
– Translating the predicted genes leads us to putative proteome (set of all proteins encoded by genome)
– But how can we determine whether a protein is syntetized in a specific tissue?
• Peptide identification– In practice, merely confirming that 10aa long peptide
from a known protein is present in a sample confirms that the sample contains this protein
• Peptide sequencing– Inferring amino acid sequence of a peptide without
relying on a proteome
– Used in situations when proteome is unknown
Sequencing Proteins with Mass
Spectrometry• Most mass spectrometers can only
measure masses of rather short peptides
(e.g. < 30-40 amino acids). To bypass this
limitation:
– Proteases (e.g., trypsin) break proteins
into short peptides.
– A mass spectrometer breaks these
peptides into charged fragment ions
and measures the mass/charge ratio*
and intensity of each ion.
How do we reconstruct the peptide from the
collection of mass/charge ratios?
* For simplicity, we assume that all masses are integers and all charges
are 1
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
200 1200400 600 800 1000 m/z
Which Peptide Generated This Spectrum?
200 400 600 800 1000 120000
Intensity
100
mass/charge
Was T. rex Just a Big Chicken?
• Paleontology Meets Computing
• Decoding an Ideal Spectrum
• From Ideal to Real Spectra
• Peptide Sequencing
• Peptide Identification
• Spectral Dictionaries
• The Ostrich Hemoglobin Riddle
• Searching for Post-Translational Modifications
• Spectral Alignment Algorithm
Prefix and Suffix Peptides
503
574
400
285
156
prefix
masses
suffix
masses
71
174
289
418
0
129 156 115 103 71
Reconstructing a Peptide from Prefix/Suffix
Masses
503
574
400
285
156
prefix
masses
suffix
masses
71
174
289
418
0
503
574
400
285
156
71
174
289
418
0
Reconstructing a Peptide from Prefix/Suffix
Masses
Ideal Spectrum: Collection of all prefix
and suffix masses of a peptide.
Note: we don’t know which masses
correspond to prefixes and which
masses correspond to suffixes.
Peptide explains Spectrum if
IdealSpectrum(Peptide) = Spectrum.
IdealSpectrum(REDCA):
0 71 156 174 285 289 400 418 503 574
Spectrum 0 71 156 174 285 289 400 418 503 574
Graph(Spectrum)
Decoding an Ideal Spectrum Problem:
Reconstruct a peptide from its ideal spectrum.
• Input: A collection of integers Spectrum.
• Output: An amino acid string Peptide that
explains Spectrum.
Reconstructing a Peptide from an Ideal Spectrum
0 71 156 574503418400289285174
Graph(Spectrum)
Decoding an Ideal Spectrum Problem:
Reconstruct a peptide from its ideal spectrum.
• Input: A collection of integers Spectrum.
• Output: An amino acid string Peptide that
explains Spectrum.
0 71 156 574503418400289285174
D
Nodes: masses in the spectrum
Edges: connect node i to node j if j - i is the mass of an
amino acid a. Label this edge by a.
Reconstructing a Peptide from an Ideal Spectrum
Graph(Spectrum)
Decoding an Ideal Spectrum Problem:
Reconstruct a peptide from its ideal spectrum.
• Input: A collection of integers Spectrum.
• Output: An amino acid string Peptide that
explains Spectrum.
0 71 156 574503418400289285174
R E D C
C D E R
A
A
Nodes: masses in the spectrum
Edges: connect node i to node j if j - i is the mass of an
amino acid a. Label this edge by a.
Reconstructing a Peptide from an Ideal Spectrum
DecodingIdealSpectrum Algorithm
DecodingIdealSpectrum(Spectrum)
construct Graph(Spectrum)
find a path Path from source to sink in Graph(Spectrum)
return amino acid string spelled by labels of Path
Spectrum 0 71 156 174 285 289 400 418 503 574
Graph(Spectrum) 0 71 156 574503418400289285174
R E D C
C D E R
A
A
Does This Approach Work for All Spectra?
DecodingIdealSpectrum(Spectrum)
construct Graph(Spectrum)
find a path Path from source to sink in Graph(Spectrum)
return amino acid string spelled by labels of Path
Spectrum 0 57 114 128 215 229 316 330 387 444
Graph(Spectrum) 0 57 114 444387330316229215128
G
N
G S S G G
K/Q
A
D D K/Q
T T T T A N
Does This Approach Work for All Spectra?
DecodingIdealSpectrum(Spectrum)
construct Graph(Spectrum)
find a path Path from source to sink in Graph(Spectrum)
return amino acid string spelled by labels of Path
Spectrum 0 57 114 128 215 229 316 330 387 444
Graph(Spectrum) 0 57 114 444387330316229215128
G
N
G S S G G
K/Q
A
D D K/Q
T T T T A N
IdealSpectrum(NTTAG) ≠ Spectrum!
Correcting DecodingIdealSpectrum
Graph(Spectrum)
Spectrum 0 57 114 128 215 229 316 330 387 444
0 57 114 444387330316229215128
G
N
G S S G G
K/Q
A
D D K/Q
T T T T A N
IdealSpectrum(GGDTN) = Spectrum
DecodingIdealSpectrum(Spectrum)
construct Graph(Spectrum)
for each path Path from source to sink in
Graph(Spectrum)
Peptide ← amino acid string spelled by labels of Path
if IdealSpectrum(Peptide) = Spectrum
return Peptide
Correcting DecodingIdealSpectrum
IdealSpectrum(GGDTN) = Spectrum
DecodingIdealSpectrum(Spectrum)
construct Graph(Spectrum)
for each path Path from source to sink in
Graph(Spectrum)
Peptide ← amino acid string spelled by labels of Path
if IdealSpectrum(Peptide) = Spectrum
return Peptide
• Not efficient algorithm, may be exponential in the number of nodes (= number of masses in the spectrum)
Was T. rex Just a Big Chicken?
• Paleontology Meets Computing
• Decoding an Ideal Spectrum
• From Ideal to Real Spectra
• Peptide Sequencing
• Peptide Identification
• Spectral Dictionariesy4
y6
y10
V N V A D C G A E A L A R
b1
y12
b2
y11
b3
y10
b4
y9
b5
y8
b6
y7
b7
y6
b8
y5
b9
y4
b10
y3
b11
y2
b12
y1
[M+2H]2+ = 673.46
y3
y5
y11
y12
b3
b4
b5
b6
b7
b8b9
b10
b11
b12
y7
y8
y9
b2
y2
Inte
nsity (
%)
100
0
200 1200400 600 800 1000 m/z
y4
y12++
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
y3
y12
b10
b11
b13
G L V G A P G L R G L P G K
b1
y13
b2
y12
b3
y11
b4
y10
b5
y9
b6
y8
b7
y7
b8
y6
b9
y5
b10
y4
b11
y3
b12
y2
b13
y1
200 1200400 600 800 1000 m/z
From Ideal to Real Spectra
Decoding a (Real) Spectrum Problem:
Reconstruct a peptide from its spectrum.
• Input: A collection of integers Spectrum.
• Output: An amino acid string Peptide that
explains Spectrum the best (among all possible
a.a. strings).
0 71 99 156 180 196 228 285 289 320 400 421 503 574
Real spectra have both false and missing masses.
0 71 156 174 285 289 400 418 503 574
Ideal Spectrum of REDCA
Real
Spectrum
From Ideal to Real Spectra
Decoding a (Real) Spectrum Problem:
Reconstruct a peptide from its spectrum.
• Input: A collection of integers Spectrum.
• Output: An amino acid string Peptide that
explains Spectrum the best (among all possible
a.a. strings).
0 71 99 156 180 196 228 285 289 320 400 421 503 574
Real spectra have both false and missing masses.
0 71 156 174 285 289 400 418 503 574
Ideal Spectrum of REDCA
Real
Spectrum
Which Peptide Generated This Spectrum?
Intensity
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
200 1200400 600 800 1000 m/z200 400 600 800 1000 12000
0
100
mass/charge
DinosaurSpectru
m
• Once the peptide is known, how can we measure how well a peptide explains a spectrum?
y4
y6
y10
V N V A D C G A E A L A R
b1
y12
b2
y11
b3
y10
b4
y9
b5
y8
b6
y7
b7
y6
b8
y5
b9
y4
b10
y3
b11
y2
b12
y1
[M+2H]2+ = 673.46
y3
y5
y11
y12
b3
b4
b5
b6
b7
b8b9
b10
b11
b12
y7
y8
y9
b2
y2
Inte
nsity (
%)
100
0
200 1200400 600 800 1000 m/z
y4
y12++
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
y3
y12
b10
b11
b13
G L V G A P G L R G L P G K
b1
y13
b2
y12
b3
y11
b4
y10
b5
y9
b6
y8
b7
y7
b8
y6
b9
y5
b10
y4
b11
y3
b12
y2
b13
y1
200 1200400 600 800 1000 m/z
Annotating a Spectrum
DinosaurSpectru
m
Intensity
200 400 600 800 1000 12000
0
100
Suffix peptide of length
3 (denoted as y3)
Prefix peptide of length
10 (denoted as b10)
• Once we infer the peptide that generated a given spectrum,we can annotate the spectrum by establishing correspondencebetween peaks in the spectrum and prefixes/suffixes of thepeptide
y4
y6
y10
V N V A D C G A E A L A R
b1
y12
b2
y11
b3
y10
b4
y9
b5
y8
b6
y7
b7
y6
b8
y5
b9
y4
b10
y3
b11
y2
b12
y1
[M+2H]2+ = 673.46
y3
y5
y11
y12
b3
b4
b5
b6
b7
b8b9
b10
b11
b12
y7
y8
y9
b2
y2
Inte
nsity (
%)
100
0
200 1200400 600 800 1000 m/z
y4
y12++
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
y3
y12
b10
b11
b13
G L V G A P G L R G L P G K
b1
y13
b2
y12
b3
y11
b4
y10
b5
y9
b6
y8
b7
y7
b8
y6
b9
y5
b10
y4
b11
y3
b12
y2
b13
y1
200 1200400 600 800 1000 m/z
Shared Peak Count
DinosaurSpectru
m
Intensity
200 400 600 800 1000 12000
0
100
GLVGAPCLRGLPGK annotates b10, b11, b13, y3, y4, y12 (Shared Peak Count =
6)
• Shared Peak Count – the number of peaks annotated bypeptide
y4
y6
y10
V N V A D C G A E A L A R
b1
y12
b2
y11
b3
y10
b4
y9
b5
y8
b6
y7
b7
y6
b8
y5
b9
y4
b10
y3
b11
y2
b12
y1
[M+2H]2+ = 673.46
y3
y5
y11
y12
b3
b4
b5
b6
b7
b8b9
b10
b11
b12
y7
y8
y9
b2
y2
Inte
nsity (
%)
100
0
200 1200400 600 800 1000 m/z
y4
y12++
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
y3
y12
b10
b11
b13
G L V G A P G L R G L P G K
b1
y13
b2
y12
b3
y11
b4
y10
b5
y9
b6
y8
b7
y7
b8
y6
b9
y5
b10
y4
b11
y3
b12
y2
b13
y1
200 1200400 600 800 1000 m/z
Another Candidate Peptide
DinosaurSpectru
m
Intensity
200 400 600 800 1000 12000
0
100
GLVGAPCLRGLPGK annotates b10, b11, b13, y3, y4, y12 (Shared Peak Count =
6)
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
200 1200400 600 800 1000 m/z
y6y5
b3
b6
b9
y7
y8
y2
y4
y3
A T K I V D C F M T Y
b1
y10
b2
y9
b3
y8
b4
y7
b5
y6
b6
y5
b7
y4
b8
y3
b9
y2
b10
y1
0
100
Intensity
200 400 600 800 1000 12000
ATKIVDCFMTY annotates b3, b6, b9, y2, y3, y4, y5, y6, y7, y8 (Shared Peak Count = 10)
y4
y6
y10
V N V A D C G A E A L A R
b1
y12
b2
y11
b3
y10
b4
y9
b5
y8
b6
y7
b7
y6
b8
y5
b9
y4
b10
y3
b11
y2
b12
y1
[M+2H]2+ = 673.46
y3
y5
y11
y12
b3
b4
b5
b6
b7
b8b9
b10
b11
b12
y7
y8
y9
b2
y2
Inte
nsity (
%)
100
0
200 1200400 600 800 1000 m/z
y4
y12++
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
y3
y12
b10
b11
b13
G L V G A P G L R G L P G K
b1
y13
b2
y12
b3
y11
b4
y10
b5
y9
b6
y8
b7
y7
b8
y6
b9
y5
b10
y4
b11
y3
b12
y2
b13
y1
200 1200400 600 800 1000 m/z
Another Candidate Peptide
DinosaurSpectru
m
Intensity
200 400 600 800 1000 12000
0
100
GLVGAPCLRGLPGK annotates b10, b11, b13, y3, y4, y12 (Shared Peak Count =
6)
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
200 1200400 600 800 1000 m/z
y6y5
b3
b6
b9
y7
y8
y2
y4
y3
A T K I V D C F M T Y
b1
y10
b2
y9
b3
y8
b4
y7
b5
y6
b6
y5
b7
y4
b8
y3
b9
y2
b10
y1
0
100
Intensity
200 400 600 800 1000 12000
ATKIVDCFMTY annotates b3, b6, b9, y2, y3, y4, y5, y6, y7, y8 (Shared Peak Count = 10)
y4
y6
y10
V N V A D C G A E A L A R
b1
y12
b2
y11
b3
y10
b4
y9
b5
y8
b6
y7
b7
y6
b8
y5
b9
y4
b10
y3
b11
y2
b12
y1
[M+2H]2+ = 673.46
y3
y5
y11
y12
b3
b4
b5
b6
b7
b8b9
b10
b11
b12
y7
y8
y9
b2
y2
Inte
nsity (
%)
100
0
200 1200400 600 800 1000 m/z
y4
y12++
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
y3
y12
b10
b11
b13
G L V G A P G L R G L P G K
b1
y13
b2
y12
b3
y11
b4
y10
b5
y9
b6
y8
b7
y7
b8
y6
b9
y5
b10
y4
b11
y3
b12
y2
b13
y1
200 1200400 600 800 1000 m/z
DinosaurSpectru
m
Intensity
200 400 600 800 1000 12000
0
100
How Should We Score an Annotated Spectrum?
Shared Peak
Count?
Sum of intensities
of explained peaks?
ignores
intensities
large peaks may
dominate the score
Idea: probabilistic model of spectra so that large peaks
contribute to the score but do not dominate it.
y4
y6
y10
V N V A D C G A E A L A R
b1
y12
b2
y11
b3
y10
b4
y9
b5
y8
b6
y7
b7
y6
b8
y5
b9
y4
b10
y3
b11
y2
b12
y1
[M+2H]2+ = 673.46
y3
y5
y11
y12
b3
b4
b5
b6
b7
b8b9
b10
b11
b12
y7
y8
y9
b2
y2
Inte
nsity (
%)
100
0
200 1200400 600 800 1000 m/z
y4
y12++
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
y3
y12
b10
b11
b13
G L V G A P G L R G L P G K
b1
y13
b2
y12
b3
y11
b4
y10
b5
y9
b6
y8
b7
y7
b8
y6
b9
y5
b10
y4
b11
y3
b12
y2
b13
y1
200 1200400 600 800 1000 m/z
DinosaurSpectru
m
Intensity
200 400 600 800 1000 12000
0
100
Transform the spectrum of mass m into a spectral
vector
s1, …,si, …, sm
The value si (amplitude) approximates the likelihood
that mass i is the prefix mass of an (unknown!)
peptide that generated the spectrum.
Spectral Vectors
R
71
Peptid
e
00…0100…0100…0100…0100…01156 bits 71 bits103 bits115 bits129 bits
peptide vector
Peptide
mass 156 129 115 103
E D C A
From a Peptide to a Peptide Vector
Converting a Peptide into a Peptide Vector
Problem. Convert a peptide into a peptide vector.
• Input: A string of amino acids Peptide.
• Output: The peptide vector of Peptide.
From a Peptide to a Peptide Vector
Converting a Peptide into a Peptide Vector
Problem. Convert a peptide into a peptide vector.
• Input: A string of amino acids Peptide.
• Output: The peptide vector of Peptide.
From a Peptide Vector to a Peptide
Converting a Peptide Vector into a Peptide
Problem. Convert a binary vector into a peptide.
• Input: A binary vector P.
• Output: A peptide whose peptide vector is equal
to P (if such a peptide exists).
From a Spectrum to a Spectral Vector
y4
y6
y10
V N V A D C G A E A L A R
b1
y12
b2
y11
b3
y10
b4
y9
b5
y8
b6
y7
b7
y6
b8
y5
b9
y4
b10
y3
b11
y2
b12
y1
[M+2H]2+ = 673.46
y3
y5
y11
y12
b3
b4
b5
b6
b7
b8b9
b10
b11
b12
y7
y8
y9
b2
y2
Inte
nsity (
%)
100
0
200 1200400 600 800 1000 m/z
y4
y12++
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
y3
y12
b10
b11
b13
G L V G A P G L R G L P G K
b1
y13
b2
y12
b3
y11
b4
y10
b5
y9
b6
y8
b7
y7
b8
y6
b9
y5
b10
y4
b11
y3
b12
y2
b13
y1
200 1200400 600 800 1000 m/z
DinosaurSpectru
m
(mass m)Intensity
200 400 600 800 1000 12000
0
100
+9 (amplitude) is not the intensity of this peak!
It is a likelihood that this peak will be annotated by a prefix
of an (unknown!) peptide that generated the spectrum.
+9amplitude
From a Spectrum to a Spectral Vector
y4
y6
y10
V N V A D C G A E A L A R
b1
y12
b2
y11
b3
y10
b4
y9
b5
y8
b6
y7
b7
y6
b8
y5
b9
y4
b10
y3
b11
y2
b12
y1
[M+2H]2+ = 673.46
y3
y5
y11
y12
b3
b4
b5
b6
b7
b8b9
b10
b11
b12
y7
y8
y9
b2
y2
Inte
nsity (
%)
100
0
200 1200400 600 800 1000 m/z
y4
y12++
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
y3
y12
b10
b11
b13
G L V G A P G L R G L P G K
b1
y13
b2
y12
b3
y11
b4
y10
b5
y9
b6
y8
b7
y7
b8
y6
b9
y5
b10
y4
b11
y3
b12
y2
b13
y1
200 1200400 600 800 1000 m/z
DinosaurSpectru
m
(mass m)Intensity
200 400 600 800 1000 12000
0
100
The larger the peak at mass i,
the larger the value (amplitude) si of the spectral
vector
s1........-5.........+3..........................+9...+7..............sm
an integer-valued vector with m
coordinates
spectral
vector
+7+9+3-5amplitude
Was T. rex Just a Big Chicken?
• Paleontology Meets Computing
• Decoding an Ideal Spectrum
• From Ideal to Real Spectra
• Peptide Sequencing
• Peptide Identification
• Spectral Dictionaries
Scoring Peptide against Spectrum
Score of Peptide against Spectrum is the dot
product of Peptide and Spectrum:
score(Peptide, Spectrum) = p1*s1+p2*s2+ …+pm*sm.
000…001000…001000…001000…001000…001Peptide
y4
y6
y10
V N V A D C G A E A L A R
b1
y12
b2
y11
b3
y10
b4
y9
b5
y8
b6
y7
b7
y6
b8
y5
b9
y4
b10
y3
b11
y2
b12
y1
[M+2H]2+ = 673.46
y3
y5
y11
y12
b3
b4
b5
b6
b7
b8b9
b10
b11
b12
y7
y8
y9
b2
y2
Inte
nsity (
%)
100
0
200 1200400 600 800 1000 m/z
y4
y12++In
tensity (
%)
100
0
[M+2H]2+ = 646.20
y3
y12
b10
b11
b13
G L V G A P G L R G L P G K
b1
y13
b2
y12
b3
y11
b4
y10
b5
y9
b6
y8
b7
y7
b8
y6
b9
y5
b10
y4
b11
y3
b12
y2
b13
y1
200 1200400 600 800 1000 m/z
DinosaurSpectru
m
Intensity
200 400 600 800 1000 12000
0
100
s1..........-5…....+3..........................+9...+7..............sm
Spectrum ******************************************
Peptide Sequencing Problem
Peptide Sequencing Problem: Given a spectral
vector, find a peptide vector with maximum score
against this spectral vector.
• Input: A spectral vector Spectrum.
• Output: An amino acid string Peptide that
maximizes
score(Peptide, Spectrum)
among all possible peptides.
Building a DAG from a Spectral Vector
1. For a spectral vector Spectrum=s1, … ,sm, construct
DAG(Spectrum) on nodes {0,1, …, m}
Building a DAG from a Spectral Vector
1. For a spectral vector Spectrum=s1, … ,sm, construct
DAG(Spectrum) on nodes {0,1, …, m}
2. Assign weight si to node i
33 2 10 0 0 -2 -3 -1 -7 5 -8 0 1 2 10 4 6 9 3 0
Building a DAG from a Spectral Vector
1. For a spectral vector Spectrum=s1, … ,sm, construct
DAG(Spectrum) on nodes {0,1, …, m}
2. Assign weight si to node i
3. Connect node i to node j if j - i is equal to the mass
of an amino acid
Toy alphabet: amino acids X and Z with masses 4 and 5
33 2 10 0 0 -2 -3 -1 -7 5 -8 0 1 2 10 4 6 9 3 0
X
Building a DAG from a Spectral Vector
1. For a spectral vector Spectrum=s1, … ,sm, construct
DAG(Spectrum) on nodes {0,1, …, m}
2. Assign weight si to node i
3. Connect node i to node j if j - i is equal to the mass
of an amino acid
Toy alphabet: amino acids X and Z with masses 4 and 5
33 2 10 0 0 -2 -3 -1 -7 5 -8 0 1 2 10 4 6 9 3 0
X
Z
Building a DAG from a Spectral Vector
1. For a spectral vector Spectrum=s1, … ,sm, construct
DAG(Spectrum) on nodes {0,1, …, m}
2. Assign weight si to node i
3. Connect node i to node j if j - i is equal to the mass
of an amino acid
Toy alphabet: amino acids X and Z with masses 4 and 5
X
Z
0 33 2 1 90 0 0 4 -2 -3 -1 -7 6 5 -8 0 3 1 2 1 0
Z
X X
Score(XZZXX, Spectrum) = 0 + 4 + 6 + 9 + 3 + 0 =
22
Peptides = Paths in DAG(Spectrum)
X
Z
0 33 2 1 90 0 0 4 -2 -3 -1 -7 6 5 -8 0 3 1 2 1 0
Z
X X
• Peptide: any path from source to sink in DAG(Spectrum).
• score(Peptide, Spectrum): sum of scores of nodes it visits.
• Peptide Sequencing Problem: finding a maximum-weight
path in a node-weighted DAG.
Score(XZZXX, Spectrum) = 0 + 4 + 6 + 9 + 3 + 0 =
22
Peptide Sequencing = Finding a Path in
DAG(Spectrum)
X
Z
0 33 2 1 90 0 0 4 -2 -3 -1 -7 6 5 -8 0 3 1 2 1 0
Z
X X
Peptide Sequencing Problem: Given a spectral
vector, find a peptide vector with maximum score
against this spectral vector.
• Input: A spectral vector Spectrum.
• Output: A maximum-weight path in DAG(Spectrum).
Score(XZZXX, Spectrum) = 0 + 4 + 6 + 9 + 3 + 0 =
22
STOP and Think: How do we find a maximum-weight path in
a node-weighted DAG?
Intensity
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
200 1200400 600 800 1000 m/z200 400 600 800 1000 12000
0
100
mass/charge
DinosaurSpectru
m
???????????
Generating Spectrum
from an (Unknown) Peptide
Intensity
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
200 1200400 600 800 1000 m/z200 400 600 800 1000 12000
0
100
mass/charge
DinosaurSpectru
m
???????????
Reconstructing Peptide from Spectrum
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
200 1200400 600 800 1000 m/z
y6y5
b3
b6
b9
y7
y8
y2
y4
y3
A T K I V D C F M T Y
b1
y10
b2
y9
b3
y8
b4
y7
b5
y6
b6
y5
b7
y4
b8
y3
b9
y2
b10
y1
De novo Reconstruction!
mass/charge
DinosaurSpectru
m
ATKIVDCFMTY
Intensity
0
100
200 400 600 800 1000 12000
But this highest scoring peptide is biologically
incorrect!
Scoring functions that reliably assign the highest
score to the biologically correct peptide remain
unknown...
Intensity
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
200 1200400 600 800 1000 m/z200 400 600 800 1000 12000
0
100
mass/charge
DinosaurSpectru
m
y4
y6
y10
V N V A D C G A E A L A R
b1
y12
b2
y11
b3
y10
b4
y9
b5
y8
b6
y7
b7
y6
b8
y5
b9
y4
b10
y3
b11
y2
b12
y1
[M+2H]2+ = 673.46
y3
y5
y11
y12
b3
b4
b5
b6
b7
b8b9
b10
b11
b12
y7
y8
y9
b2
y2
Inte
nsity (
%)
100
0
200 1200400 600 800 1000 m/z
y4
y12++
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
y3
y12
b10
b11
b13
G L V G A P G L R G L P G K
b1
y13
b2
y12
b3
y11
b4
y10
b5
y9
b6
y8
b7
y7
b8
y6
b9
y5
b10
y4
b11
y3
b12
y2
b13
y1
200 1200400 600 800 1000 m/z
…HKMPRSTATPKRMGGCTFSPCFTKRLMATSGLVGAPGLRGLPGKMGGCTFGTRACFGH…
The correct peptide may not score highest among all peptides,
but it typically scores highest among all peptides in the
proteome* * If the resulting score is sufficiently high
The highest-scoring peptide in Proteome
Imagine that You Know the Proteome…
Intensity
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
200 1200400 600 800 1000 m/z200 400 600 800 1000 12000
0
100
mass/charge
DinosaurSpectru
m
y4
y6
y10
V N V A D C G A E A L A R
b1
y12
b2
y11
b3
y10
b4
y9
b5
y8
b6
y7
b7
y6
b8
y5
b9
y4
b10
y3
b11
y2
b12
y1
[M+2H]2+ = 673.46
y3
y5
y11
y12
b3
b4
b5
b6
b7
b8b9
b10
b11
b12
y7
y8
y9
b2
y2
Inte
nsity (
%)
100
0
200 1200400 600 800 1000 m/z
y4
y12++
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
y3
y12
b10
b11
b13
G L V G A P G L R G L P G K
b1
y13
b2
y12
b3
y11
b4
y10
b5
y9
b6
y8
b7
y7
b8
y6
b9
y5
b10
y4
b11
y3
b12
y2
b13
y1
200 1200400 600 800 1000 m/z
…HKMPRSTATPKRMGGCTFSPCFTKRLMATSGLVGAPGLRGLPGKMGGCTFGTRACFGH…
The highest-scoring peptide in Proteome
Imagine that You Know the Proteome…
Peptide identification: reconstructing a peptide as
the highest-scoring peptide occurring in a
proteome.
All peptides from
Proteome
MDERHILNM, KLQWVCSDL,
PTYWASDL, ENQIKRSACVM,
TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA,
GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT, AVGELTK,
HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN
WR
A
C
VG
E
K
DW
LP
T
L T
WR
A
C
VG
E
K
DW
LP
T
L T
AVGELTK
Peptide
Identificatio
nAll possible peptides (20n)
AAAAAAAA,AAAAAAAC,AAAAAAAD,AAAAAAAE,AA
AAAAAG,AAAAAAAF,AAAAAAAH,AAAAAAI,
AVGELTI, AVGELTK , AVGELTL, AVGELTM,
YYYYYYYS,YYYYYYYT,YYYYYYYV,YYYYYYYY
Peptide Sequencing vs. Peptide
Identification
Which approach is
faster?
Peptide
Sequencing
AVGELTK
Peptide
Sequencing
Peptide
Identificatio
nThe set of all peptides in Proteome is much smaller than the set of of all possible peptides.
However, peptide sequencing algorithms are much faster, even though their search space is much larger.
Peptide sequencing eliminates the time-consuming scan of Proteomeby modeling the problem as the Longest Path in a DAG Problem.
However, since the scoring function is imperfect, peptide sequencing remains inaccurate: state-of-the-art tools correctly reconstruct only 30% of spectra.
Peptide Sequencing vs. Peptide
Identification
Was T. rex Just a Big Chicken?
• Paleontology Meets Computing
• Decoding an Ideal Spectrum
• From Ideal to Real Spectra
• Peptide Sequencing
• Peptide Identification
• Spectral Dictionaries
Peptide Identification Problem: Find a peptide
from a proteome with maximum score against a
spectrum.
• Input: A spectral vector Spectrum and an amino
acid string Proteome.
• Output: An a.a. string Peptide that maximizes
score(Peptide, Spectrum)
among all substrings of Proteome.
STOP and Think: How can we possibly construct
the T. rex proteome?
The Peptide Identification Problem
• 90% of proteins making up
animal bones are collagens.
• Since collagens are often conserved across
species, collagens in T. rex were likely similar to
collagens in some present-day species.
Approximating the T. rex Proteome
• As a sanity check, Asara
compared the T. rex spectra
against the UniProt database
(≈ 200 million amino acids
from hundreds of species).
• Asara also included some mutated versions of
collagens from present-day species; we will call
the augmented database UniProt+.*
Approximating the T. rex Proteome
*concatenate all proteins in UniProt+ into a string Proteome for
simplicity
Most of the high-scoring peptides identified in
UniProt+ were chicken collagens, supporting the
hypothesis that birds evolved from dinosaurs.
Searching T. rex Spectra Against UniProt+
DinosaurPeptide = GLVGAPGLRGLPGK is only
one mutation away from a chicken collagen
peptide.
Searching T. rex Spectra Against UniProt+
But how can we be sure that DinosaurPeptide is
the correct interpretation of DinosaurSpectrum?
y4
y6
y10
V N V A D C G A E A L A R
b1
y12
b2
y11
b3
y10
b4
y9
b5
y8
b6
y7
b7
y6
b8
y5
b9
y4
b10
y3
b11
y2
b12
y1
[M+2H]2+ = 673.46
y3
y5
y11
y12
b3
b4
b5
b6
b7
b8b9
b10
b11
b12
y7
y8
y9
b2
y2
Inte
nsity (
%)
100
0
200 1200400 600 800 1000 m/z
y4
y12++
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
y3
y12
b10
b11
b13
G L V G A P G L R G L P G K
b1
y13
b2
y12
b3
y11
b4
y10
b5
y9
b6
y8
b7
y7
b8
y6
b9
y5
b10
y4
b11
y3
b12
y2
b13
y1
200 1200400 600 800 1000 m/z200 400 600 800 1000 1200 m/z00
Intensity
100
But billions of peptides not occurring in UniProt+
outscore DinosaurPeptide.
Statistical Significance of DinosaurPeptide
DinosaurPeptide is the highest scoring peptide for
DinosaurSpectrum among all peptides in UniProt+.
But billions of peptides not occurring in UniProt+
outscore DinosaurPeptide.
We need to develop a method for evaluating the
statistical significance of identified peptides.
STOP and Think: Does this concern you?
Statistical Significance of DinosaurPeptide
DinosaurPeptide is the highest scoring peptide for
DinosaurSpectrum among all peptides in UniProt+.
* If the resulting score is sufficiently high
Given a parameter threshold, a peptide Peptide and
a spectral vector Spectrum form a Peptide-
Spectrum Match (PSM) if:
• Peptide is a highest-scoring peptide against
Spectrum among all peptides in Proteome
• Score(Peptide, Spectrum) ≥ threshold
Peptide-Spectrum Matches (PSMs)
Given a parameter threshold, a peptide Peptide and
a spectral vector Spectrum form a Peptide-
Spectrum Match (PSM) if:
• Peptide is a highest-scoring peptide against
Spectrum among all peptides in Proteome
• Score(Peptide, Spectrum) ≥ threshold
PSMthreshold(Proteome, SpectralVectors): the set of
Peptide-Spectrum Matches (PSMs) resulting from a
set of SpectralVectors (for a given Proteome and
threshold).
Peptide-Spectrum Matches (PSMs)
PSM Search Problem: Identify all Peptide-
Spectrum Matches scoring above a threshold for a
set of spectra and a proteome.
• Input: A set SpectralVectors, an amino acid
string Proteome, and a score threshold
threshold.
• Output: The set of Peptide-Spectrum Matches
PSMthreshold(Proteome, SpectralVectors).
PSM Search Problem
Was T. rex Just a Big Chicken?
• Paleontology Meets Computing
• Decoding an Ideal Spectrum
• From Ideal to Real Spectra
• Peptide Sequencing
• Peptide Identification
• Spectral Dictionaries
STOP and Think: A PSM search of 1,000 spectra
from a human sample against the human proteome
results in 100 PSMs whose score surpassed a
threshold.
• What is the fraction of erroneous PSMs among
them?
Hint: Repeat the same experiment for a randomly
generated DecoyProteome of the same size as the
human proteome.
Decoy Proteome
If you identify 5 PSMs in DecoyProteome, then 5/100
of PSMs identified in the human proteome are
estimated to be correct.
False Discovery Rate
For the T. rex spectra, there are 27 PSMs in UniProt+
and only 1 PSM in DecoyProteome with score ≥ 100
(FDR =1/27= 3.7%)
STOP and Think: Have we found ≈27* T. rex
peptides?!
False discovery rate (FDR):
|PSMthreshold(DecoyProteome,SpectralVectors)|
|PSMthreshold(Proteome, SpectralVectors)|
Many of these PSM correspond to contaminants, e.g., keratin from human skin
How can we estimate the statistical significance of
an individual PSM?
The Monkey and the Typewriter
abagytegertoyhktyhkyrzaxujhotgemamaghtkmjytrabagytegertozhkoghk
yrzacatxujhotgemamaghtkdhairytdgbikemjytrcgtyyghjotfghtsybdkkpw
kfffldogjfiegbebgncnslkcfscnnclnscnscnsnovcsnovslvnsnvnvnsnvsvv
slnlnsvlnsnvnslnvnlsvnsnnsvnslvnscatlvslvslvlmbgjgaggeyjllfghlh
mhlhjjlhjlhabracadabraghytnlkprstyrhketryabcnccowcnchairmtdgwom
bikedmdppdtyhtgftxcjabcjwqbcoewbvcoewvbexovervhhddwdwqdhgyusjff
fgfghhhhhy…
The Monkey Can Spell!
abagytegertoyhktyhkyrzaxujhotgemamaghtkmjytrabagytegertozhkoghk
yrzacatxujhotgemamaghtkdhairytdgbikemjytrcgtyyghjotfghtsybdkkpw
kfffldogjfiegbebgncnslkcfscnnclnscnscnsnovcsnovslvnsnvnvnsnvsvv
slnlnsvlnsnvnslnvnlsvnsnnsvnslvnscatlvslvslvlmbgjgaggeyjllfghlh
mhlhjjlhjlhabracadabraghytnlkprstyrhketryabcnccowcnchairmtdgwom
bikedmdppdtyhtgftxcjabcjwqbcoewbvcoewvbexovervhhddwdwqdhgyusjff
fgfghhhhhy…
The
MonkeyDictionary
NEW EDITION
2,000 new words
even more nonsense
Expected Number of Strings from Dictionary
The Monkey and the Typewriter Problem: Find the expected
number of strings from dictionary appearing in a randomly
generated text.
• Input: A set of strings Dictionary and an integer n.
• Output: The expected number of strings from Dictionary that
appear in a randomly generated string of length n.
Expected Number of High-Scoring Peptides Problem: Find
the expected number of high-scoring peptides (against a given
spectrum) in a decoy proteome.
• Input: A Spectrum, an integer n, and a score threshold.
• Output: The expected number of peptides in a decoy
proteome of length n that score a least threshold against
Spectrum.
Expected Number of High-Scoring Peptides
The Monkey and the Typewriter Problem: Find the expected
number of strings from dictionary appearing in a randomly
generated text.
• Input: A set of strings Dictionary and an integer n.
• Output: The expected number of strings from Dictionary that
appear in a randomly generated string of length n.
STOP and Think: Are these problems equivalent?
Spectral DictionaryIn
tensity (
%)
100
0
[M+2H]2+ = 646.20
200 1200400 600 800 1000 m/z
Dictionarythreshold(Spectrum): the set of all peptides with score
at least threshold against Spectrum.
Expected Number of High-Scoring Peptides Problem: Find
the expected number of high-scoring peptides (against a given
spectrum) in a decoy proteome.
• Input: A Spectrum, an integer n, and a score threshold.
• Output: The expected number of peptides in a decoy
proteome of length n that score a least threshold against
Spectrum.
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
200 1200400 600 800 1000 m/z
Expected Number of High-Scoring Peptides Problem:
Find the expected number of high-scoring peptides (against a
given spectrum) in a decoy proteome.
• Input: A Spectrum, an integer n, and a score threshold.
• Output: The expected number of peptides from
Dictionarythreshold(Spectrum) occurring in a decoy proteome
of length n.
Spectral DictionaryDictionarythreshold(Spectrum): the set of all peptides with score
at least threshold against Spectrum.
Inte
nsity (
%)
100
0
[M+2H]2+ = 646.20
200 1200400 600 800 1000 m/z
Spectral DictionaryDictionarythreshold(Spectrum): the set of all peptides with score
at least threshold against Spectrum.
Expected Number of High-Scoring Peptides Problem:
Find the expected number of high-scoring peptides (against a
given spectrum) in a decoy proteome.
• Input: Peptides Dictionarythreshold(Spectrum) and an integer
n.
• Output: The expected number of strings from
Dictionarythreshold(Spectrum) occurring in a decoy proteome
of length n.
Expected Number of High-Scoring Peptides Problem:
Find the expected number of high-scoring peptides (against a
given spectrum) in a decoy proteome.
• Input: Peptides Dictionarythreshold(Spectrum) and an integer
n.
• Output: The expected number of strings from
Dictionarythreshold(Spectrum) occurring in a decoy proteome
of length n.
Spectral DictionaryDictionarythreshold(Spectrum): the set of all peptides with score
at least threshold against Spectrum.
Expected Number of Occurrences of Peptides
from Dictionary in DecoyProteome
• Probability that a string Peptide matches a string starting at
a given position in DecoyProteome:
Pr(Peptide) =1/20|Peptide|
Expected Number of Occurrences of Peptides
from Dictionary in DecoyProteome
• Probability that a string Peptide matches a string starting at
a given position in DecoyProteome:
Pr(Peptide) =1/20|Peptide|
• Exp. #times Peptide appears in DecoyProteome of length
n:
E(Peptide, n) ≈ n * Pr(Peptide) = n * 1/20|Peptide|
Expected Number of Occurrences of Peptides
from Dictionary in DecoyProteome
• Probability that a string Peptide matches a string starting at
a given position in DecoyProteome:
Pr(Peptide) =1/20|Peptide|
• Exp. #times Peptide appears in DecoyProteome of length
n:
E(Peptide, n) ≈ n * Pr(Peptide) = n * 1/20|Peptide|
• Exp. #times peptides from Dictionary appear in
DecoyProteome of length n:
E(Dictionary, n) ≈ n * (∑each Peptide in Dictionary 1/20|Peptide|)
= n * Pr(Dictionary)
How many peptides in DecoyUniprot+ are expected to score
at least -19 against DinosaurSpectrum, i.e., what is
E(Dictionary-19(DinosaurSpectrum, |UniProt+|)?
Probability of Spectral Dictionary
Probability of Spectral Dictionary Problem: Find the
probability of a spectral dictionary for a given spectrum and
score threshold.
• Input: A spectral vector Spectrum and a score threshold
threshold.
• Output: The probability of Dictionarythreshold(Spectrum).
Probability and Size of Spectral Dictionary
Size of Spectral Dictionary Problem: Find the size of a
spectral dictionary for a given spectrum and score threshold.
• Input: A spectral vector Spectrum and a score threshold
threshold.
• Output: The size of Dictionarythreshold(Spectrum).
Probability of Spectral Dictionary Problem: Find the
probability of a spectral dictionary for a given spectrum and
score threshold.
• Input: A spectral vector Spectrum and a score threshold
threshold.
• Output: The probability of Dictionarythreshold(Spectrum).
• Given a spectral vector s = s1…si…sn
• size(i, t): #peptides matching i-prefix s1…si with
score t
• sizea(i, t): #peptides matching i-prefix s1…si with score
t and ending in amino acid a:
• Removing the last amino acid a from a peptide results in
a shorter peptide with mass i - |a| and score t - si:
• Initialization: size(0, 0) = 1, size(i, t) = 0 for i < 0
Computing the Size of a Spectral
Dictionary
size(i, t) = Σ all amino acids a sizea(i, t)
size(i, t) = Σ all amino acids a sizea(i, t)
= Σ all amino acids a size(i - |a|,t - si)
• Given Spectrum=s1…sm, construct DAG(Spectrum)
on nodes 0,…, m with weight of node i equal to si .
Computing the Size of a Spectral
Dictionary
Amino acids X and Z with respective masses 4 and 5.
X
Z
00001100010002Spectrum
00001100010002
Computing the Size of a Spectral
Dictionary
X
Z
• Given Spectrum=s1…sm, construct DAG(Spectrum)
on nodes 0,…, m with weight of node i equal to si .
• a path from source to sink spells out a peptide.
XXZ
00001100010002
Computing the Size of a Spectral Dictionary
Score(XXZ,Spectrum) = 0 + 1 + 0 + 2 = 3
X
Z
• Given Spectrum=s1…sm, construct DAG(Spectrum)
on nodes 0,…, m with weight of node i equal to si .
• a path from source to sink corresponds to a
peptide.
• sum of weights of nodes on path = score of
PSM.
00001100010002
Computing the Size of a Spectral
Dictionary
X
Z
Score(XZX,Spectrum) = 0 + 1 + 1 + 2 = 4
• Given Spectrum=s1…sm, construct DAG(Spectrum)
on nodes 0,…, m with weight of node i equal to si .
• a path from source to sink corresponds to a
peptide.
• sum of weights of nodes on path = score of
PSM.
00001100010002
Computing size(i, t)
t=0 1 0 0 0
t=1 0 0 0 0
t=2 0 0 0 0
t=3 0 0 0 0
t=4 0 0 0 0
X
Z
00001100010002
size(i, t)=Σ all amino acids a size(i - |a|,t - si)
t=0 1 0 0 0 0
t=1 0 0 0 0 1
t=2 0 0 0 0 0
t=3 0 0 0 0 0
t=4 0 0 0 0 0
X
Z
00001100010002
size(i, t)=Σ all amino acids a size(i - |a|,t - 1)
t=0 1 0 0 0 0
t=1 0 0 0 0 1
t=2 0 0 0 0 0
t=3 0 0 0 0 0
t=4 0 0 0 0 0
X
Z
00001100010002
t=0 1 0 0 0 0 0
t=1 0 0 0 0 1 1
t=2 0 0 0 0 0 0
t=3 0 0 0 0 0 0
t=4 0 0 0 0 0 0
X
Z
size(i, t)=Σ all amino acids a size(i - |a|,t - si)
00001100010002
t=0 1 0 0 0 0 0 0
t=1 0 0 0 0 1 1 0
t=2 0 0 0 0 0 0 0
t=3 0 0 0 0 0 0 0
t=4 0 0 0 0 0 0 0
X
Z
size(i, t)=Σ all amino acids a size(i - |a|,t - si)
00001100010002
t=0 1 0 0 0 0 0 0
t=1 0 0 0 0 1 1 0
t=2 0 0 0 0 0 0 0
t=3 0 0 0 0 0 0 0
t=4 0 0 0 0 0 0 0
X
Z
size(i, t)=Σ all amino acids a size(i - |a|,t - 0)
00001100010002
t=0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
t=1 0 0 0 0 1 1 0 0 1 0 1 0 1 0
t=2 0 0 0 0 0 0 0 0 0 2 0 0 0 0
t=3 0 0 0 0 0 0 0 0 0 0 0 0 0 1
t=4 0 0 0 0 0 0 0 0 0 0 0 0 0 2
X
Z
size(i, t)=Σ all amino acids a size(i - |a|,t - si)
00001100010002
t=0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
t=1 0 0 0 0 1 1 0 0 1 0 1 0 1 0
t=2 0 0 0 0 0 0 0 0 0 2 0 0 0 0
t=3 0 0 0 0 0 0 0 0 0 0 0 0 0 1
t=4 0 0 0 0 0 0 0 0 0 0 0 0 0 2
X
Z
size(i, t)=Σ all amino acids a size(i - |a|,t - 2)
• Given a spectral vector s = s1…si…sn
• Pr(i, t): sum of probabilities of all peptides matching i-
prefix s1…si with score t
• Pra(i, t): sum of probabilities of all peptides matching i-
prefix s1…si with score t and ending in amino acid a:
• Removing the last amino acid a from results in a shorter
peptide with mass i – |a|, score t – si , and 20 times larger
probability:
Computing the Probability of a Spectral
Dictionary
Pr(i, t) = Σ all amino acids a Pra(i, t)
Pr(i, t) = Σ all amino acids a Pra(i, t)
= Σ all amino acids a Pr (i - |a|,t - si) / 20
00001100010002
size(i, t)=Σ all amino acids a size(i - |a|,t - si)
t=0 1 0 0 0 0
t=1 0 0 0 0 1
t=2 0 0 0 0 0
t=3 0 0 0 0 0
t=4 0 0 0 0 0
X
Z
00001100010002
Pr(i, t)=Σ all amino acids a Pr(i - |a|,t - si)/20
t=0 1 0 0 0 0
t=1 0 0 0 0 1
t=2 0 0 0 0 0
t=3 0 0 0 0 0
t=4 0 0 0 0 0
X
Z
1/20
Hint: Dictionary-19(DinosaurSpectrum) contains
219,136,251,374 peptides (!) and has probability
0.00018
STOP and Think: What is the statistical significance of
the PSM
(DinosaurPeptide, DinosaurSpectrum)
found in searches against the UniProt+ database of
length n ≈ 200 million amino acids?
Statistical Significance of the PSM
(DinosaurPeptide, DinosaurSpectrum)
Reminder: PSM (DinosaurPeptide, DinosaurSpectrum)
has score -19.
STOP and Think: How many PSMs with score at
least -19 do we expect to find in a decoy proteome
of the same size as UniProt+?
n * Pr(Dictionary-19(DinosaurSpectrum)) = 35,311
Statistical Significance of the PSM
(DinosaurPeptide, DinosaurSpectrum)
Finding DinosaurPeptide as an
interpretation of DinosaurSpectrum is
no more surprising than the monkey
typing “THE” after 200 million
attempts...
The
MonkeyDictionary
NEW EDITION
2,000 new words
even more nonsense
STOP and Think: How many PSMs with score at
least -19 do we expect to find in a decoy proteome
of the same size as UniProt+?
n * Pr(Dictionary-19(DinosaurSpectrum)) = 35,311
Statistical Significance of the PSM
(DinosaurPeptide, DinosaurSpectrum)
Finding DinosaurPeptide as an
interpretation of DinosaurSpectrum is
no more surprising than the monkey
typing “THE” after 200 million
attempts...
...which is not surprising at all!
The
MonkeyDictionary
NEW EDITION
2,000 new words
even more nonsense