Was T. rex Just a Big Chicken?poincare.matf.bg.ac.rs/~jovana/bi/predavanja/Chapter_11.pdf · Was T....

Was T. rex Just a Big Chicken?

Computational Proteomics

Phillip Compeau and Pavel Pevzner

adjusted by Jovana Kovačević

Bioinformatics Algorithms: an Active Learning Approach

• Paleontology Meets Computing

• Decoding an Ideal Spectrum

• From Ideal to Real Spectra

• Peptide Sequencing

• Peptide Identification

• Spectral Dictionaries

T. Rex and Chicken Collagens Are Nearly

Identical!

T. rex and chicken

collagens are nearly identical!

Scientists Sequence Collagens from T. Rex!

Jack Horner

discovers a T. rex

femur fossil in

Montana (2000)

Schweitzer

demineralizes it

John Asara

generates

spectra and

decodes them

(2007)

Frederick Sanger’s Two Nobel Prizes

GIVEECCA!

GIVEECCASV!

GIVEECCASVC!

GIVEECCASVCSL!

GIVEECCASVCSLY!

SLYELEDYC!

ELEDY!

ELEDYCD!

LEDYCD!

EDYCD!

FVDEHLCG!

FVDEHLCGSHL!

HLCGSHL!

SHLVEA !

VEALY!

YLVCG!

LVCGERGF!

LVCGERGFF!

GFFYTPK!

YTPKA!

GIVECCASVCSLYELEDYCDFVDEHLCGSHLVEALYLVCGERGFFFYTPKA!

1958: protein

sequencing

1977: DNA

sequencing

1958: protein sequencing difficult, DNA sequencing

impossible

Today: protein sequencing difficult, DNA sequencing trivial

Multiple identical

copies of a genome

AGAATATCASequence the reads

Shatter the genome

into reads

Assemble the

genome using

overlapping reads

...TGAGAATATCA...

AGAATATCA

GAGAATATC

TGAGAATAT

GAGAATATCTGAGAATAT

Sequencing Proteins Today

• Putative proteome– If we know a genome, we can predict all the genes

that the genome encodes

– Translating the predicted genes leads us to putative proteome (set of all proteins encoded by genome)

– But how can we determine whether a protein is syntetized in a specific tissue?

• Peptide identification– In practice, merely confirming that 10aa long peptide

from a known protein is present in a sample confirms that the sample contains this protein

• Peptide sequencing– Inferring amino acid sequence of a peptide without

relying on a proteome

– Used in situations when proteome is unknown

Sequencing Proteins with Mass

Spectrometry• Most mass spectrometers can only

measure masses of rather short peptides

(e.g. < 30-40 amino acids). To bypass this

limitation:

– Proteases (e.g., trypsin) break proteins

into short peptides.

– A mass spectrometer breaks these

peptides into charged fragment ions

and measures the mass/charge ratio*

and intensity of each ion.

How do we reconstruct the peptide from the

collection of mass/charge ratios?

* For simplicity, we assume that all masses are integers and all charges

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

Which Peptide Generated This Spectrum?

200 400 600 800 1000 120000

Intensity

mass/charge

• The Ostrich Hemoglobin Riddle

• Searching for Post-Translational Modifications

• Spectral Alignment Algorithm

Prefix and Suffix Peptides

prefix

masses

suffix

masses

129 156 115 103 71

Reconstructing a Peptide from Prefix/Suffix

Masses

prefix

masses

suffix

masses

Reconstructing a Peptide from Prefix/Suffix

Masses

Ideal Spectrum: Collection of all prefix

and suffix masses of a peptide.

Note: we don’t know which masses

correspond to prefixes and which

masses correspond to suffixes.

Peptide explains Spectrum if

IdealSpectrum(Peptide) = Spectrum.

IdealSpectrum(REDCA):

0 71 156 174 285 289 400 418 503 574

Spectrum 0 71 156 174 285 289 400 418 503 574

Graph(Spectrum)

Decoding an Ideal Spectrum Problem:

Reconstruct a peptide from its ideal spectrum.

• Input: A collection of integers Spectrum.

• Output: An amino acid string Peptide that

explains Spectrum.

Reconstructing a Peptide from an Ideal Spectrum

0 71 156 574503418400289285174

Graph(Spectrum)

explains Spectrum.

0 71 156 574503418400289285174

Nodes: masses in the spectrum

Edges: connect node i to node j if j - i is the mass of an

amino acid a. Label this edge by a.

Graph(Spectrum)

explains Spectrum.

0 71 156 574503418400289285174

R E D C

C D E R

Nodes: masses in the spectrum

Edges: connect node i to node j if j - i is the mass of an

amino acid a. Label this edge by a.

DecodingIdealSpectrum Algorithm

DecodingIdealSpectrum(Spectrum)

construct Graph(Spectrum)

find a path Path from source to sink in Graph(Spectrum)

return amino acid string spelled by labels of Path

Spectrum 0 71 156 174 285 289 400 418 503 574

Graph(Spectrum) 0 71 156 574503418400289285174

R E D C

C D E R

Does This Approach Work for All Spectra?

Spectrum 0 57 114 128 215 229 316 330 387 444

Graph(Spectrum) 0 57 114 444387330316229215128

G S S G G

D D K/Q

T T T T A N

Does This Approach Work for All Spectra?

Spectrum 0 57 114 128 215 229 316 330 387 444

Graph(Spectrum) 0 57 114 444387330316229215128

G S S G G

D D K/Q

T T T T A N

IdealSpectrum(NTTAG) ≠ Spectrum!

Correcting DecodingIdealSpectrum

Graph(Spectrum)

Spectrum 0 57 114 128 215 229 316 330 387 444

0 57 114 444387330316229215128

G S S G G

D D K/Q

T T T T A N

IdealSpectrum(GGDTN) = Spectrum

for each path Path from source to sink in

Graph(Spectrum)

Peptide ← amino acid string spelled by labels of Path

if IdealSpectrum(Peptide) = Spectrum

return Peptide

Correcting DecodingIdealSpectrum

IdealSpectrum(GGDTN) = Spectrum

for each path Path from source to sink in

Graph(Spectrum)

Peptide ← amino acid string spelled by labels of Path

if IdealSpectrum(Peptide) = Spectrum

return Peptide

• Not efficient algorithm, may be exponential in the number of nodes (= number of masses in the spectrum)

• Spectral Dictionariesy4

V N V A D C G A E A L A R

[M+2H]2+ = 673.46

nsity (

200 1200400 600 800 1000 m/z

nsity (

[M+2H]2+ = 646.20

G L V G A P G L R G L P G K

200 1200400 600 800 1000 m/z

From Ideal to Real Spectra

Decoding a (Real) Spectrum Problem:

Reconstruct a peptide from its spectrum.

explains Spectrum the best (among all possible

a.a. strings).

0 71 99 156 180 196 228 285 289 320 400 421 503 574

Real spectra have both false and missing masses.

0 71 156 174 285 289 400 418 503 574

Ideal Spectrum of REDCA

Spectrum

From Ideal to Real Spectra

Decoding a (Real) Spectrum Problem:

Reconstruct a peptide from its spectrum.

explains Spectrum the best (among all possible

a.a. strings).

0 71 99 156 180 196 228 285 289 320 400 421 503 574

Real spectra have both false and missing masses.

0 71 156 174 285 289 400 418 503 574

Ideal Spectrum of REDCA

Spectrum

Which Peptide Generated This Spectrum?

Intensity

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z200 400 600 800 1000 12000

mass/charge

DinosaurSpectru

• Once the peptide is known, how can we measure how well a peptide explains a spectrum?

[M+2H]2+ = 673.46

nsity (

200 1200400 600 800 1000 m/z

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

Annotating a Spectrum

DinosaurSpectru

Intensity

200 400 600 800 1000 12000

Suffix peptide of length

3 (denoted as y3)

Prefix peptide of length

10 (denoted as b10)

• Once we infer the peptide that generated a given spectrum,we can annotate the spectrum by establishing correspondencebetween peaks in the spectrum and prefixes/suffixes of thepeptide

[M+2H]2+ = 673.46

nsity (

200 1200400 600 800 1000 m/z

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

Shared Peak Count

DinosaurSpectru

Intensity

200 400 600 800 1000 12000

GLVGAPCLRGLPGK annotates b10, b11, b13, y3, y4, y12 (Shared Peak Count =

• Shared Peak Count – the number of peaks annotated bypeptide

[M+2H]2+ = 673.46

nsity (

200 1200400 600 800 1000 m/z

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

Another Candidate Peptide

DinosaurSpectru

Intensity

200 400 600 800 1000 12000

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

A T K I V D C F M T Y

Intensity

200 400 600 800 1000 12000

ATKIVDCFMTY annotates b3, b6, b9, y2, y3, y4, y5, y6, y7, y8 (Shared Peak Count = 10)

[M+2H]2+ = 673.46

nsity (

200 1200400 600 800 1000 m/z

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

Another Candidate Peptide

DinosaurSpectru

Intensity

200 400 600 800 1000 12000

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

Intensity

200 400 600 800 1000 12000

ATKIVDCFMTY annotates b3, b6, b9, y2, y3, y4, y5, y6, y7, y8 (Shared Peak Count = 10)

[M+2H]2+ = 673.46

nsity (

200 1200400 600 800 1000 m/z

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

DinosaurSpectru

Intensity

200 400 600 800 1000 12000

How Should We Score an Annotated Spectrum?

Shared Peak

Count?

Sum of intensities

of explained peaks?

ignores

intensities

large peaks may

dominate the score

Idea: probabilistic model of spectra so that large peaks

contribute to the score but do not dominate it.

[M+2H]2+ = 673.46

nsity (

200 1200400 600 800 1000 m/z

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

DinosaurSpectru

Intensity

200 400 600 800 1000 12000

Transform the spectrum of mass m into a spectral

vector

s1, …,si, …, sm

The value si (amplitude) approximates the likelihood

that mass i is the prefix mass of an (unknown!)

peptide that generated the spectrum.

Spectral Vectors

Peptid

00…0100…0100…0100…0100…01156 bits 71 bits103 bits115 bits129 bits

peptide vector

Peptide

mass 156 129 115 103

E D C A

From a Peptide to a Peptide Vector

Converting a Peptide into a Peptide Vector

Problem. Convert a peptide into a peptide vector.

• Input: A string of amino acids Peptide.

• Output: The peptide vector of Peptide.

From a Peptide to a Peptide Vector

Converting a Peptide into a Peptide Vector

Problem. Convert a peptide into a peptide vector.

• Input: A string of amino acids Peptide.

• Output: The peptide vector of Peptide.

From a Peptide Vector to a Peptide

Converting a Peptide Vector into a Peptide

Problem. Convert a binary vector into a peptide.

• Input: A binary vector P.

• Output: A peptide whose peptide vector is equal

to P (if such a peptide exists).

From a Spectrum to a Spectral Vector

[M+2H]2+ = 673.46

nsity (

200 1200400 600 800 1000 m/z

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

DinosaurSpectru

(mass m)Intensity

200 400 600 800 1000 12000

+9 (amplitude) is not the intensity of this peak!

It is a likelihood that this peak will be annotated by a prefix

of an (unknown!) peptide that generated the spectrum.

+9amplitude

From a Spectrum to a Spectral Vector

[M+2H]2+ = 673.46

nsity (

200 1200400 600 800 1000 m/z

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

DinosaurSpectru

(mass m)Intensity

200 400 600 800 1000 12000

The larger the peak at mass i,

the larger the value (amplitude) si of the spectral

vector

s1........-5.........+3..........................+9...+7..............sm

an integer-valued vector with m

coordinates

spectral

vector

+7+9+3-5amplitude

Scoring Peptide against Spectrum

Score of Peptide against Spectrum is the dot

product of Peptide and Spectrum:

score(Peptide, Spectrum) = p1*s1+p2*s2+ …+pm*sm.

000…001000…001000…001000…001000…001Peptide

[M+2H]2+ = 673.46

nsity (

200 1200400 600 800 1000 m/z

y12++In

tensity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

DinosaurSpectru

Intensity

200 400 600 800 1000 12000

s1..........-5…....+3..........................+9...+7..............sm

Spectrum ******************************************

Peptide Sequencing Problem

Peptide Sequencing Problem: Given a spectral

vector, find a peptide vector with maximum score

against this spectral vector.

• Input: A spectral vector Spectrum.

maximizes

score(Peptide, Spectrum)

among all possible peptides.

Building a DAG from a Spectral Vector

1. For a spectral vector Spectrum=s1, … ,sm, construct

DAG(Spectrum) on nodes {0,1, …, m}

2. Assign weight si to node i

33 2 10 0 0 -2 -3 -1 -7 5 -8 0 1 2 10 4 6 9 3 0

3. Connect node i to node j if j - i is equal to the mass

of an amino acid

Toy alphabet: amino acids X and Z with masses 4 and 5

33 2 10 0 0 -2 -3 -1 -7 5 -8 0 1 2 10 4 6 9 3 0

of an amino acid

33 2 10 0 0 -2 -3 -1 -7 5 -8 0 1 2 10 4 6 9 3 0

of an amino acid

0 33 2 1 90 0 0 4 -2 -3 -1 -7 6 5 -8 0 3 1 2 1 0

Score(XZZXX, Spectrum) = 0 + 4 + 6 + 9 + 3 + 0 =

Peptides = Paths in DAG(Spectrum)

0 33 2 1 90 0 0 4 -2 -3 -1 -7 6 5 -8 0 3 1 2 1 0

• Peptide: any path from source to sink in DAG(Spectrum).

• score(Peptide, Spectrum): sum of scores of nodes it visits.

• Peptide Sequencing Problem: finding a maximum-weight

path in a node-weighted DAG.

Peptide Sequencing = Finding a Path in

DAG(Spectrum)

0 33 2 1 90 0 0 4 -2 -3 -1 -7 6 5 -8 0 3 1 2 1 0

Peptide Sequencing Problem: Given a spectral

vector, find a peptide vector with maximum score

against this spectral vector.

• Input: A spectral vector Spectrum.

• Output: A maximum-weight path in DAG(Spectrum).

STOP and Think: How do we find a maximum-weight path in

a node-weighted DAG?

Intensity

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z200 400 600 800 1000 12000

mass/charge

DinosaurSpectru

???????????

Generating Spectrum

from an (Unknown) Peptide

Intensity

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z200 400 600 800 1000 12000

mass/charge

DinosaurSpectru

???????????

Reconstructing Peptide from Spectrum

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

De novo Reconstruction!

mass/charge

DinosaurSpectru

ATKIVDCFMTY

Intensity

200 400 600 800 1000 12000

But this highest scoring peptide is biologically

incorrect!

Scoring functions that reliably assign the highest

score to the biologically correct peptide remain

unknown...

Intensity

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z200 400 600 800 1000 12000

mass/charge

DinosaurSpectru

[M+2H]2+ = 673.46

nsity (

200 1200400 600 800 1000 m/z

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

…HKMPRSTATPKRMGGCTFSPCFTKRLMATSGLVGAPGLRGLPGKMGGCTFGTRACFGH…

The correct peptide may not score highest among all peptides,

but it typically scores highest among all peptides in the

proteome* * If the resulting score is sufficiently high

The highest-scoring peptide in Proteome

Imagine that You Know the Proteome…

Intensity

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z200 400 600 800 1000 12000

mass/charge

DinosaurSpectru

[M+2H]2+ = 673.46

nsity (

200 1200400 600 800 1000 m/z

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

…HKMPRSTATPKRMGGCTFSPCFTKRLMATSGLVGAPGLRGLPGKMGGCTFGTRACFGH…

The highest-scoring peptide in Proteome

Imagine that You Know the Proteome…

Peptide identification: reconstructing a peptide as

the highest-scoring peptide occurring in a

proteome.

All peptides from

Proteome

MDERHILNM, KLQWVCSDL,

PTYWASDL, ENQIKRSACVM,

TLACHGGEM, NGALPQWRT,

HLLERTKMNVV, GGPASSDA,

GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, AVGELTK,

HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN

AVGELTK

Peptide

Identificatio

nAll possible peptides (20n)

AAAAAAAA,AAAAAAAC,AAAAAAAD,AAAAAAAE,AA

AAAAAG,AAAAAAAF,AAAAAAAH,AAAAAAI,

AVGELTI, AVGELTK , AVGELTL, AVGELTM,

YYYYYYYS,YYYYYYYT,YYYYYYYV,YYYYYYYY

Peptide Sequencing vs. Peptide

Identification

Which approach is

faster?

Peptide

Sequencing

AVGELTK

Peptide

Sequencing

Peptide

Identificatio

nThe set of all peptides in Proteome is much smaller than the set of of all possible peptides.

However, peptide sequencing algorithms are much faster, even though their search space is much larger.

Peptide sequencing eliminates the time-consuming scan of Proteomeby modeling the problem as the Longest Path in a DAG Problem.

However, since the scoring function is imperfect, peptide sequencing remains inaccurate: state-of-the-art tools correctly reconstruct only 30% of spectra.

Peptide Sequencing vs. Peptide

Identification

Peptide Identification Problem: Find a peptide

from a proteome with maximum score against a

spectrum.

• Input: A spectral vector Spectrum and an amino

acid string Proteome.

• Output: An a.a. string Peptide that maximizes

score(Peptide, Spectrum)

among all substrings of Proteome.

STOP and Think: How can we possibly construct

the T. rex proteome?

The Peptide Identification Problem

• 90% of proteins making up

animal bones are collagens.

• Since collagens are often conserved across

species, collagens in T. rex were likely similar to

collagens in some present-day species.

Approximating the T. rex Proteome

• As a sanity check, Asara

compared the T. rex spectra

against the UniProt database

(≈ 200 million amino acids

from hundreds of species).

• Asara also included some mutated versions of

collagens from present-day species; we will call

the augmented database UniProt+.*

Approximating the T. rex Proteome

*concatenate all proteins in UniProt+ into a string Proteome for

simplicity

Most of the high-scoring peptides identified in

UniProt+ were chicken collagens, supporting the

hypothesis that birds evolved from dinosaurs.

Searching T. rex Spectra Against UniProt+

DinosaurPeptide = GLVGAPGLRGLPGK is only

one mutation away from a chicken collagen

peptide.

Searching T. rex Spectra Against UniProt+

But how can we be sure that DinosaurPeptide is

the correct interpretation of DinosaurSpectrum?

[M+2H]2+ = 673.46

nsity (

200 1200400 600 800 1000 m/z

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z200 400 600 800 1000 1200 m/z00

Intensity

But billions of peptides not occurring in UniProt+

outscore DinosaurPeptide.

Statistical Significance of DinosaurPeptide

DinosaurPeptide is the highest scoring peptide for

DinosaurSpectrum among all peptides in UniProt+.

But billions of peptides not occurring in UniProt+

outscore DinosaurPeptide.

We need to develop a method for evaluating the

statistical significance of identified peptides.

STOP and Think: Does this concern you?

Statistical Significance of DinosaurPeptide

DinosaurPeptide is the highest scoring peptide for

DinosaurSpectrum among all peptides in UniProt+.

* If the resulting score is sufficiently high

Given a parameter threshold, a peptide Peptide and

a spectral vector Spectrum form a Peptide-

Spectrum Match (PSM) if:

• Peptide is a highest-scoring peptide against

Spectrum among all peptides in Proteome

• Score(Peptide, Spectrum) ≥ threshold

Peptide-Spectrum Matches (PSMs)

Given a parameter threshold, a peptide Peptide and

a spectral vector Spectrum form a Peptide-

Spectrum Match (PSM) if:

• Peptide is a highest-scoring peptide against

Spectrum among all peptides in Proteome

• Score(Peptide, Spectrum) ≥ threshold

PSMthreshold(Proteome, SpectralVectors): the set of

Peptide-Spectrum Matches (PSMs) resulting from a

set of SpectralVectors (for a given Proteome and

threshold).

Peptide-Spectrum Matches (PSMs)

PSM Search Problem: Identify all Peptide-

Spectrum Matches scoring above a threshold for a

set of spectra and a proteome.

• Input: A set SpectralVectors, an amino acid

string Proteome, and a score threshold

threshold.

• Output: The set of Peptide-Spectrum Matches

PSMthreshold(Proteome, SpectralVectors).

PSM Search Problem

STOP and Think: A PSM search of 1,000 spectra

from a human sample against the human proteome

results in 100 PSMs whose score surpassed a

threshold.

• What is the fraction of erroneous PSMs among

Hint: Repeat the same experiment for a randomly

generated DecoyProteome of the same size as the

human proteome.

Decoy Proteome

If you identify 5 PSMs in DecoyProteome, then 5/100

of PSMs identified in the human proteome are

estimated to be correct.

False Discovery Rate

For the T. rex spectra, there are 27 PSMs in UniProt+

and only 1 PSM in DecoyProteome with score ≥ 100

(FDR =1/27= 3.7%)

STOP and Think: Have we found ≈27* T. rex

peptides?!

False discovery rate (FDR):

|PSMthreshold(DecoyProteome,SpectralVectors)|

|PSMthreshold(Proteome, SpectralVectors)|

Many of these PSM correspond to contaminants, e.g., keratin from human skin

How can we estimate the statistical significance of

an individual PSM?

The Monkey and the Typewriter

abagytegertoyhktyhkyrzaxujhotgemamaghtkmjytrabagytegertozhkoghk

yrzacatxujhotgemamaghtkdhairytdgbikemjytrcgtyyghjotfghtsybdkkpw

kfffldogjfiegbebgncnslkcfscnnclnscnscnsnovcsnovslvnsnvnvnsnvsvv

slnlnsvlnsnvnslnvnlsvnsnnsvnslvnscatlvslvslvlmbgjgaggeyjllfghlh

mhlhjjlhjlhabracadabraghytnlkprstyrhketryabcnccowcnchairmtdgwom

bikedmdppdtyhtgftxcjabcjwqbcoewbvcoewvbexovervhhddwdwqdhgyusjff

fgfghhhhhy…

The Monkey Can Spell!

abagytegertoyhktyhkyrzaxujhotgemamaghtkmjytrabagytegertozhkoghk

yrzacatxujhotgemamaghtkdhairytdgbikemjytrcgtyyghjotfghtsybdkkpw

kfffldogjfiegbebgncnslkcfscnnclnscnscnsnovcsnovslvnsnvnvnsnvsvv

slnlnsvlnsnvnslnvnlsvnsnnsvnslvnscatlvslvslvlmbgjgaggeyjllfghlh

mhlhjjlhjlhabracadabraghytnlkprstyrhketryabcnccowcnchairmtdgwom

bikedmdppdtyhtgftxcjabcjwqbcoewbvcoewvbexovervhhddwdwqdhgyusjff

fgfghhhhhy…

MonkeyDictionary

NEW EDITION

2,000 new words

even more nonsense

Expected Number of Strings from Dictionary

The Monkey and the Typewriter Problem: Find the expected

number of strings from dictionary appearing in a randomly

generated text.

• Input: A set of strings Dictionary and an integer n.

• Output: The expected number of strings from Dictionary that

appear in a randomly generated string of length n.

Expected Number of High-Scoring Peptides Problem: Find

the expected number of high-scoring peptides (against a given

spectrum) in a decoy proteome.

• Input: A Spectrum, an integer n, and a score threshold.

• Output: The expected number of peptides in a decoy

proteome of length n that score a least threshold against

Spectrum.

Expected Number of High-Scoring Peptides

The Monkey and the Typewriter Problem: Find the expected

number of strings from dictionary appearing in a randomly

generated text.

• Input: A set of strings Dictionary and an integer n.

• Output: The expected number of strings from Dictionary that

appear in a randomly generated string of length n.

STOP and Think: Are these problems equivalent?

Spectral DictionaryIn

tensity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

Dictionarythreshold(Spectrum): the set of all peptides with score

at least threshold against Spectrum.

Expected Number of High-Scoring Peptides Problem: Find

the expected number of high-scoring peptides (against a given

spectrum) in a decoy proteome.

• Output: The expected number of peptides in a decoy

proteome of length n that score a least threshold against

Spectrum.

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

Expected Number of High-Scoring Peptides Problem:

Find the expected number of high-scoring peptides (against a

given spectrum) in a decoy proteome.

• Output: The expected number of peptides from

Dictionarythreshold(Spectrum) occurring in a decoy proteome

of length n.

Spectral DictionaryDictionarythreshold(Spectrum): the set of all peptides with score

nsity (

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

• Input: Peptides Dictionarythreshold(Spectrum) and an integer

• Output: The expected number of strings from

of length n.

• Input: Peptides Dictionarythreshold(Spectrum) and an integer

• Output: The expected number of strings from

of length n.

Expected Number of Occurrences of Peptides

from Dictionary in DecoyProteome

• Probability that a string Peptide matches a string starting at

a given position in DecoyProteome:

Pr(Peptide) =1/20|Peptide|

• Exp. #times Peptide appears in DecoyProteome of length

E(Peptide, n) ≈ n * Pr(Peptide) = n * 1/20|Peptide|

• Exp. #times Peptide appears in DecoyProteome of length

E(Peptide, n) ≈ n * Pr(Peptide) = n * 1/20|Peptide|

• Exp. #times peptides from Dictionary appear in

DecoyProteome of length n:

E(Dictionary, n) ≈ n * (∑each Peptide in Dictionary 1/20|Peptide|)

= n * Pr(Dictionary)

How many peptides in DecoyUniprot+ are expected to score

at least -19 against DinosaurSpectrum, i.e., what is

E(Dictionary-19(DinosaurSpectrum, |UniProt+|)?

Probability of Spectral Dictionary

Probability of Spectral Dictionary Problem: Find the

probability of a spectral dictionary for a given spectrum and

score threshold.

• Input: A spectral vector Spectrum and a score threshold

threshold.

• Output: The probability of Dictionarythreshold(Spectrum).

Probability and Size of Spectral Dictionary

Size of Spectral Dictionary Problem: Find the size of a

spectral dictionary for a given spectrum and score threshold.

threshold.

• Output: The size of Dictionarythreshold(Spectrum).

Probability of Spectral Dictionary Problem: Find the

probability of a spectral dictionary for a given spectrum and

score threshold.

threshold.

• Output: The probability of Dictionarythreshold(Spectrum).

• Given a spectral vector s = s1…si…sn

• size(i, t): #peptides matching i-prefix s1…si with

score t

• sizea(i, t): #peptides matching i-prefix s1…si with score

t and ending in amino acid a:

• Removing the last amino acid a from a peptide results in

a shorter peptide with mass i - |a| and score t - si:

• Initialization: size(0, 0) = 1, size(i, t) = 0 for i < 0

Computing the Size of a Spectral

Dictionary

size(i, t) = Σ all amino acids a sizea(i, t)

= Σ all amino acids a size(i - |a|,t - si)

• Given Spectrum=s1…sm, construct DAG(Spectrum)

on nodes 0,…, m with weight of node i equal to si .

Dictionary

Amino acids X and Z with respective masses 4 and 5.

00001100010002Spectrum

00001100010002

Dictionary

• a path from source to sink spells out a peptide.

00001100010002

Computing the Size of a Spectral Dictionary

Score(XXZ,Spectrum) = 0 + 1 + 0 + 2 = 3

• a path from source to sink corresponds to a

peptide.

• sum of weights of nodes on path = score of

00001100010002

Dictionary

Score(XZX,Spectrum) = 0 + 1 + 1 + 2 = 4

• a path from source to sink corresponds to a

peptide.

• sum of weights of nodes on path = score of

00001100010002

Computing size(i, t)

t=0 1 0 0 0

t=1 0 0 0 0

t=2 0 0 0 0

t=3 0 0 0 0

t=4 0 0 0 0

00001100010002

size(i, t)=Σ all amino acids a size(i - |a|,t - si)

t=0 1 0 0 0 0

t=1 0 0 0 0 1

t=2 0 0 0 0 0

t=3 0 0 0 0 0

t=4 0 0 0 0 0

00001100010002

size(i, t)=Σ all amino acids a size(i - |a|,t - 1)

t=0 1 0 0 0 0

t=1 0 0 0 0 1

t=2 0 0 0 0 0

t=3 0 0 0 0 0

t=4 0 0 0 0 0

00001100010002

t=0 1 0 0 0 0 0

t=1 0 0 0 0 1 1

t=2 0 0 0 0 0 0

t=3 0 0 0 0 0 0

t=4 0 0 0 0 0 0

00001100010002

t=0 1 0 0 0 0 0 0

t=1 0 0 0 0 1 1 0

t=2 0 0 0 0 0 0 0

t=3 0 0 0 0 0 0 0

t=4 0 0 0 0 0 0 0

00001100010002

t=0 1 0 0 0 0 0 0

t=1 0 0 0 0 1 1 0

t=2 0 0 0 0 0 0 0

t=3 0 0 0 0 0 0 0

t=4 0 0 0 0 0 0 0

00001100010002

t=0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

t=1 0 0 0 0 1 1 0 0 1 0 1 0 1 0

t=2 0 0 0 0 0 0 0 0 0 2 0 0 0 0

t=3 0 0 0 0 0 0 0 0 0 0 0 0 0 1

t=4 0 0 0 0 0 0 0 0 0 0 0 0 0 2

00001100010002

t=0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

t=1 0 0 0 0 1 1 0 0 1 0 1 0 1 0

t=2 0 0 0 0 0 0 0 0 0 2 0 0 0 0

t=3 0 0 0 0 0 0 0 0 0 0 0 0 0 1

t=4 0 0 0 0 0 0 0 0 0 0 0 0 0 2

• Given a spectral vector s = s1…si…sn

• Pr(i, t): sum of probabilities of all peptides matching i-

prefix s1…si with score t

• Pra(i, t): sum of probabilities of all peptides matching i-

prefix s1…si with score t and ending in amino acid a:

• Removing the last amino acid a from results in a shorter

peptide with mass i – |a|, score t – si , and 20 times larger

probability:

Computing the Probability of a Spectral

Dictionary

Pr(i, t) = Σ all amino acids a Pra(i, t)

= Σ all amino acids a Pr (i - |a|,t - si) / 20

00001100010002

t=0 1 0 0 0 0

t=1 0 0 0 0 1

t=2 0 0 0 0 0

t=3 0 0 0 0 0

t=4 0 0 0 0 0

00001100010002

Pr(i, t)=Σ all amino acids a Pr(i - |a|,t - si)/20

t=0 1 0 0 0 0

t=1 0 0 0 0 1

t=2 0 0 0 0 0

t=3 0 0 0 0 0

t=4 0 0 0 0 0

Hint: Dictionary-19(DinosaurSpectrum) contains

219,136,251,374 peptides (!) and has probability

0.00018

STOP and Think: What is the statistical significance of

the PSM

(DinosaurPeptide, DinosaurSpectrum)

found in searches against the UniProt+ database of

length n ≈ 200 million amino acids?

Statistical Significance of the PSM

Reminder: PSM (DinosaurPeptide, DinosaurSpectrum)

has score -19.

STOP and Think: How many PSMs with score at

least -19 do we expect to find in a decoy proteome

of the same size as UniProt+?

n * Pr(Dictionary-19(DinosaurSpectrum)) = 35,311

Finding DinosaurPeptide as an

interpretation of DinosaurSpectrum is

no more surprising than the monkey

typing “THE” after 200 million

attempts...

MonkeyDictionary

NEW EDITION

2,000 new words

even more nonsense

STOP and Think: How many PSMs with score at

least -19 do we expect to find in a decoy proteome

of the same size as UniProt+?

n * Pr(Dictionary-19(DinosaurSpectrum)) = 35,311

Finding DinosaurPeptide as an

interpretation of DinosaurSpectrum is

no more surprising than the monkey

typing “THE” after 200 million

attempts...

...which is not surprising at all!

MonkeyDictionary

NEW EDITION

2,000 new words

even more nonsense

Was T. rex Just a Big Chicken?poincare.matf.bg.ac.rs/~jovana/bi/predavanja/Chapter_11.pdf · Was T....

Documents

Http://amirenglishclub.synthasite.com/. T. REX Tyrannosaurus Rex Tyranno = Tyrant Saurus = Lizard Rex = King

Drumbeat Rex & Barack 13-07-18 Rex, Lies & Videotape

Rex Nemorensis

АУТОМАТСКО РЕШАВАЊЕ КОНСТРУКТИВНИХ ...poincare.matf.bg.ac.rs/~vesnap/radovi/teza.pdfЗахвалница Ималасамреткусрећудамииуосновноjисредњоjшколиматематикупре-даjу

Tyrannosaurus Rex

Chicken Chicken; Chicken Chicken(-_-)

Chicken Chicken Chicken: Chicken Chickenchicken/webfiles/chickenslides.pdfChicken chicken chickens: chicken, chicken Chicken chicken: chickens chicken chi eken chicken chicken h Sc2,

SCAN21052 - RKC INST › wp-content › uploads › 2019 › 10 › c20c334… · c20c334 rs rex-c2000 rex-c2010 rex-c2020 rex-c2030 ch mode

Rex Techno Dotari Bm 2#0703 Rex Techno En

Oedipus Rex!

Readasaurus rex

Mech Engr - Andy Ruinaruina.tam.cornell.edu/research/publicity/files...MUSCLE-BOUND T. REX 13,000-pound chicken wouldn't run too well either, according to Stanford University's John

Residency Presentation. Rex & Me Rex = 10 Months

Chicken Chicken Chicken: Chicken Chicken

KHANA KHAZANA menu KHAZANA_menu.pdf · Chicken Punjabi Chicken Tikka Masala Chicken Tikka Lababdar Chicken Kolhapuri Chicken Zeera Chicken Spcl Chicken Dumka Chicken Khorma Chicken

REX/REX F/REX K/REX K F REX DUAL/REX DUAL F · rex/rex f/rex k/rex k f rex dual/rex dual f ... 2.4 rex dual/rex ... 191 10,5 13,5 14,0 8 5 105 216 230 50 ip40 20 xxxx rex 8 rex k

REX Meter Technical Manual - FCC ID · Error codes ... Contact Elster Electricity for information REX Meter ... REX Meter ® REX Meter Technical Manual

Chicken Chicken Chicken Chicken

Rex Bellator

Generator Rex