Was T. rex Just a Big Chicken?poincare.matf.bg.ac.rs/~jovana/bi/predavanja/Chapter_11.pdf · Was T. rex Just a Big Chicken? • Paleontology Meets Computing • Decoding an Ideal

Was T. rex Just a Big Chicken?

Computational Proteomics

Phillip Compeau and Pavel Pevzner

adjusted by Jovana Kovačević

Bioinformatics Algorithms: an Active Learning Approach

© 2015 by Compeau and Pevzner. All rights reserved.


• Paleontology Meets Computing

• Decoding an Ideal Spectrum

• From Ideal to Real Spectra

• Peptide Sequencing

• Peptide Identification

• Spectral Dictionaries

http://www.feelguide.com/2012/08/29/renowned-paleontologist-inspiration-for-jurassic-park-says-well-having-living-dinos-in-5-years/

http://www.feelguide.com/2012/08/29/renowned-paleontologist-inspiration-for-jurassic-park-says-well-having-living-dinos-in-5-years/

T. Rex and Chicken Collagens Are Nearly

Identical!

T. rex and chicken

collagens are nearly identical!

http://commons.wikimedia.org/wiki/File:Tyrannosaurus_Rey.jpg


Scientists Sequence Collagens from T. Rex!

Jack Horner

discovers a T. rex

femur fossil in

Montana (2000)

Mary

Schweitzer

demineralizes it

John Asara

generates

spectra and

decodes them

(2007)

http://www.nbcnews.com/id/18075420/ns/technology_and_science-science/t/t-rex-analysis-supports-dino-bird-link/

http://www.nbcnews.com/id/18075420/ns/technology_and_science-science/t/t-rex-analysis-supports-dino-bird-link/

http://www.separationsnow.com/details/ezine/sepspec16289ezine/Bringing-out-dinosaurs-inner-chicken.html?

http://www.separationsnow.com/details/ezine/sepspec16289ezine/Bringing-out-dinosaurs-inner-chicken.html?

http://serc.carleton.edu/research_education/paleontology/inthenews.html

http://serc.carleton.edu/research_education/paleontology/inthenews.html

Frederick Sanger’s Two Nobel Prizes

GIVEECCA!

GIVEECCASV!

GIVEECCASVC!

GIVEECCASVCSL!

GIVEECCASVCSLY!

SLYELEDYC!

ELEDY!

ELEDYCD!

LEDYCD!

EDYCD!

FVDEHLCG!

FVDEHLCGSHL!

HLCGSHL!

SHLVEA !

VEALY!

YLVCG!

LVCGERGF!

LVCGERGFF!

GFFYTPK!

YTPKA!

GIVECCASVCSLYELEDYCDFVDEHLCGSHLVEALYLVCGERGFFFYTPKA!

1958: protein

sequencing

1977: DNA

sequencing

1958: protein sequencing difficult, DNA sequencing

impossible

Today: protein sequencing difficult, DNA sequencing trivial

Multiple identical

copies of a genome

AGAATATCASequence the reads

Shatter the genome

into reads

Assemble the

genome using

overlapping reads

...TGAGAATATCA...

AGAATATCA

GAGAATATC

TGAGAATAT

GAGAATATCTGAGAATAT

http://www.express.co.uk/news/obituaries/444612/Obituary-Fred-Sanger-the-scientist-and-twice-Nobel-Prize-winner-dies-aged-95

http://www.express.co.uk/news/obituaries/444612/Obituary-Fred-Sanger-the-scientist-and-twice-Nobel-Prize-winner-dies-aged-95

http://www2.mrc-lmb.cam.ac.uk/about-lmb/archive-and-alumni/alumni/fred-sanger/

http://www2.mrc-lmb.cam.ac.uk/about-lmb/archive-and-alumni/alumni/fred-sanger/

Sequencing Proteins Today

• Putative proteome– If we know a genome, we can predict all the genes

that the genome encodes

– Translating the predicted genes leads us to putative proteome (set of all proteins encoded by genome)

– But how can we determine whether a protein is syntetized in a specific tissue?

• Peptide identification– In practice, merely confirming that 10aa long peptide

from a known protein is present in a sample confirms that the sample contains this protein

• Peptide sequencing– Inferring amino acid sequence of a peptide without

relying on a proteome

– Used in situations when proteome is unknown

Sequencing Proteins with Mass

Spectrometry• Most mass spectrometers can only

measure masses of rather short peptides

(e.g. < 30-40 amino acids). To bypass this

limitation:

– Proteases (e.g., trypsin) break proteins

into short peptides.

– A mass spectrometer breaks these

peptides into charged fragment ions

and measures the mass/charge ratio*

and intensity of each ion.

How do we reconstruct the peptide from the

collection of mass/charge ratios?

* For simplicity, we assume that all masses are integers and all charges

are 1

http://www.directindustry.com/prod/thermo-scientific-scientific-instruments-aut/fourier-transform-mass-spectrometers-7217-1040731.html

http://www.directindustry.com/prod/thermo-scientific-scientific-instruments-aut/fourier-transform-mass-spectrometers-7217-1040731.html

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

Which Peptide Generated This Spectrum?

200 400 600 800 1000 120000

Intensity

100

mass/charge








• The Ostrich Hemoglobin Riddle

• Searching for Post-Translational Modifications

• Spectral Alignment Algorithm

Prefix and Suffix Peptides

503

574

400

285

156

prefix

masses

suffix

masses

71

174

289

418

0

129 156 115 103 71

Reconstructing a Peptide from Prefix/Suffix

Masses

503

574

400

285

156

prefix

masses

suffix

masses

71

174

289

418

0

503

574

400

285

156

71

174

289

418

0

Reconstructing a Peptide from Prefix/Suffix

Masses

Ideal Spectrum: Collection of all prefix

and suffix masses of a peptide.

Note: we don’t know which masses

correspond to prefixes and which

masses correspond to suffixes.

Peptide explains Spectrum if

IdealSpectrum(Peptide) = Spectrum.

IdealSpectrum(REDCA):

0 71 156 174 285 289 400 418 503 574

Spectrum 0 71 156 174 285 289 400 418 503 574

Graph(Spectrum)

Decoding an Ideal Spectrum Problem:

Reconstruct a peptide from its ideal spectrum.

• Input: A collection of integers Spectrum.

• Output: An amino acid string Peptide that

explains Spectrum.

Reconstructing a Peptide from an Ideal Spectrum

0 71 156 574503418400289285174

Graph(Spectrum)





explains Spectrum.

0 71 156 574503418400289285174

D

Nodes: masses in the spectrum

Edges: connect node i to node j if j - i is the mass of an

amino acid a. Label this edge by a.


Graph(Spectrum)





explains Spectrum.

0 71 156 574503418400289285174

R E D C

C D E R

A

A

Nodes: masses in the spectrum

Edges: connect node i to node j if j - i is the mass of an

amino acid a. Label this edge by a.


DecodingIdealSpectrum Algorithm

DecodingIdealSpectrum(Spectrum)

construct Graph(Spectrum)

find a path Path from source to sink in Graph(Spectrum)

return amino acid string spelled by labels of Path

Spectrum 0 71 156 174 285 289 400 418 503 574

Graph(Spectrum) 0 71 156 574503418400289285174

R E D C

C D E R

A

A

Does This Approach Work for All Spectra?





Spectrum 0 57 114 128 215 229 316 330 387 444

Graph(Spectrum) 0 57 114 444387330316229215128

G

N

G S S G G

K/Q

A

D D K/Q

T T T T A N

Does This Approach Work for All Spectra?





Spectrum 0 57 114 128 215 229 316 330 387 444

Graph(Spectrum) 0 57 114 444387330316229215128

G

N

G S S G G

K/Q

A

D D K/Q

T T T T A N

IdealSpectrum(NTTAG) ≠ Spectrum!

Correcting DecodingIdealSpectrum

Graph(Spectrum)

Spectrum 0 57 114 128 215 229 316 330 387 444

0 57 114 444387330316229215128

G

N

G S S G G

K/Q

A

D D K/Q

T T T T A N

IdealSpectrum(GGDTN) = Spectrum



for each path Path from source to sink in

Graph(Spectrum)

Peptide ← amino acid string spelled by labels of Path

if IdealSpectrum(Peptide) = Spectrum

return Peptide

Correcting DecodingIdealSpectrum

IdealSpectrum(GGDTN) = Spectrum



for each path Path from source to sink in

Graph(Spectrum)

Peptide ← amino acid string spelled by labels of Path

if IdealSpectrum(Peptide) = Spectrum

return Peptide

• Not efficient algorithm, may be exponential in the number of nodes (= number of masses in the spectrum)







• Spectral Dictionariesy4

y6

y10

V N V A D C G A E A L A R

b1

y12

b2

y11

b3

y10

b4

y9

b5

y8

b6

y7

b7

y6

b8

y5

b9

y4

b10

y3

b11

y2

b12

y1

[M+2H]2+ = 673.46

y3

y5

y11

y12

b3

b4

b5

b6

b7

b8b9

b10

b11

b12

y7

y8

y9

b2

y2

Inte

nsity (

%)

100

0

200 1200400 600 800 1000 m/z

y4

y12++

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

y3

y12

b10

b11

b13

G L V G A P G L R G L P G K

b1

y13

b2

y12

b3

y11

b4

y10

b5

y9

b6

y8

b7

y7

b8

y6

b9

y5

b10

y4

b11

y3

b12

y2

b13

y1

200 1200400 600 800 1000 m/z

From Ideal to Real Spectra

Decoding a (Real) Spectrum Problem:

Reconstruct a peptide from its spectrum.



explains Spectrum the best (among all possible

a.a. strings).

0 71 99 156 180 196 228 285 289 320 400 421 503 574

Real spectra have both false and missing masses.

0 71 156 174 285 289 400 418 503 574

Ideal Spectrum of REDCA

Real

Spectrum

From Ideal to Real Spectra

Decoding a (Real) Spectrum Problem:

Reconstruct a peptide from its spectrum.



explains Spectrum the best (among all possible

a.a. strings).

0 71 99 156 180 196 228 285 289 320 400 421 503 574

Real spectra have both false and missing masses.

0 71 156 174 285 289 400 418 503 574

Ideal Spectrum of REDCA

Real

Spectrum

Which Peptide Generated This Spectrum?

Intensity

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z200 400 600 800 1000 12000

0

100

mass/charge

DinosaurSpectru

m

• Once the peptide is known, how can we measure how well a peptide explains a spectrum?

y4

y6

y10


b1

y12

b2

y11

b3

y10

b4

y9

b5

y8

b6

y7

b7

y6

b8

y5

b9

y4

b10

y3

b11

y2

b12

y1

[M+2H]2+ = 673.46

y3

y5

y11

y12

b3

b4

b5

b6

b7

b8b9

b10

b11

b12

y7

y8

y9

b2

y2

Inte

nsity (

%)

100

0

200 1200400 600 800 1000 m/z

y4

y12++

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

y3

y12

b10

b11

b13


b1

y13

b2

y12

b3

y11

b4

y10

b5

y9

b6

y8

b7

y7

b8

y6

b9

y5

b10

y4

b11

y3

b12

y2

b13

y1

200 1200400 600 800 1000 m/z

Annotating a Spectrum

DinosaurSpectru

m

Intensity

200 400 600 800 1000 12000

0

100

Suffix peptide of length

3 (denoted as y3)

Prefix peptide of length

10 (denoted as b10)

• Once we infer the peptide that generated a given spectrum,we can annotate the spectrum by establishing correspondencebetween peaks in the spectrum and prefixes/suffixes of thepeptide

y4

y6

y10


b1

y12

b2

y11

b3

y10

b4

y9

b5

y8

b6

y7

b7

y6

b8

y5

b9

y4

b10

y3

b11

y2

b12

y1

[M+2H]2+ = 673.46

y3

y5

y11

y12

b3

b4

b5

b6

b7

b8b9

b10

b11

b12

y7

y8

y9

b2

y2

Inte

nsity (

%)

100

0

200 1200400 600 800 1000 m/z

y4

y12++

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

y3

y12

b10

b11

b13


b1

y13

b2

y12

b3

y11

b4

y10

b5

y9

b6

y8

b7

y7

b8

y6

b9

y5

b10

y4

b11

y3

b12

y2

b13

y1

200 1200400 600 800 1000 m/z

Shared Peak Count

DinosaurSpectru

m

Intensity

200 400 600 800 1000 12000

0

100

GLVGAPCLRGLPGK annotates b10, b11, b13, y3, y4, y12 (Shared Peak Count =

6)

• Shared Peak Count – the number of peaks annotated bypeptide

y4

y6

y10


b1

y12

b2

y11

b3

y10

b4

y9

b5

y8

b6

y7

b7

y6

b8

y5

b9

y4

b10

y3

b11

y2

b12

y1

[M+2H]2+ = 673.46

y3

y5

y11

y12

b3

b4

b5

b6

b7

b8b9

b10

b11

b12

y7

y8

y9

b2

y2

Inte

nsity (

%)

100

0

200 1200400 600 800 1000 m/z

y4

y12++

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

y3

y12

b10

b11

b13


b1

y13

b2

y12

b3

y11

b4

y10

b5

y9

b6

y8

b7

y7

b8

y6

b9

y5

b10

y4

b11

y3

b12

y2

b13

y1

200 1200400 600 800 1000 m/z

Another Candidate Peptide

DinosaurSpectru

m

Intensity

200 400 600 800 1000 12000

0

100


6)

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

y6y5

b3

b6

b9

y7

y8

y2

y4

y3

A T K I V D C F M T Y

b1

y10

b2

y9

b3

y8

b4

y7

b5

y6

b6

y5

b7

y4

b8

y3

b9

y2

b10

y1

0

100

Intensity

200 400 600 800 1000 12000

ATKIVDCFMTY annotates b3, b6, b9, y2, y3, y4, y5, y6, y7, y8 (Shared Peak Count = 10)

y4

y6

y10


b1

y12

b2

y11

b3

y10

b4

y9

b5

y8

b6

y7

b7

y6

b8

y5

b9

y4

b10

y3

b11

y2

b12

y1

[M+2H]2+ = 673.46

y3

y5

y11

y12

b3

b4

b5

b6

b7

b8b9

b10

b11

b12

y7

y8

y9

b2

y2

Inte

nsity (

%)

100

0

200 1200400 600 800 1000 m/z

y4

y12++

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

y3

y12

b10

b11

b13


b1

y13

b2

y12

b3

y11

b4

y10

b5

y9

b6

y8

b7

y7

b8

y6

b9

y5

b10

y4

b11

y3

b12

y2

b13

y1

200 1200400 600 800 1000 m/z

Another Candidate Peptide

DinosaurSpectru

m

Intensity

200 400 600 800 1000 12000

0

100


6)

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

y6y5

b3

b6

b9

y7

y8

y2

y4

y3


b1

y10

b2

y9

b3

y8

b4

y7

b5

y6

b6

y5

b7

y4

b8

y3

b9

y2

b10

y1

0

100

Intensity

200 400 600 800 1000 12000

ATKIVDCFMTY annotates b3, b6, b9, y2, y3, y4, y5, y6, y7, y8 (Shared Peak Count = 10)

y4

y6

y10


b1

y12

b2

y11

b3

y10

b4

y9

b5

y8

b6

y7

b7

y6

b8

y5

b9

y4

b10

y3

b11

y2

b12

y1

[M+2H]2+ = 673.46

y3

y5

y11

y12

b3

b4

b5

b6

b7

b8b9

b10

b11

b12

y7

y8

y9

b2

y2

Inte

nsity (

%)

100

0

200 1200400 600 800 1000 m/z

y4

y12++

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

y3

y12

b10

b11

b13


b1

y13

b2

y12

b3

y11

b4

y10

b5

y9

b6

y8

b7

y7

b8

y6

b9

y5

b10

y4

b11

y3

b12

y2

b13

y1

200 1200400 600 800 1000 m/z

DinosaurSpectru

m

Intensity

200 400 600 800 1000 12000

0

100

How Should We Score an Annotated Spectrum?

Shared Peak

Count?

Sum of intensities

of explained peaks?

ignores

intensities

large peaks may

dominate the score

Idea: probabilistic model of spectra so that large peaks

contribute to the score but do not dominate it.

y4

y6

y10


b1

y12

b2

y11

b3

y10

b4

y9

b5

y8

b6

y7

b7

y6

b8

y5

b9

y4

b10

y3

b11

y2

b12

y1

[M+2H]2+ = 673.46

y3

y5

y11

y12

b3

b4

b5

b6

b7

b8b9

b10

b11

b12

y7

y8

y9

b2

y2

Inte

nsity (

%)

100

0

200 1200400 600 800 1000 m/z

y4

y12++

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

y3

y12

b10

b11

b13


b1

y13

b2

y12

b3

y11

b4

y10

b5

y9

b6

y8

b7

y7

b8

y6

b9

y5

b10

y4

b11

y3

b12

y2

b13

y1

200 1200400 600 800 1000 m/z

DinosaurSpectru

m

Intensity

200 400 600 800 1000 12000

0

100

Transform the spectrum of mass m into a spectral

vector

s1, …,si, …, sm

The value si (amplitude) approximates the likelihood

that mass i is the prefix mass of an (unknown!)

peptide that generated the spectrum.

Spectral Vectors

R

71

Peptid

e

00…0100…0100…0100…0100…01156 bits 71 bits103 bits115 bits129 bits

peptide vector

Peptide

mass 156 129 115 103

E D C A

From a Peptide to a Peptide Vector

Converting a Peptide into a Peptide Vector

Problem. Convert a peptide into a peptide vector.

• Input: A string of amino acids Peptide.

• Output: The peptide vector of Peptide.

From a Peptide to a Peptide Vector

Converting a Peptide into a Peptide Vector

Problem. Convert a peptide into a peptide vector.

• Input: A string of amino acids Peptide.

• Output: The peptide vector of Peptide.

From a Peptide Vector to a Peptide

Converting a Peptide Vector into a Peptide

Problem. Convert a binary vector into a peptide.

• Input: A binary vector P.

• Output: A peptide whose peptide vector is equal

to P (if such a peptide exists).

From a Spectrum to a Spectral Vector

y4

y6

y10


b1

y12

b2

y11

b3

y10

b4

y9

b5

y8

b6

y7

b7

y6

b8

y5

b9

y4

b10

y3

b11

y2

b12

y1

[M+2H]2+ = 673.46

y3

y5

y11

y12

b3

b4

b5

b6

b7

b8b9

b10

b11

b12

y7

y8

y9

b2

y2

Inte

nsity (

%)

100

0

200 1200400 600 800 1000 m/z

y4

y12++

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

y3

y12

b10

b11

b13


b1

y13

b2

y12

b3

y11

b4

y10

b5

y9

b6

y8

b7

y7

b8

y6

b9

y5

b10

y4

b11

y3

b12

y2

b13

y1

200 1200400 600 800 1000 m/z

DinosaurSpectru

m

(mass m)Intensity

200 400 600 800 1000 12000

0

100

+9 (amplitude) is not the intensity of this peak!

It is a likelihood that this peak will be annotated by a prefix

of an (unknown!) peptide that generated the spectrum.

+9amplitude

From a Spectrum to a Spectral Vector

y4

y6

y10


b1

y12

b2

y11

b3

y10

b4

y9

b5

y8

b6

y7

b7

y6

b8

y5

b9

y4

b10

y3

b11

y2

b12

y1

[M+2H]2+ = 673.46

y3

y5

y11

y12

b3

b4

b5

b6

b7

b8b9

b10

b11

b12

y7

y8

y9

b2

y2

Inte

nsity (

%)

100

0

200 1200400 600 800 1000 m/z

y4

y12++

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

y3

y12

b10

b11

b13


b1

y13

b2

y12

b3

y11

b4

y10

b5

y9

b6

y8

b7

y7

b8

y6

b9

y5

b10

y4

b11

y3

b12

y2

b13

y1

200 1200400 600 800 1000 m/z

DinosaurSpectru

m

(mass m)Intensity

200 400 600 800 1000 12000

0

100

The larger the peak at mass i,

the larger the value (amplitude) si of the spectral

vector

s1........-5.........+3..........................+9...+7..............sm

an integer-valued vector with m

coordinates

spectral

vector

+7+9+3-5amplitude








Scoring Peptide against Spectrum

Score of Peptide against Spectrum is the dot

product of Peptide and Spectrum:

score(Peptide, Spectrum) = p1*s1+p2*s2+ …+pm*sm.

000…001000…001000…001000…001000…001Peptide

y4

y6

y10


b1

y12

b2

y11

b3

y10

b4

y9

b5

y8

b6

y7

b7

y6

b8

y5

b9

y4

b10

y3

b11

y2

b12

y1

[M+2H]2+ = 673.46

y3

y5

y11

y12

b3

b4

b5

b6

b7

b8b9

b10

b11

b12

y7

y8

y9

b2

y2

Inte

nsity (

%)

100

0

200 1200400 600 800 1000 m/z

y4

y12++In

tensity (

%)

100

0

[M+2H]2+ = 646.20

y3

y12

b10

b11

b13


b1

y13

b2

y12

b3

y11

b4

y10

b5

y9

b6

y8

b7

y7

b8

y6

b9

y5

b10

y4

b11

y3

b12

y2

b13

y1

200 1200400 600 800 1000 m/z

DinosaurSpectru

m

Intensity

200 400 600 800 1000 12000

0

100

s1..........-5…....+3..........................+9...+7..............sm

Spectrum ******************************************

Peptide Sequencing Problem

Peptide Sequencing Problem: Given a spectral

vector, find a peptide vector with maximum score

against this spectral vector.

• Input: A spectral vector Spectrum.


maximizes

score(Peptide, Spectrum)

among all possible peptides.

Building a DAG from a Spectral Vector

1. For a spectral vector Spectrum=s1, … ,sm, construct

DAG(Spectrum) on nodes {0,1, …, m}




2. Assign weight si to node i

33 2 10 0 0 -2 -3 -1 -7 5 -8 0 1 2 10 4 6 9 3 0





3. Connect node i to node j if j - i is equal to the mass

of an amino acid

Toy alphabet: amino acids X and Z with masses 4 and 5

33 2 10 0 0 -2 -3 -1 -7 5 -8 0 1 2 10 4 6 9 3 0

X






of an amino acid


33 2 10 0 0 -2 -3 -1 -7 5 -8 0 1 2 10 4 6 9 3 0

X

Z






of an amino acid


X

Z

0 33 2 1 90 0 0 4 -2 -3 -1 -7 6 5 -8 0 3 1 2 1 0

Z

X X

Score(XZZXX, Spectrum) = 0 + 4 + 6 + 9 + 3 + 0 =

22

Peptides = Paths in DAG(Spectrum)

X

Z

0 33 2 1 90 0 0 4 -2 -3 -1 -7 6 5 -8 0 3 1 2 1 0

Z

X X

• Peptide: any path from source to sink in DAG(Spectrum).

• score(Peptide, Spectrum): sum of scores of nodes it visits.

• Peptide Sequencing Problem: finding a maximum-weight

path in a node-weighted DAG.


22

Peptide Sequencing = Finding a Path in

DAG(Spectrum)

X

Z

0 33 2 1 90 0 0 4 -2 -3 -1 -7 6 5 -8 0 3 1 2 1 0

Z

X X

Peptide Sequencing Problem: Given a spectral

vector, find a peptide vector with maximum score

against this spectral vector.

• Input: A spectral vector Spectrum.

• Output: A maximum-weight path in DAG(Spectrum).


22

STOP and Think: How do we find a maximum-weight path in

a node-weighted DAG?

Intensity

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z200 400 600 800 1000 12000

0

100

mass/charge

DinosaurSpectru

m

???????????

Generating Spectrum

from an (Unknown) Peptide

Intensity

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z200 400 600 800 1000 12000

0

100

mass/charge

DinosaurSpectru

m

???????????

Reconstructing Peptide from Spectrum

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

y6y5

b3

b6

b9

y7

y8

y2

y4

y3


b1

y10

b2

y9

b3

y8

b4

y7

b5

y6

b6

y5

b7

y4

b8

y3

b9

y2

b10

y1

De novo Reconstruction!

mass/charge

DinosaurSpectru

m

ATKIVDCFMTY

Intensity

0

100

200 400 600 800 1000 12000

But this highest scoring peptide is biologically

incorrect!

Scoring functions that reliably assign the highest

score to the biologically correct peptide remain

unknown...

Intensity

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z200 400 600 800 1000 12000

0

100

mass/charge

DinosaurSpectru

m

y4

y6

y10


b1

y12

b2

y11

b3

y10

b4

y9

b5

y8

b6

y7

b7

y6

b8

y5

b9

y4

b10

y3

b11

y2

b12

y1

[M+2H]2+ = 673.46

y3

y5

y11

y12

b3

b4

b5

b6

b7

b8b9

b10

b11

b12

y7

y8

y9

b2

y2

Inte

nsity (

%)

100

0

200 1200400 600 800 1000 m/z

y4

y12++

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

y3

y12

b10

b11

b13


b1

y13

b2

y12

b3

y11

b4

y10

b5

y9

b6

y8

b7

y7

b8

y6

b9

y5

b10

y4

b11

y3

b12

y2

b13

y1

200 1200400 600 800 1000 m/z

…HKMPRSTATPKRMGGCTFSPCFTKRLMATSGLVGAPGLRGLPGKMGGCTFGTRACFGH…

The correct peptide may not score highest among all peptides,

but it typically scores highest among all peptides in the

proteome* * If the resulting score is sufficiently high

The highest-scoring peptide in Proteome

Imagine that You Know the Proteome…

Intensity

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z200 400 600 800 1000 12000

0

100

mass/charge

DinosaurSpectru

m

y4

y6

y10


b1

y12

b2

y11

b3

y10

b4

y9

b5

y8

b6

y7

b7

y6

b8

y5

b9

y4

b10

y3

b11

y2

b12

y1

[M+2H]2+ = 673.46

y3

y5

y11

y12

b3

b4

b5

b6

b7

b8b9

b10

b11

b12

y7

y8

y9

b2

y2

Inte

nsity (

%)

100

0

200 1200400 600 800 1000 m/z

y4

y12++

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

y3

y12

b10

b11

b13


b1

y13

b2

y12

b3

y11

b4

y10

b5

y9

b6

y8

b7

y7

b8

y6

b9

y5

b10

y4

b11

y3

b12

y2

b13

y1

200 1200400 600 800 1000 m/z

…HKMPRSTATPKRMGGCTFSPCFTKRLMATSGLVGAPGLRGLPGKMGGCTFGTRACFGH…

The highest-scoring peptide in Proteome

Imagine that You Know the Proteome…

Peptide identification: reconstructing a peptide as

the highest-scoring peptide occurring in a

proteome.

All peptides from

Proteome

MDERHILNM, KLQWVCSDL,

PTYWASDL, ENQIKRSACVM,

TLACHGGEM, NGALPQWRT,

HLLERTKMNVV, GGPASSDA,

GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, AVGELTK,

HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN

WR

A

C

VG

E

K

DW

LP

T

L T

WR

A

C

VG

E

K

DW

LP

T

L T

AVGELTK

Peptide

Identificatio

nAll possible peptides (20n)

AAAAAAAA,AAAAAAAC,AAAAAAAD,AAAAAAAE,AA

AAAAAG,AAAAAAAF,AAAAAAAH,AAAAAAI,

AVGELTI, AVGELTK , AVGELTL, AVGELTM,

YYYYYYYS,YYYYYYYT,YYYYYYYV,YYYYYYYY

Peptide Sequencing vs. Peptide

Identification

Which approach is

faster?

Peptide

Sequencing

AVGELTK

Peptide

Sequencing

Peptide

Identificatio

nThe set of all peptides in Proteome is much smaller than the set of of all possible peptides.

However, peptide sequencing algorithms are much faster, even though their search space is much larger.

Peptide sequencing eliminates the time-consuming scan of Proteomeby modeling the problem as the Longest Path in a DAG Problem.

However, since the scoring function is imperfect, peptide sequencing remains inaccurate: state-of-the-art tools correctly reconstruct only 30% of spectra.

Peptide Sequencing vs. Peptide

Identification








http://en.wikipedia.org/wiki/Collagen_helix


Peptide Identification Problem: Find a peptide

from a proteome with maximum score against a

spectrum.

• Input: A spectral vector Spectrum and an amino

acid string Proteome.

• Output: An a.a. string Peptide that maximizes

score(Peptide, Spectrum)

among all substrings of Proteome.

STOP and Think: How can we possibly construct

the T. rex proteome?

The Peptide Identification Problem

• 90% of proteins making up

animal bones are collagens.

• Since collagens are often conserved across

species, collagens in T. rex were likely similar to

collagens in some present-day species.

Approximating the T. rex Proteome



• As a sanity check, Asara

compared the T. rex spectra

against the UniProt database

(≈ 200 million amino acids

from hundreds of species).

• Asara also included some mutated versions of

collagens from present-day species; we will call

the augmented database UniProt+.*

Approximating the T. rex Proteome

*concatenate all proteins in UniProt+ into a string Proteome for

simplicity

Most of the high-scoring peptides identified in

UniProt+ were chicken collagens, supporting the

hypothesis that birds evolved from dinosaurs.

Searching T. rex Spectra Against UniProt+



DinosaurPeptide = GLVGAPGLRGLPGK is only

one mutation away from a chicken collagen

peptide.

Searching T. rex Spectra Against UniProt+

But how can we be sure that DinosaurPeptide is

the correct interpretation of DinosaurSpectrum?

y4

y6

y10


b1

y12

b2

y11

b3

y10

b4

y9

b5

y8

b6

y7

b7

y6

b8

y5

b9

y4

b10

y3

b11

y2

b12

y1

[M+2H]2+ = 673.46

y3

y5

y11

y12

b3

b4

b5

b6

b7

b8b9

b10

b11

b12

y7

y8

y9

b2

y2

Inte

nsity (

%)

100

0

200 1200400 600 800 1000 m/z

y4

y12++

Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

y3

y12

b10

b11

b13


b1

y13

b2

y12

b3

y11

b4

y10

b5

y9

b6

y8

b7

y7

b8

y6

b9

y5

b10

y4

b11

y3

b12

y2

b13

y1

200 1200400 600 800 1000 m/z200 400 600 800 1000 1200 m/z00

Intensity

100

But billions of peptides not occurring in UniProt+

outscore DinosaurPeptide.

Statistical Significance of DinosaurPeptide

DinosaurPeptide is the highest scoring peptide for

DinosaurSpectrum among all peptides in UniProt+.

But billions of peptides not occurring in UniProt+

outscore DinosaurPeptide.

We need to develop a method for evaluating the

statistical significance of identified peptides.

STOP and Think: Does this concern you?

Statistical Significance of DinosaurPeptide

DinosaurPeptide is the highest scoring peptide for

DinosaurSpectrum among all peptides in UniProt+.

* If the resulting score is sufficiently high

Given a parameter threshold, a peptide Peptide and

a spectral vector Spectrum form a Peptide-

Spectrum Match (PSM) if:

• Peptide is a highest-scoring peptide against

Spectrum among all peptides in Proteome

• Score(Peptide, Spectrum) ≥ threshold

Peptide-Spectrum Matches (PSMs)

Given a parameter threshold, a peptide Peptide and

a spectral vector Spectrum form a Peptide-

Spectrum Match (PSM) if:

• Peptide is a highest-scoring peptide against

Spectrum among all peptides in Proteome

• Score(Peptide, Spectrum) ≥ threshold

PSMthreshold(Proteome, SpectralVectors): the set of

Peptide-Spectrum Matches (PSMs) resulting from a

set of SpectralVectors (for a given Proteome and

threshold).

Peptide-Spectrum Matches (PSMs)

PSM Search Problem: Identify all Peptide-

Spectrum Matches scoring above a threshold for a

set of spectra and a proteome.

• Input: A set SpectralVectors, an amino acid

string Proteome, and a score threshold

threshold.

• Output: The set of Peptide-Spectrum Matches

PSMthreshold(Proteome, SpectralVectors).

PSM Search Problem








STOP and Think: A PSM search of 1,000 spectra

from a human sample against the human proteome

results in 100 PSMs whose score surpassed a

threshold.

• What is the fraction of erroneous PSMs among

them?

Hint: Repeat the same experiment for a randomly

generated DecoyProteome of the same size as the

human proteome.

Decoy Proteome

If you identify 5 PSMs in DecoyProteome, then 5/100

of PSMs identified in the human proteome are

estimated to be correct.

False Discovery Rate

For the T. rex spectra, there are 27 PSMs in UniProt+

and only 1 PSM in DecoyProteome with score ≥ 100

(FDR =1/27= 3.7%)

STOP and Think: Have we found ≈27* T. rex

peptides?!

False discovery rate (FDR):

|PSMthreshold(DecoyProteome,SpectralVectors)|

|PSMthreshold(Proteome, SpectralVectors)|

Many of these PSM correspond to contaminants, e.g., keratin from human skin

How can we estimate the statistical significance of

an individual PSM?

The Monkey and the Typewriter

abagytegertoyhktyhkyrzaxujhotgemamaghtkmjytrabagytegertozhkoghk

yrzacatxujhotgemamaghtkdhairytdgbikemjytrcgtyyghjotfghtsybdkkpw

kfffldogjfiegbebgncnslkcfscnnclnscnscnsnovcsnovslvnsnvnvnsnvsvv

slnlnsvlnsnvnslnvnlsvnsnnsvnslvnscatlvslvslvlmbgjgaggeyjllfghlh

mhlhjjlhjlhabracadabraghytnlkprstyrhketryabcnccowcnchairmtdgwom

bikedmdppdtyhtgftxcjabcjwqbcoewbvcoewvbexovervhhddwdwqdhgyusjff

fgfghhhhhy…

The Monkey Can Spell!

abagytegertoyhktyhkyrzaxujhotgemamaghtkmjytrabagytegertozhkoghk

yrzacatxujhotgemamaghtkdhairytdgbikemjytrcgtyyghjotfghtsybdkkpw

kfffldogjfiegbebgncnslkcfscnnclnscnscnsnovcsnovslvnsnvnvnsnvsvv

slnlnsvlnsnvnslnvnlsvnsnnsvnslvnscatlvslvslvlmbgjgaggeyjllfghlh

mhlhjjlhjlhabracadabraghytnlkprstyrhketryabcnccowcnchairmtdgwom

bikedmdppdtyhtgftxcjabcjwqbcoewbvcoewvbexovervhhddwdwqdhgyusjff

fgfghhhhhy…

The

MonkeyDictionary

NEW EDITION

2,000 new words

even more nonsense

Expected Number of Strings from Dictionary

The Monkey and the Typewriter Problem: Find the expected

number of strings from dictionary appearing in a randomly

generated text.

• Input: A set of strings Dictionary and an integer n.

• Output: The expected number of strings from Dictionary that

appear in a randomly generated string of length n.

Expected Number of High-Scoring Peptides Problem: Find

the expected number of high-scoring peptides (against a given

spectrum) in a decoy proteome.

• Input: A Spectrum, an integer n, and a score threshold.

• Output: The expected number of peptides in a decoy

proteome of length n that score a least threshold against

Spectrum.

Expected Number of High-Scoring Peptides

The Monkey and the Typewriter Problem: Find the expected

number of strings from dictionary appearing in a randomly

generated text.

• Input: A set of strings Dictionary and an integer n.

• Output: The expected number of strings from Dictionary that

appear in a randomly generated string of length n.

STOP and Think: Are these problems equivalent?

Spectral DictionaryIn

tensity (

%)

100

0

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

Dictionarythreshold(Spectrum): the set of all peptides with score

at least threshold against Spectrum.

Expected Number of High-Scoring Peptides Problem: Find

the expected number of high-scoring peptides (against a given

spectrum) in a decoy proteome.


• Output: The expected number of peptides in a decoy

proteome of length n that score a least threshold against

Spectrum.

http://genius.com/2228302/Rob-thomas-matchbox-twenty-our-song/Your-like-a-little-piece-of-kerosene


Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z

Expected Number of High-Scoring Peptides Problem:

Find the expected number of high-scoring peptides (against a

given spectrum) in a decoy proteome.


• Output: The expected number of peptides from

Dictionarythreshold(Spectrum) occurring in a decoy proteome

of length n.

Spectral DictionaryDictionarythreshold(Spectrum): the set of all peptides with score




Inte

nsity (

%)

100

0

[M+2H]2+ = 646.20

200 1200400 600 800 1000 m/z






• Input: Peptides Dictionarythreshold(Spectrum) and an integer

n.

• Output: The expected number of strings from


of length n.






• Input: Peptides Dictionarythreshold(Spectrum) and an integer

n.

• Output: The expected number of strings from


of length n.



Expected Number of Occurrences of Peptides

from Dictionary in DecoyProteome

• Probability that a string Peptide matches a string starting at

a given position in DecoyProteome:

Pr(Peptide) =1/20|Peptide|






• Exp. #times Peptide appears in DecoyProteome of length

n:

E(Peptide, n) ≈ n * Pr(Peptide) = n * 1/20|Peptide|






• Exp. #times Peptide appears in DecoyProteome of length

n:

E(Peptide, n) ≈ n * Pr(Peptide) = n * 1/20|Peptide|

• Exp. #times peptides from Dictionary appear in

DecoyProteome of length n:

E(Dictionary, n) ≈ n * (∑each Peptide in Dictionary 1/20|Peptide|)

= n * Pr(Dictionary)

How many peptides in DecoyUniprot+ are expected to score

at least -19 against DinosaurSpectrum, i.e., what is

E(Dictionary-19(DinosaurSpectrum, |UniProt+|)?

Probability of Spectral Dictionary

Probability of Spectral Dictionary Problem: Find the

probability of a spectral dictionary for a given spectrum and

score threshold.

• Input: A spectral vector Spectrum and a score threshold

threshold.

• Output: The probability of Dictionarythreshold(Spectrum).

Probability and Size of Spectral Dictionary

Size of Spectral Dictionary Problem: Find the size of a

spectral dictionary for a given spectrum and score threshold.


threshold.

• Output: The size of Dictionarythreshold(Spectrum).

Probability of Spectral Dictionary Problem: Find the

probability of a spectral dictionary for a given spectrum and

score threshold.


threshold.

• Output: The probability of Dictionarythreshold(Spectrum).

• Given a spectral vector s = s1…si…sn

• size(i, t): #peptides matching i-prefix s1…si with

score t

• sizea(i, t): #peptides matching i-prefix s1…si with score

t and ending in amino acid a:

• Removing the last amino acid a from a peptide results in

a shorter peptide with mass i - |a| and score t - si:

• Initialization: size(0, 0) = 1, size(i, t) = 0 for i < 0

Computing the Size of a Spectral

Dictionary

size(i, t) = Σ all amino acids a sizea(i, t)

size(i, t) = Σ all amino acids a sizea(i, t)

= Σ all amino acids a size(i - |a|,t - si)

• Given Spectrum=s1…sm, construct DAG(Spectrum)

on nodes 0,…, m with weight of node i equal to si .


Dictionary

Amino acids X and Z with respective masses 4 and 5.

X

Z

00001100010002Spectrum

00001100010002


Dictionary

X

Z



• a path from source to sink spells out a peptide.

XXZ

00001100010002

Computing the Size of a Spectral Dictionary

Score(XXZ,Spectrum) = 0 + 1 + 0 + 2 = 3

X

Z



• a path from source to sink corresponds to a

peptide.

• sum of weights of nodes on path = score of

PSM.

00001100010002


Dictionary

X

Z

Score(XZX,Spectrum) = 0 + 1 + 1 + 2 = 4



• a path from source to sink corresponds to a

peptide.

• sum of weights of nodes on path = score of

PSM.

00001100010002

Computing size(i, t)

t=0 1 0 0 0

t=1 0 0 0 0

t=2 0 0 0 0

t=3 0 0 0 0

t=4 0 0 0 0

X

Z

00001100010002

size(i, t)=Σ all amino acids a size(i - |a|,t - si)

t=0 1 0 0 0 0

t=1 0 0 0 0 1

t=2 0 0 0 0 0

t=3 0 0 0 0 0

t=4 0 0 0 0 0

X

Z

00001100010002

size(i, t)=Σ all amino acids a size(i - |a|,t - 1)

t=0 1 0 0 0 0

t=1 0 0 0 0 1

t=2 0 0 0 0 0

t=3 0 0 0 0 0

t=4 0 0 0 0 0

X

Z

00001100010002

t=0 1 0 0 0 0 0

t=1 0 0 0 0 1 1

t=2 0 0 0 0 0 0

t=3 0 0 0 0 0 0

t=4 0 0 0 0 0 0

X

Z


00001100010002

t=0 1 0 0 0 0 0 0

t=1 0 0 0 0 1 1 0

t=2 0 0 0 0 0 0 0

t=3 0 0 0 0 0 0 0

t=4 0 0 0 0 0 0 0

X

Z


00001100010002

t=0 1 0 0 0 0 0 0

t=1 0 0 0 0 1 1 0

t=2 0 0 0 0 0 0 0

t=3 0 0 0 0 0 0 0

t=4 0 0 0 0 0 0 0

X

Z


00001100010002

t=0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

t=1 0 0 0 0 1 1 0 0 1 0 1 0 1 0

t=2 0 0 0 0 0 0 0 0 0 2 0 0 0 0

t=3 0 0 0 0 0 0 0 0 0 0 0 0 0 1

t=4 0 0 0 0 0 0 0 0 0 0 0 0 0 2

X

Z


00001100010002

t=0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

t=1 0 0 0 0 1 1 0 0 1 0 1 0 1 0

t=2 0 0 0 0 0 0 0 0 0 2 0 0 0 0

t=3 0 0 0 0 0 0 0 0 0 0 0 0 0 1

t=4 0 0 0 0 0 0 0 0 0 0 0 0 0 2

X

Z


• Given a spectral vector s = s1…si…sn

• Pr(i, t): sum of probabilities of all peptides matching i-

prefix s1…si with score t

• Pra(i, t): sum of probabilities of all peptides matching i-

prefix s1…si with score t and ending in amino acid a:

• Removing the last amino acid a from results in a shorter

peptide with mass i – |a|, score t – si , and 20 times larger

probability:

Computing the Probability of a Spectral

Dictionary

Pr(i, t) = Σ all amino acids a Pra(i, t)

Pr(i, t) = Σ all amino acids a Pra(i, t)

= Σ all amino acids a Pr (i - |a|,t - si) / 20

00001100010002


t=0 1 0 0 0 0

t=1 0 0 0 0 1

t=2 0 0 0 0 0

t=3 0 0 0 0 0

t=4 0 0 0 0 0

X

Z

00001100010002

Pr(i, t)=Σ all amino acids a Pr(i - |a|,t - si)/20

t=0 1 0 0 0 0

t=1 0 0 0 0 1

t=2 0 0 0 0 0

t=3 0 0 0 0 0

t=4 0 0 0 0 0

X

Z

1/20

Hint: Dictionary-19(DinosaurSpectrum) contains

219,136,251,374 peptides (!) and has probability

0.00018

STOP and Think: What is the statistical significance of

the PSM

(DinosaurPeptide, DinosaurSpectrum)

found in searches against the UniProt+ database of

length n ≈ 200 million amino acids?

Statistical Significance of the PSM


Reminder: PSM (DinosaurPeptide, DinosaurSpectrum)

has score -19.

STOP and Think: How many PSMs with score at

least -19 do we expect to find in a decoy proteome

of the same size as UniProt+?

n * Pr(Dictionary-19(DinosaurSpectrum)) = 35,311



Finding DinosaurPeptide as an

interpretation of DinosaurSpectrum is

no more surprising than the monkey

typing “THE” after 200 million

attempts...

The

MonkeyDictionary

NEW EDITION

2,000 new words

even more nonsense

STOP and Think: How many PSMs with score at

least -19 do we expect to find in a decoy proteome

of the same size as UniProt+?

n * Pr(Dictionary-19(DinosaurSpectrum)) = 35,311



Finding DinosaurPeptide as an

interpretation of DinosaurSpectrum is

no more surprising than the monkey

typing “THE” after 200 million

attempts...

...which is not surprising at all!

The

MonkeyDictionary

NEW EDITION

2,000 new words

even more nonsense

Documents

Was T. rex Just a Big Chicken?poincare.matf.bg.ac.rs/~jovana/bi/predavanja/Chapter_11.pdf · Was T. rex Just a Big Chicken? • Paleontology Meets Computing • Decoding an Ideal