De novo glycan structure search with CID MS/MS spectra of native N-glycopeptides 18.12.2008 Hannu Peltoniemi [email protected]

De novo glycan structure search with CID MS/MS spectra of

native N-glycopeptides

18.12.2008Hannu Peltoniemi

[email protected]

De novo vs database matching

MS2 spectrum

Unknown glycan

glycandatabase

Database matching

matching

Best scoring glycan(s) in the DB

• Only those structures that are in the DB can be found• OK if comprehensive DB• If glycan not in the DB the result may be closest matching (wrong) structure or no result at all

MS2 spectrum

Unknown glycan

De novo

Best scoring glycans

• No database -> also new structures can be found !• Computational intensive, requires high quality spectra• Typically no definite answer, but a set of high scoring structures.

On the fly structure generation and matching

De novo structure search

Part of the N-glycopeptide workflow:Joenväärä et al., N-Glycoproteomics

- An automated workflow approach., Glycobiology 2008,18(4):339-349.

Input: Protonated, deconvoluted MS2 spectra

Steps:1) identification of peptides 2) identification of N-glycan compositions 3) identification of de novo N-glycan structures (branching, no linkage)

Input data

Spectrum with annotated glycopeptide and glycan composition fragments.

Example data

Peptide: QDQCIYNTTYLNVQRGlycan composition: 6 Hex 5 HexNac 3 NeuAc

Same data, different view:

O O OOOO

OO O

OO O

O O

OO O

O OO

OO O

O O

OOO O

O

O

OO O

OO O

O

Hex

Hex

NeuAc=0 NeuAc=1 NeuAc=2 NeuAc=3

6

6

5 5 5 5

0

0

0 0 0 0

composition: 6 Hex 5 HexNac 3 NeuAc

Glycan fragments attached to peptide

Free glycans

HexNAc HexNAc HexNAc HexNAc

The puzzle

• All the measured fragment compositions of a unknown structure with the given total composition are known• Some theoretical fragments may be missing• Some measured fragments may be false

O O OOOO

OO O

OO O

O

What is the structure that explains best the data?

?

Solution

The problem is split to two phases

1)Generation of possible structures: Structures are grown starting from N-glycan core. The population size is limited by removing structures with lowest fit with peptide+glycan fragments

2) Scoring: The set of structures are scored with full data. The final glycopeptide score is set to sum of peptide and glycan structure scores.

measured

theoretical

Initialization

The missfit (cost) between theoretical structure and measured data is defined as the number of not matching theoretical and measured fragments.

Example data: peptide + 5 Hex 4 HexNAc

Growing structuresStart (core)

End (final composition)

add unit

add unit

add unit

add unit

If population grows too large structures with highest cost are removed.

Scoring

...

Score is calculated as –log10(P), where P is the probability (binomial) that a random set of fragments would match as well or better as the ranked structure. The final glycopeptide score is sum of peptide and structure scores.

highest scoring

lowest scoring

Options

• All glycosidig bonds can be broken• Unlimited number of cuts

Assumptions

• Monosaccharide names• Number of possible connections with each monosaccharide• Accepted connections between monosaccharides• Start structures (N-glycan cores)• Max population size when growing structures

Testing with in silico generated data

structure theoretical spectrum

fragmentation

randomly removing and adding noise fragments

x x xxxx

xxx

xxxx

xxxx

xx

xxx

xxx

xx x

x x

xxx

xxxxx

xxxxxx

xxxxx

xxxxx

xxxx

xx x

xxx

xxxx

xxxx

xxxx

xxx

xxx

xxx

xxx

xx x x x


Hex

Hex


peptide+glycan

glycan

x x xxxx

xx

x

xx

x

x

x x

xx

xxxx x

xx

xx

x

xxx

x

xxx

xx x

x xxx

x

x

x

x

x

x

xx

xxxx

x

xx x

input to the de novo algoritm

randomized spectrum

no noise2 noise fragm ents4 noise fragm ents

20 30 40 50 60 70 80

02

04

06

08

01

00

Correct structure w ith rank 3

Removed reducing end fragments (%)

Re

sults

ma

tch

ing

th

e c

rite

ria (

%)

20 30 40 50 60 70 80

02

04

06

08

01

00

Correct structure w ith rank 1

Removed reducing end fragments (%)

Re

sults

ma

tch

ing

th

e c

rite

ria (

%)

Percentage of runs (% )

(20,40) (40,60) (60,80) (80,100) (20,40) (40,60) (60,80) (80,100)

Removed reducing, non reducing end fragments (% )

Removed reducing, non reducing end fragments (% )

Results of the in silico tests

If about ½ of the theoretical fragments present => The correct structure is among the few highest scoring ones.

Each mark is a result of a 100 runs.

Testing with serum sample

• Very complex wet lab data set, i.e. a human serum specimen• Removal the high abundance proteins prior to LC-MS/MS • 80 spectra with identified peptide and glycan compositions• 62 spectra with putative structures• Mostly typical structures• Mostly small structures, large ones seems to be hard to catch


Hex

Hex


Reducing end fragm ents(attached to peptide).

Non reducing end fragm ents(free glycans).

0

0

6

6

0 0 0 05 5 5 5X : theoretical O : m easured

x x xxxx

xxx

xxxx

xxxx

O O OOOO

OO O

OO O

O

xx

xxx

xxxO

OO O

O OO

xx x

x

OO O

O xO

xxx

xxxxx

xxxxxx

xxxxx

xxxxx

xxxx

OOO O

O

O

xx x

xxx

xxxx

xxxx

xxxx

xxx

OO O

OO x

xx

xxx

xxx

xx

O

Ox x x

G lyca n is a tta che d to pe ptideQ D Q C IY N T T Y L N V Q R (A lpha -1 -a c id g lyco pro te in 1 ).

S e rum , m /z=1194.93, z=4

T hree best sco ring s truc tu res.

73 .2 72 .8 72 .6S co re

M e a s ure d a nd the o re tica l fra gm e nts fo r the be s t s co ring s truc tu re .

Example serum spectrum

ANT3(224,187), FIBG(78), THRB(121), A1AG1(56), FETUA(156), HPT(241), HRG(344), FIBB(394), TRFE(630), IGHA1(144), A1AT(70,107,271), { VINEX(102), HPTR(126) }

FIBG(78), HRG(344), IGHA1(144) VTNC(169)

IGHG1(180), IGHG2(176) IGHA1(144) A1AG1(93)

IGHG2(176) IGHA1(144) CO2(621), CO3(85)

IGHG2(176) IGHA1(144) CO3(85)

Structures found from the serum sample

Conclusions

• De novo glycan structure identification of intact glycopeptides is possible

• High quality spectra is necessary

• Typically no definite answer but a few structures matching equally well => biological insight still needed if one identified structure needs to be picked

Documents

De novo glycan structure search with CID MS/MS spectra of native N-glycopeptides 18.12.2008 Hannu Peltoniemi [email protected]