PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry

Database Search

Laxman YetukuriT-61.6070: Modeling of Proteomics Data

Outline Motivation

Basics: MS and MS/MS for Protein Identification

Computational Framework of Database Search

Scoring Algorithms PepHMM

MOWSE

Results

Summary

Proteomics studies- dynamic and context sensitive

Speed and accuracy of omics-driven methods

High throughput MS-based approaches

Real analysis starts with protein identification

Protein identification is challenging

The heart of protein identification algorithm is scoring function

Motivation

Protein Identification Is Challenging

Sample Contamination

Imperfect Fragmentation

Post translational Modifications

Low signal to noise ratio

Machine errors

Basics: MS and MS/MS for protein Identification

Trypsin Digest

MassSpectrometry

Liquid Chromatography

Precursor selection + collision induced dissociation

(CID)

MS/MS

Nesvizhskii and Aebersold, Drug Discovery Today, 2004, 9, 173-181

Computational Problem

i+1

Peptide Fragmentation: b & y ions

-HN-CH-CO-NH-CH-CO-NH-

RiCH-R’

bi

yn-iyn-i-1

bi+1

R”

i+1

Peptide Fragmentation: b & y ions …

K1166

L1020

E907

D778

E663

E534

L405

F292

G145

S88 b ions

100

0250 500 750 1000

m/z

% I

nte

nsit

y

147260389504633762875102210801166 y ions

y6

y7

y2 y3 y4

y5

y8 y9

b3

b5 b6 b7b8 b9

b4

i+1

Peptide Fragmentation with other ions

-HN-CH-CO-NH-CH-CO-NH-

RiCH-R’

bi

yn-iyn-i-1

bi+1

R”

i+1

ai

xn-i

ci

zn-i

Peptide Identification

Two main methods for tandem MS:

De novo interpretation

Sequence database search

De Novo Interpretation

100

0250 500 750 1000

m/z

% I

nte

nsit

y

E L F

KL

SGF G

E DE

L E

E D E L

Sequence Database Search

Widely used approach

Compares peptides from a protein sequence database with

experimental spectra

Scoring function summarise the comparison

Critical for any search engine

Score each peptide against spectrum

Cross correlation (SEQUEST)

MOWSE scoring and its extensions (MASCOT)

Probabilistic scoring systems (OMSSA, OLAV, ProbID…..)

PepHMM is HMM based probabilistic scoring function

Computational Framework for pepHMM

MSDB based peptide extraction

Hypothetical spectrum generation

b,y,y-H2O,b-H2O,b2+ and y2+

Computing probabilistic scores

Initial classification :Match, missing or noise

Compute pepHMM scores (discussed later)

Compute Z-score

Compute E-score

Contents of pepHMM Model

PepHMM combines the information on correlation among

the ions, peak intensity and match tolerance

Input – sets of matches, missing and noise

Model is based on b and y ions

Each match is associated with observation (T,I)

Observation state = observed (T,I)

Hidden state =True assignement of the observations

Model Structure

Four possible assignments corresponding to four hidden states

Model Computation

.),Pr(max psDp

Goal: Calculate highest score peptide in the database

Let a path in HMM be represents configuration of states, probability of the path

),Pr(),,Pr( Mps

noisen

iea iii

#1

0

*)(11,

n ....21

),,,,( yba

Model Computation…

).,Pr(),Pr(

Mps

).,.....,r(

....

)(11

11

P

Muiv i

i

f

),()1(

,

)1()()(

afi

vu

i

uu

i

v

i

v fe

Considering all possible paths

Forward algorithm: Probability of all possible Paths from the first position to state v at postion i

)()Pr( nv

vfM

Emmission Probabilities

Probability of observing (Tb,Ib) and (Ty, Iy) for the state 1 at position i

)Pr(),Pr(ybTT

)Pr(),Pr(ybII

---Normal distribution

---Exponential distribution

)),(),,Pr(()(1 yybbi ITITe

)),Pr((),Pr(( yybb ITIT

),Pr()Pr()Pr()Pr( yybb ITIT

MOWSE Scoring System

incolumnj

ji

ji

ji

f

f

m

max

,

,

,

mn

jiotMScore

,Pr

000,50

MOWSE Algorithm is implemented in MASCOT software

Where

mi,j -elements of MOWSE frequence matrix

Data Sets

ISB data set:

1. A,B mixtures of 18 different proteins with modifications/relative amounts

2. Analysed using SEQUEST and other in-house Software

3. Data set is curated

4. Final data set with charge 2+ for trypsin digestion contains 857 spectra

5. 5-fold cross validation by random selection-Training set :687 spectra-Testing set : 170 spectra

6. EM algorithm is used for estimating parameters

Results: Distributions of Ions

b and y ionsNoise

Match ToleranceParameter estimates

,,

Comparative StudiesDat set selection repeated 10 times to select both training and test data setFor each group parameters are similar valuesPrediction is considered correct if the peptide has highest score

Independent Data Set A.Y’s Lab: The other independent data set for comparing with other tools like SEQUEST and MASCOT size of data set =20,980 spectra

False/True Positive Rates

Summary Developed probabilistic scoring function called pepHMM for

improving protein identifications

PepHMM outperform other tools like MASCOT with low false

postive rate (always?)

Can this handle other type of ions other than b and y ions

Need to handle post translational modifications

Documents

PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data