25
PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Embed Size (px)

Citation preview

Page 1: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry

Database Search

Laxman YetukuriT-61.6070: Modeling of Proteomics Data

Page 2: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Outline Motivation

Basics: MS and MS/MS for Protein Identification

Computational Framework of Database Search

Scoring Algorithms PepHMM

MOWSE

Results

Summary

Page 3: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Proteomics studies- dynamic and context sensitive

Speed and accuracy of omics-driven methods

High throughput MS-based approaches

Real analysis starts with protein identification

Protein identification is challenging

The heart of protein identification algorithm is scoring function

Motivation

Page 4: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Protein Identification Is Challenging

Sample Contamination

Imperfect Fragmentation

Post translational Modifications

Low signal to noise ratio

Machine errors

Page 5: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Basics: MS and MS/MS for protein Identification

Trypsin Digest

MassSpectrometry

Liquid Chromatography

Precursor selection + collision induced dissociation

(CID)

MS/MS

Page 6: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Nesvizhskii and Aebersold, Drug Discovery Today, 2004, 9, 173-181

Computational Problem

Page 7: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

i+1

Peptide Fragmentation: b & y ions

-HN-CH-CO-NH-CH-CO-NH-

RiCH-R’

bi

yn-iyn-i-1

bi+1

R”

i+1

Page 8: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Peptide Fragmentation: b & y ions …

K1166

L1020

E907

D778

E663

E534

L405

F292

G145

S88 b ions

100

0250 500 750 1000

m/z

% I

nte

nsit

y

147260389504633762875102210801166 y ions

y6

y7

y2 y3 y4

y5

y8 y9

b3

b5 b6 b7b8 b9

b4

Page 9: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

i+1

Peptide Fragmentation with other ions

-HN-CH-CO-NH-CH-CO-NH-

RiCH-R’

bi

yn-iyn-i-1

bi+1

R”

i+1

ai

xn-i

ci

zn-i

Page 10: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Peptide Identification

Two main methods for tandem MS:

De novo interpretation

Sequence database search

Page 11: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

De Novo Interpretation

100

0250 500 750 1000

m/z

% I

nte

nsit

y

E L F

KL

SGF G

E DE

L E

E D E L

Page 12: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Sequence Database Search

Widely used approach

Compares peptides from a protein sequence database with

experimental spectra

Scoring function summarise the comparison

Critical for any search engine

Score each peptide against spectrum

Cross correlation (SEQUEST)

MOWSE scoring and its extensions (MASCOT)

Probabilistic scoring systems (OMSSA, OLAV, ProbID…..)

PepHMM is HMM based probabilistic scoring function

Page 13: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Computational Framework for pepHMM

MSDB based peptide extraction

Hypothetical spectrum generation

b,y,y-H2O,b-H2O,b2+ and y2+

Computing probabilistic scores

Initial classification :Match, missing or noise

Compute pepHMM scores (discussed later)

Compute Z-score

Compute E-score

Page 14: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Contents of pepHMM Model

PepHMM combines the information on correlation among

the ions, peak intensity and match tolerance

Input – sets of matches, missing and noise

Model is based on b and y ions

Each match is associated with observation (T,I)

Observation state = observed (T,I)

Hidden state =True assignement of the observations

Page 15: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Model Structure

Four possible assignments corresponding to four hidden states

Page 16: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Model Computation

.),Pr(max psDp

Goal: Calculate highest score peptide in the database

Let a path in HMM be represents configuration of states, probability of the path

),Pr(),,Pr( Mps

noisen

iea iii

#1

0

*)(11,

n ....21

),,,,( yba

Page 17: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Model Computation…

).,Pr(),Pr(

Mps

).,.....,r(

....

)(11

11

P

Muiv i

i

f

),()1(

,

)1()()(

afi

vu

i

uu

i

v

i

v fe

Considering all possible paths

Forward algorithm: Probability of all possible Paths from the first position to state v at postion i

)()Pr( nv

vfM

Page 18: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Emmission Probabilities

Probability of observing (Tb,Ib) and (Ty, Iy) for the state 1 at position i

)Pr(),Pr(ybTT

)Pr(),Pr(ybII

---Normal distribution

---Exponential distribution

)),(),,Pr(()(1 yybbi ITITe

)),Pr((),Pr(( yybb ITIT

),Pr()Pr()Pr()Pr( yybb ITIT

Page 19: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

MOWSE Scoring System

incolumnj

ji

ji

ji

f

f

m

max

,

,

,

mn

jiotMScore

,Pr

000,50

MOWSE Algorithm is implemented in MASCOT software

Where

mi,j -elements of MOWSE frequence matrix

Page 20: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Data Sets

ISB data set:

1. A,B mixtures of 18 different proteins with modifications/relative amounts

2. Analysed using SEQUEST and other in-house Software

3. Data set is curated

4. Final data set with charge 2+ for trypsin digestion contains 857 spectra

5. 5-fold cross validation by random selection-Training set :687 spectra-Testing set : 170 spectra

6. EM algorithm is used for estimating parameters

Page 21: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Results: Distributions of Ions

b and y ionsNoise

Match ToleranceParameter estimates

,,

Page 22: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Comparative StudiesDat set selection repeated 10 times to select both training and test data setFor each group parameters are similar valuesPrediction is considered correct if the peptide has highest score

Page 23: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Independent Data Set A.Y’s Lab: The other independent data set for comparing with other tools like SEQUEST and MASCOT size of data set =20,980 spectra

Page 24: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

False/True Positive Rates

Page 25: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

Summary Developed probabilistic scoring function called pepHMM for

improving protein identifications

PepHMM outperform other tools like MASCOT with low false

postive rate (always?)

Can this handle other type of ions other than b and y ions

Need to handle post translational modifications