24
Independent Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics Data 10.04.2008

Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

Embed Size (px)

Citation preview

Page 1: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

Independent Component Analysis (ICA)for the extraction of protein profiles fromMALDI-TOF MS spectra

Gopal PeddintiT.61-6070 Modeling of Proteomics Data

10.04.2008

Page 2: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

Overview

� MALDI-TOF Mass Spectrometer

� Independent Component Analysis

� MALDI-TOF MS Data� Simulated Data

� Data from Inflammatory auto-immune disease patients

� Analysis of Spectra� Independent Components

� Characterization of ICs

� Smoothing� Baseline subtraction

� Removal of residual noise

� Peak picking� Biomarker identification

� Summary and Comments

Page 3: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

MALDI-TOF Mass Spectrometer

Page 4: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

ICA: Introduction

Page 5: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

ICA Model

� Generative model� Describes how observed data are generated by a process of

mixing underlying signals si

� si must be independent� si are called independent components

� When m ≤ n and mixing matrix (A) is full-column rank, one can determine unmixing matrix W such that S = WX

mnij

Tm

Tn

aA

ssS

xxX

ASX

×==

=

=

][matrix Mixing

],...,[ :signals underlying ofMatrix

],...,[ :signals observed ofMatrix

:ModelICA

1

1

Page 6: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

ICA: Preprocessing for ICA

�� CenteringCentering and WhiteningWhitening� Centering: Make each observation zero-mean� Whitening: Linear transformation which makes observations

uncorrelated and with unit variance� Common approach: Eigen value decomposition of the covariance matrix

T

T

n

T

i

n

T

TT

EEDWW

W

XWXEDWEWXS

ddddiagD

XEEDX

d

ddddiagD

XXEE

EDEXXE

21

21

21

21

21

21

21

:matrix unmixing required eFinally th

freedom) of degrees of no.(lower orthogonal is

,

),...,,( where

:matrix Whitened

eseigen valu :

),...,,( :

}{ of rseigenvecto ofmatrix orthogonal : where

,}{

21

21

−−−−

=

===⇒

=

=

=

Page 7: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

FastICA Algorithm

� ICA model can be estimated iff ICs are non-Gaussian� Estimation principle: ICs are maximally non-Gaussian components� Kurtosis (4th order cumulant) is a measure for non-gaussianity

224 }){(3}{)(

variablerandommean -zero :

yEyEykurt

y

−=

� Basic optimization technique: Gradient method� Fixed-point algorithm for optimization

� Find maxima of non-gaussianity using the absolute value of kurtosis

Page 8: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

ICA: Post processing

� After ICA decomposition A can be obtained from W as follows:TT WWWA 1)( −=

� Power of ith IC can be computed as follows:

∑=

=n

jiji ap

1

2

Page 9: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

Simulated Data

( )

τσ

σ

στ

στ

πττσ

τ

p

p

p

p

zz

p

p

hpp

h

z

A

z

dtzzzAzx

−=

←←

−−=

∞−2

2

asymmetry of measure :

]18000,6000[ axis m/zon peak theofposition thedetermines :

0189.0peak theof tailing thecontrols :

0172.0decay lexponentia theofconstant time:

]700,180[peak theof area :

valuem/z :

2exp

2

12exp)(

0

20

40 theoretical protein profiles:

30: single peaks6: sum of 2 peaks4: sum of 3 peaks

matrix 60 kX syn ×=Linear mixing

Fixed relative abundance

Decreasing baseline+ non-unform background noise

add

Page 10: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

Simulated MALDI-TOF MS Data

Page 11: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

MALDI-TOF MS Experiments

30 :spectra of No

matrix

=×=

n

knX ser

30 :spectra of No

matrix

=

×=

n

knX plas

30 patients inflammatory auto-immune disease

Plasma SamplesSerum Samples

1. Sample PreparationSample Preparation2. Data acquisitionData acquisition (MALDI-TOF MS)

1. Sample PreparationSample Preparation2. Data acquisitionData acquisition (MALDI-TOF MS)

Signal calibrationSignal calibration using internalcalibrants Peaks derived from mono-and bi- charged myoglobin andhemoglobin ions

Signal calibrationSignal calibration using internalcalibrants Peaks derived from mono-and bi- charged myoglobin andhemoglobin ions

Spectra normalizedSpectra normalized by peak intensity ofmono-charged myoglobin ion

Spectra normalizedSpectra normalized by peak intensity ofmono-charged myoglobin ion

matrix Combined

2 knX tot ×=

Page 12: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

MALDI-TOF MS Data from Experiments

Page 13: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

Analysis of Spectra

ICA Decomposition:ICA Decomposition:FastICA, Fixed-point algorithmtotplassersyn XXXX or ,,,

:Input

Peak detection:Peak detection:LIMPIC (noise threshold 10σ)1. Smoothing2. Baseline subtraction3. Measurement of Noise

4. Peak picking(Local maxima)

ICs andMixing matrix (Amplitudes)

ICs with Signals,ICs showing artifacts, and Peaks

Biomarker identificationBiomarker identificationMann Whitney U-test,Benjamini-Hochberg multipletesting correction

Page 14: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

Peak detection: Characterization of ICs.1. Smoothing� Signal enhancement� Reduction of chemical and electronic noise� Smoothing performed using Kaiser filter with smoothing

factor p set to cover a range of 5 Da.

Page 15: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

Peak detection: Characterization of ICs.2. Baseline subtraction� Baseline drift c locally estimated from signal blocks

having width of 150 Da� For each of them, average intensity (a.i.) was calculated so that

a vector w of amplitude values was generated� w was associated to the vector b of m/z values corresponding to

the central point of each interval� Components of w with rapid intensity variations were considered

to be out of the baseline. They were discarded� Baseline drift calculated from the remaining (bi,wi) by linear

interpolation. Then removed from the spectrum

Page 16: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

Peak detection: Characterization of ICs.3. Removal of residual noise� Residual noise level σ

� Calculate SD of the values included in the blocks (width 150 Da). Call them gk

� Now calculate σ by polynomial interpolation of the points (bk,gk)

Page 17: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

Peak detection: Characterization of ICs4. Peak picking� Local maxima: point of highest intensity among the ±f

nearest points is the peak in that neighbourhood� f = 2. Covers a range of 0.5 Da� Peaks with intensity lower than 10σ are eliminated from

the peak list

Page 18: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

IC waveforms of simulated data

Page 19: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

IC waveforms of experiment data

IC #1:Component with the largest power

IC #4:Signal. Outlier peak

IC #19:Myoglobin protein(16952.25 Da)

IC #35, #45:• Double peak components• Differentially expressed between

Plasma and serum (P < 0.05)

IC #53:• Biological artifact (no peak above the

noise level detected)• Amplitudes significantly different

(P < 0.001)

Page 20: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

Biomarker identification

Page 21: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

Performance comparisonof peak identification algorithms

HitHit--raterate: Ratio between number of peaks using multi-subject data and the average number of peaks detected in the single spectraHit-rate = 1 means no false positives

Page 22: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

Summary

� MALDI-TOF Mass spectra are contaminated by biological and physical artifacts

� ICA extracted protein signals from calibrated and normalized spectra

� Background noise and outlier peaks could be identified� Real protein signals showed same peaks contained in

mass spectra with increased signal-to-noise ratio� Can be integrated with existing peak detection methods

to enhance their effectiveness� ICA does not need any parameter tuning for separating

protein peaks from noise

Page 23: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

Comments

� Optimal number of independent signals is unknown� ≤ number of mass spectra according to typical ICA model� Prior dimensionality reduction can perhaps help

� False positives (hit-rate: does that make sense?)� They indeed assume the absence of false positives… and the

paper states it!!!� Why not directly count them for synthetic data?

� Biomarkers: as such are they meaningful here?

Page 24: Independent Component Analysis (ICA) for the … Component Analysis (ICA) for the extraction of protein profiles from MALDI-TOF MS spectra Gopal Peddinti T.61-6070 Modeling of Proteomics

References

� Mantini D, Petrucci F, Boccio PD, Pieragostino D, Nicola MD, Lugaresi A, Federici G, Sacchetta P, Ilio CD, UrbaniA (2007) Independent component analysis for the extraction of reliable protein signal profiles from MALDI-TOF mass spectra. Bioinformatics: btm533.

� Mantini D, Petrucci F, Pieragostino D, Del Boccio P, Di Nicola M, Di Ilio C, Federici G, Sacchetta P, Comani S, Urbani A (2007) LIMPIC: a computational method for the separation of protein MALDI-TOF-MS signals from noise. BMC Bioinformatics 8: 101.