Min Zhang, MD PhD Purdue University Joint work with Yanzhu Lin, Dabao Zhang

Min Zhang, MD PhDPurdue University

Joint work with Yanzhu Lin, Dabao Zhang

Outline

Data Summary

Methods

Data Analysis Procedure

Preliminary Results

Preprocessing GC GC-MS Data

Methods

CCE Data Summary

Phenotype summary for current available data for CCE project:

Healthy

ColonCancer

RectalCancer

Polyp

NA Total

Lipidomics(Lipid)

22 10 2 12 0 46

GProteomics(GP)

33 9 2 20 1 65

NMR 25 2 1 23 2 53

Teac 54 17 5 41 2 119

Comet 27 12 4 12 0 55

Summary of Overlap Dataset Overlap between any 2 data sets:

Overlap among any 3 data sets

Overlap among any 4 data sets

Lipid GP NMR Teac Comet

Lipid 46 41 0 46 41

GP 41 65 17 63 43

NMR 0 17 53 52 2

Teac 46 63 52 119 55

Comet 41 43 2 55 55

Lipid & GP & Teac 41 GP & NMR & Teac 16

Lipid & GP &Comet 37 GP & Teac & Comet 43

Lipid & Teac & Comet 41 NMR & Teac & Comet

2

Lipid & GP & Teac & Comet 37

Overlap of Different Omics Data

Methods for Integrating Omics

Common methods:

- Principal Component Analysis (Jolliffe, I. 1986),

- Co-Inertia Analysis (Doledec, S. and Chessel, D.,1994)

- Partial Least Squares (Wold, H., 1966)

- Bayesian Analysis method (Webb-Robertson et. al., 2009)

Our methods:

We use iteratively weighted partial least squares method (IWPLS) to fit the model for each individual data set, then we use Bayesian method to integrate the results from individual data set.

Overlap B/W NMR and G-Proteomics

NMR: 53 samples Global Proteomics: 65 samples

Overlap: 17 samples

One sample:without phenotype information

One sample: from blood draw 2

15 samples: all from blood draw 1 with phenotypeas either “Healthy Control” or “Polyp”

Data Analysis Procedure

Metabolomics (NMR)

Data PreprocessingEnding with 1824 Variables

IWPLS method

Global Proteomics

Data PreprocessingEnding with 5407 Variables

IWPLS method

Integrate Results

Analysis ResultsOur method:

1 2 3 4 5 6 7 8 9 10 11 12 13 14

0.0

0.2

0.4

0.6

0.8

1.0

TrueGProteomicsNMRIntegrate

Subject

Probability

Analysis Results (cont.)

Summary:

Other Methods Tried:- PLS: ending with 0 components;- Univariate t-test: none variables is significant.

Data Classification Rate

GProteomics 100%

NMR 85.7%

Integrated NMR and GProteomics

100%

Example: Overlap of Three Data SetsFor overlap among three data sets, we focus on the overlap

among Lipidomics,Teac and Comet. Data summary:- Phenotype summary:

- Variable summary:

Data analysis: we group patients of colon cancer and rectal cancer together as cancer group, while keeping the other two groups. The we try the following methods:

Method 1: POCRE Method 2: ANOVA test

Phenotype

Healthy Polyp Colon Rectal Total

Sample size

20 10 9 2 41

Lipidomics Teac Comet

Number of variables

52 1 2

Results

Misclassification rate:

Variables identified:

POCRE ANOVA

17% 39%

POCRE Lipids:

Teac: TEAC_mM

ANOVA Lipids:

Teac: TEAC_mM

SPC LPI, 4:20 1,:LPE18 LPG, 1:18

1:LPE18

Preprocessing GC x GC-MS MethodsHow to choose the reference sample for alignment?- Choose the chromatogram in the middle of the run sequence

or the chromatogram containing the highest number of common chemical constituents (i.e. peaks)

- Choose the chromatogram that is most similar to the loading of the first principal component in a PCA model on the unaligned data, or simply to the mean of all chromatogram.

Similarity index method for choosing reference sample: For a given chromatogram , the similarity index is

defined as:

where

The one with the maximum similarity index will be chosen as

the reference sample.Ref: Skov, T. et al, Automated Alignment of Chromatographic Data, Journal of Chemometrics, Vol. 20, Issue 11-12, page: 484-497, 2007.

|),(|Index Similarity 1 itIi xxr

tx

21

21

1

))(())((

))()()((),(

iiJjtt

Jj

iittJj

itxjxxjx

xjxxjxxxr

Results

Documents

Min Zhang, MD PhD Purdue University Joint work with Yanzhu Lin, Dabao Zhang