Upload
joan-shelton
View
224
Download
0
Tags:
Embed Size (px)
Citation preview
Min Zhang, MD PhDPurdue University
Joint work with Yanzhu Lin, Dabao Zhang
Outline
Data Summary
Methods
Data Analysis Procedure
Preliminary Results
Preprocessing GC GC-MS Data
Methods
CCE Data Summary
Phenotype summary for current available data for CCE project:
Healthy
ColonCancer
RectalCancer
Polyp
NA Total
Lipidomics(Lipid)
22 10 2 12 0 46
GProteomics(GP)
33 9 2 20 1 65
NMR 25 2 1 23 2 53
Teac 54 17 5 41 2 119
Comet 27 12 4 12 0 55
Summary of Overlap Dataset Overlap between any 2 data sets:
Overlap among any 3 data sets
Overlap among any 4 data sets
Lipid GP NMR Teac Comet
Lipid 46 41 0 46 41
GP 41 65 17 63 43
NMR 0 17 53 52 2
Teac 46 63 52 119 55
Comet 41 43 2 55 55
Lipid & GP & Teac 41 GP & NMR & Teac 16
Lipid & GP &Comet 37 GP & Teac & Comet 43
Lipid & Teac & Comet 41 NMR & Teac & Comet
2
Lipid & GP & Teac & Comet 37
Overlap of Different Omics Data
Methods for Integrating Omics
Common methods:
- Principal Component Analysis (Jolliffe, I. 1986),
- Co-Inertia Analysis (Doledec, S. and Chessel, D.,1994)
- Partial Least Squares (Wold, H., 1966)
- Bayesian Analysis method (Webb-Robertson et. al., 2009)
Our methods:
We use iteratively weighted partial least squares method (IWPLS) to fit the model for each individual data set, then we use Bayesian method to integrate the results from individual data set.
Overlap B/W NMR and G-Proteomics
NMR: 53 samples Global Proteomics: 65 samples
Overlap: 17 samples
One sample:without phenotype information
One sample: from blood draw 2
15 samples: all from blood draw 1 with phenotypeas either “Healthy Control” or “Polyp”
Data Analysis Procedure
Metabolomics (NMR)
Data PreprocessingEnding with 1824 Variables
IWPLS method
Global Proteomics
Data PreprocessingEnding with 5407 Variables
IWPLS method
Integrate Results
Analysis ResultsOur method:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0.0
0.2
0.4
0.6
0.8
1.0
TrueGProteomicsNMRIntegrate
Subject
Probability
Analysis Results (cont.)
Summary:
Other Methods Tried:- PLS: ending with 0 components;- Univariate t-test: none variables is significant.
Data Classification Rate
GProteomics 100%
NMR 85.7%
Integrated NMR and GProteomics
100%
Example: Overlap of Three Data SetsFor overlap among three data sets, we focus on the overlap
among Lipidomics,Teac and Comet. Data summary:- Phenotype summary:
- Variable summary:
Data analysis: we group patients of colon cancer and rectal cancer together as cancer group, while keeping the other two groups. The we try the following methods:
Method 1: POCRE Method 2: ANOVA test
Phenotype
Healthy Polyp Colon Rectal Total
Sample size
20 10 9 2 41
Lipidomics Teac Comet
Number of variables
52 1 2
Results
Misclassification rate:
Variables identified:
POCRE ANOVA
17% 39%
POCRE Lipids:
Teac: TEAC_mM
ANOVA Lipids:
Teac: TEAC_mM
SPC LPI, 4:20 1,:LPE18 LPG, 1:18
1:LPE18
Preprocessing GC x GC-MS MethodsHow to choose the reference sample for alignment?- Choose the chromatogram in the middle of the run sequence
or the chromatogram containing the highest number of common chemical constituents (i.e. peaks)
- Choose the chromatogram that is most similar to the loading of the first principal component in a PCA model on the unaligned data, or simply to the mean of all chromatogram.
Similarity index method for choosing reference sample: For a given chromatogram , the similarity index is
defined as:
where
The one with the maximum similarity index will be chosen as
the reference sample.Ref: Skov, T. et al, Automated Alignment of Chromatographic Data, Journal of Chemometrics, Vol. 20, Issue 11-12, page: 484-497, 2007.
|),(|Index Similarity 1 itIi xxr
tx
21
21
1
))(())((
))()()((),(
iiJjtt
Jj
iittJj
itxjxxjx
xjxxjxxxr
Results