50
SeqAhead, Barcelona February 2013 Johan A. Westerhuis Swammerdam Institute for Life Sciences, University of Amsterdam Business Mathematics and Information, North-West University, Potchefstroom, South Africa Metabolomics Data Analysis egra

Metabolomics Data Analysis

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Metabolomics Data Analysis

SeqAhead, Barcelona February 2013

Johan A. Westerhuis

Swammerdam Institute for Life Sciences, University of Amsterdam

Business Mathematics and Information,

North-West University, Potchefstroom, South Africa

Metabolomics Data Analysis

egra

Page 2: Metabolomics Data Analysis
Page 3: Metabolomics Data Analysis

3

Metabolomics pipeline : Issues for biostatistics

Biological

question Data

acquisition Experimental

design

Data

Pre-

processing

Metabolite

identification

Biological

inter-

pretation

Statistical

Data

analysis

Power analysis Treatment design

Explorative Predictive Hypotheticalbiomarkers QC strategy

Measurement design

Normalisation Quantification

Spectral matching De NOVO indentification

Network inference, MSEA, Pathway analysis

Page 4: Metabolomics Data Analysis

Data Analysis special issue Metabolomics

• Data preprocessing methods (make samples more comparable)

• How to treat non-detects

• Variable importance in multivariate models

• Metabolic network analysis

• Data fusion methods

• Individual responses

• Between metabolite ratio’s

Guest Editors Jeroen J. Jansen Johan A. Westerhuis

Page 5: Metabolomics Data Analysis

Multivariate metabolomics data

Technical correlations Biological correlations

Biological correlations

TARGETED ANALYSIS

NONTARGETED PROFILING

hipp fum urea allant TMAO citrat

3 1 8 7 13 4

5 2 12 6 15 2

1 67 45 6 3 31 10 44 32 10

3 24 12 4 33 23 0 0 99 76

Page 6: Metabolomics Data Analysis

Multivariate Metabolomics Data analysis

• Explorative – Find groups, clusters structure /

outliers in metabolites and in samples

• Supervised – Discriminate two or more groups to

make predictive model and to find biomarkers.

• Biological Interpretation – Metabolite set enrichment,

Pathway analysis

– Metabolic network inference

• Special topics – Between metabolite

ratios

– Metabolomics Data Fusion

Page 7: Metabolomics Data Analysis

Metabolomics Data preprocessing

7

• Optimize biological content of data

• Correct for incorrect sampling, sample

workup issues, batch effects

• What is the noise level in the data?

Variance stabilization.

• High peaks more important than low

peaks?

• Multivariate methods love large values!

Generalized log transform

Page 8: Metabolomics Data Analysis
Page 9: Metabolomics Data Analysis

Metabolic changes during E. coli culture growth using k-means clustering.

Szymanski, Jedrzej et al. PLoS ONE (2009), vol. 4 issue. 10

(A) Growth curve (optical density) of unperturbed E. coli culture. Numbers of

respective sampling time points are marked in the curve. Time point 0 minutes

marks the application of the respective stress condition.

(B) Relative changes of metabolites pools normalized time point 1. Fold change is

presented on log10 scale. To reveal main trends of metabolic changes

10 K means clusters are color coded.

meta

bolit

es

time

Page 10: Metabolomics Data Analysis

Self Organising Map of Metabolites in serum

1H NMR spectra of 613 patients with type I diabetes and a diverse spread of complications Nonlinear mapping method for large number of samples. Relate position on the map to diagnostic responses. Can be made supervised

1H NMR metabonomics approach to the disease continuum of diabetic complications and premature death VP Mäkinen et al, Molecular Systems Biology 4:167, 2008

Page 11: Metabolomics Data Analysis

Multivariate Metabolomics Data analysis

• Explorative – Find groups, clusters structure /

outliers in metabolites and in samples

• Supervised (Differentially expressed)

– Discriminate two or more groups to make predictive model and to find biomarkers.

• Biological Interpretation – Metabolite set enrichment, Pathway

analysis

– Metabolic network inference

• Special topics – Between metabolite

ratios

– Metabolomics Data Fusion

Page 12: Metabolomics Data Analysis

Supervised Metabolomics Data analysis Case – Control (PLSDA)

Y

0

0

0

1

1

1

• Is there really a difference between the groups ?

Statistical validation issues

• Which are the most important peaks for discrimination ?

Variable importance

-4 -2 0 2 4 6-3

-2

-1

0

1

2

3

4

PC1

PC

2

Men

Women

00.511.522.533.54-0.06

-0.04

-0.02

0

0.02

0.04

Chemical shift (ppm)

bP

LS

Page 13: Metabolomics Data Analysis

• Psyhogios example uitleggen met paper voorbeelden en metaboanalyst voorbeelden

13

Proton NMR spectra of the urine samples were obtained

on a 500MHz 1H NMR machine.

Page 14: Metabolomics Data Analysis

NMR spectra of urine samples

14

Page 15: Metabolomics Data Analysis

UNIVERSITY OF

AMSTERDAM 15

Nonsupervised

Supervised

Page 16: Metabolomics Data Analysis

Experimental Design Example

Experiment:

Time: 6, 24 and 48 hours

Experimental Design:

Rats are given Bromobenzene that affects the liver

Groups: 3 doses of BB

Animals: 3 rats per dose per time point

Vehicle group, Control group

Rats

6 hours

24 hours

48 hours

0 2 4 6 8 10

chemical shift (ppm)

2.93 2.7175 2.075

3.7525 3.675

3.0475 5.38 3.285

2.055

3.0275

Measurements: NMR spectroscopy of urine

Page 17: Metabolomics Data Analysis

Different contributions

0 0.2 0.4 0.6 0.8 1 -0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

time

Met

abo

lite

con

cen

trat

ion

0 0.2 0.4 0.6 0.8 1 time

Time

Animal

0 0.2 0.4 0.6 0.8 1 time

Dose

0 0.2 0.4 0.6 0.8 1 time

Trajectories

Experimental Design

Page 18: Metabolomics Data Analysis

ANOVA decomposition of each variable

hkhk hkihkkhkix

αβγαβα

T XXX1mX

MATRICES:

αβγαβα

T XXX1mX

MATRICES:

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 -0.5 0

0.5 1

1.5 2

2.5 3

3.5 4

Page 19: Metabolomics Data Analysis

ANOVA and PCA ASCA

E

Parts of the data not explained by the component models

αβγαβα

T XXX1mX

EPTPTPT1mX T

αβγαβγ

T

αβαβ

T

αα

T

X

Tαβγ

PαβγPαβPα

TαβTα

X

Tαβγ

PαβγPαβPα

TαβTα

X

Tαβγ

PαβγPαβPα

TαβTα

Page 20: Metabolomics Data Analysis

Results

40 %

Xαβ

XαβγXαXα

XαβXαβ

Xαβγ

6 24 48

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

Time (Hours)

Sco

res

control vehicle low medium high

αβ -scores

Page 21: Metabolomics Data Analysis

Results biomarkers

3.9675 2.735

3.675 3.7525

2.055 2.5425

5.38 3.0475

2.5825 2.6975

3.9675 2.735 2.6975

2.93 3.0275 2.91

2.5825

2.075

3.285

2.055

3.8875 3.73 2.055 3.0475 2.93 2.075

2.735

3.0275

3.2625

3.285

0 2 4 6 8 10 chemical shift (ppm)

αβγ

αβ

αDifferences between submodels

Interesting for Biology

Interesting for Statistics / Diagnostics

Unique to the α submodel

Page 22: Metabolomics Data Analysis

Multivariate Metabolomics Data analysis

• Explorative – Find groups, clusters structure /

outliers in metabolites and in samples

• Supervised – Discriminate two or more groups to

make predictive model and to find biomarkers.

– Method comparison

• Biological Interpretation – Metabolite set enrichment

– Pathway analysis

– Metabolic network inference

• Special topics – Between metabolite

ratios

– Metabolomics Data Fusion

Page 23: Metabolomics Data Analysis

SELDI measurements of serum samples of 20 Gaucher patients and 20 healthy controls. Gaucher is a genetic disease in which a fatty substance (lipid) accumulates in cells and certain organs

NONTARGETED

Page 24: Metabolomics Data Analysis

• human urine and porcine cerebrospinal fluid samples spiked with a range of peptides.

• Variation in #samples, within and between group variation

Page 25: Metabolomics Data Analysis

Gaucher Spiked

Page 26: Metabolomics Data Analysis

Feature selection methods RESULTS

• Complex nontargeted Gaucher profiling data with highly variable background and varying difference between case and control: Multivariate methods perform best.

• Spiked LCMS targeted data with less variation in effect size: univariate and semi-univariate methods are best in selecting biomarkers.

Page 27: Metabolomics Data Analysis

Multivariate Metabolomics Data analysis

• Explorative – Find groups, clusters structure /

outliers in metabolites and in samples

• Supervised – Discriminate two or more groups to

make predictive model and to find biomarkers.

• Biological Interpretation – Metabolite set enrichment,

Pathway analysis

– Metabolic network inference

• Special topics – Between metabolite

ratios

– Metabolomics Data Fusion

Page 28: Metabolomics Data Analysis

Biomarkers: A: Univariate B: Multivariate C: Change in group correlation

Page 29: Metabolomics Data Analysis

BMR of green tea intervention study 186 human subjects with abdominal obesity

Validation shows significant changes in BMR between placebo and green tea treatment together with most important triacylglycerols TG28-29 and TG41-42.

Page 30: Metabolomics Data Analysis

Multivariate Metabolomics Data analysis

• Explorative – Find groups, clusters structure /

outliers in metabolites and in samples

• Supervised – Discriminate two or more groups to

make predictive model and to find biomarkers.

• Biological Interpretation – Metabolite set enrichment

Pathway analysis

– Metabolic network inference

• Special topics – Between metabolite

ratios

– Metabolomics Data Fusion

Page 31: Metabolomics Data Analysis

Plasma

Page 32: Metabolomics Data Analysis

Differences in blood metabolites due to aging

Page 33: Metabolomics Data Analysis

Aging biomarker metabolites in liver

Page 34: Metabolomics Data Analysis
Page 35: Metabolomics Data Analysis
Page 36: Metabolomics Data Analysis

Multivariate Metabolomics Data analysis

• Explorative – Find groups, clusters structure /

outliers in metabolites and in samples

• Supervised – Discriminate two or more groups to

make predictive model and to find biomarkers.

• Biological Interpretation – Metabolite set enrichment

Pathway analysis

– Metabolic network inference

• Special topics – Between metabolite

ratios

– Metabolomics Data Fusion

Page 37: Metabolomics Data Analysis

Special topic: Metabolic networks Biochemical Network vs Association Network

Margriet M.W.B. Hendriks , Data-processing strategies for metabolomics studies, Trends in Analytical Chemistry, 20212

Figure 7 Marginal correlation network for a set of metabolites in tomato. Volatiles in red, derivatized metabolites in yellow. Solid lines represent positive correlations, dashed lines negative ones. Thickness of line corresponds to magnitude of ...

Page 38: Metabolomics Data Analysis

Metabolomics, 2005

Metabolic neighbors Do not participate in common reactions

High correlation due to e.g. chemical equilibrium, mass conservation,..

Data from Potato tubers

“a systematic relationship between observed correlation networks and the underlying biochemical pathways.” Ralf Steuer: Observing and interpreting correlations in metabolomic networks, Bioinformatics, 2003

Page 39: Metabolomics Data Analysis

A

B

D

C

E F ?? A

B

D

C

E F

Search for the link between metabolome data and underlying metabolic

networks.

As an example: can we distinguish healthy from diseased networks:

Metabolic Network Inference

Glucose A B C

D

E F

G

F

G

Glucose A B C

D

E F

G

F

G

HEALTHY DISEASE

Glucose A B C

D

E F

G

F

G

Glucose A B C

D

E F

G

F

G

Page 40: Metabolomics Data Analysis

40

Goal: ?

?

Problems:

NOISE MISSING METABOLITES

NETWORK

TOPOLOGY

DIRECTIONS

HUGE AMOUNT OF POSSIBLE

NETWORK STRUCTURES

From data to network

Page 41: Metabolomics Data Analysis

2b. Conditioned Networks

Partial Pearson Correlation (PPC) (linear)

Conditional Mutual Information (CMI) (non-linear)

A

B C

E F

D

A

B

D

C

E F

2a. Relevance Networks

Pearson Correlation (PC) (linear)

Mutual Information (MI) (non-linear)

A

B

D

C

E F

A

B

D

C

E F

0 1 2 3 40

1

2

3

4

5

0 1 2 3 40

1

2

3

4

5

0.2 0.4 0.6 0.8

0

0.5

1

1.5

2

0.2 0.4 0.6 0.8

0

0.5

1

1.5

2

ALL POSSIBLE

PAIRWISE

INTERACTIONS

1. DATA COLLECTION 2. SIMILARITY SCORE CALCULATION

0.2 0.4 0.6 0.8

0

0.5

1

1.5

2

0 1 2 3 40

1

2

3

4

5

0 50 100

0 50 100

0 2 4 6 8

0.4

0.5

0.6

0.7

0.8

0.9

1

A. Enzymatic Variability

100 200 300 400 500 600 700 800 900 1000

0.35

0.4

0.45

0.5

0.55

0.6

B. Intrinsic Variability

C. Environmental Variability

10 20 30 40 50

Inference from static data

Page 42: Metabolomics Data Analysis

42

1. ASPP 2. ASA 3. HS 4. HSP Real Pathway

PC ASA

PPCn ASPP ASA HS HSP

MI ASPP ASA HSP HS

CMI1 ASPP ASA HSP

PC ASPP ASA HS HSP

MI ASPP ASA HS HSP

PPCn ASPP ASA HS HSP

CMI1 ASPP ASA HS HSP

PC

MI

PPCn ASPP ASA HS HSP

CMI1

ASPP ASA HS HSP

ASPP ASA HS HSP

ASPP ASA HS HSP

ASPP HS HSP

Vmax Variability Intrinsic Variability Environmental Variability

PPC1 ASPP ASA HS HSP PPC1 ASPP ASA HS HSP

PPC1 ASPP ASA HS HSP

HS

PC: Pearson Correlation (linear measure) MI: Entropy-based Mutual Information (non-linear measure) PPC: Partial Pearson Correlation (linear conditioning measure) CMI: Conditional Mutual Information (nonlinear conditioning measure)

100%

> 90%

10% … 90%

< 10%

Cakir, Metabolomics 2009

ESTIMATION OF CORRELATION NETWORKS

Page 43: Metabolomics Data Analysis

Multivariate Metabolomics Data analysis

• Explorative – Find groups, clusters structure /

outliers in metabolites and in samples

• Supervised – Discriminate two or more groups to

make predictive model and to find biomarkers.

• Biological Interpretation – Metabolite set enrichment

Pathway analysis

– Metabolic network inference

• Special topics – Between metabolite

ratios

– Metabolomics Data Fusion

Page 44: Metabolomics Data Analysis

• Account for between-block difference in quality of measurements to improve data fusion

• For example, multi-platform data fusion, with differences in quantification, (non) targeted, error structure

• How to quantify the quality of measurements with many metabolites, and many samples?

Amino acids

Lipids

Fused data

Metabolomics data fusion

Page 45: Metabolomics Data Analysis

Error model for 1 metabolite

• Error models:

- RSD using 1 QC sample

- 2-component

using study samples

• Good error description

- sufficient # samples

- large -range

A

M

I

S

Mean Intensity

I

QC sample -> RSD

study samples

Sta

ndard

Devia

ton S

t.D

Page 46: Metabolomics Data Analysis

Figure of merit for data from 1 platform

Median: F-50 = 0.1

90th-percentile: F-90 = 0.35

I

Var. 15 Var. 365

Var. 118

Var. 213

F-50 F-90

Nu

mb

er

of

pe

aks

(Van Batenburg et al. Analytical Chemistry, 2011)

St.

D

Page 47: Metabolomics Data Analysis

GC/MS LC/MS

M

M

j

Ij

J2= 49 peaks

J1=

82

peaks

• Step 1:

Compute figures of merit for each platform

Two-step data fusion

Page 48: Metabolomics Data Analysis

Two-step data fusion: MB-MLPCA • Step 2 : Multi-block PCA with weighting by figures of merit

Amino acids Lipids

X1

X2

ˆ js

2

Fused error

covariance

• Method needs good estimation of error variance by – Repeats

– QC samples

Page 49: Metabolomics Data Analysis

Realistic simulations using GCMS and

LCMS data

• Error variance estimated from duplicates

• True error variance

• Estimating variance from duplicates is problematic.

• Use Mix of QC samples and repeats.

Page 50: Metabolomics Data Analysis

Multivariate Metabolomics Data analysis

• Explorative – Find groups, clusters structure /

outliers in metabolites and in samples

• Supervised – Discriminate two or more groups to

make predictive model and to find biomarkers.

• Biological Interpretation – Metabolite set enrichment

Pathway analysis

– Metabolic network inference

• Special topics – Between metabolite

ratios

– Metabolomics Data Fusion