27
Data Analysis in Metabolomics Tim Ebbels Imperial College London

Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Embed Size (px)

Citation preview

Page 1: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Data Analysis in Metabolomics

Tim EbbelsImperial College London

Page 2: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Themes

• Overview of metabolomics data processing workflow

• Differences between metabolomics and transcriptomics data

• Approaches to improving reproducibility & data quality

• Key challenges and bottlenecks

Page 3: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Metabolomics

METABOLIC PROFILE

TISSUEBIOFLUID

The study of the complement of small molecules within biological systems

CELL

Hormones

Untargeted: No prior hypothesis of specific metabolites involved

Page 4: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Biologicalquestion

Samplepreparation

Experi-mentaldesign

Data acquisition

Data pre-processing

Biologicalinter-

pretation

DataanalysisSampling

Raw data Data table Relevant metabolites,

connectivities, models

Metabolites

SamplesProtocol

Metabolite identification

Metabolomics workflow

Page 5: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Transcriptomic vs. Metabolomic DataTranscriptomics (microarrays 

/ sequencing)Metabolomics

Number of genes / metabolites known?

Yes No

Identity of genes / metabolites known?

Yes (sequence/locus) No (a priori)

Coverage Whole genome Very low (few %) of metabolome

Number of platforms Single Multiple (in same experiment)

Standardisation –analytical technology

High Constantly changing

Standardisation – data analysis

Relatively high Low

Correlation between variables

Medium Very high

Page 6: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

LC‐MS Metabolomics Data

Page 7: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

LC-MS Metabolic Profiles

~10,000s signals,100-1000s (?) metabolites

Page 8: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

LC-MS preprocessingPeak detection Peak matching

Retention time alignmentPeak table

Raw data

Peak integration

Peak filling

XCMS – Smith et al. Anal Chem 78, 779 (2006)

Page 9: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Quality Control Samples• Representative biological sample, e.g. pool of study samples

• Repeated analysis throughout analytical run

9Gika, H. G., Theodoridis, G. A., Wingate, J. E., and Wilson, I. D., J. Proteome Res. 6 (8), 3291 (2007).

Study samples

Pooled QC sample

Run order…

Page 10: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Quality Control and Data Filtering

Repeatability filter• E.g. Filter out all features with CV<30% in QC samples

Linearity filter• E.g. Filter out all features with correlation to dilution < 0.8

Normalisation• Correct global intensity drift

Drift correction• Correct feature specific drift within a batch

Batch correction• Correct drift across batches

10

Page 11: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Drift Correction

• Instrument response changes smoothly over the run

• Use QC samples to estimate changes

• Typically local regression (e.g. LOESS) with cross‐validation

• Requires frequent QC injections

Dunn, W. B. et al. Nat Protoc 6, 1060 (2011).

Page 12: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Filtering for Repeatability

• Remove features with low repeatability in QC samples (e.g. coefficient of variation, CV<30%)

12345678910

Lab C Positive ESI, 100% CV Threshold   Lab C Positive ESI, 10% CV Threshold 

-700

-600

-500

-400

-300

-200

-100

0

100

200

300

400

500

600

700

-1000 -800 -600 -400 -200 0 200 400 600 800 1000

t[2

]

t[1]

-120

-100

-80

-60

-40

-20

0

20

40

60

80

100

120

-160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160

t[2]

t[1]COMET2 / Rob Whiffin

Page 13: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Filtering for Linearity• Some metabolite concentrations will be– Outside linear range of instrument, or

– Contaminants, solvent artefacts etc…

• Use a dilution series to select features which respond linearly

CV (%)

R2

Inte

nsity

Dilution factor

Page 14: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

NMR Metabolomics Data

Page 15: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

NMR Metabolic Profiles

~100s signals,10-100s (?) metabolites

Page 16: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

NMR Metabolic Profiles: Problems

• Problems:– Assignment

• Knowns• Unknowns

– Peak overlap– Peak shift

?

Page 17: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Peak shifts• Caused primarily by pH & 

ionic strength variations• Some peaks more 

susceptible than others• Peaks for same molecule 

generally do NOT shift – In same direction– By same amount

Restrict pH variation using buffer

Try to keep in physiological range (~7‐8)

• pH shift may be the effect you’re looking for!

Urine titration series

pH 12

pH 2

Page 18: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Binning• Integrate spectral intensity

in each region one variable

• Benefits: reduces problems of– Peak shift– Large number of data

points• Drawbacks

– Bins not easily assigned –can be one or several compounds

– Statistical models not easily interpreted

Raw spectrum

Binned spectrum

Page 19: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Full resolution spectra

• Benefits:– Reduces difficulty of assignment (still manual)

• Drawbacks: does not overcome– Overlap– Shift– Large number of data points

Page 20: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Full resolution + alignment

• Move peaks until positions in different spectra match

• Difficult task, usually requires manual validation

• Can produce artefacts– Misassignment– Artificial signal– Warping of peak shape and/or area

2.62.72.82.933.1

Sam

ple

num

ber

2.62.72.82.933.1

20

40

60

80

100

120

140

0

5

10

15

20

25

30

35

40

Inte

nsity

(a.

u.)

Non-aligned data RSPA corrected data

ppm ppm

Veselkov et al. Anal. Chem. 2009

Page 21: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Peak fitting• E.g. Chenomx NMR suite• Manual process, requiring manual validation

Succinate

Glutamine MalateGlutamate

Page 22: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Normalisation

• Transformation on each sample– Removing unwanted variation– Making samples more comparable

• What variation is unwanted? Examples:– Changes in detector response– Differences in urine volume/dilution 

• Classically achieved by setting total signal to a constant ( x = 1)

Page 23: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

-200000

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

9.96

9.72

9.48

9.24 98.

76

8.52

8.28

8.04 7.8

7.56

7.32

7.08

6.84 6.6

6.36

6.12 4.4

4.16

3.92

3.68

3.44 3.2

2.96

2.72

2.48

2.24 21.

76

1.52

1.28

1.04 0.8

0.56

0.32

-200

0

200

400

600

800

1000

1200

1400

1600

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103

109

115

121

127

133

139

145

151

157

163

169

175

181

187

193

199

205

Constant sum

Raw data

Normalisation to constant sum

Page 24: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Normalisation

• Account for gross sample to sample changes

• Global, e.g.– Median fold change– Total intensity

• Intensity dependent, e.g.– LOESS– Quantile

Veselkov, K. A. et al. Anal. Chem. 83, 5864 (2011).

Median fold change normalise

Page 25: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Comparison of Normalisation Methods• Simulated data• 4 normalisations:

– Total area– Median fold change– Minimum entropy– PCA scores

• Minimum entropy– Difference (test‐ref) is 

constant for dilution variables  low entropy

• Other methods:– Histogram– Quantile– Robust regression

Hector Keun / Jake Pearce

Page 26: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Summary

• Metabolomic data share many characteristics with other omics– But fundamentally different: cannot copy data analysis pipeline

• Current bottlenecks/challenges:– Metabolite identification– Standardisation of

• Sample collection• Analytical procedure• Data analysis

Page 27: Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving