84
Statistical methods in metabolomics David Moriña a,b [email protected] a Centre for Research in Environmental Epidemiology (CREAL) b Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona May 08 2014, Reus

Statistical methods in Metabolomics

Embed Size (px)

Citation preview

Page 1: Statistical methods in Metabolomics

Statistical methods in metabolomics

David Moriñaa,b

[email protected]

aCentre for Research in Environmental Epidemiology (CREAL)bUnitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona

May 08 2014, Reus

Page 2: Statistical methods in Metabolomics

Statistical methods in metabolomics

Contents

1 Introduction

2 Basic statistics

3 Available tools

4 R basics

5 LC/MS example

6 Further reading

2 / 66

Page 3: Statistical methods in Metabolomics

Statistical methods in metabolomics

Introduction

Where does data come from?

Metabolomics

• Metabolomics is the analysis and study of the set of metabolites in a cell,organ, or tissue.

• To detect and quantify metabolites, separation techniques like gas or li-quid chromatography, followed by quantification by mass spectrometry(GC/MS, or LC/MS) are often used.

• Nuclear magnetic resonance spectroscopy (NMR) is also frequently em-ployed and has some appealing properties:

• Is non-destructive, in the sense that it does not “destroy” the samples duringthe analysis process.

• Is useful when analyzing tissues or when sequential analysis of samples isrequired.

3 / 66

Page 4: Statistical methods in Metabolomics

Statistical methods in metabolomics

Introduction

Where does data come from?

Metabolomics

• Metabolomics is the analysis and study of the set of metabolites in a cell,organ, or tissue.

• To detect and quantify metabolites, separation techniques like gas or li-quid chromatography, followed by quantification by mass spectrometry(GC/MS, or LC/MS) are often used.

• Nuclear magnetic resonance spectroscopy (NMR) is also frequently em-ployed and has some appealing properties:

• Is non-destructive, in the sense that it does not “destroy” the samples duringthe analysis process.

• Is useful when analyzing tissues or when sequential analysis of samples isrequired.

3 / 66

Page 5: Statistical methods in Metabolomics

Statistical methods in metabolomics

Introduction

Where does data come from?

Metabolomics

• Metabolomics is the analysis and study of the set of metabolites in a cell,organ, or tissue.

• To detect and quantify metabolites, separation techniques like gas or li-quid chromatography, followed by quantification by mass spectrometry(GC/MS, or LC/MS) are often used.

• Nuclear magnetic resonance spectroscopy (NMR) is also frequently em-ployed and has some appealing properties:

• Is non-destructive, in the sense that it does not “destroy” the samples duringthe analysis process.

• Is useful when analyzing tissues or when sequential analysis of samples isrequired.

3 / 66

Page 6: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

What does people say about statistics?

• There are three kinds of lies: lies, damned lies, and statistics. (B. Disraeli)

• Statistics are like bikinis: What they reveal is suggestive, but what theyhide is vital. (A. Levenstein)

• About 93% of all statistics are made up. (Any newspaper)

4 / 66

Page 7: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

What does people say about statistics?

• There are three kinds of lies: lies, damned lies, and statistics. (B. Disraeli)• Statistics are like bikinis: What they reveal is suggestive, but what they

hide is vital. (A. Levenstein)

• About 93% of all statistics are made up. (Any newspaper)

4 / 66

Page 8: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

What does people say about statistics?

• There are three kinds of lies: lies, damned lies, and statistics. (B. Disraeli)• Statistics are like bikinis: What they reveal is suggestive, but what they

hide is vital. (A. Levenstein)• About 93% of all statistics are made up. (Any newspaper)

4 / 66

Page 9: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Normal distribution

5 / 66

Page 10: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Normal distribution

6 / 66

Page 11: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Central Limit Theorem

Under some conditions (not much demanding), the distribution of the sumof independent and identically distributed random variables tends to normaldistribution if the number of observations is not too small.

7 / 66

Page 12: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Central Limit Theorem

The following example shows the distribution of the sum of the scores obtainedwhen rolling 1, 2, 3, 5 and 10 dices:

8 / 66

Page 13: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Central Limit Theorem

The following example shows the distribution of the sum of the scores obtainedwhen rolling 1, 2, 3, 5 and 10 dices:

8 / 66

Page 14: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Central Limit Theorem

9 / 66

Page 15: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Central Limit Theorem

10 / 66

Page 16: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Testing normality

There are some analytical methods to test if a random variable follow a normaldistribution or not. Some of them are

• Kolmogorov-Smirnov test• Shapiro-Wilk test• Graphical methods (QQ-plot, . . . )

11 / 66

Page 17: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Testing normality

There are some mathematical functions that can be applied in order to stabili-ze the variance of a random variable

• log transformation• logit transformation• Square root transformation

12 / 66

Page 18: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Testing normality

There are some mathematical functions that can be applied in order to stabili-ze the variance of a random variable

• log transformation• logit transformation• Square root transformation

12 / 66

Page 19: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Testing normality

Original variable

X scale

Den

sity

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

log−transformed variable

log(X) scale

Den

sity

−8 −6 −4 −2 0 2

0.0

0.2

0.4

0.6

13 / 66

Page 20: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Significance

Interpretation of p-value is well-known. . . Sure?

• Probability of a difference at least as the observed if H0 is true (bychance)

• Probability of mistake when rejecting H0

• Evidence against H0 provided by the sample. If the p-value is small, it’snot likely to observe the sample differences by chance

• Probability that the observed differences are false• 1 - p-value = Probability that the observed differences are real

14 / 66

Page 21: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Significance

Interpretation of p-value is well-known. . . Sure?• Probability of a difference at least as the observed if H0 is true (by

chance)• Probability of mistake when rejecting H0

• Evidence against H0 provided by the sample. If the p-value is small, it’snot likely to observe the sample differences by chance

• Probability that the observed differences are false• 1 - p-value = Probability that the observed differences are real

14 / 66

Page 22: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Significance

Interpretation of p-value is well-known. . . Sure?• Probability of a difference at least as the observed if H0 is true (by

chance)• Probability of mistake when rejecting H0

• Evidence against H0 provided by the sample. If the p-value is small, it’snot likely to observe the sample differences by chance

• Probability that the observed differences are false• 1 - p-value = Probability that the observed differences are real

15 / 66

Page 23: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Significance

• The p-value is computed under the assumption that H0 is true andtherefore it cannot provide direct data about its certainty

• Scientist should decide on H0 based on the evidence against it thatsample provides, without reality knowledge

16 / 66

Page 24: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Significance

• If a confidence level 1− α is fixed:

p < α −→ Statistically significant differencesp ≥ α −→ No statistically significant differences

17 / 66

Page 25: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Significance

18 / 66

Page 26: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Significance

19 / 66

Page 27: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Comparing two populations

• Student’s t test was designed to compare two means.• A t-test can also be used to determine whether 2 clusters are different.

●●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

0 100 200 300 400

−10

010

2030

40

Time

Val

ue

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

20 / 66

Page 28: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Comparing two populations (non-gaussian)

• If the distribution of the variable of interest is not gaussian, we can stillcompare two populations, by means of Mann-Whitney’s U test (forindependent samples) or Wilcoxon test (for paired samples).

• Formally, these non-parametric tests are comparing two medians.

21 / 66

Page 29: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Comparing three (or more) populations

• If we want to compare more than two groups we can use ANOVAtechnique.

• Essentially, it is a genearlization of Student’s t test.• Intra-group variance should be similar.• Normality is not crucial.• Just tells if some of the compared groups is different from the others.

22 / 66

Page 30: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Comparing three (or more) populations

• ANOVA can also be used to determine whether 3 or more clusters aredifferent.

●●●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●●

●●●

●●●

●●

●●

●●●●

●●●●

●●

0 100 200 300 400 500 600

−10

010

2030

4050

60

Time

Val

ue ●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

23 / 66

Page 31: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Comparing three (or more) populations

If H0 can be rejected, which is the different group?• We need to perform a posteriori mean tests.• They compare each pair of means.• More conservative to control αT .

24 / 66

Page 32: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Multiple comparisons

There are several methods to control type I error:• Bonferroni• Holm• Tukey• Scheffé• Dunnett (control)

25 / 66

Page 33: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

False Discovery Rate

Suppose you performed 100 different t-tests, and found 20 results with a p-value < 0.05.

• How many of these 20 tests are likely false positives?

• 20 · 0.05 = 1• To correct for this we can consider as significant the results with a

p-value < 0.0520 or p < 0.0025

26 / 66

Page 34: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

False Discovery Rate

Suppose you performed 100 different t-tests, and found 20 results with a p-value < 0.05.

• How many of these 20 tests are likely false positives?• 20 · 0.05 = 1• To correct for this we can consider as significant the results with a

p-value < 0.0520 or p < 0.0025

26 / 66

Page 35: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Correlation

• If there is some dependency between the two variables or if there is arelationship between the predicted and observer variable or if the“before” and “after” treatments led to some effect, then it is possible tosee some clear patterns to the scatter plot

• This pattern or relationship is called correlation

27 / 66

Page 36: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Correlation

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

−2 −1 0 1 2

34

56

7

Positive correlation

x

y

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

−2 −1 0 1 2

−2

−1

01

2

Negative correlation

x

z●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

−2 −1 0 1 2

−3

−2

−1

01

2

No correlation

x

t

28 / 66

Page 37: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Correlation

The correlation coefficient (Pearson coefficient) is computed by means of

r =

∑(xi − x)(yi − y)√∑(xi − x)2(yi − y)2

29 / 66

Page 38: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Correlation and significance

●●

−3.0 −2.0 −1.0

−5

05

1015

r=0.98

●●

−3.0 −2.0 −1.0

−5

05

1015

r=0.22

30 / 66

Page 39: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Clustering

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

0 100 200 300 400

−2

02

46

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

31 / 66

Page 40: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Distributions and hypothesis testing

Clustering

−200 −100 0 100 200

−6

−4

−2

02

4

CLUSPLOT( mydata )

Component 1

Com

pone

nt 2

These two components explain 100 % of the point variability.

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●●●

●●

32 / 66

Page 41: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Clustering

Clustering

• Clustering is a process by which objects that are logically similar incharacteristics are grouped together

• It’s a previous step before classification.• It requires a method to measure similarity (a similarity matrix) or

dissimilarity (a dissimilarity coefficient) between objects• Uses a threshold value to decide whether an object belongs with a

cluster• There are several clustering methods, differing in how they start the

clustering process

33 / 66

Page 42: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Clustering

Clustering

• K-means algorithm: divides a set of N objects into M clusters – with orwithout overlap. M must be specified by the analist

• Hiearchical clustering: produces a set of nested clusters in which eachpair of objects is progressively nested into a larger cluster until only onecluster remains

34 / 66

Page 43: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Clustering

K-means algorithm

• Make the first object the centroid for the first cluster• For the next object calculate the similarity to each existing centroid• If the similarity is greater than a threshold add the object to the existing

cluster and redetermine the centroid, else use the object to start newcluster

• Return to step 2 and repeat until done

35 / 66

Page 44: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Clustering

K-means algorithm example

# Read data> st1 <- read.table("Data/global_afegits.csv", sep=";",

dec=",", header=T)# Determine number of clusters> n <- nrow(st2.ado)> wss <- rep(1:10)> wss[1] <- (n-1)*sum(apply(st2.ado[,2:8],2,var))> for (i in 2:10)

{wss[i] <- sum(kmeans(na.omit(st2.ado[,2:8]),

centers=i)$withinss)}

> plot(1:10,wss,type="b",xlab="Number of groups",ylab="Within groups sum of squares")

36 / 66

Page 45: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Clustering

K-means algorithm example

●●

2 4 6 8 10

1200

014

000

1600

018

000

2000

022

000

2400

0

Number of groups

With

in g

roup

s su

m o

f squ

ares

37 / 66

Page 46: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Clustering

K-means algorithm example

If we choose 5 clusters, we then

> fit <- kmeans(st2.ado, 5)

will classify the observations in the 5 groups

38 / 66

Page 47: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Clustering

Hierarchical clustering

• Find the two closest objects and merge them into a cluster• Find and merge the next two closest objects (or an object and a cluster,

or two clusters) using some similarity measure and a predefinedthreshold

• If more than one cluster remains return to step 2 until finished

39 / 66

Page 48: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Clustering

Hierarchical clustering example

# Ward Hierarchical Clustering> d <- dist(st2.ado, method = "euclidean")> fit <- hclust(d, method="ward.D")> plot(fit) # display dendogram> groups <- cutree(fit, k=5) # cut tree into 5 clusters# draw dendogram with red borders around the 5 clusters> rect.hclust(fit, k=5, border="red")

40 / 66

Page 49: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Clustering

Hierarchical clustering example

281

302

240

309

234 17 10

223

204

221

563

305 25 148

292

137 9

283

556

152

155 67

258

300 33 173

270

170

262

251

158

286

142

282

162

315

186

567

294

241

535 1

319

4 4710

9 49 20 52 203

268

231

558 36 289

287

175 59 11 103

116

260

161

202

119

225

269

141

599

209

217

222

250

249

214

297

230 2

293

139

160

179

595

257

254

113

200

210

205

183

215

177

324

184

193

314

235

227

181

102

5526

5 32 117

601

291

185

149

154

146

169 94

312

608

569

528

546

326

131

187 46 606

243

267

605

248 66 207

327

233

247

157

164

290

295 97

140

275

172

122

568 3

127 48 24

311

129 32

215

198

206

188

238

192 18 253

256 45

288

307

259

199

57 60 229 80 31 37 124

301

126

313 51 596 5

135

548

226

201

176 26 151 14 82 242 57

234

212

559

024

475

197 68 266

277

144 92 54 591

284

136

246

213

237 88 555 6

252

338

165

337

105

212

239

328 1

285

321 65 34 35 21 99 63 39 58

218

110

334

553

585

224

167 61

057

812

831

0 12 318 8

120

171

339

611

191

272

196 50

607

156

539

271

296

341

278

303

220

263

554 4

211

195

163

216

261

53 28 134

306 22 273

132

111

274

566

114

168

180

121

174 93 106 78

115

190

298

166

255

133

118

138

145

7430

410

815

013

012

311

214

310

720

8 71 540

98 299

316

264

232 42 81

189

153

609

147 56

182

538

574

41 276

280 30 236

159

533

8524

527

957

557

6 62 19 27 586 16

600

178

228

319 43 542

219

561

050

100

150

200

250

300

Cluster Dendrogram

hclust (*, "ward.D")d

Hei

ght

41 / 66

Page 50: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Clustering

Validating cluster solutions

There are several methods to compare different clustering solutions to thesame problem.

• Hubert’s gamma coefficient• Dunn index• Corrected rand index

Some of them are implemented in R package fpc

42 / 66

Page 51: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Clustering

Validating cluster solutions

There are several methods to compare different clustering solutions to thesame problem.

• Hubert’s gamma coefficient• Dunn index• Corrected rand index

Some of them are implemented in R package fpc

42 / 66

Page 52: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Clustering

Validating cluster solutions

There are several methods to compare different clustering solutions to thesame problem.

• Hubert’s gamma coefficient• Dunn index• Corrected rand index

Some of them are implemented in R package fpc

42 / 66

Page 53: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Multivariate statistics

Multivariate statistics• Multivariate statistics means dealing with several variables at the same

time• Multivariate problems requires more complex, multidimensional analyses

or dimensional reduction methods• Metabolomics experiments typically measure many metabolites at once,

in other words the instruments are measuring multiple variables and sometabolomic data are inherently multivariate data

• The key trick in multivariate statistics is to find a way that effectivelyreduces the multivariate data into univariate data

• Then we can apply the same univariate concepts such as p-values,t-tests and ANOVA tests to the data

43 / 66

Page 54: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Multivariate statistics

Principal Component Analysis

• PCA is a process that transforms a number of possibly correlatedvariables into a smaller number of uncorrelated variables called principalcomponents

• PCA captures what should be visually detectable• If you can’t see it, PCA probably won’t help

44 / 66

Page 55: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Multivariate statistics

Principal Component Analysis

> data(USArrests)> pc.cr <- princomp(USArrests, cor = TRUE)> biplot(pc.cr)

45 / 66

Page 56: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Multivariate statistics

Principal Component Analysis

−0.2 −0.1 0.0 0.1 0.2 0.3

−0.

2−

0.1

0.0

0.1

0.2

0.3

Comp.1

Com

p.2

AlabamaAlaska

Arizona

Arkansas

California

ColoradoConnecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana IowaKansas

KentuckyLouisiana

MaineMaryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma

OregonPennsylvania

Rhode Island

South Carolina

South DakotaTennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

−5 0 5

−5

05

Murder

Assault

UrbanPop

Rape

46 / 66

Page 57: Statistical methods in Metabolomics

Statistical methods in metabolomics

Basic statistics

Multivariate statistics

Other methods

There are several multivariate methods, with an increasing usage in metabo-lomics and related fields

• Discriminant Analysis (DA, PLS-DA, OPLS-DA)• Factor Analysis• Structural Equation Modeling

47 / 66

Page 58: Statistical methods in Metabolomics

Statistical methods in metabolomics

Available tools

How to analyze data?

R

• R is a freely available language and environment for statistical computingand graphics.

• It provides a wide variety of statistical and graphical techniques.• It is constantly expanding thanks to user-contributed packages.• Can be downloaded from http://cran.r-project.org.

Bioconductor

• Bioconductor is a repository of user-contributed R packages.• It is accessible from http://www.bioconductor.org.• Provides tools for the analysis and comprehension of high-throughput ge-

nomic data.• It has mailing lists and a very active users/developers community.

48 / 66

Page 59: Statistical methods in Metabolomics

Statistical methods in metabolomics

Available tools

How to analyze data?

R

• R is a freely available language and environment for statistical computingand graphics.

• It provides a wide variety of statistical and graphical techniques.• It is constantly expanding thanks to user-contributed packages.• Can be downloaded from http://cran.r-project.org.

Bioconductor

• Bioconductor is a repository of user-contributed R packages.• It is accessible from http://www.bioconductor.org.• Provides tools for the analysis and comprehension of high-throughput ge-

nomic data.• It has mailing lists and a very active users/developers community.

48 / 66

Page 60: Statistical methods in Metabolomics

Statistical methods in metabolomics

Available tools

Bioconductor

Bioconductor submitted packages

49 / 66

Page 61: Statistical methods in Metabolomics

Statistical methods in metabolomics

Available tools

Bioconductor

Installation of Bioconductor packages

The installation of Bioconductor can be done within the R session by

source("http://bioconductor.org/biocLite.R")biocLite()

50 / 66

Page 62: Statistical methods in Metabolomics

Statistical methods in metabolomics

R basics

Getting help

Getting help• ?mean

• help(mean)

• help.search("mean")

• apropos("mean")

• example(mean)

51 / 66

Page 63: Statistical methods in Metabolomics

Statistical methods in metabolomics

R basics

R packages for metabolomics

Useful packages

There are a number of useful packages in Bioconductor regarding metabolo-mics data analysis.

• flagme: Analysis of metabolomics GC/MS data

• xcms: Analysis of metabolomics XC/MS data

52 / 66

Page 64: Statistical methods in Metabolomics

Statistical methods in metabolomics

R basics

R packages for metabolomics

Useful packages

There are a number of useful packages in Bioconductor regarding metabolo-mics data analysis.

• flagme: Analysis of metabolomics GC/MS data• xcms: Analysis of metabolomics XC/MS data

52 / 66

Page 65: Statistical methods in Metabolomics

Statistical methods in metabolomics

LC/MS example

LC/MS example

xcms

• Can read data stored in several formats like netcdf, mzXML, mzData andmzML.

• Provides methods for feature detection, non-linear retention time align-ment, visualization, relative quantization and statistics.

• Is capable of simultaneously preprocessing, analyzing, and visualizingthe raw data from hundreds of samples.

• It’s available as an R package or as an online platform accessible throughhttps://xcmsonline.scripps.edu/.

53 / 66

Page 66: Statistical methods in Metabolomics

Statistical methods in metabolomics

LC/MS example

LC/MS example

Typical xcms workflow

54 / 66

Page 67: Statistical methods in Metabolomics

Statistical methods in metabolomics

LC/MS example

LC/MS example

Reading the data

# use biocLite to install a Bioconductor package> source("http://bioconductor.org/biocLite.R")# Install the xcms package> biocLite("xcms")# Install dataset package used in this session> biocLite("faahKO")

55 / 66

Page 68: Statistical methods in Metabolomics

Statistical methods in metabolomics

LC/MS example

LC/MS example

Reading the data

The data in faahKO consists of LC/MS peaks from the spinal cords of 6 wild-type and 6 FAAH knockout mice. The data is a subset of the original data from200-600 m/z and 2500-4500 seconds. It was collected in positive ionizationmode.

# Load libraries> library("xcms")> library("faahKO")

56 / 66

Page 69: Statistical methods in Metabolomics

Statistical methods in metabolomics

LC/MS example

LC/MS example

Reading the data

> cdfpath <- system.file("cdf",package="faahKO")> files <- list.files(cdfpath, recursive=T, full=T)> data <- xcmsSet(files)

Some important parameters• scanrange=c(lower, upper): to scan part of the spectra• fwhm = seconds: specify full width at half maximum (default 30s)

based on the type of chromatography• method = “centWave”): use wavelet algorithm for peak detection,

suitable for high resolution spectra

57 / 66

Page 70: Statistical methods in Metabolomics

Statistical methods in metabolomics

LC/MS example

LC/MS example

Reading the data

> cdfpath <- system.file("cdf",package="faahKO")> files <- list.files(cdfpath, recursive=T, full=T)> data <- xcmsSet(files)

Some important parameters• scanrange=c(lower, upper): to scan part of the spectra• fwhm = seconds: specify full width at half maximum (default 30s)

based on the type of chromatography• method = “centWave”): use wavelet algorithm for peak detection,

suitable for high resolution spectra

57 / 66

Page 71: Statistical methods in Metabolomics

Statistical methods in metabolomics

LC/MS example

LC/MS example

Peak alignment and retention time correction

> xsg <- group(data) # peak alignment> xsg <- retcor(xsg) # retention time correction> xsg <- group(xsg) # re-align

• Matching peaks across samples• Using the peak groups to correct drift• Re-do the alignment• Can be performed iteratively until no further change

58 / 66

Page 72: Statistical methods in Metabolomics

Statistical methods in metabolomics

LC/MS example

LC/MS example

Peak alignment and retention time correction

> xsg <- group(data) # peak alignment> xsg <- retcor(xsg) # retention time correction> xsg <- group(xsg) # re-align

• Matching peaks across samples• Using the peak groups to correct drift• Re-do the alignment• Can be performed iteratively until no further change

58 / 66

Page 73: Statistical methods in Metabolomics

Statistical methods in metabolomics

LC/MS example

LC/MS example

Peak alignment and retention time correction

−2

−1

01

23

Retention Time Deviation vs. Retention Time

Retention Time

Ret

entio

n T

ime

Dev

iatio

n●

ko15ko16ko18ko19ko21ko22wt15wt16wt18wt19wt21wt22●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

2500 3000 3500 4000 4500

Retention Time

Pea

k D

ensi

ty

AllCorrection

59 / 66

Page 74: Statistical methods in Metabolomics

Statistical methods in metabolomics

LC/MS example

LC/MS example

Filling in missing peaks

> xsg <- fillPeaks(xsg)

• A significant number of potential peaks can be missed during peakdetection

• Missing values are problematic for robust statistical analysis• We now have a better idea about where to expect real peaks and their

boundaries• Re-scan the raw spectra and integrate peaks in the regions of the

missing peaks

60 / 66

Page 75: Statistical methods in Metabolomics

Statistical methods in metabolomics

LC/MS example

LC/MS example

Filling in missing peaks

> xsg <- fillPeaks(xsg)

• A significant number of potential peaks can be missed during peakdetection

• Missing values are problematic for robust statistical analysis• We now have a better idea about where to expect real peaks and their

boundaries• Re-scan the raw spectra and integrate peaks in the regions of the

missing peaks

60 / 66

Page 76: Statistical methods in Metabolomics

Statistical methods in metabolomics

LC/MS example

LC/MS example

Results of peak detection

> peaks(xsg)

peaks() function gives a list of peaks with• mz• mzmin• mzmax• rt• rtmin• rtmax• peak intensities/areas (raw data)

61 / 66

Page 77: Statistical methods in Metabolomics

Statistical methods in metabolomics

LC/MS example

LC/MS example

Results of peak detection

> peaks(xsg)

peaks() function gives a list of peaks with• mz• mzmin• mzmax• rt• rtmin• rtmax• peak intensities/areas (raw data)

61 / 66

Page 78: Statistical methods in Metabolomics

Statistical methods in metabolomics

LC/MS example

LC/MS example

Statistical analysis

> report <- diffreport(xsg, "WT", "KO")

• diffreport() function computes Welch’s two-sample t-statistic foreach analyte and ranks them by p-value.

• It returns a summary report• Multivariate analysis and visualization can be performed using

MetaboAnalyst• The report generated by diffreport() can be directly uploaded to

MetaboAnalyst

62 / 66

Page 79: Statistical methods in Metabolomics

Statistical methods in metabolomics

LC/MS example

LC/MS example

Statistical analysis

> report <- diffreport(xsg, "WT", "KO")

• diffreport() function computes Welch’s two-sample t-statistic foreach analyte and ranks them by p-value.

• It returns a summary report• Multivariate analysis and visualization can be performed using

MetaboAnalyst• The report generated by diffreport() can be directly uploaded to

MetaboAnalyst

62 / 66

Page 80: Statistical methods in Metabolomics

Statistical methods in metabolomics

LC/MS example

LC/MS example

Visualizing important peaks

# Select peaks with median retention time# between 3300 and 3400 and detected in# at least 8 samples> gr <- groups(xsg)> groupidx <- which(gr[,"rtmed"]>3300 &

gr[,"rtmed"]<3400 &gr[,"npeaks"]>=8])[1]

> eiccor <- getEIC(xsg, groupidx=groupidx)> plot(eiccor, col=as.numeric(phenoData(xsg)$class))

• When significant peaks are identified, it is critical to visualize thesepeaks to assess quality

• This is done using the Extracted Ion Chromatogram (EIC)

63 / 66

Page 81: Statistical methods in Metabolomics

Statistical methods in metabolomics

LC/MS example

LC/MS example

Visualizing important peaks

# Select peaks with median retention time# between 3300 and 3400 and detected in# at least 8 samples> gr <- groups(xsg)> groupidx <- which(gr[,"rtmed"]>3300 &

gr[,"rtmed"]<3400 &gr[,"npeaks"]>=8])[1]

> eiccor <- getEIC(xsg, groupidx=groupidx)> plot(eiccor, col=as.numeric(phenoData(xsg)$class))

• When significant peaks are identified, it is critical to visualize thesepeaks to assess quality

• This is done using the Extracted Ion Chromatogram (EIC)

63 / 66

Page 82: Statistical methods in Metabolomics

Statistical methods in metabolomics

LC/MS example

LC/MS example

Visualizing important peaks

3300 3350 3400 3450

050

000

1000

0015

0000

2000

0025

0000

Extracted Ion Chromatogram: 300.1 − 300.2 m/z

Retention Time (seconds)

Inte

nsity

64 / 66

Page 83: Statistical methods in Metabolomics

Statistical methods in metabolomics

Further reading

Some references

• Broadhurst, D. I., Kell, D. B. (2007). Statistical strategies for avoidingfalse discoveries in metabolomics and related experiments.Metabolomics, 2 (4), 171–196.

• Worley, B., Powers, R. (2013). Multivariate Analysis in Metabolomics.Current metabolomics, 1 (1), 92–107.

• Issaq, H. J., Van, Q. N., Waybright, T. J., Muschik, G. M., Veenstra, T. D.(2009). Analytical and statistical approaches to metabolomics research.Journal of separation science, 32, 2183–2199.

• Smith, C. A. (2014). LC/MS Preprocessing and Analysis with xcms. Rpackage documentation.

• Korman, A., Oh, A., Raskind, A., Banks, D. (2012). Statistical methods inmetabolomics. Methods in molecular biology, 856 (Evolutionarygenomics), 381–413. Springer.

65 / 66

Page 84: Statistical methods in Metabolomics

Centre for Researchin EnvironmentalEpidemiology

Parc de Recerca Biomèdica de BarcelonaDoctor Aiguader, 8808003 Barcelona (Spain)Tel. (+34) 93 214 70 00Fax (+34) 93 214 73 02

[email protected]