Statistical methods in Metabolomics

Statistical methods in metabolomics

David Moriñaa,b

[email protected]

aCentre for Research in Environmental Epidemiology (CREAL)bUnitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona

May 08 2014, Reus


Contents

1 Introduction

2 Basic statistics

3 Available tools

4 R basics

5 LC/MS example

6 Further reading

2 / 66


Introduction

Where does data come from?

Metabolomics

• Metabolomics is the analysis and study of the set of metabolites in a cell,organ, or tissue.

• To detect and quantify metabolites, separation techniques like gas or li-quid chromatography, followed by quantification by mass spectrometry(GC/MS, or LC/MS) are often used.

• Nuclear magnetic resonance spectroscopy (NMR) is also frequently em-ployed and has some appealing properties:

• Is non-destructive, in the sense that it does not “destroy” the samples duringthe analysis process.

• Is useful when analyzing tissues or when sequential analysis of samples isrequired.

3 / 66


Introduction


Metabolomics






3 / 66


Introduction


Metabolomics






3 / 66


Basic statistics

What does people say about statistics?

• There are three kinds of lies: lies, damned lies, and statistics. (B. Disraeli)

• Statistics are like bikinis: What they reveal is suggestive, but what theyhide is vital. (A. Levenstein)

• About 93% of all statistics are made up. (Any newspaper)

4 / 66


Basic statistics


• There are three kinds of lies: lies, damned lies, and statistics. (B. Disraeli)• Statistics are like bikinis: What they reveal is suggestive, but what they

hide is vital. (A. Levenstein)

• About 93% of all statistics are made up. (Any newspaper)

4 / 66


Basic statistics


• There are three kinds of lies: lies, damned lies, and statistics. (B. Disraeli)• Statistics are like bikinis: What they reveal is suggestive, but what they

hide is vital. (A. Levenstein)• About 93% of all statistics are made up. (Any newspaper)

4 / 66


Basic statistics

Distributions and hypothesis testing

Normal distribution

5 / 66


Basic statistics


Normal distribution

6 / 66


Basic statistics


Central Limit Theorem

Under some conditions (not much demanding), the distribution of the sumof independent and identically distributed random variables tends to normaldistribution if the number of observations is not too small.

7 / 66


Basic statistics



The following example shows the distribution of the sum of the scores obtainedwhen rolling 1, 2, 3, 5 and 10 dices:

8 / 66


Basic statistics



The following example shows the distribution of the sum of the scores obtainedwhen rolling 1, 2, 3, 5 and 10 dices:

8 / 66


Basic statistics



9 / 66


Basic statistics



10 / 66


Basic statistics


Testing normality

There are some analytical methods to test if a random variable follow a normaldistribution or not. Some of them are

• Kolmogorov-Smirnov test• Shapiro-Wilk test• Graphical methods (QQ-plot, . . . )

11 / 66


Basic statistics


Testing normality

There are some mathematical functions that can be applied in order to stabili-ze the variance of a random variable

• log transformation• logit transformation• Square root transformation

12 / 66


Basic statistics


Testing normality

There are some mathematical functions that can be applied in order to stabili-ze the variance of a random variable

• log transformation• logit transformation• Square root transformation

12 / 66


Basic statistics


Testing normality

Original variable

X scale

Den

sity

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

log−transformed variable

log(X) scale

Den

sity

−8 −6 −4 −2 0 2

0.0

0.2

0.4

0.6

13 / 66


Basic statistics


Significance

Interpretation of p-value is well-known. . . Sure?

• Probability of a difference at least as the observed if H0 is true (bychance)

• Probability of mistake when rejecting H0

• Evidence against H0 provided by the sample. If the p-value is small, it’snot likely to observe the sample differences by chance

• Probability that the observed differences are false• 1 - p-value = Probability that the observed differences are real

14 / 66


Basic statistics


Significance

Interpretation of p-value is well-known. . . Sure?• Probability of a difference at least as the observed if H0 is true (by

chance)• Probability of mistake when rejecting H0



14 / 66


Basic statistics


Significance

Interpretation of p-value is well-known. . . Sure?• Probability of a difference at least as the observed if H0 is true (by

chance)• Probability of mistake when rejecting H0



15 / 66


Basic statistics


Significance

• The p-value is computed under the assumption that H0 is true andtherefore it cannot provide direct data about its certainty

• Scientist should decide on H0 based on the evidence against it thatsample provides, without reality knowledge

16 / 66


Basic statistics


Significance

• If a confidence level 1− α is fixed:

p < α −→ Statistically significant differencesp ≥ α −→ No statistically significant differences

17 / 66


Basic statistics


Significance

18 / 66


Basic statistics


Significance

19 / 66


Basic statistics


Comparing two populations

• Student’s t test was designed to compare two means.• A t-test can also be used to determine whether 2 clusters are different.

●

●

●●●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●●

●

●●

●●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●●

●

●

●

●

●●

●●●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●

0 100 200 300 400

−10

010

2030

40

Time

Val

ue

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

20 / 66


Basic statistics


Comparing two populations (non-gaussian)

• If the distribution of the variable of interest is not gaussian, we can stillcompare two populations, by means of Mann-Whitney’s U test (forindependent samples) or Wilcoxon test (for paired samples).

• Formally, these non-parametric tests are comparing two medians.

21 / 66


Basic statistics


Comparing three (or more) populations

• If we want to compare more than two groups we can use ANOVAtechnique.

• Essentially, it is a genearlization of Student’s t test.• Intra-group variance should be similar.• Normality is not crucial.• Just tells if some of the compared groups is different from the others.

22 / 66


Basic statistics



• ANOVA can also be used to determine whether 3 or more clusters aredifferent.

●

●●●●

●●

●

●●

●

●

●

●●

●●●

●

●

●

●

●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●●

●●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●●●●●

●●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●●

●●●

●

●

●

●

●

●

●

●

●●●

●

●●●

●●

●●

●●●●

●●●●

●

●●

0 100 200 300 400 500 600

−10

010

2030

4050

60

Time

Val

ue ●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

23 / 66


Basic statistics



If H0 can be rejected, which is the different group?• We need to perform a posteriori mean tests.• They compare each pair of means.• More conservative to control αT .

24 / 66


Basic statistics


Multiple comparisons

There are several methods to control type I error:• Bonferroni• Holm• Tukey• Scheffé• Dunnett (control)

25 / 66


Basic statistics


False Discovery Rate

Suppose you performed 100 different t-tests, and found 20 results with a p-value < 0.05.

• How many of these 20 tests are likely false positives?

• 20 · 0.05 = 1• To correct for this we can consider as significant the results with a

p-value < 0.0520 or p < 0.0025

26 / 66


Basic statistics


False Discovery Rate

Suppose you performed 100 different t-tests, and found 20 results with a p-value < 0.05.

• How many of these 20 tests are likely false positives?• 20 · 0.05 = 1• To correct for this we can consider as significant the results with a

p-value < 0.0520 or p < 0.0025

26 / 66


Basic statistics


Correlation

• If there is some dependency between the two variables or if there is arelationship between the predicted and observer variable or if the“before” and “after” treatments led to some effect, then it is possible tosee some clear patterns to the scatter plot

• This pattern or relationship is called correlation

27 / 66


Basic statistics


Correlation

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

−2 −1 0 1 2

34

56

7

Positive correlation

x

y

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

−2 −1 0 1 2

−2

−1

01

2

Negative correlation

x

z●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−2 −1 0 1 2

−3

−2

−1

01

2

No correlation

x

t

28 / 66


Basic statistics


Correlation

The correlation coefficient (Pearson coefficient) is computed by means of

r =

∑(xi − x)(yi − y)√∑(xi − x)2(yi − y)2

29 / 66


Basic statistics


Correlation and significance

●●

●

−3.0 −2.0 −1.0

−5

05

1015

r=0.98

●●

●

●

●

−3.0 −2.0 −1.0

−5

05

1015

r=0.22

30 / 66


Basic statistics


Clustering

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

0 100 200 300 400

−2

02

46

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

31 / 66


Basic statistics


Clustering

−200 −100 0 100 200

−6

−4

−2

02

4

CLUSPLOT( mydata )

Component 1

Com

pone

nt 2

These two components explain 100 % of the point variability.

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

32 / 66


Basic statistics

Clustering

Clustering

• Clustering is a process by which objects that are logically similar incharacteristics are grouped together

• It’s a previous step before classification.• It requires a method to measure similarity (a similarity matrix) or

dissimilarity (a dissimilarity coefficient) between objects• Uses a threshold value to decide whether an object belongs with a

cluster• There are several clustering methods, differing in how they start the

clustering process

33 / 66


Basic statistics

Clustering

Clustering

• K-means algorithm: divides a set of N objects into M clusters – with orwithout overlap. M must be specified by the analist

• Hiearchical clustering: produces a set of nested clusters in which eachpair of objects is progressively nested into a larger cluster until only onecluster remains

34 / 66


Basic statistics

Clustering

K-means algorithm

• Make the first object the centroid for the first cluster• For the next object calculate the similarity to each existing centroid• If the similarity is greater than a threshold add the object to the existing

cluster and redetermine the centroid, else use the object to start newcluster

• Return to step 2 and repeat until done

35 / 66


Basic statistics

Clustering

K-means algorithm example

# Read data> st1 <- read.table("Data/global_afegits.csv", sep=";",

dec=",", header=T)# Determine number of clusters> n <- nrow(st2.ado)> wss <- rep(1:10)> wss[1] <- (n-1)*sum(apply(st2.ado[,2:8],2,var))> for (i in 2:10)

{wss[i] <- sum(kmeans(na.omit(st2.ado[,2:8]),

centers=i)$withinss)}

> plot(1:10,wss,type="b",xlab="Number of groups",ylab="Within groups sum of squares")

36 / 66


Basic statistics

Clustering


●

●

●

●

●

●

●●

●

●

2 4 6 8 10

1200

014

000

1600

018

000

2000

022

000

2400

0

Number of groups

With

in g

roup

s su

m o

f squ

ares

37 / 66


Basic statistics

Clustering


If we choose 5 clusters, we then

> fit <- kmeans(st2.ado, 5)

will classify the observations in the 5 groups

38 / 66


Basic statistics

Clustering

Hierarchical clustering

• Find the two closest objects and merge them into a cluster• Find and merge the next two closest objects (or an object and a cluster,

or two clusters) using some similarity measure and a predefinedthreshold

• If more than one cluster remains return to step 2 until finished

39 / 66


Basic statistics

Clustering

Hierarchical clustering example

# Ward Hierarchical Clustering> d <- dist(st2.ado, method = "euclidean")> fit <- hclust(d, method="ward.D")> plot(fit) # display dendogram> groups <- cutree(fit, k=5) # cut tree into 5 clusters# draw dendogram with red borders around the 5 clusters> rect.hclust(fit, k=5, border="red")

40 / 66


Basic statistics

Clustering

Hierarchical clustering example

281

302

240

309

234 17 10

223

204

221

563

305 25 148

292

137 9

283

556

152

155 67

258

300 33 173

270

170

262

251

158

286

142

282

162

315

186

567

294

241

535 1

319

4 4710

9 49 20 52 203

268

231

558 36 289

287

175 59 11 103

116

260

161

202

119

225

269

141

599

209

217

222

250

249

214

297

230 2

293

139

160

179

595

257

254

113

200

210

205

183

215

177

324

184

193

314

235

227

181

102

5526

5 32 117

601

291

185

149

154

146

169 94

312

608

569

528

546

326

131

187 46 606

243

267

605

248 66 207

327

233

247

157

164

290

295 97

140

275

172

122

568 3

127 48 24

311

129 32

215

198

206

188

238

192 18 253

256 45

288

307

259

199

57 60 229 80 31 37 124

301

126

313 51 596 5

135

548

226

201

176 26 151 14 82 242 57

234

212

559

024

475

197 68 266

277

144 92 54 591

284

136

246

213

237 88 555 6

252

338

165

337

105

212

239

328 1

285

321 65 34 35 21 99 63 39 58

218

110

334

553

585

224

167 61

057

812

831

0 12 318 8

120

171

339

611

191

272

196 50

607

156

539

271

296

341

278

303

220

263

554 4

211

195

163

216

261

53 28 134

306 22 273

132

111

274

566

114

168

180

121

174 93 106 78

115

190

298

166

255

133

118

138

145

7430

410

815

013

012

311

214

310

720

8 71 540

98 299

316

264

232 42 81

189

153

609

147 56

182

538

574

41 276

280 30 236

159

533

8524

527

957

557

6 62 19 27 586 16

600

178

228

319 43 542

219

561

050

100

150

200

250

300

Cluster Dendrogram

hclust (*, "ward.D")d

Hei

ght

41 / 66


Basic statistics

Clustering

Validating cluster solutions

There are several methods to compare different clustering solutions to thesame problem.

• Hubert’s gamma coefficient• Dunn index• Corrected rand index

Some of them are implemented in R package fpc

42 / 66


Basic statistics

Clustering





42 / 66


Basic statistics

Clustering





42 / 66


Basic statistics

Multivariate statistics

Multivariate statistics• Multivariate statistics means dealing with several variables at the same

time• Multivariate problems requires more complex, multidimensional analyses

or dimensional reduction methods• Metabolomics experiments typically measure many metabolites at once,

in other words the instruments are measuring multiple variables and sometabolomic data are inherently multivariate data

• The key trick in multivariate statistics is to find a way that effectivelyreduces the multivariate data into univariate data

• Then we can apply the same univariate concepts such as p-values,t-tests and ANOVA tests to the data

43 / 66


Basic statistics


Principal Component Analysis

• PCA is a process that transforms a number of possibly correlatedvariables into a smaller number of uncorrelated variables called principalcomponents

• PCA captures what should be visually detectable• If you can’t see it, PCA probably won’t help

44 / 66


Basic statistics



> data(USArrests)> pc.cr <- princomp(USArrests, cor = TRUE)> biplot(pc.cr)

45 / 66


Basic statistics



−0.2 −0.1 0.0 0.1 0.2 0.3

−0.

2−

0.1

0.0

0.1

0.2

0.3

Comp.1

Com

p.2

AlabamaAlaska

Arizona

Arkansas

California

ColoradoConnecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana IowaKansas

KentuckyLouisiana

MaineMaryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma

OregonPennsylvania

Rhode Island

South Carolina

South DakotaTennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

−5 0 5

−5

05

Murder

Assault

UrbanPop

Rape

46 / 66


Basic statistics


Other methods

There are several multivariate methods, with an increasing usage in metabo-lomics and related fields

• Discriminant Analysis (DA, PLS-DA, OPLS-DA)• Factor Analysis• Structural Equation Modeling

47 / 66


Available tools

How to analyze data?

R

• R is a freely available language and environment for statistical computingand graphics.

• It provides a wide variety of statistical and graphical techniques.• It is constantly expanding thanks to user-contributed packages.• Can be downloaded from http://cran.r-project.org.

Bioconductor

• Bioconductor is a repository of user-contributed R packages.• It is accessible from http://www.bioconductor.org.• Provides tools for the analysis and comprehension of high-throughput ge-

nomic data.• It has mailing lists and a very active users/developers community.

48 / 66

http://cran.r-project.org

http://www.bioconductor.org


Available tools

How to analyze data?

R

• R is a freely available language and environment for statistical computingand graphics.

• It provides a wide variety of statistical and graphical techniques.• It is constantly expanding thanks to user-contributed packages.• Can be downloaded from http://cran.r-project.org.

Bioconductor

• Bioconductor is a repository of user-contributed R packages.• It is accessible from http://www.bioconductor.org.• Provides tools for the analysis and comprehension of high-throughput ge-

nomic data.• It has mailing lists and a very active users/developers community.

48 / 66

http://cran.r-project.org

http://www.bioconductor.org


Available tools

Bioconductor

Bioconductor submitted packages

49 / 66


Available tools

Bioconductor

Installation of Bioconductor packages

The installation of Bioconductor can be done within the R session by

source("http://bioconductor.org/biocLite.R")biocLite()

50 / 66


R basics

Getting help

Getting help• ?mean

• help(mean)

• help.search("mean")

• apropos("mean")

• example(mean)

51 / 66


R basics

R packages for metabolomics

Useful packages

There are a number of useful packages in Bioconductor regarding metabolo-mics data analysis.

• flagme: Analysis of metabolomics GC/MS data

• xcms: Analysis of metabolomics XC/MS data

52 / 66


R basics

R packages for metabolomics

Useful packages

There are a number of useful packages in Bioconductor regarding metabolo-mics data analysis.

• flagme: Analysis of metabolomics GC/MS data• xcms: Analysis of metabolomics XC/MS data

52 / 66


LC/MS example

LC/MS example

xcms

• Can read data stored in several formats like netcdf, mzXML, mzData andmzML.

• Provides methods for feature detection, non-linear retention time align-ment, visualization, relative quantization and statistics.

• Is capable of simultaneously preprocessing, analyzing, and visualizingthe raw data from hundreds of samples.

• It’s available as an R package or as an online platform accessible throughhttps://xcmsonline.scripps.edu/.

53 / 66

https://xcmsonline.scripps.edu/


LC/MS example

LC/MS example

Typical xcms workflow

54 / 66


LC/MS example

LC/MS example

Reading the data

# use biocLite to install a Bioconductor package> source("http://bioconductor.org/biocLite.R")# Install the xcms package> biocLite("xcms")# Install dataset package used in this session> biocLite("faahKO")

55 / 66


LC/MS example

LC/MS example

Reading the data

The data in faahKO consists of LC/MS peaks from the spinal cords of 6 wild-type and 6 FAAH knockout mice. The data is a subset of the original data from200-600 m/z and 2500-4500 seconds. It was collected in positive ionizationmode.

# Load libraries> library("xcms")> library("faahKO")

56 / 66


LC/MS example

LC/MS example

Reading the data

> cdfpath <- system.file("cdf",package="faahKO")> files <- list.files(cdfpath, recursive=T, full=T)> data <- xcmsSet(files)

Some important parameters• scanrange=c(lower, upper): to scan part of the spectra• fwhm = seconds: specify full width at half maximum (default 30s)

based on the type of chromatography• method = “centWave”): use wavelet algorithm for peak detection,

suitable for high resolution spectra

57 / 66


LC/MS example

LC/MS example

Reading the data

> cdfpath <- system.file("cdf",package="faahKO")> files <- list.files(cdfpath, recursive=T, full=T)> data <- xcmsSet(files)

Some important parameters• scanrange=c(lower, upper): to scan part of the spectra• fwhm = seconds: specify full width at half maximum (default 30s)

based on the type of chromatography• method = “centWave”): use wavelet algorithm for peak detection,

suitable for high resolution spectra

57 / 66


LC/MS example

LC/MS example

Peak alignment and retention time correction

> xsg <- group(data) # peak alignment> xsg <- retcor(xsg) # retention time correction> xsg <- group(xsg) # re-align

• Matching peaks across samples• Using the peak groups to correct drift• Re-do the alignment• Can be performed iteratively until no further change

58 / 66


LC/MS example

LC/MS example


> xsg <- group(data) # peak alignment> xsg <- retcor(xsg) # retention time correction> xsg <- group(xsg) # re-align

• Matching peaks across samples• Using the peak groups to correct drift• Re-do the alignment• Can be performed iteratively until no further change

58 / 66


LC/MS example

LC/MS example


−2

−1

01

23

Retention Time Deviation vs. Retention Time

Retention Time

Ret

entio

n T

ime

Dev

iatio

n●

●

●

●

●

●

ko15ko16ko18ko19ko21ko22wt15wt16wt18wt19wt21wt22●

●

●●

●

●●

●

●

●●●

●

●

●

●●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●●

●

●

●

●

●

●

●●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●●

●

●●

●

●

●●

●

●

●

●●●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●●●

●

●●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●

●●●

●

●

●

●

●

●

●●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●●●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

2500 3000 3500 4000 4500

Retention Time

Pea

k D

ensi

ty

AllCorrection

59 / 66


LC/MS example

LC/MS example

Filling in missing peaks

> xsg <- fillPeaks(xsg)

• A significant number of potential peaks can be missed during peakdetection

• Missing values are problematic for robust statistical analysis• We now have a better idea about where to expect real peaks and their

boundaries• Re-scan the raw spectra and integrate peaks in the regions of the

missing peaks

60 / 66


LC/MS example

LC/MS example

Filling in missing peaks

> xsg <- fillPeaks(xsg)

• A significant number of potential peaks can be missed during peakdetection

• Missing values are problematic for robust statistical analysis• We now have a better idea about where to expect real peaks and their

boundaries• Re-scan the raw spectra and integrate peaks in the regions of the

missing peaks

60 / 66


LC/MS example

LC/MS example

Results of peak detection

> peaks(xsg)

peaks() function gives a list of peaks with• mz• mzmin• mzmax• rt• rtmin• rtmax• peak intensities/areas (raw data)

61 / 66


LC/MS example

LC/MS example

Results of peak detection

> peaks(xsg)

peaks() function gives a list of peaks with• mz• mzmin• mzmax• rt• rtmin• rtmax• peak intensities/areas (raw data)

61 / 66


LC/MS example

LC/MS example

Statistical analysis

> report <- diffreport(xsg, "WT", "KO")

• diffreport() function computes Welch’s two-sample t-statistic foreach analyte and ranks them by p-value.

• It returns a summary report• Multivariate analysis and visualization can be performed using

MetaboAnalyst• The report generated by diffreport() can be directly uploaded to

MetaboAnalyst

62 / 66


LC/MS example

LC/MS example

Statistical analysis

> report <- diffreport(xsg, "WT", "KO")

• diffreport() function computes Welch’s two-sample t-statistic foreach analyte and ranks them by p-value.

• It returns a summary report• Multivariate analysis and visualization can be performed using

MetaboAnalyst• The report generated by diffreport() can be directly uploaded to

MetaboAnalyst

62 / 66


LC/MS example

LC/MS example

Visualizing important peaks

# Select peaks with median retention time# between 3300 and 3400 and detected in# at least 8 samples> gr <- groups(xsg)> groupidx <- which(gr[,"rtmed"]>3300 &

gr[,"rtmed"]<3400 &gr[,"npeaks"]>=8])[1]

> eiccor <- getEIC(xsg, groupidx=groupidx)> plot(eiccor, col=as.numeric(phenoData(xsg)$class))

• When significant peaks are identified, it is critical to visualize thesepeaks to assess quality

• This is done using the Extracted Ion Chromatogram (EIC)

63 / 66


LC/MS example

LC/MS example


# Select peaks with median retention time# between 3300 and 3400 and detected in# at least 8 samples> gr <- groups(xsg)> groupidx <- which(gr[,"rtmed"]>3300 &

gr[,"rtmed"]<3400 &gr[,"npeaks"]>=8])[1]

> eiccor <- getEIC(xsg, groupidx=groupidx)> plot(eiccor, col=as.numeric(phenoData(xsg)$class))

• When significant peaks are identified, it is critical to visualize thesepeaks to assess quality

• This is done using the Extracted Ion Chromatogram (EIC)

63 / 66


LC/MS example

LC/MS example


3300 3350 3400 3450

050

000

1000

0015

0000

2000

0025

0000

Extracted Ion Chromatogram: 300.1 − 300.2 m/z

Retention Time (seconds)

Inte

nsity

64 / 66


Further reading

Some references

• Broadhurst, D. I., Kell, D. B. (2007). Statistical strategies for avoidingfalse discoveries in metabolomics and related experiments.Metabolomics, 2 (4), 171–196.

• Worley, B., Powers, R. (2013). Multivariate Analysis in Metabolomics.Current metabolomics, 1 (1), 92–107.

• Issaq, H. J., Van, Q. N., Waybright, T. J., Muschik, G. M., Veenstra, T. D.(2009). Analytical and statistical approaches to metabolomics research.Journal of separation science, 32, 2183–2199.

• Smith, C. A. (2014). LC/MS Preprocessing and Analysis with xcms. Rpackage documentation.

• Korman, A., Oh, A., Raskind, A., Banks, D. (2012). Statistical methods inmetabolomics. Methods in molecular biology, 856 (Evolutionarygenomics), 381–413. Springer.

65 / 66

Centre for Researchin EnvironmentalEpidemiology

Parc de Recerca Biomèdica de BarcelonaDoctor Aiguader, 8808003 Barcelona (Spain)Tel. (+34) 93 214 70 00Fax (+34) 93 214 73 02

[email protected]

Science

Statistical methods in Metabolomics