of 24 /24
http://workflow4metabolomics.org HOW TO PERFORM UNIVARIATE ANALYZES? 1 W4M Core Team

# HOW TO PERFORM UNIVARIATE ANALYZES? - Roscoffweb11.sb-roscoff.fr/download/w4m/howto/w4m_HowToPerformUniv… · The "Univariate" module The "Univariate" module on W4M allows you to

others

• View
13

0

Embed Size (px)

Citation preview

http://workflow4metabolomics.org

HOW TO PERFORM

UNIVARIATE ANALYZES?

1

W4M Core Team

http://workflow4metabolomics.org

The "Univariate" module

The "Univariate" module on W4M allows you to

perform:

• The Student t-test in order to compare two

population means

• The Wilcoxon test to compare two population

medians (non-parametric)

• The Analysis of Variance (and subsequent

pairwise comparisons with Student t-tests)

• The Kruskal-Wallis test to compare more than

two population medians (non-parametric; followed

by subsequent pairwise comparisons with

Wilcoxon tests)

• The correlation test with the Pearson or the

Spearman (non-parametric) methods

The Univariate module is a wrapper of the

corresponding tests from the R software

2

http://workflow4metabolomics.org

Chaining the statistical modules

The Univariate module can be chained with the Multivariate module,

and also the Filters module (either to filter out pool or blank samples

before the statistics, or filter out the variables according to a statistical

threshold after the analysis)

3

http://workflow4metabolomics.org

Your data must be split into 3 files:

• dataMatrix.tsv

4

http://workflow4metabolomics.org

Each file can be prepared by using Excel and saved using the

tabulated type format:

5

http://workflow4metabolomics.org

You can then rename your file with the .tsv extension (instead of .txt)

by right-clicking on the file:

.tsv files (i.e. tabular separated) can be handled correctly both by

Excel and Galaxy.

6

http://workflow4metabolomics.org

Decimal separator must be "."

Missing values must be indicated as "NA"

7

http://workflow4metabolomics.org

Note: you can switch your default language in Excel to English in order

to have your decimal separator automatically set to "."

8

1

2

3 4

http://workflow4metabolomics.org

The dataMatrix.tsv file must contain:

• the names of your samples in the first row

• the names of your variables in the first column

• numbers (or NA) in all the other cells

Note: the name in the topleft (A1) cell does not matter; avoid using "ID"

for Excel compatibility

9

http://workflow4metabolomics.org

• the names of the factors to be used in statistical analyzes in the first row

• the columns must be either characters (resp. numbers) for qualitative (resp.

quantitative) factors

• the names of your samples in the first column which must exactly match

those of the dataMatrix.tsv file

Note:

• 1) the name in the topleft (A1) cell does not matter; avoid using "ID" for Excel

compatibility

though it is not used in your Galaxy analysis

• 3) results from statistical analyzes (e.g. scores) will be added as

supplementary columns in this file 10

http://workflow4metabolomics.org

• the names of the metadata (e.g. mzmed, rtmed) in the first row (there must

be at least one column in addition to the variable names)

• the names of your variables in the first column which must exactly match

those of the dataMatrix.tsv file

Note:

• 1) the name in the topleft (A1) cell does not matter; avoid using "ID" for Excel

compatibility

though it is not used in your Galaxy analysis

new columns in this file

11

http://workflow4metabolomics.org

Sample and variable names:

• should contain only

• a b c d e f g h i j k l m n o p q r s t u v w x y z

• A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

• 0 1 2 3 4 5 6 7 8 9

• , [comma]

• - [dash]

• _ [underscore]

• [blank]

• other punctuations and accents should not be used

• your sample and variable names should not contain any duplicate

12

http://workflow4metabolomics.org 13

• either by using the icon

and drag & dropping the file:

1

2

3

4

http://workflow4metabolomics.org

• or with the Get Data / Upload File

14

1

2

3

4

5

http://workflow4metabolomics.org

Check that your data have been

15

http://workflow4metabolomics.org

16

http://workflow4metabolomics.org

Open the "Univariate" module

and select your 3 files of interest:

17

1

2

3

4

5

http://workflow4metabolomics.org

Select

• the factor of interest (name of the corresponding column of the

• the test to be performed

• the correction for multiple testing

• the significant threshold

• and launch the computation

18

http://workflow4metabolomics.org

Notes: tests available

The choice of the test depends on:

• whether your factor of interest is quantitative or qualitative (and, in the

latter case, if the number of levels is 2 or > 2)

• whether you wish to perform parametric or non-parametric testing:

• non-parametric tests do not assume that the values are normally

distributed; they can be useful in case of skewed distributions or small

number of samples; the power of non-parametric test is lower than the one

of their parametric counterparts

19

Parametric Non-parametric

2 levels Student's t test Wilcoxon test

> 2 levels Analysis of Variance Kruskall-Wallis

Quantitative correlation test Pearson Spearman

Qualitative

http://workflow4metabolomics.org

Notes: correction for multiple testing

The 7 methods implemented in the 'p.adjust' R function are available. The R documentation

describes the methods as follows:

• Bonferroni correction ("bonferroni") in which the p-values are multiplied by the number of comparisons.

• Less conservative corrections are also included by Holm (1979) ("holm"), Hochberg (1988)

("hochberg"), Hommel (1988) ("hommel"), Benjamini and Hochberg (1995) ("BH" or its alias "fdr"), and

Benjamini and Yekutieli (2001) ("BY"), respectively.

• A pass-through option ("none") is also included

• The first four methods are designed to give strong control of the family-wise error rate. There seems no

reason to use the unmodified Bonferroni correction because it is dominated by Holm's method, which

is also valid under arbitrary assumptions. Hochberg's and Hommel's methods are valid when the

hypothesis tests are independent or when they are non-negatively associated (Sarkar, 1998; Sarkar

and Chang, 1997). Hommel's method is more powerful than Hochberg's, but the difference is usually

small and the Hochberg p-values are faster to compute. The "BH" (aka "fdr") and "BY" method of

Benjamini, Hochberg, and Yekutieli control the false discovery rate, the expected proportion of false

discoveries amongst the rejected hypotheses. The false discovery rate is a less stringent condition

than the family-wise error rate, so these methods are more powerful than the others.

The p-values of the test for each variable will be given after correction by the selected method,

20

http://workflow4metabolomics.org

Notes: significance threshold

The selected threshold will not modify the (corrected) p-values that will

be returned anyway as an additional column of the

It is merely used to add another column with 0/1 values indicating

which variables are below the threshold (encoded as 1) and facilitate

their subsequent filtering

21

http://workflow4metabolomics.org

Results (1/2)

The dataMatrix.tsv and sampleMetadata.tsv files have not been modified

p-values of the Kruskal-Wallis test (corrected with the False Discovery

Rate approach) and a 0/1 encoding corresponding to the threshold given

as argument

22

http://workflow4metabolomics.org

Results (2/2)

Since the p-value of the first variable is below the threshold, a pairwise

Wilcoxon test has been performed to compare the three groups:

• junior vs experienced

• experienced vs senior

• junior vs senior

The corresponding p-values have been corrected for the number of

pairwise tests (n = 3)

After correction, the junior vs senior p-value is above the threshold and

the comparison is consequently displayed in the last column

23

http://workflow4metabolomics.org

References

• Van Belle G., Fisher LD., Heagerty PJ. and Lumley T. (2004). Biostatistics:

A Methodology for the Health Sciences. Wiley.

• Durham T. and Turner J. (2008). Introduction to Statistics in

Pharmaceutical Clinical Trials. Pharmaceutical Press.

24