1 statistical analysis

Biology

Chemistry

Informatics

Evaluation of sample processing protocols for the analysis of pumpkin leaf metabolites

Stati

stics

Goals: Compare different extraction and drying protocols to identify the “optimal” sample processing approach

Topics: 1. Data quality overview2. Statistical comparisons 3. Power analysis

Biology

Chemistry

Informatics

Stati

stics

Data Quality Overview

Goal: Calculate and visualize the summary statistics for each metabolite/treatment (Use DATA: Pumpkin data 1.csv)

Calculate: 1. Mean and standard deviation (sd)2. The percent relative standard deviation, %RSD, (sd/mean)*100

Visualize:3. The relationship between mean vs. sd, mean and %RSD4. Compare mean metabolite values for all treatments

Exercises:5. Describe the relationship between analyte mean and sd, mean and %RSD?6. Describe what constitutes an “optimal” method?7. Which extraction/treatment should be chosen to process further samples?

Biology

Chemistry

Informatics

Summary statisticsSt

atisti

cs

Biology

Chemistry

Informatics

Mean vs. SDSt

atisti

cs

• Mean and sd are highly correlated• Larger means have larger sd• This effect is also called heteroscedasticity

Mean

SD

Biology

Chemistry

Informatics

Mean vs. %RSDSt

atisti

cs

• %RSD is minimally correlated with the meanCan be used as criteria for:

• Comparing method reproducibility• Identifying data quality

Mean

%RS

D

Biology

Chemistry

Informatics

Qualities of %RSDSt

atisti

cs

• %RSD (also called the coefficient of variation or CV) is the sd (variation) scaled by the mean (magnitude).

• Removes the relationship between variation and magnitude• Provides a single value which can be used to compare the variation of a

measurement among different treatments/samples

Showing the mean and sd of the %RSD for all metabolites for a given treatment

Biology

Chemistry

Informatics

Data qualitySt

atisti

cs Good

~40%

~10,000

Moderate

Bad

Below LOQ (sensitivity)

Mean

%RS

D

Biology

Chemistry

Informatics

Selecting the “optimal” methodSt

atisti

cs

Optimal can be:1. Lowest average %RSD for all measurements2. Lowest %RSD for measurements of interest 3. Largest number of metabolites passing %RSD cutoff4. Lowest average %RSD for all measurements passing %RSD cutoff

Count %RSD (mean ± sd)

Using strategy #4 for metabolites %RSD ≤ 40

Method #2 (ACN/IPA/water 3:3:2) looks optimal…

Biology

Chemistry

Informatics

Log Mean

Mean

Based on Method #2St

atisti

cs

%RSD ≤ 40

Log Mean

%RS

D

Analytes with high signal and high %RSD should be further interrogated for explanations of low reproducibility

Biology

Chemistry

Informatics

Statistical comparison of the effects of sample drying

Stati

stics

Steps:1. Use t-Test to compare metabolite means for each treatment2. Correct for the false discovery rate (FDR) adjusted p-value3. Estimate FDR (q-value)

Visualize:4. Relationship between p-value and FDR adjusted p-value5. Relationship between FDR adjusted p-value and q-value6. Box plots for highest and lowest p-value metabolitesQuestions:7. When should you use a one-sample, two-sample or paired t-test, ANOVA?

Goals: identify the effect of treatment (fresh/lyophylized) on Methods #3-4 performance? (Use DATA: Pumpkin data 2.csv)

Count %RSD (mean ± sd)

*return to 0-introduction

Biology

Chemistry

Informatics

Hypothesis Testing StrategiesSt

atisti

cs

• One sample t-Test is used to compare single value to a population mean• Two sample t-Test is used to compare 2 independent populations• Paired t-Test is used to compare the same population (intervention, repeated

measures) • One-way ANOVA (analysis of variance) is used to compare n populations for

one factor• Two-way ANOVA is used to compare n populations for 2 factors• ANCOVA (analysis of covariance) is used to adjust n populations for

covariate (typically continuous) prior to testing for n factors• Mixed effects models are versatile analogue to linear model or

ANOVA/ANCOVA and typically used to adjust for covariates or variance due to repeated measures

*All of the above are parametric tests, and some of which have non-parametric analogues

Biology

Chemistry

Informatics

p-value vs. FDR adjusted p-valueSt

atisti

cs

FDR

adju

sted

p-v

alue

p-value

Benjamini & Hochberg (1995) (“BH”)• Accepted standard

Bonferroni• Very conservative• adjusted p-value = p-

value*# of tests (e.g. 0.005 * 148 = 0.74 )

Biology

Chemistry

Informatics

p-value vs. q-valueSt

atisti

cs

FDR

adju

sted

p-v

alue

q-value

• q-value can be used to select appropriate p-value cut off for an acceptable FDR for multiple hypotheses tested

• q=0.05 nicely matches assumptions of p=0.05 for multiple hypotheses tested

• q-value≤0.2 can be acceptable

Biology

Chemistry

Informatics

Change in metabolites due to treatment

Stati

stics

Effect size: small large

Biology

Chemistry

Informatics

Effect of drying: is minimalSt

atisti

cs

- Log

p-v

alue

Fold change (relative to fresh)

FDR p-value= 0.05

- Log p-value

7 significantly different metabolites out of 148 (5%)

Biology

Chemistry

Informatics

Power analysisSt

atisti

cs

Steps:1. Calculate effect size and power for three metabolites 2. Given the observed effect size calculate the number of samples needed to

reach 80% power

Questions:3. How would you take FDR in to account?

Goals: Use power analysis to plan a follow up experiment to detect differences in metabolites due to treatment

Biology

Chemistry

Informatics

Stati

stics

Power analysis

Scaled difference in means between treatments

Ability to detect a difference when it exists (control false negative rate)

Probability of being wrong when spotting a difference (control false positive rate)

Biology

Chemistry

Informatics

Stati

stics

Power analysis

The minimum fold change (FC) in means observable by the study can be calculated using RSD and estimated effect size to reach 0.8 (80%) power given the population size

RSD = 0.21 and effect size (EF) =1.2

We can observe a minimum of a 38% change in means at 0.8 power (p= 0.05).

Documents

1 statistical analysis