Upload
dmitry-grapov
View
10.625
Download
2
Tags:
Embed Size (px)
Citation preview
Biology
Chemistry
Informatics
Evaluation of sample processing protocols for the analysis of pumpkin leaf metabolites
Stati
stics
Goals: Compare different extraction and drying protocols to identify the “optimal” sample processing approach
Topics: 1. Data quality overview2. Statistical comparisons 3. Power analysis
Biology
Chemistry
Informatics
Stati
stics
Data Quality Overview
Goal: Calculate and visualize the summary statistics for each metabolite/treatment (Use DATA: Pumpkin data 1.csv)
Calculate: 1. Mean and standard deviation (sd)2. The percent relative standard deviation, %RSD, (sd/mean)*100
Visualize:3. The relationship between mean vs. sd, mean and %RSD4. Compare mean metabolite values for all treatments
Exercises:5. Describe the relationship between analyte mean and sd, mean and %RSD?6. Describe what constitutes an “optimal” method?7. Which extraction/treatment should be chosen to process further samples?
Biology
Chemistry
Informatics
Summary statisticsSt
atisti
cs
Biology
Chemistry
Informatics
Mean vs. SDSt
atisti
cs
• Mean and sd are highly correlated• Larger means have larger sd• This effect is also called heteroscedasticity
Mean
SD
Biology
Chemistry
Informatics
Mean vs. %RSDSt
atisti
cs
• %RSD is minimally correlated with the meanCan be used as criteria for:
• Comparing method reproducibility• Identifying data quality
Mean
%RS
D
Biology
Chemistry
Informatics
Qualities of %RSDSt
atisti
cs
• %RSD (also called the coefficient of variation or CV) is the sd (variation) scaled by the mean (magnitude).
• Removes the relationship between variation and magnitude• Provides a single value which can be used to compare the variation of a
measurement among different treatments/samples
Showing the mean and sd of the %RSD for all metabolites for a given treatment
Biology
Chemistry
Informatics
Data qualitySt
atisti
cs Good
~40%
~10,000
Moderate
Bad
Below LOQ (sensitivity)
Mean
%RS
D
Biology
Chemistry
Informatics
Selecting the “optimal” methodSt
atisti
cs
Optimal can be:1. Lowest average %RSD for all measurements2. Lowest %RSD for measurements of interest 3. Largest number of metabolites passing %RSD cutoff4. Lowest average %RSD for all measurements passing %RSD cutoff
Count %RSD (mean ± sd)
Using strategy #4 for metabolites %RSD ≤ 40
Method #2 (ACN/IPA/water 3:3:2) looks optimal…
Biology
Chemistry
Informatics
Log Mean
Mean
Based on Method #2St
atisti
cs
%RSD ≤ 40
Log Mean
%RS
D
Analytes with high signal and high %RSD should be further interrogated for explanations of low reproducibility
Biology
Chemistry
Informatics
Statistical comparison of the effects of sample drying
Stati
stics
Steps:1. Use t-Test to compare metabolite means for each treatment2. Correct for the false discovery rate (FDR) adjusted p-value3. Estimate FDR (q-value)
Visualize:4. Relationship between p-value and FDR adjusted p-value5. Relationship between FDR adjusted p-value and q-value6. Box plots for highest and lowest p-value metabolitesQuestions:7. When should you use a one-sample, two-sample or paired t-test, ANOVA?
Goals: identify the effect of treatment (fresh/lyophylized) on Methods #3-4 performance? (Use DATA: Pumpkin data 2.csv)
Count %RSD (mean ± sd)
*return to 0-introduction
Biology
Chemistry
Informatics
Hypothesis Testing StrategiesSt
atisti
cs
• One sample t-Test is used to compare single value to a population mean• Two sample t-Test is used to compare 2 independent populations• Paired t-Test is used to compare the same population (intervention, repeated
measures) • One-way ANOVA (analysis of variance) is used to compare n populations for
one factor• Two-way ANOVA is used to compare n populations for 2 factors• ANCOVA (analysis of covariance) is used to adjust n populations for
covariate (typically continuous) prior to testing for n factors• Mixed effects models are versatile analogue to linear model or
ANOVA/ANCOVA and typically used to adjust for covariates or variance due to repeated measures
*All of the above are parametric tests, and some of which have non-parametric analogues
Biology
Chemistry
Informatics
p-value vs. FDR adjusted p-valueSt
atisti
cs
FDR
adju
sted
p-v
alue
p-value
Benjamini & Hochberg (1995) (“BH”)• Accepted standard
Bonferroni• Very conservative• adjusted p-value = p-
value*# of tests (e.g. 0.005 * 148 = 0.74 )
Biology
Chemistry
Informatics
p-value vs. q-valueSt
atisti
cs
FDR
adju
sted
p-v
alue
q-value
• q-value can be used to select appropriate p-value cut off for an acceptable FDR for multiple hypotheses tested
• q=0.05 nicely matches assumptions of p=0.05 for multiple hypotheses tested
• q-value≤0.2 can be acceptable
Biology
Chemistry
Informatics
Change in metabolites due to treatment
Stati
stics
Effect size: small large
Biology
Chemistry
Informatics
Effect of drying: is minimalSt
atisti
cs
- Log
p-v
alue
Fold change (relative to fresh)
FDR p-value= 0.05
- Log p-value
7 significantly different metabolites out of 148 (5%)
Biology
Chemistry
Informatics
Power analysisSt
atisti
cs
Steps:1. Calculate effect size and power for three metabolites 2. Given the observed effect size calculate the number of samples needed to
reach 80% power
Questions:3. How would you take FDR in to account?
Goals: Use power analysis to plan a follow up experiment to detect differences in metabolites due to treatment
Biology
Chemistry
Informatics
Stati
stics
Power analysis
Scaled difference in means between treatments
Ability to detect a difference when it exists (control false negative rate)
Probability of being wrong when spotting a difference (control false positive rate)
Biology
Chemistry
Informatics
Stati
stics
Power analysis
The minimum fold change (FC) in means observable by the study can be calculated using RSD and estimated effect size to reach 0.8 (80%) power given the population size
RSD = 0.21 and effect size (EF) =1.2
We can observe a minimum of a 38% change in means at 0.8 power (p= 0.05).