15
1 Statistics for non-statisticians Marco Pavesi Lead Statistician Liver Unit – Hospital Clínic i Provincial Ferran Torres Statistics and Methodology Support Unit. Hospital Clínic Barcelona Biostatistics Unit. School of Medicine. Universitat Autònoma Barcelona (UAB) Outline Why Statistics? Descriptive Statistics. Populations and Samples. Type of errors Inferential Statistics. Hypothesis testing Statistical errors p-value Confidence Intervals Multiplicity issues. Type of tests. Sample size Multivariate analysis. More on p-values Conclusion: “little shop of horrorsIntro. Why should we learn statistics ? Inducción y Verdad Bertrand Russell presents… The inductivist turkey Troubles for the plain researchers: Induction and statistics ARE NOT a method to get a sort of mathematical demonstration of Truth The results observed for a population sample are not necessarily true for the whole population Smart turkeys / researchers… 1) …are aware that the relevance (weight) of statistical inferences always depends on the sample size

Intro. Why should we learn statistics - Ferran Torresferran.torres.name/edu/iusc/download/Marco_Ferran_SPau (students).pdf · 1 Statistics for non-statisticians Marco Pavesi Lead

  • Upload
    tranthu

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

1

Statistics for non-statisticians

Marco Pavesi Lead Statistician Liver Unit – Hospital Clínic i Provincial

Ferran Torres Statistics and Methodology Support Unit. Hospital Clínic Barcelona Biostatistics Unit. School of Medicine. Universitat Autònoma Barcelona (UAB)

Outline

• Why Statistics? • Descriptive Statistics.

  Populations and Samples.   Type of errors

•  Inferential Statistics. Hypothesis testing   Statistical errors   p-value   Confidence Intervals

• Multiplicity issues. Type of tests. Sample size • Multivariate analysis. More on p-values • Conclusion: “little shop of horrors”

Intro. Why should we learn statistics ?

Inducción y Verdad Bertrand Russell presents…

The inductivist turkey

Troubles for the plain researchers:

  Induction and statistics ARE NOT a method to get a sort of mathematical demonstration of Truth

  The results observed for a population sample are not necessarily true for the whole population

Smart turkeys / researchers…

1)  …are aware that the relevance (weight) of statistical inferences always depends on the sample size

2

Smart turkeys / researchers…

2)  …do know that we can only model /estimate the real world with a specific approximation error.

Smart turkeys / researchers…

3)  …understand that true hipotheses do not exist, and we can only reject or keep a hypothesis based on the available evidence

What is statistics ?

•  “I know (I’m making the assumption) that these dice are fair: what is the probability of always getting a 1 in 15 runs?“

==> Probability mathematics

•  “I have got always a 1 in 15 runs. Are these dice fair ?” ==> Inferential STATISTICS

So, why statistics? To account for chance & variability!

Why is Statistics needed?

• Statistics tells us whether events are likely to have happened simply by chance

• Statistics is needed because we always work with sample observation (variability) and never with populations

• Statistics is the only mean to predict what is more likely to happen in new situations and helps us to make decisions

Introduction to descriptive statistics

3

Population and Samples

Target Population

Study Population

Sample

130 150 170

01 02 03 04 05

True Value

Random vs Sistematic error

Random Systematic (Bias)

130 150 170

01 05

02 03

04

True Value

Example: Systolic Blood Pressure (mm Hg)

What Statistics?

• Descriptive Statistics

  Position statistics (central tendency measures): mean, median

  Dispersion statistics: variance, standard deviation, standard error

  Shape statistics: symmetry, skewness and kurtosis measures.

The mean and the median

Arithmetic mean (average):

Median: (50% of sample individuals have a value

higher than or equal to the median)

1,3,3,4,6,13,14,14,18 6 1,3,3,4,6,13,14,14,17,18 6 - 13

Median=(6+13)/2=9.5

Mean 1

Median 1 New outlier

Mean 2

Median 2

•  Unlikely the median, the mean is affected by outliers

•  Especially relevant for specific distributions (survival times)

Dispersion measures

•  The Variance is the mean of squared differences from the distribution mean:

•  The Standard Deviation is the square root of the Variance:

•  The Standard Error is generally expressed as the ratio between the Variance and the sample size: •  It is considered as the true SD of the

population mean (or parameter)

SE = σ2 / N

Inference & tests

•  Inferential Statistics

  Draw conclusions (inferences) from incomplete (sample) data.   Allow us to make predictions about the target population based on the

results observed in the sample   Are computed in hypothesis testing

• Examples   95%CI’, t-test, chi square test, ANOVA, regression

4

Basic pattern of statistical tests

• Based on the total number of observations and the size of the test statistic, one can determine the P value.

How many noise units?

• Test statistic & sample size (degrees of freedom) convert to a probability or P Value.

Overall hypothesis testing flow chart

Test Statistics value

Corresponding P-value (from known distribution)

Comparison with significance level (previously defined)

Reject null hypothesis Keep null hypothesis

P < α P >= α

Introduction to inferential statistics

Extrapolation

Sample

Population

Inferential analysis Statistical Tests

Confidence Intervals

Study Results

“Conclusions”

5

25

Statistical Inference

Statistical Tests=> p-value

Confidence Intervals

Valid samples?

Population

Likely to occur

Unlikely to occur Invalid Sample and Conclusions

27

P-value

• The p-value is a “tool” to answer the question:   Could the observed results have occurred by chance*?

  Remember: –  Decision given the observed results in a SAMPLE

–  Extrapolating results to POPULATION

*: accounts exclusively for the random error, not bias

p < .05 “statistically significant”

A intuitive definition

• The p-value is the probability of having observed our data when the null hypothesis is true

• Steps: 1)  Calculate the treatment differences in the sample (A-B) 2)  Assume that both treatments are equal (A=B) and then… 3)  …calculate the probability of obtaining a magnitude of at least the

observed differences, given the assumption 2 4)  We conclude according the probability:

a.  p<0.05: the differences are unlikely to be explained by random, we assume that the treatment explains the differences

b.  p>0.05: the differences could be explained by random, we assume that random explains the differences

HYPOTHESIS TESTING

• Testing two hypotheses   H0: A=B (Null hypothesis – no difference)   H1: A≠B (Alternative hypothesis)

• Calculate test statistic based on the assumption that H0 is true (i.e. there is no real difference)

• Test will give us a p-value: how likely are the collected data if H0 is true

• If this is unlikely (small p-value), we reject H0

RCT from a statistical point of view

1 homogeneous population 2 distinct populations

Randomisation Treatment B (control)

Treatment A

6

RCT

Sample Population

Statistical significance/Confidence

•  A>B p<0.05 means:

•  “I can conclude that the higher values observed with treatment A vs treatment B are linked to the treatment rather to chance, with a risk of error of less than 5%” ?

Factors influencing statistical significance

•  Signal

•  Noise (background)

•  Quantity

•  Difference

•  Variance (SD)

•  Quantity of data

P-value

• A “very low” p-value do NOT imply:

 Clinical relevance (NO!!!)

 Magnitude of the treatment effect (NO!!)

With ↑n or ↓variability ⇒ ↓p

• Please never compare p-values!! (NO!!!)

P-value

• A “statistically significant” result (p<.05) tells us NOTHING about clinical or scientific importance. Only, that the results were not due to chance.

A p-value does NOT account for bias only by random error STAT REPORT

THE BASIC IDEA

• Statistics can never PROVE anything beyond any doubt, just beyond reasonable doubt!!

• … because of working with samples and random error

7

Type I & II Error & Power Type I & II Error & Power

• Type I Error (α)   False positive   Rejecting the null hypothesis when in fact it is true   Standard: α=0.05   In words, chance of finding statistical significance when in fact there truly

was no effect

• Type II Error (β)   False negative   Accepting the null hypothesis when in fact alternative is true   Standard: β=0.20 or 0.10   In words, chance of not finding statistical significance when in fact there

was an effect

Type I & II Error & Power

• Power   1-Type II Error (β)   Usually in percentage: 80% or 90% (for β =0.1 or 0.2, respectively)   In words, chance of finding statistical significance when in fact there is an

effect

95%CI

• Better than p-values…   …use the data collected in the trial to give an estimate of the treatment

effect size, together with a measure of how certain we are of our estimate

• CI is a range of values within which the “true” treatment effect is believed to be found, with a given level of confidence.   95% CI is a range of values within which the ‘true’ treatment effect

will lie 95% of the time

• Generally, 95% CI is calculated as   Sample Estimate ± 1.96 x Standard Error

Interval Estimation

Superiority study

d > 0 + effect

IC95%

d = 0 No differences

d < 0 - effect

Test better Control better

8

Superiority study

d > 0 + effect

IC95%

d = 0 No differences

d < 0 - effect

Test better Control better

Multiplicity

Lancet 2005; 365: 1591–95

Design Conduction Results

9

Interim Analyses in the CDP

(Month 0 = March 1966, Month 100 = July 1974) Coronary Drug Project Mortality Surveillance Circulation. 1973;47:I-1 http://clinicaltrials.gov/ct/show/NCT00000483;jsessionid=C4EA2EA9C3351138F8CAB6AFB723820A?order=23

Lancet 2005; 365: 1657–61

Sample Size

10

Sample Size

•  The planned number of participants is calculated on the basis of:

  Expected effect of treatment(s)

  Variability of the chosen endpoint

  Accepted risks in conclusion

↗  effect  ↘  number  

↗  variability  ↗  number  

↗  risk  ↘  number  

Sample Size

•  The planned number of participants is calculated on the basis of:

  Expected effect of treatment(s)

  Variability of the chosen endpoint

  Accepted risks in conclusion

↗  effect  ↘  number  

↗  variability  ↗  number  

↗  risk  ↘  number  

Sample Size

•  The planned number of participants is calculated on the basis of:

  Expected effect of treatment(s)

  Variability of the chosen endpoint

  Accepted risks in conclusion

↗  effect  ↘  number  

↗  variability  ↗  number  

↗  risk  ↘  number  

Normal vs. Skewed Distributions

• Parametric statistical test can be used to assess variables that have a “normal” or symmetrical bell-shaped distribution curve for a histogram.

• Nonparametric statistical test can be used to assess variables that are skewed or non-normal.

• “Inferential tests” vs Look at a histogram to decide.

Examples of Normal and Skewed

11

Parametric vs. Nonparametric

•  Student’s t-test •  One-way ANOVA •  Paired t-test •  Pearson correlation •  Correlated F ratio

(repeatedmeasures ANOVA)

• Mann-Whitney U test •  Kruskal-Wallis test • Wilcoxon signed-rank •  Spearman’s r •  Friedman ANOVA

The type of Inferential Tests depend on data

•  Repeated measures ?   UnMatched groups: different subsets of the population in each condition:

–  Independent data (paired)   Matched groups : the same individuals in each condition:

–  dependent data (unpaired)

•  Type of data   Continuous Gaussian, Metric mean, SD, ….   Continuous non-Gaussian, ordinal Ranks: 1,2,3,4,5,6,7,8,9,10 Median, interquartile

range   Nominal, Categories: 49% “yes”, 33 “no”, 18% “no opinion”, frequencies and

percentages

Qualitative dependent variable

Quantitative independent variable, Independent (unpaired) data

Quantitative independent variable, dependent (paired) data

12

• http://statpages.org/ • http://www.microsiris.com/Statistical%20Decision%20Tree/ • http://www.socialresearchmethods.net/selstat/ssstart.htm • http://www.wadsworth.com/psychology_d/templates

/student_resources/workshops/stat_workshp/chose_stat/chose_stat_01.html • http://www.graphpad.com/www/Book/Choose.htm

A Good Rule to Follow

• Always check your results with a nonparametric (sensitivity analysis)

• If you test your null hypothesis with a Student’s t-test, also check it with a Mann-Whitney U test.

• It will only take an extra 25 seconds.

• Use common sense and prior knowledge!!

Multivariate statistics: why and when ?

Marco Pavesi Lead Statistician

Liver Unit – Hospital Clínic i Provincial Barcelona

2 or 3 more things on p-values

• P-values only depend on the magnitude of the test statistic computed based on observed (sample) data.

• They are related to the evidence against the null hypothesis and tell us how confortable we should feel when we reject it.

• They are not related in any way to the clinical relevance of the “signal” (or effect, or difference, or whatever result) observed !!

Clinical study design chart

Any intervention

applied & studied?

YES

NO

EXPERIMENTAL STUDY

(Ex. Randomized Clinical Trial)

Repeated measurements

taken?

YES NO

PROSPECTIVE STUDY

CROSS-SECTIONAL STUDY

(Ex. Cohort study designs) (Ex. Case-control study designs)

Randomization

1.  Eliminates assignment bias

2. Tends to produce comparable groups for known and unknown, recoded and unrecorded factors

Design Sources of Imbalance

Randomized Chance Concurrent (prospective) Chance & Selection Bias (Non-randomized) Historical (retrospective) Chance, Selection Bias & Time Bias (Non-randomized)

3. Adds validity (extrapolability) to the results of statistical tests Reference: Byar et al (1976) NEJM

13

Confounding

• No randomization Lack of homogeneity between groups in the distribution of risk (protection) factors • A potential confounder is:

  Associated to the outcome   Associated to the main factor studied   Not involved in the causal association between factor and outcome as a

midway step

EXPOSURE (coffee intake)

OUTCOME (stroke)

CONFOUNDING FACTOR (smoking)

Interactions

• Effect modification • Different risk (effect) estimates are associated to different strata

of a specific factor.

Outcome associated to a specific factor “A” (ex. death)

Factor A, stratum 2

(ex. Female)

Factor A, stratum 1

(ex. Male)

Factor B, stratum 1 Factor B, stratum 2

(ex. Age < 65) (ex. Age ≥ 65)

20%

10%

7%

Multivariate analysis and statistical models

• A model is “a simplified representation (usually mathematical) used to explain the workings of a real world system or event” (Wikipedia)

• Two types of statistical models are used in clinical reasearch /epidemiology:

  Predictive models   Explanatory models

• Both are fitted by means of multivariate analysis techniques

Predictive models

•  Used when we are interested in predicting the probability of a specific outcome or the value of a specific dependent variable

•  Focused on selection of the best subset of predictors and highest precision of estimates

•  The selection of predictors is based on their contribution to the predictive ability of the model (i.e., on p-values)

•  Ex. Framingham equations to predict the probability of developing coronary events at 10 years (http://www.framinghamheartstudy.org/risk/index.html)

Framingham predictive equation for CHD Estimated Coefficients Underlying CHD

Prediction Sheets Using Total Cholesterol Categories Variable Men Women

Age,-y 0.04826 0.33766 Age-squared,-y -0.00268

TC,-mg/dL <160 -0.65945 -0.26138

160-199 Referent Referent 200-239 0.17692 0.20771 240-279 0.50539 0.24385 >=280 0.65713 0.53513

HDL-C,-mg/dL <35 0.49744 0.84312

35-44 0.2431 0.37796 45-49 Referent 0.19785 50-59 -0.05107 Referent >=60 -0.4866 -0.42951

Blood-pressure Optimal -0.00226 -0.53363 Normal Referent Referent

High-normal 0.2832 -0.06773 Stage-I-hypertension 0.52168 0.26288

Stage-II-IV-hypertension 0.61859 0.46573 Diabetes 0.42839 0.59626 Smoker 0.52337 0.29246

Baseline-survival-function-at-10-years, S0(10) 0.90015 0.96246 Linear predictor at risk factor means 3.09750 9.92545

Explanatory models

• Study objective: to assess (estimate) the effect of a specific factor on the study outcome • Multivariate analysis aimed at getting the best (most valid)

estimate of the studied effect • Confounders must be accounted for in the model • Evaluation of confounding variables is based on the change of

model estimates, NOT ON STATISTICAL SIGNIFICANCE.

• Rule of thumb: add each potential confounder into the model one by one and keep only those modifying by more than 10% the estimate of the main factor

14

Adjusting for confounders: an example Outcome variables and statistical models A summary table

•  Continuous (normally distr.) outcome:   ANOVA, or ANCOVA or Linear Regression model

•  Bivariate (YES/NO):   Logistic regression

•  Categorical (with a ref.group):   Multinomial logistic regression

•  Time-to-event (different follow-up times & censored cases):   Survival models (ex. Cox PH)

•  Number of counts:   Poisson or Negative Binomial regression

Some “take home” hints

Marco Pavesi Lead Statistician

Liver Unit – Hospital Clínic i Provincial Barcelona

The p-value…

… is the probability of a result like that observed in our sample when the null hypothesis is true in the population (i.e., simply due to chance)

…is related to the evidence against the null hypothesis and to the reliability of the observed result

…IT DOES NOT TELL US ANYTHING ON THE CLINICAL RELEVANCE OF THE RESULT WE HAVE OBSERVED !!

Interpretation of a p-value

• The highest the p-value, the highest the probability that the observed result is due simply to chance:

p = 0.75 75% probability (3 out of 4 studies) to reject a true H0

p = 0.015 1.5% probability (15 out of 1,000 studies) to reject a true H0

•  A “small” p-value (significance level) is established conventionally as the highest rate of false-positive results that we consider acceptable (for instance, the common 5% rate)

Evidence and p-value: an example (1)

Drug A. Efficacy rate: 22% Drug B. Efficacy rate: 11%

…observed results:

Drug A. Efficacy rate: 2 / 9 Drug B. Efficacy rate: 1 / 9

P-value=0,98

15

Evidence and p-value: an example (2)

Drug A. Efficacy rate: 22% Drug B. Efficacy rate: 11%

…observed results:

Drug A. Efficacy rate: 35 / 154 Drug B. Efficacy rate: 18 / 158

P-value=0,008

Evidence and p-value: an example (3)

….on the other hand…

Drug A. Known efficacy rate: 50% Drug B. Expected efficacy rate: 52%

Δ=2%; Type I error: 0.05; Type II error: 0.20 N (per arm): 9.806

Conclusion: little shop of horrors (1)

•  “No significant difference is observed between the treatment arms.

Conclusion: the treatments are equally effective…”

•  “Absence of evidence is not evidence of absence” (Altman DG, Bland JM. BMJ 1995;311:485)

…AAAAAARGH !

!!!

Conclusion: little shop of horrors (2)

• The p-value of the comparison A vs. Placebo is lower than the p-value for the comparison B vs. Placebo

Conclusion: treatment A is better than B…”

• The p-value gives us a measure of the evidence against that specific null hypothesis in that specific hypothesis test.

…AAAAAARGH !

!!!

Conclusion: little shop of horrors (3)

• A clinician speaking to the poor, helpless statistician: “Can we just test variable A vs. the rest of variables and check if some difference is significant…?”

• Type I error increases exponentially together with the number of hypothesis tests performed:

1 test: Type I error = 5%……5 tests: Type I error > 20%

…AAAAAARGH !

!!!