In collaboration with Clinical Translational Science Center (CTSC) and
the Biostatistics and Bioinformatics Shared Resource (BB-SR), Stony Brook Cancer Center (SBCC).
Introduction to Biostatistics
Jie Yang, Ph.D.
Associate Professor
Department of Family, Population and Preventive Medicine
Director
Biostatistical Consulting Core
What is Biostatistics
What does a biostatistician do
• Experiment design, clinical trial design
• Descriptive and Inferential analysis
• Result interpretation
What you should bring while consulting
with a biostatistician
OUTLINE
http://www.youtube.com/watch?v=PbODigCZqL8
WHAT IS BIOSTATISTICS
• The science of biostatistics encompasses the design of
biological/clinical
experiments
the collection,
summarization, and
analysis of data from
those experiments
the interpretation of,
and inference from, the
results How to Lie with Statistics (1954) by Darrell Huff.
GOAL OF STATISTICS
Sampling
Inference
Probability
TheoryPOPULATION SAMPLE
Descriptive
Statistics
Descriptive
Statistics
Inferential StatisticsSample
Statistics: 𝑿 , 𝒔, 𝒑 ,…
Population
Parameters:
𝝁, 𝝈, 𝝅…
• Adequate sample size (statistical power)
• Random selection (representative)
PROPERTIES OF A “GOOD” SAMPLE
Sampling Techniques:
1.Simple random sampling
2.Stratified sampling
3.Systematic sampling
4.Cluster sampling
5.Convenience sampling
STUDY DESIGN
EXPERIEMENT DESIGN
Completely Randomized Design (CRD)- Randomly assign the experiment units to the treatments
Design with Blocking – dealing with nuisance factor which has
some effect on the response, but of no interest to the experimenter; Without
blocking, large unexplained error leads to less detection power.
1. Randomized Complete Block Design (RCBD) - One single blocking factor
2. Latin Square
Design (two
blocking factor)
3. Cross over Design
(each subject=blocking factor) 4. Balanced Incomplete
Block Design
EXPERIMENT DESIGN
Factorial Design: similar to randomized block design, but
allowing to test the interaction between two treatment
effects. A significant interaction between A and B tells:
• the effect of A is different at each level of B. Or the
effect of B differs at each level of A.
• it is not very sensible to even be talking about the main
effect of A and B
Experiment with random factors: randomly select n of
the possible levels of the factor of interest. Typically
random factors are categorical.
Split-plot Design: confounding a main effect with blocks
EVIDENCE PYRAMID
IMPACT Observatory: tracking the evolution of clinical trial data sharing and research integrity - Scientific Figure on
ResearchGate. Available from: https://www.researchgate.net/figure/Evidence-pyramid_fig1_309019368 [accessed 14 Jan, 2019]
WHAT CAN A STATISTICIAN HELP
DURING STUDY DESIGN PHASE
Blinding/masking and randomization
The number and combination of experimental inventions
The timing of measurements or visits
Collect information on a larger sample or on the same
sample over time
Ways to maximize the efficient use of the available
resources
Even for data management – how to code measures
and what to computerize directly affect the ease even
the feasibility of subsequent analysis
How a biostatistician analyzes data
1. Nominal data: unordered categories or classes
e.g. gender, blood type, transplant type
2. Ordinal data: order among categories is important
e.g. disease severity, AE level
3. Discrete data: both ordering and magnitude are important;
often integers or counts, no intermediate values are possible
e.g. # of accidents within a month, # of kids in a family
4. Continuous data: difference between two possible data
values can be arbitrarily small
e.g. height, weight, body temperature, serum level, BP
5. Time to event data: censoring presents
e.g. overall survival
TYPE OF DATA
DESCRIPTIVE STATISTICS
General goal is to describe the distribution of a single
variable (center, spread, shape, functional form)
Helpful for checking data and assumptions
Stratified (by group) analysis can be done for groups of
interest
Values and comparisons can be visualized and
“estimated” but descriptive statistics alone will provide no
information about our level of confidence in conclusions
• Mean: average
• Median: the 50th percentile point (median
value);
• Mode: value that occurs most frequently;
unimodal and multimodal
DESCRIPTIVE STATISTICS
1. Measure of central tendency
• Reporting a measure of center gives only partial
information about a data set.
– Example: Consider the following three datasets:
Dataset 1: 4 5 5 5 6
Dataset 2: 1 3 5 7 9
Dataset 3: 1 5 5 5 9
All the three datasets have identical means and medians.
Datasets 2&3 are more variable than the 1st one.
• It is also important to describe the spread of values
about the center.
DESCRIPTIVE STATISTICS
• Range= Max –Min
• Inter-Quartile Range (IQR)=Q3-Q1
• Variance, Sample Variance
• Standard Deviation, Sample Standard
Deviation
DESCRIPTIVE STATISTICS
2. Measure of variability
• An Outlying Value is a value, X, such that
X> Q3+ 1.5(IQR) or
X< Q1–1.5(IQR)
• An Extreme Outlying Value is a value, X, such
that
X> Q3+ 3(IQR) or
X< Q1–3(IQR)
IDENTIFYING POTENTIAL OUTLIERS
• Median and IQR are generally unaffected by the removal
of outliers but minor changes are possible.
• Mean and Standard Deviation will be affected by the
outlying values.
• Apparent shape of the distribution can also be affected by
outlying values.
• One should never simply remove data values from a
dataset.
• In practice, if the outliers are not errors, sensitivity
analysis will often be conducted or robust statistical
methods will be used.
EFFECTS OF OUTLIERS
• Summary table
• Bar/Pie chart
• Histogram
• Scatter plot
• Boxplot
1. Outlier
2. Extreme Outlier
3. Modified Boxplot
WAYS OF PRESENTING DATA
SUMMARY TABLE
side N Mean Stderr Median Min Max
left 14 18.83 6.04 18.25 8.00 30.10
right 14 18.61 5.48 17.75 8.80 28.21
1. By one variable
2. By multiple categorical variables
Location of
Tumor
Radiation
Sequence with
Surgery
Before 2002 After 2002
Lower
(n = 972)
Preoperative 107(19.21%) 65(15.66%)
Postoperative 450(80.79%) 350(84.34%)
Upper
(n = 283)
Preoperative 20(13.16%) 21(16.03%)
Postoperative 132(86.84%) 110(83.97%)
BAR CHART AND PIE CHART
8.12%
10.35%
12.76%
2.88%3.59%
4.46%
5.75%
7.44%
9.32%
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
AGB LSG RYGB
Bariatric surgeries, 2010-2013
ED revisit Admitted from ED Discharged from ED
Diagnosis for Cholecystectomy patients,
2006-2013
HISTOGRAM AND SCATTER PLOT
0 10 20 30 40 50 60 700
10
20
30
40
50
60
70
DISC measurement by group
Le
ft
Right Control No Surgery Gamma-knife Resection
BOX-PLOT
One continuous variable and one categorical variable
OTHER
THEORETICAL DISTRIBUTIONS
Variable Type of Outcome Theoretical Distribution
Continuous numeric Normal, Log-normal, Exponential,…
Discrete numeric Poisson, Negative Binomial,…
Binary Bernoulli, Binomial,….
Categorical with multiple categories
Multinomial,…
CONFIDENCE INTERVALS
A point estimate alone is not enough: it gives us no way to judge how accurate it is as an estimator.
A confidence interval provides a better estimate by combining the point estimate with its standard error to define a range of values that are likely to cover the true value of the parameter.
A confidence intervals starts with the point estimate and adds a “margin of error.” A confidence interval is defined as: point estimate +/- margin of error.
Population Parameters:𝝁, 𝝈, 𝝅…
Sample Statistics: 𝑿 , 𝒔, 𝒑 ,…
95% CI for μ: P(-??<µ<??)=0.95
CONFIDENCE INTERVALS
Since by central limit theorem, ),(~ 2
xxNx
95.0)96.1/
96.1(
n
xP
95.0)*96.1*96.1( n
xn
xP
95% Confidence Interval (CI) for µ:
Interpretation 1: You can be 95% sure that the true mean
(μ) will fall within the upper and lower bounds.
Interpretation 2: 95% of the intervals constructed using
sample means (x) will contain the true population mean
(μ).
100(1-α)% CI:
A good link for simulation of CI: http://www.ruf.rice.edu/~lane/stat_sim/conf_interval/index.html
CONFIDENCE INTERVALS
nx
*96.1
nZx
2/1
The cartoon guide to Statistics by Gonick
and Smith
• Using data to test specific hypotheses
• Making decisions based on probability
(instead of subjective impressions)
• Distribution is usually assumed
• Methods that require no distributional
assumptions are called non-parametric or
distribution free
HYPOTHESIS TESTING
• A well formulated hypothesis will be both
quantifiable and testable, that is, involve
measurable quantities or refer to items that may
be assigned to mutually exclusive categories.
• Takes one of two forms:
“ Some measurable characteristic of a population takes
one of a specific set of values”
“ Some measurable characteristic takes different values in
different populations, the difference has a specific pattern
or a specific set of values”
WHAT IS A HYPOTHESIS
EXAMPLES
This new drug will lower diastolic blood pressure.
For males over 40 suffering from chronic
hypertension, a 100mg daily dose of this new drug
will lower diastolic blood pressure an average of
10 mm Hg.
• The Null hypothesis describes some aspect of the
statistical behavior of a set of data and is denoted
H0
• This description is treated as valid unless the
actual behavior of the data contradicts this
assumption
• The Alternative Hypothesis is generally the
“opposite” of the null hypothesis and is denoted
H1
BASIC DEFINITIONS AND
NOTATION
• The null hypothesis is usually of the form
HYPOTHESIS
• The alternative will take on one of the following forms
REJECTION REGION IN
HYPOTHESIS TESTING
ERRORS IN HYPOTHESIS TESTING
The facts Decisions
No Difference Drug is Better
No Difference Correct Type I error: Manufacturer wastes money
developing an ineffective drugs
Drug is better Type II error: Manufacturer misses
opportunity for profit; Public denied access to
effective treatment
Correct
There is a trade off between type I error, α, and type II error, β
P-VALUE
Definition: the probability of obtaining a test statistic as extreme as or
more extreme than the actual test statistic obtained, given that the null
hypothesis is true.
Other explanation: the α level at which we would be indifferent between
accepting or rejecting H0 given the sample data at hand or the α level at
which the given value of the test statistic is on the borderline between the
acceptance and rejection region
1. State null (H0) and alternative (H1) hypotheses
2. Choose a significance level, α (usually 0.05 or
0.01)
3. Based on the sample, calculate the test statistic
and calculate p-value based on theoretical
distribution
4. Compare p-value with the significance level
5. Make a decision, and state the conclusion
BASIC STEPS IN HYPOTHESIS TESTING
• If we wish to conduct a two-sided test of a hypothesis regarding a population parameter with significance level α,
we can do this by constructing a 100(1-α)% confidence
interval and checking to see if the hypothesized value is in
the interval
• In this manner, CIs can be used to conduct Two-Sided
Hypothesis Tests
RELATIONSHIP BETWEEN HT
AND CI
Variable Group Estimate95% Confidence
IntervalP-value
absolute
difference
Control vs. No surgery 0.421 0.239-0.741 0.0036
Control vs. Surgery 0.276 0.158-0.482 <0.0001
No surgery vs. Surgery 0.655 -0.372-1.156 0.1400
Gamma-knife vs. Resection 0.420 0.190-0.925 0.0322
• Usually we assume an underlying distribution and
the methods used have been based on these
assumptions
• Such methods are parametric statistical methods
since the parametric form of the distribution is
assumed to be known
• If these assumptions are not reasonable and/or
the central limit theorem cannot be applied,
nonparametric procedures should be used
NONPARAMETRIC TESTS
• These tests can be used in situations where the
data are ordinal or even binary (Yes/No)
• For quantitative data, if there are extreme values,
ranking mediates these values
• In these situations, assumptions may be violated
for parametric tests such as t-test, especially for
small samples and non-parametric methods are
more appropriate
NONPARAMETRIC TESTS
SUMMARY OF COMMON
NONPARAMETRIC TESTS
Purpose of test
Normal theory-based
test
Corresponding
nonparametric test
To study the central tendency
of a single sample One sample t-test
sign test; Wilcoxon
signed rank test
To compare central
tendencies of two
independent samples
t-test for two independent
samples
Wilcoxon rank-sum test
(Mann-Whitney U test)
To examine a set of
differences paired t-test
Wilcoxon signed-rank
test
To assess the linear
association between two
variables
Pearson correlation
coefficient
Spearman rank
correlation coefficient
To compare three or more
groups
one way analysis of
variance (ANOVA) Kruskal-Wallis test
The cartoon guide to Statistics by Gonick
and Smith
43
• Sample size: N– ↑ N → power ↑
• Significance level: α– ↑ α → power ↑
• Effect size: δ– ↑ δ → power ↑
• Variation in the continuous outcome: σ2
– ↓ σ2 → power ↑• One-tailed vs. two-tailed tests
– Power is greater in one-tailed tests than in comparable two-tailed tests
POWER IS AFFECTED BY
44
• Variables of interest – type of data e.g. continuous, categorical
• Desired power
• Desired significance level
• Effect/difference of clinical importance
• Standard deviations of continuous outcome variables
• One or two-sided tests
SAMPLE SIZE FORMULA BASICS
45
Depends on study designNot hard, but can be VERY algebra intensive
WHILE CONSULTING WITH A
BIOSTATISTICIAN
Written summary materials are preferred: readable & organized
o Background information about the problem
o A proposal, protocol, or statement of work
o Schematics, such as diagram or flow-chart
o Information about any existing database
o prior studies, pilot data, tests, published and in-house reports
relating to the problem
Make sure the biostatistician understand your needs - Avoid having a
good solution to a wrong question
Any issues in data collection (e.g. missing data) or deviations from the
study protocol (e. g. randomization before baseline tests verified
eligibility).
The biostatistician should be a coauthor given substantive input from the biostatistician. But always get explicit consent first.
WHEN TO CONSULT A
BIOSTATISTICIAN
Most effective way: to include a biostatistician from the
very beginning of a research project.
After data have been collected: • Bring complete, detailed description of the study design and conduct
• Bring clear exposition of the questions to be addressed
• Note: research questions may not be answered by collected data
Once the data are analyzed:• Checking if conclusions fit the analysis results
• Suggesting best ways to describe and display the data
• Assuring no erroneous or incomplete statements about the findings
• Sometimes data re-analysis may be needed
When You Consult a Statistician... What to Expect (2007) Berman N, Gansky S, Guillon C, Loughin T, Sanchez M (2003)
Please check our website for future lectures
https://osa.stonybrookmedicine.edu/research-core-facilities/bcc/education
THANK YOU!