FischerLecNotesIntroStatistics.pdf

Introduction to Basic Statistical Methodology

(with an emphasis on biomedical applications, using R)

– Lecture Notes, Spring 2014 –

Ismor Fischer, Ph.D.

“Statistics” plays a role in determining whether sources of variation between groups and/or within groups are simply the effects of random

chance, or are attributable to genuine, nonrandom differences.

Variation in littermates

Orchids of the genus Phalenopsis

“Darwin’s finches”

Highly variable Kallima paralekta (Malayan Dead Leaf Butterfly) Biodiversity in Homo sapiens

Variability in female forms of Papilio dardanus (Mocker Swallowtail), Madagascar. The tailed specimen is male, but the tailless female morphs are mimics of different poisonous butterfly species.

Australian boulder opals

X ~ N(µ, σ)

To the memory of my late wife,

~ Carla Michele Blum ~

the sweetest and wisest person I ever met,

taken far too young...

-

Mar 21, 1956 – July 6, 2006

Introduction to Basic Statistical Methods Note: Underlined headings are active webpage links!

0. Course Preliminaries Course Description A Brief Overview of Statistics

1. Introduction 1.1 Motivation: Examples and Applications 1.2 The Classical Scientific Method and Statistical Inference 1.3 Definitions and Examples 1.4 Some Important Study Designs in Medical Research 1.5 Problems

2. Exploratory Data Analysis and Descriptive Statistics 2.1 Examples of Random Variables and Associated Data Types 2.2 Graphical Displays of Sample Data • Dotplots, Stemplots,…

• Histograms: Absolute Frequency, Relative Frequency, Density 2.3 Summary Statistics • Measures of Center: Mode, Median, Mean,... (+ Shapes of Distributions) • Measures of Spread: Range, Quartiles, Variance, Standard Deviation… 2.4 Summary: Parameters vs. Statistics, Expected Values, Bias, Chebyshev’s Inequality 2.5 Problems

3. Theory of Probability 3.1 Basic Ideas, Definitions, and Properties 3.2 Conditional Probability and Independent Events (with Applications) 3.3 Bayes’ Formula

3.4 Applications • Diagnostic: Sensitivity, Specificity, Predictive Power, ROC curves • Epidemiological: Odds Ratios, Relative Risk 3.5 Problems

4. Classical Probability Distributions 4.1 Discrete Models: Binomial Distribution, Poisson Distribution,… 4.2 Continuous Models: Normal Distribution,…

4.3 Problems

5. Sampling Distributions and the Central Limit Theorem 5.1 Motivation 5.2 Formal Statement and Examples 5.3 Problems

6. Statistical Inference and Hypothesis Testing 6.1 One Sample 6.1.1 Mean (Z- and t-tests, Type I and II Error, Power & Sample Size) 6.1.2 Variance (Chi-squared Test) 6.1.3 Proportion (Z-test) 6.2 Two Samples 6.2.1 Means (Independent vs. Paired Samples, Nonparametric tests) 6.2.2 Variances (F-test, Levene Test) 6.2.3 Proportions (Z-test, Chi-squared Test, McNemar Test)

• Applications: Case-Control Studies, Test of Association and Test of Homogeneity of Odds Ratios, Mantel-Haenszel Estimate of Summary Odds Ratio

6.3 Several Samples 6.3.1 Proportions (Chi-squared Test) 6.3.2 Variances (Bartlett’s Test, etc.) 6.3.3 Means (ANOVA, F-test, Multiple Comparisons)

6.4 Problems

S A M P L E

P O P U L A T I O N

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/0_-_Preliminaries/�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/1_-_Introduction�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/2_-_Exploratory_Data_Analysis/�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/3_-_Probability_Theory/�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/4_-_Classical_Probability_Distributions�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/5_-_Central_Limit_Theorem�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/6_-_Statistical_Inference/�

7. Correlation and Regression 7.1 Motivation 7.2 Linear Correlation and Regression (+ Least Squares Approximation) 7.3 Extensions of Simple Linear Regression

• Transformations (Power, Logarithmic,…) • Multilinear Regression (ANOVA, Model Selection, Drug-Drug Interaction) • Logistic Regression (Dose-Response Curves)

7.4 Problems 8. Survival Analysis 8.1 Survival Functions and Hazard Functions 8.2 Estimation: Kaplan-Meier Product-Limit Formula 8.3 Statistical Inference: Log-Rank Test

8.4 Linear Regression: Cox Proportional Hazards Model 8.5 Problems

APPENDIX A1. Basic Reviews Logarithms Perms & Combos

A2. Geometric Viewpoint Mean and Variance ANOVA Least Squares Approximation

A3. Statistical Inference Mean, One Sample Means & Proportions, One & Two Samples General Parameters & FORMULA TABLES

A4. Regression Models Power Law Growth Exponential Growth Multilinear Regression Logistic Regression Example: Newton’s Law of Cooling

A5. Statistical Tables Z-distribution t-distribution Chi-squared distribution F-distribution (in progress...)

Even genetically identical organisms, such as these inbred mice, can exhibit a considerable amount of variation in physical and/or behavioral characteristics, due to random epigenetic differences in their development. But statistically, how large must such differences be in order to reject random chance as their sole cause, and accept that an alternative mechanism is responsible? Source: Nature Genetics, November 2, 1999.

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/7_-_Correlation_and_Regression�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/8_-_Survival_Analysis�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A1._Basic_Reviews/�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A2._Geometric_Viewpoint/�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A3._Statistical_Inference/�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A4._Regression_Models�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A5._Statistical_Tables�

Ismor Fischer, 7/20/2010 i

Course Description for Introduction to Basic Statistical Methodology

Ismor Fischer, UW Dept of Statistics, UW Dept of Biostatistics and Medical Informatics

Objective: The overall goal of this course is to provide students with an overview of fundamental statistical concepts, and a practical working knowledge of the basic statistical techniques they are likely to encounter in applied research and literature review contexts, with some basic programming in R. An asterisk (*) indicates a topic only relevant to Biostatistics courses. Lecture topics include: I. Introduction. General ideas, interpretation, and terminology: population, random sample, random

variable, empirical data, etc. Describing the formal steps of the classical scientific method – hypothesis, experiment, observation, analysis and conclusion – to determine if sources of variation in a system are genuinely significant or due to random chance effects. General study design considerations: prospective (e.g., randomized clinical trials, cohort studies) versus retrospective (e.g., case-control studies).*

II. Exploratory Data Analysis and Descriptive Statistics. Classification of data: numerical (continuous,

discrete) and categorical (nominal – including binary – and ordinal). Graphical displays of data: tables, histograms, stemplots, boxplots, etc. Summary Statistics: measures of center (sample mean, median, mode), measures of spread (sample range, variance, standard deviation, quantiles), etc., of both grouped and ungrouped data. Distributional summary using Chebyshev’s Inequality.

III. Probability Theory. Basic definitions: experiment, outcomes, sample space, events, probability. Basic

operations on events and their probabilities, including conditional probability, independent events. Specialized concepts include diagnostic tests (sensitivity and specificity, Bayes’ Theorem, ROC curves), relative risk and odds ratios in case-control studies.*

IV. Probability Distributions and Densities. Probability tables, probability histograms and probability

distributions corresponding to discrete random variables, with emphasis on the classical Binomial and Poisson models. Probability densities and probability distributions corresponding to continuous random variables, with emphasis on the classical Normal (a.k.a. Gaussian) model.

V. Sampling Distributions and the Central Limit Theorem. Motivation, formal statement, and examples.

VI. Statistical Inference. Formulation of null and alternative hypotheses, and associated Type I and Type II

errors. One- and two-sided hypothesis testing methods for population parameters – mostly, means and proportions – for one sample or two samples (independent or dependent), large (Z-test) or small (t-test). Light treatment of hypothesis testing for population variances (χ2-test for one, F-test for two). Specifically, for a specified significance level, calculation of confidence intervals, acceptance/rejection regions, and p-values, and their application and interpretation. Power and sample size calculations. Brief discussion of nonparametric (Wilcoxon) tests. Multiple comparisons: ANOVA tables for means, χ2 and McNemar tests on contingency tables for proportions. Mantel-Haenszel Method for multiple 2 × 2 tables (i.e., Test of Homogeneity → Summary Odds Ratio → Test of Association).*

VII. Linear Regression. Plots of scattergrams of bivariate numerical data, computation of sample

correlation coefficient r, and associated inference. Calculation and applications of corresponding least squares regression line, and associated inferences. Evaluation of fit via coefficient of determination r2 and residual plot. Additional topics include: transformations (logarithmic and others), logistic regression (e.g., dose-response curves), multilinear regression (including a brief discussion of drug-drug interaction*, ANOVA formulation and model selection techniques).

VIII. Survival Analysis.* Survival curves, hazard functions, Kaplan-Meier Product-Limit Estimator, Log-

Rank Test, Cox Proportional Hazards Regression Model.

Ismor Fischer, 1/4/2011 ii

In complex dynamic systems such as biological organisms, how is it possible to distinguish genuine – or “statistically significant” – sources of variation, from purely “random chance” effects? Why is it important to do so? Consider the following three experimental scenarios…

• In a clinical trial designed to test the efficacy of a new drug, participants are randomized to either a control arm (e.g., a standard drug or placebo) or a treatment arm, and carefully monitored over time. After the study ends, the two groups are then compared to determine if the differences between them are “statistically significant” or not.

• In a longitudinal study of a cohort of individuals, the strength of association between a disease such as COPD (Chronic Obstructive Pulmonary Disease) or lung cancer, and exposure to a potential risk factor such as smoking, is estimated and determined to be “statistically significant.”

• By formulating an explicit mathematical model, an investigator wishes to describe how much variation in a response variable, such as mean survival time after disease diagnosis in a group of individuals, can be deterministically explained in terms of one or more “statistically significant” predictor variables with which it is correlated.

This first course is an introduction to the basic but powerful techniques of statistical analysis – techniques which formally implement the fundamental principles of the classical scientific method – in the general context of biomedical applications. How to: 1. formulate a hypothesis about some characteristic of a variable quantity measured on a

population (e.g., mean cholesterol level, proportion of treated patients who improve),

2. classify different designs of experiment that generate appropriate sample data (e.g., randomized clinical trials, cohort studies, case-control studies),

3. investigate ways to explore, describe and summarize the resulting empirical observations (e.g., visual displays, numerical statistics),

4. conduct a rigorous statistical analysis (e.g., by comparing the empirical results with a known reference obtained from Probability Theory), and finally,

5. infer a conclusion (i.e., whether or not the original hypothesis is rejected) and corresponding interpretation (e.g., whether or not there exists a genuine “treatment effect”).

These important biostatistical techniques form a major component in much of the currently active research that is conducted in the health sciences, such as the design of safe and effective pharmaceuticals and medical devices, epidemiological studies, patient surveys, and many other applications. Lecture topics and exams will include material on:

Exploratory Data Analysis of Random Samples

Probability Theory and Classical Population Distributions

Statistical Inference and Hypothesis Testing

Regression Models

Survival Analysis

Ismor Fischer, 1/4/2011 iii

A Brief Overview of Statistics Statistics is a quantitative discipline that allows objective general statements to be made about a population of units (e.g., people from Wisconsin), from specific data, either numerical (e.g., weight in pounds) or categorical (e.g., overweight / normal weight / underweight), taken from a random sample. It parallels and implements the fundamental steps of the classical scientific method: (1) the formulation of a testable null hypothesis for the population, (2) the design of an experiment specifically designed to test this hypothesis, (3) the performance of which results in empirical observations, (4) subsequent analysis and interpretation of the generated data set, and finally, (5) conclusion about the hypothesis. Specifically, a reproducible scientific study requires an explicit measurable quantity, known as a random variable (e.g., IQ, annual income, cholesterol level, etc.), for the population. This variable has some ideal probability distribution of values in the population, for example, a bell curve (see figure), which in turn has certain population characteristics, a.k.a. parameters, such as a numerical “center” and “spread.” A null hypothesis typically conjectures a fixed numerical value (or sometimes, just a largest or smallest numerical bound) for a specific parameter of that distribution. (In this example, its “center” – as measured by the population mean IQ – is hypothesized to be 100.) After being visually displayed by any of several methods (e.g., a histogram; see figure), empirical data can then be numerically “summarized” via sample characteristics, a.k.a. statistics, that estimate these parameters without bias. (Here, the sample mean IQ is calculated to be 117.) Finally, in a process known as statistical inference, the original null hypothesis is either rejected or retained, based on whether or not the difference between these two values (117 − 100 = 17) is statistically significant at some pre-specified significance level (say, a 5% Type I error rate). If this difference is “not significant” – i.e., is due to random chance variation alone – then the data tend to support the null hypothesis. However, if the difference is “significant” – i.e., genuine, not due to random chance variation alone – then the data tend to refute the null hypothesis, and it is rejected in favor of a complementary alternative hypothesis.

Formally, this decision is reached via the computation of any or all of three closely related quantities:

1) Confidence Interval = the observed sample statistic (117), plus or minus a margin of error. This interval is so constructed as to contain the hypothesized parameter value (100) with a pre-specified high probability (say, 95%), the confidence level. If it does not, then the null is rejected.

2) Acceptance Region = the hypothesized parameter value (100), plus or minus a margin of error. This is constructed to contain the sample statistic (117), again at a pre-specified confidence level (say, 95%). If it does not, then the null hypothesis is rejected.

3) p-value = a measure of how probable it is to obtain the observed sample statistic (117) or worse, assuming that the null hypothesis is true, i.e., that the conjectured value (100) is really the true value of the parameter. (Thus, the smaller the p-value, the less probable that the sample data support the null hypothesis.) This “tail probability” (0%-100%) is formally calculated using a test statistic, and compared with the significance level (see above) to arrive at a decision about the null hypothesis.

Moreover, an attempt is sometimes made to formulate a mathematical model of a desired population response variable (e.g., lung cancer) in terms of one or more predictor (or explanatory) variables (e.g., smoking) with which it has some nonzero correlation, using sample data. Regression techniques can be used to calculate such a model, as well as to test its validity. This course will introduce the fundamental statistical methods that are used in all quantitative fields. Material will include the different types of variable data and their descriptions, working the appropriate statistical tests for a given hypothesis, and how to interpret the results accordingly in order to formulate a valid conclusion for the population of interest. This will provide sufficient background to conduct basic statistical analyses, understand the basic statistical content of published journal articles and other scientific literature, and investigate more specialized statistical techniques if necessary.

Ismor Fischer, 1/4/2011 iv

POPULATION

RANDOM SAMPLE

Observations

Statistical Inference Conclusion: Does the experimental evidence tend to support or refute the null hypothesis?

Experiment to test hypothesis

Analysis of empirically-generated data (e.g., via a histogram): X Statistic: Mean x = 117 (estimate of parameter)

Random Variable: X = IQ score, having an ideal distribution of values X Null Hypothesis: Mean µ = 100 (about a parameter)

1. Introduction

1.1 Motivation 1.2 Classical Scientific Method 1.3 Definitions and Examples 1.4 Medical Study Designs 1.5 Problems

Ismor Fischer, 5/29/2012 1.1-1

| | | |

10 15 20 25 30 35 40

X = Survival (months)

Population mean survival time

1. Introduction

1.1 Motivation: Examples and Applications

Is there a “statistically significant” difference in survival time between cancer patients on a new drug treatment, and the “standard treatment” population?

An experimenter may have suspicions, but how are they formally tested? Select a random sample of cancer patients, and calculate their “mean survival time.”

Design issues ~

How do we randomize? For that matter, why do we randomize? (Bias)

What is a “statistically significant” difference, and how do we detect it?

How large should the sample be, in order to detect such a difference if there is one?

Sample mean survival time = 27.0

Analysis issues ~

Is a mean difference of 2 months “statistically significant,” or possibly just due to random variation? Can we formally test this, and if so, how?

Interpretation in context? Similar problems arise in all fields where quantitative data analysis is required.


t = 0

α = angle of elevation

v0 = initial speed

x = (v0 cos α) t

y = (v0 sin α) t − 12 g t2

At time t, P(x, y):

DETERMINISTIC OUTCOMES

RANDOM OUTCOME

Heads or Tails???? Probability! x

y DTW Terminal A fountain; click for larger image

http://www.metroairport.com

Question: How can we “prove” objective statements about the behavior of a given system, when random variation is present?

Example: Toss a small mass into space.

1. Hypothesis H0: “The coin is fair (i.e., unbiased).”

However, it is not possible to apply this formal definition in practice, because we cannot toss the coin an infinite number of times. So....

# tosses: 1 2 3 4 5 6 7 … n …

outcome = (H T H H T T H … T …)

# Heads: 1 1 2 3 3 3 4 … X …

Definition: P(Heads) = limn→∞

Xn (= 0.5, if the coin is fair)

Answer: In principle, the result of an individual random outcome may be unpredictable, but long-term statistical

patterns and trends can often be determined.

0final

20

final

2 20

max

2 sin

sin 2

sin2

vt

g

vx

g

vy

g

α

α

α

=

=

=

http://www.stat.wisc.edu/~ifischer/Images/DTW.pdf�

http://www.metroairport.com/�

http://www.stat.wisc.edu/~ifischer/Images/DTW.pdf�

Ismor Fischer, 5/29/2012 1.1-3 2. Experiment: Generate a random sample of n = 100 independent tosses. 1 2 3 4 5 6 7 … 100

3. Observation: outcome = (T T H T H H H … T)

Exercise: How many such possible outcomes are there? Let the random variable “X = # Heads” in this experiment: {0, 1, 2, …, 100}. [Comment: A nonrandom variable is one whose value is

determined, thus free of any experimental measurement variation, such as the solution of the algebraic equation 3X + 7 = 11, or X = # eggs in a standard one-dozen carton, or # wheels on a bicycle.]

4. Analysis: Compare the observed empirical data with the theoretical

prediction for X (using probability), assuming the hypothesis is true. That is, …

Expected # Heads: E[X] = 50 versus Observed # Heads: {0, 1, 2, …, 100}

Future Issue: If the hypothesis is indeed false, then how large must the sample size n be, in order to detect a genuine difference the vast majority of the time? This relates to the power of the experiment.

Is the difference statistically significant?


. . . . . . 0 100

0.00

0.20

0.40

0.60

0.80

1.00

X

p-value

significance level α = 0.05

confidence level 1 − α = 0.95

Accept H0 Reject H0 Reject H0

X = 50 Again, assuming P(Observed, given Expected) = p-value the hypothesis is true, P(X = 50) = 0.0796.* ⇔ P(X ≤ 49 or X ≥ 51) = 0.9204 Accept H0

Likewise… P(X ≤ 48 or X ≥ 52) = 0.7644

P(X ≤ 47 or X ≥ 53) = 0.6173

P(X ≤ 46 or X ≥ 54) = 0.4841

P(X ≤ 45 or X ≥ 55) = 0.3682

P(X ≤ 44 or X ≥ 56) = 0.2713

P(X ≤ 43 or X ≥ 57) = 0.1933

P(X ≤ 42 or X ≥ 58) = 0.1332

P(X ≤ 41 or X ≥ 59) = 0.0886

P(X ≤ 40 or X ≥ 60) = 0.0569

P(X ≤ 39 or X ≥ 61) = 0.0352

P(X ≤ 38 or X ≥ 62) = 0.0210

P(X ≤ 37 or X ≥ 63) = 0.0120

P(X ≤ 36 or X ≥ 64) = 0.0066

. . . . . . . .

P(X = 0 or X = 100) = 0.0000 Reject H0

5. Conclusion:

Suppose H0 is true, i.e., the coin is fair, and we wish to guarantee that there is... at worst, 5% probability of erroneously

concluding that the coin is unfair, or equivalently, 95% probability of correctly concluding

that the coin is indeed fair. This will be the case if we do not reject H0 unless X ≤ 39 or X ≥ 61.

* Of the 2100 possible outcomes

of this experiment, only 100

50

of them have exactly 50 Heads; the ratio is 0.0796, or about 8%.


1.2 The Classical Scientific Method and Statistical Inference

“The whole of science is nothing more than a refinement of everyday thinking.”

- Albert Einstein Population of units

Random Sample Mathematical Theorem (empirical data) (formal proof) n = # observations

Analysis: Observed vs. Expected, under Hypothesis

“Is the difference statistically significant? Or just due to random, chance variation alone?”

● x1

● x2 . . . ● xn

● x3

Proof :

If Hypothesis (about X),

then Conclusion (about X).

QED

Random Variable X

Hypothesis (about X)

THEORY EXPERIMENT

“What actually happens this time, regardless of hypothesis.”

“What ideally must follow, if hypothesis is true.” Decision:

Accept or Reject Hypothesis


Example: Population of individuals

Random Sample Mathematical Theorem (empirical data) (formal proof) n = 2500 individuals

Hypothesis: “The

prevalence (proportion) of a certain disease is

10%.”

Suppose random variable X = “# Yes” = 300, i.e., estimated prevalence = 3002500

= 0.12, or 12%.

● Yes/No

● Yes/No . . . ● Yes/No

● Yes/No

If Hypothesis of 10% prevalence is true, then the “expected value” of X would be 250 out of a random sample of 2500.

Moreover, under these conditions, it can (and later will) be mathematically proved that the probability of obtaining a sample result that is as, or more, extreme than 12%, is only .00043 (the “p-value”), or less than one-twentieth of one percent. EXTREMELY RARE!!! Thus, our sample evidence is indeed statistically significant; it tends to strongly refute the original Hypothesis.

THEORY EXPERIMENT

“What actually happens this time, regardless of hypothesis.”

“What ideally must follow, if hypothesis is true.”

Decision: Reject Hypothesis

Based on our sample, the prevalence of this disease in the population is significantly higher than 10%, around 12%.


POPULATION = swimming pool Random Variable X = Water Temperature (°F)

(Informal) Null Hypothesis H0: “(The mean of) X is okay for swimming.” (e.g., µ = 80°F)

(Informal) Experiment Select a random sample by sticking in foot and swishing water around.

(Informal) Analysis Determine if the difference between the observed temperature and expected temperature under H0 is significant.

Conclusion If not, then accept H0… Jump in! If so, then reject H0… Go jogging instead.

1.3 Definitions and Examples

Definition: A random variable, usually denoted by X, Y, Z,…, is a rule that assigns a number to each outcome of an experiment. (Examples: X = mass, pulse rate, gender)

Definition: Statistics is a collection of formal computational techniques that are designed to test and derive a (reject or “accept”) conclusion about a null hypothesis for a random variable defined on a population, based on experimental data taken from a random sample.

Example: Blood sample taken from a patient for medical testing purposes, and results compared with ideal reference values, to see if differences are significant.

Example: “Goldilocks Principle”

The following example illustrates the general approach used in formal hypothesis testing.

Example: United States criminal justice system

Null Hypothesis H0: “Defendant is innocent.”

The burden of proof is on the prosecution to collect enough empirical evidence to try to reject this hypothesis, “beyond a reasonable doubt” (i.e., at some significance level).

Too Cold OK Too Hot Reject H0 Accept H0 Reject H0

Casey Anthony ACQUITTED H0 “accepted” July 5, 2011

Jodi Arias CONVICTED H0 rejected May 8, 2013

Ismor Fischer, 5/17/2013 1.3-2 Example: Pharmaceutical Application

Phase III Randomized Clinical Trial (RCT)

• Used to compare “drug vs. placebo,” “new treatment vs. standard treatment,” etc., via randomization (to eliminate bias) of participants to either a treatment arm or control arm. Moreover, randomization is often “blind” (i.e., “masked”), and implemented by computer, especially in multicenter collaborative studies. Increasing use of the Internet!

• Standard procedure used by FDA to approve pharmaceuticals and other medical treatments for national consumer population.

Random Variable “X = cholesterol level (mg/dL)”

µ1 µ2

Drug Placebo

POPULATION

Size n1 Size n2

RANDOM SAMPLES

Null Hypothesis H0: There is no difference in population mean cholesterol levels between the two groups, i.e.,

µ1 − µ2 = 0.

Is the mean difference statistically significant, (e.g., at the α = .05 level)? If so, then reject H0. There is evidence of a genuine treatment difference! If not, then “accept” H0. There is not enough evidence of a genuine treatment difference. More study needed?

1x = 225 2x = 240 1x − 2x = −15


1.4 Some Important Study Designs in Medical Research I. OBSERVATIONAL (no intervention)

A. LONGITUDINAL (over some period of time) 1. Retrospective (backward-looking)

Case-Control Study: Identifies present disease with past exposure to risk factors.

2. Prospective (forward-looking)

Cohort Study: Classically, follows a cohort of subjects forward in time.

Example: Framingham Heart Study to identify CVD risk factors, ongoing since 1948.

B. CROSS-SECTIONAL (at some fixed time)

Survey: Acquires self-reported information from a group of participants.

Prevalence Study: Determines the proportion of a specific disease in a given population. II. EXPERIMENTAL (intervention)

Randomized Clinical Trial (RCT): Randomly assigns patients to either a treatment group (e.g., new drug) or control group (e.g., standard drug or placebo), and follows each through time.

TIME FUTURE PRESENT

Investigate: Association with D+ and D− Given: Exposed (E+) and Unexposed (E−)

TIME PRESENT PAST

Given: Cases (D+) and Controls (D−) Investigate: Association with E+ and E−

Patients satisfying inclusion criteria

R A N D OM I Z E

Treatment Arm

Control Arm

At end of study, compare via

statistical analysis.


Phases of a Clinical Trial In vitro biochemical and pharmacological research, including any computer

simulations.

Pre-clinical testing of in vivo animal models to determine safety and potential to fight a specific disease. Typically takes 3-4 years. Successful pass rate is only ≈ 0.01%, i.e., one in a thousand compounds.

PHASE I. First stage of human testing, contingent upon FDA approval, including protocol evaluation by an International Review Board (IRB) ethics committee. Determines safety and side effects as dosage is incrementally increased to “maximum tolerated dose” (MTD) that can be administered without serious toxicity. Typically involves very few (≈ 12, but sometimes more) healthy volunteers, lasting several months to a year. Phase I pass rate is approximately 70%.

PHASE II. Determines possible effectiveness of treatment. Typically involves

several (≈ 14-30, but sometimes more) afflicted patients who have either received previous treatment, or are untreatable otherwise. Lasts from several months to two years. Only approximately 30% of all experimental drugs tested successfully pass both Phases I and II.

PHASE III. Classical randomized clinical trial (although most Phase II are

randomized as well) that compares patients randomly assigned to a new treatment versus those treated with a control (standard treatment or placebo). Large-scale experiment involving several hundred to several thousand patients, lasting several years. Seventy to 90 percent of drugs that enter Phase III studies successfully complete testing. FDA review and approval for public marketing can take from six months to two years.

PHASE IV. Post-marketing monitoring. Randomized controlled studies often

designed with several objectives: 1) to evaluate long term safety, efficacy and quality of life after the treatment is licensed or in common use, 2) to investigate special patient populations not previously studied (e.g., pediatric or geriatric), 3) to determine the cost-effectiveness of a drug therapy relative to other traditional and new therapies.

Total time from lab development to marketing: 10-15 years

Ismor Fischer, 2/8/2014 Solutions / 1.5-1

1.5 Solutions 1. X = 38 Heads in n = 100 tosses corresponds to a p-value = .021, which is less than α = .05;

hence in this case we are able to reject the null hypothesis, and conclude that the coin is not fair, at this significance level. However, p = .021 is greater than α = .01; hence we are unable to reject the null hypothesis of fairness, at this significance level. We tentatively “accept” – or, at least, not outright reject – that the coin is fair, at this level. (The coin may indeed be biased, but this empirical evidence is not sufficient to show it.) Thus, lowering the significance level α at the outset means that based on the sample data, we will be able to reject the null hypothesis less often on average, resulting in a more conservative test.

2.

(a) If the coin is known to be fair, then all 210 outcomes are equally likely; the probability of any one of them occurring is the same (namely, 1/210)!

(b) However, if the coin is not known to be fair, then Outcomes 1, 2, and 3 – each with X = 5 Heads and n – X = 5 Tails, regardless of the order in which they occur – all provide the best possible evidence in support of the hypothesis that the coin is unbiased. Outcome 4, with X = 7 Heads, is next. And finally, Outcome 5, with all X = 10 Heads, provides the worst possible evidence that the coin is fair.

3. The issue here is one of sample size, and statistical power – the ability to detect a significant

difference from the expected value, if one exists. In this case, a total of X = 18 Heads out of n = 50 tosses yields a p-value = 0.0649, which is just above the α = .05 significance level. Hence, the evidence in support of the hypothesis that the coin is fair is somewhat borderline. This suggests that perhaps the sample size of n = 50 may not be large enough to detect a genuine difference, even if there is one. If so, then a larger sample size might generate more statistical power. In this experiment, obtaining X = 36 Heads out of n = 100 tosses is indeed sufficient evidence to reject the hypothesis that the coin is fair.


4. R exercise

(a) If the population ages are uniformly distributed between 0 and 100 years, then via symmetry, the mean age would correspond to the midpoint, or 50 years.

(b) The provided R code generates a random sample of n = 500 ages from a population between 0 and 100 years old. The R command mean(my.sample) should typically give a value fairly close to the population mean of 50 (but see part (d)).

(c) The histogram below is typical. The frequencies indicate the number of individuals in each age group of the sample, and correspond to the heights of the rectangles. In this sample, there are:

• 94 individuals between 0 and 20 years old, i.e., 18.8%,




• 103 individuals between 80 and 100 years old, i.e. 20.6%.

If the population is uniformly distributed, we would expect the sample frequencies to be about the same in each of the five intervals, and indeed, that is the case; we can see that each interval contains about one-hundred individuals (i.e., 20%).


(d) Most results should be generally similar to (b) and (c) – in particular, the sample means fairly close to the population mean of 50 – but there is a certain nontrivial amount of variability, due to the presence of “outliers.” For example, if by chance a particular sample should consist of unusually many older individuals, it is quite possible that the mean age would be shifted to a value that is noticeably larger than 50. This is known as “skewed to the right” or “positive skew.” Similarly, a sample containing many younger individuals might be “skewed to the left” or “negatively skewed.”

(e) The histogram below displays a simulated distribution of the means of many (in this

case, 2000) samples, each sample having n = 500 ages. Notice how much “tighter” (i.e., less variability) the graph is around 50, than any of those in (c). The reason is that it is much more common for a random sample to contain a relatively small number of outliers – whose contribution is “damped out” when all the ages are averaged – than for a random sample to contain a relatively large number of outliers – whose contribution is sizeable enough to skew the average. Thus, the histogram is rather “bell-shaped”; highly peaked around 50, but with “tails” that taper off left and right.

Very rarely will a random sample have mostly low values, resulting in its average << 50.

↓

Very rarely will a random sample have mostly high values, resulting in its average >> 50.

↓


5. The following is typical output (“copy-and-paste”) directly from R. Comments are in blue.

(a) > prob = 0.5 > tosses = rbinom(100, 1, prob) This returns a random sequence of 100 single tosses.* > tosses # view the sequence [1] 1 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 1 0 [38] 1 1 1 1 0 1 1 0 1 0 0 0 0 1 1 0 1 0 1 1 1 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 0 [75] 1 1 1 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1 1 0 1 0 1 1 0

> sum(tosses) # count the number of Heads [1] 58

* Note: rbinom(1, 100, prob) just generates the number of Heads (not the actual sequence) in 1 run of 100 random tosses, in this case, 58. This simulation of 100 random tosses of a fair coin produced 58 Heads. According to the chart on page 1-4, the corresponding p-value = 0.1332. That is, if the coin is fair (as here), then in 100 tosses, there is an expected 13.32% probability of obtaining 8 (or more) Heads away from 50. This is above the 5% significance level, hence consistent with the coin being fair. Had it been below (i.e., rarer than) 5%, it would have been inconsistent with the coin being fair, and we would be forced to conclude that the coin is indeed biased. Alas, in multiple runs, this would eventually happen just by chance!

(See the outliers in the graphs below.) (b)

> X = rbinom(500, 100, prob) This command generates the number of Heads in each of 500 runs of 100 tosses, as stated. > sort(X) This command sorts the 500 numbers just found in increasing order (not shown). > table(X) Produces a frequency table for X = # Heads, i.e., 35 Heads occurred twice, 36 twice, etc. X 35 36 37 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 66 2 2 2 8 7 9 13 15 24 23 30 38 41 35 41 41 33 27 21 31 16 8 10 9 1 6 3 1 2 1

> summary(X) This is often referred to as the “five number summary”: Min. 1st Qu. Median Mean 3rd Qu. Max. 35.00 46.00 50.00 49.53 53.00 66.00

Notice that the mean ≈ median (suggesting that this may be close to a more-or-less symmetric distribution; see page 2-14 in the notes) ≈ 50, both of which you might expect to see in 100 tosses of an unbiased coin, as confirmed in the three graphs below.


Histogram

Stemplot Dotplot 35 | 00 36 | 00 37 | 00 38 | 39 | 00000000 40 | 0000000 41 | 000000000 42 | 0000000000000 43 | 000000000000000 44 | 000000000000000000000000 45 | 00000000000000000000000 46 | 000000000000000000000000000000 47 | 00000000000000000000000000000000000000 48 | 00000000000000000000000000000000000000000 49 | 00000000000000000000000000000000000 50 | 00000000000000000000000000000000000000000 51 | 00000000000000000000000000000000000000000 52 | 000000000000000000000000000000000 53 | 000000000000000000000000000 54 | 000000000000000000000 55 | 0000000000000000000000000000000 56 | 0000000000000000 57 | 00000000 58 | 0000000000 59 | 000000000 60 | 0 61 | 000000 62 | 000 63 | 0 64 | 00 65 | 66 | 0

(c) The sample proportions obtained from this experiment are quite close to the

theoretical p-values we expect to see, if the coin is fair.

lower upper prop p-values (from chart) [1,] 49 51 0.918 0.9204 [2,] 48 52 0.766 0.7644 [3,] 47 53 0.618 0.6173 [4,] 46 54 0.488 0.4841 [5,] 45 55 0.386 0.3682 [6,] 44 56 0.278 0.2713 Since these values are comparable, [7,] 43 57 0.198 0.1933 it seems that we have reasonably [8,] 42 58 0.152 0.1332 strong confirmation that the coin is [9,] 41 59 0.106 0.0886 indeed unbiased. [10,] 40 60 0.070 0.0569 [11,] 39 61 0.054 0.0352 [12,] 38 62 0.026 0.0210 [13,] 37 63 0.020 0.0120 [14,] 36 64 0.014 0.0066 [15,] 35 65 0.006 [16,] 34 66 0.002 etc. From this point on, all proportions are 0.

Note the outliers!


(d) > prob = runif(1, min = 0, max = 1) This selects a random probability for Heads. > tosses <- rbinom(100, 1, prob)

> sum(tosses) # count the number of Heads [1] 62

This simulation of 100 random tosses of a fair coin produced 62 Heads, which corresponds to a p-value = .021 < .05. Hence, based on this sample evidence, we may reject the hypothesis that the coin is fair; the result is statistically significant at the α = .05 level. Graphs are similar to above, centered about the mean (see below).

> table(X) X 46 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 77 78 79 80 1 2 1 1 6 6 6 5 12 16 22 31 27 28 44 50 42 45 39 29 18 14 16 15 4 4 11 2 1 1 1

> summary(X) Min. 1st Qu. Median Mean 3rd Qu. Max. 46.00 61.00 64.00 64.31 67.00 80.00

According to these data, the mean number of Heads is 64.31 out of 100 tosses; hence the estimated probability of Heads is 0.6431. The actual probability that R used here is

> prob [1] 0.6412175


6. (a)

(b) • P(2 ≤ X ≤ 12) = 1, because the event 2 ≤ X ≤ 12 comprises the entire sample space.

• P(2 ≤ X ≤ 6 or 8 ≤ X ≤ 12 ) = P(X = 2) + P(X = 3) + P(X = 4) + P(X = 5) + P(X = 6) + P(X = 8) + P(X = 9) + P(X = 10) + P(X = 11) + P(X = 12),

or, 1 – P(X = 7) = 1 – 6/36 = 30/36 = 0.83333 Likewise,

• P(2 ≤ X ≤ 5 or 9 ≤ X ≤ 12) = 20/36 = 0.55556

• P(2 ≤ X ≤ 4 or 10 ≤ X ≤ 12) = 12/36 = 0.33333

• P(2 ≤ X ≤ 3 or 11 ≤ X ≤ 12) = 6/36 = 0.16667

• P(X ≤ 2 or X ≥ 12) = 2/36 = 0.05556

• P(X ≤ 1 or X ≥ 13) = 0 , because neither the event X ≤ 1 nor X ≥ 13 can occur.

X = Sum Probability 2 1/36 = 0.02778 3 2/36 = 0.05556 4 3/36 = 0.08333 5 4/36 = 0.01111 6 5/36 = 0.01389 7 6/36 = 0.01667 8 5/36 = 0.01389 9 4/36 = 0.01111 10 3/36 = 0.08333 11 2/36 = 0.05556 12 1/36 = 0.02778


7. Absolutely not. That both sets of measurements average to 50.0 grams indicate that they have

the same accuracy, but Scale A has much less variability in its readings that Scale B, so it has much greater precision. This experiment suggests that if many more measurements were taken, those of A would show a much higher density of them centered around 50 g than B, whose distribution of values would show much more spread around 50 g. Variability determines reliability, a major factor in quality control of services and manufactured products.

Measurements obtained from the A distribution are

much more tightly clustered around their center, than

those of the B distribution.

50 g

A

B

50 g


1.5 Problems In this section, we use some of the terminology that was introduced in this chapter, most of which will be formally defined and discussed in later sections of these notes. 1. Suppose that n = 100 tosses of a coin result in X = 38 Heads. What can we conclude about

the “fairness” of the coin at the α = .05 significance level? At the α = .01 level? (Use the chart given on page 1.1-4.)

2.

(a) Suppose that a given coin is known to be “fair” or “unbiased” (i.e., the probability of Heads is 0.5 per toss). In an experiment, the coin is to be given n = 10 independent tosses, resulting in exactly one out of 210 possible outcomes. Rank the following five outcomes in order of which has the highest probability of occurrence, to which has the lowest.

Outcome 1: (H H T H T T T H T H)

Outcome 2: (H T H T H T H T H T)

Outcome 3: (H H H H H T T T T T)

Outcome 4: (H T H H H T H T H H)

Outcome 5: (H H H H H H H H H H) (b) Suppose now that the bias of the coin is not known. Rank these outcomes in order of

which provides the best evidence in support of the hypothesis that the coin is “fair,” to which provides the best evidence against it.

3. Let X = “Number of Heads in n = 50 random, independent tosses of a fair coin.” Then the

expected value is E[X] = 25, and the corresponding p-values for this experiment can be obtained by the following probability calculations (for which you are not yet responsible).

P(X ≤ 24 or X ≥ 26) = 0.8877

P(X ≤ 23 or X ≥ 27) = 0.6718

P(X ≤ 22 or X ≥ 28) = 0.4799

P(X ≤ 21 or X ≥ 29) = 0.3222

P(X ≤ 20 or X ≥ 30) = 0.2026

P(X ≤ 19 or X ≥ 31) = 0.1189

P(X ≤ 18 or X ≥ 32) = 0.0649

P(X ≤ 17 or X ≥ 33) = 0.0328

P(X ≤ 16 or X ≥ 34) = 0.0153

P(X ≤ 15 or X ≥ 35) = 0.0066

P(X ≤ 14 or X ≥ 36) = 0.0026

P(X ≤ 13 or X ≥ 37) = 0.0009

P(X ≤ 12 or X ≥ 38) = 0.0003

P(X ≤ 11 or X ≥ 39) = 0.0001

P(X ≤ 10 or X ≥ 40) = 0.0000

……

P(X ≤ 0 or X ≥ 50) = 0.0000

Now suppose that this experiment is conducted twice, and X = 18 Heads are obtained both times. According to this chart, the p-value = 0.0649 each time, which is above the α = .05 significance level; hence, both times, we conclude that the sample evidence seems to support the hypothesis that the coin is fair. However, the two experiments taken together imply that in this random sequence of n = 100 independent tosses, X = 36 Heads are obtained. According to the chart on page 1.1-4, the corresponding p-value = 0.0066, which is much less than α = .05, suggesting that the combined sample evidence tends to refute the hypothesis that the coin is fair. Give a brief explanation for this apparent discrepancy.


NOTE: Please read the bottom of “Getting Started with R” regarding its use in HW problems, such as 1.5/4 below. Answer questions in all parts, especially those involving the output, and indicate! 4. In this problem, we will gain some more fundamental practice with the R

programming language. Some of the terms and concepts may appear unfamiliar, but we will formally define them later. For now, just use basic intuition. [R Tip: At the prompt (>), repeatedly pressing the “up arrow” ↑ on your keyboard will step through your previous commands in reverse order.]

(a) First, consider a “uniformly distributed” (i.e., evenly scattered) population of ages between 0 and 100 years. What is the mean age of this population? (Use intuition.)

Let us simulate such a population, by generating an arbitrarily large (say one million) vector of random numbers between 0 and 100 years. Type, or copy and paste

population = runif(1000000, 0, 100)

at the prompt (>) in the R console, and hit Enter.

Let us now select a single random sample of n = 500 values from this population via

rand = sample(population, 500)

then sort them from lowest to highest, and round them to two decimal places:

my.sample = round(sort(rand), 2)

Type my.sample to view the sample you just generated. (You do not need to turn this in.)

(b) Compute the mean age of my.sample. How does it compare with the mean found in (a)?

(c) The R command hist graphs a “frequency histogram” of your data. Moreover, ?hist gives many options under Usage for this command. As an example, graph:

hist(my.sample, breaks = 5, xlab = "Ages", border = "blue", labels = T)

Include and interpret the resulting graph. Does it reasonably reflect the uniformly-distributed population? Explain.

(d) Repeat (b) and (c) several more times using different samples of n = 500 data values. How do these sample mean ages compare with the population mean age in (a)?

(e) Suppose many random samples of size n = 500 values are averaged, as in (d). Graph their histogram via the R code below, and offer a reasonable explanation for the resulting shape.

vec.means = NULL for (i in 1:2000) {vec.means[i] = mean(sample(population, 500))} hist(vec.means, xlab = "Mean Ages", border = "darkgreen")

The idea behind this problem will be important in Chapter 5.

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/Rcode/0._Getting_Started_with_R.pdf�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/Rcode/�


5. In this problem, we will use the R programming language to simulate n = 100 random

tosses of a coin. (Remember that most such problems are linked to the Rcode folder.)

(a) First, assume the coin is fair or unbiased (i.e., the probability of Heads is 0.5 per toss), and use the Binomial distribution to generate a random sequence of n = 100 independent tosses; each outcome is coded as “Heads = 1” and “Tails = 0.”

prob = 0.5 ( )∗

tosses = rbinom(100, 1, prob) tosses # view the sequence sum(tosses) # count the number of Heads

From the chart on page 1.1-4, calculate the p-value of this experiment. At the α = 0.05 significance level, does the outcome of this experiment tend to support or reject the hypothesis that the coin is fair? Repeat the experiment several times.

(b) Suppose we run this experiment 500 times, and count the number of Heads each time. Let us view the results, and display some summary statistics, X = rbinom(500, 100, prob) sort(X) table(X) summary(X)

as well as graph them, using each of the following methods, one at a time.

stripchart(X, method = "stack", ylim = range(0, 100), pch = 19) # Dotplot stem(X, scale = 2) # Stemplot

hist(X) # Histogram Comment on how these graphs compare to what you would expect to see from a fair coin.

(c) How do the sample proportions obtained compare with the theoretical probabilities on page 1.1-4? lower = 49:0 upper = 51:100

prop = NULL for (k in 1:50) {less.eq <- which(X <= lower[k]) greater.eq <- which(X >= upper[k]) v <- c(less.eq, greater.eq) prop <- c(prop, length(v)/500)}

cbind(lower, upper, prop)

(d) Suppose now that the coin may be biased. Replace line ( )∗ above with the following R code, and repeat parts (a) and (b).

prob = runif(1, min = 0, max = 1)

Also, estimate the probability of Heads from the data. Check against the true value of prob.



Ismor Fischer, 1/9/2014 1.5-4 6. Suppose we roll two distinct dice, each die having 6 faces. Thus there are

62 = 36 possible combinations of outcomes for the pair. For any given roll, define the random variable X = “Sum,” so X can only take on the integer values in the set S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}.

(a) How many ways can each of these values theoretically occur on a single roll? For

example, there are three possible ways of rolling X = 4. (What are they?) Hence, if the dice are fair, we may express the “probability” of this event occurring as being equal to 3/36. (The mathematical notion of probability will be formalized later.) Making this assumption that the dice are indeed fair, and using the same logic as just outlined, complete the following table. (The case X = 4 has been done for you.)

X = Sum Probability

2 3 4 3/36 5 6 7 8 9 10 11 12

(b) It should be reasonable that, by inspection, the “expected value” of X – written E[X] – is 7.

(Again, this will be formally shown later.) Using your table, calculate the probabilities of each of the following events, again assuming that the dice are fair. Please show all work.

• Probability of rolling 0 or more away from 7, i.e., P(2 ≤ X ≤ 12) = 1 (Why?)

• Probability of rolling 1 or more away from 7, i.e., P(2 ≤ X ≤ 6 or 8 ≤ X ≤ 12 ) =

• Probability of rolling 2 or more away from 7, i.e., P(2 ≤ X ≤ 5 or 9 ≤ X ≤ 12) =



• Probability of rolling 5 or more away from 7, i.e., P(X ≤ 2 or X ≥ 12) =

• Probability of rolling 6 or more away from 7, i.e., P(X ≤ 1 or X ≥ 13) = 0 (Why?)


7. Suppose an experimenter wishes to evaluate the reliability of two weighing scales, by measuring a known 50-gram mass three times on each scale, and comparing the results. Scale A gives measurements of 49.8, 50.0 and 50.2 grams. Scale B gives measurements of 49.0, 50.0, and 51.0 grams. The average in both cases is 50.0 grams, so should the experimenter conclude that both scales are equally precise, on the basis of these data? Explain.

A B

Synopsis of § 1.1

At the exact center of Concourse A of Detroit Metro airport, lies a striking work of kinetic art. An enormous black granite slab is covered with a continuously flowing layer of water; recessed in the slab are numerous jets, each of which is capable of projecting a stream of water into the air at an exact speed and angle. When all of the jets are activated simultaneously, the result is a beautiful array of intersecting parabolic arcs at different positions, heights, and widths. But there is more... the jets are also set on different timers that are activated intermittently, shooting small bursts of water in rapid succession. The effect of these lifelike streamlets of water, leaping and porpoising over and under each other as they follow their individual parabolic paths, resembles an elegantly choreographed ballet, and is quite a sight. Besides being aesthetically pleasing to watch, the piece is important for another reason. It is a tribute to classical physics; the equations of motion of a projectile (neglecting air resistance) have been well known for centuries. Classical Newtonian mechanics predicts that such an object will follow the path of a specific parabola which depends on the initial speed and angle, and finding its equation is a standard first-year calculus exercise. The equation also allow us to determine other quantities as well, such as the total time taken to traverse this parabola, the maximum height the projectile reaches, and the speed and downrange distance at the time of impact. In other words, this system has no (or very little) variability; it can be mathematically modeled in a very precise way. Despite the apparent complexity of the artwork, everything about its motion is completely determined from initial conditions. In fact, one could argue that it is precisely this predictability that makes the final structure possible to construct, and so visually appealing. There would probably be nothing special about watching jets of water spouting randomly. However, for most complex dynamical systems, this is an exception rather than the rule. Random variability lurks everywhere. If, as in the previous scenario, the projectile is a coin rather than a stream of water, then none of this mathematical analysis predicts whether the coin will land on heads or tails. That outcome is not determined solely on initial conditions; in fact, even if were possible to “rewind the tape” and start from exactly the same initial conditions, we would not necessarily obtain the same outcome of this experiment. The culprit responsible for this strange phenomenon is random chance (or random variation), an intrinsic property of the universe that can never be entirely eliminated. Analyzing systems that yield random outcomes requires an entirely different approach, one that involves an understanding of the nature of the “probability” of an event occurring over time. If every individual in a given population were exactly the same age, there would be zero variation in that variable. But, of course, in reality, this is rarely the case, and there is usually a substantial amount of variation in ages. To say that the mean age of the population is 45 years old, for example, offers no clue about this variation; everyone could (theoretically) be 45 years old, or ages could range widely, from infants to geriatrics. Random variation in a formal experiment can be introduced through biases, measurement errors, etc. The phrase “practice makes perfect” can be more formally interpreted as “practice reduces the amount of variability in the desired outcome,” making it increasingly precise. It’s what turns a good athlete into a great athlete (or an average student into an excellent student). One definition of the field of statistics is a formal way to detect, model, and interpret “genuine” information in a system, in the presence of random variation. A classic example (in the biological sciences, especially) is to compare the amount of variation between the “treatment arm” of a study – say, individuals on a new, investigational drug – with the corresponding “control arm” – say, individuals on a standard treatment or placebo (…while simultaneously adjusting for the amount of variation in individual responses within each group).

But exactly how is this accomplished? As a simple illustration, suppose we wish to test the claim that half of all the individuals in a certain population are male, and half are female. This scenario can be modeled by the “coin toss” example in the notes, which introduces some of the main concepts and terminology used in statistical methodology, and covered in much more detail later. Specifically, we can translate this problem into the equivalent one of trying to determine if a particular coin is fair (i.e., unbiased), that is, if the probability of obtaining heads or tails on a single toss is the same, namely 50%. To test this hypothesis, we can design a simple experiment: generate a random sample of outcomes by tossing the coin a large number of times, say 100, and count the resulting number of heads. If the hypothesis is indeed true, then the expected value of this number would be 50. Of course, if our experiment were to result in exactly 50 heads, then that would constitute the strongest possible evidence we could hope to obtain in support of the original hypothesis. However, because random variation plays a role here, it is certainly possible to obtain a number close to 50 heads, yet the coin may still be fair, i.e., the hypothesis is still true. The question now is “How far from 50 do we have to be, before the evidence suggests that the coin is indeed not fair, and thus the hypothesis should be rejected?” To answer this, we turn to the formal mathematics of Probability Theory, which can be used to predict, via a formula, the theoretical probability of obtaining any specified number of heads in 100 tosses of a fair coin. For instance, if the coin is fair1, then it can be mathematically shown that there is a 92% theoretical probability of obtaining an experimental sample that is at least 1 head away from the expected value of 50. Likewise, it can be shown that if the coin is fair2

, then there is a 76.4% mathematical probability of obtaining a sample that is at least 2 heads away from 50. At least 3 heads away from 50, and the probability drops to 61.7%. In a similar fashion, we can compute the probability of obtaining a sample that is at least any prescribed number of heads away from 50, and observe that this probability (called the p-value of the sample) decreases rapidly, as we drift farther and farther away from 50. At some point, the probability becomes so low, that it suggests the assumption that the coin is fair is indeed not true, and should be rejected, based on the sample evidence. But where should we draw the line? Typically, this significance level – denoted by α, the Greek symbol “alpha” – is set between 1% and 10% for many applications, with 5% being a very common choice. At the 5% significance level, it turns out that if the coin is fair, the sample can be as much as 10 heads away from 50, but no more than that. That is, in 100 random tosses of a fair coin, the probability (i.e., p-value) of obtaining a sample with more than 10 heads away from 50, is less (i.e., rarer) than 5%, a result that would be considered statistically significant. Consequently, this evidence would tend to refute the hypothesis that the coin is fair.

1 … and we are not saying for certain that it is or isn’t…

2 … and again, we are not saying for certain that it is or isn’t…

Synopsis of §1.2, 1.3

In a criminal court trial, the claim that “the defendant is presumed innocent” (unless proven guilty) can be taken as an example of a hypothesis. In other words, innocence is to be assumed; it is (supposedly) not the defense attorney’s job to prove it. Rather, the burden of proof lies with the prosecution, who must provide strong enough evidence – i.e., it must have sufficient power – to convince a jury that the hypothesis should be rejected, beyond a reasonable doubt – i.e., with a high level of confidence. (In a context such as this, we can never be 100% certain.)3

This approach is also used in the sciences. A hypothesis is either rejected or retained, based upon whether empirical evidence tends to refute or support it, respectively, via a formal procedure known as the “classical scientific method,” outlined below. We first define a population of interest (usually thought of as being arbitrarily large or infinite, for simplicity, and consisting of distinct, individual units, such as the residents of a particular area), and a specific measurable quantity that we wish to consider in that population, e.g., “Age in years.” Naturally, such a quantity varies randomly from individual to individual, hence is referred to as a random variable, and usually denoted by a generic capital letter such as X, Y, Z, etc.

This overall conservative approach to jurisprudence is deliberately intended to err on the side of caution; it is generally considered more serious to jail an innocent man – i.e., reject the hypothesis if it is true, what we will eventually come to know as a Type 1 error – than set free a guilty one – i.e., retain the hypothesis if it false, what we will eventually come to know as a Type 2 error.

4

1. Formulate a hypothesis, such as “The population mean age is 40 years old.”

In practice, it is usually not possible to measure the variable for everyone in the population, but we can imagine that, in principle, its values would take on some true “distribution” (such as a “bell curve” for example), which includes a population mean (i.e., average), although its exact value would most probably be unknown to us. But it is precisely this population mean value – e.g., mean age – that we wish to say something about, via the scientific method referred to as “statistical inference”:

2. Design an experiment to test this hypothesis. In a case like this, for example, select a random sample of individuals from the population. The number of individuals in the sample is known as the “sample size,” and usually denoted by the generic symbol n. (How to choose the value of n judiciously is the subject of a future topic.)

3. Measure the variable X (in this case, age) on all individuals in the sample, resulting in n empirical observations, denoted generically by the sample data values {x1, x2, x3, …, xn}.

4. Average these n values, and conduct a formal analysis to see whether this “sample mean” suggests a statistically significant difference5

5. Infer a “reject” or “accept” conclusion about the original population hypothesis.

from the hypothesized value of 40 years.

3 Arguably, absolute “certainty” can be achieved only in an abstract context, such as when formally proving a mathematical theorem in an axiomatic logical framework. For example, it is not feasible to verify definitively the statement that “the sum of any two odd numbers is even” by checking a large finite number of examples, no matter how many (which would technically only provide empirical evidence, albeit overwhelming), but it can be formally proved using simple laws of algebra, once formal terms are defined.

4 Note that not all variables in a population are necessarily random, such as the variable X = the number of eggs in an individual carton selected from a huge truckload of standard “one dozen” cartons, namely X = 12.

5 That is, a difference greater than what one would expect just from random chance. This analysis is the critical step.

Synopsis of § 1.4

This is a highly simplified chart of some of the main study designs used in biomedical research. As a general rule, experimental designs that test the efficacy of some investigational treatment for patients are of primary interest to physicians and other health care professionals. The gold standard of such tests is the randomized clinical trial which, classically, randomly6

Framingham Heart Study

assigns each participant (of which there may be thousands) to one of two “arms” of the study – either treatment or control (i.e., standard treatment or placebo) – and later compares the results of the two groups. Though expensive and time-consuming, clinical trials are the most important tool used by the FDA to approve medical treatments for the public consumer market. However, they are clearly not suitable for epidemiological investigations into such issues as disease prevalence in populations, or the association between diseases and their potential risk factors, such as lung cancer and smoking. These questions can be addressed with surveys and longitudinal studies, specifically, case-control – where previous exposure status of participants currently with disease (cases) and without disease (controls) is determined from medical records, tumor registries, etc. – and cohort – where currently exposed and unexposed groups are followed over time, and their disease status compared at the end of the study. Two of the largest and best-known cohort studies are the (ongoing since 1948) and the Nurses’ Health Study.

6 This is usually done with mathematical algorithms that allow computers to generate “pseudorandom” numbers. Advanced schemes exist for more complex scenarios, such as “adaptive randomization” for ongoing patient recruitment during the study, “block randomization” for multicenter collaborative studies among several institutions, etc. The entire purpose of randomizing is to minimize any source of systematic bias in the selection process.

Yes No

Intervention?

Longitudinal Cross-sectional

over time at a fixed time

Surveys, prevalence

studies, etc.

backward forward

Cohort studies

Case-Control studies

Retrospective Prospective

Experimental

Randomized Clinical Trials

(RCT)

Observational

http://www.framinghamheartstudy.org/�

http://www.channing.harvard.edu/nhs/�

2. Exploratory Data Analysis and Descriptive Statistics

2.1 Random Variables and Data Types 2.2 Graphical Displays of Sample Data 2.3 “Summary Statistics” 2.4 Summary 2.5 Problems


X

2. Exploratory Data Analysis & Descriptive Statistics

2.1 Examples of Random Variables & Associated Data Types

NUMERICAL (Quantitative measurements)

• Continuous: X = Length, Area, Volume, Temp, Time elapsed, pH, Mass of tumor

• Discrete: X = Shoe size, # weeks till death, Time displayed, Rx dose, # tumors

CATEGORICAL (Qualitative “attributes”)

• Nominal: X = Color (1 = Red, 2 = Green, 3 = Blue), ID #, Zip Code, Type of tumor

• Ordinal: X = Dosage (1 = Low, 2 = Med, 3 = High),

Year (2000, 2001, 2002, …), Stage of tumor (I, II, III, IV), Alphabet (01 = A, 02 = B, …, 26 = Z)

Random variables are important in experiments because they ensure objective reproducibility (i.e., verifiability, replicability) of results.

Example:

In any given study, the researcher must first decide what percentage of replicated experiments should, in principle, obtain results that correctly agree (specifically, accept a true hypothesis), and incorrectly agree (specifically, reject a true hypothesis), allowing for random variation.

Confidence Level: 1 − α = 0.90, 0.95, 0.99 are common choices… Significance Level: α = 0.10, 0.05, 0.01 the corresponding error rates

X

interval

X

steps

X

1 2 3

ranked

< <

1 2 3 4 . . . . . . . 90 91 92 93 94 95 96 97 98 99 100

. . . . .

|

0 |

1

X

1 2 3

unranked

Special Case: Binary

1, “Success” X = 0, “Failure”


2.2 Graphical Displays of Sample Data

Dotplots, Stem-and-Leaf Diagrams (Stemplots), Histograms, Boxplots, Bar Charts, Pie Charts, Pareto Diagrams, …

Example: Random variable X = “Age (years) of individuals at Memorial Union.” Consider the following sorted random sample of n = 20 ages:

{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} Dotplot

Comment: Uses all of the values. Simple, but crude; does not summarize the data.

Stemplot

Stem Leaves

Tens Ones

1 8 9 9 9

2 0 1 1 3 4 4 6 7

3 1 5 5 7 8

4 2 6

5 9

Comment: Uses all of the values more effectively. Grouping summarizes the data better.

18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 X

Ismor Fischer, 1/7/2013 2.2-2 Histograms

Class Interval Frequency (# occurrences)

[10, 20) 4

[20, 30) 8

[30, 40) 5

[40, 50) 2

[50, 60) 1

n = 20

4

8

2

5

1

Frequency Histogram


Class Interval Absolute Frequency (# occurrences)

Relative Frequency (Frequency ÷ n)

[10, 20) 4 420 = 0.20

[20, 30) 8 820 = 0.40

[30, 40) 5 520 = 0.25

[40, 50) 2 220 = 0.10

[50, 60) 1 120 = 0.05

n = 20 2020 = 1.00

Relative Frequency Histogram

0.20

0.40

0.25

0.10

0.05

0.00

0.10

0.20

0.30

0.40

Relative frequencies are always between 0 and 1, and their sum is

always = 1 !


Often, it is of interest to determine the total relative frequency, up to a certain value. For example, we see here that 0.60 of the age data are under 30 years, 0.85 are under 40 years, etc. The resulting cumulative distribution, which always increases monotonically from 0 to 1, can be represented by the discontinuous “step function” or “staircase function” in the first graph below. By connecting the right endpoints of the steps, we obtain a continuous polygonal graph called the ogive (pronounced “o-jive”), shown in the second graph. This has the advantage of approximating the rate at which the cumulative distribution increases within the intervals. For example, suppose we wish to know the median age, i.e., the age that divides the values into equal halves, above and below. It is clear from the original data that 25 does this job, but if data are unavailable, we can still estimate it from the ogive. Imagine drawing a flat line from 0.5 on the vertical axis until it hits the graph, then straight down to the horizontal “Age” axis somewhere in the interval [20, 30); it is this value we seek. But the cumulative distribution up to 20 years is 0.2, and up to 30 years is 0.6… a rise of 0.4 in 10 years, or 0.04 per year, on average. To reach 0.5 from 0.2 – an increase of 0.3 – would thus require a ratio of 0.3 / 0.04 = 7.5 years from 20 years, or 27.5 years. Medians and other percentiles will be addressed in the next section.



Cumulative Relative Frequency

[0, 10) 0 0.00 0.00

[10, 20) 4 0.20 0.20 = 0.00 + 0.20

[20, 30) 8 0.40 0.60 = 0.20 + 0.40

[30, 40) 5 0.25 0.85 = 0.60 + 0.25

[40, 50) 2 0.10 0.95 = 0.85 + 0.10

[50, 60) 1 0.05 1.00 = 0.95 + 0.05

n = 20 1.00


Problem! Suppose that all ages 30 and older are “lumped” into a single class interval:

{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}



[10, 20) 4 420 = 0.20

[20, 30) 8 820 = 0.40

[30, 60) 8 820 = 0.40

n = 20 2020 = 1.00

Relative Frequency Histogram

0.00

0.10

0.20

0.30

0.40

0.20

0.40 0.40

If this outlier (59) were larger, the histogram would

be even more distorted!

Ismor Fischer, 1/7/2013 2.2-6 Remedy: Let… Area of each class rectangle = Relative Frequency

Height of rectangle × Class Width

Therefore… Density = Relative Frequency

Class Width



Density (Rel Freq ÷ Class Width)

[10, 20); width = 10 4 420 = 0.20 0.20

10 = 0.02

[20, 30); width = 10 8 820 = 0.40 0.40

10 = 0.04

[30, 60); width = 30 8 820 = 0.40

0.4030 = 0.01333…

n = 20 2020 = 1.00

Relative

Frequency

Class Width

Den

sity

Density Histogram

0.02

0.04

0.0133…

0.20

0.40

0.40

Total Area = 1!


2.3 Summary Statistics – Measures of Center and Spread

?

?

Distribution of X

X discrete

X continuous

POPULATION

Random Variable X, numerical

♦ True “center” = ???

♦ True “spread” = ???

parameters (“population characteristics”)

unknown fixed numerical values usually denoted by Greek letters,

e.g., θ (“theta”)

♦ Measures of center

median, mode, mean ♦ Measures of spread

range, variance, standard deviation

SAMPLE, size n

statistics (“sample characteristics”)

known (or computable) numerical values obtained from sample data

estimators of parameters, e.g., θ usually denoted by corresponding Roman letters

Statistical Inference


Measures of Center

For a given numerical random variable X, assume that a random sample {x1, x2, …, xn} has been selected, and sorted from lowest to highest values, i.e.,

• sample median = the numerical “middle” value, in the sense that half the data values are smaller, half are larger.

If n is odd, take the value in position # n + 12 .

If n is even, take the average of the two closest neighboring data values, left (position # n2 ) and right (position # n2 + 1).

Comments:

The sample median is robust (insensitive) with respect to the presence of outliers.

More generally, can also define quartiles (Q1 = 25% cutoff, Q2 = 50% cutoff = median, Q3 = 75% cutoff), or percentiles (a.k.a. quantiles), which divide the data values into any given p% vs. (100 − p)% split. Example: SAT scores

• sample mode = the data value with the largest frequency (fmax)

Comment: The sample mode is robust to outliers.

If present, repeated sample data values can be neatly consolidated in a frequency table, vis-à-vis the corresponding dotplot. (If a value xi is not repeated, then its fi = 1.)

k distinct data values of X

absolute frequency of ix

relative frequency of ix

xi fi f (xi) = fi / n x1 f1 f (x1) x2 f2 f (x2)

xk fk f (xk) n 1

x1 ≤ x2 ≤ … ≤ xn−1 ≤ xn

X

50% 50%

•••

••

•

••

•••••••

••

•

• . . . . . . .

fk

X

f

f1

f2

fmax

x1 x2 xk mode mean


Example: n = 12 random sample values of X = “Body Temperature (°F)”:

{98.5, 98.6, 98.6, 98.6, 98.6, 98.6, 98.9, 98.9, 98.9, 99.1, 99.1, 99.2}

xi fi f (xi) 98.5 1 1/12

98.6 5 5/12

98.9 3 3/12

99.1 2 2/12

99.2 1 1/12

n = 12 1

sample median = 98.6 + 98.9

2 = 98.75°F (six data values on either side)

sample mode = 98.6°F

sample mean = 112 [ (98.5)(1) + (98.6)(5) + (98.9)(3) + (99.1)(2) + (99.2)(1) ]

or, = (98.5) 112 + (98.6)

512 + (98.9) 3

12 + (99.1) 212 + (99.2) 1

12 = 98.8°F

• sample mean = the “weighted average” of all the data values

Comments:

The sample mean is the center of mass, or “balance point,” of the data values.

The sample mean is sensitive to outliers. One common remedy for this…

Trimmed mean: Compute the sample mean after deleting a predetermined number or percentage of outliers from each end of the data set, e.g., “10% trimmed mean.” Robust to outliers by construction.

x = 1

1 ki i

ix fn =

∑ , where fi is the absolute frequency of xi

= 1

( )k

i ii

x f x=∑ , where f(xi) = if

n is the relative frequency of xi

10% 10%

X 98.5 98.6 99.2 98.7 98.8 99.0 99.1 98.9

f

1

5

3

2

•

•

•

•

•

•

•

•

•

• •

•

1


0.20

0.40

Q

0.30 .10

Grouped Data – Suppose the original values had been “lumped” into categories.

Example: Recall the grouped “Memorial Union age” data set…

xi Class Interval Frequency fi Relative Frequency fin Density

(Rel Freq ÷ Class Width)

15 [10, 20) 4 0.20 0.02

25 [20, 30) 8 0.40 0.04

45 [30, 60) 8 0.40 0.013

n = 20 1.00

• group mean: Same formula as above, with xi = midpoint of ith class interval.

groupx = 120 [ (15)(4) + (25)(8) + (45)(8) ] = 31.0 years

Exercise: Compare this value with the ungrouped sample mean x = 29.2 years.

• group median (& other quantiles):

Density Histogram By definition, the median Q divides the data set into equal halves, i.e., 0.50 above and below. In this example, it must therefore lie in the class interval [20, 30), and divide the 0.40 area of the corresponding class rectangle as shown. Since the 0.10 “strip” is ¼ of that area, it proportionally follows that Q must lie at ¼ of the class width 30 – 20 = 10, or 2.5, from the right endpoint of 30. That is, Q = 30 – 2.5, or Q = 27.5 years. (Check that the ungrouped median = 25 years.)


Formal approach ~

First, identify which class interval [a, b) contains the desired quantile Q (e.g., median, quartile, etc.), and determine the respective left and right areas A and B into which it

divides the corresponding class rectangle. Equating proportions for Density = A Bb a+−

,

we obtain

Density = A Ba b=

− −Q Q,

from which it follows that

DensityAa= +Q or

DensityBb= −Q or Ab Ba

A B+=+

Q .

For example, in the grouped “Memorial Union age” data, we have a = 20, b = 30, and A = 0.30, B = 0.10. Substituting these values into any of the equivalent formulas above yields the median Q2 = 27.5. Exercise: Now that Q2 is found, use the formula again to find the first and third quartiles Q1 and Q3, respectively. Note also from above, we obtain the useful formulas

( ) DensityA Q a= − ×

( ) DensityB b Q= − ×

for calculating the areas A and B, when a value of Q is given! This can be used when finding the area between two quantiles Q1 and Q2. (See next page for another way.)

A B

a Q b

Den

sity


Alternative approach First, form this column:

Class Interval Frequency fi

Relative Frequency /if n

Cumulative Relative Frequency 1 2 if f f

i n n nF = + + +

0I 0 0 0

1I 1f 1 /f n 1F

2I 2f 2 /f n 2F

iI if /if n lowF < 0.5

Q = ? in 0.5

[ ),a b 1if + 1 /if n+ highF > 0.5

kI kf /kf n 1

n 1

Then 0.5

( )low

high low

Fa b aF F

−= + −

−Q or

0.5( )high

high low

Fb b a

F F−

= − −−

Q .

Again, in the grouped “Memorial Union age” data, we have a = 20, b = 30, lowF = 0.2, and highF = 0.6 (why?). Substituting these values into either formula yields the median Q2 = 27.5. To find Q1, replace the 0.5 in the formula by 0.25; to find Q3, replace the 0.5 in the formula by 0.75, etc. Conversely, if a quantile Q in an interval [a, b) is given, then we can solve for the cumulative relative frequency F(Q) up to that quantile value:

( ) ( )( ) ( ) ( )F b F aF F a ab a

−= + −

−Q Q . It follows that the relative frequency

(i.e., area) between two quantiles Q1 and Q2 is equal to the difference between their cumulative relative frequencies: F(Q2) − F(Q1).

Next, identify Flow and Fhigh which bracket 0.5, and let [a, b) be the class interval of the latter.


Shapes of Distributions

Symmetric distributions correspond to values that are spread equally about a “center.”

mean = median Examples: (Drawn for “smoothed histograms” of a random variable X.)

Note: An important special case of the “bell-shaped” curve is the normal distribution,

a.k.a. Gaussian distribution. Example: X = IQ score Otherwise, if more outliers of X occur on one side of the median than the other, the corresponding distribution will be skewed in that direction, forming a tail.

Examples: X = “calcium level (mg)” X = “serum cholesterol level (mg/dL)” Furthermore, distributions can also be classified according to the number of “peaks”:

mean < median X

skewed to the left (negatively skewed)

0.5 0.5

X

skewed to the right (positively skewed)

median < mean

0.5 0.5

unimodal bimodal multimodal

X X X

uniform triangular bell-shaped


Measures of Spread

Again assume that a numerical random sample {x1, x2, …, xn} has been selected, and sorted from lowest to highest values, i.e.,

• sample range = xn − x1 (highest value − lowest value)

Comments: Uses only the two most extreme values. Very crude estimator of spread.

The sample range is extremely sensitive to outliers. One common remedy …

Interquartile range (IQR) = Q3 – Q1. Robust to outliers by construction.

If the original data are grouped into k class intervals [a1, a2), [a2, a3),…, [ak, ak+1),

then the group range = ak+1 − a1. A similar calculation holds for group IQR.

Example: The “Body Temperature” data set has a sample range = 99.2 − 98.5 = 0.7°F.

{98.5, 98.6, 98.6, 98.6, 98.6, 98.6, 98.9, 98.9, 98.9, 99.1, 99.1, 99.2}

xi fi 98.5 1

98.6 5

98.9 3

99.1 2

99.2 1

n = 12

x1 ≤ x2 ≤ … ≤ xn−1 ≤ xn

X

Q1 Q2 Q3

25% 25% 25% 25%


1( )

k

i ii

x x f=

−∑ = 0,

i.e., the sum of the deviations is always zero.

For a much less crude measure of spread that uses all the data, first consider the following… Definition: xi − x = individual deviation of the ith sample data value from the sample mean

xi 98.5 −0.3 1

98.6 −0.2 5

98.9 +0.1 3

99.1 +0.3 2

99.2 +0.4 1

n = 12 Naively, an estimate of the spread of the data values might be calculated as the average of these n = 12 individual deviations from the mean. However, this will always yield zero! FACT:

Check: In this example, the sum = (−0.3)(1) + (−0.2)(5) + (0.1)(3) + (0.3)(2) + (0.4)(1) = 0. Exercise: Prove this general fact algebraically. Interpretation: The sample mean is the center of mass, or “balance point,” of the data values.

98.8

fi xi − x

X 98.5 98.6 99.2 98.7 98.8 99.0 99.1 98.9

f

1

5

3

2

•

•

•

•

•

•

•

•

•

• •

•

1


Best remedy: To make them non-negative, square the deviations before summing.

• sample variance

• sample standard deviation

Example:

xi xi − x (xi − x)2 fi

98.5 −0.3 +0.09 1

98.6 −0.2 +0.04 5

98.9 +0.1 +0.01 3

99.1 +0.3 +0.09 2

99.2 +0.4 +0.16 1

n = 12 Comments:

s2 = 2( )

1i ix x fn−−

∑ has the important frequently-recurring form SSdf , where SS =

“Sum of Squares” (sometimes also denoted Sxx) and df = “degrees of freedom” = n − 1, since the n individual deviations have a single constraint. (Namely, their sum must equal zero.)

Same formulas are used for grouped data, with groupx , and xi = class interval midpoint.

Exercise: Compute s for the grouped and ungrouped Memorial Union age data.

A related measure of spread is the absolute deviation, defined as 1n ∑ |xi − x | fi , but its statistical properties are not as well-behaved as the standard deviation. Also, see Appendix > Geometric Viewpoint > Mean and Variance, for a way to understand the “sum of squares” formula via the Pythagorean Theorem (!), as well as a useful alternate computational formula for the sample variance.

s2 = 1n − 1 ∑ (xi − x)2 fi

i=1

k

s = + s2

s2 is not on the same scale as

the data values!

s is on the same scale as the data

values.

Then…

s2 = 111 [ (0.09)(1) + (0.04)(5) + (0.01)(3) +

(0.09)(2) + (0.16)(1) ] = 0.06 (°F)2, so that… s = 0.06 = 0.245°F. Body Temp has a small amount of variance.

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A2._Geometric_Viewpoint/A2.1_-_Mean_and_Variance.pdf�


Typical “Grouped Data” Exam Problem

Given the sample frequency table of age intervals shown above; answer the following.

1. Sketch the density histogram. (See Lecture Notes, page 2.2-6)

2. Sketch the graph of the cumulative distribution. (page 2.2-4)

3. What proportion of the sample is under 36 yrs old? (pages 2.3-5 bottom, 2.3-6 bottom)

4. What proportion of the sample is under 45 yrs old? (same)

5. What proportion of the sample is between 36 and 45 yrs old? (same)

6. Calculate the values of the following grouped summary statistics.

Quartiles Q1, Q2, Q3 and IQR (pages 2.3-4 to 2.3-6)

Mean (page 2.3-4)

Variance (page 2.3-10, second comment on bottom)

Standard deviation (same)

Solutions at http://www.stat.wisc.edu/~ifischer/Grouped_Data_Sols.pdf

Age Intervals Frequencies

[0, 18) - [18, 24) 208 [24, 30) 156 [30, 40) 104 [40, 60) 52

520

http://www.stat.wisc.edu/~ifischer/Grouped_Data_Sols.pdf�


POPULATION


Parameters

♦ Mean µ

♦ Variance σ 2 ♦ Standard Deviation σ

Statistical Inference

Estimators µ and σ can be calculated via the following statistics:

♦ Mean x

♦ Variance s2 ♦ Standard Deviation s

SAMPLE, size n

Density Histogram

x

Relative frequency of xi

xi

s

σ

µ

Distribution of X

X discrete

X continuous

POPULATION


Parameters

♦ Mean µ (“mu”)

♦ Variance σ 2 ♦ Standard Deviation σ

(“sigma”)

2.4 Summary (Compare with first page of §2.3.)

Comments:

The population mean µ and variance σ 2 are defined in terms of expected value:

µ = E[X] = Σ x f(x), σ 2 = E[(X − µ)2] = Σ (x − µ) 2 f(x) if X is discrete (with corresponding “integration formulas” if X is continuous), where

f(x) is the probability of value x occurring in the population, i.e., P(X = x). Later…

If n is used instead of n − 1 in the denominator of s2, the expected value is always less than σ 2. Consistent under- (or over-) estimation of a parameter by a statistic is called bias. The formulas given for the sample mean and variance are unbiased estimators.

all x all x

Ismor Fischer, 5/29/2012 2.4-2 Chebyshev’s Inequality

Whatever the shape of the distribution, at least 75% of the values lie within ±2 standard deviations of the mean, at least 89% lie within ±3 standard deviations, etc.

More generally, at least 2

11 100%k

− × of the values lie within ± k

standard deviations of the mean. (Note that k > 1, but it need not be an integer!)

Exercise: Suppose that a population of individuals has a mean age of µ = 40 years, and standard deviation of σ = 10 years. At least how much of the population is between 20 and 60 years old? Between 15 and 65 years old? What symmetric age interval about the mean is guaranteed to contain at least half the population? Note: If the distribution is bell-shaped, then approximately 68% lie within ±1σ, approximately 95% lie within ±2σ, approximately 99.7% lie within ±3σ. For other multiples of σ, percentages can be obtained via software or tables. Much sharper than Chebyshev’s general result, which can be overly conservative, this can be used to check if a distribution is reasonably bell-shaped for use in subsequent testing procedures. (Later...)

3µ σ− 2µ σ− 1µ σ− 1µ σ+ 2µ σ+ 3µ σ+ µ

σ

≥ 89%

≥ 75%

Pafnuty Chebyshev (1821-1894)


2.5 Solutions 1. Implementing various R commands reproduces exact (or similar) results to those in the

notes, plus more. Remember that you can always type help(command) for detailed online information. In particular, help(par)yields much information on options used in R plotting commands. For a general HTML browser interface, type help.start().

Full solution in progress…

2. Data Types

(a) Amount of zinc: It is given that the coins are composed of a 5% alloy of tin and zinc. Therefore, the exact proportion of zinc must correspond to some value in the interval (0, 0.05); hence this variable is numerical, and in particular, continuous.

Image on reverse: There are only two possibilities, either wheat stalks or the Lincoln Memorial, thus categorical: nominal: binary. Year minted: Any of {1946, 1947, 1948, …, 1961, 1962}. As these numbers purely represent an ordered sequence of labels, this variable would typically be classified as categorical: ordinal. (Although, an argument can technically be made that each of these numbers represents the quantity of years passed since “year 0,” and hence is numerical: discrete. However, this interpretation of years as measurements is not the way they are normally used in most practical applications, including this one.) City minted: Denver, San Francisco, or Philadelphia. Hence, this variable is categorical: nominal: not binary, as there are more than two unordered categories. Condition: Clearly, these classifications are not quantities, but a list of labels with a definite order, so categorical: ordinal.

(b) Out of 1000 coins dropped, the number of heads face-up can be any integer from 0 to

1000 – i.e., {0, 1, 2, …, 999, 1000} – hence numerical: discrete. It follows that the proportion of heads face-up can be any fraction from { }0 999 10001 2

1000 1000 1000 1000 1000, , , , , –i.e., {0, .001, .002, …, .999, 1} in decimal format – hence is also numerical: discrete. (However, for certain practical applications, this may be approximately modeled by the continuous interval [0, 1], for convenience.)


3. Dotplots

All of these samples have the same mean value = 4. However, the sample variances (28/6, 10/6, 2/6, and 0, respectively), and hence the standard deviations (2.160, 1.291, 0.577, and 0, respectively) become progressively smaller, until there is literally no variation at all in the last sample of equal data values. This behavior is consistent with the dotplots, whose shapes exhibit progressively smaller “spread” – and hence progressively higher “peak” concentrations about the mean – as the standard deviation decreases.

Ismor Fischer, 2/8/2014 Solutions / 2.5-3 4. The following properties can be formally proved using algebra. (Exercise)

(a) If the same constant b is added to every value of a data set {x1, x2, x3, ..., xn}, then the entire distribution is shifted by exactly that amount b, i.e., {x1 + b, x2 + b, ..., xn + b}. Therefore, the mean also changes by b (i.e., from x to x + b), but the amount of “spread” does not change. That is, the variance and standard deviation are unchanged (as a simple calculation will verify). In general, for any dataset “x” and constant b, it follows that

mean(x + b) = mean(x) + b and var(x + b) = var(x), so that sd(x + b) = sd(x), i.e.,

x b x b+ = + and 2 2x b xs s+ = , so that x b xs s+ = .

(b) If every data value of {x1, x2, x3, ..., xn} is multiplied by a nonzero constant a, then the

distribution becomes {ax1, ax2, ax3, ..., axn}. Therefore, the mean is multiplied by this amount a as well (i.e., mean = a x ), but the variance (which is on the scale of the square of the data) is multiplied by a2, which is positive, no matter what the sign of a.

Its square root, the standard deviation, is therefore multiplied by 2a a= , the absolute value of a, always positive. In general, if a is any constant, then

mean(a x) = a mean(x) and var(a x) = a2 var(x), so that sd(a x) = |a| sd(x), i.e.,

a x a x= and 2 2 2ax xs a s= , so that | |ax xs a s= .

In particular, if a = −1, then the mean changes sign, but the variance, and hence the standard deviation, remain the same positive values that they were before. That is,

x x− = − and 2 2x xs s− = , so that x xs s− = .

5.

Sun Mon Tues Wed Thurs Fri Sat Week 1 +8 +8 +8 +5 +3 +3 0 Week 2 0 –3 –3 –5 –8 –8 –8

(a) For Week 1, the mean temperature is x = 3(8) 1(5) 2(3) 1(0)7

+ + + = 5, and the

variance is s2 = 2 2 2 23(8 5) 1(5 5) 2(3 5) 1(0 5)

7 1− + − + − + −

− = 60

6 = 10. (s = 10 )

(b) Note that the Week 2 temperatures are the negative values of the Week 1

temperatures. Therefore, via the result in 2-3(b), the Week 2 mean temperature is −5, while the variance is exactly the same, 10. (s = 10 )

Check: x = 3( 8) 1( 5) 2( 3) 1(0)7

− + − + − + = –5

s2 = 2 2 2 23( 8 5) 1( 5 5) 2( 3 5) 1(0 5)

7 1− + + − + + − + + +

− = 60

6 = 10

Add b

x

x b+

x

Multiply by a

ax

Ismor Fischer, 2/8/2014 Solutions / 2.5-4 6. (a) self-explanatory

(b) sum(x.vals)/5 and mean(x.vals) will both yield identical results, xbar.

(c) sum((x.vals – xbar)^2)/4 and var(x.vals) will both yield identical results, s.sqrd.

(d) sqrt(s.sqrd) and sd(x.vals) will both yield identical results.

7. The numerators of the z-values are simply the deviations of the original x-values from their mean x ; hence their sum = 0 (even after dividing each of them by the same standard deviation sx of the x-values), so it follows that z = 0. Moreover, since the denominators of the z-values are all the same constant sx, it follows that the new standard deviation is equal to sx divided by sx, i.e., sz = 1. In other words, subtracting the sample mean x from each xi results in deviations −ix x that are “centered” around a new mean of 0. Dividing them by their own standard deviation sx results in “standardized” deviations zi that have a new standard deviation of sx / sx = 1. (This is informal. See Problem 2.5/4 above for the formal mathematical details.)

8. (a) If the two classes are pooled together to form 1 2n n+ = 50 students, then the first class

of 1n = 20 students contributes a relative frequency of 2/5 toward the combined score, while the second class of 2n = 30 students contributes a relative frequency of 3/5

toward the combined score. Hence, the “weighted average” is equal to 2 3

5 5(90) (80 )+

= 84. More generally, the formula for the grand mean of two groups having means x

and y respectively, is 1 2

1 2

n x n yn n+

+.

(b) If two classes have the same mean score, then so will the combined classes, regardless

of their respective sizes. (You may have to think about why this is true, if it is not apparent to you. For example, what happens to the grand mean formula above if x y= ?) However, calculating the combined standard deviation is a bit more subtle.

Recall that the sample variance is given by 2 SSdf

s = , where the “sum of squares”

SS = 2( )i ix x f−∑ , and “degrees of freedom” df = n – 1. For the first class, we are

told that 1 7s = and 1 24n = , so that 49 = 1SS23

, or SS1 = 1127. Similarly, for the

second class, we are told that 2 10s = and 2 44n = , so that 100 = 2SS43

, or SS2 = 4300.

Combining the two, we would have a large sample of size 1 2n n+ = 68, whose values, say iz , consist of both ix and iy scores, with a mean value equal to the same value of x and y (via the comments above). Denoting this common value by c, we obtain

2 2 2both 1 2SS ( ) ( ) ( ) SS SS 1127 4300,i i i i i iz c f x c f y c f= − = − + − = + = +∑ ∑ ∑ i.e.,

SSboth = 5427, and dfboth = 67. Thus, 2 bothboth

both

SS 5427df 67

s = = = 81, so that boths = 9.


.

.

.

.

.

.

.

.

.

.

0.1

0.3 0.6

9. Note: Because the data are given in grouped form, all numerical calculations and

resulting answers are necessarily approximations of the true ungrouped values.

Age Group Midpoint Age Group Frequency Relative

Frequency Density

5 [0, 10); width = 10 9 9

90 = 0.1 0.110 = 0.010

17.5 [10, 25); width = 15 27 27

90 = 0.3 0.315 = 0.020

45 [25, 65]; width = 40 54 54

90 = 0.6 0.640 = 0.015

90 9090 = 1.0

(a) group mean = 190 [(5)(9) + (17.5)(27) + (45)(54)] = 32.75 years, or 32 years, 9 months

group variance = 189 [(5 – 32.75)2 (9) + (17.5 – 32.75)2 (27) + (45 – 32.75)2 (54)]

= 239.4733,

∴ group standard deviation = 239.4733 = 15.475 years, or 15 years, 5.7 months

(b) Relative Frequency Histogram (c) Density Histogram


.20 .15 .45 .10 .10

.10

.15

17 ½

.15

.10

31 23 48

13

.25 .25

(d) The age interval [15, 25) contains 2/3 of the 30%, or 20%, of the sample values found

in the interval [10, 25). Likewise, the remaining interval [25, 35) contains 1/4 of the 60%, or 15%, found in the interval [25, 65). Therefore, the interval [15, 35) contains 20% + 15%, or 35%, of the sample.

(e) Quartiles are computed similarly. The median Q2 divides the total area into equal

halves, and so must be one-sixth of the way inside the last interval [25, 65), i.e., Q2 = 31 years, 8 months. After that, the remaining areas are halved, so Q1 coincides with the midpoint of [10, 25), i.e., Q1 = 17 years, 6 months, and Q3 with the midpoint of [Q2, 65), i.e., Q3 = 48 years, 4 months.

(f) Range = 65 – 0 = 65 years, IQR = Q3 – Q1 = 48 yrs, 4 mos – 17 yrs, 6 mos =

30 years, 10 months


10.

(a) The number of possible combinations is equal to the number of possible rearrangements of x objects (the ones) among n objects. This is the well-known combinatorial symbol

“n-choose-x”

nx

= !! ( )!

nx n x−

. (See the Basic Reviews section of the Appendix.)

(b) Relative frequency table: Each iy = 1 or 0; the former occurs with a frequency of x

times, the latter with a frequency of (n – x) times. Therefore, iy = 1 corresponds to a

relative frequency of ,x pn= so that iy = 0 corresponds to a relative frequency of 1 – p.

frequency relative

frequency

yi fi f(yi) 1 x p 0 n – x 1 – p n 1

(c) Clearly, the sum of all the iy values is equal to x (the number of ones), so the mean is

xy pn

= = . Or, from the table, (1)( ) (0)(1 ) .y p p p= + − =

(d) We have 2 2 21

1(1 ) ( ) (0 ) ( )y n

s p x p n x−

= − + − − = 2 2

1(1 ) ( ) (0 ) (1 )n

np p p p

− − + − −

(recall that x = n p), = 1

(1 ).n

np p

−−


11.

(a) Given {10,10,10, ,10, 60, 60, 60, , 60} , where half the values are 10 and half the values are 60, it clearly follows that...

sample mean x = (10)(0.5) + (60)(0.5) = 35 and

sample median = 10 602+ = 35 as well.

(This is an example of a symmetric distribution.)

(b) Given only grouped data however, we have...

sample mean groupx = (10)(0.5) + (60)(0.5) = 35 as above, and sample group median = 20, since it is that value which divides the grouped data into equal halves, clearly very different from the true median found in (a).

Because the density histogram is so constructed that its total area = 1, it can be interpreted as a physical system of aligned rectangular “weights,” whose total mass = 1. The fact that the deviations from the mean sum to 0 can be interpreted as saying that from the mean, all of the negative (i.e., to the left) and positive (i.e., to the right) horizontal forces cancel out exactly, and the system is at perfect equilibrium there. That is, the mean is the “balance point” or “center of mass” of the system. (This is the reason it is called a density histogram, for by definition, density of physical matter = amount of mass per unit volume, area, or in this case, width.) This property is not true of the other histogram, whose rectangular heights – not areas – measure the relative frequencies, and therefore sum to 1; hence there is no analogous physical interpretation for the mean.

ix ( )if x

10 0.5 60 0.5

Class Interval

Relative Frequency

[0, 20) midpt = 10 width = 20

0.5

[20, 100) midpt = 60 width = 80

0.5

1

mean (35)

median (20)

0.5

0.5

mean (35)

median (20)


12. The easiest (and most efficient) way to solve this is to first choose the notation judiciously.

Recall that we define i id x x= − to be the ith deviation id of a value ix from the mean x , and that they must sum to zero. As the mean is given as 80,x = and three of the four quiz scores are equal, we may therefore represent them as 1 2 3 4{ , , , }x x x x , where...

1 80 ,x d= + 2 80 ,x d= + 3 80 ,x d= + and 4 80 3x d= − .

Hence, the variance would be given by 2 2 2 2

2 (3 )4 1

d d d ds + + +=

− =

2123d = 24d , so the

standard deviation is 2s d= ± . Because s = 10 (given), it follows that 5d = ± , whereby the quiz scores can be either {85, 85, 85, 65} or {75, 75, 75, 95}. Both sets satisfy the conditions that 80x = and s = 10. [Note: Other notation would still yield the same answers (if solved correctly, of course), but the subsequent calculations might be much messier.]

13. Straightforward algebra.


2.5 Problems 1. Follow the instructions in the posted R code folder

(http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/Rcode/) for this problem, to reproduce the results that appear in the lecture notes for the “Memorial Union age” data.

2. A numismatist (coin collector) has a large collection of pennies minted between the years

1946-1962, when they were made of bronze: 95% copper, and 5% tin and zinc. (Today, pennies have a 97.5% zinc core; the remaining 2.5% is a very thin layer of copper plating.) The year the coin was minted appears on the obverse side (i.e., “heads”), sometimes with a letter below it, indicating the city where it was minted: D (Denver), S (San Francisco), or none (Philadelphia). Before 1959, a pair of wheat stalks was depicted on the reverse side (i.e., “tails”); starting from that year, this image was changed to the Lincoln Memorial. The overall condition of the coin follows a standard grading scale – Poor (PO or PR), Fair (FA or FR), About Good (AG), Good (G), Very Good (VG), Fine (F), Very Fine (VF), Extremely Fine (EF or XF), Almost Uncirculated (AU), and Uncirculated or Mint State (MS) – which determines the coin’s value.

(a) Using this information, classify each of the following variables as either numerical

(specify continuous or discrete) or categorical (specify nominal: binary, nominal: not binary, or ordinal).

Amount of zinc Image on reverse Year minted City minted Condition

(b) Suppose the collector accidentally drops 1000 pennies. Repeat the instructions in (a) for the variables

Number of heads face-up Proportion of heads face-up 3. Sketch a dotplot (by hand) of the distribution of values for each of the data sets below, and

calculate the mean, variance, and standard deviation of each.

U: 1, 2, 3, 4, 5, 6, 7

X: 2, 3, 4, 4, 4, 5, 6

Y: 3, 4, 4, 4, 4, 4, 5

Z: 4, 4, 4, 4, 4, 4, 4 What happens to the mean, variance, and standard deviation, as we progress from one data set to the next? What general observations can you make about the relationship between the standard deviation, and the overall shape of the corresponding distribution? In simple terms, why should this be so?

4. Useful Properties of Mean, Variance, Standard Deviation

(a) Suppose that a constant b is added to every value of a data set 1 2 3{ , , , , }nx x x x , to produce a new data set 1 2 3{ , , , , }nx b x b x b x b+ + + + . Exactly how are the mean, variance, and standard deviation affected, and why? (Hint: Think of the dotplot.)

(b) Suppose that every value in a data set 1 2 3{ , , , , }nx x x x is multiplied by a nonzero constant a to produce a new data set 1 2 3{ , , , , }nax ax ax ax . Exactly how are the mean, variance, and standard deviation affected, and why? Don’t forget that a (and for that matter, b above) can be negative! (Hint: Think of the dotplot.)



5. During a certain winter in Madison, the variable X = “Temperature at noon (°F)” is measured every day over two consecutive weeks, as shown below.

Sun Mon Tues Wed Thurs Fri Sat Week 1 +8 +8 +8 +5 +3 +3 0 Week 2 0 –3 –3 –5 –8 –8 –8

(a) Calculate the sample mean temperature x and sample variance 2s for Week 1.

(b) Without performing any further calculations, determine the mean temperature x and sample variance 2s for Week 2. [Hint: Compare the Week 2 temperatures with those of Week 1, and use the result found in 4(b).] Confirm by explicitly calculating.

6. A little practice using R: First, type the command pop = 1:100 to generate a simulated “population” of integers from 1 to 100, and view them (read the intro to R to see how).

(a) Next, type the command x.vals = sample(pop, 5, replace = T) to generate a random sample of n = 5 values from this population, and view them. Calculate, without R, their sample mean x , variance 2s , and standard deviation s . Show all work!

(b) Use R to calculate the sample mean in two ways: first, via the sum command, then via the mean command. Do the two answers agree with each other? Do they agree with (a)? If so, label this value xbar. Include a copy of the R output in your work.

(c) Use R to calculate the sample variance in two ways: first, via the sum command, then via the var command. Do the two answers agree with each other? Do they agree with (a)? If so, label this value s.sqrd. Include a copy of the R output in your work.

(d) Use R to calculate the sample standard deviation in two ways: first, via the sqrt command, then via the sd command. Do the two answers agree with each other? Do they agree with (a)? Include a copy of the R output in your work.

7. (You may want to refer to the Rcode folder for this problem.) First pick n = 5 numbers at random

}{ 1 2 3 4 5, , , ,x x x x x , and calculate their sample mean x and standard deviation xs .

(a) Compute the deviations from the mean ix x− for 1, 2,3, 4,5i = , and confirm that their sum = 0.

(b) Now divide each of these individual deviations by the standard deviation xs . These new values

}{ 1 2 3 4 5, , , ,z z z z z are called “standardized” values, i.e., ii

x

x xzs−

= , for 1, 2,3, 4,5i = .

Calculate their mean z and standard deviation zs . Repeat several times. What do you notice?

(c) Why are these results not surprising? (Hint: See problem 4.)

The idea behind this problem will be important in Chapter 4.




8. (a) The average score of a class of 1 20n = students on an exam is 1 90.0x = , while the

average score of another class of 2 30n = students on the same exam is 2 80.0x = . If the two classes are pooled together, what is their combined average score on the exam?

(b) Suppose two other classes – one with 1 24n = students, the other with 2 44n = students –

have the same mean score, but with standard deviations 1 7.0s = and 2 10.0s = , respectively. If these two classes are pooled together, what is their combined standard deviation on the exam? (Hint: Think about how sample standard deviation is calculated.)

9. (Hint: See page 2.3-11) A random sample of n = 90 people is grouped according to age in

the frequency table below:

Age Group Frequency

[0, 10) 9

[10, 25) 27

[25, 65] 54

(a) Calculate the group mean age and group standard deviation. Express in years and months.

(b) Construct a relative frequency histogram.

(c) Construct a density histogram.

(d) What percentage of the sample falls between 15 and 35 years old?

(e) Calculate the group quartile ages Q1, Q2, Q3. Express in terms of years and months.

(f) Calculate the range and the interquartile range. Express in terms of years and months. 10. For any 0,1, 2, ,x n= , consider a data set { }1 2 3, , , , ny y y y consisting entirely of x

ones and ( )n x− zeros, in any order. For example, {1,1, ,1, 0, 0, , 0 }x n x−

. Also denote

the sample proportion of ones by xpn

= .

(a) How many such possible data sets can there be? (b) Construct a relative frequency table for such a data set.

(c) Show that the sample mean y p= .

(d) Show that the sample variance 2

1(1 )y

n

ns p p

−= − .


11.

(a) Consider the sample data {10,10,10, ,10, 60, 60, 60, , 60} , where half the values are 10 and half the values are 60. Complete the following relative frequency table for this sample, and calculate the sample mean x and sample median.

ix ( )if x

10 60

(b) Suppose the original dataset is unknown, and only given in grouped form, with each

of the two class intervals shown below containing half the values.

Class Interval Relative Frequency

[0, 20)

[20, 100)

Complete this relative frequency table, and calculate the group sample mean

groupx and group sample median. How do these compare with the values found

in (a)?

Sketch the relative frequency histogram.

Sketch the density histogram.

Label the group sample mean and median in each of the two histograms. In which histogram does the mean more accurately represent the “balance point” of the data, and why?

12. By the end of the semester, Merriman forgets the scores he received on the four quizzes (each worth 100 points) he took in a certain course. He only remembers that their average score was 80 points, standard deviation 10 points, and that 3 out of the 4 scores were the same. From this information, compute all four missing quiz scores. [Hint: Recall that the ith deviation of a value ix from the mean x is defined as i id x x= − , so that i ix x d= +

for 1, 2,3, 4i = . Then use the given information.]

Note: There are two possible solutions to this problem. Find them both.


13. Linear Interpolation (A generalization of the method used on page 2.3-6.)

If software is unavailable for computations, this is an old technique to estimate values which are “in-between” tabulated entries. It is based on the idea that over a small interval, a continuous function can be approximated by a linear one, i.e., constant slope. Given two successive entries a1 and a2 in the first column of a table, with corresponding successive entries b1 and b2, respectively, in the second column. For a given x value between a1 and a2, we wish to approximate the corresponding y value between b1 and b2, or vice versa. Then assuming equal proportions, we have

1 2 1

1 2 1

y b b bx a a a− −

=− −

.

Show that this relation implies that y can be written as a weighted average of b1 and b2. In particular,

1 2 2 1

1 2

v b v byv v+

=+

,

where the weights are given by the differences 1 1v x a= − and 2 2v a x= − . Similarly,

1 2 2 1

1 2

w a w axw w

+=

+,

where the weights are given by the differences 1 1w y b= − and 2 2w b y= − .

Column A Column B a1 b1

v1 w1 x y

v2 w2 a2 b2

3. Theory of Probability

3.1 Basic Definitions and Properties 3.2 Conditional Probability and Independent Events 3.3 Bayes’ Formula 3.4 Applications 3.5 Problems


… … … … … …

3. Probability Theory

3.1 Basic Ideas, Definitions, and Properties

POPULATION = Unlimited supply of five types of fruit, in equal proportions. O1 = Macintosh apple O2 = Golden Delicious apple O3 = Granny Smith apple

O4 = Cavendish (supermarket) banana O5 = Plantain banana

Experiment 1: Randomly select one fruit from this population, and record its type.

Sample Space: The set S of all possible elementary outcomes of an experiment.

S = {O1, O2, O3, O4, O5} #(S) = 5

Event: Any subset of a sample space S. (“Elementary outcomes” = simple events.)

A = “Select an apple.” = {O1, O2, O3} #(A) = 3

B = “Select a banana.” = {O4, O5} #(B) = 2

Event P(Event)

A 3/5 = 0.6

B 2/5 = 0.4

5/5 = 1.0

P(A) = 0.6 “The probability of randomly selecting an apple is 0.6.” As # trials → ∞ P(B) = 0.4 “The probability of randomly selecting a banana is 0.4.”

1/1

1/2

2/3 3/4 3/5

4/6

1/3 1/4 2/5

2/6

. . .

. . .

A

B

0

0.4 0.6

1

e.g., . . . A B B A A A 1 2 5 3 4 . . . 6

# trials of experiment

#(Event)

#(trials)


General formulation may be facilitated with the use of a Venn diagram:

Event A = {O1, O2, …, Om} ⊆ S #(A) = m ≤ k

Definition: The probability of event A, denoted P(A), is the long-run relative frequency with which A is expected to occur, as the experiment is repeated indefinitely.

Fundamental Properties of Probability For any event A = {O1, O2, …, Om} in a sample space S,

1. 0 ≤ P(A) ≤ 1

2. P(A) = 1 2 31

( ) ( ) ( ) ( ) ( )mi

m

iP O P O P O P O P O

== + + + +∑

Special Cases:

• P(∅) = 0

• P(S) = 1

( )i

k

iP O

=∑ = 1 “certainty”

3. If all the elementary outcomes of S are equally likely, i.e.,

P(O1) = P(O2) = … = P(Ok) = 1k

,

then… #( )( )#( )

A mP Ak

= =S

.

Example: P(A) = 3/5 = 0.6, P(B) = 2/5 = 0.4

Sample Space: S = {O1, O2, …, Ok} #(S) = k

O1

O2

O4 Om

. . .

Ok

O3

. . .

A Om+1

Om+2

Om+3

Experiment ⇒


New Events from Old Events

Experiment 2: Select a card at random from a standard deck (and replace).

Sample Space: S = {A♠, …, K♦} #(S) = 52

Events: A = “Select a 2.” = {2♠, 2♣, 2♥, 2♦} #(A) = 4

B = “Select a ♣.” = {A♣, 2♣, …, K♣} #(B) = 13

Probabilities: Since all elementary outcomes are equally likely, it follows that

P(A) = #( )#( )

AS

= 452

and P(B) = #( )#( )

BS

= 1352

.

(1) Ac = “not A” = {All outcomes that are in S, but not in A.}

Example: Ac = “Select either A, 3, 4, …, or K.” P(Ac) = 1 − 452

= 4852

.

Example: Experiment = Toss a coin once.

Events: A = {Heads} Ac = {Tails}

Probabilities: Fair coin… P(A) = 0.5 ⇒ P(Ac) = 1 − 0.5 = 0.5 Biased coin… P(A) = 0.7 ⇒ P(Ac) = 1 − 0.7 = 0.3

P(Ac) = 1 − P(A)

complement

A♠ 2♠ 3♠ 4♠ 5♠ 6♠ 7♠ 8♠ 9♠ 10♠ J♠ Q♠ K♠

A♣ 2♣ 3♣ 4♣ 5♣ 6♣ 7♣ 8♣ 9♣ 10♣ J♣ Q♣ K♣

A♥ 2♥ 3♥ 4♥ 5♥ 6♥ 7♥ 8♥ 9♥ 10♥ J♥ Q♥ K♥

A♦ 2♦ 3♦ 4♦ 5♦ 6♦ 7♦ 8♦ 9♦ 10♦ J♦ Q♦ K♦

A

B

Ismor Fischer, 5/29/2012 3.1-4 (2) A ∩ B = “A and B” = {All outcomes in S that A and B share in common.}

= {All outcomes that result when events A and B occur simultaneously.}

Example: A ∩ B = “Select a 2 and a ♣” = {2♣} ⇒ P(A ∩ B) = 152

.

Definition: Two events A and B are said to be disjoint, or mutually exclusive, if they cannot occur simultaneously, i.e., A ∩ B = ∅, hence P(A ∩ B) = 0.

Example: A = “Select a 2” and C = “Select a 3” are disjoint events.

Exercise: Are 4 4 4 4{2 , 3 , 4 , 5 ,...}A= and 6 6 6 6{2 , 3 , 4 , 5 ,...}B = disjoint? If not, find A ∩ B.

(3) A ∪ B = “A or B” = {All outcomes in S that are either in A or B, inclusive.}

Example: A ∪ B = “Select either a 2 or a ♣” has probability

P(A ∪ B) = 452

+ 1352

− 152

= 1652

.

Example: A ∪ C = “Select either a 2 or a 3” has probability

P(A ∪ C) = 452

+ 452

− 0 = 852

.

S A B

P(A ∪ B) = P(A) + P(B) − P(A ∩ B)

= 0, if A and B are disjoint.

intersection

union


S A

C

B

Note: Formula (3) extends to n ≥ 3 disjoint events in a straightforward manner:

(4) P(A1 ∪ A2 ∪ … ∪ An) = P(A1) + P(A2) + … + P(An). Question: How is this formula modified if the n events are not necessarily disjoint?

Example: Take n = 3 events… Then P(A ∪ B ∪ C) =

P(A) + P(B) + P(C)

− P(A ∩ B) − P(A ∩ C) − P(B ∩ C)

+ P(A ∩ B ∩ C). Exercise: For S = {January,…, December}, verify this formula for the three events A = “Has 31 days,” B = “Name ends in r,” and C = “Name begins with a vowel.”

Exercise: A single tooth is to be randomly selected for a certain dental procedure. Draw a Venn diagram to illustrate the relationships between the three following events: A = “upper jaw,” B = “left side,” and C = “molar,” and indicate all corresponding probabilities. Calculate the probability that all of these three events, A and B and C, occur. Calculate the probability that none of these three events occur. Calculate the probability that exactly one of these three events occurs. Calculate the probability that exactly two of these three events occur. (Think carefully.) Assume equal likelihood in all cases.

The three “set operations” – union, intersection, and complement – can be unified via... Exercise: Using a Venn diagram, convince yourself that these statements are true in general. Then verify them for a specific example, e.g., A = “Pick a picture card” and B = “Pick a black card.”

incisors

incisors

premolars

premolars

canine canine

canine canine

molars

DeMorgan’s Laws

(A ∪ B) c = Ac ∩ Bc

(A ∩ B) c = Ac ∪ Bc

Ismor Fischer, 5/29/2012 3.1-6 Slight Detour…

Suppose that out of the last n = 40 races, a certain racing horse won x = 25, and lost the remaining n – x = 15. Based on these statistics, we can calculate the following probability estimates for future races:

P(Win) ≈ 25 540 8

xn= = = 0.625 = p

P(Lose) ≈ 1 15 340 8

xn

− = = = 0.375 = 1 – p = q

Odds of winning = (Win) 5 / 8 5(Lose) 3 / 8 3

PP

= = “5 to 3”

Definition: For any event A, let P(A) = p, thus P(Ac) = q = 1 – p. The odds of event A = p

q =

1p

p−, i.e., “the probability that A does occur, divided by the probability that it

does not occur.” (In the preceding example, A = “Win” with probability p = 5/8.) Note that if odds = 1, then A and Ac are equally likely to occur. If odds > 1 (likewise, < 1), then the probability that A occurs is greater (likewise, less) than the probability that it does not occur. Example: Suppose the probability of contracting a certain disease in a particular group of “high risk” individuals is P(D+) = 0.75, so that the probability of being disease-free is P(D–) = 0.25. Then the odds of contracting the disease in this group is equal to 0.75/0.25 = 3 (or “3 to 1”).*

31/ 49

Likewise, if in a reference group of “low risk” individuals, the prevalence of the same disease is only P(D+) = 0.02, so that P(D–) = 0.98, then their odds = 0.02/0.98 = 1/49 (≈ 0.0204). As its name suggests, the corresponding “odds ratio” between the two groups is defined as the ratio of their

respective odds, i.e., = 147. That is, the odds of the high-risk group contracting

the disease are 147 times larger than the odds of the low-risk reference group. (Odds ratios have nice properties, and are used extensively in epidemiological studies.) * That is, within this group, the probability of disease is three times larger than the probability of no disease.

Out of every 8 races, the horse wins 5 and loses 3, on average.


3.2 Conditional Probability and Independent Events

Using population-based health studies to estimate probabilities relating potential risk factors to a particular disease, evaluate efficacy of medical diagnostic and screening tests, etc.

Example: Events: A = “lung cancer” B = “smoker”

Disease Status

Lung cancer (A)

No lung cancer (Ac)

Smok

er Yes

(B) 0.12 0.04 0.16

No (Bc) 0.03 0.81 0.84

0.15 0.85 1.00

Probabilities: P(A) = 0.15 P(B) = 0.16 P(A ∩ B) = 0.12 Definition:

= 0.120.16

= 0.75 >> 0.15 = P(A).

Comments:

P(B | A) = ( )

( )P B A

P A∩

= 0.120.15

= 0.80, so P(A | B) ≠ P(B | A) in general.

General formula can be rewritten: P(A ∩ B) = P(A | B) × P(B) ← IMPORTANT

Example: P(Angel barks) = 0.1

P(Brutus barks) = 0.2

P(Angel barks | Brutus barks) = 0.3

Conditional Probability of Event A, given Event B (where P(B) ≠ 0)

P(A | B) = ( )

( )P A B

P B∩

Therefore… P(Angel and Brutus bark) = 0.06

B

0.03 0.04

0.81

A

S

0.12


Example: Suppose that two balls are to be randomly drawn, one after another, from a container holding four red balls and two green balls. Under the scenario of sampling without replacement, calculate the probabilities of the events A = “First ball is red”, B = “Second ball is red”, and A ∩ B = “First ball is red AND second ball is red”. (As an exercise, list the 6 × 5 = 30 outcomes in the sample space of this experiment, and use “brute force” to solve this problem.)

This type of problem – known as an “urn model” – can be solved with the use of a tree diagram, where each branch of the “tree” represents a specific event, conditioned on a preceding event. The product of the probabilities of all such events along a particular sequence of branches is equal to the corresponding intersection probability, via the previous formula. In this example, we obtain the following values:

1st draw 2nd draw

We can calculate the probability P(B) by adding the two “boxed” values above, i.e., P(B) = P(A ∩ B) + P(Ac ∩ B) = 12/30 + 8/30 = 20/30, or P(B) = 2/3.

This last formula – which can be written as P(B) = P(B | A) P(A) + P(B | Ac) P(Ac) – can be extended to more general situations, where it is known as the Law of Total Probability, and is a useful tool in Bayes’ Theorem (next section).

P(A ∩ B) = 12/30

P(A ∩ Bc) = 8/30

P(Ac ∩ B) = 8/30

P(Ac ∩ Bc) = 2/30

P(Bc | A) = 2/5

P(Bc | Ac) = 1/5

P(Ac) = 2/6

P(A) = 4/6

P(B | A) = 3/5 A Ac B

A ∩ B Ac ∩ B

P(B | Ac) = 4/5

R3 R4

G1

G2

R1 R2


Two events A and B are said to be statistically independent if either:

(1) P(A | B) = P(A), i.e., P(B | A) = P(B),

or equivalently,

(2) P(A ∩ B) = P(A) × P(B).

Suppose event C = “coffee drinker.”

Probabilities: P(A) = 0.15 P(C) = 0.40 P(A ∩ C) = 0.06

Therefore, P(A | C) = P(A ∩ C)

P(C) = 0.060.40

= 0.15 = P(A)

i.e., the occurrence of event C gives no information about the probability of event A. Definition:

Exercise: Prove that if events B and C are statistically independent, then so are each of the following: B and “Not C” “Not B” and C “Not B” and “Not C” Hint: Let P(B) = b, P(C) = c, and construct a 2 × 2 probability table. Summary

A, B disjoint ⇔ If either event occurs, then the other cannot occur: ( ) 0P A B∩ = .

A, B independent ⇔ If either event occurs, this gives no information about the other: ( ) ( ) ( )P A B P A P B∩ = × .

Example: A = “Select a 2” and B = “Select a ♣” are not disjoint events, because A ∩ B = {2♣} ≠ ∅. However, P(A ∩ B) = 1/52 = 1/13 × 1/4 = P(A) × P(B); hence they are independent events. Can two disjoint events ever be independent? Why?

Disease Status

Lung cancer (A)

No lung cancer (Ac)

Cof

fee

Dri

nker

Yes (C) 0.06 0.34 0.40

No (Cc) 0.09 0.51 0.60

0.15 0.85 1.00

A C

0.06 0.09 0.34

0.51

S


A VERY IMPORTANT AND USEFUL FACT: It can be shown that for any event A, all of the elementary properties of “probability” P(A) covered in the notes, extend to “conditional probability” ( | )P A B , for any other event B. For example, since we know that 1 2 1 2 1 2( ) ( ) ( ) ( )P A A P A P A P A A∪ = + − ∩

for any two events A1 and A2, it is also true that

1 2 1 2 1 2( | ) ( | ) ( | ) ( | )P A A B P A B P A B P A A B∪ = + − ∩ for any other event B.

As another example, since we know that ( ) 1 ( )P A P A= −c , it therefore also

follows that ( | ) 1 ( | )P A B P A B= −c .

Exercise: Prove these two statements. (Hint: Sketch a Venn diagram.)

HOWEVER, there is one important exception! We know that if A and B are two independent events, then ( ) ( ) ( )P A B P A P B∩ = . But this does not extend to conditional probabilities! In particular, if C is any other event, then

( | ) ( | ) ( | )P A B C P A C P B C∩ ≠ in general. The following example illustrates this, for three events A, B, and C: Exercise: Confirm that ( ) ( ) ( )P A B P A P B∩ = , but ( | ) ( | ) ( | )P A B C P A C P B C∩ ≠ .

In other words, two events that may be independent in a general population, may not necessarily be independent in a particular subgroup of that population.

C

B A

.10

.20

.05 .05

.05

.15

.20 .20


More on Conditional Probability and Independent Events Another example from epidemiology

Suppose that, in a certain study population, we wish to investigate the prevalence of lung cancer (A), and its associations with obesity (B) and cigarette smoking (C), respectively. From the first of the two stylized Venn diagrams above, by comparing the scales drawn, observe that the proportion of the size of the intersection A ∩ B (green) relative to event B (blue + green), is about equal to the proportion of the size of event A (yellow + green) relative to the entire population S. That is,

( )( )

P A BP B∩ = ( )

( )P AP S

.

(As an exercise, verify this equality for the following probabilities: yellow = .09, green = .07, blue = .37, white = .47, to two decimals, before reading on.) In other words, the probability that a randomly chosen person from the obese subpopulation has lung cancer, is equal to the probability that a randomly chosen person from the general population has lung cancer (.16). This equation can be equivalently expressed as

P(A | B) = P(A), since the left side is conditional probability by definition, and P(S) = 1 in the denominator of the right side. In this form, the equation clearly conveys the interpretation that knowledge of event B (obesity) yields no information about event A (lung cancer). In this example, lung cancer is equally probable (.16) among the obese as it is among the general population, so knowing that a person is obese is completely unrevealing with respect to having lung cancer. Events A and B that are related in this way are said to be independent. Note that they are not disjoint! In the second diagram however, the relative size of A ∩ C (orange) to C (red + orange), is larger than the relative size of A (yellow + orange) to the whole population S, so P(A | C) ≠ P(A), i.e., events A and C are dependent. Here, as is true in general, the probability of lung cancer is indeed influenced by whether a person is randomly selected from among the general population or the smoking subset, where it is much higher. Statistically, lung cancer would be a rare disease in the U.S., if not for cigarettes (although it is on the rise among nonsmokers).

S = POPULATION S = POPULATION A = lung cancer A = lung cancer

B = obese

A ∩ B

C = smoker

A ∩ C


Application: “Are Blood Antibodies Independent?” An example of conditional probability in human genetics

(Adapted from Rick Chappell, Ph.D., UW Dept. of Biostatistics & Medical Informatics) Background: The surfaces of human red blood cells (“erythrocytes”) are coated with antigens that are classified into four disjoint blood types: O, A, B, and AB. Each type is associated with blood serum antibodies for the other types, that is, • Type O blood contains both A and B antibodies. (This makes Type O the “universal donor”, but capable of receiving only Type O.) • Type A blood contains only B antibodies. • Type B blood contains only A antibodies. • Type AB blood contains neither A nor B antibodies. (This makes Type AB the “universal recipient”, but capable of donating only to Type AB.) In addition, blood is also classified according to the presence (+) or absence (−) of Rh factor (found predominantly in rhesus monkeys, and to varying degree in human populations; they are important in obstetrics). Hence there are eight distinct blood groups corresponding to this joint classification system: O+, O−, A+, A−, B+, B−, AB+, AB−. According to the American Red Cross, the U.S. population has the following blood group relative frequencies:

Rh factor + − Totals

O .384 .077 .461 A .323 .065 .388 B .094 .017 .111

AB .032 .007 .039 Totals .833 .166 .999

From these values (and from the background information above), we can calculate the following probabilities: P (A antibodies) = P (Type O or B) P (B antibodies) = P (Type O or A) = P (O) + P (B) = P (O) + P (A) = .461 + .111 = .461 + .388 = .572 = .849

P (B antibodies and Rh+ ) = P (Type O+ or A+)

= P (O+) + P (A+) = .384 + .323 = .707

Blo

od

Typ

es

Ismor Fischer, 5/29/2012 3.2-7 Using these calculations, we can answer the following. Question: Is having “A antibodies” independent of having “B antibodies”? Solution: We must check whether or not

P(A and B antibodies) = P(A antibodies) × P(B antibodies), i.e., P(Type O) .572 × .849 or .461 .486 This indicates near independence of the two events; there does exist a slight dependence. The dependence would be much stronger if America were composed of two disjoint (i.e., non-interbreeding) groups: Type A (with B antibodies only) and Type B (with A antibodies only), and no Type O (with both A and B antibodies). Since this is evidently not the case, the implication is that either these traits evolved before humans spread out geographically, or they evolved later but the populations became mixed in America. Question: Is having “B antibodies” independent of “Rh+”? Solution: We must check whether or not

P (B antibodies and Rh+) = P (B antibodies) × P (Rh+), that is, .707 = .849 × .833, which is true, so we have exact independence of these events. These traits probably predate diversification in humans (and were not differentially selected for since). Exercises: • Is having “A antibodies” independent of “Rh+”? • Find P (A antibodies | B antibodies) and P (B antibodies | A antibodies).

Conclusions? • Is “Blood Type” independent of “Rh factor”? (Do a separate calculation for

each blood type: O, A, B, AB, and each Rh factor: +, −.)


3.3 Bayes’ Formula

Suppose that, for a certain population of individuals, we are interested in comparing sleep disorders – in particular, the occurrence of event A = “Apnea” – between M = Males and F = Females.

Also assume that we know the following information:

P(M) = 0.4 P(A | M) = 0.8 (80% of males have apnea) prior probabilities

P(F) = 0.6 P(A | F) = 0.3 (30% of females have apnea) Given here are the conditional probabilities of having apnea within each respective gender, but these are not necessarily the probabilities of interest. We actually wish to calculate the probability of each gender, given A. That is, the posterior probabilities P(M | A) and P(F | A). To do this, we first need to reconstruct P(A) itself from the given information.

S = Adults under 50

M

A ∩ M

A ∩ F

F A

P(A ∩ M) = P(A | M) P(M)

P(Ac | M)

P(Ac | F)

P(F)

P(M)

P(A | M)

P(A | F)

P(Ac ∩ M) = P(Ac | M) P(M)

P(A ∩ F) = P(A | F) P(F)

P(Ac ∩ F) = P(Ac | F) P(F)

P(A) = P(A | M) P(M) + P(A | F) P(F)


posterior probabilities

S

M F A 0.08 0.42

0.32 0.18

So, given A…

P(M | A) = P(M ∩ A)

P(A) = P(A | M) P(M)

P(A | M) P(M) + P(A | F) P(F)

= (0.8)(0.4)

(0.8)(0.4) + (0.3)(0.6) = 0.320.50 = 0.64

and

P(F | A) = P(F ∩ A)

P(A) = P(A | F) P(F)

P(A | M) P(M) + P(A | F) P(F)

= (0.3)(0.6)

(0.8)(0.4) + (0.3)(0.6) = 0.180.50 = 0.36

Thus, the additional information that a randomly selected individual has apnea (an event with probability 50% – why?) increases the likelihood of being male from a prior probability of 40% to a posterior probability of 64%, and likewise, decreases the likelihood of being female from a prior probability of 60% to a posterior probability of 36%. That is, knowledge of event A can alter a prior probability P(B) to a posterior probability P(B | A), of some other event B.

Exercise: Calculate and interpret the posterior probabilities P(M | Ac) and P(F | Ac) as above, using the prior probabilities (and conditional probabilities) given. More formally, consider any event A, and two complementary events B1 and B2, (e.g., M and F) in a sample space S. How do we express the posterior probabilities P(B1 | A) and P(B2 | A) in terms of the conditional probabilities P(A | B1) and P(A | B2), and the prior probabilities P(B1) and P(B2)?

Bayes’ Formula for posterior probabilities P(Bi | A) in terms of prior probabilities P(Bi), i = 1, 2

P(Bi | A) = ( )

( )iP B A

P A∩

= 1 1 2 2

( ) ( )( ) ( ) ( ) ( )

i iP A | B P BP A | B P B + P A | B P B


Prior Probabilities

In general, consider an event A, and events B1, B2, …, Bn, disjoint and exhaustive.

Reverend Thomas Bayes 1702 - 1761

B1 B2 Bn

A

S . . .

. . . A ∩ B1 A ∩ B2 A ∩ Bn

j=1

n

.

.

.

P(B1)

P(B2)

P(B3)

P(Bn)

P(A | B1)

P(Ac | B1)

P(A | B2)

P(Ac | B2)

P(A | B3)

P(Ac | B3)

P(Ac | Bn)

P(A | Bn)

P(A ∩ B1)

P(Ac ∩ B1)

P(A ∩ B2)

P(Ac ∩ B2)

P(A ∩ B3)

P(Ac ∩ B3)

P(A ∩ Bn)

P(Ac ∩ Bn)

.

.

.

.

.

.

P(A ∩ Bj)

P(A) = ∑ P(A | Bj) P(Bj)

Law of Total Probability

Bayes’ Formula (general version)

For i = 1, 2, …, n, the posterior probabilities are…

P(Bi | A) = ( )

( )iP B A

P A∩

= ∑

=1

( ) ( )

( ) ( )

i in

j jj

P A | B P B

P A | B P B.


3.4 Applications

“Evidence-Based Medicine”: Screening Tests and Disease Diagnosis

Clinical tests are frequently used in medicine and epidemiology to diagnose or screen for the presence (T+) or absence (T−) of a particular condition, such as pregnancy or disease. Definitive disease status (either D+ or D−) is often subsequently determined by means of a “gold standard,” such as data resulting from follow-up, invasive radiographic or surgical procedures, or autopsy. Different measures of the test’s merit can then be estimated via various conditional probabilities. For instance, the sensitivity or true positive rate of the test is defined as the probability that a randomly selected individual has a positive test result, given that he/she actually has the disease. Other terms are defined similarly; the following example, using a random sample of n = 200 patients, shows how they are estimated from the data.

Disease Status

Diseased (D+) Nondiseased (D−)

Tes

t Res

ult

Positive (T+) 16 (= TP) 9 (= FP) 25

Negative (T−) 4 (= FN) 171 (= TN) 175

20 180 200

True Positive rate = P(T+ | D+) False Positive rate = P(T+ | D−)

“Sensitivity” = 1620 = .80 1 − specificity =

9180 = .05

False Negative rate = P(T− | D+) True Negative rate = P(T− | D−)

1 − sensitivity = 420 = .20 “Specificity” =

171180 = .95

D+ D−

T− ∩ D−

T− ∩ D+

T+ ∩ D−

T+

T+ ∩ D+


In order to be able to apply this test to the general population, we need accurate estimates of its predictive values of a positive and negative test, PV+ = P(D+ | T+) and PV− = P(D− | T−), respectively. We can do this via the basic definition

P(B | A) = P(B ∩ A)

P(A)

which, when applied to our context, becomes

P(D+ | T+) = P(D+ ∩ T+)

P(T+) and P(D− | T−) = P(D− ∩ T−)

P(T−) ,

often written PV+ = TP

TP + FP and PV− = TN

FN + TN .

Here, PV+ = 16 25 = 0.64 and PV− =

171175 = 0.977.

However, a more accurate determination is possible, with the use of…

Bayes’ Formula: P(B | A) = P(A | B) P(B)

P(A | B) P(B) + P(A | Bc) P(Bc)

which, when applied to our context, becomes

P(D+ | T+) = P(T+ | D+) P(D+)

P(T+ | D+) P(D+) + P(T+ | D−) P(D−) ,

i.e., PV+ = (Sensitivity)(Prevalence)

(Sensitivity)(Prevalence) + (False Positive rate)(1 − Prevalence)

and

P(D− | T−) = P(T− | D−) P(D−)

P(T− | D−) P(D−) + P(T− | D+) P(D+) ,

i.e., PV− = (Specificity)(1 − Prevalence)

(Specificity)(1 − Prevalence) + (False Negative rate)(Prevalence) .

All the ingredients are obtainable from the table calculations, except for the baseline prevalence of the disease in the population, P(D+), which is usually grossly overestimated by the corresponding sample-based value, in this case, 20/200 = .10. We must look to outside published sources and references for a more accurate estimate of this figure.


Suppose that we are able to determine the prior probabilities:

P(D+) = .04 and therefore, P(D−) = .96.

Then, substituting, we obtain the following posterior probabilities:

PV+ = (.80)(.04)

(.80)(.04) + (.05)(.96) = .40 and PV− = (.95)(.96)

(.95)(.96) + (.20)(.04) = .99.

Therefore, a positive test result increases the probability of having this disease from 4% to 40%; a negative test result increases the probability of not having the disease from 96% to 99%. Hence, this test is extremely specific for the disease (i.e., low false positive rate), but is not very sensitive to its presence (i.e., high false negative rate). A physician may wish to use a screening test with higher sensitivity (i.e., low false negative rate). However, such tests also sometimes have low specificity (i.e., high false positive rate), e.g., MRI screening for breast cancer. An ideal test generally has both high sensitivity and high specificity (e.g., mammography), but are often expensive. Typically, health insurance companies favor tests with three criteria: cheap, fast, and easy, e.g., Fecal Occult Blood Test (FOBT) vs. colonoscopy.

FUITA Procedure∗

High cost Low cost No cost!

∗ Overwhelmingly preferred by most insurance companies.

Patient-obtained fecal smears are analyzed for presence of blood in stool, a possible sign of colorectal cancer. High false positive rate (e.g., bleeding hemmorhoid).


“Evidence-Based Medicine”: Receiver Operating Characteristic (ROC) Curves

Originally developed in the electronic communications field for displaying “Signal-to-Noise Ratio” (SNR), these graphical objects are used when numerical cutoff values are used to determine T+ versus T−.

Example: Using blood serum markers in a screening test (T) for detecting fetal Down’s syndrome (D) and other abnormalities, as maternal age changes.

Triple Test: Uses three maternal serum markers (alpha-fetoprotein, unconjugated oestriol, and human gonadotrophin) to calculate a woman’s individual risk of having a Down syndrome pregnancy.

True + = False + True − = False −

(nondiscriminatory test; AUC = 0.5)

IDEAL TEST

Age 20 Age 25

Age 30

Age 35

sensitive, but not specific

Age 40

specific, but not

sensitive

optimal cutoff

AUC = 1


The True Positive rate (from 0 to 1) of the test is graphed against its False Positive rate (from 0 to 1), for a range of age levels, and approximated by a curve contained in the unit square. The farther this graph lies above the diagonal – i.e., the closer it comes to the ideal level of 1 – the better the test. This is often measured by the Area Under Curve (AUC), which has a maximum value of 1, the total area of the unit square. Often in practice, the “curve” is simply the corresponding polygonal graph (as shown), and AUC can be numerically estimated by the Trapezoidal Rule. (It can also be shown that this value corresponds to the probability that a random pregnancy can be correctly classified as Down, using this screening test.) Illustrated below are the ROC curves corresponding to three different Down syndrome screening tests; although their relative superiorities are visually suggestive, formal comparison is commonly performed by a modified version of the Wilcoxon Rank Sum Test (covered later).

Triple + dimeric inhibin A (DIA)


“cross product ratio”

“cross product ratio”

Further Applications: Relative Risk and Odds Ratios

Measuring degrees of association between disease (D) and exposure (E) to a potential risk (or protective) factor, using a prospective cohort study:

From the resulting data, various probabilities can be estimated. Approximately,

Disease Status

Diseased (D+) Nondiseased (D−)

Ris

k Fa

ctor

Exposed (E+) p11 p12 p11 + p12

Unexposed (E−) p21 p22 p21 + p22

p11 + p21 p12 + p22 1

P(D+ | E+) = P(D+ ∩ E+)

P(E+) = p11

p11 + p12 P(D− | E+) =

P(D− ∩ E+)P(E+) =

p12p11 + p12

P(D+ | E−) = P(D+ ∩ E−)

P(E−) = p21

p21 + p22 P(D− | E−) =

P(D− ∩ E−)P(E−) =

p22p21 + p22

Odds of disease, given exposure = P(D+ | E+)P(D− | E+) =

p11 / (p11 + p12)p12 / (p11 + p12)

= p11p12

Odds of disease, given no exposure = P(D+ | E−)P(D− | E−) =

p21 / (p21 + p22)p22 / (p21 + p22)

= p21p22

Odds Ratio: OR = P(D+ | E+)P(D− | E+) ÷

P(D+ | E−)P(D− | E−) =

p11p12

÷ p21p22

= p11 p22p12 p21

Comment: If OR = 1, then “odds, given exposure” = “odds, given no exposure,” i.e., no association exists between disease D and exposure E. What if OR > 1 or OR < 1?

Relative Risk: RR = P(D+ | E+)P(D+ | E−) =

p11 / (p11 + p12)p21 / (p21 + p22) =

p11 (p21 + p22)p21 (p11 + p12)

Comment: RR directly measures the effect of exposure on disease, but OR has better statistical properties. However, if the disease is rare in the population, i.e.,

if p11 ≈ 0 and p21 ≈ 0, then RR = p11 (p21 + p22)p21 (p11 + p12) ≈

p11 p22p12 p21

= OR.

TIME FUTURE PRESENT

Investigate: Association with D+ and D− Given: Exposed (E+) and Unexposed (E−)


Recall our earlier example of investigating associations between lung cancer and the potential risk factors of smoking and coffee drinking. First consider the former:

Lung Cancer

Diseased (D+) Nondiseased (D−) Sm

okin

g Exposed (E+) .12 .04 .16

Not Exposed (E−) .03 .81 .84 .15 .85 1.00

P(D+ | E+) = P(D+ ∩ E+)

P(E+) = .12.16 =

34 ; therefore, P(D− | E+) =

.04

.16 = 14 .

A random smoker has a 3 out of 4 (i.e., 75%) probability of having lung cancer; a random smoker has a 1 out of 4 (i.e., 25%) probability of not having lung cancer.

Therefore, the odds of the disease, given exposure, = P(D+ | E+)P(D− | E+) =

3/41/4

or

.12

.04 = 3.

The probability that a random smoker has lung cancer is 3 times greater than the probability that he/she does not have it.

P(D+ | E−) = P(D+ ∩ E−)

P(E−) = .03.84 =

128 ; therefore, P(D− | E−) =

.81

.84 = 2728 .

A random nonsmoker has a 1 out of 28 (i.e., 3.6%) probability of having lung cancer; a random nonsmoker has a 27 out of 28 (i.e., 96.4%) probability of not having lung cancer.

Therefore, the odds of the disease, given no exposure, = P(D+ | E−)P(D− | E−) =

1/2827/28

or

.03

.81 = 127 .

The probability that a random nonsmoker has lung cancer is 1/27 (= .037) times the probability that he/she does not have it. Or equivalently, The probability that a random nonsmoker does not have lung cancer is 27 times greater than the probability that he/she does have it.

Odds Ratio: OR = odds(D ± | E+)odds(D ± | E−) =

31/27

or‚ the “cross product ratio”

(.12) (.81)(.04) (.03) = 81 .

The odds of having lung cancer among smokers are 81 times greater than the odds of having lung cancer among nonsmokers.


3/41/28


(.12) (.84)(.16) (.03) = 21 .

The probability of having lung cancer among smokers is 21 times greater than the probability of having lung cancer among nonsmokers. The findings that OR >> 1 and RR >> 1 suggest a strong association between lung cancer and smoking. (But how do we formally show that this is significant? Later…)


Now consider measures of association between lung cancer and caffeine consumption.

Lung Cancer

Diseased (D+) Nondiseased (D−) C

affe

ine Exposed (E+) .06 .34 .40

Not Exposed (E−) .09 .51 .60 .15 .85 1.00

P(D+ | E+) = P(D+ ∩ E+)

P(E+) = .06.40 = .15 ; therefore, P(D− | E+) =

.34

.40 = .85 .

A random caffeine consumer has a 15% probability of having lung cancer; a random caffeine consumer has an 85% probability of not having lung cancer.

NOTE: P(D+ | E+) = .15 = P(D+), so D+ and E+ are independent events!

Therefore, the odds of the disease, given exposure, = P(D+ | E+)P(D− | E+) =

.15

.85

or

.06

.34 = .176 .

The probability that a random caffeine consumer has lung cancer is .176 times the probability that he/she does not have it.

P(D+ | E−) = P(D+ ∩ E−)

P(E−) = .09.60 = .15 ; therefore, P(D− | E−) =

.51

.60 = .85 .

A random caffeine non-consumer has a 15% probability of having lung cancer; a random caffeine non-consumer has an 85% probability of not having lung cancer.

Therefore, the odds of the disease, given no exposure, = P(D+ | E−)P(D− | E−) =

.15

.85

or

.09

.51 = .176 .

The probability that a random caffeine non-consumer has lung cancer is .176 times the probability that he/she does not have it.

Odds Ratio: OR = odds(D ± | E+)odds(D ± | E−) =

.176

.176


(.06) (.51)(.34) (.09) = 1 .

The odds of having lung cancer among caffeine consumers are equal to the odds of having lung cancer among caffeine non-consumers.


.15

.15


(.06) (.60)(.40) (.09) = 1 .

The probability of having lung cancer among caffeine consumers is equal to the probability of having lung cancer among caffeine non-consumers. NOTE: The findings that OR = 1 and RR = 1 are to be expected, since D+ and E+ are independent! Thus, no association exists between lung cancer and caffeine consumption. (In truth, there actually is a spurious association, since many coffee drinkers also smoke, which commonly leads to lung cancer. In this context, smoking is a variable that confounds the association between lung cancer and caffeine, and should be adjusted for. For a well-known example of a study where this was not done carefully enough, with substantial consequences, see MacMahon B., Yen S., Trichopoulos D., et. al., Coffee and Cancer of the Pancreas, New England Journal of Medicine, March 12, 1981; 304: 630-33.)


Adjusting for Age (and other confounders)

Once again, consider the association between lung cancer and smoking in the earlier example. A legitimate argument can be made that the reason for such a high relative risk (RR = 21) is that age is a confounder that was not adequately taken into account in the study. That is, there is a naturally higher risk of many cancers as age increases, regardless of smoking status, so “How do you tease apart the effects of age versus smoking, on the disease?” The answer is to adjust, or standardize,

for age. First, recall that relative risk RR = P(D+ | E+)P(D+ | E−) by definition, i.e., we are

confining our attention only to individuals with disease (D+), and measuring the effect of exposure (E+ vs. E–). Therefore, we can restrict our analysis to the two cells in the first column of the previous 2 × 2 table. However, suppose now that the probability estimates are stratified on age, as shown:

D+ Age ni

+ = #(E+) xi+ = #(D+ ∩ E+) pi

+ = P(D+ | E+) = xi+/ ni

+

E+ 50-59 250 5 5/250 = .02 60-69 150 15 15/150 = .10 70-79 100 40 40/100 = .40

Total n+ = 500 x+ = 60 p+ = 60/500 = .12 (as before) Age ni

– = #(E–) xi– = #(D+ ∩ E–) pi

– = P(D+ | E–) = xi–/ ni

–

E– 50-59 300 3 3/300 = .01 60-69 200 8 8/200 = .04 70-79 100 7 7/100 = .07

Total n– = 600 x– = 18 p– = 18/600 = .03 (as before) For each age stratum (i = 1, 2, 3),

ni+ = # individuals in the study who were exposed (E+), regardless of disease status

ni– = # individuals in the study who were not exposed (E–), regardless of disease status

xi

+ = # of exposed individuals (E+), with disease (D+)

xi– = # of unexposed individuals (E–), with disease (D+)

Therefore,

pi+ = xi

+ / ni+ = proportion of exposed individuals (E+), with disease (D+)

pi– = xi

– / ni– = proportion of unexposed individuals (E–), with disease (D+)

Ismor Fischer, 5/29/2012 3.4-10 From this information, we can imagine a combined table of age strata for D+:

Age ni = ni+ + ni

– pi+ pi

–

E± 50-59 550 .02 .01 60-69 350 .10 .04 70-79 200 .40 .07

Total n = 1100 Now, to estimate the “age-adjusted” numerator P(D+ | E+) of RR, we calculate the weighted average of the proportions pi

+, using their corresponding combined sample sizes ni as the weights. That is,

(550)(.02) (350)(.10) (200)(.40) 126( | )550 350 200 1100

i i

i

n pP D E

n

+ + ++ + ≈ = = =+ +

∑∑

0.1145

and similarly, the “age-adjusted” denominator P(D+ | E–) of RR is estimated by the weighted average of the proportions pi

–, again using the same combined sample sizes ni as the weights:

(550)(.01) (350)(.04) (200)(.07) 33.5( | )550 350 200 1100

i i

i

n pP D E

n

− + ++ − ≈ = = =+ +

∑∑

0.0305

whereby we obtain

RRadj = P(D+ | E+)P(D+ | E−) = 126

33.5 = 3.76.

Note that in this example, there is a substantial difference between the adjusted and unadjusted risks. The same ideas extend to the “age-adjusted” odds ratio ORadj.


3.5 1. Let events A = “Live to age 60,” B = “Live to age 70,” C = “Live to age 80”; note that event

C is a subset of B, and that B is a subset of A, i.e., they are nested: C B A⊂ ⊂ . We are given that P(A) = 0.90, P(B | A) = 0.80, and P(C | B) = 0.75. Therefore, by the general formula

( ) ( | ) ( )P E F P E F P F∩ = × , we have See Note →

( ) ( ) ( | ) ( )P B P B A P B A P A= ∩ = × = (0.80)(0.90) = 0.72

( ) ( ) ( | ) ( )P C P C B P C B P B= ∩ = × = (0.75)(0.72) = 0.54

( ) ( ) 0.54( | )( ) ( ) 0.90

P C A P CP C A

P A P A∩

= = = = 0.60

2. A = “Angel barks” B = “Brutus barks” P(A) = 0.1, P(B) = 0.2, P(A | B) = 0.3 ⇒ P(A ∩ B) = 0.06

(a) Because P(A) = 0.1 is not equal to P(A | B) = 0.3, the events A and B are not independent!

Or, equivalently, P(A ∩ B) = 0.06 is not equal to P(A) × P(B) = (0.1)(0.2) = 0.02.

(b) P(A ∪ B) = P(A) + P(B) – P(A ∩ B) = 0.1 + 0.2 – 0.06 = 0.24

Via DeMorgan’s Law: P( Ac ∩ Bc ) = 1 – P(A ∪ B) = 1 – 0.24 = 0.76

P(A ∩ Bc ) = P(A) – P(A ∩ B) = 0.1 – 0.06 = 0.04

P( Ac ∩ B) = P(B) – P(A ∩ B) = 0.2 – 0.06 = 0.14

P(A ∩ Bc ) + P( Ac ∩ B) = 0.04 + 0.14 = 0.18, or, P(A ∪ B) – P(A ∩ B) = 0.24 – 0.06 = 0.18

P(B | A) = ( )

( )P B A

P A∩

= 0.060.1

= 0.6

P( Bc | A) = ( )

( )P B A

P A∩c

= 0.040.1

= 0.4, or more simply, 1 – P(B | A) = 1 – 0.6 = 0.4

P(A | Bc ) = ( )

( )

P A B

P B

∩ c

c =

( )1 ( )

P A BP B∩

−

c

= 0.040.8

= 0.05

A Ac

B 0.06 0.14 0.20 = P(B)

Bc 0.04 0.76 0.80 = P( Bc )

0.10 = P(A)

0.90 = P( Ac )

1.00

Note: If event C occurs, then event B must have occurred. If event B occurs, then event A must have occurred. Thus, the event A in the intersection of “B and A” is redundant, etc.

A B

0.04 0.06 0.14

0.76


3. Urn Model: Events A = “First ball is red” and B = “Second ball is red.” In the “sampling

without replacement” case illustrated, it was calculated that, reduced to lowest terms, P(A) = 4/6 = 2/3, P(B) = 2/3, and P(A ∩ B) = 12/30 = 2/5. Since P(A ∩ B) = 2/5 ≠ 4/9 = 2/3 × 2/3 = P(A) × P(B), it follows that the two events A and B are not statistically independent. This should be intuitively consistent; as this “population” is small, the probability that event A occurs nontrivially affects that of event B, if the unit is not replaced after the first draw. However, in the “sampling with replacement” scenario, this is not the case. For, as illustrated below, P(A) = 4/6 = 2/3, P(B) = 24/36 = 2/3, and P(A ∩ B) = 16/36 = 4/9. Hence, P(A ∩ B) = 4/9 = 2/3 × 2/3 = P(A) × P(B), and so it follows that events A and B are indeed statistically independent.

4. First note that, in this case, A ⊂ B (event A is a subset of event B), that is, if A occurs, then B

occurs! (See Venn diagram.) In addition, the given information provides us with the following conditional probabilities: ( | )P A B = 0.75, c c( | )P B A = 0.80. Expanding these out via the usual formulas, we obtain, respectively,

0.75 = P(A | B) = ( )( )

P A BP B∩ = ( )

( )P AP B

,

i.e., ( ) 0.75 ( )P A P B= and

0.80 = ( | )P B Ac c = ( )( )

P B AP A

∩c c

c = ( )( )

P BP A

c

c =

1 ( )1 ( )

P BP A

−−

, i.e., ( ) 1.25 ( ) 0.25P A P B= −

upon simplification. Since the left-hand sides of these two equations are identical, it follows that the right-hand sides are equal, i.e., 1.25 P(B) – 0.25 = 0.75 P(B), and solving yields P(B) = 0.5. Hence, there is a 50% probability that any students come to the office hour. Plugging this value back into either one of these equations yields P(A) = 0.375. Hence, there is a 37.5% probability that any students arrive within the first fifteen minutes of the office hour.

P(A ∩ B) = 16/36

P(A ∩ Bc) = 8/36

P(Ac ∩ B) = 8/36

P(Ac ∩ Bc) = 4/36

P(Bc | A) = 2/6

P(Bc | Ac) = 2/6

P(Ac) = 2/6

P(A) = 4/6

P(B | A) = 4/6

P(B | Ac) = 4/6

P(B) = 24/36

Student Population

B

0.5

A

0.375

0.125


Lung Cancer Yes No

Cof

fee

Dri

nker

Yes 0.06 0.34 0.40

No 0.09 0.51 0.60

0.15 0.85 1.00

5.

Cancer stage (A) 1 2 3 4

Inco

me

Leve

l (B

)

Low (1) 0.05 0.10 0.15 0.20 0.5 Middle (2) 0.03 0.06 0.09 0.12 0.3 High (3) 0.02 0.04 0.06 0.08 0.2

0.1 0.2 0.3 0.4 1.0

(a) Recall that one definition of statistical independence of A and B is P(A ∩ B) = P(A) P(B). In particular then, the first cell entry P(“A = 1” ∩ “B = 1”) = P(A = 1) × P(B = 1) = (0.1)(0.5) = 0.05, i.e., the product of the 1st column marginal times 1st first row marginal. In a similar fashion, the cell value in the intersection of the ith row (i = 1, 2, 3) and jth column (j = 1, 2, 3, 4) is equal to the product of the ith row marginal probability, times the jth column marginal probability, which allows us to complete the entire table easily, as shown. By definition, this property is only true for independent events (!!!), and is fundamental to the derivation of the “expected value” formulas used in the “Chi-squared Test” (sections 6.2.3 and 6.3.1).

(b) By construction, we have

1 |1π = 0.05 / 0.1 = 0.5, 1 | 2π = 0.10 / 0.2 = 0.5, 1 | 3π = 0.15 / 0.3 = 0.5, 1 | 4π = 0.20 / 0.4 = 0.5… and P(Low) = 0.5

2 |1π = 0.03 / 0.1 = 0.3, 2 | 2π = 0.06 / 0.2 = 0.3, 2 | 3π = 0.09 / 0.3 = 0.3, 2 | 4π = 0.12 / 0.4 = 0.3… and P(Mid) = 0.3

3 |1π = 0.02 / 0.1 = 0.2, 3 | 2π = 0.04 / 0.2 = 0.2, 3 | 3π = 0.06 / 0.3 = 0.2, 3 | 4π = 0.08 / 0.4 = 0.2… and P(High) = 0.2

(c) Also,

1 |1π = 0.05 / 0.5 = 0.1 2 |1π = 0.10 / 0.5 = 0.2 3 |1π = 0.15 / 0.5 = 0.3 4 |1π = 0.20 / 0.5 = 0.4

1 | 2π = 0.03 / 0.3 = 0.1 2 | 2π = 0.06 / 0.3 = 0.2 3 | 2π = 0.09 / 0.3 = 0.3 4 | 2π = 0.12 / 0.3 = 0.4

1 | 3π = 0.02 / 0.2 = 0.1 2 | 3π = 0.04 / 0.2 = 0.2 3 | 3π = 0.06 / 0.2 = 0.3 4 | 3π = 0.08 / 0.2 = 0.4 … and

P(Stage 1) = 0.1 … and



P(Stage 4) = 0.4

(d) It was shown in the “Lung cancer” versus “Coffee drinker” example that these two events are independent in the study population; the 2 × 2 table is reproduced below.

The probability in the first cell (“Yes” for both events), 0.06, is indeed equal to (0.40)(0.15), the product of its row and column marginal sums (i.e., “Yes” for one event, times “Yes” for the other event), and likewise for the probabilities in all the other cells.

Note that this is not true of the 2 × 2 “Lung Cancer” versus “Smoking” table.


c a – c b – c

1 – a – b + c

A B

0.36 0.04 0.09

0.51

A B

6. The given information can be written as conditional probabilities:

( | ) 0.8P A B = , ( | ) 0.9P B A = , ( | ) 0.85P B A =c c

We are asked to find the value of ( | )P A Bc c . First, let ( )P A a= , ( )P B b= , and ( )P A B c∩ = . Then all of the events in the Venn

diagram can be labeled as shown. Using the definition of conditional probability ( | ) ( ) ( )P E F P E F P F= ∩ , we have

0.8cb= , 0.9c

a= , 1 0.85

1a b c

a− − +

=−

.

Algebraically solving these three equations with three unknowns yields a = 0.40, b = 0.45, c = 0.36, as shown.

Therefore, ( ) 0.51( | )

0.55( )

P A BP A B

P B

∩= =

c cc c

c = 0.927.

7. Let events , ,A B and C represent the occurrence of each symptom,

respectively. The given information can be written as:

( ) ( ) ( ) 0.6P A P B P C= = = ( | ) 0.45P A B C∩ = , and similarly, ( | ) 0.45P A C B∩ = , ( | ) 0.45P B C A∩ = as well. ( | ) 0.75P A B C∩ = , and similarly, ( | ) 0.75P B A C∩ = , ( | ) 0.75P C A B∩ = as well.

(a) We are asked to find ( )P A B C∩ ∩ . It follows from

the definition of conditional probability that ( ) ( | ) ( )P A B C P A B C P C∩ ∩ = ∩ × which, via the

first two statements above, (0.45)(0.6)= = 0.27 . (The two other equations yield the same value.)

(b) Again, via conditional probability, we have ( ) ( | ) ( )P A B C P A B C P B C∩ ∩ = ∩ × ∩ which, via

the third statement above and part (a), can be written as 0.27 0.75 ( )P B C= × ∩ , so that ( )P B C∩ = 0.36 . So ( ) 0.36 0.27P A B C∩ ∩ = − =c 0.09, and likewise for the others, ( )P A B C∩ ∩c and ( )P A B C∩ ∩ c . (See Venn diagram.) Hence, P(Two or three) = (3 × 0.09) + 0.27 = 0.54.

(c) From (b), P(Exactly two) = (3 × 0.09) = 0.27.

(d) From (a) and (c), it follows that ( ) 0.6 (0.27 0.09 0.09)P A B C∩ ∩ = − + + =c c 0.15 , and likewise for the others, ( )P A B C∩ ∩c c and ( )P A B C∩ ∩c c . Hence 3 × 0.15 = 0.45.

(e) From (b), (c), and (d), we see that ( ) 0.27 3 (0.9) 3 (0.15) 0.99,P A B C∪ ∪ = + + = so that ( ) 1 0.99P A B C∩ ∩ = − =c c c 0.01 . (See Venn diagram.)

(f) Working with A and B for example, we have ( ) ( ) 0.6P A P B= = from the given, and ( ) 0.36P A B∩ = from part (b). Since it is true that 0.36 = 0.6 × 0.6, it does indeed follow

that ( ) ( ) ( )P A B P A P B∩ = × , i.e., events A and B are statistically independent.

A B

C

0.27

0.09

0.09 0.09

0.15

0.15 0.15

0.01


C B

.54

.07

A

.03

.001

.13 .06

.12

.049

8. With events A = Accident, B = Berkeley visited, and C = Chelsea visited, the given statements can

be translated into mathematical notation as follows:

i. P(B ⋂ C) = P(B) P(C)

ii. P(B) = .80

iii. P(C) = .75

Therefore, substituting ii and iii into i yields P(B ⋂ C) = (.8)(.75), i.e., P(B ⋂ C) = .60. (purple + gray)

Furthermore, it also follows from statistical independence that

P(B only) = P(B ⋂ C c) = (.8)(1 – .75), i.e., P(B ⋂ C c) = .20 (blue + green)

P(C only) = P(Bc ⋂ C) = (1 – .8)(.75), i.e., P(Bc ⋂ C) = .15 (red + orange)

P(Neither B nor C) = P(Bc ⋂ C c) = (1 – .8)(1 – .75), i.e., P(Bc ⋂ C c) = .05 (yellow + white)

iv. P(A | B ⋂ C) = .90, which implies P(A ⋂ B ⋂ C) = P(A | B ⋂ C) P(B ⋂ C) = (.9)(.6), i.e., P(A ⋂ B ⋂ C) = .54, hence P(Ac ⋂ B ⋂ C) = .06.

v. P(A | B ⋂ C c) = .35, which implies P(A ⋂ B ⋂ C c) = P(A | B ⋂ C c) P(B ⋂ C c) = (.35)(.2), i.e., P(A ⋂ B ⋂ C c) = .07, hence P(Ac ⋂ B ⋂ C c) = .13.

vi. P(A | Bc ⋂ C) = .20, which implies P(A ⋂ Bc ⋂ C) = P(A | Bc ⋂ C) P(Bc ⋂ C) = (.2)(.15), i.e., P(A ⋂ Bc ⋂ C) = .03, hence P(Ac ⋂ Bc ⋂ C) = .12.

vii. P(A | Bc ⋂ C c) = .02, which implies P(A ⋂ Bc ⋂ C c) = P(A | Bc ⋂ C c) P(Bc ⋂ C c) = (.02)(.05), i.e., P(A ⋂ Bc ⋂ C c) = .001, hence P(Ac ⋂ Bc ⋂ C c) = .049.


.495

.33 .165

.01

A B

9. The given information tells us the following.

(i) P(A ⋃ B) = .99

(ii) P(B | A) = .60, which implies that P(B ⋂ A) = .6 P(A)

(iii) P(A | B) = .75, which implies that P(A ⋂ B) = .75 P(B)

Because the left-hand sides of (ii) and (iii) are the same, it follows that .6 P(A) = .75 P(B), or (iv) P(B) = .8 P(A).

Now, substituting (ii) and (iv) into the general relation P(A ⋃ B) = P(A) + P(B) – P(A ⋂ B) gives

.99 = P(A) + .8 P(A) – .6 P(A),

or .99 = 1.2 P(A), i.e., P(A) = .825. Thus, P(B) = .66 via (iv), and P(B ⋂ A) = .495 via (ii). The

two events A and B are certainly not independent, which can be seen any one of three ways:

P(A | B) = .75 from (iii), is not equal to P(A) = .825 just found;

P(B | A) = .60 from (ii), is not equal to P(B) = .66 just found;

P(A ⋂ B) = .495 is not equal to P(A) × P(B) = .825 × .66 = .5445.


10. Switch! It is tempting to believe that it makes no difference, since once a zonk door has been

opened and supposedly ruled out, the probability of winning the car should then be equally likely (i.e., 1/2) between each of the two doors remaining. However, it is important to remember that the host does not eliminate one of the original three doors at random, but always – i.e., “with probability 1” – a door other than the one chosen, and known (to him) to contain a zonk. Rather than discarding it, this nonrandom choice conveys useful information, namely, if indeed that had been the door originally chosen, then not switching would certainly have resulted in losing. As exactly one of the other doors also contains a zonk, the same argument can be applied to that door as well, whichever it is. Thus, as it would only succeed if the winning door was chosen, the strategy of not switching would result in losing two out of three times, on average.

This very surprising and counterintuitive result can be represented via the following table. Suppose that, for the sake of argument, Door 1 contains the car, and Doors 2 and 3 contain goats, as shown.

If contestant chooses: Door 1 Door 2 Door 3

then host reveals: Door 2 or Door 3 (at random)

Door 3 (not at random)

Door 2 (not at random)

Switch?

Yes LOSE WIN WIN P(Win | Switch) = 2/3 P(Lose | Switch) = 1/3

No WIN LOSE LOSE P(Win | Stay) = 1/3 P(Lose | Stay) = 2/3

Much mathematical literature has been devoted to the Monty Hall Problem – which has a colorful history – and its numerous variations. In addition, many computer programs exist on the Internet (e.g., using Java applets), that numerically simulate the Monty Hall Problem, and in so doing, verify that the values above are indeed correct. Despite this however, many people (including more than a few professional mathematicians and statisticians) heatedly debate the solution in favor of the powerfully intuitive, but incorrect, “switching doesn’t matter” answer. Strange but true...


C B

0.09

0.10

A

0.11

0.13

0.14 0.12

0.15

0.16

11. (a) We know that for any two events E and F, ( ) ( ) ( ) ( ) .P E F P E P F P E F= + − Hence,

( ) ( ) ( ) ( ),P A B P A P B P A B= + − i.e., 0.69 ( ) ( ) 0.19,P A P B= + − or 0.88 ( ) ( )P A P B= + .

Likewise, ( ) ( ) ( ) ( ),P A C P A P C P A C= + − i.e., 0.70 ( ) ( ) 0.20,P A P C= + − or 0.90 ( ) ( )P A P C= +

and ( ) ( ) ( ) ( ),P B C P B P C P B C= + − i.e., 0.71 ( ) ( ) 0.21,P B P C= + − or 0.92 ( ) ( )P B P C= + .

Solving these three simultaneous equations yields ( ) 0.43, ( ) 0.45, ( ) 0.47P A P B P C= = = .

(b) Events E and F are statistically independent if ( ) ( ) ( ) .P E F P E P F= Hence,

( ) ( ) ( ) (0.20)(0.45),P A C B P A C P B= = i.e., ( )P A B C = 0.09 , from which the entire

Venn diagram can be reconstructed from the triple intersection out, using the information above.


(a) Sensitivity P(T+ | D+) = P(T+ ∩ D+) / P(D+) = 302/481 = 0.628

Specificity P(T− | D−) = P(T− ∩ D−) / P(D−) = 372/452 = 0.823

(b) If the prior probability is P(D+) = 0.010, then via Bayes’ Law, the posterior probability is

P(D+ | T+) = P(T+ | D+) P(D+)

P(T+ | D+) P(D+) + P(T+ | D−) P(D−) = (0.628)(0.10)

(0.628)(0.10) + (0.177)(0.90) = 0.283

(c) P(D− | T−) = P(T− | D−) P(D−)

P(T− | D−) P(D−) + P(T− | D+) P(D+) = (0.823)(0.90)

(0.823)(0.90) + (0.372)(0.10) = 0.952 Comment: There are many potential reasons for low predictive value of a positive test, despite high sensitivity (i.e., true positive rate). One possibility is very low prevalence of the disease in the population (i.e., if P(D+) ≈ 0 in the numerator, then P(D+ | T+) will consequently be small, in general), as in the previous two problems. Other possibilities include health conditions other than the intended one that might also result in a positive test, or that the test might be inaccurate in a large subgroup of the population for some reason. Often, two or more different tests are combined (such as biopsy) in order to obtain a more accurate diagnosis.

12. Odds Ratio and Relative Risk

Disease Status Diseased (D+) Nondiseased (D−)

Risk Factor

Exposed (E+) p11 p12 p11 + p12

Unexposed (E−) p21 p22 p21 + p22 p11 + p21 p12 + p22 1

In a cohort study design…

OR = odds of disease, given exposure

odds of disease, given no exposure = P(D+ | E+) ÷ P(D− | E+)P(D+ | E−) ÷ P(D− | E−) =

p11 / p12p21 / p22

= p11 p22p12 p21

.

In a case-control study design…

OR = odds of exposure, given disease

odds of exposure, given no disease = P(E+ | D+) ÷ P(E− | D+)P(E+ | D−) ÷ P(E− | D−) =

p11 / p21p12 / p22

= p11 p22p21 p12

.

Both of these quantities agree, so the odds ratio can be used in either type of longitudinal study, although the interpretation must be adjusted accordingly. This is not true of the relative risk, which is only defined for cohort studies. (However, it is possible to estimate it using Bayes’ Law, provided one has an accurate estimate of the disease prevalence.)


13. OR = (273)(7260)(2641)(716) = 1.048 The odds of previous use of oral contraceptives given breast

cancer, are 1.048 times the odds of previous use of oral contraceptives given no breast cancer. That is, the odds of previous use of oral contraceptives are approximately 5% greater among breast cancer cases than cancer-free controls. (Note: Whether or not this odds ratio of 1.048 is significantly different from 1 is the subject of statistical inference and hypothesis testing…)

14. OR = (31)(4475)(1594)(65) = 1.339 The odds of breast cancer given age at first birth ≥ 25 years

old, are 1.339 times the odds of breast cancer given age at first birth < 25 years old. That is, the odds of breast cancer among women who first gave birth when they were 25 or older, are approximately 1/3 greater than those who first gave birth when they were under 25. (Again, whether or not this odds ratio of 1.339 is significantly different from 1 is to be tested.)

RR = (31)(4540)(1625)(65) = 1.332 The probability of breast cancer given age at first birth ≥ 25

years old, is 1.332 times the probability of breast cancer given age at first birth < 25 years old. That is, the probability of breast cancer among women who first gave birth when they were 25 or older, is approximately 1/3 greater than those who first gave birth when they were under 25.


15. Events: A = Aspirin use, B1 = GI bleeding, B2 = Primary stroke, B3 = CVD

Prior probabilities: P(B1) = 0.2, P(B2) = 0.3 P(B3) = 0.5 Conditional probabilities: P(A | B1) = 0.09, P(A | B2) = 0.04 P(A | B3) = 0.02

(a) Therefore, the posterior probabilities are, respectively,

P(B1 | A) = (0.09)(0.2)

(0.09)(0.2) + (0.04)(0.3) + (0.02)(0.5) = 0.0180.040 = 0.45

P(B2 | A) = (0.04)(0.3)

(0.09)(0.2) + (0.04)(0.3) + (0.02)(0.5) = 0.0120.040 = 0.30

P(B3 | A) = (0.02)(0.5)

(0.09)(0.2) + (0.04)(0.3) + (0.02)(0.5) = 0.0100.040 = 0.25

(b) The probability of gastrointestinal bleeding (B1) increases from 20% to 45%, in the

event of aspirin use (A); similarly, the probability of primary stroke (B2) remains constant at 30%, and the probability of cardiovascular disease (B3) decreases from 50% to 25%, in the event of aspirin use. Therefore, although it occurs the least often among the three given vascular conditions, gastrointestinal bleeding occurs in the highest overall proportion among the patients who used aspirin in this study. Furthermore, although it occurs the most often among the three conditions, cardiovascular disease occurs in the lowest overall proportion among the patients who used aspirin in this study, suggesting a protective effect. Lastly, as the prior probability P(B2) and posterior probability P(B2 | A) are equal (0.30), the two corresponding events “Aspirin use” and “Primary stroke” are statistically independent. Hence, the event that a patient has a primary stroke conveys no information about aspirin use, and vice versa (although aspirin does have a protective effect against secondary stroke). The following Venn diagram shows the relations among these events, drawn approximately to scale.

B1 B2 B3

A

0.182 0.288 0.490

0.010 0.018

0.012


S = “Survive” S c = “Not Survive” 0.08 0.42

T = “Treatment”

0.32 0.18

16. Events: S = “Five year survival” ⇒ S c = “Death within five years” T = “Treatment” ⇒ T c = “No Treatment”

Prior probability P(S) = 0.4 ⇒ P(S c ) = 1 – P(S) = 0.6

Given Conditional probability P(T | S) = 0.8 ⇒ P(T c | S) = 1 – P(T | S) = 0.2

Conditional probability P(T | S c ) = 0.3 ⇒ P(T c | S c ) = 1 – P(T | S c ) = 0.7 Posterior probabilities (via Bayes’ Formula):

(a) P(S | T) = P(T | S) P(S)

P(T | S) P(S) + P(T | S c ) P(S c )

= (0.8)(0.4)

(0.8)(0.4) + (0.3)(0.6) = 0.320.50 = 0.64

P(S | T c ) = P(T c | S) P(S)

P(T c | S) P(S) + P(T c | S c ) P(S c )

= (0.2)(0.4)

(0.2)(0.4) + (0.7)(0.6) = 0.080.50 = 0.16

Given treatment (T), the probability of five-year survival (S) increases from a prior of 0.40 to a posterior of 0.64. Moreover, given no treatment (T c ), the probability of five-year survival (S) decreases from a prior of 0.40 to a posterior of 0.16. Hence, in this population, treatment is associated with a four-fold increase in the probability of five-year survival. (This is the relative risk.) Note, however, that this alone may not be enough to recommend treatment. Other factors, such as adverse side effects and quality of life issues, are legitimate patient concerns to be decided individually.

(b) Odds of survival, given treatment = P(S | T)

P(S c | T) = 0.64

1 – 0.64 = 1.778

Odds of survival, given no treatment = P(S | T c )

P(S c | T c ) = 0.16

1 – 0.16 = 0.190

∴ Odds Ratio = 1.7780.190 = 9.33


17. Let P(A) = a, P(B) = b, P(A ∩ B) = c, as shown. Then it follows that

(1) 0 ≤ c ≤ a ≤ 1 and 0 ≤ c ≤ b ≤ 1

as well as

(2) 0 ≤ a + b – c ≤ 1.

Therefore, ∆ = | c – ab | in this notation. It thus suffices to show that

1 14 4

ab c− ≤ − ≤ + . From (2), we see that

ab – c ≤ a (1 – a + c) – c = a – a2 – (1 – a) c

≤ a – a2 ≤ 14

.

Clearly, this inequality is sharp when c = 0 and a = 1/2,

i.e., when P(A ∩ B) = 0 (e.g., A and B are disjoint) and

P(A) = 1/2. Moreover, because the definition of ∆ is

symmetric in A and B, it must also follow that P(B) = 1/2.

(See first figure below.) Furthermore, from (1),

ab – c ≥ (c)(c) – c = c2 – c ≥ – 14

.

This inequality is sharp when a = b = c = 1/2, i.e., when P(A) = P(B) = P(A ∩ B) = 1/2,

which implies that A = B, both having probability 1/2. (See second figure.)

A B a – c c b – c

A B

1/2 1/2

A = B

1/2

1/2


Treatment A Yes No

Tre

atm

ent B

Yes 0.14 0.26 0.40

No 0.21 0.39 0.60

0.35 0.65 1.00

Treatment A Yes No

Tre

atm

ent B

Yes 0.14 0.40 0.54

No 0.35 0.11 0.46

0.49 0.51 1.00

18.

(a) Given: P(A) = .35, P(B) = .40, P(A ∩ B) = .14 Then P(A only) = P(A ∩ Bc) = .35 – .14 = .21, P(B only) = P(Ac ∩ B) = .40 – .14 = .26, and P(Neither) = P(Ac ∩ Bc) = 1 – (.21 + .14 + .26) = 1 – .61 = .39, as shown in the first Venn diagram above. Since P(A ∩ B) = .14 and P(A) P(B) = (.35)(.40) = .14 as well, it follows that the two treatments are indeed statistically independent in this population. P(A or B) = .61 (calculated above) P(A xor B) = .21 + .26, or .61 – 14, = .47

(b) Given: P(A only) = .35, P(B only) = .40, P(A ∩ B) = .14

Then P(Neither) = P(Ac ∩ Bc) = 1 – (.35 + .14 + .40) = 1 – .89 = .11, as shown in the second Venn diagram above. Since P(A ∩ B) = .14 and P(A) P(B) = (.49)(.54) = .2646, it follows that the two treatments are not statistically independent in this population. P(A or B) = .89 (calculated above) P(A xor B) = .35 + .40, or .89 – 14, = .75

A B

.21 .14 .26

.39

A B

.35 .14 .40

.11


19. Let events A = Adult, B = Male, C = White. We are told that

(1) P(A ∩ B | C) = 0.3, i.e., ( )( )

P A B CP C∩ ∩ = 0.3, so that ( ) 0.3 ( )P A B C P C∩ ∩ = ,

(2) P(A ∩ C | B) = 0.4, i.e., ( )( )

P A C BP B∩ ∩ = 0.4, so that ( ) 0.4 ( )P A B C P B∩ ∩ = ,

and finally,

(3) P(A | B ∩ C) = 0.5, i.e., ( )( )

P A B CP B C∩ ∩∩

= 0.5, so that ( ) 0.5 ( )P A B C P B C∩ ∩ = ∩ .

Since the left-hand sides of all three equations are the same, it follows that all the right-hand sides are equal as well.

(a) Therefore, equating (1) and (3) yields

0.5 P(B ∩ C) = 0.3 P(C), i.e., ( )( )

P B CP C∩ = 0.3

0.5, or by definition, P(B | C) = 0.6, i.e., 60%

and

(b) equating (2) and (3) yields

0.5 P(B ∩ C) = 0.4 P(B), i.e., ( )( )

P B CP B∩ = 0.4

0.5, or by definition, P(C | B) = 0.8, i.e., 80%.

20. Again, let events A = Adult, B = Male, C = White. We are here told that

• P(B | A) = .1, P(C | B) = .2, P(A | C) = .3

• P(A | B) = .4, P(B | C) = .5, P(C | A) = ?

However, it is true that

P(A | B) × P(B | C) × P(C | A) = P(B | A) × P(C | B) × P(A | C)

because ( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( ) ( )

P A B P B C P C A P B A P C B P A CP B P C P A P A P B P C∩ ∩ ∩ ∩ ∩ ∩

× × = × × , since

the numerators of each side are simply rearrangements of one another, as likewise are the denominators. Therefore,

.4 × .5 × P(C | A) = .1 × .2 × .3, i.e., P(C | A) = .03, or 3%.


21. The Shell Game

(a) With 20 shells, the probability of winning exactly one game is 1/20, or .05; therefore, the probability of losing exactly one game is .95. Thus (reasonably assuming independence between game outcomes), the probability of losing all n games is equal to (.95)n, from which it follows that the probability of not losing all n games – i.e., P(winning at least one game) – is equal to 1 – (.95)n.

In order for this probability to be greater than .5 – i.e., 1 (.95) .5n− > – it must be true

that (.95)n < .5, or n > log (.5)log (.95)

= 13.51, so n ≥ 14 games. As n → ∞, it follows that

(.95)n → 0, so that P(win at least one game) = 1 – (.95)n → 1 (“certainty”).

(b) Using the same logic as above with n shells, the probability of winning exactly one game

is 1n

; therefore, the probability of losing exactly one game is 11n

− . Thus (again, tacitly

assuming independence between game outcomes), the probability of losing all n games is

equal to 11n

n −

, from which it follows that the probability of not losing all n games –

i.e., P(win at least one game) – is equal to 11 1n

n − −

, which approaches 1 – e–1 = .632

as n → ∞.

22. In progress…

23. Recall that pRRq

= and / (1 ) 1/ (1 ) 1

p p p qORq q q p

− −= = − −

, with ( | )p P D E= + + and ( | )q P D E= + − .

The case RR = 1 is trivial, for then p = q, hence OR = 1 as well; this corresponds to the case of no association.

Suppose RR > 1. Then p > q, which implies 111

qp

−<

−, or 1

1p p qq q p

−< −

, i.e., RR < OR.

Thus we have 1 < RR < OR. For the case RR < 1, simply reverse all the inequalities.

24. Let event A = “perfect square” = {12, 22, 32,…, (103)2}; then P(A) = 103/106 = 0.001. Likewise, let B = “perfect cube” = {13, 23, 33,…, (102)3}; then P(B) = 102/106 = 0.0001. Thus, A ⋂ B = “perfect sixth” = {16, 26, 36,…, 106}; hence P(A ⋂ B) = 10/106 = 0.00001. Therefore, P(A ⋃ B) = P(A) + P(B) – P(A ⋂ B) = 0.001 + 0.0001 – 0.00001 = 0.00109.


25. We construct a counterexample as follows. Let 0 < ε < 1 be a fixed but arbitrary number.

Also, with no loss of generality, assume the first toss results in 1 (i.e., Heads), as shown.

Now suppose that the next 0 1n − tosses all result in 0 (i.e., Tails), where 01nε

> . It then

follows that the proportion of Heads in the first 0n tosses is 0

1n

ε< , i.e., arbitrarily close to

0. Now suppose that the next 1 0n n− tosses all result in 1 (i.e., Heads), where 01

1 nnε

− +> .

It then follows that the proportion of Heads in the first 1n tosses is 0 1

1

1 1n nn

ε− +> − , i.e.,

arbitrarily close to 1. By continuing to attach sufficiently large blocks of zeros and ones in

this manner – i.e., 0 12

1 n nnε

− +> , 0 1 2

31 n n nn

ε− + − +

> ,… – an infinite sequence is

generated that does not converge, but forever oscillates between values which come arbitrarily close to 0 and 1, respectively.

1 toss n0 tosses n1 tosses n2 tosses n3 tosses 1 0 1n − 1 0n n− 2 1n n− 3 2n n−

1 000…0 111………1 0000000…………… 0 11111111………………………………1 etc.

# Heads: 0 1X = 1 0 11X n n= − + 2 0 11X n n= − + 3 0 1 2 31X n n n n= − + − +

Proportion of Heads X / n: 0

1n

ε< 0 1

1

1 1n nn

ε− +> − 0 1

2

1 n nn

ε− +< 0 1 2 3

3

1 1n n n nn

ε− + − +> −

Exercise: Prove that 1 1

(1 )max ,k

k k kn n εε− +

−>

for k = 0, 1, 2,…

Hint: By construction, 1

1 2 3 ... ( 1)kk k k

kn n nn

ε

−− − −− + − + −

> . From this, show that 11

k kn nεε+− >

.


A

C B

(1 )(1 )a b c− −

(1 ) (1 )a b c− − (1 )(1 )a b c− −

(1 )ab c− (1 )a b c−

(1 )a bc−

abc

(1 – a)(1 – b)(1 – c)

26. In progress…

27. Label the empty cells as shown.

.01 x .02

y ? z .50

.03 w .04

.60 1 It then follows that:

(1) .01 .02 ? .03 .04 1x y z w+ + + + + + + + = , i.e., ? .90x y z w+ + + + =

(2) ? .60x w+ + = (3) ? .50y z+ + =

Adding equations (2) and (3) together yields 2? 1.10x y z w+ + + + = . Subtracting

equation (1) from this yields ? = .20 .

28. In progress…

29. Careful calculation shows that P(A) = a, P(B) = b, P(C) = c, and P(A⋂B) = ab, P(A⋂C) = ac, P(B⋂C) = bc, so that the events are indeed pairwise independent. However, the triple intersection P(A⋂B⋂C) = d, an arbitrary value. Thus ( ) ( ) ( ) ( )P A B C P A P B P C≠ , unless d = abc. In that case, the Venn diagram simplifies to the following unsurprising form.


30. Bar Bet

(a) Absolutely not! To see why, let us start with the simpler scenario of drawing four cards from a fair deck, with replacement. In this case, all cards have an equal likelihood of being selected (namely, 1/52). This being the case, and the fact that there are 12 face cards in a standard deck, it follows that the probability of selecting a face card is 12/52, and the outcome of any selection is statistically independent of any other selection. To calculate the probability of at least one face card, we can subtract the probability of the complement – no face cards – from 1. That is, 1 – the probability of picking 4 non-face cards: 1 – (40/52)4 = 0.65.

Now suppose we modify the scenario to selecting n = 4 cards, without replacement. Unlike the above, the probability of selecting a face card now changes with every draw, making the outcomes statistically dependent. Since the number of cards decreases by one with each draw, the probability of picking all 4 non-face cards is no longer simply (40/52)4 = (40/52)(40/52)(40/52)(40/52), but (40/52)(39/51)(38/50)(37/49).

Therefore, the probability of picking ≥ 1 face card = 1 – (40/52)(39/51)(38/50)(37/49) = 0.6624. This means that I will win the bet approximately 2 out of 3 times! Counterintuitive perhaps, but true nonetheless.

(b) No, you should still not take the bet. Using the same logic with n = 3 draws, the probability of picking at least one face card = 1 – (40/52)(39/51)(38/50) = 0.5529. Thus, I still enjoy about a 5+% advantage over “even money” (i.e., 50%). On average, I will win 11 out of every 20 games played, and make one dollar.

(c) The R simulation should be consistent with the result found in part (a), namely, that the proportion of wins ≈ 0.6624, and therefore the proportion of losses ≈ 0.3376.

Note: For those who remember “combinatorics,” another way to arrive at this value is the following: There are

52

4

ways of randomly selecting 4 cards from the deck of 52. Of this number, there are 40

4

ways of randomly selecting 4

non-face cards. The ratio of the two, 40 52

4 4 ÷

, yields the same value as the underlined four-factor product above.


3.5 Problems 1. In a certain population of males, the following longevity probabilities are determined.

P(Live to age 60) = 0.90

P(Live to age 70, given live to age 60) = 0.80

P(Live to age 80, given live to age 70) = 0.75

From this information, calculate the following probabilities.

P(Live to age 70)

P(Live to age 80)

P(Live to age 80, given live to age 60) 2. Refer to the “barking dogs” problem in section 3.2.

(a) Are the events “Angel barks” and “Brutus barks” statistically independent?

(b) Calculate each of the following probabilities.

P(Angel barks OR Brutus barks)

P(NEITHER Angel barks NOR Brutus barks), i.e., P(Angel does not bark AND Brutus does not bark)

P(Only Angel barks) i.e., P(Angel barks AND Brutus does not bark)

P(Only Brutus barks) i.e., P(Angel does not bark AND Brutus barks)

P(Exactly one dog barks)

P(Brutus barks | Angel barks)

P(Brutus does not bark | Angel barks)

P(Angel barks | Brutus does not bark)

Also construct a Venn diagram, and a 2 × 2 probability table, including marginal sums. 3. Referring to the “urn model” in section 3.2, are the events A = “First ball is red” and

B = “Second ball is red” independent in this sampling without replacement scenario? Does this agree with your intuition? Rework this problem in the sampling with replacement scenario.

4. After much teaching experience, Professor F has come up with a conjecture about office hours:

“There is a 75% probability that a random student arrives to a scheduled office hour within the first 15 minutes (event A), from among those students who come at all (event B). Furthermore, there is an 80% probability that no students will come to the office hour, given that no students arrive within the first 15 mins.” Answer the following. (Note: Some algebra may be involved.)

(a) Calculate P(B), the probability that any students come to the office hour. (b) Calculate P(A), the probability that any students arrive in the first 15 mins of the office hour.

(c) Sketch a Venn diagram, and label all probabilities in it.

Ismor Fischer, 9/21/2014 3.5-2 5. Suppose that, in a certain population of cancer patients having similar ages, lifestyles, etc., two

categorical variables – I = Income (Low, Middle, High) and J = Disease stage (1, 2, 3, 4) – have probabilities corresponding to the column and row marginal sums in the 3 × 4 table shown.

Cancer stage 1 2 3 4

Inco

me

Leve

l Low 0.5 Middle 0.3 High 0.2

0.1 0.2 0.3 0.4 1.0

(a) Suppose I and J are statistically independent.∗

Complete all entries in the table.

(b) For each row i = 1, 2, 3, calculate the following conditional probabilities, across the columns j = 1, 2, 3, 4:

P(Low Inc | Stage 1), P(Low Inc | Stage 2), P(Low Inc | Stage 3), P(Low Inc | Stage 4) P(Mid Inc | Stage 1), P(Mid Inc | Stage 2), P(Mid Inc | Stage 3), P(Mid Inc | Stage 4) P(High Inc | Stage 1), P(High Inc | Stage 2), P(High Inc | Stage 3), P(High Inc | Stage 4)

Confirm that, for j = 1, 2, 3, 4:

That is, P(Income i | Stage j) = P(Income i). Is this consistent with the information in (a)? Why?

(c) Now for each column j = 1, 2, 3, 4, compute the following conditional probabilities, down the rows i = 1, 2, 3:

P(Stage 1 | Low Inc), P(Stage 2 | Low Inc), P(Stage 3 | Low Inc), P(Stage 4 | Low Inc), P(Stage 1 | Mid Inc), P(Stage 2 | Mid Inc), P(Stage 3 | Mid Inc), P(Stage 4 | Mid Inc), P(Stage 1 | High Inc). P(Stage 2 | High Inc). P(Stage 3 | High Inc). P(Stage 4 | High Inc).

Likewise confirm that, for i = 1, 2, 3:

P(Stage 1 | Income i) are all equal to the unconditional column probability P(Stage 1).




That is, P(Stage j | Income i) = P(Stage j). Is this consistent with the information in (a)? Why?

∗ Technically, we have only defined statistical independence for events, but it can be formally extended to general random variables in a natural way. For categorical variables such as these, every category (viewed as an event) in I, is statistically independent of every category (viewed as an event) in J, and vice versa.

P(Low Income | Stage j) are all equal to the unconditional row probability P(Low Income).

P(Mid Income | Stage j) are all equal to the unconditional row probability P(Mid Income).

P(High Income | Stage j) are all equal to the unconditional row probability P(High Income).


6. A certain medical syndrome is usually associated with two overlapping sets of symptoms, A and B.

Suppose it is known that: If B occurs, then A occurs with probability 0.80 . If A occurs, then B occurs with probability 0.90 . If A does not occur, then B does not occur with probability 0.85 . Find the probability that A does not occur if B does not occur. (Hint: Use a Venn diagram; some algebra may also be involved.)

7. The progression of a certain disease is typically characterized by the onset of up to three distinct

symptoms, with the following properties:

Each symptom occurs with 60% probability.

If a single symptom occurs, there is a 45% probability that the two other symptoms will also occur.

If any two symptoms occur, there is a 75% probability that the remaining symptom will also occur.

Answer each of the following. (Hint: Use a Venn diagram.) (a) What is the probability that all three symptoms will occur?

(b) What is the probability that at least two symptoms occur?

(c) What is the probability that exactly two symptoms occur?

(d) What is the probability that exactly one symptom occurs?

(e) What is the probability that none of the symptoms occurs?

(f) Is the event that a symptom occurs statistically independent of the event that any other

symptom occurs?

8. I have a nephew Berkeley and niece Chelsea (true) who, when very young, would occasionally visit their Uncle Ismor on weekends (also true). Furthermore,

i. Berkeley and Chelsea visited independently of one another.

ii. Berkeley visited with probability 80%.

iii. Chelsea visited with probability 75%.

However, it often happened that some object in his house – especially if it was fragile – accidentally broke during such visits (not true). Furthermore,

iv. The probability of such an accident occurring, given that both children visited, was 90%.

v. The probability of such an accident occurring, given that only Berkeley visited, was 35%.

vi. The probability of such an accident occurring, given that only Chelsea visited, was 20%.

vii. The probability of such an accident occurring, given that neither child visited, was 2%.

Sketch and label a Venn diagram for events A = Accident, B = Berkeley visited, and C = Chelsea visited. (Hint: The Exercise on page 3.2-3 might be useful.)


9. At a certain meteorological station, data are being collected about the behavior of

thunderstorms, using two lightning rods A and B. It is determined that, during a typical storm, there is a 99% probability that lightning will strike at least one of the rods. Moreover, if A is struck, there is a 60% probability that B will also be struck, whereas if B is struck, there is a 75% probability that A will also be struck. Calculate the probability of each of the following events. (Hint: See PowerPoint section 3.2, slide 28.)

• Both rods A and B are struck by lightning

• Rod A is struck by lightning

• Rod B is struck by lightning

Are the two events “A is struck” and “B is struck” statistically independent? Explain. 10. The Monty Hall Problem (simplest version)

Between 1963 and 1976, a popular game show called “Let’s Make A Deal” aired on network television, starring charismatic host Monty Hall, who would engage in “deals” – small games of chance – with randomly chosen studio audience members (usually dressed in outrageous costumes) for cash and prizes. One of these games consisted of first having a contestant pick one of three closed doors, behind one of which was a big prize (such as a car), and behind the other

two were “zonk” prizes (often a goat, or some other farm animal). Once a selection was made, Hall – who knew what was behind each door – would open one of the other doors that contained a zonk. At this point, Hall would then offer the contestant a chance to switch their choice to the other closed door, or stay with their original choice, before finally revealing the contestant’s chosen prize.

Question: In order to avoid “getting zonked,” should the

optimal strategy for the contestant be to switch, stay, or does it not make a difference?

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/PowerPoint/�


11. (a) Given the following information about three events A, B, and C.

( ) 0.69 ( ) 0.19( ) 0.70 ( ) 0.20( ) 0.71 ( ) 0.21

P A B P A BP A C P A CP B C P B C

= == == =

Find the values of ( ), ( ),P A P B and ( )P C .

(b) Suppose it is also known that the two events A C and B are statistically independent. Sketch a Venn diagram for events A, B, and C.

12. Recall that in a prospective cohort study, exposure (E+ or E−) is given, so that the odds ratio is

defined as

OR = odds of disease, given exposure

odds of disease, given no exposure = P(D+ | E+) ÷ P(D− | E+)P(D+ | E−) ÷ P(D− | E−) .

Recall that in a retrospective case-control study, disease status (D+ or D−) is given; in this case, the corresponding odds ratio is defined as

OR = odds of exposure, given disease

odds of exposure, given no disease = P(E+ | D+) ÷ P(E− | D+)P(E+ | D−) ÷ P(E− | D−) .

Show algebraically that these two definitions are mathematically equivalent, so that the same “cross product ratio” calculation can be used in either a cohort or case-control study, as the following two problems demonstrate. (Recall the definition of conditional probability.)

13. Under construction…

Ismor Fischer, 9/21/2014 3.5-6 14. Under construction… . 15. An observational study investigates the connection between aspirin use and three vascular

conditions – gastrointestinal bleeding, primary stroke, and cardiovascular disease – using a group of patients exhibiting these disjoint conditions with the following prior probabilities: P(GI bleeding) = 0.2, P(Stroke) = 0.3, and P(CVD) = 0.5, as well as with the following conditional probabilities: P(Aspirin | GI bleeding) = 0.09, P(Aspirin | Stroke) = 0.04, and P(Aspirin | CVD) = 0.02.

(a) Calculate the following posterior probabilities: P(GI bleeding | Aspirin), P(Stroke | Aspirin),

and P(CVD | Aspirin). (b) Interpret: Compare the prior probability of each category with its corresponding posterior

probability. What conclusions can you draw? Be as specific as possible. 16. On the basis of a retrospective study, it is determined (from hospital records, tumor registries, and

death certificates) that the overall five-year survival (event S) of a particular form of cancer in a population has a prior probability of P(S) = 0.4. Furthermore, the conditional probability of having received a certain treatment (event T) among the survivors is given by P(T | S) = 0.8, while the conditional probability of treatment among the non-survivors is only P(T | Sc) = 0.3.

(a) A cancer patient is uncertain about whether or not to undergo this treatment, and consults with

her oncologist, who is familiar with this study. Compare the prior probability of overall survival given above with each of the following posterior probabilities, and interpret in context.

Survival among treated individuals, P(S | T)

Survival among untreated individuals, P(S | Tc),

(b) Also calculate the following.

Odds of survival, given treatment

Odds of survival, given no treatment

Odds ratio of survival for this disease 17.

Recall that two events A and B are statistically independent if ( ) ( ) ( )P A B P A P B∩ = . It therefore follows that the difference

( ) ( ) ( )P A B P A P B∆ = ∩ −

is a measure of “how far” from statistical independence any two arbitrary events A and B are. Prove that 1

4∆ ≤ . When is the inequality sharp? (That is, when is equality achieved?)

5 years PRESENT PAST

Given: Survivors (S) vs. Non-survivors (Sc)

P(S) = 0.4

Treatment (T):

P(T | S) = 0.8

P(T | Sc) = 0.3

WARNING! This problem is not for the mathematically timid.


18. First, recall that, for any two events A and B, the union A ∪ B defines the “inclusive or” – i.e.,

“Either A occurs, or B occurs, or both.”

Now, consider the event “Only A” – i.e., “Event A occurs, and event B does not occur” – defined as the intersection A ∩ Bc, also denoted as the difference A – B. Likewise, “Only B” = “B and not A” = B ∩ Ac = B – A. Using these, we can define “xor” – the so-called “exclusive or” – i.e., “Either A occurs, or B occurs, but not both” – as the union (A – B) ∪ (B – A), or equivalently, (A ∪ B) – (A ∩ B). This is also sometimes referred to the symmetric difference between A and B, denoted A ∆ B. (See the two regions corresponding to the highlighted formulas below.)

(a) Suppose that two treatment regimens A and B exist for a certain medical condition. It is

reported that 35% of the total patient population receives Treatment A, 40% receives Treatment B, and 14% receives both treatments. Construct the corresponding Venn diagram and 2 × 2 probability table. Are the two treatments A and B statistically independent of one another?

Calculate P(A or B), and P(A xor B).

(b) Suppose it is discovered that an error was made in the original medical report, and it is

actually the case that 35% of the population receives only Treatment A, 40% receives only Treatment B, and 14% receives both treatments. Construct the corresponding Venn diagram and 2 × 2 probability table. Are the two treatments A and B statistically independent of one another?

Calculate P(A or B), and P(A xor B).

A B

A – B B – A = A ∩ Bc A ∩ B = Ac ∩ B


19. Three of the most common demographic variables used in epidemiological studies are age, sex, and race. Suppose it is known that, in a certain population,

• 30% of whites are men, 40% of males are white men, 50% of white males are men.

(a) What percentage of whites are male? Formally justify your answer!

(b) What percentage of males are white? Formally justify your answer!

Hint: Follow the same notation as the example in section 3.2, slide 24, of the PowerPoint slides.

20. In another epidemiological study, it is known that, for a certain population,

• 10% of adults are men, 20% of males are white, 30% of whites are adults

• 40% of males are men, 50% of whites are male.

What percentage of adults are white?

Hint: Find a connection between the products P(A | B) P(B | C) P(C | A) and P(B | A) P(C | B) P(A | C).

21. The Shell Game. In the traditional version, a single pea is placed under one of three walnut

half-shells in full view of an observer. The shells are then quickly shuffled into a new random arrangement, and the observer then guesses which shell contains the pea. If the guess is correct, the observer wins.

(a) For the sake of argument, suppose there are 20 half-shells instead of three,

and the observer plays the game a total of n times. What is the probability that he/she will guess correctly at least once out of those n times? How large must n be, in order to guarantee that the probability of winning is over 50%? What happens to the probability as n → ∞ ?

(b) Now suppose there are n half-shells, and the observer plays the game a total of n times. What is the probability that he/she will guess correctly at least once out of those n times? What happens to this probability as n → ∞ ?

Hint (for both parts): First calculate the probability of losing all n times.

22. (a) By definition, two events A and B are statistically independent if and only if P(A | B) = P(A). Prove mathematically that two events A and B are independent if and only if P(A | B) = P(A | Bc).

[Hint: Let P(A) = a, P(B) = b, P(A ⋂ B) = c, and use either a Venn diagram or a 2 × 2 table.]

(b) More generally, let events A, B1, B2, …, Bn be defined as in Bayes’ Theorem. Prove that:

A and B1 are independent, A and B2 are independent, …, A and Bn are independent

if and only if P(A | B1) = P(A | B2) = … = P(A | Bn).

[Hint: Use the Law of Total Probability.]

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/PowerPoint/�


23. Prove that the relative risk RR is always between 1 and the odds ratio OR. (Note there are three

possible cases to consider: RR < 1, RR = 1, and RR > 1.) 24. Consider the following experiment. Pick a random integer from 1 to one million (106). What is

the probability that it is either a perfect square (1, 4, 9, 16, …) or a perfect cube (1, 8, 27, 64,…)?

25. As defined at the beginning of this chapter, the probability of Heads of a coin is formally

identified with ( )limn

X nn→∞

– when that limiting value exists – where n = # tosses, and X = # Heads

in those n tosses. Show by a mathematical counterexample that in fact, this limit need not necessarily exist. That is, provide an explicit sequence of Heads and Tails (or ones and zeros)

for which the ratio ( )X nn

does not converge to a unique finite value, as n increases.

26. Warning: These may not be quite as simple as they look.

(a) Consider two independent events A and B. Suppose A occurs with probability 60%, while “B only” occurs with probability 30%. Calculate the probability that B occurs, i.e., P(B).

(b) Consider two independent events C and D. Suppose they both occur together with

probability 72%, while there is a 2% probability that neither event occurs. Calculate the probabilities P(C) and P(D).

27. Solve for the middle cell probability (“?”) in the following partially-filled probability table.

.01 .02

? .50

.03 .04

.60 28. How far away can a prior probability be from its posterior probabilities?

Consider two events A and B, and let P(A | B) = p and P(A | Bc) = q be fixed probabilities. If p = q, then A and B are statistically independent (see problem 22 above), and thus the prior probability P(B) coincides with its corresponding posterior probabilities P(B | A) and P(B | Ac) exactly, yielding a minimum value of 0 for the absolute differences

| ( ) ( | ) |P B P B A− and | ( ) ( | ) |P B P B A− C .

In terms of p and q (with p ≠ q), what must P(B) be for the maximum absolute differences to occur, and what are their respective values?

Ismor Fischer, 9/21/2014 3.5-10

29. Let A, B, and C be three pairwise-independent events, that is, A and B are independent, B and C

are independent, and A and C are independent. It does not necessarily follow that ( ) ( ) ( ) ( )P A B C P A P B P C= , as the following Venn diagram illustrates. Provide the details.

30. Bar Bet

(a) Suppose I ask you to pick any four cards at random from a deck of 52, without replacement, and bet you one dollar that at least one of the four is a face card (i.e., Jack, Queen, or King). Should you take the bet? Why? (Hint: See how the probability of this event compares to 50%. If this is too hard, try it with replacement first.)

(b) What if the bet involves picking three cards at random instead of four? Should you take the bet then? Why?

(c) Refer to the posted Rcode folder for this part. Please answer all questions.

A

C B

(1 )a b c d− − +

(1 )b a c d− − + (1 )c a b d− − +

ab d− ac d−

bc d−

d

1 – a – b – c + ab + ac + bc – d


4. Classical Probability Distributions

4.1 Discrete Models 4.2 Continuous Models

4.3 Summary Chart 4.4 Problems


4. Classical Probability Distributions

4.1 Discrete Models

FACT:

Experiment 3a: Roll one fair die... Discrete random variable X = “value obtained”

Sample Space: S = {1, 2, 3, 4, 5, 6} #(S) = 6

Because the die is fair, each of the six faces has an equally likely probability of occurring, i.e., 1/6. The probability distribution for X can be defined by a so-called probability mass function (pmf) f(x), organized in a probability table, and displayed via a corresponding probability histogram, as shown.

Comment on notation:

Event

( 4 )P X =

= 1/6

Translation: “The probability of rolling 4 is 1/6.”

Likewise for the other probabilities P(X = 1), P(X = 2),…, P(X = 6) in this example. A mathematically succinct way to write such probabilities is by the notation P(X = x), where x = 1, 2, 3, 4, 5, 6. In general therefore, since this depends on the value of x, we can also express it as a mathematical function of x (specifically, the pmf; see above), written f(x). Thus the two notations are synonymous and interchangeable. The previous example could just as well have been written f(4) = 1/6.

Event Probability

x f(x) = P(X = x) 1 1/6 2 1/6 3 1/6 4 1/6 5 1/6 6 1/6 1

Random variables can be used to define events that involve measurement!

“Uniform Distribution”

16

16

16

16

16

16

X


Experiment 3b: Roll two distinct, fair dice. ⇒ Outcome = (Die 1, Die 2)

Sample Space: S = {(1, 1), …, (6, 6)} #(S) = 62 = 36

Discrete random variable X = “Sum of the two dice (2, 3, 4, …, 12).” Events: “X = 2” = {(1, 1)} #(X = 2) = 1

“X = 3” = {(1, 2), (2, 1)} #(X = 3) = 2

“X = 4” = {(1, 3), (2, 2), (3, 1)} #(X = 4) = 3

“X = 5” = {(1, 4), (2, 3), (3, 2), (4, 1)} #(X = 5) = 4

“X = 6” = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)} #(X = 6) = 5

“X = 7” = {(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)} #(X = 7) = 6

“X = 8” = {(2, 6), (3, 5), (4, 4), (5, 3), (6, 2)} #(X = 8) = 5

“X = 9” = {(3, 6), (4, 5), (5, 4), (6, 3)} #(X = 9) = 4

“X = 10” = {(4, 6), (5, 5), (6, 4)} #(X = 10) = 3

“X = 11” = {(5, 6), (6, 5)} #(X = 11) = 2

“X = 12” = {(6, 6)} #(X = 12) = 1

Recall that, by definition, each event “X = x” (where x = 2, 3, 4,…, 12) corresponds to a specific subset of outcomes from the sample space (of ordered pairs, in this case). Because we are still assuming equal likelihood of each die face appearing, the probabilities of these events can be easily calculated by the “shortcut” formula

#( )( )

#( )A

P A =S

. Question for later: What if the dice are “loaded” (i.e., biased)?

(1, 1) (1, 2) (1, 3) (1, 4) (1, 5) (1, 6)

(2, 1) (2, 2) (2, 3) (2, 4) (2, 5) (2, 6)

(3, 1) (3, 2) (3, 3) (3, 4) (3, 5) (3, 6)

(4, 1) (4, 2) (4, 3) (4, 4) (4, 5) (4, 6)

(5, 1) (5, 2) (5, 3) (5, 4) (5, 5) (5, 6)

(6, 1) (6, 2) (6, 3) (6, 4) (6, 5) (6, 6)


3 5 7 9 11

136

636

136

236

236

336

336

436

436

536

536

2 3 4 5 6 7 8 9 10 11 12

Again, the probability distribution for X can be organized in a probability table, and displayed via a probability histogram, both of which enable calculations to be done easily:

x f(x) = P(X = x) 2 1/36

3 2/36

4 3/36

5 4/36

6 5/36

7 6/36

8 5/36

9 4/36

10 3/36

11 2/36

12 1/36

1

P(X = 7 or X = 11) Note that “X = 7” and “X = 11” are disjoint!

= P(X = 7) + P(X = 11) via Formula (3) above

= 6/36 + 2/36 = 8/36

P(5 ≤ X ≤ 8)

= P(X = 5 or X = 6 or X = 7 or X = 8)

= P(X = 5) + P(X = 6) + P(X = 7) + P(X = 8)

= 4/36 + 5/36 + 6/36 + 5/36

= 20/36

P(X < 10) = 1 − P(X ≥ 10) via Formula (1) above

= 1 − [P(X = 10) + P(X = 11) + P(X = 12)]

= 1 − [3/36 + 2/36 + 1/36] = 1 − 6/36 = 30/36

Exercise: How could event E = “Roll doubles” be characterized in terms of a random variable? (Hint: Let Y = “Difference between the two dice.”)


The previous example motivates the important topic of...

Discrete Probability Distributions

In general, suppose that all of the distinct population values of a discrete random variable X are sorted in increasing order: x1 < x2 < x3 < …, with corresponding probabilities of occurrence f(x1), f(x2), f(x3), … Formally then, we have the following.

Definition: f(x) is a probability distribution function for the discrete random variable X if, for all x,

f(x) ≥ 0 AND all

( )x

f x∑ = 1.

In this case, f(x) = P(X = x), the probability that the value x occurs in the population.

The cumulative distribution function (cdf) is defined as, for all x,

F(x) = P(X ≤ x) = all

( )i

ix x

f x≤

∑ = f(x1) + f(x2) + … + f(x).

Therefore, F is piecewise constant, increasing from 0 to 1.

Furthermore, for any two population values a < b, it follows that

P(a ≤ X ≤ b) = ( )f x∑b

a = F(b) – F(a−)

where a− is the value just preceding a in the sorted population.

Exercise: Sketch the cdf F(x) for Experiments 3a and 3b above.

Total Area = 1

…

… X | | | | x1 x2 x3 … x …

f(x1)

f(x2) f(x3)

f(x)

1

X | | | | x1 x2 x3 … x …

F(x1)

F(x2)

F(x3)

0


Population Parameters μ and σ2 (vs. Sample Statistics x and s2) • population mean = the “expected value” of the random variable X

= the “arithmetic average” of all the population values

Compare this with the relative frequency definition of sample mean given in §2.3.

• population variance = the “expected value” of the squared deviation of the

random variable X from its mean (μ)

Compare the first with the definition of sample variance given in §2.3. (The second is the analogue of the alternate computational formula.) Of course, the population standard deviation σ is defined as the square root of the variance.

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− *Exercise: Algebraically expand the expression (X − µ)2, and use the properties of expectation given above.

If X is a discrete numerical random variable, then…

μ = E[X] = ∑ x f(x), where f(x) = P(X = x), the probability of x.

If X is a discrete numerical random variable, then…

σ 2 = E[(X − µ)2] = ∑ (x − µ)2 f(x).

Equivalently,*

σ 2 = E[X 2] − µ 2 = ∑ x2 f(x) − µ 2 ,

where f(x) = P(X = x), the probability of x.

Properties of Mathematical Expectation

1. For any constant c, it follows that E[cX] = c E[X].

2. For any two random variables X and Y, it follows that

E[X + Y] = E[X] + E[Y] and, via Property 1,

E[X − Y] = E[X] − E[Y]. Any “operator” on variables satisfying 1 and 2 is said to be linear.


10%

20%

30%

40%

Experiment 4: Two populations, where the daily number of calories consumed is designated by X1 and X2, respectively. Population 1

Probability Table

Mean(X1) = µ1 = (2300)(0.1) + (2400)(0.2) + (2500)(0.3) + (2600)(0.4) = 2500 cals

Var(X1) = σ12 = (–200)2(0.1) + (–100)2(0.2) +

(0)2(0.3) + (+100)2(0.4) = 10000 cals2

Population 2

Probability Table

Mean(X2) = µ2 = (2200)(0.2) + (2300)(0.3) + (2400)(0.5) = 2330 cals

Var(X2) = σ22 = (–130)2(0.2) + (–30)2(0.3) + (70)2(0.5) = 6100 cals2

x f1(x) 2300 0.1

2400 0.2

2500 0.3

2600 0.4

x f2(x)

2200 0.2

2300 0.3

2400 0.5

20%

30%

50%

2300

2400

2500

2600

2200

2300 2400

0.1

0.2

0.3

0.4

0.2

0.3

0.5


Summary (Also refer back to 2.4 - Summary)

POPULATION

Discrete random variable X

Probability Table → Probability Histogram

x f(x) = P(X = x) x1 f(x1) x2 f(x2) . . .

.

.

.

1

µ = E[X] = Σ x f(x)

E[(X − µ)2] = Σ (x − µ)2 f(x) σ 2 = or

E[X2] − µ2 = Σ x2 f(x) − µ2

X Pa

ram

eter

s

SAMPLE, size n

Relative Frequency Table → Density Histogram

x f(x) = freq(x)

n

x1 f(x1) x2 f(x2) . . .

.

.

.

xk f(xk)

1

x = Σ x f(x)

nn − 1 Σ (x − x )2 f(x)

s 2 = or

nn − 1 [Σ x2 f(x) − x 2]

X

Stat

istic

s

X and 2S can be shown

to be unbiased

estimators of µ and 2σ ,

respectively. That is, E X µ = ,

and 2 2E S σ = .

(In fact, they are MVUE.)

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/2_-_Exploratory_Data_Analysis/2.4_-_Summary.pdf�


~ Some Advanced Notes on General Parameter Estimation ~

Suppose that θ is a fixed population parameter (e.g., µ ), and θ is a sample-based estimator (e.g., X ). Consider all the random samples of a given size n, and the resulting “sampling distribution” of θ values. Formally define the following:

Mean (of θ ) = ˆ[ ]E θ , the expected value of θ .

Bias = ˆ[ ]E θ θ− , the difference between the expected value of θ , and the “target” parameter θ .

Variance (of θ ) = 2ˆ ˆ[ ]E Eθ θ

− , the expected value

of the squared deviation of θ from its mean ˆ[ ]E θ ,

or equivalently,* 2 2ˆ ˆ[ ]E Eθ θ

− = .

Mean Squared Error (MSE) = 2ˆ( )E θ θ

− , the expected value of the squared

difference between estimator θ and the “target” parameter θ . Exercise: Prove* 2MSE = Variance + Bias that .

Comment: A parameter estimator θ is defined to be unbiased if ˆ[ ]E θ θ= , i.e.,

Bias = 0. In this case, MSE = Variance, so that if θ minimizes MSE, it then follows that it has the smallest variance of any estimator. Such a highly desirable estimator is called MVUE (Minimum Variance Unbiased Estimator). It can be shown that the estimators X and 2S (of µ and 2σ , respectively) are MVUE, but finding such an

estimator θ for a general parameter θ can be quite difficult in practice. Often, one must settle for either not having minimum variance or having a small amount of bias.

* using the basic properties of mathematical expectation given earlier

POPULATION Parameter θ

SAMPLE Statistic θ

θ θ= −c

ˆ ˆ[ ]Eθ θ= −a

ˆ[ ]E θ θ= −b

Vector interpretation

2 2 2[ ] [ ] [ ]E E E= +

c = a + b

c a b

Ismor Fischer, 5/29/2012 4.1-9 Related (but not identical) to this is the idea that of all linear combinations

1 1 2 2 n nc x c x c x+ + + of the data 1 2{ , , , }nx x x (such as X , with 1 2 1/nc c c n= = = = ) which are also unbiased, the one that minimizes MSE is called BLUE (Best Linear Unbiased Estimator). It can be shown that, in addition to being MVUE (as stated above), X is also BLUE. To summarize, MVUE gives: Min Variance among all unbiased estimators

≤ Min Variance among linear unbiased estimators

= Min MSE among linear unbiased estimators (since MSE = Var + Bias2),

given by BLUE (by def). The Venn diagram below depicts these various relationships.

Comment: If MSE → 0 as n → ∞ , then θ is said to have mean square convergence to θ . This in turn implies “convergence in probability” (via “Markov's Inequality,” also used in proving Chebyshev’s Inequality), i.e., θ is a consistent estimator of θ .

Unbiased Linear

Minimum MSE

Minimum Variance

BLUE

MVUE

X

2SMinimum variance

among linear unbiased estimators

Minimum variance among all unbiased

estimators

Ismor Fischer, 5/29/2012 4.1-10

10%

20%

30%

40%

20%

30%

50%

Experiment 4 - revisited: Recall the previous example, where X1 and X2 represent the daily number of calories consumed in two populations, respectively.

Population 1 Population 2

Case 1: First suppose that X1 and X2 are statistically independent, as shown in the joint probability distribution given in the table below. That is, each cell probability is equal to the product of the corresponding row and column marginal probabilities. For example, P(X1 = 2300 ∩ X2 = 2200) = .02, but this is equal to the product of the column marginal P(X1 = 2300) = .1 with the row marginal P(X2 = 2200) = .2. Note that the marginal distributions for X1 and X2 remain the same as above, as can be seen from the single-underlined values for X1, and respectively, the double-underlined values for X2.

X1 = # calories for Pop 1

2300 2400 2500 2600

X2 =

# c

alor

ies

for

Pop

2 2200 .02 .04 .06 .08 .20

2300 .03 .06 .09 .12 .30

2400 .05 .10 .15 .20 .50

.10 .20 .30 .40 1.00

2300

2400

2500

2600

2200

2300 2400

x f1(x) 2300 0.1 2400 0.2 2500 0.3 2600 0.4

Mean(X1) = µ1 = 2500 cals; Var(X1) = σ1

2 = 10000 cals2

x f2(x)

2200 0.2

2300 0.3

2400 0.5

Mean(X2) = µ2 = 2330 cals; Var(X2) = σ2

2 = 6100 cals2

Ismor Fischer, 5/29/2012 4.1-11 Now imagine that we wish to compare the two populations, by considering the probability distribution of the calorie difference D = X1 – X2 between them. (The sum S = X1 + X2 is similar, and left as an exercise.)

As an example, there are two possible ways that D = 300 can occur, i.e., two possible outcomes corresponding to the event D = 300: Either A = “X1 = 2500 and X2 = 2200” or B = “X1 = 2600 and X2 = 2300,” that is, A ⋃ B. For its probability, recall that

( ) ( ) ( ) ( ).P A B P A P B P A B= + − However, events A and B are disjoint, for they cannot both occur simultaneously, so that the last term is P(A ⋂ B) = 0. Thus,

( ) ( ) ( )P A B P A P B= + with P(A) = .06 and P(B) = .12 from the joint distribution.

Mean(D) = µD =

(–100)(.05) + (0)(.13) + (100)(.23) +

(200)(.33) + (300)(.18) + (400)(.08)

= 170 cals i.e., µD = µ1 – µ2 (Check this!) Var(D) = σD

2 =

(–270)2(.05) + (–170)2(.13) + (–70)2(.23)

+ (30)2(.33) + (130)2(.18) + (230)2(.08)

= 16100 cals2 i.e., σD

2 = σ12 + σ2

2 (Check this!)

Events D = d

Sample Space Outcomes in the form of ordered pairs (X1, X2)

Probabilities from joint distribution

D = –100: (2300, 2400) .05

D = 0: (2300, 2300), (2400, 2400) .13 = .03 + .10

D = +100: (2300, 2200), (2400, 2300), (2500, 2400) .23 = .02 + .06 + .15

D = +200: (2400, 2200), (2500, 2300), (2600, 2400) .33 = .04 + .09 + .20

D = +300: (2500, 2200), (2600, 2300) .18 = .06 + .12

D = +400: (2600, 2200) .08

.05

.13

.23

.33

.18

.08

Ismor Fischer, 5/29/2012 4.1-12

Case 2: Now assume that X1 and X2 are not statistically independent, as given in the joint probability distribution table below.

X1 = # calories for Pop 1

2300 2400 2500 2600

X2 =

# c

alor

ies

for

Pop

2 2200 .01 .03 .07 .09 .20

2300 .02 .05 .10 .13 .30

2400 .07 .12 .13 .18 .50

.10 .20 .30 .40 1.00 The events “D = d” and the corresponding sample space of outcomes remain unchanged, but the last column of probabilities has to be recalculated, as shown. This results in a slightly different probability histogram (Exercise) and parameter values.

Mean(D) = µD = (–100)(.07) + (0)(.14) + (100)(.19) + (200)(.33) + (300)(.18) + (400)(.08)

= 170 cals, i.e., µD = µ1 – µ2.

Var(D) = σD2 = (–270)2(.07) + (–170)2(.14) + (–70)2(.19) + (30)2(.31) + (130)2(.20) + (230)2(.09)

= 18517 cals2 It seems that “the mean of the difference is equal to the difference in the means” still holds, even when the two populations are dependent. But the variance of the difference is no longer necessarily equal the sum of the variances, as with independent populations.

Events D = d

Sample Space Outcomes in the form of ordered pairs (X1, X2)

Probabilities from joint distribution

D = –100: (2300, 2400) .07

D = 0: (2300, 2300), (2400, 2400) .14 = .02 + .12

D = +100: (2300, 2200), (2400, 2300), (2500, 2400) .19 = .01 + .05 + .13

D = +200: (2400, 2200), (2500, 2300), (2600, 2400) .31 = .03 + .10 + .18

D = +300: (2500, 2200), (2600, 2300) .20 = .07 + .13

D = +400: (2600, 2200) .09

Ismor Fischer, 5/29/2012 4.1-13

These examples illustrate a general principle that can be rigorously proved with mathematics. GENERAL FACT ~

Comments: These formulas actually apply to both discrete and continuous variables (next section).

The difference relations will play a crucial role in 6.2 - Two Samples inference.

If X and Y are dependent, then the two bottom relations regarding the variance also involve an additional term, Cov(X, Y), the population covariance between X and Y. See problems 4.3/29 and 4.3/30 for details.

The variance relation can be interpreted visually via the Pythagorean Theorem, which illustrates an important geometric connection, expanded in the Appendix.]

Certain discrete distributions (or discrete models) occur so frequently in practice, that their properties have been well-studied and applied in many different scenarios. For instance, suppose it is known that a certain population consists of 45% males (and thus 55% females). If a random sample of 250 individuals is to be selected, then what is the probability of obtaining exactly 100 males? At most 100 males? At least 100 males? What is the “expected” number of males? This is the subject of the next topic:

Mean(X + Y) = Mean(X) + Mean(Y) and Mean(X – Y) = Mean(X) – Mean(Y) In addition, if X and Y are independent random variables,

Var(X + Y) = Var(X) + Var(Y) and Var(X – Y) = Var(X) + Var(Y).

Xσ

Yσ Dσ

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/6_-_Statistical_Inference/6.2_-_Two_Samples.pdf�

Ismor Fischer, 5/29/2012 4.1-14

POPULATION = Women diagnosed with breast cancer in Dane County, 1996-2000

Among other things, this study estimated that the rate of “breast cancer in situ (BCIS),” which is diagnosed almost exclusively via mammogram, is approximately 12-13%. That is, for any individual randomly selected from this population, we have a binary variable

1, with probability 0.12BCIS0, with probability 0.88.

=

In a random sample of 100n = breast cancer diagnoses, let

X = # BCIS cases (0,1,2, ,100) .

Questions:

How can we model the probability distribution of X, and under what assumptions?

Probabilities of events, such as ( 0),P X = ( 20),P X = ( 20),P X ≤

etc.?

Mean # BCIS cases = ?

Standard deviation of # BCIS cases = ?

Full article available online at this link.

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/4_-_Classical_Probability_Distributions/BCIS.pdf�

Ismor Fischer, 5/29/2012 4.1-15

Binomial Distribution (Paradigm model = coin tosses)

(H H H H H) (H H T H H) (H T H H H) (H T T H H) (T H H H H) (T H T H H) (T T H H H) (T T T H H) (H H H H T) (H H T H T) (H T H H T) (H T T H T) (T H H H T) (T H T H T) (T T H H T) (T T T H T) (H H H T H) (H H T T H) (H T H T H) (H T T T H) (T H H T H) (T H T T H) (T T H T H) (T T T T H) (H H H T T) (H H T T T) (H T H T T) (H T T T T) (T H H T T) (T H T T T) (T T H T T) (T T T T T)

Random Variable: X = “# Heads in n = 5 independent tosses (0, 1, 2, 3, 4, 5)”

Events: “X = 0” = Exercise #(X = 0) =

5 0 = 1

“X = 1” = Exercise #(X = 1) = 5

1 = 5

“X = 2” = Exercise #(X = 2) = 5

2 = 10

“X = 3” = see above #(X = 3) = 5

3 = 10

“X = 4” = Exercise #(X = 4) = 5

4 = 5

“X = 5” = Exercise #(X = 5) = 5

5 = 1

Recall: For x = 0, 1, 2, …, n, the combinatorial symbol

n

x – read “n-choose-x” – is

defined as the value n!

x! (n − x)! , and counts the number of ways of rearranging x objects

among n objects. See Appendix > Basic Reviews > Perms & Combos for details.

Note:

n

r is computed via the mathematical function “nCr” on most calculators.

Binary random variable: Probability:

1, Success (Heads) with P(Success) = π Y = 0, Failure (Tails) with P(Failure) = 1 − π

Experiment: n = 5 independent coin tosses

Sample Space S = {(H H H H H), …, (T T T T T)} #(S) = 25 = 32

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A1._Basic_Reviews/A1.2_-_Perms_and_Combos.pdf�

Ismor Fischer, 5/29/2012 4.1-16

Total Area = 1

Probabilities:

First assume the coin is fair (π = 0.5 ⇒ 1 − π = 0.5), i.e., equally likely elementary outcomes H and T on a single trial. In this case, the probability of any event A above can thus be easily calculated via P(A) = #(A) / #(S).

x P(X = x) = 125

5

x

0 1/32 = 0.03125

1 5/32 = 0.15625

2 10/32 = 0.312500

3 10/32 = 0.312500

4 5/32 = 0.15625

5 1/32 = 0.03125 Now consider the case where the coin is biased (e.g., π = 0.7 ⇒ 1 − π = 0.3). Calculating P(X = x) for x = 0, 1, 2, 3, 4, 5 means summing P(all its outcomes). Example: P(X = 3) = outcome via independence of H, T P(H H H T T) = (0.7)(0.7)(0.7)(0.3)(0.3) = (0.7)3 (0.3)2

+ P(H H T H T) = (0.7)(0.7)(0.3)(0.7)(0.3) = (0.7)3 (0.3)2

+ P(H H T T H) = (0.7)(0.7)(0.3)(0.3)(0.7) = (0.7)3 (0.3)2

+ P(H T H H T) = (0.7)(0.3)(0.7)(0.7)(0.3) = (0.7)3 (0.3)2

+ P(H T H T H) = (0.7)(0.3)(0.7)(0.3)(0.7) = (0.7)3 (0.3)2

+ P(H T T H H) = (0.7)(0.3)(0.3)(0.7)(0.7) = (0.7)3 (0.3)2

+ P(T H H H T) = (0.3)(0.7)(0.7)(0.7)(0.3) = (0.7)3 (0.3)2

+ P(T H H T H) = (0.3)(0.7)(0.7)(0.3)(0.7) = (0.7)3 (0.3)2

+ P(T H T H H) = (0.3)(0.7)(0.3)(0.7)(0.7) = (0.7)3 (0.3)2

+ P(T T H H H) = (0.3)(0.3)(0.7)(0.7)(0.7) = (0.7)3 (0.3)2

via disjoint outcomes,

=

5

3 (0.7)3 (0.3)2

Ismor Fischer, 5/29/2012 4.1-17 Hence, we similarly have…

x

0

5

0 (0.7)0 (0.3)5 = 0.00243

1

5

1 (0.7)1 (0.3)4 = 0.02835

2

5

2 (0.7)2 (0.3)3 = 0.13230

3

5

3 (0.7)3 (0.3)2 = 0.30870

4

5

4 (0.7)4 (0.3)1 = 0.36015

5

5

5 (0.7)5 (0.3)0 = 0.16807

Example: Suppose that a certain medical procedure is known to have a 70% successful recovery rate (assuming independence). In a random sample of n = 5 patients, the probability that three or fewer patients will recover is: Method 1: P(X ≤ 3) = P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3)

= 0.00243 + 0.02835 + 0.13230 + 0.30870 = 0.47178 Method 2: P(X ≤ 3) = 1 − [ P(X = 4) + P(X = 5) ]

= 1 − [0.36015 + 0.16807 ] = 1 – 0.52822 = 0.47178 Example: The mean number of patients expected to recover is: µ = E[X] = 0 (0.00243) + 1 (0.02835) + 2 (0.13230) + 3 (0.30870) + 4 (0.36015) + 5 (0.16807)

= 3.5 patients This makes perfect sense for n = 5 patients with a π = 0.7 recovery probability, i.e., their product. In the probability histogram above, the “balance point” fulcrum indicates the mean value of 3.5.

Total Area = 1 P(X = x) =

5

x (0.7)x (0.3)5 − x

Ismor Fischer, 5/29/2012 4.1-18 General formulation:

The Binomial Distribution

Let the discrete random variable X = “# Successes in n independent Bernoulli trials (0, 1, 2, …, n),” each having constant probability P(Success) = π, and hence P(Failure) = 1 − π. Then the probability of obtaining any specified number of successes x = 0, 1, 2, …, n, is given by:

P(X = x) =

n

x π x (1 − π) n − x.

We say that X has a Binomial Distribution, denoted X ~ Bin(n, π). Furthermore, the mean µ = n π, and the standard deviation σ = n π (1 − π) .

Example: Suppose that a certain spontaneous medical condition affects 1% (i.e., π = 0.01) of the population. Let X = “number of affected individuals in a random sample of n = 300.” Then X ~ Bin(300, 0.01), i.e., the probability of obtaining any specified number x = 0, 1, 2, …, 300 of affected individuals is:

P(X = x) =

300

x (0.01)x (0.99)300 − x .

The mean number of affected individuals is µ = nπ = (300)(0.01) = 3 expected cases, with a standard deviation of σ = (300)(0.01)(0.99) = 1.723 cases.

Probability Table for Binomial Dist.

x f(x) =

n

x π x (1 − π) n − x

0

0 00

(1 )nn π π −−

1

1 11

(1 )nn π π −−

2

2 22

(1 )nn π π −−

etc. etc.

n

(1 )n nnnn π π −−

1

Exercise: In order to be a valid distribution, the sum of these probabilities must = 1. Prove it.

Hint: First recall the Binomial Theorem: How do you expand the algebraic expression ( )na b+ for any n = 0, 1, 2, 3, …? Then replace a with π, and b with 1 – π. Voilà!

Ismor Fischer, 5/29/2012 4.1-19 Comments:

The assumption of independence of the trials is absolutely critical! If not satisfied – i.e., if the “success” probability of one trial influences that of another – then the Binomial Distribution model can fail miserably. (Example: X = “number of children in a particular school infected with the flu”) The investigator must decide whether or not independence is appropriate, which is often problematic. If violated, then the correlation structure between the trials may have to be considered in the model.

As in the preceding example, if the sample size n is very large, then the computation of

n

x for x = 0, 1, 2, …, n, can be intensive and impractical. An approximation to the

Binomial Distribution exists, when n is large and π is small, via the Poisson Distribution (coming up…).

Note that the standard deviation σ = n π (1 − π) depends on the value of π. (Later…)

Ismor Fischer, 5/29/2012 4.1-20

How can we estimate the parameter π, using a sample-based statistic π ?

Example: If, in a sample of n = 50 randomly selected individuals, X = 36 are female,

then the statistic π = Xn =

3650 = 0.72 is an estimate of the true probability π that a

randomly selected individual from the population is female. The probability of selecting a male is therefore estimated by 1 − π = 0.28 .

Binary random variable

1, Success with probability π Y = 0, Failure with probability 1 − π

POPULATION

Experiment: n independent trials

SAMPLE

0/1 0/1 0/1 0/1 0/1 0/1 … 0/1 (y1, y2, y3, y4, y5, y6, …, yn)

y1 + y2 + y3 + y4 + y5 + … + yn

Let X = # Successes in n trials ~ Bin(n, π)

(n − X = # Failures in n trials). Therefore, dividing by n…

Xn = proportion of Successes in n trials

π = p ( = y , as well)

and hence…

q = 1 − p = proportion of Failures in n trials.

Ismor Fischer, 5/29/2012 4.1-21

Poisson Distribution (Models rare events)

Assume:

1. All the occurrences of E are independent in the interval.

2. The mean number µ of expected occurrences of E in the interval is proportional to T, i.e., µ = α T. This constant of proportionality α is called the rate of the resulting Poisson process.

Then…

Examples: # bee-sting fatalities per year, # spontaneous cancer remissions per year, # accidental needle-stick HIV cases per year, hemocytometer cell counts

Discrete Random Variable:

X = # occurrences of a (rare) event E, in a given interval of time or space, of size T. (0, 1, 2, 3, …)

T 0 ↑ ↑ ↑ ↑

The Poisson Distribution The probability of obtaining any specified number x = 0, 1, 2, … of occurrences of event E is given by:

P(X = x) = e −µ µ x

x!

where e = 2.71828… (“Euler’s constant”). We say that X has a Poisson Distribution, denoted X ~ Poisson(µ). Furthermore, the mean is µ = α T, and the variance is σ 2 = α T also.

Ismor Fischer, 5/29/2012 4.1-22

Area = 1 Area = 1

Example (see above): Again suppose that a certain spontaneous medical condition E affects 1% (i.e., α = 0.01) of the population. Let X = “number of affected individuals in a random sample of T = 300.” As before, the mean number of expected occurrences of E in the sample is µ = α T = (0.01)(300) = 3 cases. Hence X ~ Poisson(3), and the probability that any number x = 0, 1, 2, … of individuals are affected is given by:

P(X = x) = e −3 3 x

x!

which is a much easier formula to work with than the previous one. This fact is sometimes referred to as the Poisson approximation to the Binomial Distribution, when T (respectively, n) is large, and α (respectively, π) is small. Note that in this example, the variance is also σ 2 = 3, so that the standard deviation is σ = 3 = 1.732, very close to the exact Binomial value.

x Binomial

P(X = x) =

300

x (0.01)x (0.99)300 − x Poisson

P(X = x) = e −3 3 x

x!

0 0.04904 0.04979 1 0.14861 0.14936 2 0.22441 0.22404 3 0.22517 0.22404 4 0.16888 0.16803 5 0.10099 0.10082 6 0.05015 0.05041 7 0.02128 0.02160 8 0.00787 0.00810 9 0.00258 0.00270

10 0.00076 0.00081 etc. → 0 → 0

Ismor Fischer, 5/29/2012 4.1-23

Why is the Poisson Distribution a good approximation to the

Binomial Distribution, for large n and small π?

Rule of Thumb: n ≥ 20 and π ≤ 0.05; excellent if n ≥ 100 and π ≤ 0.1.

Let fBin(x) =

n

x π x (1 − π) n − x and fPoisson(x) = e−λ λ x

x! , where λ = nπ.

We wish to show formally that, for fixed λ, and x = 0, 1, 2, …, we have:

lim fBin(x) = fPoisson(x). Proof: By elementary algebra, it follows that…

fBin(x) =

n

x π x (1 − π) n − x

= n!

x! (n − x)! π x (1 − π) n (1 − π) − x

= 1x! n (n − 1) (n − 2) ... (n − x + 1) π x

1 − λn

n (1 − π) − x

= 1x!

n (n − 1) (n − 2) ... (n − x + 1)nx nx π x

1 − λn

n (1 − π) − x

= 1x!

nn

n − 1

n

n − 2

n …

n − x + 1

n (nπ)x

1 − λn

n (1 − π) − x

= 1x! 1

1 − 1n

1 − 2n …

1 − x −1

n λ x

1 − λn

n (1 − π) − x

As n → ∞, ↓ ↓ ↓ ↓ ↓ π → 0,

1x! 1(1)(1) … (1) = 1 λ x e −λ 1 −x = 1

= e−λ λ x

x! = fPoisson(x). QED

n → ∞ π → 0

Siméon Poisson (1781 - 1840)

Ismor Fischer, 5/29/2012 4.1-24

Classical Discrete Probability Distributions Binomial (probability of finding x “successes” and n – x “failures” in n independent trials)

Negative Binomial (probability of needing x independent trials to find k successes)

Hypergeometric (modification of Binomial to sampling without replacement from “small” finite populations, relative to n.)

Multinomial (generalization of Binomial to k categories, rather than just two) Poisson (“limiting case” of Binomial, with n → ∞ and π → 0, such that nπ = λ, fixed)

X = # occurrences of a rare event (i.e., π ≈ 0) among many (i.e., n large), with fixed mean λ = nπ

f(x) = P(X = x) = e−λ λ x

x! , x = 0, 1, 2, …

X = # independent Bernoulli trials for k successes (each with probability π), k = 1, 2, 3, …

f(x) = P(X = x) = ( x − 1 k − 1) π k (1 − π) x − k, x = k, k + 1, k + 2, …

Geometric: X = # independent Bernoulli trials for k = 1 success

f(x) = P(X = x) = π (1 − π) x − 1, x = 1, 2, 3, …

For i = 1, 2, 3, …, k,

Xi = # outcomes in category i (each with probability πi), in n independent Bernoulli trials, n = 1, 2, 3, … 1 2 3 1kπ π π π+ + + + =

f(x1, x2, …, xk) = P(X1 = x1, X2 = x2, …, Xk = xk) = n!

x1! x2! … xk! 1 2

1 2kxx x

kπ π π ,

xi = 0, 1, 2, …, n with x1 + x2 + … + xk = n

X = # successes (each with probability π) in n independent Bernoulli trials, n = 1, 2, 3, …

f(x) = P(X = x) = ( n x ) π x (1 − π) n − x, x = 0, 1, 2, …, n

X = # successes in n random trials taken from a population of size N containing d successes, n > N10

f(x) = P(X = x) = ( d

x )( N − d n − x )

( N n )

, x = 0, 1, 2, …, d


4.2 Continuous Models

Horseshoe Crab (Limulus polyphemus)

• Not true crabs, but closely related to spiders and scorpions.

• “Living fossils” – existed since Carboniferous Period, ≈ 350 mya.

• Found primarily on Atlantic coast, with the highest concentration in Delaware Bay, where males and the much larger females congregate in large numbers on the beaches for mating, and subsequent egg-laying.

• Pharmaceutical (and many other scientific) contributions! Blue hemolymph (due to copper-based hemocyanin molecule) contains amebocytes, which produce a clotting agent that reacts with endotoxins found in the outer membrane of Gram-negative bacteria. Several East Coast companies have developed the Limulus Amebocyte Lysate (LAL) assay, used to detect bacterial contamination of drugs and medical implant devices, etc. Equal amounts of LAL reagent and test solution are mixed together, incubated at 37°C for one hour, then checked to see if gelling has occurred. Simple, fast, cheap, sensitive, uses very small amounts, and does not harm the animals… probably. (Currently, a moratorium exists on their harvesting, while population studies are ongoing…)

Photo courtesy of Bill Hall, [email protected]. Used with permission.

mailto:[email protected]�


0.24

0.40 0.36

0.02

0.18

0.24

0.16 0.12

0.20

0.08

Continuous Random Variable:

X = “Length (inches) of adult horseshoe crabs”

Sample 1 Sample 2

n = 25; lengths measured to nearest inch n = 1000; lengths measured to nearest ½ inch

e.g., 10 in [12, 16)″, 6 in [16, 20)″, 9 in [20, 24)″ e.g., 180 in [12, 14)″, 240 in [14, 16)″, etc.

Examples: P(16 ≤ X < 20) = 0.24 P(16 ≤ X < 20) = 0.16 + 0.12 = 0.28

In the limit as n → ∞, the population distribution of X can be characterized by a continuous density curve, and formally described by a density function f(x) ≥ 0.

Thus, P(a ≤ X < b) = ⌡⌠a

bf(x) dx = area under the density curve from a to b.

X

f(x)

a b

Males are smaller, on

average

Females are larger, on average

12 24

Total Area

= ⌡⌠−∞

∞ f (x) dx = 1


Definition: f(x) is a probability density function for the continuous random variable X if, for all x,

f(x) ≥ 0 AND ⌡⌠−∞

∞ f(x) dx = 1.

The cumulative distribution function (cdf) is defined as, for all x,

F(x) = P(X ≤ x) = ⌡⌠−∞

x f(t) dt .

Therefore, F increases monotonically and continuously from 0 to 1.

Furthermore, P(a ≤ X ≤ b) = ⌡⌠a

bf(x) dx = F(b) – F(a). FTC!!!!

X

f(x)

⌡⌠−∞

x f(t) dt

x

X 0 x

1

F(x)

Total Area

= ⌡⌠−∞

∞ f (x) dx = 1

The cumulative probability that X is less than or equal to some value x – i.e., P(X ≤ x) – is characterized by: (1) the area under the graph of f up to x, or (2) the height of the graph of F at x. But note: f(x) NO LONGER corresponds to the probability P(X = x) [which = 0, since X is here continuous], as it does for discrete X.

Ismor Fischer, 5/22/2013 4.2-4 Example 1: Uniform density

This is the trivial “constant function” over some fixed interval [a, b]. That is, 1( )f x

b a=

− for a ≤ x ≤ b (and ( ) 0f x = otherwise). Clearly, the two criteria for

being a valid density function are met: it is non-negative, and the (rectangular) area under its graph is equal to its base (b – a) × height (1 / b – a), which is indeed 1 . Moreover, for any value of x in the interval [a, b], the (rectangular) area under the graph up to x is equal to its base (x – a) × height (1 / b – a). That is, the cumulative distribution function (cdf) is given by ( ) x aF x

b a−=−

, the graph of which is a straight

line connecting the left endpoint (a, 0) to the right endpoint (b, 1). [[Note: Since ( ) 0f x = outside the interval [a, b], the area beneath it contributes nothing to F(x) there; hence F(x) = 0 if x < a, and F(x) = 1 if x > b. Observe that, indeed, F increases monotonically and continuously from 0 to 1; the graphs show

( )f x and ( )F x over the interval [1, 6], i.e., a = 1, b = 6. Compare this example with the discrete version in section 3.1.]]

Thus, for example, the probability P(2.6 ≤ X ≤ 3.8) is equal to the (rectangular) area under ( )f x over that interval, or in terms of ( )F x , simply equal to the difference

between the heights F(3.8) – F(2.6) = 3.8 1 2.6 15 5− −

− = 0.56 – 0.32 = 0.24.

1/5 1

5x −


Example 2: Power density (A special case of the Beta density: β = 1)

For any fixed p > 0, let 1( ) pf x p x −= for 0 < x < 1. (Else, ( ) 0f x = .) This is a

valid density function, since f(x) ≥ 0 and ⌡⌠−∞

∞ f(x) dx =

1 10

pp x dx−∫ = 1

0px

= 1 .

The corresponding cdf is therefore F(x) = ⌡⌠−∞

x f(t) dt = 1

0

x ppt dt−∫ = 0

xpt = px

on [0, 1]. (And, as above, F(x) = 0 if x < 0, and F(x) = 1 if x > 1.) Again observe that F indeed increases monotonically and continuously from 0 to 1, regardless of f ; see graphs for p = 1

2 , 32 , 3. (Note: p = 1 corresponds to the uniform density on [0, 1].)

121

2 x−

123

2 x

23x

12x

32x

3x


e−x 1 – e−x

Example 3: Cauchy density

The function 2

1 1( )1

f xxπ

=+

for x−∞ < < +∞ is a legitimate density function, since

it satisfies the two criteria above: f(x) ≥ 0 AND ⌡⌠−∞

∞ f(x) dx = 1. (Verify it!) The cdf is

therefore F(x) = ⌡⌠−∞

x f(t) dt =

21 1

1x

dttπ−∞ +∫ = 1 1arctan

2x

π+ for x−∞ < < +∞ .

Thus, for instance, P(0 ≤ X ≤ 1) = F(1) – F(0) = ( ) ( )1 1 1 1 14 2 2 40 .π

π π + − + =

Example 4: Exponential density

For any a > 0 fixed, f(x) = a e−ax for x ≥ 0 (and = 0 for x < 0) is a valid density function, since it satisfies the two criteria. (Details are left as an exercise.) The corresponding

cdf is given by F(x) = ⌡⌠−∞

x f(t) dt = ⌡⌠

0

x a e−at dt = 1 a xe−− , for x ≥ 0 (and = 0 otherwise).

The case a = 1 is shown below. Thus, for instance, P(X ≤ 2) = F(2) = 1 – e−2 = 0.8647, and P(0.5 ≤ X ≤ 2) = F(2) – F(0.5) = (1 – e−2) – (1 – e− 0.5) = 0.8647 – 0.3935 = 0.4712.


Exercise: (Another special case of the Beta density.) Sketch the graph of ( ) 6 (1 )f x x x= − for 0 1x≤ ≤ (and = 0 elsewhere); show that it is a valid density

function. Find the cdf ( )F x , and sketch its graph. Calculate P(¼ ≤ X ≤ ¾).

Exercise: Sketch the graph of 2

( )( 1)

x

xef x

e=

+ for x−∞ < < +∞ , and show

that it is a valid density function. Find the cdf ( )F x , and sketch its graph. Find the quartiles. Calculate P(0 ≤ X ≤ 1).

Thus, for the exponential density, µ = ⌡⌠0

∞ x a e−ax dx = 1

a, via integration by parts.

The calculation of σ2 is left as an exercise.

Exercise: Sketch the graph of 2

2 11

( )x

f xπ −

= for 0 1x≤ < (and 0 elsewhere);

show that it is a valid density function. Find the cdf ( )F x , and sketch its graph. Calculate P(X ≤ ½), and find the mean.

Exercise: What are the mean and variance of the power density?

Exercise: What is the mean of the Cauchy density?

If X is a continuous numerical random variable with density function f(x), then the population mean is given by the “first moment”

μ = E[X] = ⌡⌠−∞

+∞ x f(x) dx

and the population variance is given by the “second moment” about the mean

σ 2 = E[(X − µ)2] = ⌡⌠−∞

+∞ (x − µ)2 f(x) dx ,

or equivalently,

σ 2 = E[X 2] − µ 2 = ⌡⌠−∞

+∞ x2 f(x) dx − µ 2

.

(Compare these continuous formulas with those for discrete X.)

Augustin-Louis Cauchy 1789-1857

Faites attention! Ce n’est pas aussi

facile qu’il apparaît...


728

628

528

428

328

228

128

X

f(x)

1 2 3 4 5 6

Example: Crawling Ants and Jumping Fleas

Consider two insects on a (six-inch) ruler: a flea, who makes only discrete integer jumps (X), and an ant, who crawls along continuously and can stop anywhere (Y).

1. Let the discrete random variable X = “length jumped (0, 1, 2, 3, 4, 5, or 6 inches) by the flea”. Suppose that the flea is tired, so is less likely to make a large jump than a small (or no) jump, according to the following probability distribution (or mass) function f(x) = P(X = x), and corresponding probability histogram.

• The total probability is P(0 ≤ X ≤ 6) = 1, as it should be.

• P(3 ≤ X ≤ 6) = 4/28 + 3/28 + 2/28 + 1/28 = 10/28

• P(0 ≤ X < 3) = 7/28 + 6/28 + 5/28 = 18/28, or

Not = 1 − P(3 ≤ X ≤ 6) = 1 − 10/28 = 18/28 equal!

• P(0 ≤ X ≤ 3) = 18/28 + 4/28 = 22/28, because P(X = 3) = 4/28

• Exercise: Confirm that the flea jumps a mean length of µ = 2 inches.

• Exercise: Sketch a graph of the cumulative distribution function F(x) = P(X ≤ x), similar to that of §2.2 in these notes.

Probability Table

x f(x) = P(X = x)

0 7/28 1 6/28 2 5/28 3 4/28 4 3/28 5 2/28 6 1/28

1

Ismor Fischer, 5/22/2013 4.2-9 2. Let the continuous random variable Y = “length crawled (any value in the

interval [0, 6] inches) by the ant”. Suppose that the ant is tired, so is less likely to crawl a long distance than a short (or no) distance, according to the following probability density function f(y), and its corresponding graph, the probability density curve. (Assume that f = 0 outside of the given interval.)

• The total probability is P(0 ≤ Y ≤ 6) = ½ (6)(1/3) = 1, as it should be.

• P(3 ≤ Y ≤ 6) = ½ (3)(1/6) = 1/4 (Could also use calculus.)

• P(0 ≤ Y < 3) = 1 − P(3 ≤ Y ≤ 6) = 1 − 1/4 = 3/4 Equal!

• P(0 ≤ Y ≤ 3) = 3/4 also, because P(Y = 3) = 0 Why?

• Exercise: Confirm that the ant crawls a mean length of µ = 2 inches.

• Exercise: Find the cumulative distribution function F(y), and sketch its graph.

Y

1 2 3 4 5 6

1/3

f(y) = 6 − y18 , 0 ≤ y ≤ 6

1

Ismor Fischer, 5/22/2013 4.2-10

µ = 98.6

small σ

µ = 100

large σ

Total Area

= ⌡⌠−∞

∞f (x) dx = 1

Mean µ

Standard Deviation

σ

X

Right tail Left tail

Total Area

= ⌡⌠−∞

∞f (x) dx = 1

Johann Carl Friedrich Gauss (1777 - 1855)

An extremely important bell-shaped continuous population distribution…

Normal Distribution (a.k.a. Gaussian Distribution): X ~ N(µ, σ)

f(x) = 1

2π σ e , −∞ < x < +∞

Examples: X = Body Temp (°F) X = IQ score (discrete!)

−12

x − µ

σ ²

π = 3.14159… e = 2.71828…

Ismor Fischer, 5/22/2013 4.2-11

X1 ~ N(80.7, 3.5)

µ = 80.7

σ = 3.5

x = 87

X2 ~ N(82.8, 4.5)

µ = 82.8

σ = 4.5

x = 90

Z N(0, 1)

Z

1

1.6 1.8

Example: Two exams are given in a statistics course, both resulting in class scores that are normally distributed. The first exam distribution has a mean of 80.7 and a standard deviation of 3.5 points. The second exam distribution has a mean of 82.8 and a standard deviation of 4.5 points. Carla receives a score of 87 on the first exam, and a score of 90 on the second exam. Which of her two exam scores represents the better effort, relative to the rest of the class?

The Z-score tells how many standard deviations σ the X-score lies from the mean µ . x-score = 87 ⇔ x-score = 90 ⇔

z-score = 87 − 80.7

3.5 = 1.8 z-score = 90 − 82.8

4.5 = 1.6

higher relative score

Z-score Transformation

X ~ N(µ, σ) ⇔ Z = X − µσ ~ N(0, 1)

Standard Normal Distribution

Ismor Fischer, 5/22/2013 4.2-12

20 X

−0.8 0 Z

19

X ~ N(20, 1.25) Z ~ N(0, 1)

=

σ = 1.25

µ =

Example: X = “Age (years) of UW-Madison third-year undergraduate population”

Assume: X ~ N(20, 1.25), i.e., X is normally distributed with mean µ = 20 yrs, and s.d. σ = 1.25 yrs.

Suppose that an individual from this population is randomly selected. Then…

• P(X < 20) = 0.5 (via symmetry)

• P(X < 19) = P

Z < 19 − 20

1.25 = P(Z < −0.8) = 0.2119 (via table or software) Therefore…

• P(19 ≤ X < 20) = P(X < 20) – P(X < 19) = 0.5000 − 0.2119 = 0.2881 Likewise,

• P(19 ≤ X < 19.5) = 0.3446 − 0.2119 = 0.1327

• P(19 ≤ X < 19.05) = 0.2236 − 0.2119 = 0.0118

• P(19 ≤ X < 19.005) = 0.2130 − 0.2119 = 0.0012

• P(19 ≤ X < 19.0005) = 0.2120 − 0.2119 = 0.0001

↓

• P(X = 19.00000…) = 0,

since X is continuous! 19 20

How do we check this? And what do we do if it’s not true, or we can’t tell? Later...

Ismor Fischer, 5/22/2013 4.2-13

µ µ + σ µ − σ X

0.6827

X µ µ + σ µ − σ

0.9545

µ + 2σ

µ − 2σ

Two Related Questions…

1. Given X ~ N(µ, σ). What is the probability that a randomly selected individual from the population falls within one standard deviation (i.e., ±1σ) of the mean µ ? Within two standard deviations (±2σ)? Within three (±3σ)?

Solution: We solve this by transforming to the tabulated standard normal

distribution Z ~ N(0, 1), via the formula Z = X − µσ , i.e., X = µ + Zσ . P(µ − 1σ ≤ X ≤ µ + 1σ) =

P( − 1 ≤ Z ≤ + 1 ) =

P(Z ≤ +1) − P(Z ≤ −1) =

0.8413 − 0.1587 = 0.6827

P(µ − 2σ ≤ X ≤ µ + 2σ) =

P( − 2 ≤ Z ≤ + 2 ) =

P(Z ≤ +2) − P(Z ≤ −2) =

0.9772 − 0.0228 = 0.9545

Likewise, P(µ − 3σ ≤ X ≤ µ + 3σ) = P(−3 ≤ Z ≤ +3) = 0.9973 .

These so-called empirical guidelines can be used as an informal check to see if sample-generated data derive from a population that is normally distributed. For if so, then 68%, or approximately 2/3, of the data should lie within one standard deviation s of the mean x ; approximately 95% should lie within two standard deviations 2s of the mean x , etc. Other quantiles can be checked similarly. Superior methods also exist…

See my homepage to view a “ball drop” computer simulation of the normal distribution: (requires Java)

http://www.stat.wisc.edu/~ifischer

http://www.stat.wisc.edu/~ifischer�

Ismor Fischer, 5/22/2013 4.2-14

Z

0.90 0.05 0.05

0 1.645 = z.05 −z.05 = −1.645

0 Z

2.575 = z.005

−z.005 = −2.575

0.005

0.99

0.005

Z 0

1 − α α/2 α/2

zα/2 −zα/2

0.95 0.025 0.025

Z 0 −z.025 = −1.960 1.960 = z.025

2. Given X ~ N(µ, σ). What symmetric interval about the mean µ contains 90% of the population distribution? 95%? 99%? General formulation?

Solution: Again, we can answer this question for the standard normal distribution Z ~ N(0, 1), and transform back to X ~ N(µ, σ), via the

formula Z = X − µσ , i.e., X = µ + Zσ .

The value z.05 = 1.645 satisfies

P(−z.05 ≤ Z ≤ z.05) = 0.90,

or equivalently,

P(Z ≤ −z.05) = P(Z ≥ z.05) = 0.05.

Hence, the required interval is µ − 1.645σ ≤ X ≤ µ + 1.645σ.


P(−z.025 ≤ Z ≤ z.025) = 0.95,

or equivalently,

P(Z ≤ −z.025) = P(Z ≥ z.025) = 0.025.



P(−z.005 ≤ Z ≤ z.005) = 0.99,

or equivalently,

P(Z ≤ −z.005) = P(Z ≥ z.005) = 0.005.


Def: The critical value zα/2 satisfies

P(−zα/2 ≤ Z ≤ zα/2) = 1 − α ,

or equivalently, the “tail probabilities”

P(Z ≤ − zα/2) = P(Z ≥ zα/2) = α/2 .

Hence, the required interval satisfies

P(µ − zα/2 σ ≤ X ≤ µ + zα/2 σ) = 1 − α .

In general…

Ismor Fischer, 5/22/2013 4.2-15

µ = 20

Normal Approximation to the Binomial Distribution (continuous) (discrete)

Example: Suppose that it is estimated that 20% (i.e., π = 0.2) of a certain population has diabetes. Out of n = 100 randomly selected individuals, what is the probability that… (a) exactly X = 10 are diabetics? X = 15? X = 20? X = 25? X = 30?

Assuming that the occurrence of diabetes is independent among the individuals in the population, we have X ~ Bin(100, 0.2). Thus, the values of P(X = x) are calculated in the following probability table and histogram.

x P(X = x) =

100

x (0.2) x (0.8)100 − x

10

100

10 (0.2)10 (0.8)90 = 0.00336

15

100

15 (0.2)15 (0.8)85 = 0.04806

20

100

20 (0.2)20 (0.8)80 = 0.09930

25

100

25 (0.2)25 (0.8)75 = 0.04388

30

100

30 (0.2)30 (0.8)70 = 0.00519

(b) X ≤ 10 are diabetics? X ≤ 15? X ≤ 20? X ≤ 25? X ≤ 30?

Method 1: Directly sum the exact binomial probabilities to obtain P(X ≤ x). For instance, the cumulative probability P(X ≤ 10) =

100

0 (0.2)0 (0.8)100 +

100

1 (0.2)1 (0.8)99 +

100

2 (0.2)2 (0.8)98 +

100

3 (0.2)3 (0.8)97 +

100

4 (0.2)4 (0.8)96 +

100

5 (0.2)5 (0.8)95 +

100

6 (0.2)6 (0.8)94 +

100

7 (0.2)7 (0.8)93 +

100

8 (0.2)8 (0.8)92 +

100

9 (0.2)9 (0.8)91 +

100

10 (0.2)10 (0.8)90 = 0.00570

X ~ Bin(100, 0.2)

Ismor Fischer, 5/22/2013 4.2-16

µ = 20

Method 2: Despite the skew, X ~ N(µ, σ), approximately (a consequence of the Central Limit Theorem, §5.2), with mean µ = nπ, and standard deviation σ = nπ (1 − π). Hence,

Z = X − µσ ~ N(0, 1)

becomes

Z = X − nπ

nπ (1 − π) ~ N(0, 1).

In this example, µ = nπ = (100)(0.2) = 20, and σ = nπ (1 − π) = 100(0.2)(0.8) = 4.

So, approximately, X ~ N(20, 4); thus

Z = X − 20

4 ~ N(0, 1). For instance, P(X ≤ 10) ≈ P

Z ≤ 10 − 20

4 = P(Z ≤ −2.5) = 0.00621. The following table compares the two methods for finding P(X ≤ x).

x Binomial (exact)

Normal (approximation)

Normal (with correction)

10 0.00570 0.00621 0.00877 15 0.12851 0.10565 0.13029 20 0.55946 0.50000 0.54974 25 0.91252 0.89435 0.91543 30 0.99394 0.99379 0.99567

Comment: The normal approximation to the binomial generally works well, provided nπ ≥ 15 and n(1 −π) ≥ 15. A modification exists, which adjusts for the difference between the discrete and continuous distributions:

Z = X − nπ ± 0.5

nπ (1 − π) ~ N(0, 1)

where the continuity correction factor is equal to +0.5 for P(X ≤ x), and –0.5 for P(X ≥ x). In this example, the “corrected” formula becomes

Z = X − 20 + 0.5

4 ~ N(0, 1).

X ≈ N(20, 4)

Ismor Fischer, 5/22/2013 4.2-17

Exercise: Recall the preceding section, where a spontaneous medical condition affects 1% (i.e., π = 0.01) of the population, and X = “number of affected individuals in a random sample of n = 300.” Previously, we calculated the probability P(X = x) for x = 0, 1, …, 300. We now ask for the more meaningful cumulative probability P(X ≤ x), for x = 0, 1, 2, 3, 4, ... Rather than summing the exact binomial (or the approximate Poisson) probabilities as in Method 1 above, adopt the technique in Method 2, both with continuity correction and without. Compare these values with the exact binomial sums.

A Word about “Probability Zero” Events

(Much Ado About Nothing?)

Exactly what does it mean to say that an event E has zero probability of occurrence, i.e. P(E) = 0? A common, informal interpretation of this statement is that the event “cannot happen” and, in many cases, this is indeed true. For example, if X = “Sum of two dice,” then “X = –4,” “X = 5.7,” and “X = 13” all have probability zero because they are impossible outcomes of this experiment, i.e., they are not in the sample space {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}. However, in a formal mathematical sense, this interpretation is too restrictive. For example, consider the following scenario: Suppose that k people participate in a lottery; each individual holds one ticket with a unique integer from the sample space {1, 2, 3, …, k}. The winner is determined by a computer that randomly selects one of these k integers with equal likelihood. Hence, the probability that a randomly selected individual wins is equal to 1/k. The larger the number k of participants, the smaller the probability 1/k that any particular person will win. Now, for the sake of argument, suppose that there is an infinite number of participants; a computer randomly selects one integer from the sample space {1, 2, 3, …}. The probability that a randomly selected individual wins is therefore less than 1/k for any k, i.e., arbitrarily small, hence = 0.* But by design, someone must win the lottery, so “probability zero” does not necessarily translate into “the event cannot happen.” So what does it mean? Recall that the formal, classical definition of the probability P(E) of any event E is the

mathematical “limiting value” of the ratio #(E occurs)

# trials , as # trials → ∞. That is, the

fraction of “the number of times that the event occurs” to “the total number of experimental trials,” as the experiment is repeated indefinitely. If, in principle, this ratio becomes arbitrarily small after sufficiently many trials, then such an ever-increasingly rare event E is formally identified with having “probability zero” (such as, perhaps, the random toss of a coin under ordinary conditions resulting in it landing on edge, rather than on heads or tails).

* Similarly, any event consisting of a finite subset of an infinite sample space of possible outcomes (such as the event of randomly selecting a single particular value from a continuous interval), has a mathematical probability of zero.

Ismor Fischer, 5/22/2013 4.2-18

Classical Continuous Probability Densities (The t and F distributions will be handled separately.)

Uniform

Normal

Log-Normal

Gamma

Beta

For α > 0, β > 0,

f(x) = α β x β − 1 xe

βα− , x ≥ 0.

Thus, F(x) = 1 − xeβα− .

For α > 0, β > 0,

f(x) = 1

β α Γ(α) x α − 1 e−x/β,

x ≥ 0.

f(x) = 1

b − a , a ≤ x ≤ b

Consequently, F(x) = x − ab − a .

For σ > 0,

f(x) = 1

2π σ

212

x

eµ

σ

−−, −∞ < x < +∞.

For β > 0,

f(x) = 1

2π β x−1

21 ln2

x

eα

β

−−, x ≥ 0.

Weibull

Notes on the Gamma and Beta Functions

Def: Γ(α) = ⌡⌠0

∞

x α − 1 e−x dx

Thm: Γ(α) = (α − 1) Γ(α − 1); therefore, = (α − 1)!, if α = 1, 2, 3,… Thm: Γ(1/2) = π

Def: Β(α, β) = ⌡⌠0

1

x α − 1 (1 − x) β − 1 dx

Thm: Β(α, β) = Γ(α) Γ(β)Γ(α + β)

Exponential

f(x) = 1β e−x/β, x ≥ 0.

Thus, F(x) = 1 − e−x/β.

Chi-Squared: For ν = 1, 2, …

f(x) = 1

2 ν/2 Γ(ν/2) x ν/2 − 1 e−x/2,

x ≥ 0.

For α > 0, β > 0,

f(x) = 1

Β(α, β) x α − 1 (1 − x) β − 1, 0 ≤ x ≤ 1.

Ismor Fischer, May 22, 2013 4.3-1

4.3 Summary Chart

POPULATION RANDOM VARIABLES (NUMERICAL)

TYPE EXAMPLES TABLE GRAPHICAL DISPLAYS CHARACTERISTICS, etc.

Discrete

All distinct population

values can be ordered and individually

listed:

1 2 3{ , , ,...}x x x

x-values have gaps

X = Sum of two dice (2, 3, 4, …, 12)

X = Shoe size (…, 8, 8½, 9, 9½, …)

X = # “Successes” in n Bernoulli trials (0, 1, 2, …, n)

~ Binomial distribution

Probability Table

probability mass function

x f(x) ≥ 0 1

2

xx

1

2

)

)

((

f xf x

1

Probability Histogram

Cumulative Distribution ( ) ( ) ( )F x P X x f x= ≤ = ∑

Parameters

Mean all

( )xx f xµ =∑

Variance 2 2

all ( ) ( )

xx f xµσ −=∑

Probability • ( ) ( )P X c f c= =

•

Area from to

( ) ( )

a b

b

aP a X b f x≤ ≤ = ∑

Continuous

Population is interval-valued, thus all of them

cannot be so listed.

x-values run along a continuous scale

of real numbers

X = pH, Length, Area, Volume, Mass, Temp, etc.

X ~ Normal distribution None*

Density Curve

Cumulative Distribution ( ) ( ) ( )F x P X x f x dx= ≤ = ∫

Parameters

Mean ( )x f x dxµ +∞

−∞= ∫

Variance 2 2( ) ( )x f x dxµσ +∞

−∞= −∫

Probability • ( ) 0P X c= =

•

Area from to

( ) ( )

a b

b

aP a X b f x dx≤ ≤ = ∫

• • • •

µ

σ

density function f(x) ≥ 0

Total Area = 1

x2 x1 x3 • • •

…

f(x1)

f(x2)

f(x3) Total Area = 1

0

F(x1)

F(x2)

1

0 •

x2 •

x1 •

x3 …

F(x3)

1

Ismor Fischer, May 22, 2013 4.3-2

* If X is a discrete random variable, then for any value x in the population, f(x) corresponds to the probability that x occurs, i.e., P(X = x). However, if X is a continuous population variable, this is false, since P(X = x) = 0. In this case, f(x) is its density function, and probability corresponds to the area under its graph up to x. Formally, this is defined in terms of the cumulative distribution: F(x) = P(X ≤ x), which rises continuously and monotonically from 0 to 1, as x increases. It is the values of this function that are often tabulated for selected (i.e., “discretized”) values of x, as in the case of the standard normal distribution. F(x) is defined the same way for discrete variables, but it is only piecewise continuous, i.e., its graph is a “step” or “staircase” function.

Similarly, in a random sample, f(x) measures the relative frequency of x, and the cumulative distribution is defined the same way, F(x) = ˆ( ),P X x≤ where

P denotes the sample proportion. But since it is data-based, F(x) is known as the empirical distribution, and likewise has a stepwise graph from 0 to 1.

RANDOM SAMPLE

Either

n observed data values 1 2{ , ,..., }nx x x

selected from either type of population above, individually ordered and listed. If some values occur multiple times, then list only the k distinct values, together with each of their corresponding frequencies

1 2{ , ,..., },kf f f where

1 2 .k nf f f+ + + =

Relative Frequency

Table relative frequency

x f(x) ≥ 0 1

2

k

xx

x

1

2

)

)

((

( )k

f xf x

f x

1

Density Histogram

Empirical Distribution

ˆ( ) ( ) ( )F x P X x f x= ≤ = ∑

Statistics

Mean all

( )x

x x f x=∑

Variance 2

all

21

( ) ( )x

nn

x x f xs−

= −∑

Proportion • ˆ( ) ( )P X c f c= =

•

Area from to

ˆ( ) ( )

a b

b

aP a X b f x≤ ≤ = ∑

f(x1)

f(x2)

f(x3)

•

xk

f(xk)

Total Area = 1

•

x1 •

x3 … •

x2

F(x1)

•

x2 •

xk •

x1 •

x3 …

F(x2)

F(x3) 1

0


$20 $10

$40 $30

4.4 Problems 1. Patient noncompliance is one of many potential sources of bias in medical studies. Consider a

study where patients are asked to take 2 tablets of a certain medication in the morning, and 2 tablets at bedtime. Suppose however, that patients do not always fully comply and take both tablets at both times; it can also occur that only 1 tablet, or even none, are taken at either of these times.

(a) Explicitly construct the sample space S of all possible daily outcomes for a randomly selected

patient.

(b) Explicitly list the outcomes in the event that a patient takes at least one tablet at both times, and calculate its probability, assuming that the outcomes are equally likely.

(c) Construct a probability table and corresponding probability histogram for the random variable X = “the daily total number of tablets taken by a random patient.”

(d) Calculate the daily mean number of tablets taken.

(e) Suppose that the outcomes are not equally likely, but vary as follows:

# tablets AM probability PM probability 0 0.1 0.2 1 0.3 0.3 2 0.6 0.5

Rework parts (b)-(d) using these probabilities. Assume independence between AM and PM.

2. A statistician’s teenage daughter withdraws a certain amount of money X from an ATM every so

often, using a method that is unknown to him: she randomly spins a circular wheel that is equally divided among four regions, each containing a specific dollar amount, as shown.

Bank statements reveal that over the past n = 80 ATM transactions, $10 was withdrawn thirteen times, $20 sixteen times, $30 nineteen times, and $40 thirty-two times. For this sample, construct a relative frequency table, and calculate the average amount x withdrawn per transaction, and the variance s2.

Suppose this process continues indefinitely. Construct a probability table, and calculate the expected amount µ withdrawn per transaction, and the variance 2σ . (Verify that, for this sample, s2 and 2σ happen to be equal.)


3. A youngster finds a broken clock, on which the hour and minute hands can be randomly spun at

the same time, independently of one another. Each hand can land in any one of the twelve equal areas below, resulting in elementary outcomes in the form of ordered pairs (hour hand, minute hand), e.g., (7, 11), as shown.

Let the simple events A = “hour hand lands on 7” and B = “minute hand lands on 11.”

(a) Calculate each of the following probabilities. Show all work!

P(A and B)

P(A or B)

(b) Let the discrete random variable X = “the product of the two numbers spun”. List all the elementary outcomes that belong to the event C = “X = 36” and calculate its probability P(C).

(c) After playing for a little while, some of the numbers fall off, creating new areas, as shown. For example, the configuration below corresponds to the ordered pair (9, 12). Now calculate P(C).


4. An amateur game player throws darts at the dartboard shown below, with each target area worth the number of points indicated. However, because of the player’s inexperience, all of the darts hit random points that are uniformly distributed on the dartboard.

(a) Let X = “points obtained per throw.” What is the sample space S of this experiment?

(b) Calculate the probability of each outcome in S. (Hint: The area of a circle is 2rπ .)

(c) What is the expected value of X, as darts are repeatedly thrown at the dartboard at random?

(d) What is the standard deviation of X?

Suppose that, if the total number of points in three independent random throws is exactly 100, the player wins a prize. With what probability does this occur? (Hint: For the random variable T = “total points in three throws,” calculate the probability of each “ordered triple” outcome

1 2 3( , , )X X X in the event “T = 100.”) 5. Compare this problem with 2.5/10!

Consider the binary population variable 1, with probability 0, with probability 1

Yπ

π

= − (see figure).

(a) Construct a probability table for this random variable.

(b) Show that the population mean Yµ π= .

(c) Show that the population variance 2 (1 )Yσ π π= − . Note that π controls both the mean and the variance!

10 20

30 40

50 1 1 1 1 1

POPULATION = 1 ◦ = 0


6. SLOT MACHINE

Wheel 1 Wheel 2 Wheel 3

A casino slot machine consists of three wheels, each with images of three types of fruit: apples, bananas, and cherries. When a player pulls the handle, the wheels spin independently of one another, until each one stops at a random image displayed in its window, as shown above. Thus, the sample space S of possible outcomes consists of the 27 ordered triples shown below, where events A = “Apple,” B = “Banana,” and C = “Cherries.”

(a) Complete the individual tables above, and use them to construct the probability table

(including the outcomes) for the discrete random variable X = “# Apples” that are displayed when the handle is pulled. Show all work. (Hint: To make calculations easier, express probabilities as fractions reduced to lowest terms, instead of as decimals.)

Outcome Probability A B C



X Outcomes Probability f(x)

(A A A), (A A B), (A A C), (A B A), (A B B), (A B C), (A C A), (A C B), (A C C)

(B A A), (B A B), (B A C), (B B A), (B B B), (B B C), (B C A), (B C B), (B C C)

(C A A), (C A B), (C A C), (C B A), (C B B), (C B C), (C C A), (C C B), (C C C)

$100

$100


(b) Sketch the corresponding probability histogram of X. Label all relevant features.

(c) Calculate the mean µ and variance σ2 of X. Show all work.

(d) Similar to X = “# Apples,” define random variables Y = “# Bananas” and Z = “# Cherries”

displayed in one play. The player wins if all three displayed images are of the same fruit. Using these variables, calculate the probability of a win. Show all work.

(e) Suppose it costs one dollar to play this game once. The result is that either the player loses the

dollar, or if the player wins, the slot machine pays out ten dollars in coins. If the player continues to play this game indefinitely, should he/she expect to win money, lose money, or neither, in the long run? If win or lose money, how much per play? Show all work.

7. Formally prove that each of the following is a valid density function. [Note: This is a rigorous

mathematical exercise.]

(a) Bin ( ) (1 )x n xnf x

xπ π −

= −

x = 0, 1, 2, ..., n

(b) Poisson ( )!

xef xx

λ λ−

= , x = 0, 1, 2, ...

(c)

212

Normal1( )

2

x

f x eµ

σ

π σ

− −

= , x−∞ < < +∞

8. Formally prove each of the following, using the appropriate “expected value” definitions.

[Note: As the preceding problem, this is a rigorous mathematical exercise.]

(a) If X ~ Bin(n, π ), then nµ π= and 2 (1 )nσ π π= − .

(b) If X ~ Poisson(λ ), then µ λ= and 2σ λ= .

(c) If X ~ ( , )N α β , then µ α= and 2 2σ β= .

9. For any p > 0, sketch the graph of 1( ) pf x p x− −= for x ≥ 1 (and ( ) 0f x = for x < 1), and formally show that it is a valid density function. Then show the following.

If p > 2, then ( )f x has finite mean µ and finite variance 2σ .

If 1 2p< ≤ , then ( )f x has finite mean µ but infinite (i.e., undefined) variance.

If 0 1p< ≤ , then ( )f x has infinite (i.e., undefined) mean (and hence undefined variance). [Note: As with the preceding problems, this is a rigorous mathematical exercise.]


10. This is a subtle problem that illustrates an important difference between the normal distribution and many other distributions, the binomial in particular. Consider a large group of populations of males and females, such as all Wisconsin counties, and suppose that the random variable Y = “Age (years)” is normally distributed in all of them, each with some mean µ, and some variance 2σ . Clearly, there is no direct relationship between any µ and its corresponding 2σ , as we range continuously from county to county. (In fact, it is not unreasonable to assume that although the means may be different, the variances – which, recall, are measures of “spread” – might all be the same (or similar) throughout the counties. This is known as equivariance, a concept that we will revisit in Chapter 6.)

Suppose that, instead of age, we are now concerned with the different proportion of males from one county to another, i.e., ( Male )Pπ = . If we intend to select a random sample of n = 100 individuals from each county, then the random variable X = “Number of males” in each sample is binomially distributed, i.e., X ~ Bin(100, π ), for 0 1π≤ ≤ . Answer each of the following.

If a county has no males, compute the mean µ, and variance 2σ .

If a county has all males, compute the mean µ, and variance 2σ .

If a county has males and females in equal proportions, compute the mean µ, and variance 2σ .

Sketch an accurate graph of 2σ on the vertical axis, versus π on the horizontal axis, for n = 100 and 0 1π≤ ≤ , as we range continuously from county to county. Conclusions?

Note: Also see related problem 4.4/5.


11. Imagine that a certain disease occurs in a large population in such a way that the probability of a

randomly selected individual having the disease remains constant at π = .008, independent of any other randomly selected individual having the disease. Suppose now that a sample of n = 500 individuals is to be randomly selected from this population. Define the discrete random variable X = “the number of diseased individuals,” capable of assuming any value in the set {0, 1, 2, …, 500} for this sample.

(a) Calculate the probability distribution function f(x) = P(X = x) – “the probability that the number of diseased individuals equals x” – for x = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Do these computations two ways: first, using the Binomial Distribution and second, using the Poisson Distribution, and arrange these values into a probability table. (For the sake of comparison, record at least five decimal places.) Tip: Use the functions dbinom and dpois in R.

x Binomial Poisson 0

1

2

3

4

5

6

7

8

9

10

etc. etc. etc.

(b) Using either the Binomial or Poisson Distribution, what is the mean number of diseased individuals to be expected in the sample, and what is its probability? How does this probability compare with the probabilities of other numbers of diseased individuals?

(c) Suppose that, after sampling n = 500 individuals, you find that X = 10 of them actually have this disease. Before performing any formal statistical tests, what assumptions – if any – might you suspect have been violated in this scenario? What is the estimate of the probability π of disease, based on the data of this sample?

12. The uniform density function given in the notes has median and mean = 3.5, by inspection. Calculate the variance.


13.

(a) Let ( )8xf x = for 0 4,x≤ ≤ and = 0 elsewhere, as shown below left.

Confirm that f(x) is indeed a density function.

Determine the formula for the cumulative distribution function ( ) ( ),F x P X x= ≤ and sketch its graph. Recall that F(x) corresponds to the area under the density curve f(x) up to and including the value x, and therefore must increase monotonically and continuously from 0 to 1, as x increases.

Using F(x), calculate the probabilities ( 1),P X ≤ ( 3),P X ≤ and (1 3).P X≤ ≤

Using F(x), calculate the quartiles Q1, Q2, and Q3.

(b) Repeat (a) for the function , 0 2

6( )1 , 2 43

x xf x

x

≤ ≤= ≤ ≤

, and = 0 elsewhere, as shown below right.

14. Define the piecewise uniform function 1814

, 1 3( )

, 3 6

xf x

x

≤ <= ≤ ≤

(and = 0 elsewhere). Prove that

this is a valid density function, sketch the cdf F(x), and find the median, mean, and variance.

8x

6x

13


15. Suppose that the continuous random variable X = “age of juniors at the UW-Iwanagoeechees campus” is symmetrically distributed about its mean, but piecewise linear as illustrated, rather than being a normally distributed bell curve.

For an individual selected at random from this population, calculate each of the following.

(a) Verify by direct computation that P(18 ≤ X ≤ 22) = 1, as it should be. [Hint: Recall that the area of a triangle = ½ (base × height).]

(b) P(18 ≤ X < 18.5)

(c) P(18.5 < X ≤ 19)

(d) P(19.5 < X < 20.5)

(e) What symmetric interval about the mean contains exactly half the population values? Express in terms of years and months.

16. Suppose that in a certain population of adult males, the variable X = “total serum cholesterol level

(mg/dL)” is found to be normally distributed, with mean µ = 220 and standard deviation σ = 40. For an individual selected at random, what is the probability that his cholesterol level is…

(a) under 190? under 210? under 230? under 250?

(b) over 240? over 270? over 300? over 330?

(c) Using the R command pnorm, redo parts (a) and (b). [Type ?pnorm for syntax help.

Ex: pnorm(q=190, mean=220, sd=40), or more simply, pnorm(190, 220, 40)]

(d) over 250, given that it is over 240? [Tip: See the last question in (a), and the first in (b).]

(e) between 214 and 276?

(f) between 202 and 238?

(g) Eighty-eight percent of men have a cholesterol level below what value? Hint: First find the approximate critical value of z that satisfies P(Z ≤ +z) = 0.88, then change back to X.

(h) Using the R command qnorm, redo (g). [Type ?qnorm for syntax help.]

(i) What symmetric interval about the mean contains exactly half the population values?

Hint: First find the approximate critical value of z that satisfies P(−z ≤ Z ≤ +z) = 0.5, then change back to X.

1/3

f(x)

X 0 18 19 µ = 20 21 22

Submit a copy of the output, and clearly show agreement of your answers!

Submit a copy of the output, and clearly show agreement of your answer!

Ismor Fischer, 6/27/2014 4.4-10

M ~ N(10, 2.5)

F ~ N(16, 5)

17. A population biologist is studying a certain species of lizard, whose sexes appear alike, except for size. It is known that in the adult male population, length M is normally distributed with mean Mµ = 10.0 cm and standard deviation Mσ = 2.5 cm, while in the adult female population, length F is normally distributed with mean Fµ = 16.0 cm and standard deviation

Fσ = 5.0 cm.

(a) Suppose that a single adult specimen of length 11 cm is captured at random, and its sex identified as either a larger-than-average male, or a smaller-than-average female.

Calculate the probability that a randomly selected adult male is as large as, or larger than, this specimen.

Calculate the probability that a randomly selected adult female is as small as, or smaller than, this specimen.

Based on this information, which of these two events is more likely?

(b) Repeat part (a) for a second captured adult specimen, of length 12 cm.

(c) Repeat part (a) for a third captured adult specimen, of length 13 cm.

Ismor Fischer, 6/27/2014 4.4-11

18. Consider again the male and female lizard populations in the previous problem.

(a) Answer the following. Calculate the probability that the length of a randomly selected adult male falls between

the two population means, i.e., between 10 cm and 16 cm. Calculate the probability that the length of a randomly selected adult female falls between

the two population means, i.e., between 10 cm and 16 cm.

(b) Suppose it is known that males are slightly less common than females; in particular, males comprise 40% of the lizard population, and females 60%. Further suppose that the length of a randomly selected adult specimen of unknown sex falls between the two population means, i.e., between 10 cm and 16 cm. Calculate the probability that it is a male. Calculate the probability that it is a female.

Hint: Use Bayes’ Theorem.

19. Bob spends the majority of a certain evening in his favorite drinking establishment.

Eventually, he decides to spend the rest of the night at the house of one of his two friends, each of whom lives ten blocks away in opposite directions. However, being a bit intoxicated, he engages in a so-called “random walk” of n = 10 blocks where, at the start of each block, he first either turns and faces due west with probability 0.4, or independently, turns and faces due east with probability 0.6, before continuing. Using this information, answer the following.

Hint: Let the discrete random variable X = “number of east turns in n = 10 blocks.”

(0, 1, 2, 3, …, 10)

(a) Calculate the probability that he ends up at Al’s house.

(b) Calculate the probability that he ends up at Carl’s house.

(c) Calculate the probability that he ends up back where he started.

(d) How far, and in which direction, from where he started is he expected to end up, on average?

(Hint: Combine the expected number of east and west turns.) With what probability does this occur?

West East

Al’s house

Carl’s house

ECK’S BAR

Ismor Fischer, 6/27/2014 4.4-12

20. (a) Let “X = # Heads” in n = 100 tosses of a fair coin (i.e., π = 0.5). Write but DO NOT

EVALUATE an expression to calculate the probability P(X ≤ 45 or X ≥ 55).

(b) In R, type ?dbinom, and scroll down to Examples, where P(45 < X < 55) is computed for X Binomial(100,0.5). Copy, paste, and run the single line of code given, and use it to calculate the probability in (a).

(c) How does this compare with the corresponding probability on page 1.1-4?

21.

(a) How much overlap is there between the bell curves ~ (0,1)Z N and ~ (2,1)X N ? (Take 2µ = in the figure below.) That is, calculate the probability that a randomly selected

population value is either in the upper tail of (0,1)N , or in the lower tail of (2,1)N . Hint: Where on the horizontal axis do the two curves cross in this case?

(b) Suppose ~ ( ,1)X N µ for a general µ ; see figure. How close to 0 does the mean µ have to

be, in order for the overlap between the two distributions to be equal to 20%? 50%? 80%?

1 1

0

µ

Z X

Ismor Fischer, 6/27/2014 4.4-13

22. Consider the two following modified Cauchy distributions.

(a) “Truncated” Cauchy: 2

2 1( )1

f xxπ

=+

for 1 1x− ≤ ≤ + (and ( ) 0f x = otherwise).

Show that this is a valid density function, and sketch its graph. Find the cdf ( )F x , and sketch its graph. Find the mean and variance.

(b) “One-sided” Cauchy: 2

2 1( )1

f xxπ

=+

for 0x ≥ (and ( ) 0f x = otherwise). Show that

this is a valid density function, and sketch its graph. Find the cdf ( )F x , and sketch its graph. Find the median. Does the mean exist?

23. Suppose that the random variable X = “time-to-failure (yrs)” of a standard model of a medical implant device is known to follow a uniform distribution over ten years, and therefore corresponds to the density function 1( ) 0.1f x = for 0 10x≤ ≤ (and zero otherwise). A new model of the same implant device is tested, and determined to correspond to a time-to-failure density function 2

2 ( ) .009 .08 0.2f x x x= − + for 0 10x≤ ≤ (and zero otherwise). See figure.

(a) Verify that 1( )f x and 2 ( )f x are indeed legitimate density functions.

(b) Determine and graph the corresponding cumulative distribution functions 1( )F x and 2 ( )F x .

(c) Calculate the probability that each model fails within the first five years of operation.

(d) Calculate the median failure time of each model.

(e) How do 1( )F x and 2 ( )F x compare? In particular, is one model always superior during the entire ten years, or is there a time in 0 10x< < when a switch occurs in which model outperforms the other, and if so, when (and which model) is it? Be as specific as possible.

2 ( )f x

1( )f x

Ismor Fischer, 6/27/2014 4.4-14

24. Suppose that a certain random variable X follows a Poisson distribution with mean λ cases – i.e., X1 ~ Poisson(λ) – in the first year, then independently, follows a Poisson distribution with mean µ cases – i.e., X2 ~ Poisson(µ) – in the second year. Then it should seem intuitively correct that the sum X1 + X2 follows a Poisson distribution with mean λ + µ cases – i.e., X1 + X2 ~ Poisson(λ + µ) – over the entire two-year period. Formally prove that this is indeed true. (In other words, the sum of two Poisson variables is also a Poisson variable.)

25. [Note: The result of the previous problem might be useful for part (e).] Suppose the occurrence of a rare disease in a certain population is known to follow a Poisson distribution, with an average of λ = 2.3 cases per year. In a typical year, what is the probability that…

(a) no cases occur? (b) exactly one case occurs? (c) exactly two cases occur? (d) three or more cases occur? (e) Answer (a)-(d) for a typical two-year period. (Assume independence from year to year.) (f) Use the function dpois in R to redo (a), (b), and (c), and include the output as part of your

submitted assignment, clearly showing agreement of your answers. (g) Use the function ppois in R to redo (d), and include the output as part of your submitted

assignment, clearly showing agreement of your answer.

26. (a) Population 1 consists of individuals whose ages are uniformly distributed from 0 to 50 years old.

• What is the mean age of the population?

• What proportion of the population is between 30 and 50 years old?

(b) Population 2 consists of individuals whose ages are uniformly distributed from 50 to 90 years old.



(c) Suppose the two populations are combined into a single population.



Ismor Fischer, 6/27/2014 4.4-15

27. Let X be a discrete random variable on a population, with corresponding probability mass

function ( )f x , i.e., P(X = x). Then recall that the population mean, or expectation, of X is defined as

all Mean( ) = [ ] ( )

xX E X x f xµ = =∑ ,

and the population variance of X is defined as

2 2 2

all Var( ) = ( ) ( ) ( )

xX E X x f xσ µ µ = − = − ∑ .

(NOTE: Also recall that if X is a continuous random variable with density function ( )f x , all of the definitions above – as well as those that follow – can be modified simply by replacing the summation sign ∑ by an integral symbol ∫ over all population values x. For example,

Mean( ) = [ ] ( )X E X x f x dxµ+∞

−∞= = ∫ , etc.)

Now suppose we have two such random variables X and Y, with corresponding joint distribution function ( , )f x y , i.e., ( , )P X x Y y= = . Then in addition to the individual means ,X Yµ µ and

variances 2 2,X Yσ σ above,∗ we can also define the population covariance between X and Y :

[ ]all all

Cov( , ) = ( )( ) ( )( ) ( , )XY X Y X Yx y

X Y E X Y x y f x yσ µ µ µ µ= − − = − −∑∑ .

Example: A sociological study investigates a certain population of married couples, with random variables X = “number of husband’s former marriages (0, 1, or 2)” and Y = “number of wife’s former marriages (0 or 1).” Suppose that the joint probability table is given below.

X = # former marriages (Husbands)

0 1 2

Y = # former marriages

(Wives)

0 .19 .20 .01 .40

1 .01 .10 .49 .60

.20 .30 .50 1.00

For instance, the probability f (0, 0) = ( 0, 0)P X Y= = = .19, i.e., neither spouse was previously married in 19% of this population of married couples. Similarly, f (2, 1) = ( 2, 1)P X Y= = = .49, i.e., in 49% of this population, the husband was married twice before, and the wife once before, etc.

∗ The individual distribution functions ( )Xf x for X, and ( )Yf y for Y, correspond to the so-called marginal distributions of

the joint distribution ( , )f x y , as will be seen in the upcoming example.

Ismor Fischer, 6/27/2014 4.4-16

From their joint distribution above, we can read off the marginal distributions of X and Y :

X ( )Xf x Y ( )Yf y

0 0.2 0 0.4 1 0.3 1 0.6 2 0.5 1.0 1.0

from which we can compute the corresponding population means and population variances:

(0)(0.2) (1)(0.3) (2)(0.5),Xµ = + +

i.e., 1.3Xµ =

(0)(0.4) (1)(0.6),Yµ = +

i.e., 0.6Yµ =

2 2 2 2(0 1.3) (0.2) (1 1.3) (0.3) (2 1.3) (0.5),Xσ = − + − + −

i.e., 2 0.61Xσ =

2 2 2(0 0.6) (0.4) (1 0.6) (0.6),Yσ = − + −

i.e., 2 0.24Yσ = .

But now, we can also compute the population covariance between X and Y, using their joint distribution:

(0,0) (1,0) (2,0)

(0,1) (1,1) (2,1)

(0 1.3)(0 0.6) (.19) (1 1.3)(0 0.6) (.20) (2 1.3)(0 0.6) (.01)(0 1.3)(1 0.6) (.01) (1 1.3)(1 0.6) (.10) (2 1.3)(1 0.6) (.49) ,

f f f

XY

f f f

σ = − − + − − + − −+ − − + − − + − −

i.e., 0.30XYσ = .

(A more meaningful context for the covariance will be discussed in Chapter 7.)

(a) Recall that two events A and B are statistically independent if ( ) ( ) ( ).P A B P A P B=

Therefore, in this context, two discrete random variables X and Y are statistically independent if, for all population values x and y, we have ( , ) ( ) ( ).P X x Y y P X x P Y y= = = = = That is,

( , ) ( ) ( )X Yf x y f x f y= , i.e., the joint probability distribution is equal to the product of the marginal distributions. However, it then follows from the covariance definition above, that

( , )

all all all all ( )( ) ( ) ( ) ( ) ( ) ( ) ( )

f x y

XY X Y X Y X X Y Yx y x y

x y f x f y x f x y f yσ µ µ µ µ

= − − = − −

∑∑ ∑ ∑

= 0, since

each of the two factors in this product is the sum of the deviations of the variable from its respective mean, hence = 0. Consequently, we have the important property that

If X and Y are statistically independent, then Cov(X, Y) = 0.

Ismor Fischer, 6/27/2014 4.4-17

Verify that this statement is true for the joint probability table below.


0 1 2


(Wives)

0 .08 .12 .20 .40

1 .12 .18 .30 .60

.20 .30 .50 1.00

That is, first confirm that X and Y are statistically independent, by showing that each cell probability is equal to the product of the corresponding row marginal and column marginal probabilities (as in Chapter 3). Then, using the previous example as a guide, compute the covariance, and show that it is equal to zero.

(b) The converse of the statement in (a), however, is not necessarily true! For the table below,

show that Cov(X, Y) = 0, but X and Y are not statistically independent.


0 1 2


(Wives)

0 .13 .02 .25 .40

1 .07 .28 .25 .60

.20 .30 .50 1.00

Ismor Fischer, 6/27/2014 4.4-18

28. Using the joint distribution ( , )f x y , we can also define the sum X + Y and difference X – Y of two discrete random variables in a natural way, as follows.

{ | , }X Y x y x X y Y+ = + ∈ ∈ { | , }X Y x y x X y Y− = − ∈ ∈

That is, the variable X + Y consists of all possible sums x + y, where x comes from the population distribution of X, and y comes from the population distribution of Y. Likewise, the variable X – Y consists of all possible differences x – y, where x comes from the population distribution of X, and y comes from the population distribution of Y. The following important statements can then be easily proved, from the algebraic properties of mathematical expectation given in the notes. (Exercise)

Example (cont’d): Again consider the first joint probability table in the previous problem:


0 1 2


(Wives)

0 .19 .20 .01 .40

1 .01 .10 .49 .60

.20 .30 .50 1.00 We are particularly interested in studying D = X – Y, the difference between these two variables. As before, we reproduce their respective marginal distributions below. In order to construct a probability table for D, we must first list all the possible (x, y) ordered-pair outcomes in the sample space, but use the joint probability table to calculate the corresponding probability values:

X ( )Xf x Y ( )Yf y ⇒ D = X – Y Outcomes ( )f d

0 0.2 0 0.4 –1 (0, 1) .01 1 0.3 1 0.6 0 (0, 0), (1, 1) .29 = .19 + .10 2 0.5 1.0 1 (1, 0), (2, 1) .69 = .20 + .49 1.0 2 (2, 0) .01 1.0

II.. (A) Mean(X + Y) = Mean(X) + Mean(Y)

(B) Var(X + Y) = Var(X) + Var(Y) + 2 Cov(X, Y)

IIII.. (A) Mean(X – Y) = Mean(X) – Mean(Y)

(B) Var(X – Y) = Var(X) + Var(Y) – 2 Cov(X, Y)

Ismor Fischer, 6/27/2014 4.4-19

We are now able to compute the population mean and variance of the variable D:

( 1)(.01) (0)(.29) (1)(.69) (2)(.01),Dµ = − + + +

i.e., 0.7Dµ = 2 2 2 2 2( 1 0.7) (.01) (0 0.7) (.29) (1 0.7) (.69) (2 0.7) (.01),Dσ = − − + − + − + −

i.e., 2 0.25Dσ =

To verify properties II(A) and II(B) above, we can use the calculations already done in the previous problem, i.e., 2 21.3, 0.6, 0.61, 0.24,X Y X Yµ µ σ σ= = = = and 0.30XYσ = .

Mean(X – Y) = 0.7 = 1.3 – 0.6 = Mean(X) – Mean(Y)

Var(X – Y) = 0.25 = 0.61 + 0.24 – 2(0.30) = Var(X) + Var(Y) – 2 Cov(X, Y) Using this example as a guide, verify properties II(A) and II(B) for the tables in part (a) and part (b) of the previous problem. These properties are extremely important, and will be used in §6.2.

29. On his way to work every morning, Bob first takes the bus from his house, exits near his

workplace, and walks the remaining distance. His time spent on the bus (X) is a random variable that follows a normal distribution, with mean µ = 20 minutes, and standard deviation σ = 2 minutes, i.e., X ~ N(20, 2). Likewise, his walking time (Y) is also a random variable that follows a normal distribution, with mean µ = 10 minutes, and standard deviation σ = 1.5 minutes, i.e., Y ~ N(10, 1.5). Find the probability that Bob arrives at his workplace in 35 minutes or less. [Hint: Total time = X + Y ~ N(?, ?). Recall the “General Fact” on page 4.1-13, which is true for both discrete and continuous random variables.]

30. The arrival time of my usual morning bus (B) is normally distributed, with a mean ETA at 8:00

AM, and a standard deviation of 4 minutes. My arrival time (A) at the bus stop is also normally distributed, with a mean ETA at 7:50 AM, and a standard deviation of 3 minutes.

(a) With what probability can I expect to catch the bus? (Hint: What is the distribution of the random variable X = A – B, and what must be true about X in the event that I catch the bus?)

(b) On average, how much earlier should I arrive, if I expect to catch the bus with 99%

probability?

X ~ N(20, 2) Y ~ N(10, 1.5)

Ismor Fischer, 6/27/2014 4.4-20

31. Discrete vs. Continuous

(a) Discrete: General. Imagine a flea starting from initial position X = 0, only able to move by making integer jumps X = 1, X = 2, X = 3, X = 4, X = 5, or X = 6, according to the following probability table and corresponding probability histogram.

x f(x) 0 .05 1 .10 2 .20 3 .30 4 .20 5 .10 6 .05

Confirm that P(0 ≤ X ≤ 6) = 1, i.e., this is indeed a legitimate probability distribution.

Calculate the probability P(2 ≤ X ≤ 4).

Determine the mean µ and standard deviation σ of this distribution.

(b) Discrete: Binomial. Now imagine a flea starting from initial position X = 0, only able to move by making integer jumps X = 1, X = 2, …, X = 6, according to a binomial distribution, with π = 0.5. That is, X ~ Bin(6, 0.5).

x f(x) 0 1 2 3 4 5 6

Complete the probability table above, and confirm that P(0 ≤ X ≤ 6) = 1.


Determine the mean µ and standard deviation σ of this distribution.

Ismor Fischer, 6/27/2014 4.4-21

(c) Continuous: General. Next imagine an ant starting from initial position X = 0, able to move by crawling to any position in the interval [0, 6], according to the following probability density curve.

, if 0 39( )

6 , if 3 69

x xf x

x x

≤ ≤= − < ≤

Confirm that P(0 ≤ X ≤ 6) = 1, i.e., this is indeed a legitimate probability density.


What distance is the ant able to pass only 2% of the time? That is, P(X ≥ ?) = .02.

(d) Continuous: Normal. Finally, imagine an ant

starting from initial position X = 0, able to move by crawling to any position in the interval [0, 6], according to the normal probability curve, with mean µ = 3, and standard deviation σ = 1. That is, X ~ N(3, 1).


What distance is the ant able to pass only 2% of the time? That is, P(X ≥ ?) = .02. 32. Temporary place-holder during SIBS – to be deleted

Ismor Fischer, 6/27/2014 4.4-22

33.

(a) The ages of employees in a certain workplace are normally distributed. It is known that 80% of the workers are under 65 years old, and 67% are under 55 years old. What percentage of the workers are under 45 years old? (Hint: First find µ and σ by calculating the z-scores.)

(b) Suppose it is known that the wingspan X of the males of a certain bat species is normally distributed with some mean µ and standard deviation σ, i.e., ( , )X N µ σ , while the wingspan Y of the females is normally distributed with the same mean µ, but standard deviation twice that of the males, i.e., ( , 2 )Y N µ σ . It is also known that 80% of the males have a wingspan less than a certain amount m. What percentage of the females have a wingspan less than this same amount m? (Hint: Calculate the z-scores.)

5. Sampling Distributions and the Central Limit Theorem

5.1 Motivation 5.2 Formal Statement and Examples 5.3 Problems


5. Sampling Distributions and the Central Limit Theorem 5.1 Motivation

Population Distribution of X

X Xµ = 70

σX = 4

X Xµ = 70

Sampling Distribution of X

Xσ = 4 n

EXTREMELY RARE – mostly

short outliers

x << 70

EXTREMELY TYPICAL – most are near the population mean, with a few short and tall outliers

x ≈ 70

EXTREMELY RARE – mostly

tall outliers

x >> 70

POPULATION = U.S. Adult Males Random Variable X = Height (inches)

RANDOM SAMPLES (all of size n)

RARE – short outlier

x << 70

TYPICAL x ≈ 70

RARE – tall outlier

x >> 70


µ = 27 30 X

X µ = 27 30

5.2 Formal Statement and Examples

Comments:

σn is called the “standard error of the mean,” denoted SEM, or more simply, s.e.

The corresponding Z-score transformation formula is Z = X − µσ / n ~ N(0, 1).

Example: Suppose that the ages X of a certain population are normally distributed, with mean µ = 27.0 years, and standard deviation σ = 12.0 years, i.e., X ~ N(27, 12).

Sampling Distribution of a Normal Variable Given a random variable X. Suppose that the population distribution of X is known to be normal, with mean µ and variance σ 2, that is, X ~ N(µ, σ). Then, for any sample size n, it follows that the sampling distribution of X is normal,

with mean µ and variance σ 2

n , that is, X ~ N

µ, σn .

The probability that the age of a single randomly selected individual is less than 30 years is P(X < 30) = P

Z < 30 − 27

12

= P(Z < 0.25) = 0.5987.

Now consider all random samples of size n = 36 taken from this population. By the above, their mean ages X are also normally distributed, with mean µ = 27 yrs

as before, but with standard error σn =

12 yrs36 = 2 yrs.

That is, X ~ N(27, 2).

The probability that the mean age of a single sample of n = 36 randomly selected individuals is less than 30 years is P( X < 30) = P

Z < 30 − 27

2 = P(Z < 1.5) = 0.9332.

In this population, the probability that the average age of 36 random people is under 30 years old, is much greater than the probability that the age of one random person is under 30 years old.

Exercise: Compare the two probabilities of being under 24 years old.

Exercise: Compare the two probabilities of being between 24 and 30 years old.


The Central Limit Theorem Given any random variable X, discrete or continuous, with finite mean µ and finite variance σ 2. Then, regardless of the shape of the population distribution of X, as the sample size n gets larger, the sampling distribution of X becomes increasingly closer to

normal, with mean µ and variance σ 2

n , that is, X ~ N

µ, σn ,

approximately.

More formally, (0,1) as ./

XZ N nnµ

σ

−= → →∞

If X ~ N(µ, σ) approximately, then X ~ N

µ, σn approximately. (The larger the value

of n, the better the approximation.) In fact, more is true...

IMPORTANT GENERALIZATION: Intuitively perhaps, there is less variation between different sample mean values, than

there is between different population values. This formal result states that, under very general conditions, the sampling variability is usually much smaller than the population variability, as well as gives the precise form of the “limiting distribution” of the statistic.

What if the population standard deviation σ is unknown? Then it can be replaced by the

sample standard deviation s, provided n is large. That is, X ~ N

µ, sn approximately,

if n ≥ 30 or so, for “most” distributions (... but see example below). Since the value sn

is a sample-based estimate of the true standard error s.e., it is commonly denoted s.e.

Because the mean Xµ of the sampling distribution is equal to the mean Xµ of the

population distribution – i.e., [ ]E X = Xµ – we say that X is an unbiased estimator of Xµ . In other words, the sample mean is an unbiased estimator of the population mean.

A biased sample estimator is a statistic θ whose “expected value” either consistently overestimates or underestimates its intended population parameter θ .

Many other versions of CLT exist, related to so-called Laws of Large Numbers.

Ismor Fischer, 5/29/2012 5.2-3 Example: Consider a(n infinite) population of paper notes, 50% of which are blank, 30% are ten-dollar bills, and the remaining 20% are twenty-dollar bills.

Experiment 1: Randomly select a single note from the population.

Random variable: X = $ amount obtained

Mean Xµ = E[X] = (.5)(0) + (.3)(10) + (.2)(20) = $7.00

Variance 2Xσ = E[ (X – Xµ )2 ] = (.5)(−7)2 + (.3)(3)2 + (.2)(13)2 = 61

Standard deviation Xσ = $7.81

x

f(x) = P(X = x)

0 .5

10 .3

20 .2 .5

.3 .2


Experiment 2: Each of n = 2 people randomly selects a note, and split the winnings.

Random variable: X = $ sample mean amount obtained per person x 0 5 10 5 10 15 10 15 20

(x1, x2) (0, 0) (0, 10) (0, 20) (10, 0) (10, 10) (10, 20) (20, 0) (20, 10) (20, 20)

Probability .5 × .5 = 0.25

.5 × .3 = 0.15

.5 × .2 = 0.10

.3 × .5 = 0.15

.3 × .3 = 0.09

.3 × .2 = 0.06

.2 × .5 = 0.10

.2 × .3 = 0.06

.2 × .2 = 0.04

x

f ( x ) = P( X = x )

0 .25

5 .30 = .15 + .15

10 .29 = .10 + .09 + .10

15 .12 = .06 + .06

20 .04

Mean

Xµ = (.25)(0) + (.30)( 5) + (.29)(10) + (.12)(15) + (.04)(20) = $7.00 = Xµ !! Variance 2

Xσ = (.25)(−7)2 + (.30)(−2)2 + (.29)(3)2 + (.12)(8)2 + (.04)(13)2

= 30.5 = 612 =

2Xσn !!

Standard deviation Xσ = $5.52 = Xσ

n !!

.25 .30 .29

.12 .04


Experiment 3: Each of n = 3 people randomly selects a note, and split the winnings.

Random variable: X = $ sample mean amount obtained per person

x 0 3.33 6.67 3.33 6.67 10 6.67 10 13.33 (x1, x2, x3) (0, 0, 0) (0, 0, 10) (0, 0, 20) (0, 10, 0) (0, 10, 10) (0, 10, 20) (0, 20, 0) (0, 20, 10) (0, 20, 20)

Probability .5 × .5 × .5 = 0.125

.5 × .5 × .3 = 0.075

.5 × .5 × .2 = 0.050

.5 × .3 × .5 = 0.075

.5 × .3 × .3 = 0.045

.5 × .3 × .2 = 0.030

.5 × .2 × .5 = 0.050

.5 × .2 × .3 = 0.030

.5 × .2 × .2 = 0.020

3.33 6.67 10 6.67 10 13.33 10 13.33 16.67

(10, 0, 0) (10, 0, 10) (10, 0, 20) (10, 10, 0) (10, 10, 10) (10, 10, 20) (10, 20, 0) (10, 20, 10) (10, 20, 20) .3 × .5 × .5

= 0.075 .3 × .5 × .3

= 0.045 .3 × .5 × .2

= 0.030 .3 × .3 × .5

= 0.045 .3 × .3 × .3

= 0.027 .3 × .3 × .2

= 0.018 .3 × .2 × .5

= 0.030 .3 × .2 × .3

= 0.018 .3 × .2 × .2

= 0.012

6.67 10 13.33 10 13.33 16.67 13.33 16.67 20 (20, 0, 0) (20, 0, 10) (20, 0, 20) (20, 10, 0) (20, 10, 10) (20, 10, 20) (20, 20, 0) (20, 20, 10) (20, 20, 20)

.2 × .5 × .5 = 0.050

.2 × .5 × .3 = 0.030

.2 × .5 × .2 = 0.020

.2 × .3 × .5 = 0.030

.2 × .3 × .3 = 0.018

.2 × .3 × .2 = 0.012

.2 × .2 × .5 = 0.020

.2 × .2 × .3 = 0.012

.2 × .2 × .2 = 0.008

x

f ( x ) = P( X = x )

0.00 .125

3.33 .225 = .075 + .075 + .075

6.67 .285 = .050 + .045 + .050 + .045 + .045 + .050

10.00 .207 = .030 + .030 + .030 + .027 + .030 + .030 + .030

13.33 .114 = .020 + .018 + .018 + .020 + .018 + .020

16.67 .036 = .012 + .012 + .012

20.00 .008

Mean

Xµ = Exercise = $7.00 = Xµ !!!

Variance 2Xσ = Exercise = 20.333 =

613 =

2Xσn !!!

Standard deviation Xσ = $4.51 = Xσ

n !!!

.125

.225

.285

.207

.114 .036 .008


The tendency toward a normal distribution becomes stronger as the sample size n gets larger, despite the mild skew in the original population values. This is an empirical consequence of the Central Limit Theorem. For most such distributions, n ≥ 30 or so is sufficient for a reasonable normal approximation to the sampling distribution. In fact, if the distribution is symmetric, then convergence to a bell curve can often be seen for much lower n, say only n = 5 or 6. Recall also, from the first result in this section, that if the population is normally distributed (with known σ), then so will be the sampling distribution, for any n. BUT BEWARE....


However, if the population distribution of X is highly skewed, then the sampling distribution of X can be highly skewed as well (especially if n is not very large), i.e., relying on CLT can be risky! (Although, sometimes using a transformation, such as ln(X) or X, can restore a bell shape to the values. Later…)

Example: The two graphs on the bottom of this page are simulated sampling distributions for the highly skewed population shown below. Both are density histograms based on the means of 1000 random samples; the first corresponds to samples of size n = 30, the second to n = 100. Note that skew is still present!

Population Distribution


5.3 Problems 1. Computer Simulation of Distributions

(a) In Problem 4.4/8(c), it was formally proved that if Z follows a standard normal

distribution – i.e., has density function 2 / 21( )

2zz eϕ

π−= – then its expected value

is given by the mean µ = 0. A practical understanding of this interpretation can be achieved via empirical computer simulation. For concreteness, suppose the random variable Z = “Temperature (°C)” ~ N(0°, 1°). Let us consider a single sample of n = 400 randomly generated z-value temperature measurements from a frozen lake, and calculate its mean temperature z , via the following R code.

# Generate and display one random sample. sample <- rnorm(400) sort(sample)

Upon inspection, it should be apparent that there is some variation among these z-values.

# Compare density histogram of sample against population distribution Z ~ N(0, 1). hist(sample, freq = F) curve(dnorm(x), lwd = 2, col = "darkgreen", add = T)

# Calculate and display sample mean. mean(sample)

This sample mean z should be fairly close to the actual expected value in the population, µ = 0° (likewise, sd(sample) should be fairly close to σ = 1°), but it is only generated from a single sample. To obtain an even better estimate of µ, consider say, 500 samples, each containing n = 400 randomly generated z-values. Then average each sample to find its mean temperature, and obtain { }1 2 3 500, , , ...,z z z z .

# Generate and display 500 random sample means. zbars <- NULL for (s in 1:500) {sample <- rnorm(400) zbars <- c(zbars, mean(sample))} sort(zbars)

Upon inspection, it should be apparent that there is little variation among these z -values.

# Compare density histogram of sample means against sampling distribution Z ~ N(0, 0.05). hist(zbars, freq = F) curve(dnorm(x, 0, 0.05), lwd = 2, col = "darkgreen", add = T)

# Calculate and display mean of the sample means. mean(zbars)

This value should be extremely close to the mean µ = 0°, because there is much less variation about µ in the sampling distribution, than in the population distribution. (In fact, via the Central Limit Theorem, the standard deviation is now only

/ 1 / 400nσ = = 0.05°. Check this value against sd(zbars).)


Ismor Fischer, 9/21/2014 5.3-2 (b) Contrast the preceding example with the following. A random variable X is said to

follow a standard Cauchy (pronounced “ko-shee”) distribution if it has the density

function Cauchy 2

1 1( )1

f xxπ

=+

, for x−∞ < < +∞ , as illustrated.

First, as in Problem 4.4/7, formally

prove that this is indeed a valid density function.

However, as in Problem 4.4/8,

formally prove – using the appropriate “expected value” definition – that the mean µ in fact does not exist!

Informally, there are too many outliers in both tails to allow convergence to a single mean value µ. To obtain a better appreciation of this subtle point, we once again rely on computer simulation.

# Generate and display one random sample. sample <- rcauchy(400) sort(sample)

Upon inspection, it should be apparent that there is much variation among these x-values.

# Compare density histogram of sample against population distribution X ~ Cauchy. hist(sample, freq = F) curve(dcauchy(x), lwd = 2, col = "darkgreen", add = T)

# Calculate and display sample mean. mean(sample)

This sample mean x is not necessarily close to an expected value µ in the population, nor are the means { }1 2 3 500, , , ...,x x x x of even 500 random samples:

# Generate and display 500 random sample means. xbars <- NULL for (s in 1:500) {sample <- rcauchy(400) xbars <- c(xbars, mean(sample))} sort(xbars)

Upon inspection, it should be apparent that there is still much variation among these x -values.

# Compare density histogram of sample means against sampling distribution X ~ Cauchy. hist(xbars, freq = F) curve(dcauchy(x), lwd = 2, col = "darkgreen", add = T)

# Calculate and display mean of the sample means. mean(xbars)

Indeed, it can be shown that X follows a Cauchy distribution as well, i.e., the Central Limit Theorem fails! Gathering more data does not yield convergence to a mean µ.

Ismor Fischer, 9/21/2014 5.3-3 2. For which functions in Problem 4.4/9 does the Central Limit Theorem hold / fail? 3. Refer to Problem 4.4/16.

(a) Suppose that a random sample of n = 36 males is to be selected from this population,

and the sample mean cholesterol level calculated. As in part (f), what is the probability that this sample mean value is between 202 and 238?

(b) How large a sample size n is necessary to guarantee that 80% of the sample mean values are within 5 mg/dL of the mean of their distribution? (Hint: First find the value of z that satisfies P(−z ≤ Z ≤ +z) = 0.8, then change back to X, and solve for n.)

4. Suppose that each of the four experiments in Problem 4.4/31 is to be performed n = 9

times, and the nine resulting distances averaged. Estimate the probability (2 4)P X≤ ≤ for each of (a), (b), (c), and (d). [Note: Use the Central Limit Theorem, and the fact that, for (c), the mean and variance are µ = 3 and σ 2 = 1.5, respectively.]

5. Bob suddenly remembers that today is Valentine’s Day, and rushes into a nearby florist to

buy his girlfriend some (overpriced) flowers. There he finds a large urn containing a population of differently colored roses in roughly equal numbers, but with different prices: yellow roses cost $1 each, pink roses cost $2 each, and red roses cost $6 each. As he is in a hurry, he simply selects a dozen roses at random, and brings them up to the counter.

(a) Lowest and highest costs = ? How much money can Bob expect to pay, on average?

(b) What is the approximate probability that he will have to pay no more than $45?

Assume the Central Limit Theorem holds.

(c) Simulate this in R: From many random samples (each with a dozen values) selected from a population of the prices listed above, calculate the proportion whose totals are no more than $45. How does this compare with your answer in (b)?

6. A geologist manages a large museum collection of minerals, whose mass (in grams) is

known to be normally distributed, with some mean µ and standard deviation σ. She knows that 60% of the minerals have mass less than a certain amount m, and needs to select a random sample of n = 16 specimens for an experiment. With what probability will their average mass be less than the same amount m? (Hint: Calculate the z-scores.)

7. Refer to Prob 4.4/5. Here, we sketch how formally applying the Central Limit Theorem

to a binary variable yields the “normal approximation to the binomial distribution”

(section 4.2). First, define the binary variable 1, with probability 0, with probability 1

Yπ

π

= − and the

discrete variable X = “#(Y = 1) in a random sample of n Bernoulli trials” ~ Bin(n, π).

(a) Using the results of Problem 4.4/5 for Yµ and 2Yσ , apply the Central Limit

Theorem to the variable Y.

(b) Why is it true that XYn

= ? [Hint: Why is “#(Y = 1)” the same as “∑ ( Y = 1)?”]

Use this fact along with (a) to conclude that, indeed, X ≈ N(µX, σX).*

* Recall what the mean Xµ and standard deviation Xσ of the Binomial distribution are.



8. Imagine performing the following experiment in principle. We are conducting a socioeconomic survey of an arbitrarily large population of households, each of which owns a certain number of cars X = 0, 1, 2, or 3, as illustrated below. For simplicity, let us assume that the proportions of these four types of household are all equal (although this restriction can be relaxed).

Select n = 1 household at random from this population, and record its corresponding value X = 0, 1, 2, or 3. By the “equal likelihood” assumption above, each of these four elementary outcomes has the same probability of being selected (1/4), therefore the resulting uniform distribution of population values is given by:

x f(x) = P(X = x) 0 1/4 1 1/4 2 1/4 3 1/4 1

From this, construct the probability histogram of the “population distribution of X” values, on the graph paper below. Remember that the total area of a probability histogram = 1! Next, draw an ordered random sample of n = 2 households, and compute the mean number of cars X . (For example, if the first household has 2 cars, and the second household has 3 cars, then the mean for this sample is 2.5 cars.) There are 42 = 16 possible samples of size n = 2; they are listed below. For each such sample, calculate and record its corresponding mean X ; the first two have been done for you. As above, construct the corresponding probability table and probability histogram of these “sampling distribution of X ” values, on the graph paper below. Remember that the total area of a probability histogram = 1; this fact must be reflected in your graph! Repeat this process for the 43 = 64 samples of size n = 3, and answer the following questions.

. . .

. . .

. . .

. . .

0 1 2 3 0 1 2 3

0 1 2 0 1 2 3 3

3 0 1 2 0 3 2 1

3 0 1 2 0 3 2 1

. . .

. . .

. . .

. . .


How do the X values behave?

Compare X versus X.

Put it together.

1. Comparing these three distributions, what can generally be observed about their

overall shapes, as the sample size n increases?

2. Using the expected value formula µ =

all( )

xx f x∑ ,

calculate the mean µX of the population distribution of X. Similarly, calculate the mean Xµ of the sampling distribution of X , for n = 2. Similarly, calculate the mean

Xµ of the sampling distribution of X , for n = 3. Conclusions?

3. Using the expected value formula σ 2 = 2

all( ) ( )

xx f xµ−∑ ,

calculate the variance 2Xσ of the population distribution of X. Similarly, calculate

the variance 2Xσ of the sampling distribution of X , for n = 2. Similarly, calculate

the variance 2Xσ of the sampling distribution of X , for n = 3. Conclusions?

4. Suppose now that we have some arbitrarily large study population, and a general

random variable X having an approximately symmetric distribution, with some mean Xµ and standard deviation Xσ . As you did above, imagine selecting all random

samples of a moderately large, fixed size n from this population, and calculate all of their sample means X . Based partly on your observations in questions 1-3, answer the following.

(a) In general, how would the means X of most “typical” random samples be expected to behave, even if some of them do contain a few outliers, especially if the size n of the samples is large? Why? Explain briefly and clearly.

(b) In general, how then would these two large collections – the set of all sample mean values X , and the set of all the original population values X – compare with each other, especially if the size n of the samples is large? Why? Explain briefly and clearly.

(c) What effect would this have on the overall shape, mean Xµ , and standard deviation

Xσ , of the sampling distribution of X , as compared with the shape, mean Xµ and standard deviation Xσ , of the population distribution of X? Why? Explain briefly and clearly.


SAMPLES, n = 2 Means X Draw: 1st 2nd

1. 0 0 0.0 2. 0 1 0.5 3. 0 2 4. 0 3 5. 1 0 6. 1 1 7. 1 2 8. 1 3 9. 2 0 10. 2 1 11. 2 2 12. 2 3 13. 3 0 14. 3 1 15. 3 2 16. 3 3

SAMPLES, n = 3 Means X

Draw: 1st 2nd 3rd

1. 0 0 0 0.00 2. 0 0 1 0.33 3. 0 0 2 4. 0 0 3 5. 0 1 0 6. 0 1 1 7. 0 1 2 8. 0 1 3 9. 0 2 0 10. 0 2 1 11. 0 2 2 12. 0 2 3 13. 0 3 0 14. 0 3 1 15. 0 3 2 16. 0 3 3 17. 1 0 0 18. 1 0 1 19. 1 0 2 20. 1 0 3 21. 1 1 0 22. 1 1 1 23. 1 1 2 24. 1 1 3 25. 1 2 0 26. 1 2 1 27. 1 2 2 28. 1 2 3 29. 1 3 0 30. 1 3 1 31. 1 3 2 32. 1 3 3 33. 2 0 0 34. 2 0 1 35. 2 0 2 36. 2 0 3 37. 2 1 0 38. 2 1 1 39. 2 1 2 40. 2 1 3 41. 2 2 0 42. 2 2 1 43. 2 2 2 44. 2 2 3 45. 2 3 0 46. 2 3 1 47. 2 3 2 48. 2 3 3 49. 3 0 0 50. 3 0 1 51. 3 0 2 52. 3 0 3 53. 3 1 0 54. 3 1 1 55. 3 1 2 56. 3 1 3 57. 3 2 0 58. 3 2 1 59. 3 2 2 60. 3 2 3 61. 3 3 0 62. 3 3 1 63. 3 3 2 64. 3 3 3


1/4

2/4

3/4

Probability Histogram of Population Distribution of X

0 1 2 3


2/16

4/16

6/16

8/16

10/16

12/16

Probability Histogram of Sampling Distribution of X, n = 2

0 1 2 3 0.5 1.5 2.5


0 1 2 3 0.67 1.67 2.67 0.33 1.33 2.33

8/64

16/64

24/64

32/64

40/64

48/64

Probability Histogram of Sampling Distribution of X, n = 3

6. Statistical Inference and Hypothesis Testing

6.1 One Sample

6.1.1 Mean

6.1.2 Variance

6.1.3 Proportion 6.2 Two Samples

6.2.1 Means

6.2.2 Variances

6.2.3 Proportions 6.3 Several Samples

6.3.1 Proportions

6.3.2 Variances

6.3.3 Means

6.4 Problems


6. Statistical Inference and Hypothesis Testing

6.1 One Sample

§ 6.1.1 Mean

STUDY POPULATION = Cancer patients on new drug treatment

Random Variable: X = “Survival time” (months)

Assume X ≈ N(µ, σ), with unknown mean µ, but known (?) σ = 6 months.


σ = 6

µ

What can be said about the mean µ of this study population?

X

RANDOM SAMPLE, n = 64

{x1, x2, x3, x4, x5, …, x64}

Sampling Distribution of X

6

640.75

n

σ= =

µ x

x is called a “point estimate”

of µ

X


0.95 0.025 0.025

Z 0 −z.025 = −1.960 1.960 = z.025

Objective 1x

: Parameter Estimation ~ Calculate an interval estimate of µ, centered at the point estimate , that contains µ with a high probability, say 95%. (Hence, 1 − α = 0.95, so that α = 0.05.)

That is, for any random sample, solve for d:

P( X − d ≤ µ ≤ X + d) = 0.95 i.e., via some algebra,

P(µ − d ≤ X ≤ µ + d) = 0.95 .

But recall that Z = /

Xnµ

σ− ~ N(0, 1). Therefore,

P/ /d d

n nZ

σ σ

− +≤ ≤ = 0.95

Hence, /d

nσ+ = z.025 ⇒ d = z.025 ×

nσ = (1.96)(0.75 months) = 1.47 months.

95% margin of error

For future reference, call this

equation .

x + d x − d

µ X

x

0.75 mosnσ

=


µ X

x

x − 1.47 x + 1.47

X ~ N(µ, 0.75)

X

= 24.53 26 27.47 X

= 25.53 27 28.47

X

X

X

X

95% Confidence Interval for µ

.025 .025,z zn n

x xσ σ

− +

95% Confidence Limits

where the critical value z.025 = 1.96 . Therefore, the margin of error (and thus, the size of the confidence interval) remains the same, from sample to sample. Example:

Sample Mean x 95% CI 1

26.0 mos

(26 − 1.47, 26 + 1.47)

2

27.0 mos

(27 − 1.47, 27 + 1.47)

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

etc. Interpretation: Based on Sample 1, the true mean µ of the “new treatment” population is between 24.53 and 27.47 months, with 95% “confidence.” Based on Sample 2, the true mean µ is between 25.53 and 28.47 months, with 95% “confidence,” etc. The ratio of # CI’s that contain µ

Total # CI’s

→ 0.95, as more and more samples are chosen, i.e., “The probability

that a random CI contains the population mean µ is equal to 0.95.” In practice however, the common (but technically incorrect) interpretation is that “the probability that a fixed CI (such as the ones found above) contains µ is 95%.” In reality, the parameter µ is constant; once calculated, a single fixed confidence interval either contains it or not.


Z 0

1 − α α/2 α/2

zα/2 −zα/2

For any significance level α (and hence confidence level 1 − α), we similarly define the…

where zα/2 is the critical value that divides the area under the standard normal distribution N(0, 1) as shown. Recall that for α = 0.10, 0.05, 0.01 (i.e., 1 − α = 0.90, 0.95, 0.99), the corresponding critical values are z.05 = 1.645, z.025 = 1.960, and z.005 = 2.576, respectively. The quantity zα/2 n

σ is the two-sided margin of error.

Therefore, as the significance level α decreases (i.e., as the confidence level 1 − α increases), it follows that the margin of error increases, and thus the corresponding confidence interval widens. Likewise, as the significance level α increases (i.e., as the confidence level 1 − α decreases), it follows that the margin of error decreases, and thus the corresponding confidence interval narrows.

Exercise: Why is it not realistic to ask for a 100% confidence interval (i.e., “certainty”)?

Exercise: Calculate the 90% and 99% confidence intervals for Samples 1 and 2 in the preceding example, and compare with the 95% confidence intervals.

N(0, 1)

X

x

95% CI 99% CI

90% CI

(1 − α) × 100% Confidence Interval for µ

/2 /2,z zn nα ασ σ

− +x x

Ismor Fischer, 1/8/2014 6.1-5 We are now in a position to be able to conduct Statistical Inference

on the population, via a formal process known as

Objective 2a

: Hypothesis Testing ~ “How does this new treatment compare with a ‘control’ treatment?” In particular, how can we use a confidence interval to decide this?

STANDARD POPULATION = Cancer patients on standard drug treatment

Technical Notes: Although this is drawn as a bell curve, we don’t really care how the variable X is distributed in this population, as long as it is normally distributed in the study population of interest, an assumption we will learn how to check later, from the data. Likewise, we don’t really care about the value of the standard deviation σ of this population, only of the study population. However, in the absence of other information, it is sometimes assumed (not altogether unreasonably) that the two are at least comparable in value. And if this is indeed a standard treatment, it has presumably been around for a while and given to many patients, during which time much data has been collected, and thus very accurate parameter estimates have been calculated. Nevertheless, for the vast majority of studies, it is still relatively uncommon that this is the case; in practice, very little if any information is known about any population standard deviation σ . In lieu of this value then, σ is usually well-estimated by the sample standard deviation s with little change, if the sample is sufficiently “large,” but small samples present special problems. These issues will be dealt with later; for now, we will simply assume that the value of σ is known

.

Random Variable: X = “Survival time” (months)

Suppose X is known to have mean = 25 months.


σ = 6

25

How does this compare with the mean µ of the study population?

X


Hence, let us consider the situation where, before any sampling is done, it is actually the experimenter’s intention to see if there is a statistically significant difference between the unknown mean survival time µ of the “new treatment” population, and the known mean survival time of 25 months of the “standard treatment” population

. (See page 1-1!) That is, the sample data will be used to determine whether or not to reject the formal…

Null Hypothesis H0: µ = 25 versus the Alternative Hypothesis HA: µ ≠ 25

at the α = 0.05 significance level (i.e., the 95% confidence level).

Sample 1: 95% CI does contain µ = 25. Sample 2: 95% CI does notTherefore, the data support H0, and we Therefore, the data do not support H0, and we

contain µ = 25.

cannot reject it at the α = .05 level. Based can rejecton this sample, the new drug

it at the α = .05 level. Based on this does not result sample, the new drug does

in a mean survival time that is significantly survival time that is significantly different result in a mean

different from 25 months. Further study? from 25 months. A genuine treatment effect.

In general… Null Hypothesis H0: µ = µ0

versus the Alternative Hypothesis HA: µ ≠ µ0

Decision Rule: If the (1 − α) × 100% confidence interval contains the value µ0, then the difference is not statistically significant; “accept” the null hypothesis at the α level of significance. If it does not contain the value µ0, then the difference is statistically significant; reject the null hypothesis in favor of the alternative at the α significance level.

Two-sided Alternative

Either µ < 25 or µ > 25

X 24.53 25 26 27.47

X 25 25.53 27 28.47

Null Distribution

X ~ N(25, 0.75)


Either µ < µ0 or µ > µ0

“No significant difference exists.”


0.95

X

0.025

µ = 25

Acceptance Region for H0 (Sample 1)

Rejection Region

Rejection Region

0.025

(Sample 2)

23.53 26.47 27 26

Null Distribution

X ~ N(25, 0.75)

Objective 2b x: Calculate which sample mean values will lead to rejecting or not rejecting (i.e., “accepting” or “retaining”) the null hypothesis.

From equation above, and the calculated margin of error = 1.47, we have…

P(µ − 1.47 ≤ X ≤ µ + 1.47) = 0.95 .

Now, IF the null hypothesis : µ = 25 is indeed true

, then substituting this value gives…

P(23.53 ≤ X ≤ 26.47) = 0.95 . Interpretation: If the mean survival time x of a random sample of n = 64 patients is between 23.53 and 26.47, then the difference from 25 is “not statistically significant” (at the α = .05 significance level), and we retain the null hypothesis. However, if x is either less than 23.53, or greater than 26.47, then the difference from 25 will be “statistically significant” (at α = .05), and we reject the null hypothesis in favor of the alternative. More specifically, if the former, then the result is significantly lower than the standard treatment average (i.e., new treatment is detrimental!); if the latter, then the result is significantly higher than the standard treatment average (i.e., new treatment is beneficial).

In general… Decision Rule: If the (1 − α) × 100% acceptance region contains the value x , then the difference is not statistically significant; “accept” the null hypothesis at the α significance level. If it does not contain the value x , then the difference is statistically significant; reject the null hypothesis in favor of the alternative at the α significance level.

(1 − α) × 100% Acceptance Region for H0: µ = µ0

/2 /2,z zn nα ασ σ

− +μ μ0 0


1 − α

α/2

H0: µ = µ0

Acceptance Region for H0

Rejection Region

Rejection Region

α/2

Null Distribution X ~ N(µ0, σ / n )

Error Rates Associated with Accepting / Rejecting a Null Hypothesis

(vis-à-vis Neyman-Pearson) - Confidence Level -

µ = µ0 P(Accept H0 | H0 true) = 1 − α - Significance Level -

P(Reject H0 | H0 true) = α Type I Error

Likewise, µ = µ1 P(Accept H0 | H0 false) = β Type II Error

- Power -

P(Reject H0 | HA: µ = µ1) = 1 − β

X

Null Distribution X ~ N(µ0, σ / n )

Alternative Distribution X ~ N(µ1, σ / n )

H0: µ = µ0 HA: µ = µ1

X

X

β

1 – β


Objective 2cx

: “How probable is my experimental result, if the null hypothesis is true?” Consider a sample mean value . Again assuming that the null hypothesis : µ = µ0 is indeed true, calculate the p-value of the sample = the probability that any random sample mean is this far away or farther, in the direction of the alternative hypothesis

. That is, how significant is the decision about H0, at level α ?

Sample 1: p-value = P( X ≤ 24 or X ≥ 26) Sample 2: p-value = P( X ≤ 23 or X ≥ 27)

= P( X ≤ 24) + P( X ≥ 26) = P( X ≤ 23) + P( X ≥ 27)

= 2 × P( X ≥ 26) = 2 × P( X ≥ 27)

= 2 × P0.75

26 25Z

−≥ = 2 × P0.75

27 25Z

−≥

= 2 × P(Z ≥ 1.333) = 2 × P(Z ≥ 2.667)

= 2 × 0.0912 = 2 × 0.0038

= 0.1824 > 0.05 = α = 0.0076 < 0.05 = α

Decision Rule: If the p-value of the sample is greater than the significance level α, then the difference is not statistically significant; “accept” the null hypothesis at this level. If the p-value is less than α, then the difference is statistically significant

; reject the null hypothesis in favor of the alternative at this level.

Guide to statistical significance of p-values for α = .05: ↓

Accept H0

Reject H0 0 ≤ p ≤ .001

extremely strong

p ≈ .05 borderline

p ≈ .005 strong

p ≈ .01 moderate

.10 ≤ p ≤ 1 not significant

Recall that Z = 1.96 is the α = .05 cutoff z-score!

0.95

X 23.53 µ = 25 26 26.47

0.0912

0.0912 0.025 0.025

0.95

X

0.025

23.53 µ = 25 26.47 27

0.025

0.0038 0.0038

Test Statistic

Z = /

Xnσ

− μ0 ~ N(0, 1)


Summary of findings

x

: Even though the data from both samples suggest a generally longer “mean survival time” among the “new treatment” population over the “standard treatment” population, the formal conclusions and interpretations are different. Based on Sample 1 patients ( = 26), the difference between the mean survival time µ of the study population, and the mean survival time of 25 months of the standard population, is not statistically significant, and may in fact simply be due to random chance. Based on Sample 2 patients ( x = 27) however, the difference between the mean age µ of the study population, and the mean age of 25 months of the standard population, is indeed statistically significant, on the longer side. Here, the increased survival times serve as empirical evidence of a genuine, beneficial “treatment effect” of the new drug.

Comment: For the sake of argument, suppose that a third sample of patients is selected, and to the experimenter’s surprise, the sample mean survival time is calculated to be only x = 23 months. Note that the p-value of this sample is the same as Sample 2, with x = 27 months, namely, 0.0076 < 0.05 = α. Therefore, as far as inference is concerned, the formal conclusion is the same, namely, reject H0: µ = 25 months. However, the practical interpretation is very different! While we do have statistical significance as before, these patients survived considerably shorter than the standard average, i.e., the treatment had an unexpected effect of decreasing survival times, rather than increasing them. (This kind of unanticipated result is more common than you might think, especially with investigational drugs, which is one reason for formal hypothesis testing, before drawing a conclusion.)

://www.african-caribbean-ents.com

α = .05

If p-value < α, then reject H0; significance!

... But interpret it correctly!

higher α ⇒ easier to reject, less conservative

lower α ⇒ harder to reject, more conservative

http://www.african-caribbean-ents.com/�


Modification

: Consider now the (unlikely?) situation where the experimenter knows that the new drug will not result in a “mean survival time” µ that is significantly less than 25 months, and would specifically like to determine if there is a statistically significant increase. That is, he/she formulates the following one-sided null hypothesis to be rejected, and complementary alternative:

Null Hypothesis H0: µ ≤ 25 versus the Alternative Hypothesis HA: µ > 25

at the α = 0.05 significance level (i.e., the 95% confidence level).

In this case, the acceptance region for H0 consists of sample mean values x that are less

than the null-value of µ0 = 25, plus the one-sided margin of error = zα nσ = z.05

664

=

(1.645)(0.75) = 1.234, hence 26.234 . Note that α replaces α/2 here

!

Sample 1: p-value = P( X ≥ 26) Sample 2: p-value = P( X ≥ 27)

= P(Z ≥ 1.333) = P(Z ≥ 2.667)

= 0.0912 > 0.05 = α = 0.0038 < 0.05 = α

(accept) (fairly strong rejection) Note that these one-sided p-values are exactly half of their corresponding two-sided p-values found above, potentially making the null hypothesis easier to reject. However, there are subtleties that arise in one-sided tests that do not arise in two-sided tests…

Right-tailed Alternative

0.95

X µ = 25 26 26.234

0.0912

0.05

0.95

X µ = 25 26.234 27

0.05

0.0038

Here, Z = 1.645 is the α = .05 cutoff z-score! Why?


Consider again the third sample of patients, whose sample mean is unexpectedly calculated to be only x = 23 months. Unlike the previous two samples, this evidence is in strong agreement with the null hypothesis H0: µ ≤ 25 that the “mean survival time” is 25 months or less. This is confirmed by the p-value of the sample, whose definition (recall above) is “the probability that any random sample mean is this far away or farther, in the direction of the alternative hypothesis” which, in this case, is the right-sided HA: µ > 25. Hence,

p-value = P( X ≥ 23) = P(Z ≥ –2.667) = 1 – 0.0038 = 0.9962 >> 0.05 = α

which, as just observed informally, indicates a strong “acceptance” of the null hypothesis.

Exercise: What is the one-sided p-value if the sample mean x = 24 mos? Conclusions?

A word of caution: One-sided tests are less conservative than two-sided tests, and should be used sparingly, especially when it is a priori unknown if the mean response µ is likely to be significantly larger or smaller than the null-value µ0, e.g., testing the effect of a new drug. More appropriate to use when it can be clearly assumed from the circumstances that the conclusion would only be of practical significance if µ is either higher or lower (but not both) than some tolerance or threshold level µ0, e.g., toxicity testing, where only higher levels are of concern.

X |

µ = 25 26.234

0.9962

0.05

0.95

23

SUMMARY: To test any null hypothesis for one mean µ, via the p-value of a sample...

• Step I: Draw a picture of a bell curve, centered at the “null value” µ0.

• Step II: Calculate your sample mean x , and plot it on the horizontal X axis.

• Step III: From x , find the area(s) Ain the direction(s) of H (<, >, or both tails) , by first transforming x to a z-score, and using the z-table. This is your p-value. SEE NEXT PAGE!

• Step IV: Compare p with the significance level α. If <, reject H0. If >, retain H0.

• Step V: Interpret your conclusion in the context of the given situation!


N(0, 1)

z

p/2 p/2

N(0, 1)

z

p/2 p/2

N(0, 1)

z

p

N(0, 1)

z

p

P-VALUES MADE EASY

Def 0H: Suppose a null hypothesis about a population mean µ is to be tested, at a significance level α (= .05, usually), using a known sample mean x from an experiment. The p-value of the sample is the probability that a general random sample yields a mean X that differs from the hypothesized “null value”

0µ , by an amount which is as large as – or larger than – the difference between our known x value and 0µ . Thus, a small p-value (i.e., < α) indicates that our sample provides evidence against the null hypothesis, and we may reject it; the smaller the p-value, the stronger the rejection, and the more “statistically significant” the finding. A p-value > α indicates that our sample does not provide evidence against the null hypothesis, and so we may not reject it. Moreover, a large p-value (i.e., ≈ 1) indicates empirical evidence in support of the null hypothesis, and we may retain, or even “accept” it. Follow these simple steps:

STEP 1. From your sample mean x , calculate the standardized z-score = 0

/x

nµ

σ− .

STEP 2. What form is your alternative hypothesis? 0:AH µ µ< (1-sided, left)......... p-value = tabulated entry corresponding to z-score = left shaded area, whether 0z < or 0z > (illustrated) 0:AH µ µ> (1-sided, right)...... p-value = 1 – tabulated entry corresponding to z-score = right shaded area, whether 0z < or 0z > (illustrated)

0:AH µ µ≠ (2-sided)

If z-score is negative....... p-value = 2 × tabulated entry corresponding to z-score = 2 × left-tailed shaded area

If z-score is positive........ p-value = 2 × (1 – tabulated entry corresponding to z-score) = 2 × right-tailed shaded area

STEP 3.

If the p-value is less than α (= .05, usually), then REJECT NULL HYPOTHESIS – EXPERIMENTAL RESULT IS STATISTICALLY SIGNIFICANT AT THIS LEVEL!

If the p-value is greater than α (= .05, usually), then RETAIN NULL HYPOTHESIS – EXPERIMENTAL RESULT IS NOT

STATISTICALLY SIGNIFICANT AT THIS LEVEL!

STEP 4. IMPORTANT - Interpret results in context. (Note

: For many, this is the hardest step of all!)

“standard error”

Example 0: 10 ppbH µ <: Toxic levels of arsenic in drinking water? Test (safe) vs. : 10 ppbAH µ ≥

(unsafe), at .05α = . Assume ( , )N µ σ , with 1.6 ppb.σ = A sample of 64n = readings that average to 10.1 ppbx = would have a z-score = 0.1/ 0.2 0.5,= which corresponds to a p-value = 1 – 0.69146 = 0.30854

> .05, hence not significant; toxicity has not been formally shown. (Unsafe levels are 10.33 ppb.x ≥ Why?)


P-VALUES MADE SUPER EASY

STATBOT 301, MODEL ZSubject: basic calculation of p-values for Z-TEST

sign of z-score?

1 – table entrytable entry

HA: μ ≠ μ0?

HA: μ < μ0 HA: μ > μ0

2 × (table entry) 2 × (1 – table entry)

– +

CALCULATE… from H0

Test Statistic“z-score” = 0x

nµ

σ−

Check the direction of the alternative

hypothesis!

Remember that the Z-table corresponds to the “cumulative” area to the left of any z-score.


–4.9 +4.9 X ≈ 0

β

≈ 1 − β

28 28 − 0.75 zβ 25

|

28 + 0.75 zβ

.95

.025 .025

X Rejection Region Rejection Region

23.53 25 26.47

Null Distribution

Alternative Distribution

Acceptance Region for H0: μ = 25

–1.47 +1.47

These diagrams compare the null distribution for μ = 25, with the alternative distribution corresponding to μ = 28 in the rejection region of the null hypothesis. By definition, β = P(Accept H0 | HA: μ = 28), and its complement – the power to distinguish these two distributions from one another – is 1 – β = P(Reject H0 | HA: μ = 28), as shown by the gold-shaded areas below. However, the “left-tail” component of this area is negligible, leaving the remaining “right-tail” area equal to 1 – β by itself, approximately. Hence, this corresponds to the critical value −zβ in the standard normal Z-distribution (see inset), which transforms back to 28 − 0.75 zβ in the X -distribution. Comparing this boundary value in both diagrams, we see that

() 28 − 0.75 zβ = 26.47 and solving yields –zβ = –2.04. Thus, β = 0.0207,

and the associated power = 1 – β = 0.9793, or 98%. Hence, we would expect to be able to detect significance 98% of the time, using 64 patients.

X ~ N(28, 0.75)

Power and Sample Size Calculations

Recall: X = survival time (mos) ~ N(μ, σ), with σ = 6 (given). Testing null hypothesis H0: μ = 25 (versus the 2-sided alternative HA: μ ≠ 25), at the α = .05 significance level. Also recall that, by definition, power = 1 – β = P(Reject H0 | H0 is false, i.e., μ ≠ 25). Indeed, suppose that the mean survival time of “new treatment” patients is actually suspected to be HA: μ = 28 mos. In this case, what is the resulting power to distinguish the difference and reject H0, using a sample of n = 64 patients (as in the previous examples)?

Z −zβ 0

β 1 − β

X ~ N(25, 0.75)


General Formulation:

Procurement of drug samples for testing purposes, or patient recruitment for clinical trials, can be extremely time-consuming and expensive. How to determine the minimum sample size n required to reject the null hypothesis H0: µ = µ0, in favor of an alternative value HA: µ = µ1, with a desired power 1 − β , at a specified significance level α ? (And conversely, how to determine the power 1 − β for a given sample size n, as above?)

H0 true H0 false

Reject H0 Type I error, probability = α

(significance level)

probability = 1 − β

(power)

Accept H0

probability = 1 − α (confidence level)

Type II error, probability = β

(1 − power) That is, confidence level = 1 − α = P(Accept H0: µ = µ0 | H0 is true),

and power = 1 − β = P(Reject H0: µ = µ0 | HA: µ = µ1).

β

1 − β 1 − α

α/2 α/2

µ1 µ1 − zβ nσ

µ0 − zα/2nσ µ0 µ0 + zα/2

nσ

Null Distribution

X ~ ,n

N σ μ0

Alternative Distribution

X ~ ,n

N σ μ1

X


µ0 µ1

µ0 µ1

than these two, based solely on sample data.

It is easier to distinguish these two

distributions from each

other...

Hence (compare with () above), µ1 − zβ

nσ = µ0 + zα/2

nσ .

Solving for n yields the following.

Comments: This formula corresponds to a two-sided hypothesis test. For a one-sided test, simply

replace α/2 by α. Recall that if α = .05, then z.025 = 1.960 and z.05 = 1.645. If σ is not known, then it can be replaced above by s, the sample standard deviation,

provided the resulting sample size turns out to be n ≥ 30, to be consistent with CLT. However, if the result is n < 30, then add 2 to compensate. [Modified from: Lachin, J. M. (1981), Introduction to sample size determination and power analysis for clinical trials. Controlled Clinical Trials, 2(2), 93-113.]

What affects sample size, and how? With all other values being equal…

As power 1 − β increases, n increases; as 1 − β decreases, n decreases.

As the difference ∆ decreases, n increases; as ∆ increases, n decreases.

Exercise: Also show that n increases... as σ increases, [Hint: It may be useful to draw a picture, similar to above.] as α decreases. [Hint: It may be useful to recall that α is the Type I Error rate,

or equivalently, that 1 – α is the confidence level.]

In order to be able to detect a statistically significant difference (at level α) between the null population distribution having mean µ0, and an alternative population distribution having mean µ1, with a power of 1 − β, we require a minimum sample size of

n =

zα/2 + zβ

∆ 2 ,

where ∆ = |µ1 − µ0|

σ is the “scaled difference” between µ0 and µ1.

Note: Remember that, as we defined it, zβ is always ≥ 0, and has β area to its right.

Z 0 zβ

β 1 − β


Examples: Recall that in our study, µ0= 25 months, σ = 6 months.

Suppose we wish to detect a statistically significant difference (at level α = .05 ⇒ z.025 = 1.960) between this null distribution, and an alternative distribution having…

µ1 = 28 months, with 90% power (1 − β = .90 ⇒ β = .10 ⇒ z.10 = 1.282). Then the

scaled difference ∆ = |28 − 25|

6 = 0.5, and

n =

1.960 + 1.282

0.5 2 = 42.04, so n ≥ 43 patients.

µ1 = 28 months, with 95% power (1 − β = .95 ⇒ β = .05 ⇒ z.05 = 1.645). Then,

n =

1.960 + 1.645

0.5 2 = 51.98, so n ≥ 52 patients.

µ1 = 27 months, with 95% power (so again, z.05 = 1.645). Then ∆ = |27 − 25|

6 = 0.333,

n =

1.960 + 1.645

0.333 2 = 116.96, so n ≥ 117 patients.

Table of Sample Sizes* for Two-Sided Tests (α = .05) Power ∆ 80% 85% 90% 95% 99% 0.1 785 898 1051 1300 1838 0.125 503 575 673 832 1176 0.15 349 400 467 578 817 0.175 257 294 344 425 600 0.2 197 225 263 325 460 0.25 126 144 169 208 294 0.3 88 100 117 145 205 0.35 65 74 86 107 150 0.4 50 57 66 82 115 0.45 39 45 52 65 91 0.5 32 36 43 52 74 0.6 24 27 30 37 52 0.7 19 21 24 29 38 0.8 15 17 19 23 31 0.9 12 14 15 19 25 1.0 10 11 13 15 21 * Shaded cells indicate that 2 was added to compensate for small n.


∆ = 0.0

∆ = 0.1

∆ = 0.2

∆ = 0.3

∆ = 0.4 ∆ = 1.0

n = 10

n = 20

n = 30

n = 100

∆ = |µ1 −µ0| / σ

.0252α= −

Power Curves – A visual way to relate power and sample size.

1 – β

1 – β

1

Question: Why is power not equal to 0 if Δ = 0?


Comments:

Due to time and/or budget constraints for example, a study may end before optimal sample size is reached. Given the current value of n, the corresponding power can then be determined by the graph above, or computed exactly via the following formula.

Example: As in the original study, let α = .05, | |6−∆ = 8 252 = 0.5, and n = 64. Then the

z-score = –1.96 + 0.5 64 = 2.04, so power = 1 − β = P(Z ≤ 2.04) = 0.9793, or 98% . The probability of committing a Type 2 error = β = 0.0207, or 2%. See page 6.1-15.

Exercise: How much power exists if the sample size is n = 25? 16? 9? 4? 1?

Generally, a minimum of 80% power is acceptable for reporting purposes.

Note: Larger sample size ⇒ longer study time ⇒ longer wait for results. In clinical trials and other medical studies, formal protocols exist for early study termination.

Also, to achieve a target sample size, practical issues must be considered (e.g., parking, meals, bed space,…). Moreover, may have to recruit many more individuals due to eventual censoring (e.g., move-aways, noncompliance,…) or death. $$$$$$$ issues…

Research proposals must have power and sample size calculations in their “methods” section, in order to receive institutional approval, support, and eventual journal publication.

0.0207

0.9793

Z 2.04

N(0, 1)

Power = 1 − β = P(Z ≤ −zα/2 + ∆ n) z-score

The z-score can be +, –, or 0.


N(0, 1): ϕ(z) = 12π

e

−z²/2

tn−1: fn(t) = 1

(n − 1)π

Γn

2

Γ

n − 1

2

1 +

t 2

n − 1 −n/2

−tn−1, α/2 −zα/2 zα/2 tn−1, α/2

Small Samples: Student’s t-distribution

Recall that, vis-à-vis the Central Limit Theorem: X ~ N(µ, σ) ⇒ X ~ N

µ, σn , for any n.

Test statistic…

• σ known: Z = X − µσ / n ~ N(0, 1).

• σ unknown, n ≥ 30: Z = X − µs / n ~ N(0, 1) approximately

• σ unknown, n < 30: T = X − µs / n ~ tn−1 ← Note: Can use for n ≥ 30 as well.

Student’s t-distribution, with ν = n − 1 degrees of freedom df = 1, 2, 3,…

(Due to William S. Gossett (1876 - 1937), Guinness Brewery, Ireland, anonymously publishing under the pseudonym “Student” in 1908.)

df = 1 is also known as the Cauchy distribution.

As df → ∞, it follows that T ~ tdf → Z ~ N(0, 1).

Recall:

s.e. = σ / n

s.e. = s / n


.025

22.42

.025

µ = 25 27.58

X 27.4

0.0334 0.0334

Example: Again recall that in our study, the variable X = “survival time” was assumed to be normally distributed among cancer patients, with σ = 6 months. The null hypothesis H0: µ = 25 months was tested with a random sample of n = 64 patients; a sample mean of x = 27.0 months was shown to be statistically significant (p = .0076), i.e., sufficient evidence to reject the null hypothesis, suggesting a genuine difference, at the α = .05 level.

Now suppose that σ is unknown and, like µ, must also be estimated from sample data. Further suppose that the sample size is small, say n = 25 patients, with which to test the same null hypothesis H0: µ = 25, versus the two-sided alternative HA: µ ≠ 25, at the α = .05 significance level. Imagine that a sample mean x = 27.4 months, and a sample standard deviation s = 6.25 months, are obtained. The greater mean survival time appears promising. However…

s.e. = sn =

6.25 mos25 = 1.25 months

(> s.e. = 0.75 months) Therefore, critical value = t24, .025 = 2.064 Margin of Error = (2.064)(1.25 mos)

= 2.58 months

(> 1.47 months, previously) So…

95% Confidence Interval for µ = (27.4 − 2.58, 27.4 + 2.58) = (24.82, 29.98) months, which does contain the null value µ = 25 ⇒ Accept H0… No significance shown!

95% Acceptance Region for H0 = (25 − 2.58, 25 + 2.58) = (22.42, 27.58) months, which does contain the sample mean x = 27.4 ⇒ Accept H0… No significance shown!

p-value = 2 P( X ≥ 27.4) = 2 P

T24 ≥

27.4 − 251.25

= 2 P(T24 ≥ 1.92) = 2(0.0334) = 0.0668,, which is greater than α = .05 ⇒ Accept H0... No significance shown!

Why? The inability to reject is a typical consequence of small sample size, thus low power! Also see Appendix > Statistical Inference > Mean, One Sample for more info and many more examples on this material.

0.95

.025

−2.064 0 2.064

.025

t24

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A3._Statistical_Inference/A3.1_-_Mean,_One_Sample.pdf�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A3._Statistical_Inference/A3.1_-_Mean,_One_Sample.pdf�


Example: A very simplified explanation of how fMRI works

Functional Magnetic Resonance Imaging (fMRI) is one technique of visually mapping areas of the human cerebral cortex in real time. First, a three-dimensional computer-generated image of the brain is divided into cube-shaped voxels (i.e., “volume elements” – analogous to square “picture elements,” or pixels, in a two-dimensional image), about 2-4 mm on a side, each voxel containing thousands of neurons. While the patient is asked to concentrate on a specific mental task, increased cerebral blood flow releases oxygen to activated neurons at a greater rate than to inactive ones (the so-called “hemodynamic response”), and the resulting magnetic resonance signal can be detected. In one version, each voxel signal is compared with the mean of its neighboring voxels; if there is a statistically significant difference in the measurements, then the original voxel is assigned one of several colors, depending on the intensity of the signal (e.g., as determined by the p-value); see figures.

Suppose the variable X = “Cerebral Blood Flow (CBF)” typically follows a normal distribution with mean µ = 0.5 ml/g/min at baseline. Further, suppose that the n = 6 neighbors surrounding a particular voxel (i.e., front and back, left and right, top and bottom) yields a sample mean of x = 0.767 ml/g/min, and sample standard deviation of s = 0.082 ml/g/min. Calculate the two-sided p-value of this sample (using baseline as the null hypothesis for simplicity), and determine what color should be assigned to the central voxel, using the scale shown.

Solution: X = “Cerebral Blood Flow (CBF)” is normally distributed, H0: µ = 0.5 ml/g/min n = 6 x = 0.767 ml/g/min s = 0.082 ml/g/min

As the population standard deviation σ is unknown, and the sample size n is small, the t-test on df = 6 – 1 = 5 degrees of freedom is appropriate.

Using standard error estimate s.e. = sn

= 0.082 ml/g/min6

= 0.03348 ml/g/min yields

p-value = 2 P( X ≥ 0.767) = 2 50.767 0.5

0.03348P T −

≥

= 2 P(T5 ≥ 7.976) = 2 (.00025) = .0005

This is strongly significant at any reasonable level α. According to the scale, the voxel should be assigned the color RED.

p ≥ .05 gray

.01 ≤ p < .05 green

.005 ≤ p < .01 yellow

.001 ≤ p < .005 orange

p < .001 red


ALTERNATIVE HYPOTHESIS

HA: μ < μ0 HA: μ ≠ μ0 HA: μ > μ0

t-sc

ore + 1 – table entry 2 × table entry table entry

– table entryfor |t-score|

2 × table entryfor |t-score|

1 – table entryfor |t-score|

STATBOT 301, MODEL TSubject: basic calculation of p-values for T-TEST

CALCULATE… from H0

Test Statistic“t-score” = 0x

s nµ−

Remember that the T-table corresponds to the area to the right of a positive t-score.


Each of these 25 areas represents .04 of the total.

Checks for normality ~ Is the ongoing assumption that the sample data come from a normally-distributed population reasonable? Quantiles: As we have already seen, ≈ 68% within ±1 s.d. of mean, ≈ 95% within ±2

s.d. of mean, ≈ 99.7% within ±3 s.d. of mean, etc. Other percentiles can also be checked informally, or more formally via...

Normal Scores Plot: The graph of the quantiles of the n ordered (low-to-high)

observations, versus the n known z-scores that divide the total area under N(0, 1) equally (representing an ideal sample from the standard normal distribution), should resemble a straight line. Highly skewed data would generate a curved plot. Also known as a probability plot or Q-Q plot (for “Quantile-Quantile”), this is a popular method. Example: Suppose n = 24 ages (years). Calculate the .04 quantiles of the sample, and plot them against the 24 known (i.e., “theoretical”) .04 quantiles of the standard normal distribution (below).

{–1.750, –1.405, –1.175, –0.994, –0.842, –0.706, –0.583, –0.468, –0.358, –0.253, –0.151, –0.050, +0.050, +0.151, +0.253, +0.358, +0.468, +0.583, +0.706, +0.842, +0.994, +1.175, +1.405, +1.750}


Sample 1:

{6, 8, 11, 12, 15, 17, 20, 20, 21, 23, 24, 24, 26, 28, 29, 30, 31, 32, 34, 37, 40, 41, 42, 45} The Q-Q plot of this sample (see first graph, below) reveals a more or less linear trend between the quantiles, which indicates that it is not unreasonable to assume that these data are derived from a population whose ages are indeed normally distributed. Sample 2:

{6, 6, 8, 8, 9, 10, 10, 10, 11, 11, 13, 16, 20, 21, 23, 28, 31, 32, 36, 38, 40, 44, 47, 50} The Q-Q plot of this sample (see second graph, below) reveals an obvious deviation from normality. Moreover, the general “concave up” nonlinearity seems to suggest that the data are positively skewed (i.e., skewed to the right), and in fact, this is the case. Applying statistical tests that rely on the normality assumption to data sets that are not so distributed could very well yield erroneous results!

Formal tests for normality include:

Anderson-Darling

Shapiro-Wilk

Lilliefors (a special case of Kolmogorov-Smirnov)


Remedies for non-normality ~ What can be done if the normality assumption is violated, or difficult to verify (as in a very small sample)?

Transformations: Functions such as Y = X or Y = ln(X), can transform a positively-skewed variable X into a normally distributed variable Y. (These functions “spread out” small values, and “squeeze together” large values. In the latter case, the original variable X is said to be log-normal.)

Exercise: Sketch separately the dotplot of X, and the dotplot of Y = ln(X) (to two decimal places), and compare.

X Y = ln(X) Frequency 1 1 2 2 3 3 4 4 5 5 6 5 7 4 8 4 9 3 10 3 11 3 12 2 13 2 14 2 15 2 16 1 17 1 18 1 19 1 20 1

Nonparametric Tests: Statistical tests (on the median, rather than the mean) that are

free of any assumptions on the underlying distribution of the population random variable. Slightly less powerful than the corresponding parametric tests, tedious to carry out by hand, but their generality makes them very useful, especially for small samples where normality can be difficult to verify.

Sign Test (crude), Wilcoxon Signed Rank Test (preferred)

Ismor Fischer, 1/8/2014 6.1-28 GENERAL SUMMARY…

Step-by-Step Hypothesis Testing

One Sample Mean H0: µ vs. µ0

CONTINUE…

Is σ known?

Is n ≥ 30?

Use Z-test (with σ)

Use t-test (with σ = s)

Use Z-test or t-test (with σ = s)

Use a transformation, or a nonparametric test, e.g., Wilcoxon Signed

Rank Test

Is random variable approximately

normally distributed (or mildly skewed)?

Yes No

Yes No, or

don’t know

Yes No

0 ~ (0,1)XZ Nnµ

σ−

= 0 ~ (0,1)XZ Ns n

µ−= 0

1~ nXT ts n

µ−

−=

(used most often in practice)

Ismor Fischer, 1/8/2014 6.1-29 p-value: “How do I know in which direction to move, to find the p-value?” See STATBOT, page 6.1-14 (Z) and page 6.1-24 (T), or…

Alternative Hypothesis

1-sided, left HA: <

2-sided HA: ≠

1-sided, right HA: >

Z- o

r Tdf

- sc

ore

+

–

The p-value of an experiment is the probability (hence always between 0 and 1) of

obtaining a random sample with an outcome that is as, or more, extreme than the one actually obtained, if the null hypothesis is true.

Starting from the value of the test statistic (i.e., z-score or t-score), the p-value is computed in the direction of the alternative hypothesis (either <, >, or both), which usually reflects the investigator’s belief or suspicion, if any.

If the p-value is “small,” then the sample data provides evidence that tends to refute the null hypothesis; in particular, if the p-value is less than the significance level α, then the null hypothesis can be rejected, and the result is statistically significant at that level. However, if the p-value is greater than α, then the null hypothesis is retained; the result is not statistically significant at that level. Furthermore, if the p-value is “large” (i.e., close to 1), then the sample data actually provides evidence that tends to support the null hypothesis.

0

0

0

0

0

0


§ 6.1.2 Variance Given

: Null Hypothesis H0: σ 2 = σ0 2 (constant value)

versus Alternative Hypothesis HA: σ 2 ≠ σ0 2

Test statistic: Χ 2 = (n − 1) s2

σ0 2 ~ χ 2

n−1

Sampling Distribution of Χ 2:

Chi-Squared Distribution, with ν = n − 1 degrees of freedom df = 1, 2, 3,…

Note that the chi-squared distribution is not symmetric, but skewed to the right. We will not pursue the details for finding an acceptance region and confidence intervals for σ 2 here. But this distribution will appear again, in the context of hypothesis testing for equal proportions.


Either σ 2 < σ0 2 or σ 2 > σ0

2

ν = 1

ν = 2

ν = 3

ν = 4 ν = 5

ν = 6 ν = 7

fν(x) = 1

2 ν/2 Γ(ν/2) x ν/2 − 1 e−x/2

Sample, size n

Calculate s 2

Population Distribution ~ N(µ, σ)

σ


← Illustration of the bell curves (1 )

, Nn

π ππ −

for n = 100, as proportion π ranges from 0 to 1. Note how, rather than being fixed at a constant value, the “spread” s.e. is smallest when π is close to 0 or 1 (i.e., when success in the population is either very rare or very common), and is maximum when π = 0.5 (i.e., when both success and failure are equally likely). Also see Problem 4.4/10. This property of nonconstant variance has further implications; see “Logistic Regression” in section 7.3.

.03

.04

.046 .049 .05 .049 .046

.04

.03

|

0.1 |

0.3 |

0.5 |

0.7 |

0.9

π = 0 π = 1

π = 0.5

§ 6.1.3 Proportion

Problem! The expression for the standard error involves the very parameter π upon which we are performing statistical inference. (This did not happen with inference on the mean µ, where the standard error is s.e. = σ / n, which does not depend on µ.)

Binary random variable

1, Success with probability π Y = 0, Failure with probability 1 − π

POPULATION

Experiment: n independent trials

SAMPLE Random Variable: X = # Successes ~ Bin(n, π)

Recall: Assuming n ≥ 30, nπ ≥ 15, and n (1 − π) ≥ 15,

X ~ N ( nπ, nπ (1 − π) ), approximately. (see §4.2)

Therefore, dividing by n…

π = Xn ~ N

π , (1 )n−π π , approximately.

standard error s.e.


Example: Refer back to the coin toss example of section 1.1, where a random sample of n = 100 independent trials is performed in order to acquire information about the probability P(Heads) = π. Suppose that X = 64 Heads are obtained. Then the sample-based point estimate of π is calculated as π = X / n = 64/100 = 0.64 . To improve this to an interval estimate, we can compute the… 95% Confidence Interval for π

95% limits = 0.64 ± z.025 (0.64)(0.36)

100 = 0.64 ± 1.96 (.048)

∴ 95% CI = (0.546, 0.734) contains the true value of π, with 95% confidence. As the 95% CI does not contain the null-value π = 0.5, H0 can be rejected at the α = .05 level, i.e., the coin is not fair.

95% Acceptance Region for H0: π = 0.50

95% limits = 0.50 ± z.025 (0.50)(0.50)

100 = 0.50 ± 1.96 (.050)

∴ 95% AR = (0.402, 0.598)

As the 95% AR does not contain the sample proportion π = 0.64, H0 can be rejected at the α = .05 level, i.e., the coin is not fair.

Is the coin fair at the α = .05 level?

Null Hypothesis H0: π = 0.5

vs. Alternative Hypothesis HA: π ≠ 0.5

(1 − α) × 100% Acceptance Region for H0: π = π0

π0 − zα/2 π0 (1 − π0)

n , π0 + zα/2 π0 (1 − π0)

n

(1 − α) × 100% Confidence Interval for π

π − zα/2 π (1 − π )

n , π + zα/2 π (1 − π )

n

s.e.0 = .050

s.e. = .048

≠


π = 0.5 0.402 0.598 0.64 0.546 0.734

0.95

0.025 0.025

0.0026 0.0026 π

0.5 is not in the 95% Confidence Interval

= (0.546, 0.734)

0.64 is not in the 95% Acceptance Region

= (0.402, 0.598)

p-value = 2 P(π ≥ 0.64) = 2 P

Z ≥ 0.64 − 0.50

.050 = 2 P(Z ≥ 2.8) = 2(.0026) = .0052

As p << α = .05, H0 can be strongly rejected at this level, i.e., the coin is not fair.

Test Statistic

Z = π − π0

π0 (1 − π0)

n

~ N(0, 1)

Null Distribution

π ~ N(0.5, .05)

Ismor Fischer, 1/8/2014 6.1-34 Comments:

A continuity correction factor of ± 0.5n may be added to the numerator of the Z test

statistic above, in accordance with the “normal approximation to the binomial distribution” – see 4.2 of these Lecture Notes. (The “n” in the denominator is there because we are here dealing with proportion of success π = X / n, rather than just number of successes X.)

Power and sample size calculations are similar to those of inference for the mean, and

will not be pursued here. IMPORTANT See Appendix > Statistical Inference > General Parameters and FORMULA TABLES.

and Appendix > Statistical Inference > Means and Proportions, One and Two Samples.

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A3._Statistical_Inference/A3.3_-_General_Parameters_and_FORMULA_TABLES.pdf�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A3._Statistical_Inference/A3.2_-_Means_and_Proportions,_One_and_Two_Samples.pdf�


1X − 2X

Null Distribution

6.2 Two Samples

§ 6.2.1 Means

First assume that the samples are randomly selected from two populations that are independent, i.e., no relation exists between individuals of one population and the other, relative to the random variable, or any lurking or confounding variables that might have an effect on this variable. Model: Phase III Randomized Clinical Trial (RCT)

Measuring the effect of treatment (e.g., drug) versus control (e.g., placebo) on a response variable X, to determine if there is any significant difference between them.

Then… ↓ CLT ↓

So… 1X − 2X ~ N

µ1 − µ2, σ1

2

n1 + σ2

2

n2

Comments:

Recall from 4.1: If Y1 and Y2 are independent, then Var(Y1 − Y2) = Var(Y1) + Var(Y2).

If n1 = n2, the samples are said to be (numerically) balanced.

The null hypothesis H0: µ1 − µ2 = 0 can be replaced by H0: µ1 − µ2 = µ0 if necessary, in order to compare against a specific constant difference µ0 (e.g., 10 cholesterol points), with the corresponding modifications below.

s.e. = σ1

2

n1 + σ2

2

n2 can be replaced by s.e. =

s12

n1 +

s22

n2, provided n1 ≥ 30, n2 ≥ 30.

Independent

Dependent (Paired, Matched)

Control Arm Treatment Arm

Assume

X1 ~ N(µ1, σ1)

Assume

X2 ~ N(µ2, σ2)

Sample, size n1 Sample, size n2

1X ~ N

µ1, σ1

n1 2X ~ N

µ2, σ2

n2

H0: µ1 − µ2 = 0 (There is no difference in mean response between the two populations.)

X

X1 X2

0

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/4_-_Classical_Probability_Distributions/4.1_-_Discrete_Models.pdf�


Test Statistic

Z = ( 1X − 2X ) − µ0

s1

2

n1 +

s22

n2

~ N(0, 1)

Example: X = “cholesterol level (mg/dL)” Test H0: µ1 − µ2 = 0 vs. HA: µ1 − µ2 ≠ 0 for significance at the α = .05 level.

s1

2

n1 =

120080 = 15,

s22

n2 =

60060 = 10 → s.e. =

s12

n1 +

s22

n2 = 25 = 5

95% Confidence Interval for µ1 − µ2

95% limits = 11 ± (1.96)(5) = 11 ± 9.8 ← margin of error ∴ 95% CI = (1.2, 20.8), which does not contain 0 ⇒ Reject H0. Drug works!

95% Acceptance Region for H0: µ1 − µ2 = 0

95% limits = 0 ± (1.96)(5) = ± 9.8 ← margin of error ∴ 95% AR = (−9.8, +9.8), which does not contain 11 ⇒ Reject H0. Drug works!

p-value = 2 P( 1X − 2X ≥ 11)

= 2 P

Z ≥ 11 − 0

5

= 2 P(Z ≥ 2.2) = 2(.0139) = .0278 < .05 = α ⇒ Reject H0. Drug works!

Placebo

Drug n1 = 80

1x = 240

s12 = 1200

n2 = 60

2x = 229

s22 = 600

→ 1x − 2x = 11

(1 − α) × 100% Acceptance Region for H0: µ1 − µ2 = µ0

µ0 − zα/2 s1

2

n1 +

s22

n2 , µ0 + zα/2

s12

n1 +

s22

n2

(1 − α) × 100% Confidence Interval for µ1 − µ2

1 2( )x x− − zα/2 s1

2

n1 +

s22

n2 , 1 2( )x x− + zα/2

s12

n1 +

s22

n2


0.95

1X – 2X

0.025

–9.8 µ 1 – µ 2 = 0 9.8 11

0.025

0.0139 0.0139

1.2 20.8

0 is not in the 95% Confidence Interval

= (1.2, 20.8)

11 is not in the 95% Acceptance Region

= (–9.8, 9.8)

Null Distribution

1 2X X− ~ N(0, 5)


Small samples: What if n1 < 30 and/or n2 < 30? Then use the t-distribution, provided… H0: σ1

2 = σ22 (equivariance, homoscedasticity)

Technically, this requires a formal test using the F-distribution; see next section (§ 6.2.2). However, an informal criterion is often used:

14 < F =

s12

s22 < 4 .

If equivariance is accepted, then the common value of σ12 and σ2

2 can be estimated by the weighted mean of s1

2 and s22, the pooled sample variance:

spooled2 =

df1 s12 + df2 s2

2 df1 + df2

, where df1 = n1 − 1 and df2 = n2 − 1,

i.e.,

spooled2 =

(n1 − 1) s12 + (n2 − 1) s2

2 n1 + n2 − 2 =

SSdf .

Therefore, in this case, we have s.e. = σ1

2

n1 + σ2

2

n2 estimated by

s.e. = spooled

2

n1 +

spooled2

n2

i.e., s.e. = spooled

2

1

n1 +

1n2

= spooled 1n1

+ 1n2

.

If equivariance (but not normality) is rejected, then an approximate t-test can be used, with the approximate degrees of freedom df given by

s1

2

n1 +

s22

n2

2

(s12/n1)2

n1 − 1 + (s2

2/n2)2

n2 − 1

.

This is known as the Smith-Satterwaithe Test. (Also used is the Welch Test.)

Ismor Fischer, 5/29/2012 6.2-5 Example: X = “cholesterol level (mg/dL)” Test H0: µ1 − µ2 = 0 vs. HA: µ1 − µ2 ≠ 0 for significance at the α = .05 level.

Pooled Variance

spooled2 =

(8 − 1)(775) + (10 − 1)(1175) 8 + 10 − 2 =

1600016 = 1000

↑ df

Note that spooled2 = 1000 is indeed between the variances s1

2 = 775 and s22 = 1175.

Standard Error

s.e. = 1000

1

8 + 110 = 15

Margin of Error = (2.120)(15) = 31.8 Critical Value t16, .025 = 2.120

Placebo

Drug

n1 = 8

1x = 230

s12 = 775

n2 = 10

2x = 200

s22 = 1175

→ 1x − 2x = 30

→ F = s12 / s2

2 = 0.66, which is between 0.25 and 4. Equivariance accepted ⇒ t-test


Test Statistic

T = ( 1X − 2X ) − µ0

spooled2

1

n1 +

1n2

~ tdf

where df = n1 + n2 – 2

95% Confidence Interval for µ1 − µ2

95% limits = 30 ± 31.8 ← margin of error ∴ 95% CI = (−1.8, 61.8), which contains 0 ⇒ Accept H0.

95% Acceptance Region for H0: µ1 − µ2 = 0

95% limits = 0 ± 31.8 ← margin of error ∴ 95% AR = (−31.8, +31.8), which contains 30 ⇒ Accept H0.

p-value = 2 P( 1X − 2X ≥ 30)

= 2 P

T16 ≥ 30 − 0

15

= 2 P(T16 ≥ 2.0) = 2(.0314) = .0628 > .05 = α

⇒ Accept H0.

Once again, low sample size implies low power to reject the null hypothesis. The tests do not show significance, and we cannot conclude that the drug works, based on the data from these small samples. Perhaps a larger study is indicated…

(1 − α) × 100% Confidence Interval for µ1 − µ2

1 2( )x x− − tdf, α/2 spooled2

1

n1 +

1n2

, 1 2( )x x− + tdf, α/2 spooled2

1

n1 +

1n2


(1 − α) × 100% Acceptance Region for H0: µ1 − µ2 = µ0

µ0 − tdf, α/2 spooled2

1

n1 +

1n2

, µ0 + tdf, α/2 spooled2

1

n1 +

1n2



Now consider the case where the two samples are dependent. That is, each observation in the first sample is paired, or matched, in a natural way on a corresponding observation in the second sample.

Examples:

Individuals may be matched on characteristics such as age, sex, race, and/or other variables that might confound the intended response.

Individuals may be matched on personal relations such as siblings (similar genetics, e.g., twin studies), spouses (similar environment), etc.

Observations may be connected physically (e.g., left arm vs. right arm), or connected in time (e.g., before treatment vs. after treatment).

Calculate the difference di = xi – yi of each matched pair of observations, thereby forming a single collapsed sample {d1, d2, d3, …, dn}, and apply the appropriate one-sample Z- or t- test to the equivalent null hypothesis H0: µD = 0.

Subtract…

Subtract…

Assume

X ~ N(µ1, σ1)

Assume

Y ~ N(µ2, σ2)

x1

x2

x3 . . .

xn

y1

y2

y3 . . .

yn

#

1

2

3 . . . n

Sample, size n

Sample, size n

D = X – Y ~ N(µ, σ)

where µD = µ1 – µ2

d1 = x1 – y1

d2 = x2 – y2

d3 = x3 – y3 . . .

dn = xn – yn

Sample, size n

H0: µ1 − µ2 = 0 H0: µD = 0

Ismor Fischer, 5/29/2012 6.2-8 Checks for normality include normal scores plot (probability plot, Q-Q plot), etc., just as with one sample. Remedies for non-normality include transformations (e.g., logarithmic or square root), or nonparametric tests.

Independent Samples: Wilcoxon Rank Sum Test (= Mann-Whitney U Test)

Dependent Samples: Sign Test, Wilcoxon Signed Rank Test (just as with one sample)


Step-by-Step Hypothesis Testing Two Sample Means H0: µ1 – µ2 vs. 0

1 2 02 21 1 2 2

( )X XZn n

µ

σ σ

− −=

+

1 2

1 2 02 2

2 1 1 2 2

( )n n

Z X XT s n s n

µ+ −

− −=

+

1 2

2 21 1 2 2

1 2

1 2 02

1 2

( 1) ( 1)2

( )(1 1 )

n n

n s n sn n

X XTn n

µ+ −

− + −+ −

− −=

+

=

2pooled

2pooled

s

s

…GO TO PAGE 6.1-28

Use Z-test (with σ1, σ2)

Use t-test (with 2 2

1 2ˆ ˆσ σ= = 2

pooleds )

Use Z-test or t-test (with 11ˆ s=σ , 22ˆ s=σ )

Use a transformation, or a nonparametric test,

e.g., Wilcoxon Rank Sum Test

Independent or Paired?

Compute D = X1 – X2 for each i = 1, 2, …, n. Then calculate…

• sample mean ∑=

=n

iid

nd

1

1

• sample variance ∑=

−−

=n

iid dd

ns

1

22 )(1

1

… and GO TO “One Sample Mean” testing of H0: µD = 0, section 6.1.1.

Are X1 and X2 approximately

normally distributed (or mildly skewed)?

Are σ1, σ2 known?

Are n1 ≥ 30 and n2 ≥ 30?

Equivariance: σ12 = σ2

2?

Compute F = s12 / s2

2. Is 1/4 < F < 4?

Use an approximate t-test, e.g., Satterwaithe Test

No

Paired

Yes

No, or don’t know

Yes

No

Independent

Yes

Yes

No

Ismor Fischer, 5/29/2012 6.2-10

ν1 = 20, ν2 = 40

ν1 = 20, ν2 = 30

ν1 = 20, ν2 = 20

ν1 = 20, ν2 = 10

ν1 = 20, ν2 = 5

F-distribution

f(x) = 1

Β(ν1/2, ν2/2)

ν1

ν2

ν1/2

xν1/2 − 1

1 +

ν1ν2

x−ν1/2 −ν2/2

σ1 σ2

Independent groups

Sample, size n1

Calculate 21s

Sample, size n2

Calculate 22s

§ 6.2.2 Variances Suppose X1 ~ N(µ1, σ1) and X2 ~ N(µ2, σ2). Null Hypothesis H0: σ1

2 = σ22

versus

Alternative Hypothesis HA: σ12 ≠ σ2

2

Formal test: Reject H0 if the F-statistic is significantly different from 1. Informal criterion: Accept H0 if the F-statistic is between 0.25 and 4. Comment: Another test, more robust to departures from the normality assumption than the F-test, is Levene’s Test, a t-test of the absolute deviations of each sample. It can be generalized to more than two samples (see section 6.3.2).

Test Statistic

F = s1

2

s22 ~ F

ν1 ν2

where ν1 = n1 – 1 and ν2 = n2 – 1 are the corresponding numerator and denominator degrees of freedom, respectively.

Ismor Fischer, 5/29/2012 6.2-11 § 6.2.3 Proportions

Therefore, approximately…

π1 − π2 ~ N

π1 − π2 ,

π1 (1 − π1)n1

+ π2 (1 − π2)

n2 .

↑ standard error s.e. Confidence intervals are computed in the usual way, using the estimate

s.e. = 1 2

ˆ ˆ ˆ ˆ1 (1 )( )n n

π π π π+

− −1 1 2 2 ,

as follows:

POPULATION

Binary random variable I1 = 1 or 0, with

P(I1 = 1) = π1, P(I1 = 0) = 1 − π1

Binary random variable I2 = 1 or 0, with

P(I2 = 1) = π2, P(I2 = 0) = 1 − π2

INDEPENDENT SAMPLES n1 ≥ 30 n2 ≥ 30

Random Variable X1 = #(I1 = 1) ~ Bin(n1, π1)

Recall (assuming n1π1 ≥ 15, n1(1 – π1) ≥ 15):

~ N

π1, π1 (1 − π1)

n1 , approx.

Random Variable X2 = #(I2 = 1) ~ Bin(n2, π2)

Recall (assuming n2π2 ≥ 15, n2(1 – π2) ≥ 15):

~ N

π2, π2 (1 − π2)

n2 , approx. π1 =

X1n1

π 2 = X2n2

Ismor Fischer, 5/29/2012 6.2-12

(1 − α) × 100% Confidence Interval for π1 − π2

(π1 − π2 ) − zα/2

1 2

ˆ ˆ ˆ ˆ1 (1 )( )n n

π π π π+

− −1 1 2 2 ‚ (π1 − π2 ) + zα/2 1 2

ˆ ˆ ˆ ˆ1 (1 )( )n n

π π π π+

− −1 1 2 2

Test Statistic for H0: π1 − π2 = 0

Z = (π1 − π2 ) − 0

πpooled (1 − πpooled ) 1n1

+ 1n2

~ N(0, 1)

Unlike the one-sample case, the same estimate for the standard error can also be used in computing the acceptance region for the null hypothesis H0: π1 − π2 = π0, as well as the test statistic for the p-value, provided the null value π0 ≠ 0. HOWEVER, if testing for equality between two proportions via the null hypothesis H0: π1 − π2 = 0, then their common value should be estimated by the more stable weighted mean of π1 and π 2 , the pooled sample proportion:

πpooled = X1 + X2n1 + n2

= n1π1 + n2π 2

n1 + n2 .

Substituting yields…

s.e.0 = πpooled (1 − πpooled )

n1 +

πpooled (1 − πpooled )n2

i.e.,

s.e.0 = πpooled (1 − πpooled ) 1n1

+ 1n2

. Hence…

(1 − α) × 100% Acceptance Region for H0: π1 − π2 = 0

0 − zα/2 πpooled (1 − πpooled )

1n1

+ 1n2

, 0 + zα/2 πpooled (1 − πpooled ) 1n1

+ 1n2

Ismor Fischer, 5/29/2012 6.2-13

PT + Supplement

PT only

n1 = 400

X1 = 332

n2 = 320

X2 = 244

H0: π1 − π2 = 0

Null Distribution N(0, 0.03)

0 0.0675 π1 − π 2

Standard Normal Distribution

N(0, 1)

Z 0 2.25

.0122 .0122

Figure 1

−0.0675 −2.25

Example: Consider a group of 720 patients who undergo physical therapy for arthritis. A daily supplement of glucosamine and chondroitin is given to n1 = 400 of them in addition to the physical therapy; after four weeks of treatment, X1 = 332 show measurable signs of improvement (increased ROM, etc.). The remaining n2 = 320 patients receive physical therapy only; after four weeks, X2 = 244 show improvement. Does this difference represent a statistically significant treatment effect? Calculate the p-value, and form a conclusion at the α = .05 significance level.

H0: π1 − π2 = 0

vs. HA: π1 − π2 ≠ 0 at α = .05

↓ ↓

π1 = 332400 = 0.83, π 2 =

244320 = 0.7625 ⇒ π1 − π 2 = 0.0675

πpooled = 332 + 244400 + 320 =

576720 = 0.8

and thus 1 – πpooled = 144720 = 0.2

Therefore, p-value = 2 P(π1 − π2 ≥ 0.0675) = 2 P

Z ≥ 0.0675 − 0

0.03 = 2 P(Z ≥ 2.25) = 2(.0122) = .0244 .

Conclusion: As this value is smaller than α = .05, we can reject the null hypothesis that the two proportions are equal. There does indeed seem to be a moderately significant treatment difference between the two groups.

s.e.0 = (0.8)(0.2) 1

400 + 1

320 = 0.03

Ismor Fischer, 5/29/2012 6.2-14 Exercise: Instead of H0: π1 − π2 = 0 vs. HA: π1 − π2 ≠ 0, test the null hypothesis for a 5% difference, i.e., H0: π1 − π2 = .05 vs. HA: π1 − π2 ≠ .05, at α = .05 . [Note that the pooled proportion πpooled is no longer appropriate to use in the expression for the standard error under the null hypothesis, since H0 is not claiming that the two proportions π1 and π2 are equal (to a common value); see notes above.] Conclusion? Exercise: Instead of H0: π1 − π2 = 0 vs. HA: π1 − π2 ≠ 0, test the one-sided null hypothesis H0: π1 − π2 ≤ 0 vs. HA: π1 − π2 > 0 at α = .05 . Conclusion? Exercise: Suppose that in a second experiment, n1 = 400 patients receive a new drug that targets B-lymphocytes, while the remaining n2 = 320 receive a placebo, both in addition to physical therapy. After four weeks, X1 = 376 and X2 = 272 show improvement, respectively. Formally test the null hypothesis of equal proportions at the α = .05 level. Conclusion? Exercise: Finally suppose that in a third experiment, n1 = 400 patients receive “magnet therapy,” while the remaining n2 = 320 do not, both in addition to physical therapy. After four weeks, X1 = 300 and X2 = 240 show improvement, respectively. Formally test the null hypothesis of equal proportions at the α = .05 level. Conclusion? See…

Appendix > Statistical Inference > General Parameters and FORMULA TABLES.

IMPORTANT!

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A3._Statistical_Inference/A3.3_-_General_Parameters_and_FORMULA_TABLES.pdf�

Ismor Fischer, 5/29/2012 6.2-15 Alternate Method: Chi-Squared (χ 2) Test As before, let the binary variable I = 1 for improvement, I = 0 for no improvement, with probability π and 1 − π, respectively. Now define a second binary variable J = 1 for the “PT + Drug” group, and J = 0 for the “PT only” group. Thus, there are four possible disjoint events: “I = 0 and J = 0,” “I = 0 and J = 1,” “I = 1 and J = 0,” and “I = 1 and J = 1.” The number of times these events occur in the random sample can be arranged in a 2 × 2 contingency table that consists of four cells (NW, NE, SW, and SE) as demonstrated below, and compared with their corresponding expected values based on the null hypothesis. Observed Values Group (J) PT + Drug PT only

Stat

us (I

) Improvement 332 244 576 Row marginal

totals No Improvement 68 76 144

400 320 720 Column marginal totals

versus…

Expected Values = Column total × Row totalTotal Sample Size n

under H0: π1 = π2 πpooled = 576/720 = 0.8 Group (J) PT + Drug PT only

Stat

us (I

) Improvement 400 × 576

720 = 320.0

320 × 576720 = 256.0

576

No Improvement 400 × 144

720 = 80.0

320 × 144720 =

64.0 144

400.0 320.0 720

Note: “Chi” is pronounced “kye”

Informal reasoning: Consider the first cell, improvement in the 400 patients of the “PT + Drug” group. The null hypothesis conjectures that the probability of improvement is equal in both groups, and this common value is estimated by the pooled proportion 576/720. Hence, the expected number (under H0) of improved patients in the “PT + Drug” group is 400 × 576/720, etc.

Note that, by construction,

H0: 320400 =

256320

= 576720 , the pooled proportion.

Ismor Fischer, 5/29/2012 6.2-16

5.0625

Figure 2

.0244

χ 21

Ideally, if all the observed values = all the expected values, then this statistic would = 0, and the corresponding p-value = 1. As it is,

Χ 2 =

(332 − 320)2

320 + (244 − 256)2

256 + (68 − 80)2

80 + (76 − 64)2

64 = 5.0625 on 1 df

Therefore, the p-value = P(χ 21 ≥ 5.0625) = .0244, as before. Reject H0.

Comments:

Chi-squared Test is valid, provided Expected Values ≥ 5. (Otherwise, the score is inflated.) For small expected values in a 2 × 2 table, defer to Fisher’s Exact Test.

Chi-squared statistic with Yates continuity correction to reduce spurious significance:

Χ 2 = Σ (|Obs − Exp| − 0.5)2

Exp all cells

Chi-squared Test is strictly for the two-sided H0: π1 − π2 = 0 vs. HA: π1 − π2 ≠ 0. It cannot be modified to a one-sided test, or to H0: π1 − π2 = π0 vs. HA: π1 − π2 ≠ π0.

Test Statistic for H0: π1 − π2 = 0

Χ 2 = Σ (Obs − Exp)2

Exp ~ χ 21

all cells

Note that

5.0625 = (± 2.25)2,

i.e., χ 2

1 = Z 2.

The two test statistics are mathematically equivalent! (Compare Figures 1 and 2.)

Ismor Fischer, 5/29/2012 6.2-17

How could we solve this problem using R? The code (which can be shortened a bit): # Lines preceded by the pound sign are read as comments, # and ignored by R.

# The following set of commands builds the 2-by-2 contingency table, # column by column (with optional headings), and displays it as # output (my boldface).

Tx.vs.Control = matrix(c(332, 68, 244, 76), ncol = 2, nrow = 2, dimnames = list("Status" = c("Improvement", "No Improvement"), "Group" = c("PT + Drug", "PT")))

Tx.vs.Control Group Status PT + Drug PT Improvement 332 244 No Improvement 68 76 # A shorter alternative that outputs a simpler table:

Improvement = c(332, 244) No_Improvement = c(68, 76) Tx.vs.Control = rbind(Improvement, No_Improvement)

Tx.vs.Control

[,1] [,2]

Improvement 332 244

No_Improvement 68 76

# The actual Chi-squared Test itself. Since using a correction # factor is the default, the F option specifies that no such # factor is to be used in this example.

chisq.test(Tx.vs.Control, correct = F) Pearson's Chi-squared test data: Tx.vs.Control X-squared = 5.0625, df = 1, p-value = 0.02445

Note how the output includes the Chi-squared test statistic, degrees of freedom, and p-value, all of which agree with our previous manual calculations.

Ismor Fischer, 5/29/2012 6.2-18 Application: Case-Control Study Design

Determines if an association exists between disease D and risk factor exposure E.

TIME PRESENT PAST

Given: Cases (D+) and Controls (D−) Investigate: Relation with E+ and E−

Chi-Squared Test

H0: π E+ | D+ = π E+ | D− Randomly select a sample of cases and controls, and categorize each member according to whether or not he/she was exposed to the risk factor.

SAMPLE

n1 cases D+

n2 controls D−

For each case (D+), there are 2 disjoint possibilities for exposure: E+ or E−.

For each control (D−), there are 2 disjoint possibilities for exposure: E+ or E−.

D+ D−

E+

E−

E+

E−

a b

c d

Calculate the χ21 statistic:

(a + b + c + d) (ad − bc)2

(a + c) (b + d) (a + b) (c + d)

McNemar’s Test

E+ E−

E+

E−

D+

D−

For each matched case-control ordered pair (D+, D−), there are 4 disjoint possibilities for exposure:

SAMPLE

n cases D+

n controls D−

E+ and E+ or

E− and E+ or

E+ and E− or

E− and E−

concordant pair

discordant pair

discordant pair

concordant pair

b

c

a

d

H0: π E+ | D+ = π E+ | D− Match each case with a corresponding control on age, sex, race, and any other confounding variables that may affect the outcome. Note that this requires a balanced sample: n1 = n2.

Calculate the χ21 statistic:

(b − c)2

b + c

See Appendix > Statistical Inference > Means and Proportions, One and Two Samples.

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A3._Statistical_Inference/A3.2_-_Means_and_Proportions,_One_and_Two_Samples.pdf�

Ismor Fischer, 5/29/2012 6.2-19

To quantify the strength of association between D and E, we turn to the notion of…

Odds Ratios – Revisited

Recall: Alas, the probability distribution of the odds ratio OR is distinctly skewed to the right. However, its natural logarithm, ln(OR), is approximately normally distributed, which makes it more useful for conducting the Test of Association above. Namely… (1 − α) × 100% Confidence Limits for ln(OR)

e ln(OR ) ± (zα/2) s.e.

, where s.e. = 1a +

1b +

1c +

1d

POPULATION

Case-Control Studies:

OR = odds(Exposure | Disease)

odds(Exposure | No Disease) = P(E+ | D+) / P(E− | D+)P(E+ | D−) / P(E− | D−)

Cohort Studies:

OR = odds(Disease | Exposure)

odds(Disease | No Exposure) = P(D+ | E+) / P(D− | E+)P(D+ | E−) / P(D− | E−)

H0: OR = 1 ⇔ No association exists between D, E.

versus…

HA: OR ≠ 1 ⇔ An association exists between D, E.

SAMPLE, size n

D+ D− E+ a b

E− c d

OR = adbc

(1 − α) × 100% Confidence Limits for OR

Ismor Fischer, 5/29/2012 6.2-20

0.79 8.30 2.56 1

1.52 4.32 2.56 1

D+ D− E+ 40 50

E− 50 160

OR = (40)(160)(50)(50) = 2.56

D+ D− E+ 8 10

E− 10 32

OR = (8)(32)

(10)(10) = 2.56

Examples: Test H0: OR = 1 versus HA: OR ≠ 1 at the α = .05 significance level. ln(2.56) = 0.94

s.e. = 18 +

110 +

110 +

132 = 0.6 ⇒ 95% Margin of Error = (1.96)(0.6) = 1.176

95% Confidence Interval for ln(OR) = ( 0.94 − 1.176, 0.94 + 1.176 ) = ( −0.236, 2.116 )

and so… 95% Confidence Interval for OR = ( e−0.236, e2.116 ) = (0.79, 8.30)

Conclusion: As this interval does contain the null value OR = 1, we cannot reject the hypothesis of non-association at the 5% significance level. ln(2.56) = 0.94

s.e. = 140 +

150 +

150 +

1160 = 0.267 ⇒ 95% Margin of Error = (1.96)(0.267) = 0.523

95% Confidence Interval for ln(OR) = ( 0.94 − 0.523, 0.94 + 0.523 ) = ( 0.417, 1.463 )

and so… 95% Confidence Interval for OR = ( e0.417, e1.463 ) = (1.52, 4.32)

Conclusion: As this interval does not contain the null value OR = 1, we can reject the hypothesis of non-association at the 5% level. With 95% confidence, the odds of disease are between 1.52 and 4.32 times higher among the exposed than the unexposed. Comments:

If any of a, b, c, or d = 0, then use s.e. = 1a + 0.5 +

1b + 0.5 +

1c + 0.5 +

1d + 0.5 .

If OR < 1, this suggests that exposure might have a protective effect, e.g., daily calcium supplements (yes/no) and osteoporosis (yes/no).

Ismor Fischer, 5/29/2012 6.2-21

Summary Odds Ratio

Combining 2 × 2 tables corresponding to distinct strata. Examples:

Males Females All D+ D− D+ D− D+ D−

E+ 10 50 E+ 10 10 →

E+ 20 60

E− 10 150 E− 60 60 E− 70 210

1OR = 3 2OR = 1 OR = 1


E+ 80 20 E+ 10 20 →

E+ 90 40

E− 20 10 E− 20 80 E− 40 90

1OR = 2 2OR = 2 OR = 5.0625


E+ 60 100 E+ 50 10 →

E+ 110 110

E− 10 50 E− 100 60 E− 110 110

1OR = 3 2OR = 3 OR = 1

These examples illustrate the phenomenon known as Simpson’s Paradox. Ignoring a confounding variable (e.g., gender) may obscure an association that exists within each stratum, but not observed in the pooled data, and thus must be adjusted for. When is it acceptable to combine data from two or more such strata? How is the summary odds ratio ORsummary estimated? And how is it tested for association?

???

???

???

Ismor Fischer, 5/29/2012 6.2-22 In general…

Stratum 1 Stratum 2 D+ D− D+ D−

E+ a1 b1 E+ a2 b2

E− c1 d1 E− c2 d2

1OR = a1 d1b1 c1

2OR = a2 d2b2 c2

Example:

Males Females D+ D− D+ D−

E+ 10 20 E+ 40 50

E− 30 90 E− 60 90

1OR = 1.5 2OR = 1.2

Assuming that the Test of Homogeneity H0: OR1 = OR2 is conducted and accepted,

MHOR =

(10)(90)150 +

(40)(90)240

(20)(30)150 +

(50)(60)240

= 6 + 15

4 + 12.5 = 21

16.5 = 1.273 .

Exercise: Show algebraically that MHOR is a weighted average of 1OR and 2OR .

I. Calculate the estimates of OR1 and OR2 for each stratum, as shown.

II. Can the strata be combined? Conduct a “Breslow-Day” (Chi-squared) Test of Homogeneity for

H0: OR1 = OR2 .

III. If accepted, calculate the Mantel-Haenszel Estimate of ORsummary:

MHOR =

a1 d1n1

+ a2 d2

n2

b1 c1n1

+ b2 c2

n2

.

IV. Finally, conduct a Test of Association for the combined strata

H0: ORsummary = 1

either via confidence interval, or special χ 2-test (shown below).

Ismor Fischer, 5/29/2012 6.2-23

To conduct a formal Chi-squared Test of Association H0: ORsummary = 1, we calculate, for the 2 × 2 contingency table in each stratum i = 1, 2,…, s.

Observed # diseased

vs. Expected # diseased Variance

D+ D−

E+ ai bi R1i → E1i = R1i C1i

ni

Vi = R1i R2i C1i C2i

ni2 (ni − 1)

E− ci di R2i → E2i = R2i C1i

ni

C1i C2i ni

Therefore, summing over all strata i = 1, 2,…, s, we obtain the following:

Observed total, Diseased Expected total, Diseased Total Variance

Exposed: O1 = Σ ai Exposed: E1 = Σ E1i

Not Exposed: O2 = Σ ci Not Exposed: E2 = Σ E2i

and the formal test statistic for significance is given by

Χ 2 = (O1 − E1)2

V ~ χ 1

2.

This formulation will appear again in the context of the Log-Rank Test in the area of Survival Analysis (section 8.3). Example (cont’d):

For stratum 1 (males), E11 = (30)( 40)150

= 8 and V1 = 2

(30)(120)( 40)(110)150 (149)

= 4.725.

For stratum 2 (females), E12 = (90)(100)240

= 37.5 and V2 = 2

(90)(150)(100)(140)240 ( 239)

= 13.729.

Therefore, O1 = 50, E1 = 45.5, and V = 18.454, so that Χ 2 = 2(4.5)

18.454 = 1.097 on 1

degree of freedom, from which it follows that the null hypothesis H0: ORsummary = 1 cannot be rejected at the α = .05 significance level, i.e., there is not enough empirical evidence to conclude that an association exists between disease D and exposure E. Comment: This entire discussion on Odds Ratios OR can be modified to Relative Risk RR

(defined only for a cohort study), with the following changes: s.e. = 1a −

1R1

+ 1c −

1R2

,

as well as b replaced with row marginal R1, and d replaced with row marginal R2, in all other formulas. [Recall, for instance, that /OR ad bc= , whereas 2 1/RR aR R c= , etc.]

V = Σ Vi


6.3 Several Samples

§ 6.3.1 Proportions

General formulation Consider several fixed (i.e., nonrandom) populations, say j = 1, 2, 3, …, c, where every individual in each population can have one of several random responses, i =1, 2, 3, …, r (e.g., the previous example had c = 2 treatment groups and r = 2 possible improvement responses: “Yes” or “No”). Formally, let I and J be two general categorical variables, with r and c categories, respectively. Thus, there is a total of r c possible disjoint outcomes – namely, “an individual in population j (= 1, 2, …, c) corresponds to some response i (= 1, 2, …, r).” With this in mind, let i jπ = the probability of this outcome. We wish to test the null hypothesis that, for each response category i, the probabilities

i jπ are equal, over all the population categories j. That is, the populations are homogeneous, with respect to the proportions of individuals having the same responses:

H0: 11 12 13 1= = = …=π π π π c and

21 22 23 2= = = …=π π π π c and … ⇔ “There is no association between … (categories of) I and (categories of) J.” … and

1 2 3= = = …=π π π π r cr r r

versus…

HA: At least one of these equalities ⇔ “There is an association between is false, i.e., π πi j i k≠ for some i. (categories of) I and (categories of) J.”

Much as before, we can construct an r × c contingency table of n observed values, where r = # rows, and c = # columns.

Categories of J 1 2 3 … c

Cat

egor

ies o

f I

1 O11 O12 O13 … O1c R1

2 O21 O22 O23 … O2c R2

3 O31 O32 O33 … O3c R3 . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

r Or1 Or2 Or3 … Orc Rr C1 C2 C3 … Cc n


ν = 1

ν = 2

ν = 3 ν = 4

ν = 5 ν = 6 ν = 7

χ 2ν distribution

For i = 1, 2, …, r and j = 1, 2, …, c, the following are obtained:

Observed Values Oi j = #(I = i, J = j) whole numbers ≥ 0

Expected Values Ei j = Ri Cj

n , real numbers (i.e., with decimals) ≥ 0

where the row marginals Ri = Oi1 + Oi2 + O i3 + … + Oic ,

and the column marginals Cj = O1j + O2j + O3j + … + Orj

Comments:

Chi-squared Test is valid, provided 80% or more of Eij ≥ 5. For small expected values, lumping categories together increases the numbers in the corresponding cells. Example: The five age categories “18-24,” “25-39,” “40-49,” “50-64,” and “65+” in a might be lumped into three categories “18-39,” “40-64,” and “65+” if appropriate. Caution: Categories should be deemed contextually meaningful before using χ 2.

Remarkably, the same Chi-squared statistic can be applied in different scenarios, including tests of different null hypotheses H0 on the same contingency table, as shown in the following examples.

If Z1, Z2, …, Zd are N(0, 1) random variables, then Z12 + Z2

2 + … + Zd2 ~ χ 2

d .

Test Statistic

Χ 2 = Σ (Oij − Eij)2

Eij ~ χ 2

df

where ν = df = (r − 1)(c − 1)

all i, j


Example: Suppose that a study, similar to the previous one, compares r = 4 improvement responses of c = 3 fixed groups of n = 600 patients: one group of 250 receives physical therapy alone, a second group of 200 receives an over-the-counter supplement in addition to physical therapy, and a third group of 150 receives a prescription medication in addition to physical therapy. The 4 × 3 contingency table of observed values is generated below. Treatment Group (J) PT + Rx PT + OTC PT only

Impr

ovem

ent

Stat

us (I

) None 6 14 40 60 random row

m

arginal totals

Minor 9 30 81 120 Moderate 15 60 105 180

Major 120 96 24 240 150 200 250 600 fixed column marginal totals

Upon inspection, it seems obvious that there are clear differences, but determining whether or not these differences are statistically significant requires a formal test. For instance, consider the null hypothesis that “there is no significant difference in each improvement response rate, across the treatment populations” – i.e., for each improvement category i (= 1, 2, 3, 4) in I, the probabilities i jπ over all treatment categories j (= 1, 2, 3) in J, are equal. That is, explicitly,

H0: “Treatment populations are homogeneous with respect to each response.”

π None in “PT + Rx” = π None in “PT + OTC” = π None in “PT only” and π Minor in “PT + Rx” = π Minor in “PT + OTC” = π Minor in “PT only” and π Mod in “PT + Rx” = π Mod in “PT + OTC” = π Mod in “PT only” and π Major in “PT + Rx” = π Major in “PT + OTC” = π Major in “PT only”

If the null hypothesis is true, then the expected table would consist of the values below,

Treatment Group (J) PT + Rx PT + OTC PT only

Impr

ovem

ent

Stat

us (I

) None 15 20 25 60 Minor 30 40 50 120

Moderate 45 60 75 180 Major 60 80 100 240

150 200 250 600


because in this case….

15150 =

20200 =

25250

= pooled proportion πNone = 60

600 , true since all = 0.1

30150 =

40200 =

50250

= pooled proportion πMinor = 120600 , true since all = 0.2

45150 =

60200 =

75250

= pooled proportion ˆModπ = 180600 , true since all = 0.3

60150 =

80200 =

100250

= pooled proportion πMajor = 240600 , true since all = 0.4.

If the null hypothesis is rejected based on the data, then the alternative is that at least one of its four statements is false. For that corresponding improvement category, one of the three treatment populations is significantly different from the others. This is referred to as a Chi-squared Test of Homogeneity , and is performed in the usual way. (Exercise) Let us consider a slightly different scenario which, for the sake of simplicity, has the same observed values as above. Suppose now we start with a single population, where every individual can have one of several random responses i = 1, 2, 3, …, r corresponding to one categorical variable I (such as improvement status, as before), AND one of several random responses j =1, 2, 3, …, c corresponding to another categorical variable J (such as, perhaps, the baseline symptoms of their arthritis): Baseline Disease Status (J) Mild Moderate Severe

Impr

ovem

ent

Stat

us (I

) None 6 14 40 60 random row

m

arginal totals

Minor 9 30 81 120 Moderate 15 60 105 180

Major 120 96 24 240 150 200 250 600 random column marginal totals

In other words, unlike the previous scenario where there was only one random response for each individual per population, here there are two random responses for each individual in a single fixed population. With this in mind, the probability i jπ (see first page of this section) is defined differently – namely, as the conditional probability that “an individual corresponds to a response i (= 1, 2, …, r), given that he/she corresponds to a response j (= 1, 2, …, c).” Hence, in this scenario, the null hypothesis translates to:


π None | Mild = π None | Moderate = π None | Severe and π Minor | Mild = π Minor | Moderate = π Minor | Severe and π Mod | Mild = π Mod | Moderate = π Mod | Severe and π Major | Mild = π Major | Moderate = π Major | Severe

However, interpreting this in context, each row now states that “the improvement status variable I (= None, Minor, Mod, Major) is not affected by the baseline disease status variable J (= Mild, Moderate, Severe).” This implies that for each i = 1, 2, 3, 4, the events “I = i” and “J = j” (j = 1, 2, 3) are statistically independent, and hence, by definition, the common value of the conditional probabilities P(I = i | J = j) in each row, is equal to the corresponding unconditional probability P(I = i) for that row, namely, π None, π Minor, π Mod, and π Major, respectively. It then also follows that P(I = i ⋂ J = j) = P(I = i) P(J = j).*

Test of Independence

The left-hand intersection probability in this equation is simply the “expected value Ei j” / n; the right-hand side is the product of (“Row marginal Ri” / n) ×

(“Column marginal Cj” / n), and so we obtain the familiar formula Ei j = Ri Cj / n. Thus, the previous table of expected values and subsequent calculations are exactly the same for this so-called Chi-squared :

H0: “The two responses are statistically independent in this population.”

Furthermore, because both responses I and J are independent, we can also characterize this null hypothesis by the “symmetric” statement that “the baseline disease status variable J (= Mild, Moderate, Severe) is not affected by the improvement status variable I (= None, Minor, Mod, Major).” That is, the common value of the conditional probabilities P(J = j | I = i) in each column, is equal to the corresponding unconditional probability P(J = j) for that column, i.e., π Mild, π Moderate, and π Severe, respectively:

and and π Mild | None π Moderate | None π Severe | None

= π Mild | Minor = π Moderate | Minor = π Severe | Minor = π Mild | Mod = π Moderate | Mod = π Severe | Mod = π Mild | Major = π Moderate | Major = π Severe | Major

(= π Mild) (= π Moderate) (= π Severe)

* We have used several results here. Recall that, by definition, two events A and B are said to be statistically independent if P(A | B) = P(A), or equivalently, P(A ⋂ B) = P(A) P(B). Also see Problems 3-5 and 3-22(b) for related ideas.

Ismor Fischer, 5/29/2012 6.3-6 In particular, this would yield the following:

and and (= ˆMildπ = 150/600) (= ˆModerateπ = 200/600) (= ˆSevereπ = 250/600),

true, since all = 1/4. true, since all = 1/3. true, since all = 5/12. That is, the independence between I and J can also be interpreted in this equivalent form.

Hence, the same Chi-squared statistic Χ 2 =

2

all cells

(Obs Exp)Exp−∑ on df = (r – 1)(c – 1) is

used for both types of hypothesis test! The exact interpretation depends on the design of the experiment, i.e., whether two or more populations are being compared for homogeneity with respect to a single response, or whether any two responses are independent of one another in a single population. However, as the application of the Chi-squared test is equally valid in either scenario, the subtle distinction between them is often blurred in practice. MORAL: In general, if the null hypothesis is rejected in either scenario, then there is an association between the two categorical variables I and J. Exercise: Conduct (both versions of) the Chi-squared Test for this 4 × 3 table.

One way to code this in R:

# Input None = c(6, 14, 40) Minor = c(9, 30, 81) Moderate = c(15, 60, 105) Major = c(120, 96, 24) Improvement = rbind(None, Minor, Moderate, Major) # Output

Improvement

chisq.test(Improvement, correct = F)

15/60 = 30/120 = 45/180 = 60/240

20/60 = 40/120 = 60/180 = 80/240

25/60 = 50/120 = 75/180

= 100/240


Goodness-of-Fit Test

H0: π 1 = π 10, π 2 = π 2

0, π 3 = π 30, …, π k = π k

0

For i = 1, 2, 3, …, k = # groups, n = sample size: Observed Values Oi Expected Values Ei = nπ i

0

Test Statistic Χ 2 = 2

1

( )ki i

ii

O EE=

−∑ ~ χ 2df

where ν = df = k − 1

As a final application, consider one of the treatment categories alone, say “PT + Rx,” written below as a row, for convenience.

PT + Rx Observed

Values

None Minor Moderate Major

6 9 15 120 n = 150

Suppose we wish to test the null hypothesis that there is “no significant difference in improvement responses,” i.e., the probabilities of all the improvement categories are equal. That is, H0: π None = π Minor = π Moderate = π Major (thus, = 0.25 each). Therefore, under this null hypothesis (and changing notation slightly), these n = 150 patients should be equally divided into the k = 4 response categories, i.e., H0: “For this treatment category, the responses follow a uniform distribution (= n/k)” as illustrated.

PT + Rx Expected

Values


37.5 37.5 37.5 37.5 n = 150

Of course, even a cursory comparison of these two distributions strongly suggests that there is indeed a significant difference. Remarkably, the same basic test statistic can be used in this Chi-squared Goodness -of -Fit Test . The “degrees of freedom” is equal to one less than k, the number of response categories being compared; in this case, df = 3. In general, this test can be applied to determine if data follow other probability distributions as well. For example, suppose it is more realistic to believe that the null distribution is not uniform, but skewed, i.e., H0: π None = .10, π Minor = .20, π Moderate = .30, π Major = .40 . Then the observed values above would instead be compared with…

PT + Rx Expected

Values


15 30 45 60 n = 150

In general,

Exercise: Conduct this test for the “PT + Rx” data given, under both null hypotheses.


The Birds and the Bees An Application of the Chi-squared Test to Basic Genetics

Inherited biological traits among humans (e.g., right- or left- handedness) and other organisms are transmitted from parents to offspring via “unit factors” called genes, discrete regions of DNA that are located on chromosomes, which are tightly coiled within the nucleus of a cell. Most human cells normally contain 46 chromosomes, arranged in 23 pairs (“diploid”); hence, two copies of each gene. Each copy can be either dominant (say, A = right-handedness) or recessive (a = left-handedness) for a given trait. The trait that is physically expressed in the organism – i.e., its phenotype – is det ermined by which of the three possible combinations of pairs AA, Aa, aa of these two “alleles” A and a occurs in its genes – i.e., its genotype – and its interactions with environmental factors: AA is “homozygous dominant” for right-handedness, Aa is “heterozygous dominant” (or “hybrid”) for right-handedness, and aa is “homozygous recessive” for left-handedness. However, reproductive cells (“gametes”: egg and sperm cells) only have 23 chromosomes, thus a single copy of each gene (“haploid”). When male and female parents reproduce, the “zygote” receives one gene copy – either A or a – from each parental gamete, restoring diploidy in the offspring. With two traits, say handedness and eye color (B = brown, b = blue), there are nine possible genotypes: AABB, AABb, AAbb, AaBB, AaBb, Aabb, aaBB, aaBb, aabb, resulting in four possible phenotypes. (AaBb is known as a “dihybrid.”)

According to Mendel’s Law of Independent Assortment, segregation of the alleles of one allelic pair during gamete formation is independent of the segregation of the alleles of another allelic pair. Therefore, a homozygous dominant parent AABB has gametes AB, and a homozygous recessive parent aabb has gametes ab; crossing them consequently results in all dihybrid AaBb offspring in the so-called F1 (or “first filial”) generation, having gametes AB, Ab, aB, and ab, as shown below. Parental Genotypes AABB aabb Parental Gametes F1 Genotype AaBb F1 Gametes

AB ab

AB ab Ab aB

Ismor Fischer, 5/29/2012 6.3-9 It follows that further crossing two such AaBb genotypes results in expected genotype frequencies in the F2 (“second filial”) generation that follow a 9:3:3:1 ratio, shown in the 4 × 4 Punnet square below.

Phenotypes Expected Frequencies 1 = Right-handed, Brown-eyed 9/16 = 0.5625 2 = Right-handed, Blue-eyed 3/16 = 0.1875 3 = Left-handed, Brown-eyed 3/16 = 0.1875 4 = Left-handed, Blue-eyed 1/16 = 0.0625 For example, in a random sample of n = 400 such individuals, the expected phenotypic values under the null hypothesis 0 1 2 3 4: 0.5625, 0.1875, 0.1875, 0.0625H π π π π= = = = are as follows.

Expected Values

1 2 3 4

225 75 75 25 n = 400

These would be compared with the observed values, say

Observed Values

1 2 3 4

234 67 81 18 n = 400

via the Chi-squared Goodness of Fit Test: 2 2 2 2

2 ( 9 ) ( 8) ( 6 ) ( 7 )225 75 75 25

Χ+ − + −

= + + + = 3.653 on df = 3.

Because this is less than the .05 Chi-squared score of 7.815, the p-value is greater than .05 (its exact value = 0.301), and hence the data provide evidence in support of the 9:3:3:1 ratio in the null hypothesis, at the .05α = significance level. If this model had been rejected however, then this would suggest a possible violation of the original assumption of independent assortment of allelic pairs. This is indeed the case in genetic linkage, where the two genes are located in close proximity to one other on the same chromosome. If two alleles A and a occur with respective frequencies p and q (= 1 – p) in a population, then observed genotype frequencies can be compared with those expected from The Hardy-Weinberg Law (namely p2 for AA, 2pq for Aa, and q2 for aa) via a similar Chi-squared Test.

F2 Genotypes Female Gametes

AB Ab aB ab

Mal

e G

amet

es

AB AABB1 AABb1 AaBB1 AaBb1

Ab AABb1 AAbb2 AaBb1 Aabb2

aB AaBB1 AaBb1 aaBB3 aaBb3

ab AaBb1 Aabb2 aaBb3 aabb4

Ismor Fischer, 5/29/2012 6.3-10

ν1 = 20, ν2 = 40

ν1 = 20, ν2 = 30

ν1 = 20, ν2 = 20

ν1 = 20, ν2 = 10

ν1 = 20, ν2 = 5

F-distribution

§ 6.3.2 Variances Consider k independent, normally-distributed groups X1 ~ N(µ1, σ1), X2 ~ N(µ2, σ2), …, Xk ~ N(µk, σk). We wish to conduct a formal test for equivariance, or homogeneity of variances. Null Hypothesis H0: σ1

2 = σ22 = σ3

2 = … = σk2

versus

Alternative Hypothesis HA: At least one of the σi2 is different from the others.

Formal test: Reject H0 if the F-statistic is significantly > 1. Comments:

Other tests: Levene (see § 6.2.2), Hartley, Cochran, Bartlett, and Scheffé.

For what follows (ANOVA), moderate heterogeneity of variances is permissible, especially with large, approximately equal sample sizes n1, n2, …, nk. Hence this test is often not even performed in practice, unless the sample variances s1

2, s22, ..., sk

2 appear to be greatly unequal.

Test Statistic

F = smax

2

smin2 ~ F

ν1 ν2

where ν1 and ν2 are the corresponding numerator and denominator degrees of freedom, respectively.

Ismor Fischer, 5/29/2012 6.3-11

σ1 σ2 σk

H0: µ1 = µ2 = . . . . µk

X1 X2 Xk

. . . .

The “total variation” in this system can be decomposed into two disjoint sources:

variation between the groups (via a “treatment” s2 measure)

variation within the groups (as measured by spooled2).

If the former is significantly larger than the latter (i.e., if the ratio is significantly > 1), then there must be a genuine treatment effect, and the null hypothesis can be rejected.

§ 6.3.3 Means

Assume we have k independent, equivariant, normally-distributed groups X1 ~ N(µ1, σ1), X2 ~ N(µ2, σ2), …, Xk ~ N(µk, σk), e.g., corresponding to different treatments. We wish to compare the treatment means with each other in order to determine if there is a significant difference among any of the groups. Hence… ⇔ H0: “There is no difference in treatment means, i.e., no treatment effect.” vs. HA: “There is at least one treatment mean µi that is different from the others.”

Key Strategy

Recall (from the comment at the end of 2.3) that sample variance has the general form

s2 = Σ(xi − x )2

n − 1 = Sum of Squares

degrees of freedom = SSdf .

That is, SS = (n − 1) s2. Using this fact, the powerful technique of Analysis of Variance (ANOVA) separates the total variation of the system into its two disjoint sources (known as “partitioning sums of squares”), so that a formal test statistic can then be formulated, and a decision regarding the null hypothesis ultimately reached. However, in order to apply this, it is necessary to make the additional assumption of equivariance, i.e, σ1

2 = σ22 = σ3

2 = … = σk2, testable using the methods of the preceding section.

Ismor Fischer, 5/29/2012 6.3-12

0 4.8 −4.8

0.95 0.025 0.025

0.0043 0.0043

t4 Figure 1

Example: For simplicity, take k = 2 balanced samples, say of size n1 = 3 and n2 = 3, from two independent, normally distributed populations: x11 x12 x13 x21 x22 x23

X1: {50, 53, 71} X2: {1, 4, 25}

The null hypothesis H0: µ1 = µ2 is to be tested against the alternative HA: µ1 ≠ µ2 at the α = .05 level of significance, as usual. In this case, the difference in magnitudes between the two samples appears to be sufficiently substantial, that significance seems evident, despite the small sample sizes.

The following summary statistics are an elementary exercise:

1x = 58 2x = 10

s12 = 129 s2

2 = 171 Therefore, spooled

2 = (3 − 1)(129) + (3 − 1)(171) (3 − 1) + (3 − 1) = 600

4 ← SSErrordfError

= 150.

We are now in a position to carry out formal testing of the null hypothesis.

Method 1. (Old way: two-sample t-test) In order to use the t-test, we must first verify equivariance σ1

2 = σ22. The computed sample variances of 129 and 171 are certainly

sufficiently close that this condition is reasonably satisfied. (Or, check that the ratio 129/171 is between 0.25 and 4.) Now, recall from the formula for standard error, that:

s.e. = 150 1/3 + 1/3 = 10. Hence,

p-value = 2 P( 1X – 2X ≥ 48) = 2 P

T4 ≥ 48 − 0

10 = 2 P(T4 ≥ 4.8) = 2(.0043) = .0086 < .05

so the null hypothesis is (strongly) rejected; a significant difference exists at this level.

Also, the grand mean is calculated as:

3(58) + 3(10)

x = 50 + 53 + 71 + 1 + 4 + 253 + 3 = 34.

Ismor Fischer, 5/29/2012 6.3-13

Method 2. (New way: ANOVA F-test) We first calculate three “Sums of Squares (SS)” that measure the variation of the system and its two component sources, along with their associated degrees of freedom (df). 1. Total Sum of Squares = sum of the squared deviations of each observation xij from the

grand mean x . SSTotal = (50 – 34)2 + (53 – 34)2 + (71 – 34)2 + (1 – 34)2 + (4 – 34)2 + (25 – 34)2 = 4056 dfTotal = (3 + 3) – 1 = 5 2. Treatment Sum of Squares = sum of the squared deviations of each group mean ix

from the grand mean x .

Motivation: In order to measure pure treatment effect, imagine two ideal groups with no “within group” variation, i.e., replace each sample value by its sample mean ix :

X1′ : {58, 58, 58} X2′ : {10, 10, 10}

SSTrt = (58 – 34)2 + (58 – 34)2 + (58 – 34)2 + (10 – 34)2 + (10 – 34)2 + (10 – 34)2

= 3 (58 – 34)2 + 3 (10 – 34)2 = 3456 dfTrt = 1 Reason: As with any deviations, these must satisfy a single constraint:

namely, their sum = 3(58 – 34) + 3(10 – 34) = 0. Hence their degrees of freedom = one less than the number of treatment groups (k = 2).

3. Error Sum of Squares = sum of the squared deviations of each observation xij from its

group mean ix . SSError = (50 – 58)2 + (53 – 58)2 + (71 – 58)2 + (1 – 10)2 + (4 – 10)2 + (25 – 10)2 = 600 dfError = (3 – 1) + (3 – 1) = 4

Treatment

Error

Total SSTotal = SSTrt + SSError

dfTotal = dfTrt + dfError

Ismor Fischer, 5/29/2012 6.3-14

F1, 4

Figure 2

23.04

.0086

ANOVA Table

Test Statistic “Sum of Squares” “Mean Squares” (F1, 4 distribution) Source df SS MS =

SSdf F =

MSTrtMSErr

p-value

Treatment 1 3456 3456 ( = sbetween2)

23.04 .0086 Error 4 600 150 ( = swithin

2)

Total 5 4056 −

The F1, 4-score of 23.04 is certainly much greater than 1 (the expected value under the null hypothesis of no treatment difference), and is in fact greater than 7.71, the F1, 4 critical value for α = .05. Hence the small p-value, and significance is established.

In fact, the ratio of SSTrt

SSTotal =

34564056 = 0.852 indicates that 85.2% of the total variation in

response is due to the treatment effect! Comment: Note that 23.04 = (± 4.8)2, i.e., F1, 4 = t4

2. In general, F1, df = tdf2 for any df.

Hence the two tests are mathematically equivalent to each other. Compare Figs 1 and 2.

Ismor Fischer, 5/29/2012 6.3-15

Fk − 1, n − k

F

p-value

General ANOVA formulation

Consider now the general case of k independent, normally-distributed, equivariant groups.

Treatment Groups X1 ~ N(µ1, σ1) X2 ~ N(µ2, σ2) … Xk ~ N(µk, σk) Sample Sizes n1 + n2 + … nk = n

Group Means 1x 2x … kx

Group Variances s12 s2

2 … sk2

Grand Mean x = n1 1x + n2 2x + … + nk kx

n

Pooled Variance swithin2 =

(n1 − 1) s12 + (n2 − 1) s2

2 + … + (nk − 1) sk2

n − k

Source df SS MS F-statistic p-value

Treatment k − 1 2

1( )

k

i ii

n x x=

−∑ sbetween2

Fk − 1, n − k 0 ≤ p ≤ 1 Error n − k 2

11)(

k

i ii

n s=

−∑ swithin2

Total n − 1 2

all ,( )i j

i jx x−∑ −

Comments:

This is referred to as the overall F-test of significance. If the null hypothesis is rejected, then (the mean value of at least) one of the treatment groups is different from the others. But which one(s)?

Nonparametric form of ANOVA: Kruskal-Wallis Test

Appendix > Geometric Viewpoint > ANOVA

Null Hypothesis H0: µ1 = µ2 = … = µk ⇔ “No treatment difference exists.”

Alternative Hyp. HA: µi ≠ µj for some i ≠ j ⇔ “A treatment difference exists.”

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A2._Geometric_Viewpoint/A2.2_-_ANOVA.pdf�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A2._Geometric_Viewpoint/A2.2_-_ANOVA.pdf�

Ismor Fischer, 5/29/2012 6.3-16

Multiple Comparison Procedures

How do we decide which groups (if any) are significantly different from the others? Pairwise comparisons between the two means of individual groups can be t-tested. But how do we decide which pairs to test, and why should it matter?

A Priori Analysis (Planned Comparisons – before any data are collected)

Investigator wishes to perform pairwise t-test comparisons on a fixed number m specific groups of interest, chosen for scientific or other theoretical reasons.

Example: Group 1 = control, and each experimental group 2, …, k is to be compared with it separately (e.g., testing mean annual crop yields of different seed types, against a standard seed type). Then there are m = k – 1 pairwise comparisons, with corresponding null hypotheses H0: µ1 = µ2, H0: µ1 = µ3, H0: µ1 = µ4, …, H0: µ1 = µk.

A Posteriori (or Post Hoc) Analysis (Unplanned Comparisons – after data are collected)

“Data-mining,” “data-dredging,” “fishing expedition,” etc. Unlike above, should be used only if the ANOVA overall F-test is significant.

Example: Suppose it is decided to compare all possible pairs among Groups 1, …, k,

i.e., H0: µi = µj for all i ≠ j. Then there will be m =

k2 =

k (k − 1)2 such t-tests.

For example, if k = 5 groups, then m =

52 = 10 pairwise comparisons.

Though computationally intensive perhaps, these t-tests pose no problem for a computer.

However…..

. . . .

σ1 σ2 σk

H0: µ1 = µ2 = . . . . µk

X1 X2 Xk

t-test t-test t-test

t-test etc.

Ismor Fischer, 5/29/2012 6.3-17

With a large number m of such comparisons, there is an increased probability of finding a spurious significance (i.e., making a Type 1 error) between two groups, just by chance.

Exercise: Show that this probability = 1 (1 )mα− − , which goes to 1 as m gets large. The graph for α = .05 is shown below. (Also see Problem 3-21, The Shell Game.)

In m t-test comparisons, the probability of finding at least one significant p-value at the α = .05 level, is 1 – (.95)m, which approaches certainty. Note that if m = 14, this probability is already greater than 50%.

How do we reduce this risk? Various methods exist…

Bonferroni correction - Lower the significance level of each t-test from α to α* = αm .

(But use the overall ANOVA MSError term for spooled2.)

Example: As above, if α = .05 and m = 10 t-tests, then make α* = .0510 = .005 for each.

The overall Type 1 error rate α remains unchanged.

Each individual t-test is more conservative, hence less chance of spurious rejection.

However, Bonferroni correction can be overly conservative, failing to reject differences known to be statistically significant, e.g., via the ANOVA overall F-test. A common remedy for this is the Holm-Bonferroni correction, in which the α* values are allowed to become slightly larger (i.e., less conservative) with each successive t-test.

Other methods include: • Fisher’s Least Significant Difference (LSD) Test • Tukey’s Honest Significant Difference (HSD) Test • Neumann-Keuls Test

Ismor Fischer, 5/29/2012 6.3-18

Without being aware of this phenomenon, a researcher might be tempted to report a random finding as being evidence of a genuine statistical significance, when in fact it might simply be an artifact of conducting a large number of individual experiments. Such a result should be regarded as the starting point of more rigorous investigation… Famous example ~ Case-control study involving coffee and pancreatic cancer

Former chair, Harvard School of Public Health

Ismor Fischer, 5/29/2012 6.3-19 The findings make it to the media…

First public reaction:

PANIC?!! Do we have to stop drinking coffee??

Second public reaction: Hold on… Coffee has been around for a long time, and so have cancer studies. This is the first time any connection like this has ever been reported. I’ll keep it in mind, but let’s just wait and see…

Ismor Fischer, 5/29/2012 6.3-20

Scientific doubts are very quickly raised….

Many sources of BIAS exist, including (but not limited to):

• For convenience, cases (with pancreatic cancer) were chosen from a group of patients hospitalized by the same physicians who had diagnosed and hospitalized the controls (with non-cancerous diseases of the digestive system). Therefore, investigators who interviewed patients about their coffee consumption history knew in advance who did or did not have pancreatic cancer, possibly introducing unintentional selection bias.

• Also, either on their own, or on advice from their physicians, patients with noncancerous gastrointestinal illness frequently stop drinking coffee, thereby biasing the proportion of coffee drinkers away from the control group, who are to be compared to the cases with cancer.

• Investigators were “fishing” for any association between pancreatic cancer and multiple possible risk factors – including coffee, tea, alcohol, pipe smoking, and cigar smoking (while adjusting for cigarette smoking history, since this is a known confounding variable for pancreatic cancer) – but they did not Bonferroni correct!

• Publication bias: Many professional research journals prefer only to publish articles that result in “positive” (i.e., statistically significant) study outcomes, rather than “negative” ones. (This may be changing, somewhat.)

For more info, see http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/6_-_Statistical_Inference/BIAS.pdf and on its source website http://www.medicine.mcgill.ca/epidemiology/pai/teaching.htm.

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/6_-_Statistical_Inference/BIAS.pdf�

http://www.medicine.mcgill.ca/epidemiology/pai/teaching.htm�

Ismor Fischer, 5/29/2012 6.3-21

Results could not be replicated by others, including the original investigators, in subsequent studies. Eventual consensus: No association. Moral: You can’t paint a bull’s-eye around an arrow, after it’s been fired at a target.

To date, no association has been found between coffee and pancreatic cancer, or any other life-threatening medical illness.

“Coffee is a substance in search of a disease.”

– Old adage


6.4 Problems

NOTE: Before starting these problems, it might be useful to review pages 1.3-1 and 2.1-1.

1. Suppose that a random sample of n = 102 children is selected from the population of newborn infants in Mexico. The probability that a child in this population weighs at most 2500 grams is presumed to be π = 0.15. Calculate the probability that thirteen or fewer of the infants weigh at most 2500 grams, using… (a) the exact binomial distribution (Tip: Use the function pbinom in R),

(b) the normal approximation to the binomial distribution (with continuity correction).

Suppose we wish to test the null hypothesis H0: π = 0.15 versus the alternative HA: π ≠ 0.15, and that in this random sample of n = 102 children, we find thirteen whose weights are under 2500 grams. Use this information to decide whether or not to reject H0 at the α = .05 significance level, and interpret your conclusion in context.

(c) Calculate the p-value, using the “normal approximation to the binomial” with continuity correction. (Hint: See (b).) Also compute the 95% confidence interval.

(d) Calculate the exact p-value, via the function binom.test in R. 2. A new “smart pill” is tested on n = 36 individuals randomly sampled from a certain population

whose IQ scores are known to be normally distributed, with mean µ = 100 and standard deviation σ = 27. After treatment, the sample mean IQ score is calculated to be x = 109.9, and a two-sided test of the null hypothesis H0: µ = 100 versus the alternative hypothesis HA: µ ≠ 100 is performed, to see if there is any statistically significant difference from the mean IQ score of the original population. Using this information, answer the following.

(a) Calculate the p-value of the sample.

(b) Fill in the following table, concluding with the decision either to reject or not reject the null

hypothesis H0 at the given significance level α.

Significance Level α

Confidence Level 1 − α Confidence Interval Decision about H0

.10

.05

.01 (c) Extend these observations to more general circumstances. Namely, as the significance level

decreases, what happens to the ability to reject a null hypothesis? Explain why this is so, in terms of the p-value and generated confidence intervals.


3. Consider the distribution of serum cholesterol levels for all 20- to 74-year-old males living in the United States. The mean of this population is 211 mg/dL, and the standard deviation is 46.0 mg/dL. In a study of a subpopulation of such males who smoke and are hypertensive, it is assumed (not unreasonably) that the distribution of serum cholesterol levels is normally distributed, with unknown mean µ, but with the same standard deviation σ as the original population.

(a) Formulate the null hypothesis and complementary alternative hypothesis, for testing whether the unknown mean serum cholesterol level µ of the subpopulation of hypertensive male smokers is equal to the known mean serum cholesterol level of 211 mg/dL of the general population of 20- to 74-year-old males.

(b) In the study, a random sample of size n = 12 hypertensive smokers was selected, and found to have a sample mean cholesterol level of x = 217 mg/dL. Construct a 95% confidence interval for the true mean cholesterol level of this subpopulation.

(c) Calculate the p-value of this sample, at the α = .05 significance level.

(d) Based on your answers in parts (b) and (c), is the null hypothesis rejected in favor of the alternative hypothesis, at the α = .05 significance level? Interpret your conclusion: What exactly has been demonstrated, based on the empirical evidence?

(e) Determine the 95% acceptance region and complementary rejection region for the null hypothesis. Is this consistent with your findings in part (d)? Why?

4. Consider a random sample of ten children selected from a population of infants receiving antacids that contain aluminum, in order to treat peptic or digestive disorders. The distribution of plasma aluminum levels is known to be approximately normal; however its mean µ and standard deviation σ are not known. The mean aluminum level for the sample of n = 10 infants is found to be x = 37.20 µg/l and the sample standard deviation is s = 7.13 µg/l. Furthermore, the mean plasma aluminum level for the population of infants not receiving antacids is known to be only 4.13 µg/l.

(a) Formulate the null hypothesis and complementary alternative hypothesis, for a two-sided test of whether the mean plasma aluminum level of the population of infants receiving antacids is equal to the mean plasma aluminum level of the population of infants not receiving antacids.

(b) Construct a 95% confidence interval for the true mean plasma aluminum level of the population of infants receiving antacids.

(c) Calculate the p-value of this sample (as best as possible), at the α = .05 significance level.

(d) Based on your answers in parts (b) and (c), is the null hypothesis rejected in favor of the alternative hypothesis, at the α = .05 significance level? Interpret your conclusion: What exactly has been demonstrated, based on the empirical evidence?

(e) With the knowledge that significantly elevated plasma aluminum levels are toxic to human beings, reformulate the null hypothesis and complementary alternative hypothesis, for the appropriate one-sided test of the mean plasma aluminum levels. With the same sample data as above, how does the new p-value compare with that found in part (c), and what is the resulting conclusion and interpretation?


5. Refer to Problem 4.4/2.

(a) Suppose we wish to formally test the null hypothesis H0: µ = 25 against the alternative HA: µ ≠ 25, at the .05α = significance level, by using the random sample of n = 80 given.

Calculate the p-value, and verify that in fact, this sample leads to an incorrect conclusion.

[[Hint: Use the Central Limit Theorem to approximate the sampling distribution of X with the normal distribution ( , / )N nµ σ .]] Which type of error (Type I or Type II) is committed here, and why?

(b) Now suppose we wish to formally test the null hypothesis H0: µ = 27 against the specific alternative HA: µ = 25, at the .05α = significance level, using the same random sample of n = 80 trials.

How much power exists (i.e., what is the probability) of inferring the correct conclusion?

Calculate the p-value, and verify that, once again, this sample in fact leads to an incorrect

conclusion. [[Use the same hint as in part (a).]] Which type of error (Type I or Type II) is committed here, and why?

6. Two physicians are having a disagreement about the effectiveness of chicken soup in relieving

common cold symptoms. While both agree that the number of symptomatic days generally follows a normal distribution, physician A claims that most colds last about a week; chicken soup makes no difference, whereas physician B argues that it does. They decide to settle the matter by performing a formal two-sided test of the null hypothesis H0: µ = 7 days, versus the alternative HA: µ ≠ 7 days.

(a) After treating a random sample of n = 16 cold patients with chicken soup, they calculate a mean number of symptomatic days x = 5.5, and standard deviation s = 3.0 days. Using either the 95% confidence interval or the p-value (or both), verify that the null hypothesis cannot be rejected at the α = .05 significance level.

(b) Physician A is delighted, but can predict physician B’s rebuttal: “The sample size was too small! There wasn’t enough power to detect a statistically significant difference between µ = 7 days, and say µ = 5 days, even if there was one present!” Calculate the minimum sample size required in order to achieve at least 99% power of detecting such a genuine difference, if indeed one actually exists. (Note: Use s to estimate σ.)

(c) Suppose that, after treating a random sample of n = 49 patients, they calculate the mean number of symptomatic days x = 5.5 (as before), and standard deviation s = 2.8 days. Using either the 95% confidence interval or the p-value (or both), verify that the null hypothesis can now be rejected at the α = .05 significance level.

FYI: The long-claimed ability of chicken soup – sometimes referred to as “Jewish penicillin” – to combat colds has actually been the subject of several well-known published studies, starting with a 1978 seminal paper written by researchers at Mount Sinai Hospital in NYC. The heat does serve to break up chest congestion, but it turns out that there are many other surprising cold-fighting benefits, far beyond just that. “Who knew?” Evidently… Mama. See http://well.blogs.nytimes.com/2007/10/12/the-science-of-chicken-soup/.

http://well.blogs.nytimes.com/2007/10/12/the-science-of-chicken-soup/�


7. Toxicity Testing. [Tip: See page 6.1-28] According to the EPA (Environmental Protection Agency), drinking water can contain no more than 10 ppb (parts per billion) of arsenic, in order to be considered safe for human consumption.

x

Suppose that the concentration X of arsenic in a typical water source is known to be normally distributed, with an unknown mean µ and standard deviation σ. A random sample of n = 121 independent measurements is to be taken, from which the sample mean and sample standard deviation s are calculated, and used in formal hypothesis testing. The following sample data for four water sources are obtained:

This is known as the Maximum Contaminant Level (MCL).

• Source 1: 11.43x = ppb, s = 5.5 ppb

• Source 2: 8.57x = ppb, s = 5.5 ppb

• Source 3: 9.10x = ppb, s = 5.5 ppb

• Source 4: 10.90x = ppb, s = 5.5 ppb

(a) For each water source, answer the following questions to test the null hypothesis 0: 10H µ =

ppb, vs. the two-sided alternative hypothesis : 10AH µ ≠ ppb, at the α = .05 significance level.

(i) Just by intuitive inspection, i.e., without first conducting any formal calculations, does this sample mean suggest that the water might be safe, or unsafe, to drink? Why??

(ii) Calculate the p-value of this sample (to the closest entries of the appropriate table), and use it to draw a formal conclusion about whether or not the null hypothesis can be rejected in favor of the alternative, at the α = .05 significance level.

(iii) Interpret: According to your findings, is the result statistically significant? That is… Is the water unsafe to drink? Does this agree with your informal reasoning in (i)?

(b) For the hypothesis test in (a), what is the two-sided 5% rejection region for this 0H ? Is it consistent with your findings?

(c) One-sided hypothesis tests can be justifiably used in some contexts, such as situations where one direction (either ≤ or ≥) is impossible (for example, a human knee cannot flex backwards), or irrelevant, as in “toxicity testing” here. We are really not concerned if the mean is significantly below 10 ppb, only above. With this in mind, repeat the instructions in (a) above, to test the left-sided null hypothesis 0: 10H µ ≤ ppb (i.e., safe) versus the right-sided

alternative : 10AH µ > ppb (i.e., unsafe) at the α = .05 significance level.

(d) Suppose a fifth water source yields 10.6445x = ppb and s = 5.5 ppb. Repeat part (c).

(e) For the hypothesis test in (c), what is the exact cutoff ppb level for x , above which we can conclude that the water is unsafe? (Compare Sources 4 and 5, for example.) That is, what is the one-sided 5% rejection region for this 0H ? Is it consistent with your findings?

(f) Summarize these results, and make some general conclusions regarding advantages and disadvantages of using a one-sided test, versus a two-sided test, in this context. [Hint: Compare the practical results in (a) and (c) for Source 4, for example.]


8. Do the Exercise on page 6.1-20. 9.

(a) In R, type the following command to generate a data set called “x” of 1000 random values.

x = rf(1000, 5, 20)

Obtain a graph of its frequency histogram by typing hist(x). Include this graph as part of your submitted homework assignment. (Do not include the 1000 data values!)

Next construct a “normal q-q plot” by typing qqnorm(x, pch = 19). Include this plot as part of your submitted homework assignment.

(b) Now define a new data set called “y” by taking the (natural) logarithm of x.

y = log(x)

Obtain a graph of its frequency histogram by typing hist(y). Include this graph as part of your submitted homework assignment. (Do not include the 1000 data values!)

Then construct a “normal q-q plot” by typing qqnorm(y, pch = 19). Include this plot as part of your submitted homework assignment.

(c) Summarize the results in (a) and (b). In particular, from their respective histograms and q-q plots, what general observation can be made regarding the distributions of x and y = log(x)? (Hint: See pages 6.1-25 through 6.1-27.)


10. In this problem, assume that population cholesterol level is normally distributed.

(a) Consider a small clinical trial, designed to measure the efficacy of a new cholesterol-lowering drug against a placebo. A group of six high-cholesterol patients is randomized to either a treatment arm or a control arm, resulting in two numerically balanced samples of n1 = n2 = 3 patients each, in order to test the null hypothesis H0: µ1 = µ2 vs. the alternative HA: µ1 ≠ µ2. Suppose that the data below are obtained.

Placebo Drug

220 180

240 200

290 220

Obtain the 95% confidence interval for µ1 − µ2, and the p-value of the data, and use each to decide whether or not to reject H0 at the α = .05 significance level. Conclusion?

(b) Now imagine that the same drug is tested using another pilot study, with a different design.

Serum cholesterol levels of n = 3 patients are measured at the beginning of the study, then re-measured after a six month treatment period on the drug, in order to test the null hypothesis H0: µ1 = µ2 versus the alternative HA: µ1 ≠ µ2. Suppose that the data below are obtained.

Obtain the 95% confidence interval for µ1 − µ2, and the p-value of the data, and use each to decide whether or not to reject H0 at the α = .05 significance level. Conclusion?

(c) Compare and contrast these two study designs and their results. (d) Redo (a) and (b) using R (see hint). Show agreement between your answers and the output.

Baseline End of Study

220 180

240 200

290 220


http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/Rcode/6.4-10%28d%29.pdf�

Ismor Fischer, 5/20/2014 6.4-4 11. In order to determine whether children with cystic fibrosis have a normal level of iron in their

blood on average, a study is performed to detect any significant difference in mean serum iron levels between this population and the population of healthy children, both of which are approximately normally distributed with unknown standard deviations. A random sample of n1 = 9 healthy children has mean serum iron level 1x = 18.9 µmol/l and standard deviation s1 = 5.9 µmol/l; a sample of n2 = 13 children with cystic fibrosis has mean serum iron level

2x = 11.9 µmol/l and standard deviation s2 = 6.3 µmol/l.

(a) Formulate the null hypothesis and complementary alternative hypothesis, for testing whether the mean serum iron level µ1 of the population of healthy children is equal to the mean serum iron level µ2 of children with cystic fibrosis.

(b) Construct the 95% confidence interval for the mean serum iron level difference µ1 − µ2.

(c) Calculate the p-value for this experiment, under the null hypothesis.

(d) Based on your answers in parts (b) and (c), is the null hypothesis rejected in favor of the alternative hypothesis, at the α = .05 significance level? Interpret your conclusion: What exactly has been demonstrated, based on the sample evidence?

12. Methylphenidate is a drug that is widely used in the treatment of attention deficit disorder

(ADD). As part of a crossover study, ten children between the ages of 7 and 12 who suffered from this disorder were assigned to receive the drug and ten were given a placebo. After a fixed period of time, treatment was withdrawn from all 20 children and, after a “washout period” of no treatment for either group, subsequently resumed after switching the treatments between the two groups. Measures of each child’s attention and behavioral status, both on the drug and on the placebo, were obtained using an instrument called the Parent Rating Scale. Distributions of these scores are approximately normal with unknown means and standard deviations. In general, lower scores indicate an increase in attention. It is found that the random sample of n = 20 children enrolled in the study has a sample mean attention rating score of methylx = 10.8 and standard deviation smethyl = 2.9 when taking methylphenidate, and mean rating score

placebox = 14.0 and standard deviation splacebo = 4.8 when taking the placebo.

(a) Calculate the 95% confidence interval for µplacebo, the mean attention rating score of the population of children taking the placebo.

(b) Calculate the 95% confidence interval for µmethyl, the mean attention rating score of the population of children taking the drug.

(c) Comparing these two confidence intervals side-by-side, develop an informal conclusion about the efficacy of methylphenidate, based on this experiment. Why can this not be used as a formal test of the hypothesis H0: µplacebo = µmethyl, vs. the alternative HA: µplacebo ≠ µmethyl, at the α = .05 significance level? (Hint: See next problem.)


13. A formal hypothesis test for two-sample means using the confidence interval for 1 2µ µ− is generally NOT equivalent to an informal side-by-side comparison of the individual confidence intervals for 1µ and 2µ for detecting overlap between them.

(a) Suppose that two population random variables 1X and 2X are normally distributed, each with standard deviation 50σ = . We wish to test the null hypothesis 0 1 2:H µ µ= versus the alternative 0 1 2:H µ µ≠ , at the .05α = significance level. Two independent, random samples are selected, each of size 100n = , and it is found that the corresponding means are

1 215x = and 2 200x = , respectively. Show that even though the two individual 95% confidence intervals for 1µ and 2µ overlap, the formal 95% confidence interval for the mean difference 1 2µ µ− does not contain the value 0, and hence the null hypothesis can be rejected. (See middle figure below.)

(b) In general, suppose that 1 1~ ( , )X N µ σ and 2 2~ ( , )X N µ σ , with equal σ (for simplicity). In order to test the null hypothesis 0 1 2:H µ µ= versus the two-sided alternative 0 1 2:H µ µ≠ at the α significance level, two random samples are selected, each of the same size n (for simplicity), resulting in corresponding means 1x and 2x , respectively. Let

1CIµ and

2CIµ

be the respective 100(1 )%α− confidence intervals, and let ( )1 2

/ 2

| |/

x xdz nα σ

−= . (Note that

the denominator is simply the margin of error for the confidence intervals.) Also let 1 2

CIµ µ−

be the 100(1 )%α− confidence interval for the true mean difference 1 2µ µ− . Prove:

If 2d < , then 1 2

0 CIµ µ−∈ (i.e., “accept” 0H ), and 1 2

CI CIµ µ∩ ≠ ∅ (i.e., overlap).

If 2 2d< < , then

1 20 CIµ µ−∉ (i.e., reject 0H ), but

1 2CI CIµ µ∩ ≠ ∅ (i.e., overlap)!

If 2d > , then

1 20 CIµ µ−∉ (i.e., reject 0H ), and

1 2CI CIµ µ∩ =∅ (i.e., no overlap).

• 1x

2x •

• 1 2x x−

|

0

• 1x

2x •

• 1 2x x−

|

0

• 1x

2x •

• 1 2x x−

|

0


14. Z-tests and Chi-squared Tests

(a) Test of Independence (1 population, 2 random responses). Imagine that a marketing research study surveys a random sample of n = 2000 consumers about their responses regarding two brands (A and B) of a certain product, with the following observed results.

Do You Like Brand B? Yes No

Do

You

Lik

e B

rand

A? Yes 335 915 1250

No 165 585 750

500 1500 2000

First consider the null hypothesis H0: | |A B A Bπ π= c , that is, in this consumer population,

“The probability of liking A, given that B is liked, is equal to probability of liking A, given that B is not liked.” ⇔ “There is no association between liking A and liking B.” ⇔ “Liking A and liking B are independent of each other.”

[Why? See Problem 3.5/22(a).] Calculate the point estimate | |ˆ Â B A Bπ π− c . Determine the Z-score of this sample (and

thus whether or not H0 is rejected at α = .05). Conclusion?

Now consider the null hypothesis H0: | |B A B Aπ π= c , that is, in this consumer population,

“The probability of liking B, given that A is liked, is equal to probability of liking B, given that A is not liked.”

⇔ “There is no association between liking B and liking A.”

⇔ “Liking B and liking A are independent of each other.” Calculate the point estimate | |ˆ ˆB A B Aπ π− c . Determine the Z-score of this sample (and

thus whether or not H0 is rejected at α = .05). How does it compare with the previous Z-score? Conclusion?

Compute the Chi-squared score. How does it compare with the preceding Z-scores? Conclusion?


(b) Test of Homogeneity (2 populations, 1 random response). Suppose that, for the sake of simplicity, the same data are obtained in a survey that compares the probability π of liking Brand A between two populations.

City 1 City 2

Do

You

Lik

e B

rand

A? Yes 335 915 1250

No 165 585 750

500 1500 2000

Here, the null hypothesis is H0: 2City|1City| AA ππ = , that is,

“The probability of liking A in the City 1 population is equal to probability of liking A in the City 2 population.”

⇔ “City 1 and City 2 populations are homogeneous with respect to liking A.”

⇔ “There is no association between city and liking A.” How do these corresponding Z and Chi-squared test statistics compare with those in (a)? Conclusion?


15. Consider the following 2 × 2 contingency table taken from a retrospective case-control study that investigates the proportion of diabetes sufferers among acute myocardial infarction (heart attack) victims in the Navajo population residing in the United States.

MI Yes No Total

Dia

bete

s Yes 46 25 71

No 98 119 217

Total 144 144 288

(a) Conduct a Chi-squared Test for the null hypothesis H0: π Diabetes | MI = π Diabetes | No MI versus the alternative HA: π Diabetes | MI ≠ π Diabetes | No MI. Determine whether or not we can reject the null hypothesis at the α = .01 significance level. Interpret your conclusion: At the α = .01 significance level, what exactly has been demonstrated about the proportion of diabetics among the two categories of heart disease in this population?

(b) In the study design above, the 144 victims of myocardial infarction (cases) and the 144

individuals free of heart disease (controls) were actually age- and gender-matched. The members of each case-control pair were then asked whether they had ever been diagnosed with diabetes. Of the 46 individuals who had experienced MI and who were diabetic, it turned out that 9 were paired with diabetics and 37 with non-diabetics. Of the 98 individuals who had experienced MI but who were not diabetic, it turned out that 16 were paired with diabetics and 82 with non-diabetics. Therefore, each cell in the resulting 2 × 2 contingency table below corresponds to the combination of responses for age- and gender- matched case-control pairs, rather than individuals.

MI Diabetes No Diabetes Totals

No

MI Diabetes 9 16 25

No Diabetes 37 82 119

Totals 46 98 144

Conduct a McNemar Test for the null hypothesis H0: “The number of ‘diabetic, MI case’ -‘non-diabetic, non-MI control’ pairs, is equal to the number of ‘non-diabetic, MI case’ - ‘diabetic, non-MI control’ pairs, who have been matched on age and gender,” or more succinctly, H0: “There is no association between diabetes and myocardial infarction in the Navajo population, adjusting for age and gender.” Determine whether or not we can reject the null hypothesis at the α = .01 significance level. Interpret your conclusion: At the α = .01 significance level, what exactly has been demonstrated about the association between diabetes and myocardial infarction in this population?

(c) Why does the McNemar Test only consider discordant case-control pairs? Hint: What, if anything, would a concordant pair (i.e., either both individuals in a ‘MI case - No MI control’ pair are diabetic, or both are non-diabetic) reveal about a diabetes-MI association, and why?

(d) Redo this problem with R, using chisq.test and mcnemar.test.


16. The following data are taken from a study that attempts to determine whether the use of electronic fetal monitoring (“exposure”) during labor affects the frequency of caesarian section deliveries (“disease”). Of the 5824 infants included in the study, 2850 were electronically monitored during labor and 2974 were not. Results are displayed in the 2 × 2 contingency table below.

Caesarian Delivery

Yes No Totals

EFM

Ex

posu

re

Yes 358 2492 2850

No 229 2745 2974

Totals 587 5237 5824

(a) Calculate a point estimate for the population odds ratio OR, and interpret.

(b) Compute a 95% confidence interval for the population odds ratio OR.

(c) Based on your answer in part (b), show that the null hypothesis H0: OR = 1 can be rejected in favor of the alternative HA: OR ≠ 1, at the α = .05 significance level. Interpret this conclusion: What exactly has been demonstrated about the association between electronic fetal monitoring and caesarian section delivery? Be precise.

(d) Does this imply that electronic monitoring somehow causes a caesarian delivery? Can the

association possibly be explained any other way? If so, how?

Ismor Fischer, 5/20/2014 6.4-10

17. The following data come from two separate studies, both conducted in San Francisco, that investigate various risk factors for epithelial ovarian cancer.

Study 1 Disease Status Study 2 Disease Status

Cancer No Cancer Total Cancer No

Cancer Total

Ter

m

Preg

nanc

ies

None 31 93 124

Ter

m

Preg

nanc

ies

None 39 74 113

One or More 80 379 459

One or More 149 465 614

Total 111 472 583 Total 188 539 727

(a) Compute point estimates 1OR and 2OR of the respective odds ratios OR1 and OR2 of the two studies, and interpret.

(b) In order to determine whether or not we may combine information from the two tables, it is

first necessary to conduct a Test of Homogeneity on the null hypothesis H0: OR1 = OR2, vs. the alternative HA: OR1 ≠ OR2, by performing the following steps.

Step 1: First, calculate l1 = ln(1OR ) and l2 = ln( 2OR ), in the usual way.

Step 2: Next, using the definition of s.e. given in the notes, calculate the weights

w1 =

2

1

1

s.e. and w2 =

2

2

1

s.e..

Step 3: Compute the weighted mean of l1 and l2:

L = w1 l1 + w2 l2

w1 + w2 .

Step 4: Finally, calculate the test statistic

Χ 2 = w1 (l1 − L)2 + w2 (l2 − L)2 ,

which follows an approximate χ 2 distribution, with 1 degree of freedom.

Step 5: Use this information to show that the null hypothesis cannot be rejected at the α = .05 significance level, and that the information from the two tables may therefore be combined.

(c) Hence, calculate the Mantel-Haenszel estimate of the summary odds ratio:

summaryOR = (a1 d1 / n1) + (a2 d2 / n2)(b1 c1 / n1) + (b2 c2 / n2).

Ismor Fischer, 5/20/2014 6.4-11

(d) To compute a 95% confidence interval for the summary odds ratio ORsummary, we must first verify that the sample sizes in the two studies are large enough to ensure that the method used is valid.

Step 1: Verify that the expected number of observations of the (i, j)th cell in the first table, plus

the expected number of observations of the corresponding (i, j)th cell in the second table, is greater than or equal to 5, for i = 1, 2 and j = 1, 2. Recall that the expected number of the (i, j)th cell is given by Ei j = Ri Cj / n.

Step 2: By its definition, the quantity L computed in part (b) is a weighted mean of log-odds

ratios, and already represents a point estimate of ln(ORsummary). The estimated standard error of L is given by

s.e.( )L = 1

w1 + w2 .

Step 3: From these two values in Step 2, construct a 95% confidence interval for ln(ORsummary), and exponentiate it to derive a 95% confidence interval for ORsummary itself.

(e) Also compute the value of the Chi-squared test statistic for ORsummary given at the end of § 6.2.3. (f) Use the confidence interval in (d), and/or the 2

1χ statistic in (e), to perform a Test of Association of the null hypothesis H0: ORsummary = 1, versus the alternative HA: ORsummary ≠ 1, at the α = .05 significance level. Interpret your conclusion: What exactly has been demonstrated about the association between the number of term pregnancies and the odds of developing epithelial ovarian cancer? Be precise.

(g) Redo this problem in R, using the code found in the link below, and compare results.

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/Rcode/



Ismor Fischer, 5/20/2014 6.4-12

18. (a) Suppose a survey determines the political orientation of 60 men in a certain community:

Among these men, calculate the proportion belonging to each political category. Then show that a Chi-squared Test of the null hypothesis of equal proportions

H0: π Left | Men = π Mid | Men = π Right | Men

leads to its rejection at the α = .05 significance level. Conclusion?

(b) Suppose the survey also determines the political orientation of 540 women in the same community:

Among these women, calculate the proportion belonging to each political category. How do these proportions compare with those in (a)? Show that a Chi-squared Test of the null hypothesis of equal proportions

H0: π Left | Women = π Mid | Women = π Right | Women

leads to its rejection at the α = .05 significance level. Conclusion?

(c) Suppose the two survey results are combined:

Among the individuals in each gender (i.e., row), the proportion belonging to each political category (i.e., column) of course match those found in (a) and (b), respectively. Therefore, show that a Chi-squared Test of the null hypothesis of equal proportions

H0: π Left | Men = π Left | Women AND π Mid | Men = π Mid | Women AND π Right | Men = π Right | Women

leads to a 100% acceptance at the α = .05 significance level. Conclusion?

NOTE: The closely-resembling null hypothesis

H0: π Men | Left = π Women | Left AND π Men | Mid = π Women | Mid AND π Men | Right = π Women | Right

tests for equal proportions of men and women within each political category, which is very different from the above. Based on sample proportions (0.1 vs. 0.9), it is likely to be rejected, but each column would need to be formally tested by a separate Goodness-of-Fit.

Left Middle Right

Men 12 18 30 60

Left Middle Right

Women 108 162 270 540

Left Middle Right

Men 12 18 30 60

Women 108 162 270 540

120 180 300 600

Ismor Fischer, 5/20/2014 6.4-13

(d) Among the individuals in each political category (i.e., column), calculate the proportion of men, and show that they are all equal to each other. Among the individuals in each political category (i.e., column), calculate the proportion of women, and show that they are all equal to each other. Therefore, show that a Chi-squared Test of the null hypothesis of equal proportions

H0: π Men | Left = π Men | Mid = π Men | Right AND π Women | Left = π Women | Mid = π Women | Right

also leads to a 100% acceptance at the α = .05 significance level. Conclusion?

MORAL: There is more than one type of null hypothesis on proportions to which the Chi-squared Test can be applied.

19. In a random sample of n = 1200 consumers who are surveyed about their ice cream flavor preferences, 416 indicate that they prefer vanilla, 419 prefer chocolate, and 365 prefer strawberry.

(a) Conduct a Chi-squared “Goodness-of-Fit” Test of the null hypothesis of equal proportions

H0: Vanilla Chocolate Strawberryπ π π= = of flavor preferences, at the α = .05 significance level.

Vanilla Chocolate Strawberry 416 419 365 1200

(b) Suppose that the sample of n = 1200 consumers is equally divided between males and

females, yielding the results shown below. Conduct a Chi-squared Test of the null hypothesis that flavor preference is not associated with gender, at the α = .05 level.

Vanilla Chocolate Strawberry Totals

Males 200 190 210 600

Females 216 229 155 600 Totals 416 419 365 1200

(c) Redo (a) and (b) with R, using chisq.test. Show agreement with your calculations!

Ismor Fischer, 5/20/2014 6.4-14

20. In the late 1980s, the pharmaceutical company Upjohn received approval from the Food and Drug Administration to market RogaineTM, a 2% minoxidil solution, for the treatment of androgenetic alopecia (male pattern hair loss). Upjohn’s advertising campaign for Rogaine included the results of a double-blind randomized clinical trial, conducted with 1431 patients in 27 centers across the United States. The results of this study at the end of four months are summarized in the 2 × 5 contingency table below, where the two row categories represent the treatment arm and control arm respectively, and each column represents a response category, the degree of hair growth reported. [Source: Ronald L. Iman, A Data-Based Approach to Statistics, Duxbury Press]

Degree of Hair Growth

No Growth

New Vellus

Minimal Growth

Moderate Growth

Dense Growth Total

Rogaine 301 172 178 58 5 714 Placebo 423 150 114 29 1 717 Total 724 322 292 87 6 1431

(a) Conduct a Chi-squared Test of the null hypothesis H0: πRogaine = πPlacebo versus the alternative hypothesis HA: πRogaine ≠ πPlacebo across the five hair growth categories (That is, H0: NoGrowth | Rogaine NoGrowth | Placeboπ π= and New Vellus | Rogaine New Vellus | Placeboπ π= and … and DenseGrowth | Rogaine DenseGrowth | Placeboπ π= .) Infer whether or not we can reject the null hypothesis at the α = .01 significance level. Interpret in context: At the α = .01 significance level, what exactly has been demonstrated about the efficacy of Rogaine versus placebo?

(b) Form a 2 × 2 contingency table by combining the last four columns into a single column

labeled Growth. Conduct a Chi-squared Test for the null hypothesis H0: πRogaine = πPlacebo versus the alternative HA: πRogaine ≠ πPlacebo between the resulting No Growth versus Growth binary response categories. (That is, H0: Growth | Rogaine Growth | Placeboπ π= .) Infer whether or not we can reject the null hypothesis at the α = .01 significance level. Interpret in context: At the α = .01 significance level, what exactly has been demonstrated about the efficacy of Rogaine versus placebo?

(c) Calculate the p-value using a two-sample Z-test of the null hypothesis in part (b), and show

that the square of the corresponding z-score is equal to the Chi-squared test statistic found in (b). Verify that the same conclusion about H0 is reached, at the α = .01 significance level.

(d) Redo this problem with R, using chisq.test. Show agreement with your calculations!

Ismor Fischer, 5/20/2014 6.4-15

21. Male patients with coronary artery disease were recruited from three different medical centers – the Johns Hopkins University School of Medicine, The Rancho Los Amigos Medical Center, and the St. Louis University School of Medicine – to investigate the effects of carbon monoxide exposure. One of the baseline characteristics considered in the study was pulmonary lung function, as measured by X = “Forced Expiratory Volume in one second,” or FEV1. The data are summarized below.

Johns Hopkins Rancho Los Amigos St. Louis

n1 = 21 n2 = 16 n2 = 23

1x = 2.63 liters 2x = 3.03 liters 3x = 2.88 liters

s12 = 0.246 liters2 s2

2 = 0.274 liters2 s32 = 0.248 liters2

Based on histograms of the raw data (not shown), it is reasonable to assume that the FEV1 measurements of the three populations from which these samples were obtained are each approximately normally distributed, i.e., 1 1 1~ ( , )X N µ σ , 2 2 2~ ( , )X N µ σ , and

3 3 3~ ( , )X N µ σ . Furthermore, because the three sample variances are so close in value, it is reasonable to assume equivariance of the three populations, that is, σ 1

2 = σ 22 = σ 3

2. With these assumptions, answer the following.

(a) Compute the pooled estimate of the common variance σ

2 “within groups” via the formula

s within2 = MSError =

SSErrordfError

= (n1 − 1) s1

2 + (n2 − 1) s22 + … + (nk − 1) sk

2

n − k .

(b) Compute the grand mean of the k = 3 groups via the formula

x = 1 1 2 2 k kn x n x n xn

+ + + , where the combined sample size n = n1 + n2 + … + nk .

From this, calculate the estimate of the variance “between groups” via the formula

s between2 = MSTreatment =

SSTreatmentdfTreatment

= 2 2 2

1 1 2 2( ) ( ) ( )1

k kn x x n x x n x xk

− + − + + −−

.

(c) Using this information, construct a complete ANOVA table, including the F-statistic, and

corresponding p-value, relative to .05 (i.e., < .05, > .05, or = .05). Infer whether or not we can reject H0: µ1 = µ2 = µ3, at the α = .05 level of significance. Interpret in context: Exactly what has been demonstrated about the baseline FEV1 levels of the three groups?

Ismor Fischer, 5/20/2014 6.4-16

22. Generalization of Problem 2.5/8

(a) Suppose a random sample of size n1 has a mean 1x and variance s12, and a second random

sample of size n2 has a mean 2x and variance s22. If the two samples are combined into a

single sample, then algebraically express its mean Totalx and variance sTotal2 in terms of the

preceding variables. (Hint: If you think of this in the right way, it’s easier than it looks.)

(b) In a study of the medical expenses at a particular hospital, it is determined from a sample of 4000 patients that a certain laboratory procedure incurs a mean cost of $30, with a standard deviation of $10. It is realized however, that these values inadvertently excluded 1000 patients for whom the cost was $0. When these patients are included in the study, what is the adjusted cost of the mean and standard deviation?

23.

(a) For a generic 2 × 2 contingency table such as the one shown, prove that the Chi-squared test statistic reduces to

221

1 2 1 2

( )n ad bcR R C C

χ −= .

(b) Suppose that a z-test of two equal proportions results in the generic sample values shown

in this table. Prove that the square of the z-score is equal to the Chi-squared score in (a).

24. Problem 5.3/1 illustrates one way that the normal and t distributions differ, as similar as their graphs may appear (drawn to scale, below). Essentially, any t-curve has heavier tails than the bell curve, indicating a higher density of outliers in the distribution. (So much higher in fact, that the mean does not exist!) Another way is to see this is to check the t-distribution for normality, via a Q-Q plot. The posted R code for this problem graphs such a plot for a standard normal distribution (with predictable results), and for a t-distribution with 1 degree of freedom (a.k.a. the Cauchy distribution). Run this code five times each, and comment on the results!

a b R1

c d R2

C1 C2 n

curve(dnorm(x), -3, 3, lwd = 2, col = "darkgreen")

N(0, 1)

curve(dt(x, 1), -3, 3, ylim = range(0,.4), lwd = 2, col = "darkgreen")

t1


Ismor Fischer, 5/20/2014 6.4-17

25.

(a) In R, type the following command to generate a data set called “x” of 1000 random values.

x = rf(1000, 5, 20)

Obtain a graph of its frequency histogram by typing hist(x). Include this graph as part of your submitted homework assignment. (Do not include the 1000 data values!)

(b) Next construct a “normal q-q plot” by typing the following.

qqnorm(x, pch = 19)

qqline(x)

Include this plot as part of your submitted homework assignment.

Now define a new data set called “y” by taking the (natural) logarithm of x.

y = log(x)

Obtain a graph of its frequency histogram by typing hist(y). Include this graph as part of your submitted homework assignment. (Do not include the 1000 data values!)

Then construct a “normal q-q plot” by typing the following.

qqnorm(y, pch = 19)

qqline(y)

Include this plot as part of your submitted homework assignment.

(c) Summarize the results in (a) and (b). In particular, from their respective histograms and q-q plots, what general observation can be made regarding the distributions of x and y = log(x)? (Hint: See pages 6.1-25 through 6.1-27.)

26. Refer to the posted Rcode folder for this problem. Please answer all questions.




7. Correlation and Regression

7.1 Motivation 7.2 Linear Correlation and Regression 7.3 Extensions of Simple Linear Regression

7.4 Problems


7. Correlation and Regression

7.1 Motivation

*Exercise: Algebraically expand the expression (X − µX)(Y − µY), and use the properties of

mathematical expectation given in 3.1. This motivates an alternate formula for sxy.

POPULATION Random Variables X, Y: numerical (Contrast with § 6.3.1.) How can the association between X and Y (if any exists) be

1) characterized and measured?

2) mathematically modeled via an equation, i.e., Y = f(X)?

Recall:

µX = Mean(X) = E[X] µY = Mean(Y) = E[Y]

σX2 = Var(X) = E[(X – µX)2] σY

2 = Var(Y) = E[(Y – µY)2] Definition: Population Covariance of X, Y

σXY = Cov(X, Y) = E[(X – µX)(Y – µY)]

Equivalently,* = E[XY] – µX µY

SAMPLE, size n Recall:

1

1 ni

ix x

n == ∑

1

1 ni

iy y

n == ∑

2 2

1

1 ( )1x

ni

is x x

n =−=

− ∑ 2 2

1

1 ( )1y

ni

is y y

n =−=

− ∑

Definition: Sample Covariance of X, Y

1

1 ( )( )1xy

ni i

is x x y y

n =− −=

− ∑

Note: Whereas sx2 ≥ 0 and sy

2 ≥ 0, sxy is unrestricted in sign.

Ismor Fischer, 5/29/2012 7.1-2 For the sake of simplicity, let us assume that the predictor variable X is nonrandom (i.e., deterministic), and that the response variable Y is random. (Although, the subsequent techniques can be extended to random X as well.)

Example: X = fat (grams), Y = cholesterol level (mg/dL)

Suppose the following sample of n = 5 data pairs (i.e., points) is obtained and graphed in a scatterplot, along with some accompanying summary statistics:

X Y

60 70 80 90 100 x = 80

y = 240

sx2 = 250

sy

2 = 1750 210 200 220 280 290 Sample Covariance

sxy = 15 − 1 [ (60 − 80)(210 − 240) + (70 − 80)(200 − 240) + (80 − 80)(220 − 240) +

(90 − 80)(280 − 240) + (100 − 80)(290 − 240) ] = 600 As the name implies, the variance measures the extent to which a single variable varies (about its mean). Similarly, the covariance measures the extent to which two variables vary (about their individual means), with respect to each other.


Ideally, if there is no association of any kind between two variables X and Y (as in the case where they are independent), then a scatterplot would reveal no organized structure, and covariance = 0; e.g., X = adult head size, Y = IQ. Clearly, in a case such as this, the variable X is not a good predictor of the response Y. Likewise, if the variables X = age, Y = body temperature (°F) are measured in a group of healthy individuals, then the resulting scatterplot would consist of data points that are very nearly lined up horizontally (i.e., zero slope), reflecting a constant mean response value of Y = 98.6°F, regardless of age X. Here again, covariance = 0 (or nearly so); X is not a good predictor of the response Y. See figures.∗

However, in the preceding “fat vs. cholesterol” example, there is a clear “positive trend” exhibited in the scatterplot. Overall, it seems that as X increases, Y increases, and inversely, as X decreases, Y decreases. The simplest mathematical object that has this property is a straight line with positive slope, and so a linear description can be used to capture such “first-order” properties of the association between X and Y. The two questions we now ask are… 1) How can we measure the strength of the linear association between X and Y?

Answer: Linear Correlation Coefficient 2) How can we model the linear association between X and Y, essentially via an

equation of the form Y = mX + b?

Answer: Simple Linear Regression

∗ Caution: The covariance can equal zero under other conditions as well; see Exercise in the next section.

X = Head Circumference

Y =

IQ sc

ore

Y =

Bod

y Te

mp

(°F)

X = Age

98.6 −


µ Y | X = 70

µ Y | X = 80

µ Y | X = 90

µ Y | X = 100

µ Y | X = 60

We can consider n = 5 subpopulations, each of whose cholesterol levels Y are normally distributed, and whose means are conditioned on X = 60, 70, 80, 90, 100 fat grams, respectively.

σ

σ

σ

σ

σ

Before moving on to the next section, some important details are necessary in order to provide a more formal context for this type of problem. In our example, the response variable of interest is cholesterol level Y, which presumably has some overall probability distribution in the study population. The mean cholesterol level of this population can therefore be denoted Yµ – or, recall, expectation E[Y] – and estimated by the “grand mean” y = 240. Note that no information about X is used.

Now we seek to characterize the relation (if any) between cholesterol level Y and fat intake X in this population, based on a random sample using n = 5 fat intake values (i.e., x1 = 60, x2 = 70, x3 = 80, x4 = 90, x5 = 100). Each of these fixed xi values can be regarded as representing a different amount of fat grams consumed by a subpopulation of individuals, whose cholesterol levels Y, conditioned on that value of X = xi, are assumed to be normally distributed. The conditional mean cholesterol level of each of these distributions could therefore be denoted | iY X xµ = – equivalently, conditional expectation E[Y | X = xi] – for i = 1, 2, 3, 4, 5. (See figure; note that, in addition, we will assume that the variances “within groups” are all equal (to 2σ ), and that they are independent of one another.) If no relation between X and Y exists, we would expect to see no organized variation in Y as X changes, and all of these conditional means would either be uniformly “scattered” around – or exactly equal to – the unconditional mean Yµ ; recall the discussion on the preceding page. But if there is a true relation between X and Y, then it becomes important to characterize and model the resulting (nonzero) variation.


7.2 Linear Correlation and Regression

Example: r = 600

250 1750 = 0.907 strong, positive linear correlation

FACT:

Any set of data points (xi, yi), i = 1, 2, …, n, having r > 0 (likewise, r < 0) is said to have a positive linear correlation (likewise, negative linear correlation). The linear correlation can be strong, moderate, or weak, depending on the magnitude. The closer r is to +1 (likewise, −1), the more strongly the points follow a straight line having some positive (likewise, negative) slope. The closer r is to 0, the weaker the linear correlation; if r = 0, then EITHER the points are uncorrelated (see 7.1), OR they are correlated, but nonlinearly (e.g., Y = X 2).

Exercise: Draw a scatterplot of the following n = 7 data points, and compute r.

(−3, 9), (−2, 4), (−1, 1), (0, 0), (1, 1), (2, 4), (3, 9)

−1 ≤ r ≤ +1

POPULATION

Random Variables X, Y: numerical

Definition: Population Linear Correlation Coefficient of X, Y

ρ = σXY

σX σY

FACT: −1 ≤ ρ ≤ +1

SAMPLE, size n

Definition: Sample Linear Correlation Coefficient of X, Y

ρ = r = sxy

sx sy


(Pearson’s) Sample Linear Correlation Coefficient r = sxysx sy

Some important exceptions to the “typical” cases above:

uncorrelated

− 1 − 0.8 − 0.5 0 + 0.5 + 0.8 + 1

strong strong moderate moderate weak

negative linear correlation

As X increases, Y decreases. As X decreases, Y increases.

positive linear correlation

As X increases, Y increases. As X decreases, Y decreases.

r = 0, but X and Y are correlated, nonlinearly

r > 0 in each of the two individual subgroups, but r < 0 when combined

r > 0, only due to the effect of one influential outlier; if removed, then data are uncorrelated (r = 0)

r


Hypothesis H0: ρ = 0 ⇔ “There is no linear correlation between X and Y.” vs. Alternative Hyp. HA: ρ ≠ 0 ⇔ “There is a linear correlation between X and Y.”

Statistical Inference for ρ

Suppose we now wish to conduct a formal test of…

Example: p-value = 2 P

T3 ≥ .907 3 1 − (.907)2

= 2 P(T3 ≥ 3.733) = 2(.017) = .034

As p < α = .05, the null hypothesis of no linear correlation can be rejected at this level. Comments:

Defining the numerator “sums of squares” Sxx = (n – 1) sx2, Syy = (n – 1) sy

2, and

Sxy = (n – 1) sxy, the correlation coefficient can also be written as r = xy

xx yy

S

S S.

The general null hypothesis H0: ρ = ρ0 requires a more complicated Z-test, which first applies the so-called Fisher transformation, and will not be presented here.

The assumption on X and Y is that their joint distribution is bivariate normal, which is difficult to check fully in practice. However, a consequence of this assumption is that X and Y are linearly uncorrelated (i.e., ρ = 0) if and only if X and Y are independent. That is, it overlooks the possibility that X and Y might have a nonlinear correlation. The moral: ρ – and therefore the Pearson sample linear correlation coefficient r calculated above – only captures the strength of linear correlation. A more sophisticated measure, the multiple correlation coefficient, can detect nonlinear correlation, or correlation in several variables. Also, the nonparametric Spearman rank-correlation coefficient can be used as a substitute.

Correlation does not imply causation! (E.g., X = “children’s foot size” is indeed positively correlated with Y = “IQ score,” but is this really cause-and-effect????) The ideal way to establish causality is via a well-designed randomized clinical trial, but this is not always possible, or even desirable. (E.g., X = smoking vs. Y = lung cancer)

Test Statistic

T = r n − 2

1 − r2 ~ tn − 2


k = 2 parameters, “regression coefficients”

Predictor Variable, Explanatory Variable

Y = β0 + β1 X + ε

“Response = (Linear) Model + Error”

Y = ˆ0β + 1β X

intercept = b0 b1 = slope

Y

Y

ε

Simple Linear Regression and the Method of Least Squares

If a linear association exists between variables X and Y, then it can be written as

Sample-based estimator of response

That is, given the “response vector” Y, we wish to find the linear estimate Y that makes the magnitude of the difference ε = Y – Y as small as possible.


Y = β0 + β1 X + ε ⇒ Y = 0β + 1β X

How should we define the line that “best” fits the data, and obtain its coefficients 0β and 1β ?

For any line, errors εi , i = 1, 2, …, n, can be estimated by the residuals iε = ei = yi − îy .

Example (cont’d): Slope b1 = 600250 = 2.4 Intercept b0 = 240 − (2.4)(80) = 48

Therefore, the least squares regression line is given by the equation Y = 48 + 2.4 X.

( x , y )

X

Y

|

xi

yi −

îy −

e3

en

e1

e2

(xi, yi)

ei = yi − îy

(xi, îy )

residual = observed response – fitted response

The least squares regression line is the unique line that minimizes the

Error (or Residual) Sum of Squares SSError = 2

1

n

ii

e=∑ = 2

1( )ˆ

n

iii yy

=−∑ .

Slope: 1β = b1 = sxy

sx2

Intercept: 0β = b0 = y − b1 x

Y = b0 + b1 X


+18 −16

−20

+16

+2 Y = 48 + 2.4 X ^

Scatterplot, Least Squares Regression Line, and Residuals

predictor values xi 60 70 80 90 100

observed responses yi 210 200 220 280 290 fitted responses, predicted responses

îy 192 216 240 264 288

residuals ei = yi − îy +18 −16 −20 +16 +2 Note that the sum of the residuals is equal to zero. But the sum of their squares,

2ε = SSError = (+18)2 + (−16)2 + (−20)2 + (+16)2 + (+2)2 = 1240

is, by construction, the smallest such value of all possible regression lines that could have been used to estimate the data. Note also that the center of mass (80, 240) lies on the least squares regression line. Example: The population cholesterol level corresponding to x* = 75 fat grams is estimated by ˆ 48 2.4(75)y = + = 228 mg/dL. But how precise is this value? (Later...)


Statistical Inference for β0 and β1

It is possible to test for significance of the intercept parameter β0 and slope parameter β1 of the least squares regression line, using the following:

where se

2 = SSError

n − 2 is the so-called standard error of estimate, and Sxx = (n – 1) sx2.

(Note: se2 is also written as MSE or MSError, the “mean square error” of the

regression; see ANOVA below.)

Example: Calculate the p-value of the slope parameter β1, under…

First, se2 = 1240

3 = 413.333, so se = 20.331. And Sxx = (4)(250) = 1000. So…

p-value = 2 P

T3 ≥

2.4 − 0

20.331 1000 = 2 P(T3 ≥ 3.733) = 2 (.017) = .034

As p < α = .05, the null hypothesis of no linear association can be rejected at this level.

Note that the T-statistic (3.733), and hence the resulting p-value (.034), is identical to the test of significance of the linear correlation coefficient H0: ρ = 0 conducted above!

Exercise: Calculate the 95% confidence interval for β1, and use it to test H0: β1 = 0.

Null Hypothesis H0: β1 = 0 ⇔ “There is no linear association between X and Y.”

vs.

Alternative Hyp. HA: β1 ≠ 0 ⇔ “There is a linear association between X and Y.”

Test Statistic

For β0: T =

b0 − β0

se

n Sxx

Sxx + n ( x )2 ~ tn – 2

For β1: T =

b1 − β1

se Sxx ~ tn – 2

(1 − α) × 100% Confidence Limits

For β0: b0 ± tn − 2, α/2 ⋅ se 1n +

( x )2

Sxx

For β1: b1 ± tn − 2, α/2 ⋅ se 1Sxx


Confidence and Prediction Intervals

Recall that, from the discussion in the previous section, a regression problem such as this may be viewed in the formal context of starting with n normally-distributed populations, each having a conditional mean | iY X xµ = , i = 1, 2, ..., n. From this, we then obtain a linear model that allows us to derive an estimate of the response variable via

0 1Y b b X+= , for any value X = x* (with certain restrictions to be discussed later), i.e.,

0 1ˆ *y b b x+= . There are two standard possible interpretations for this fitted value. First, y can be regarded simply as a “predicted value” of the response variable Y, for a randomly selected individual from the specific normally-distributed population corresponding to X = x*, and can be improved via a so-called prediction interval. Exercise: Confirm that the 95% prediction interval for y = 228 (when x* = 75) is (156.3977, 299.6023). Example (α = .05): 95% Prediction Bounds

X fit Lower Upper 60 192 110.1589 273.8411 70 216 142.2294 289.7706 80 240 169.1235 310.8765 90 264 190.2294 337.7706 100 288 206.1589 369.8411

This diagram illustrates the associated 95% prediction interval around y = b0 + b1 x*, which contains the true response value Y with 95% probability.

b0 + b1 x*

x* X

Y

.95

.025

.025

(1 − α) × 100% Prediction Limits for Y at X = x*

(b0 + b1 x*) ± tn − 2, α/2 ⋅ se 1 + 1n +

(x* − x )2

Sxx

| *xY Xµ =


The second interpretation is that y can be regarded as a point estimate of the conditional mean *|Y X xµ = of this population, and can be improved via a confidence interval.

Exercise: Confirm that the 95% confidence interval for y = 228 (when x* = 75) is (197.2133, 258.6867). Note: Both approaches are based on the fact that there is, in principle, variability in the coefficients b0 and b1 themselves, from one sample of n data points to another. Thus, for fixed x*, the object 0 1ˆ *y b b x+= can actually be treated as a random variable in its own right, with a computable sampling distribution. Also, we define the general conditional mean |Y Xµ – i.e., conditional expectation E[Y | X] – as | *Y X xµ = – i.e., E[Y | X = x*] – for all appropriate x*, rather than a specific one.

(1 − α) × 100% Confidence Limits for µ Y | X = x*

(b0 + b1 x*) ± tn − 2, α/2 ⋅ se 1n +

(x* − x )2

Sxx

This diagram illustrates the associated 95% confidence interval around y = b0 + b1 x*, which contains the true conditional mean µ Y | X = x* with 95% probability. Note that it is narrower than the corresponding prediction interval above.

b0 + b1 x*

x* X

Y

.95

.025

.025

| *xY Xµ =

Ismor Fischer, 5/29/2012 7.2-10

upper 95% confidence band

lower 95% confidence band

95% Confidence Intervals

Example (α = .05): 95% Confidence Bounds

X fit Lower Upper 60 192 141.8827 242.1173

70 216 180.5617 251.4383 80 240 211.0648 268.9352 90 264 228.5617 299.4383 100 288 237.8827 338.1173

Comments: Note that, because individual responses have greater variability than mean

responses (recall the Central Limit Theorem, for example), we expect prediction intervals to be wider than the corresponding confidence intervals, and indeed, this is the case. The two formulas differ by a term of “1 +” in the standard error of the former, resulting in a larger margin of error.

Note also from the formulas that both types of interval are narrowest when x* = x ,

and grow steadily wider as x* moves farther away from x . (This is evident in the graph of the 95% confidence intervals above.) Great care should be taken if x* is outside the domain of sample values! For example, when fat grams x = 0, the linear model predicts an unrealistic cholesterol level of y = 48, and the margin of error is uselessly large. The linear model is not a good predictor there.

Ismor Fischer, 5/29/2012 7.2-11

ANOVA Formulation

As with comparison of multiple treatment means (§6.3.3), regression can also be interpreted in the general context of analysis of variance. That is, because

Response = Model + Error,

it follows that the total variation in the original response data can be partitioned into a source of variation due to the model, plus a source of variation for whatever remains. We now calculate the three “Sums of Squares (SS)” that measure the variation of the system and its two component sources, and their associated degrees of freedom (df). 1. Total Sum of Squares = sum of the squared deviations of each observed response

value yi from the mean response value y . SSTotal = (210 – 240)2 + (200 – 240)2 + (220 – 240)2 + (280 – 240)2 + (290 – 240)2 = 7000

dfTotal = 5 – 1 = 4 Reason: n data values – 1

Note that, by definition, sy2 =

SSTotaldfTotal

= 7000

4 = 1750, as given in the beginning of this

example in 7.1. 2. Regression Sum of Squares = sum of the squared deviations of each fitted response

value îy from the mean response value y .

SSReg = (192 – 240)2 + (216 – 240)2 + (240 – 240)2 + (264 – 240)2 + (288 – 240)2 = 5760

dfReg = 1 Reason: As the regression model is linear, its degrees of freedom = one less than the k = 2 parameters we are trying to estimate (β0 and β1).

3. Error Sum of Squares = sum of the squared deviations of each observed response

yi from its corresponding fitted response îy (i.e., the sum of the squared residuals). SSError = (210 – 192)2 + (200 – 216)2 + (220 – 240)2 + (280 – 264)2 + (290 – 288)2 = 1240

dfError = 5 – 2 = 3 Reason: n data values – k regression parameters in model

SSTotal = SSReg + SSError

dfTotal = dfReg + dfError Regression

Model

Error Total

Ismor Fischer, 5/29/2012 7.2-12

F1, 3

13.94

.034

Null Hypothesis H0: β1 = 0 ⇔ “There is no linear association between X and Y.”

vs.

Alternative Hyp. HA: β1 ≠ 0 ⇔ “There is a linear association between X and Y.”

ANOVA Table Test Statistic

“Sum of Squares” “Mean Squares” (F1, 3 distribution)

Source df SS MS = SSdf F =

MSRegMSErr

p-value

Regression 1 5760 5760 13.94 .034

Error 3 1240 413.333

Total 4 7000 −

According to this F-test, we can reject…

at the α = .05 significance level, which is consistent with our earlier findings. Comment: Again, note that 13.94 = (± 3.733)2, i.e., F1, 3 = t3

2 ⇒ equivalent tests.

Ismor Fischer, 5/29/2012 7.2-13

How well does the model fit? Out of a total response variation of 7000, the linear regression model accounts for 5760, with the remaining 1240 unaccounted for (perhaps explainable by a better model, or simply due to random chance). We can

therefore assess how well the model fits the data by calculating the ratio SSRegSSTotal

=

57607000 = 0.823. That is, 82.3% of the total response variation is due to the linear

association between the variables, as determined by the least squares regression line, with the remaining 17.7% unaccounted for. (Note: This does NOT mean that 82.3% of the original data points lie on the line. This is clearly false; from the scatterplot, it is clear that none of the points lies on the regression line!)

Moreover, note that 0.823 = (0.907)2 = r2, the square of the correlation coefficient calculated before! This relation is true in general…

Comment: In practice, it is tempting to over-rely on the coefficient of determination as the sole indicator of linear fit to a data set. As with the correlation coefficient r itself, a reasonably high r2 value is suggestive of a linear trend, or a strong linear component, but should not be used as the definitive measure. Exercise: Sketch the n = 5 data points (X, Y)

(0, 0), (1, 1), (2, 4), (3, 9), (4, 16)

in a scatterplot, and calculate the coefficient of determination r2 in two ways:

1. By squaring the linear correlation coefficient r.

2. By explicitly calculating the ratio SSRegSSTotal

from the regression line.

Show agreement of your answers, and that, despite a value of r2 very close to 1, the exact association between X and Y is actually a nonlinear one. Compare the linear estimate of Y when X = 5, with its exact value.

Also see Appendix > Geometric Viewpoint > Least Squares Approximation.

Coefficient of Determination

r2 = SSReg

SSTotal = 1 −

SSErr

SSTotal

This value (always between 0 and 1) indicates the proportion of total response variation that is accounted for by the least squares regression model.

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A2._Geometric_Viewpoint/A2.3_-_Least_Squares_Approximation.pdf�

Ismor Fischer, 5/29/2012 7.2-14

0 σ

σ

σ

ε1 ~ N(0, σ) ε2 ~ N(0, σ) … εn ~ N(0, σ)

| | |

x1 x2 … xn

Regression Diagnostics – Checking the Assumptions

True Responses: 0 1Y Xβ β ε= + +Model

⇔ 0 1i iy xβ β= + + iε , i = 1, 2, ..., n

Fitted Responses: 0 1Y b b aaXa= + ⇔ 0 1ˆ ,i iy b b x aa= + i = 1, 2, ..., n

Residuals: ˆˆ Y Yε = − ⇔ ,ˆ î i i ie y yε −= = i = 1, 2, ..., n 1. The model is “correct.”

Perhaps a better word is “useful,” since correctness is difficult to establish without a theoretical justification, based on known mathematical and scientific principles.

Check: Scatterplot(s) for general behavior, r2 ≈ 1, overall balance of simplicity vs. complexity of model, and robustness of response variable explanation.

2. Errors εi are independent of each other, i = 1, 2, …, n.

This condition is equivalent to the assumption that the responses yi are independent of one other. Alas, it is somewhat problematic to check in practice; formal statistical tests are limited. Often, but not always, it is implicit in the design of the experiment. Other times, errors (and hence, responses) may be autocorrelated with each other. Example: Y = “systolic blood pressure (mm Hg)” at times t = 0 and t = 1 minute later. Specialized time-series techniques exist for these cases, but are not pursued here.

3. Errors εi are normally distributed with mean 0, and equal variances σ1

2 = σ22 = … = σn

2 (= σ 2), i.e., εi ~ N(0, σ), i = 1, 2, …, n.

This condition is equivalent to the original normality assumption on the responses yi. Informally, if for each fixed xi, the true response yi is normally distributed with mean | iY X xµ = and

variance 2σ – i.e, yi ~ N( | iY X xµ = , σ) – then the error εi that remains upon “subtracting out” the true model value

0 1 ixβ β+ (see boxed equation above) turns out also to be normally distributed, with mean 0 and the same variance 2σ – i.e., εi ~ N(0, σ). Formal details are left to the mathematically brave to complete.

Response = + Error

Ismor Fischer, 5/29/2012 7.2-15

Check: Residual plot (residuals ei vs. fitted values îy ) for a general random appearance, evenly distributed about zero. (Can also check the normal probability plot.) Typical residual plots that violate Assumptions 1-3:

nonlinearity dependent errors increasing variance omitted predictor

Nonlinear trend can often be described with a polynomial regression model, e.g., Y = β0 + β1 X + β2 X 2 + ε. If a residual plot resembles the last figure, this is a possible indication that more than one predictor variable may be necessary to explain the response, e.g., Y = β0 + β1 X1 + β2 X

2 + ε, multiple linear regression. Nonconstant variance can be handled by Weighted Least Squares (WLS) – versus Ordinary Least Squares (OLS) above – or by using a transformation of the data, which can also alleviate nonlinearity, as well as violations of the third assumption that the errors are normally distributed.

0 0 0 0

Ismor Fischer, 5/29/2012 7.2-16

Y = 12.1 + 4.7 X ^

Sadie

Example: Regress Y = “human age (years)” on X = “dog age (years),” based on the following n = 20 data points, for adult dogs 23-34 lbs.:

X 1 2 3 4 5 6 7 8 9 10

Y 15 21 27 32 37 42 46 51 55 59

11 12 13 14 15 16 17 18 19 20

63 67 71 76 80 85 91 97 103 111 Residuals: Min 1Q Median 3Q Max -2.61353 -1.57124 0.08947 1.16654 4.87143 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.06842 0.87794 13.75 5.5e-11 *** X 4.70301 0.07329 64.17 < 2e-16 *** Multiple R-Squared: 0.9956, Adjusted R-squared: 0.9954 F-statistic: 4118 on 1 and 18 degrees of freedom, p-value: 0

Ismor Fischer, 5/29/2012 7.2-17 The residual plot exhibits a clear nonlinear trend, despite the excellent fit of the linear model. It is possible to take this into account using, say, a cubic (i.e., third-degree) polynomial, but this then begs the question: How complicated should we make the regression model?

My assistant and I, thinking hard about regression models.


X −1

X −2

X −1/2

0.0

X 2

X 3 X 4 X 1/2

X 1/3

X 1/4

β < 0

β > 1 0 < β < 1

7.3 Extensions of Simple Linear Regression

Transformations

Power Laws: Y = α X β


If a scatterplot exhibits evidence of a monotonic nonlinear trend, then it may be possible to improve the regression model by first transforming the data according to one of the power functions above, depending on its overall shape.

If Y = α Xβ, then it follows that log(Y) = log(α) + β log(X)

i.e., V = β0 + β1 U.

[See Appendix > Basic Reviews > Logarithms for properties of logarithms.]

That is, if X and Y have a power law association, then log(X) and log(Y) have a linear association. Therefore, such (X, Y) data are often replotted on a log-log (U, V) scale in order to bring out the linear trend. The linear regression coefficients of the transformed data are then computed, and backsolved for the original parameters α and β. Algebraically, any logarithmic base can be used, but it is customary to use natural logarithms “ln” – that is, base e = 2.71828… Thus, if Y = α Xβ, then V = β0 + β1U, where V = ln(Y), U = ln(X), and the parameters β0 = ln(α) and β1 = β, so that the scale parameter α = e 0β , and the shape parameter β = β1. However…

Comment: This description of the retransformation is not quite complete. For, recall that linear regression assumes the true form of the response as V = β0 + β1U + ε. (The random error term ε is estimated by the least squares

minimum SSError = 2

1

n

ii

e=∑ .) Therefore, exponentiating both sides, the actual

relationship between X and Y is given by Y = α Xβ eε. Hence (see section 7.2), the conditional expectation is E[Y | X] = α Xβ E[eε], where E[eε] is the mean of the exponentiated errors εi, and is thus estimated by the sample mean of the exponentiated residuals ei. Consequently, the estimate of the original scale parameter α is more accurately given by

α = e 0β × 1n ∑

=

n

i

eie1

.

(The estimate of the original shape parameter β remains β = 1β .)

In this context, the expression 1n ∑

=

n

i

eie1

is called a smearing factor, introduced to

reduce bias during the retransformation process. Note that, ideally, if all the residuals ei = 0 – i.e., the model fits exactly – then (because e0 = 1) it follows that the smearing factor = 1. This will be the case in most of the “rigged” examples in this section, for the sake of simplicity. The often-cited reference below contains information on smearing estimators for other transformations.

Duan, N. (1983) Smearing estimate: a nonparametric retransformation method. Journal of the American Statistical Association, 78, 605-610.

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A1._Basic_Reviews/A1.1_-_Logarithms.pdf�

Ismor Fischer, 5/29/2012 7.3-3 Example: This example is modified from a pharmaceutical research paper, Allometric Scaling of Xenobiotic Clearance: Uncertainty versus Universality by Teh-Min Hu and William L. Hayton, that can be found at the URL http://www.aapsj.org/view.asp?art=ps030429, and which deals with different rates of metabolic clearance of various substances in mammals. (A xenobiotic is any organic compound that is foreign to the organism under study. In some situations, this is loosely defined to include naturally present compounds administered by alternate routes or at unnatural concentrations.) In one part of this particular study, n = 6 mammals were considered: mouse, rat, rabbit, monkey, dog and human. Let X = “body weight (kg)” and the response Y = “clearance rate of some specific compound.” Suppose the following “ideal” data were generated (consistent with the spirit of the article’s conclusions):

X .02 .25 2.5 5 14 70 Y 5.318 35.355 198.82 334.4 723.8 2420.0

http://www.aapsj.org/view.asp?art=ps030429�

Ismor Fischer, 5/29/2012 7.3-4 Solving for the least squares regression line yields the following standard output. Residuals: 1 2 3 4 5 6 -102.15 -79.83 8.20 59.96 147.60 -33.78 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 106.800 49.898 2.14 0.099 X 33.528 1.707 19.64 3.96e-05 Residual standard error: 104.2 on 4 degrees of freedom Multiple R-Squared: 0.9897, Adjusted R-squared: 0.9872 F-statistic: 385.8 on 1 and 4 degrees of freedom, p-value: 3.962e-005


V = 4.605 + 0.75 U ^

The residual plot, as well as a visual inspection of the linear fit, would seem to indicate that model improvement is possible, despite the high r2 value. The overall shape is suggestive of a power law relation Y = α Xβ with 0 < β < 1. Transforming to a log-log scale produces the following data and regression line.

U = ln X −1.386 0.916 1.609 0.264 4.248 V = ln Y 1.671 3.565 5.292 5.812 6.585 7.792

Residuals: 1 2 3 4 5 6 -2.469e-05 -1.944e-06 -1.938e-06 6.927e-05 2.244e-05 -6.313e-05 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.605e+00 2.097e-05 219568 <2e-16 *** U 7.500e-01 7.602e-06 98657 <2e-16 *** Residual standard error: 4.976e-05 on 4 degrees of freedom Multiple R-Squared: 1, Adjusted R-squared: 1 F-statistic: 9.733e+009 on 1 and 4 degrees of freedom, p-value: 0

−3.912


Y = 100 X 3/4 ^

The residuals are all within 10−6 of 0; this is clearly a much better fit to the data. Transforming back to the original X, Y variables from the regression line

ln(Y ) = 4.605 + 0.75 ln(X),

we obtain… Y = e4.605 + 0.75 ln(X) = e4.605 e0.75 ln(X) = 100 X 0.75. That is, the variables follow a power law relation with exponent ¾, illustrating a result known as Kleiber’s Law of “quarter power scaling.” See Appendix > Regression Models > Power Law Growth for more examples and information.

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A4._Regression_Models/A4.1_-_Power_Law_Growth.pdf�




e0.5X

eX

e2X

β > 0

e−2X

e−X

e−0.5X

β < 0

Logarithmic Transformation: Y = α e βX (Assume α > 0.) In some systems, the response variable Y grows (β > 0) or decays (β < 0) exponentially in X. That is, each unit increase in X results in a new response value Y that is a constant multiple (either > 1 or < 1, respectively) of the previous response value. A typical example is unrestricted cell division where, under ideal conditions, the number of cells Y at the end of every time period X is twice the number at the previous period. (The resulting explosion in the number of cells helps explain why patients with bacterial infections need to remain on their full ten-day regimen of antibiotics, even if they feel recovered sooner.) The half-life of a radioactive isotope is a typical example of exponential decay.

In general, if Y = α e βX, then ln(Y) = ln(α) + β X,

i.e., V = β0 + β1 X.

That is, X and ln(Y) have a linear association, and the model itself is said to be log-linear. Therefore, the responses are often replotted on a semilog scale – i.e., ln(Y) versus X – in order to bring out the linear trend. As before, the linear regression coefficients of the transformed data are then computed, and backsolved for estimates of the scale parameter α = e 0β and shape parameter β = β1. Also see Appendix > Regression Models > Exponential Growth and Appendix > Regression Models > Example - Newton's Law of Cooling. Comment: Recall that the square root and logarithm functions also serve to transform positively skewed data closer to being normally distributed. Caution: If any of the values are ≤ 0, then add a constant value (e.g., +1) uniformly to all of the values, before attempting to take their square root or logarithm!!!

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A4._Regression_Models/A4.2_-_Exponential_Growth.pdf�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A4._Regression_Models/A4.4_-_Example_-_Newtons_Law_of_Cooling.pdf�


Multiple Linear Regression

Suppose we now have k – 1 independent explanatory variables X1, X2, …, Xk – 1 (numerical or categorical) to predict a single continuous response variable Y. Then the regression setup “Response = Model + Error” becomes:

Y = β0 + β1 X1 + β2 X2 + β3 X3 + … βk – 1 Xk – 1 ← main effect terms

+ β11 X12 + β22 X2

2 + … βk – 1, k – 1 Xk – 12 ← quadratic terms (if any)

+ β25 X2 X5 + β68 X6 X8 + … ← two-way interaction terms (if any)

+ β147 X1 X4 X7 + … ← three-way interaction terms (if any)

+ ε

For simplicity, first consider the general additive model, i.e., main effects only.

Question 1: How are the estimates of the regression coefficients obtained?

Answer: Least Squares Approximation (LS), which follows the same principle of minimizing the residual sum of squares SSError. However, this leads to a set of complicated normal equations, best formulated via matrix algebra, and solved numerically by a computer. See figure below for two predictors.

Fitted response îy

∗

Residual ei = yi − ˆ

iy

True response yi

X1

X2 0

Y

Y = 0β + 1β 1X + 2β 2X

(x1i , x2i) Predictors

1 2( , , )x x y


Question 2: Which predictor variables among X1, X2, …, Xk – 1 are the most important for modeling the response variable? That is, which regression coefficients βj are statistically significant?

Answer: This raises the issue of model selection, one of the most important problems in the sciences. There are two basic stepwise procedures: forward selection (FS) and backward elimination (BE) (as well as widely used hybrids of these methods (FB)). The latter is a bit easier to conceptualize, and the steps are outlined below.

Model Selection: Backward Elimination (BE) Step 0. In a procedure that is extremely similar to that for multiple comparison of k treatment means (§6.3.3), first conduct an overall F-test of the full model β0 + β1 X1 + β2 X2 + … βk – 1 Xk – 1, by constructing an ANOVA table:

“Sum of Squares” “Mean Squares” Test Statistic Source df SS MS =

SSdf F =

MSRegMSErr

p-value

Regression k – 1 2

1ˆ( )

n

ii

y y=

−∑ MSReg Fk − 1, n − k 0 ≤ p ≤ 1

Error n – k 2

1ˆ( )

n

i ii

y y=

−∑ MSErr

Total n – 1 2

1( )

n

ii

y y=

−∑ −

If – and only if – the null hypothesis is (hopefully) rejected, it then becomes necessary to determine which of the predictor variables correspond to statistically significant regression coefficients. (Note that this is analogous to determining the mean of which of the k treatment groups are significantly different from the others, in multiple comparisons.)

Null Hypothesis H0: β1 = β2 = … = βk – 1 = 0 ⇔ “There is no linear association between the response Y and any of the predictors X1,…, Xk.”

Alternative Hyp. HA: βj ≠ 0 for some j ⇔ “There is a linear association between the response Y and at least one predictor Xj.”

Ismor Fischer, 5/29/2012 7.3-10

Example ~

Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model.

Step 1. t-test 0 1: 0H β = 0 2: 0H β = 0 3: 0H β = 0 4: 0H β = …

p-values: p1 < .05 p2 < .05 3 .05p > p4 < .05 Reject H0 Reject H0 Accept H0 Reject H0

Step 2. Are all the p-values significant (i.e., < .05 = α)? If not, then...

Step 3. Delete the predictor with the largest p-value, and recompute new coefficients.

Repeat Steps 1-3 as necessary, until all p-values are significant.

Step 4. Check feasibility of the final reduced model, and interpret.

1β X1

2β X2

3β

X3

4β X4

+ + + + …

X3 1β

X1

2β X2

4β X4

+ + + + …

1β ′ X1

2β ′ X2

4β ′ X4

+ + + …

Ismor Fischer, 5/29/2012 7.3-11

Comment: The steps outlined above extend to much more general models, including interaction terms, binary predictors (e.g., in women’s breast cancer risk assessment, let X = 1 if a first-order relative – mother, sister, daughter – was ever affected, X = 0 if not), binary response (e.g., Y = 1 if cancer occurs, Y = 0 if not), multiple responses, etc. The overall goal is to construct a parsimonious model based on the given data, i.e., one that achieves a balance between the level of explanation of the response, and the number of predictor variables. A good model will not have so few variables that it is overly simplistic, yet not too many that its complexity makes it difficult to interpret and form general conclusions. There is a voluminous amount of literature on regression methods for specialized applications; some of these topics are discussed below, but a thorough treatment is far beyond the scope of this basic introduction.

Step 2.

Are all the p-values corresponding to the regression coefficients significant

at (i.e., less than) level α ?

Step 1.

For each coefficient (j = 1, 2, …, k), calculate the associated p-value

from the test statistic t-ratio = β j − 0s.e.(βj)

~ tn – k corresponding to the null

hypothesis H0: βj = 0, versus the alternative HA: βj ≠ 0. (Note: The mathematical expression for the standard error s.e.(βj) is quite complicated, and best left to statistical software for evaluation.)

Step 3.

Select the single least significant coefficient at level α (i.e., the largest p-value, indicating strongest acceptance of the null hypothesis βj = 0), and delete only that corresponding term βj Xj from the model. Refit the original data to the “new” model without the deleted term. That is, recompute the remaining regression coefficients from scratch. Repeat Steps 1-3 until all surviving coefficients are significant (i.e., all p-values < α), i.e., Step 4.

No Yes

Step 4.

Evaluate how well the final reduced model fits; check multiple r2 value, residual plots, “reality check,” etc. It is also possible to conduct a formal Lack-of-Fit Test, which involves repeated observations yij at predictor value xi; the minimized residual sum of squares SSError can then be further partitioned into SSPure + SSLack-of-Fit, and a formal F-test of significance conducted for the appropriateness of the linear model.

Ismor Fischer, 5/29/2012 7.3-12

LOW HIGH

LOW X2 = 0 ⇒

Y = 120 + 0.5 X1

HIGH X2 = 20 ⇒

Y = 125 + 0.5 X1 Change in response with respect to X2 is constant (5 mm Hg), independent of X1. No interaction between X1, X2 on Y!

Interaction Terms

Consider the following example. We wish to study the effect of two continuous predictor variables, say X1 = “Drug 1 dosage (0-10 mg)” and X2 = “Drug 2 dosage (0-20 mg),” on a response variable Y = “systolic blood pressure (mm Hg).” Suppose that, based on empirical data using different dose levels, we obtain the following additive multilinear regression model, consisting of main effects only:

Y = 120 + 0.5 X1 + 0.25 X2, 0 ≤ X1 ≤ 10, 0 ≤ X2 ≤ 20.

Rather than attempting to visualize this planar response surface in three dimensions, we can better develop intuition into the relationships between the three variables by projecting it into a two-dimensional interaction diagram, and seeing how the response varies as each predictor is tuned from “low” to “high.” First consider the effect of Drug 1 alone on systolic blood pressure, i.e., X2 = 0. As Drug 1 dosage is increased from a low level of X1 = 0 mg to a high level of X1 = 10 mg, the blood pressure increases linearly, from Y = 120 mm Hg to Y = 125 mm Hg. Now consider the effect of adding Drug 2, eventually at X2 = 20 mg. Again, as Drug 1 dosage is increased from a low level of X1 = 0 mg to a high level of X1 = 10 mg, blood pressure increases linearly, from Y = 125 mm Hg to Y = 130 mm Hg. The change in blood pressure remains constant, thereby resulting in two parallel lines, indicating no interaction between the two drugs on the response.

Ismor Fischer, 5/29/2012 7.3-13

Change in response with respect to X2 depends on X1.

X1 = 0: ∆Y = 5 mm Hg X1 = 10: ∆Y = 25 mm Hg Interaction between X1, X2 on Y!

LOW HIGH

LOW X2 = 0 ⇒

Y = 120 + 0.5 X1

HIGH X2 = 20 ⇒

Y = 125 + 2.5 X1

However, suppose instead that the model includes a statistically significant (i.e., p-value < α) interaction term:

Y = 120 + 0.5 X1 + 0.25 X2 + 0.1 X1 X2 0 ≤ X1 ≤ 10, 0 ≤ X2 ≤ 20. This has the effect of changing the response surface from a plane to a “hyperbolic paraboloid,” shaped somewhat like a saddle.

Again, at the Drug 2 low dosage level X2 = 0, systolic blood pressure linearly increases by 5 mm Hg as Drug 1 is increased from X1 = 0 to X1 = 10, exactly as before. But now, at the Drug 2 high dosage level X2 = 20, a different picture emerges. For as Drug 1 dosage is increased from a low level of X1 = 0 mg to a high level of X1 = 10 mg, blood pressure linearly increases from Y = 125 mm Hg to a hypertensive Y = 150 mm Hg, a much larger difference of 25 mm Hg! Hence the two resulting lines are not parallel, indicating a significant drug-drug interaction on the response. Exercise: Draw the interaction diagram corresponding to the model

Y = 120 + 0.5 X1 + 0.25 X2 – 0.1 X1 X2. Comment: As a rule, if an explanatory variable Xj is not significant as a main effect, but is a factor in a statistically significant interaction term, it is nevertheless retained as a main effect in the final model. This convention is known as the Hierarchical Principle.

Ismor Fischer, 5/29/2012 7.3-14 These ideas also appear in another form. Consider the example of constructing a simple linear regression model for the response variable “Y = height (in.)” on the single predictor variable “X = weight (lbs.)” for individuals of a particular age group. A reasonably positive correlation might be expected, and after obtaining sample observations, the following scatterplot may result, with accompanying least squares regression line.

However, suppose it is the case that the sample is actually composed of two distinct subgroups, which are more satisfactorily modeled by separate, but parallel, regression lines, as in the examples shown below.

X

Y

X

Y

Males Females

2Y = 52 + 0.1 X

1Y = 48 + 0.1 X

Ismor Fischer, 5/29/2012 7.3-15

It is possible to fit both parallel lines to a single multiple linear model simultaneously, by introducing a binary variable that, in this case, codes for gender. Let M = 1 if Male, and M = 0 if Female. Then the model

Y = 48 + 0.1 X + 4 M

incorporates both the (continuous) numerical variable X, as well as the (binary) categorical variable M, as predictors for the response.

However, if the simple linear regression lines are not parallel, then it becomes necessary to include an interaction term, just as before. For example, the model

Y = 48 + 0.1 X + 4 M + 0.2 M X

becomes 1Y = 48 + 0.1X if M = 0, and 2Y = 52 + 0.3X if M = 1. These lines have unequal slopes (0.1 and 0.3), hence are not parallel.

More generally then, categorical data can also be used as predictors of response, by introducing dummy, or indicator variables in the model. Specifically, for each disjoint category i = 1, 2, 3,…, k, let Ii = 1 if category i, and 0 otherwise. For example, for the k = 4 categories of blood type, we have

1, if Type A I1 =

0, otherwise

1, if Type B I2 =

0, otherwise

1, if Type AB I3 =

0, otherwise

1, if Type O I4 =

0, otherwise. Note that I1 + I2 + … + Ik = 1, so there is collinearity among these k variables; hence – just as in multiple comparisons – there are k – 1 degrees of freedom. (Therefore, only this many indicator variables should be retained in the model; adding the last does not supply new information.) As before, a numerical response Yi for each of the categories can then be modeled by combining main effects and possible interactions of numerical and/or indicator variables.

But what if the response Y itself is categorical, e.g., binary?

Ismor Fischer, 5/29/2012 7.3-16

Logistic Regression

Suppose we wish to model a binary response variable Y, i.e., Y = 1 (“Success”) with probability π, and Y = 0 (“Failure”) with probability 1 − π, in terms of a predictor variable X. This problem gives rise to several difficulties, as the following example demonstrates.

Example: “If you live long enough, you will need surgery.” Imagine that we wish to use the continuous variable “X = Age” as a predictor for the binary variable “Y = Ever had major surgery (1 = Yes, 0 = No).” If we naively attempt to use simple linear regression however, the resulting model contains relatively little predictive value for the response (either 0 or 1), since it attains all continuous values from −∞ to +∞; see figure below.

This is even more problematic if there are several people of the same age X, with some having had major surgery (i.e., Y = 1), but the others not (i.e., Y = 0). Possibly, a better approach might be to replace the response Y (either 0 or 1), with its probability π, in the model. This would convert the binary variable to a continuous variable, but we still have two problems. First, we are restricted to the finite interval 0 ≤ π ≤ 1. And second, although π is approximately normally distributed, its variance is not constant (see §6.1.3), in violation of one of the assumptions on least squares regression models stated in 7.2.

1

Y

0 X = Age

Y = 0β + 1β X

••••• ••• •• •••• • • • • • • • •

• • • • • • • •• ••• •••• •••••• ••••••••

| | | | | | | | |

10 20 30 40 50 60 70 80 90

| | | | | | | | |

10 20 30 40 50 60 70 80 90

1

π = P(Y = 1)

0 X = Age

π = 0β + 1β X

Ismor Fischer, 5/29/2012 7.3-17

One solution to the first problem is to transform the probability π, using a continuous link function g(π), which takes on values from −∞ to +∞, as π ranges from 0 to 1. The function usually chosen for this purpose is the log-odds, or logit

(pronounced “low-jit”): g(π) = ln

π

1 − π . Thus, the model is given by…

ln

π

1 − π = b0 + b1 X ⇔ π = 1

1 + e− b0 − b1X

This reformulation does indeed put the estimate π between 0 and 1, but with the constant variance assumption violated, the technique of least squares approximation does not give the best fit here. For example, consider the following artificial data:

X 0 1 2 3 4 π 0.01 0.01 0.50 0.99 0.99

Least squares approximation gives the regression parameter estimates b0 = –5.514 and b1 = 2.757, resulting in the dotted graph shown. However, a closer fit is obtained by using the technique of Maximum Likelihood Estimation (MLE) – actually, a generalization of least squares approximation – and best solved by computer software. The MLE coefficients are b0 = –7.072 and b1 = 3.536, resulting in the solid graph shown.

1

π = P(Y = 1)

0

X = Age | | | | | | | | |

10 20 30 40 50 60 70 80 90

Ismor Fischer, 5/29/2012 7.3-18 Comments: This is known as the “S-shaped,” “sigmoid,” or logistic curve, and appears in

a wide variety of applications. See Appendix > Regression Models > Logisitic Growth for an example involving restricted population growth. (Compare with unrestricted exponential growth, discussed earlier.)

It is often of interest to determine the median response level, that is, the value

of the predictor variable X for which a 50% response level is achieved.

Hence, if π = 0.5, then b0 + b1 X = ln

0.5

1 − 0.5 = 0, so X = − b0b1

.

Exercise: Prove that the median response corresponds to the point of inflection (i.e., change in concavity) of any general logistic curve.

Other link functions sometimes used for binary responses are the probit

(pronounced “pro-bit”) and tobit (pronounced “toe-bit”) functions, which have similar properties to the logit. The technique of using link functions is part of a larger regression theory called Generalized Linear Models.

Since the method of least squares is not used for the best fit, the traditional

“coefficient of determination” r2 as a measure of model fitness does not exist! However, several analogous pseudo-r2 formulas have been defined (Efron, McFadden Cox & Snell, others…), but must be interpreted differently.

Another way to deal with the nonconstant variance of proportions, which does

not require logistic regression, is to work with the variance-stabilizing transformation arcsin π , a technique that we do not pursue here.

To compare regression models: Wald Test, Likelihood Ratio Test, Akaike

Information Criterion (AIC), Bayesian Information Criterion (BIC). Polytomous regression is used if the response Y has more than two categories.

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A4._Regression_Models/A4.3_-_Logistic_Growth.pdf�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A4._Regression_Models/A4.3_-_Logistic_Growth.pdf�

Ismor Fischer, 5/29/2012 7.3-19

Disease Status: 1 or 0

The logit function can be modeled by more than one predictor variable via multilinear logistic regression, using selection techniques as described above (except that MLE for the coefficients must be used instead of LS). For instance,

ln

π

1 − π = b0 + b1 X1 + b2 X2 + …+ bk – 1 Xk – 1.

In particular, suppose that one of these variables, say X1, is binary. Then, as its category level changes from X1 = 0 to X1 = 1, the right-hand amount changes exactly by its coefficient b1. The corresponding amount of change of the left side is equal to the difference in the two log-odds which, via a basic property of logarithms, is equal to the logarithm of the odds ratio between the

two categories. Thus, the odds ratio itself can be estimated by be 1 .

Example: Suppose that, in a certain population of individuals 50+ years old, it is found that the probability π = P(Lung cancer) is modeled by

ln

π

1 − π = –6 +.05X1 + 4.3X2

where the predictors are X1 = Age (years) , and X2 = Smoker (1 = Yes, 0 = No); note that X1 is numerical, but X2 is binary. Thus, for example, a 50-year-old nonsmoker would correspond to X1 = 50 and X2 = 0 in the model, which yields –3.5 on the right hand side for the “log-odds” of this group. (Solving for the actual probability itself gives 3.5ˆ 1 (1 )eπ += + = .03.) We can take this value as a baseline for the population. Likewise, for a 50-year-old smoker, the only difference would be to have X2 = 1 in the model, to indicate the change in smoking status from baseline. This would yield +0.8 for the “log-odds” (corresponding to 0.8ˆ 1 (1 )eπ −= + = 0.69). Thus, taking the difference gives

log-odds Smokers – log-odds Nonsmokers = 0.8 – (–3.5)

i.e., Smokers

Nonsmokers

oddslog 0.8 3.5odds

= +

or log( ) 4.3OR = , so that 4.3 73.7OR e= = , quite large. That the exponent 4.3 is also the coefficient of X2 in the model is not a coincidence, as stated above. Moreover, 73.7OR = here, for any age X1!

Recall that…

log log log AA BB

− =

Exposure Status: 1 or 0

Ismor Fischer, 5/29/2012 7.3-20

0.5

π

260

Pharmaceutical Application: Dose-Response Curves

Example: Suppose that, in order to determine its efficacy, a certain drug is administered in subtoxic dosages X (mg) of 90 mg increments to a large group of patients. For each patient, let the binary variable Y = 1 if improvement is observed, Y = 0 if there is no improvement. The proportion π of improved patients is recorded at each dosage level, and the following data are obtained.

X 90 180 270 360 450 π 0.10 0.20 0.60 0.80 0.90

The logistic regression model (as computed via MLE) is

ln

π

1 − π = − 3.46662 + 0.01333 X ⇔ π = 1

1 + e3.46662 − 0.01333X ,

the following graph is obtained.

The median dosage is X = 3.466620.01333 = 260.0 mg. That is, above this dosage

level, more patients are improving than not improving.


7.4 Problems 1. In Problem 4.4/29, it was shown that important relations exist between population means,

variances, and covariance. Specifically, we have the formulas that appear below left.

In this problem, we verify that these properties are also true for sample means, variances, and covariance in two examples. For data values {x1, x2, …, xn} and {y1, y2, …, yn}, recall that:

x = 1n

1

n

ii

x=∑ sx

2 = 1

n − 1 2

1( )

n

ii

x x=

−∑

y = 1n

1

n

ii

y=∑ sy

2 = 1

n − 1 2

1( )

n

ii

y y=

−∑ .

Now suppose that each value xi from the first sample is paired with exactly one corresponding value yi from the second sample. That is, we have the set of n ordered pairs of data

1 1 2 2{( , ), ( , ), , ( , )}n nx y x y x y , with sample covariance given by

sxy = 1n − 1

1( )( )

n

i ii

x x y y=

− −∑ .

Furthermore, we can label the pairwise sum “x + y” as the dataset 1 1 2 2( , , , )n nx y x y x y+ + + , and likewise for the pairwise difference “x – y.” It can be shown (via basic algebra, or Appendix A2), that for any such dataset of ordered pairs, the formulas that appear above right hold. (Note that these formulas generalize the properties found in Problem 2.5/4.)

For the following ordered data pairs, verify that the formulas in I and II hold. (In R, use mean, var, and cov.) Also, sketch the scatterplot.

x 0 6 12 18 y 3 3 5 9

Repeat for the following dataset. Notice that the values of xi and yi are the same as before,

but the correspondence between them is different!

x 0 6 12 18 y 3 9 3 5

I. (A) x y x y+ = +

(B) 2 2 2 2x y x y xys s s s+ = + + II. (A) x y x y− = −

(B) 2 2 2 2x y x y xys s s s− = + −

I. (A) X Y X Yµ µ µ+ = +

(B) 2 2 2 2X Y X Y XYσ σ σ σ+ = + + II. (A) X Y X Yµ µ µ− = −

(B) 2 2 2 2X Y X Y XYσ σ σ σ− = + −



Ismor Fischer, 1/11/2014 7.4-2 2. Expiration dates that establish the shelf lives of pharmaceutical products are determined from

stability data in drug formulation studies. In order to measure the rate of decomposition of a particular drug, it is stored under various conditions of temperature, humidity, light intensity, etc., and assayed for intact drug potency at FDA-recommended time intervals of every three months during the first year. In this example, the assay Y (mg) of a certain 500 mg tablet formulation is determined at time X (months) under ambient storage conditions.

X

Y

0 3 6 9 12

500 490 470 430 350

(a) Graph these data points (xi , yi) in a scatterplot, and calculate the sample correlation coefficient r = sxy / sx sy. Classify the correlation as positive or negative, and as weak, moderate, or strong.

(b) Determine the equation of the least squares regression line for these data points, and include a 95% confidence interval for the slope β1.

(c) Sketch a graph of this line on the same set of axes as part (a); also calculate and plot the fitted response values ˆ

iy and the residuals ei = yi – îy on this graph.

(d) Complete an ANOVA table for this linear regression, including the F-ratio and

corresponding p-value.

(e) Calculate the value of the coefficient of determination r2, using the two following equivalent ways (and showing agreement of your answers), and interpret this quantity as a measure of fit of the regression line to the data, in a brief, clear explanation.

via squaring the correlation coefficient r = sxy

sx sy found in (a),

via the ratio r2 = SSRegression

SSTotal of sums of squares found in (d).

(f) Test the null hypothesis of no linear association between X and Y, either by using your

answer in (a) on H0: ρ = 0, or equivalently, by using your answers in (b) and/or (d) on H0: β1 = 0.

(g) Calculate a point estimate of the mean potency when X = 6 months. Judging from the data, is this realistic? Determine a 95% confidence interval for this value.

(h) The FDA recommends that the expiration date should be defined as that time when a drug contains 90% of the labeled potency. Using this definition, calculate the expiration date for this tablet formulation. Judging from the data, is this realistic?

(i) The residual plot of this model shows evidence of a nonlinear trend. (Check this!) In order

to obtain a better regression model, first apply the linear transformations X = X / 3 and Y = 510 – Y, then try fitting an exponential curve XY eβα=

. Use this model to determine the expiration date. Judging from the data, is this realistic?

Ismor Fischer, 1/11/2014 7.4-3 (j) Redo this problem using the following R code:

# See help(lm) or help(lsfit), and help(plot.lm) for details. # Compute Correlation Coefficient and Scatterplot X <- c(0, 3, 6, 9, 12) Y <- c(500, 490, 470, 430, 350) cor(X, Y) plot(X, Y, xlab = "X = Months", ylab = "Y = Assay (mg)", pch=19)

# Least Squares Fit, Regression Line Plot, ANOVA F-test regline <- lm(Y ~ X) summary(regline) abline(regline, col = "blue") # Exercise: Why does the p-value of 0.02049 appear twice? # Estimate Mean Potency at 6 Months new <- data.frame(X = 6) predict(regline, new, interval = "confidence") # Residual Plot resids <- round(resid(regline), 2) plot(regline, which = 1, id.n = 5, labels.id = resids, pch=19) # Log-Transformed Linear Regression Xtilde <- X / 3 Ytilde = 510 – Y V <- log(Ytilde) plot(Xtilde, V, xlab = "Xtilde", ylab = "ln(Ytilde)", pch=19) regline.transf <- lm(V ~ Xtilde) summary(regline.transf) abline(regline.transf, col = "red") # Plot Transformed Model coeffs <- coefficients(regline.transf) scale <- exp(coeffs[1]) shape <- coeffs[2] Yhat <- function(X)(510 – scale * exp(shape * X / 3)) plot(X, Y, xlab = "X = Months", ylab = "Y = Assay (mg)", pch=19) curve(Yhat, col = "red", add = TRUE)

Answer this.



3. A Third Transformation. Suppose that two continuous variables X and Y are negatively

correlated via the nonlinear relation Y = 1

α X + β , for some parameters α and β. This is

algebraically equivalent to the relation 1Y

= αX + β, which can then be solved via simple linear

regression. Use this reciprocal transformation on the data and corresponding scatterplot below, to sketch a new scatterplot, and solve for sample-based estimates of the parameters α and β. (Hint: Finding the parameter values in this example should be straightforward, and not require any “least squares” regression formulas.) Express the original response Y in terms of X.

X 0 1 2 3 4 5

→ X 0 1 2 3 4 5

Y 60 30 20 15 12 10 1/Y

4. For this problem, recall that in simple linear regression, we have the following definitions:

1 2

xy

x

sb

s= , MSErr =

SSErr

n − 2, Reg Err2

Tot Tot

SS SS1

SS SSr = = − , SSTot = (n – 1) sy

2, and Sxx = (n – 1) sx2.

(a) Formally prove that the T-score = 2

2

1

r n

r

−

− for testing the null hypothesis H0: ρ = 0,

is equal to the T-score = 1 1

ErrMSxx

bS

β −

for testing the null hypothesis H0: β1 = 0.

(b) Formally prove that, in simple linear regression (where dfReg = 1), the square of the T-score =

1 1

ErrMSxx

bS

β −

is equal to the F-ratio = Reg

Err

MS

MS for testing the null hypothesis H0: β1 = 0.

Ismor Fischer, 1/11/2014 7.4-5 5. In a study of “binge eating” disorders among dieters, the average weights (Y) of a group of

overweight women of similar ages and lifestyles are measured at the end of every two months (X) over an eight month period. The resulting data values, some accompanying summary statistics, and the corresponding scatterplot, are shown below.

X 0 2 4 6 8 x = 4 sx2 = 10

Y 200 190 210 180 220 y = 200 sy2 = 250

(a) Compute the sample covariance sxy between the variables X and Y.

(b) Compute the sample correlation coefficient r between the variables X and Y. Use it to classify the linear correlation as positive or negative, and as strong, moderate, or weak.

(c) Determine the equation of the least squares regression line for these data. Sketch a graph of this line on the scatterplot provided above. Please label clearly!

(d) Also calculate the fitted response values iy , and plot the residuals ei = yi – iy , on this same

graph. Please label clearly!

(e) Calculate the coefficient of determination r2, and interpret its value in the context of evaluating the fit of this linear model to the sample data. Be as clear as possible.

(f) Interpretation: Evaluate the overall adequacy of the linear model to these data, using as

much evidence as possible. In particular, refer to at least two formal “linear regression” assumptions which may or may not be satisfied here, and why.

Ismor Fischer, 1/11/2014 7.4-6 6. A pharmaceutical company wishes to evaluate the results Y of a new drug assay procedure,

performed on n = 5 drug samples of different, but known potency X. In a perfect error-free assay, the two sets of values would be identical, thus resulting in the ideal calibration line Y = X, i.e., Y = 0 + 1X. However, experimental variability generates the results shown below, along with some accompanying summary statistics: the sample means, variances, and covariance, respectively.

X (mg)

Y (mg)

30 40 50 60 70 x = 50 sx2 = 250

sxy = 260 32 39 53 65 71 y = 52 sy

2 = 275

(a) Graph these data points (xi, yi) in a scatterplot. (b) Compute the sample correlation coefficient r. Use it to determine whether or not X and Y

are linearly correlated; if so, classify as positive or negative, and as weak, moderate, or strong.

(c) Determine the equation of the least squares regression line for these data. Sketch a graph of

this line on the same set of axes as part (a). Also calculate and plot the fitted response values iy and the residuals ei = yi – iy , on this same graph.

(d) Using all of this information, complete the following ANOVA table for this simple linear

regression model. (Hints: SSTotal and dfTotal can be obtained from sy2 given above; SSError =

“residual sum of squares,” and dfError = n – 2.) Show all work.

Source df SS MS F-ratio p-value

Regression

Error

Total

(e) Construct a 95% confidence interval for the slope β1.

(f) Use the p-value in (d) and the 95% confidence interval in (e) to test whether the null hypothesis H0: β1 = 0 can be rejected in favor of the alternative HA: β1 ≠ 0, at the α = .05 significance level. Interpret your answer: What exactly has been demonstrated about any association that might exist between X and Y? Be precise.

(g) Use the 95% confidence interval in (e) to test whether the null hypothesis H0: β1 = 1 can be rejected in favor of the alternative HA: β1 ≠ 1, at the α = .05 significance level. Interpret your answer in context: What exactly has been demonstrated about the new drug assay procedure? Be precise.


↑ SSTotaldfTotal


8. Survival Analysis

8.1 Survival Functions and Hazard Functions 8.2 Estimation: Kaplan-Meier Formula 8.3 Inference: Log-Rank Test

8.4 Regression: Cox Proportional Hazards Model

8.5 Problems


8. Survival Analysis

8.1 Definition: Survival Function Survival Analysis is also known as Time-to-Event Analysis, Time-to-Failure Analysis, or Reliability Analysis (especially in the engineering disciplines), and requires specialized techniques. Examples: Event

Cancer surgery, radiotherapy, chemotherapy → Death

Cancer remission → Cancer recurrence

Coronary artery bypass surgery → Heart attack or death, whichever comes first

Topical application of skin rash ointment → Disappearance of symptoms

However, such longitudinal data can be censored, i.e., the “event” may not occur before the end of the study. Patients can be “lost to follow-up” (e.g., moved away, non-compliant, choose a different treatment, etc.), as shown in the diagram below.

Patients

Time Study begins

Study ends

3 −

2 −

1 −

= Death = Censored


POPULATION

Define a continuous random variable T = “time-to-event,” or in this context, “survival time (until death).” From this, construct the survival function

S(t) = P(T > t) = Probability of surviving beyond time t.

The graph of the survival function is the survival curve.

Properties: For all t ≥ 0,

0 ≤ S(t) ≤ 1

S(0) = 1, and S(t) monotonically decreases to 0 as t gets larger.

S(t) is continuous.

Examples: S(t) = ( 0 )cte c− > , 1

1 + t , …

Note that the probability of death occurring in the interval [ , ]a b is ( )P a T b≤ ≤ =

( ) ( )P T a P T b> − > = ( ) ( )S a S b− .

T

S(t)

t 0

1

T

S(t)

a 0

1

b

S(a)

S(b)

Ismor Fischer, 5/29/2012 8.1-3 SAMPLE: How can we estimate S(t), using a cohort of n individuals? For simplicity, assume no censoring for now. Life Table Method: Suppose that, at the end of every month (week, year, etc.), we record the current number of deaths dt so far, or equivalently, the current number of survivors nt = n − dt, over the duration of the study. At these values t = 1, 2, 3,..., define

ˆ( )S t = ntn = 1 –

dtn ,

and linear in between. Example: Twelve-month cohort study of n = 10 patients

Patient Survival Time (months) → Month dt nt ˆ( )S t

1 3* 1 0 10 1.0 2 5 2 0 10 1.0 3 6 3 0 10 1.0 4 6 4 1 9 0.9 5 7 5 1 9 0.9 6 8 6 2 8 0.8 7 8 7 4 6 0.6 8 8 8 5 5 0.5 9 10 9 8 2 0.2 10 12 10 8 2 0.2 11 9 1 0.1

* Patient 1 died in month 4, etc. 12 9 1 0.1


Disadvantage: This method is based on calendar times, not cohort times of death, thereby wasting much information. A more efficient method can be developed, that is based on the observed times of death of the patients.

ˆ ( )S t

Time (months)

| | | | | | | | | | | |

0 1 2 3 4 5 6 7 8 9 10 11 12

1.0 − 0.9 −

0.8 − 0.7 − 0.6 − 0.5 −

0.4 −

0.3 − 0.2 −

0.1 −

1.0 1.0 1.0

0.9 0.9

0.8

0.6

0.5

0.2 0.2

0.1 0.1


8.2 Estimation: Kaplan-Meier Product-Limit Formula Let t1, t2, t3, … denote the actual times of death of the n individuals in the cohort. Also let d1, d2, d3, … denote the number of deaths that occur at each of these times, and let n1, n2, n3, … be the corresponding number of patients remaining in the cohort. Note that n2 = n1 − d1, n3 = n2 − d2, etc. Then, loosely speaking, S(t2) = P(T > t2) = “Probability of surviving beyond time t2” depends conditionally on S(t1) = P(T > t1) = “Probability of surviving beyond time t1.” Likewise, S(t3) = P(T > t3) = “Probability of surviving beyond time t3” depends conditionally on S(t2) = P(T > t2) = “Probability of surviving beyond time t2,” etc. By using this recursive idea, we can iteratively build a numerical estimate ˆ( )S t of the true survival function S(t). Specifically,

For any time t ∈ [0, t1), we have S(t) = P(T > t) = “Probability of surviving beyond time t” = 1, because no deaths have as yet occurred. Therefore, for all t in this interval, let ˆ( )S t = 1.

Recall (see 3.2): For any two events A and B, P(A and B) = P(A) × P(B | A).

Let A = “survive to time t1” and B = “survive from time t1 to beyond some time t before t2.” Having both events occur is therefore equivalent to the event “A and B” = “survive to beyond time t before t2,” i.e., “T > t.” Hence, the following holds.

For any time t ∈ [t1, t2), we have…

S(t) = P(T > t) = P(survive in [0, t1)) × P(survive in [t1, t] | survive in [0, t1)),

i.e,

ˆ( )S t = 1 × n1 – d1

n1 , or

ˆ( )S t = 1 – d1n1

. Similarly,

For any time t ∈ [t2, t3), we have…

S(t) = P(T > t) = P(survive in [t1, t2)) × P(survive in [t2, t] | survive in [t1, t2)),

i.e,

ˆ( )S t =

1 – d1n1

× n2 – d2

n2 , or

ˆ( )S t =

1 – d1n1

1 – d2n2

, etc.

Time | | | | 0 t1 t2 t3 . . .


In general, for t ∈ [tj, tj+1), j = 1, 2, 3,…, we have…

This is known as the Kaplan-Meier estimator of the survival function S(t). (Theory developed in 1950s, but first implemented with computers in 1970s.) Note that it is not continuous, but only piecewise-continuous (actually, piecewise-constant, or “step function”).

Comment: The Kaplan-Meier estimator ˆ( )S t can be regarded as a point estimate of the survival function S(t) at any time t. In a manner similar to that discussed in 7.2, we can construct 95% confidence intervals around each of these estimates, resulting in a pair of confidence bands that brackets the graph. To compute the confidence intervals, Greenwood’s Formula gives an asymptotic estimate of the standard error of ˆ( )S t for large groups.

Time

ˆ ( )S t

1

1 – d1n1

−

1 –

d1n1

1 –

d2n2

−

| | | times of death: 0 t1 t2 t3 . . .

# deaths: 0 d1 d2 d3 . . .

# survivors: n1 = n − 0 n2 = n1 − d1 n3 = n2 − d2 . . .

ˆ ( )S t =

1 – d1

n1

1 – d2

n2 …

1 – dj

nj =

j

Πi = 1

1 – di

ni .


| | | | | | |

0 3.2 5.5 6.7 7.9 8.4 10.3 12

0.5

Example (cont’d): Twelve-month cohort study of n = 10 patients

Patient ti (months) → Interval [ti, ti+1)

ni = # patients at risk at time it −

di = # deaths at time ti 1 –

dini

ˆ( )S t

1 3.2 [0, 3.2) 10 0 1.00 1.0 2 5.5 [3.2, 5.5) 10 – 0 = 10 1 0.90 0.9 3 6.7 [5.5, 6.7) 10 – 1 = 9 1 0.89 0.8 4 6.7 [6.7, 7.9) 9 – 1 = 8 2 0.75 0.6 5 7.9 [7.9, 8.4) 8 – 2 = 6 1 0.83 0.5 6 8.4 [8.4, 10.3) 6 – 1 = 5 3 0.40 0.2 7 8.4 [10.3, 12) 5 – 3 = 2 1 0.50 0.1 8 8.4 Study Ends 2 – 1 = 1 0 1.00 0.1 9 10.3

10 alive

it −denotes a time just prior to ti

Time (months)

ˆ ( )S t

1.0 − 0.9 − 0.8 − 0.7 − 0.6 − 0.5 − 0.4 − 0.3 − 0.2 −

0.1 −

1.0

0.9

0.8

0.6

0.1

0.2


1.0 −

0.9 −

0.8 −

0.7 −

0.6 −

0.5 −

0.4 −

0.3 −

0.2 −

0.1 −

1.000

0.900

x

0.675 x

0.135

0.270

| | | | | | |

0 3.2 5.5 6.7 7.9 8.4 10.3 12

Exercise: Prove algebraically that, assuming no censored observations (as in the

preceding example), the Kaplan-Meier estimator can be written simply as ˆ( )S t = ni+1n for

t ∈ [ti, ti+1), i = 0, 1, 2,… Hint: Use mathematical induction; recall that ni+1 = ni – di. In light of this, now assume that the data consists of censored observations as well, so that ni+1 = ni – di – ci.

Example (cont’d):

Patient ti (months) → Interval [ti, ti+1)

ni = # at risk at time ti

− di = #

deaths ci = #

censored 1 – dini

ˆ( )S t

1 3.2 [0, 3.2) 10 0 0 1.00 1.000 2 5.5* [3.2, 6.7) 10 – 0 – 0 = 10 1 1 0.90 0.900 3 6.7 [6.7, 8.4) 10 – 1 – 1 = 8 2 1 0.75 0.675 4 6.7 [8.4, 10.3) 8 – 2 – 1 = 5 3 0 0.40 0.270 5 7.9* [10.3, 12) 5 – 3 – 0 = 2 1 0 0.50 0.135 6 8.4 Study Ends 2 – 1 – 0 = 1 0 0 1.00 0.135 7 8.4 8 8.4 9 10.3 Exercise: What would the corresponding

changes be to the Kaplan-Meier estimator if Patient 10 died at the very end of the study?

10 alive

*censored

ˆ ( )S t


Hazard Functions

Suppose we have a survival function S(t) = P(T > t), where T = survival time, and some ∆t > 0. We wish to calculate the conditional probability of survival to the later time t + ∆t, given survival to time t.

P(Survive in [t, t + ∆t) | Survive after t) = P(t ≤ T < t + ∆t)

P(T > t) = S(t) − S(t + ∆t)

S(t) .

t ≤ T < t + ∆t T > t Therefore, dividing by ∆t,

P(t ≤ T < t + ∆t | T > t) ∆t =

−1 S(t)

S(t + ∆t) − S(t) ∆t .

Now, take the limit of both sides as ∆t → 0:

h(t) = −1

S(t) S′ (t) = − d [ln S(t)]

dt ⇔ 0( )( )

th u duS t e− ∫=

This is the hazard function (or hazard rate, failure rate), and roughly characterizes the “instantaneous probability” of dying at time t, in the above mathematical “limiting” sense. It is always ≥ 0 (Why? Hint: What signs are S(t) and S′ (t), respectively?), but can be > 1, hence is not a true probability in a mathematically rigorous sense. Exercise: Suppose two hazard functions are linearly combined to form a third hazard function: 1 21 2 3( ) ( ) ( )c h t c h t h t+ = , for any constants 1 2, 0c c ≥ . What is the relationship between their corresponding log-survival functions 1ln ( )S t , 2ln ( )S t , and 3ln ( )S t ?

Its integral, 0

( )th u du∫ , is the cumulative hazard rate – denoted H(t) – and increases

(since H′ (t) = h(t) ≥ 0). Note also that H(t) = – ln S(t), and so ˆˆ ( ) ln ( )H t S t= − .

T

S(t)

t 0

1

t + ∆t

S(t + ∆t)

S(t)

Ismor Fischer, 5/29/2012 8.2-6 Examples: (Also see last page of 4.2!)

If the hazard function is constant for t ≥ 0, i.e., h(t) ≡ α > 0, then it follows that the survival function is S(t) = te α− , i.e., the exponential model. Shown here is α = 1.

More realistically perhaps, suppose the hazard takes the form of a more general

power function, i.e., h(t) = 1tβαβ − , for “scale parameter” α > 0, and “shape

parameter” β > 0, for t ≥ 0. Then S(t) = te βα− , i.e., the Weibull model, an

extremely versatile and useful model with broad applications to many fields. The case α = 1, β = 2 is illustrated below.

Exercise: Suppose that, for argument’s sake, a population is modeled by the

decreasing hazard function h(t) = 1

t c+ for t ≥ 0, where c > 0 is some constant.

Sketch the graph of the survival function S(t), and find the median survival time.


8.3 Statistical Inference: Log-Rank Test Suppose that we wish to compare the survival curves S1(t) and S2(t) of two groups, e.g., breast cancer patients with chemotherapy versus without. POPULATION

SAMPLE

Time

S(t)

0

1 S1(t)

S2(t)

S(t)

Time

1

0

S1(t) ^

S2(t) ^

^

Null Hypothesis

H0: S1(t) = S2(t) for all t

“Survival probability is equal in both groups.”


To conduct a formal test of the null hypothesis, we construct a 2 × 2 contingency table for each interval [ti, ti+1), where i = 0, 1, 2,…

Observed # deaths

vs. Expected # deaths Variance

Dead Alive

Group 1 ai bi R1i → E1i = R1i C1i

ni

Vi = R1i R2i C1i C2i

ni2 (ni − 1)

Group 2 ci di R2i → E2i = R2i C1i

ni

C1i C2i ni Therefore, summing over all intervals i = 0, 1, 2,…, we obtain

Observed total deaths Expected total deaths Total Variance

Group 1: O1 = Σ ai Group 1: E1 = Σ E1i

Group 2: O2 = Σ ci Group 2: E2 = Σ E2i In effect, the contingency tables are combined in the same way as in any cohort study. In particular, an estimate of the summary odds ratio can be calculated via the general

Mantel-Haenszel formula OR = Σ (ai di / ni)

Σ (bi ci / ni) (see §6.2.3), with an analogous

interpretation in terms of group survival. The formal test for significance relies on the corresponding log-rank statistic:

Χ 2 = (O1 − E1)2

V ~ χ 1

2,

although a slightly less cumbersome alternative is the (approximate) test statistic

Χ 2 = (O1 − E1)2

E1 +

(O2 − E2)2

E2 ~ χ

12.

Illustration of two Kaplan-Meier survival curves that are not significantly different from one another

Time

1

0

S1(t) ^

S2(t) ^

V = Σ Vi


8.4 Regression: Cox Proportional Hazards Model Suppose we wish to model the hazard function h(t) for a population, in terms of explanatory variables – or covariates – X1, X2, X3,… , Xm. That is,

h(t) = h(t; X1, X2, X3,… , Xm),

so that all the individuals corresponding to one set of covariate values have a different hazard function from all the individuals corresponding to some other set of covariate values.

Assume initially that h has the general form h(t) = h0(t) C(X1, X2, X3,… , Xm). Example: In a population of 50-year-old males, X1 = smoking status (0 = No, 1 = Yes), X2 = # pounds overweight, X3 = # hours of exercise per week. Consider

h(t) = .02 t e X1 + 0.3X2 − 0.5X3.

If X1 = 0, X2 = 0, X3 = 0, then h0(t) = .02 t. This is the baseline hazard. (Therefore, the corresponding survival function is S0(t) = e −.01 t 2

. Why?)

If X1 = 1, X2 = 10 lbs, X3 = 2 hrs/wk, then h(t) = .02 t e3 = .02 t (20.1) = .402 t. (Therefore, the corresponding survival function is S(t) = e −.201 t 2

. Why?)

Thus, the proportion of hazards h(t)h0(t) = e3 (= 20.1), i.e., constant for all time t.

t

h0(t)

h(t) = 20.1 h0(t)


h(t) = h0(t) e β1 X1 + β2 X2 + … + βm Xm

Furthermore, notice that this hazard function can be written as…

h(t) = .02 t (e X1) (e 0.3X2) (e −0.5X3).

Hence, with all other covariates being equal, we have the following properties.

If X1 is changed from 0 to 1, then the net effect is that of multiplying the hazard function by a constant factor of e1 ≈ 2.72. Similarly,

If X2 is increased to X2 + 1, then the net effect is that of multiplying the hazard function by a constant factor of e0.3 ≈ 1.35. And finally,

If X3 is increased to X3 + 1, then the net effect is that of multiplying the hazard function by a constant factor of e−0.5 ≈ 0.61. (Note that this is less than 1, i.e., beneficial to survival.)

In general, the hazard function given by the form

where h0(t) is the baseline hazard function, is called the Cox Proportional Hazards Model, and can be rewritten as the equivalent linear regression problem:

ln

h(t)

h0(t) = β1 X1 + β2 X2 + … + βm Xm The “constant proportions” assumption is empirically verifiable. Once again, the regression coefficients are computationally intensive, and best left to a computer. Comment: There are many practical extensions of the methods in this section, including techniques for hazards modeling when the “constant proportions” assumption is violated, when the covariates X1, X2, X3,… , Xm are time-dependent, i.e.,

ln

h(t)

h0(t) = β1 X1(t) + β2 X2(t) + … + βm Xm(t),

when patients continue to be recruited after the study begins, etc. Survival Analysis remains a very open area of active research.


8.5 Problems 1. Displayed below are the survival times in months since diagnosis for 10 AIDS patients

suffering from concomitant esophageal candidiasis, an infection due to Candida yeast, and cytomegalovirus, a herpes infection that can cause serious illness.

Patient ti (months)

1 0.5* 2 1.0 3 1.0 4 1.0 5 2.0 6 5.0* 7 8.0* 8 9.0

9 10.0*

10 12.0*

*censored

(a) Construct the Kaplan-Meier product-limit estimator of the survival function S(t), and sketch its graph.

(b) Calculate the estimated 1-month and 2-month survival probabilities, respectively.

(c) Redo part (a) with R, using survfit.

2. For any constants 1, 0a b> > , graph the hazard function ( ) bh t at b

= −+

, for 0t ≥ . Find

and graph the corresponding survival function ( )S t . What happens to each function as 0b → ? b →∞ ?


3. In a cancer research journal article, authors conduct a six-year trial involving a small sample of 5n = patients who are on a certain aggressive treatment, and present the survival data in the form of a computer-generated table shown below, at left. (The last patient is alive at the end of the trial.)

Patient Survival time (mos.) → Time

interval in ic id 1 – dini

ˆ( )S t

001 36.0

002 48.0*

003 60.0

004 60.0

005 72.0 (alive)

*censored

(a) Using the Kaplan-Meier product-limit formula, complete the table of estimated survival probabilities shown at right.

(b) From part (a), sketch the Kaplan-Meier survival curve ˆ( )S t corresponding to this

sample. Label all relevant features.

^

Ismor Fischer, 7/18/2013 8.4-3 Suppose that, from subsequently larger studies, it is determined that the true survival curve corresponding to this population can be modeled by the function

2 500004( ) tS t e−=.. for 0t ≥ , as shown.

Use this Weibull model to answer each of the following.

(c) Calculate the probability of surviving beyond three years.

(d) Compute the median survival time for this population.

(e) Determine the hazard function ( )h t , and sketch its graph below.

(f) Calculate the hazard rate at three years.

2 500004 te− ..


1 ( )h t 2 ( )h t

4.

(a) Suppose that a certain population of individuals has a constant hazard function h1(t) = 0.03

for all time t > 0, as shown in the first graph above. For the variable T = “survival time (years),” determine the survival function S1(t), and sketch its graph on the set of axes below.

(b) Suppose that another population of individuals has a piecewise constant hazard function

given by h2(t) = 0.02, for 0 50.04, for 5

tt≤ ≤

>, as shown in the second graph above. For the variable

T = “survival time (years),” determine the survival function S2(t), and sketch its graph on the same set of axes below.


(3 2 ) / 20, 0 1( ) 1/ 20, 1 6

/120, 6

t th t t

t t

− ≤ <= ≤ < ≥

(c) For each population, use the corresponding survival function S(t) = P(T > t) to calculate

each of the following. Show all work. Population 1 Population 2 P(T > 4)

P(T > 5)

P(T > 6)

Odds of survival after 5 years

Median survival time tmed

i.e., when P(T > tmed) = 0.5 (d) Is there a finite time t* > 0 when the two populations have equal survival probabilities P(T > t*)?

If so, calculate its value, and the value of this common survival probability.

5. (Hint: See page 8.2-6 in the notes.) A population of children having a certain disease suffers from a high but rapidly decreasing infant mortality rate during the first year of life, followed by death due to random causes between the ages of one and six years old, and finally, steadily increasing mortality as individuals approach adolescence and beyond. Suppose that the associated hazard function ( )h t is known to be well-modeled by a so-called “bathtub curve,” whose definition and graph are given below.

(a) Find and use R to sketch the graph of the

corresponding survival function ( )( ) ( ) ,H tS t P T t e−= > = where the cumulative

hazard function is given by 0

( ) ( )t

H t h s ds= ∫ .

(b) Calculate each of the following.

( 1)P T >

( 6)P T >

( 12)P T >

( 6 | 1)P T T> >

Median survival time

(c) From the cumulative distribution function ( ) ( )F t P T t= ≤ , find and use R to sketch the

graph of the corresponding density function ( )f t .

R tip: To graph a function f(x) in the interval [a, b], first define “foo = function(x)(expression in terms of x)”, then use the command “plot(foo, from = a, to = b,...)” with optional graphical parameters col = "…" (for color), lty = 1 (for line type), lwd = 2 (for line width), etc.; type help(par) for more details. To add the graph of a function g(x) to an existing graph, type “plot(goo, from = b, to = c,…, add = T)”

Appendix >>> >>A1. Basic Reviews A2. Geometric Viewpoint A3. Statistical Inference A4. Regression Models A5. Statistical Tables

A1. Basic Reviews

A1.1 Logarithms A1.2 Permutations and Combinations

Ismor Fischer, 5/22/2008 Appendix / A1. Basic Reviews / Logarithms-1

A1. Basic Reviews Logarithms

What are they? In a word, exponents. The “logarithm (base 10) of a specified positive number” is the exponent to which the base 10 needs to be raised, in order to obtain that specified positive number. In effect, it is the reverse (or more correctly, “inverse”) process of raising 10 to an exponent. Example: The “logarithm (base 10) of 100” is equal to 2, because 102 = 100,

or, in shorthand notation, log 10 100 = 2. Likewise, log 10 10000 = 4, because 104 = 10000

log 10 1000 = 3, because 103 = 1000 log 10 100 = 2, because 102 = 100 log 10 10 = 1, because 101 = 10 log 10 1 = 0, because 100 = 1 log 10 0.1 = −1, because 10− 1 = 1/101 = 0.1 log 10 0.01 = −2, because 10− 2 = 1/102 = 0.01 log 10 0.01 = −3, because 10− 3 = 1/103 = 0.001

etc.

How do you take the logarithm of a specified number that is “between” powers of 10? In the old days, this would be done with the aid of a lookup table or slide rule (for those of us who are old enough to remember slide rules). Today, scientific calculators are equipped with a button labeled “log”, “log 10”, or “INV 10x ”. Examples: To five decimal places,

log 10 3 = 0.47712, because (check this) 10 0.47712 = 3. log 10 5 = 0.69897, because (check this) 10 0.69897 = 5. log 10 9 = 0.95424, because (check this) 10 0.95424 = 9.

log 10 15 = 1.17609, because (check this) 10 1.17609 = 15.

There are several relations we can observe here that extend to general properties of logarithms.


First, notice that the values for log 10 3 and log 10 5 add up to the value for log 10 15. This is not an accident; it is a direct consequence of 3 × 5 = 15, together with the algebraic law of exponents 10 s × 10 t = 10 s + t, and the fact that logarithms are exponents by definition. (Exercise: Fill in the details.) In general, we have Property 1: that is, the sum of the logarithms of two numbers is equal to the logarithm of their product . For example, taking A = 3 and B = 5 yields log 10 (15) = log 10 3 + log 10 5. Another relation to notice from these examples is that the value for log 10 9 is exactly double the value for log 10 3. Again, not a coincidence, but a direct consequence of 3 2 = 9, together with the algebraic law of exponents (10 s ) t = 10 s t, and the fact that logarithms are exponents by definition. (Exercise: Fill in the details.) In general, we have

log 10 (A B ) = B × log 10 A

log 10 (AB) = log 10 A + log 10 B

Property 2: that is, the logarithm of a number raised to a power is equal to the power times the logarithm of the original number. For example, taking A = 3 and B = 2 yields log 10 (3 2 ) = 2 (log 10 3). There are other properties of logarithms, but these are the most important for our purposes. In particular, we can combine these properties in the following way. Suppose that two variables X and Y are related by the general form

Y = α X β

for some constants α and β . Then, taking “log 10” of both sides, log 10 Y = log 10 ( α X β ) or, by Property 1, log 10 Y = log 10 α + log 10 ( X β ) and by Property 2, log 10 Y = log 10 α + β log 10 X .

Relabeling, V = β 0 + β 1 U . In other words, if there exists a power law relation between two variables X and Y, then there exists a simple linear relation between their logarithms. For this reason, scatterplots of two such related variables X and Y are often replotted on a log-log scale. More on this later…


Additional comments: • “log 10” is an operation on positive numbers – you must have the logarithm of something.

(This is analogous to square roots; you must have the square root of something in order to have a value. The disembodied symbol is meaningless without a number inside; similarly with log 10. )

• There is nothing special about using “base 10”. In principle, we

could use any positive base b (provided b ≠ 1, which causes a problem). Popular choices are b = 10 (resulting in the so-called “common logarithms” above), b = 2 (sometimes denoted by “lg”), and finally, b = 2.71828… (resulting in “natural logarithms”, denoted by “ln”). This last peculiar choice is sometimes referred to as “e” and is known as Euler’s constant. (Leonhard Euler, pronounced “oiler,” was a Swiss mathematician. This constant e arises in a variety of applications, including the formula for the density function of a normal distribution, described in a previous lecture.) There is a special formula for converting logarithms (using any base b) back to common logarithms (i.e., base 10), for calculator use. For any positive number a, and base b as described above,

Leonhard Euler 1707 - 1783

log b a = log10 alog10 b

• Logarithms are particularly useful in calculating physical processes that grow or decay

exponentially. For example, suppose that at time t = 0, we have N = 1 cell in a culture, and that it continually divides in two in such a way that the entire population doubles its size every hour. At the end of t = 1 hour, there are N = 2 cells; at time t = 2 hours, there are N = 22 = 4 cells; at time t = 3 hours, there are N = 2 3 = 8 cells, etc. Clearly, at time t, there will be N = 2 t cells in culture (exponential growth). Question: At what time t will there be 500000 (half a million) cells in the culture? The solution can be written as t = log 2 500000, which can be rewritten via the “change of base” formula above (for calculator use) as t = log 10 500000 ÷ log 10 2 = 5.69897 ÷ 0.30103 = 18.93 hours, or about 18 hours, 56 minutes. (Check: 218.93 = 499456.67, which represents an error of only about 0.1% from 500000; the discrepancy is due to roundoff error.)

Other applications where logarithms are used include the radioactive isotope dating of fossils and artifacts (exponential decay), determining the acidity or alkalinity of chemical solutions (pH = −log 10 H+, the “power of hydrogen”), and the Richter scale – a measure of earthquake intensity as defined by the log 10 of the quake’s seismic wave amplitude. (Hence an earthquake of magnitude 6 is ten times more powerful than a magnitude 5 quake, which in turn is ten times more powerful than one of magnitude 4, etc.)


Supplement: What Is This Number Called e, Anyway?

The symbol e stands for “Euler’s constant,” and is a fundamental mathematical constant (like π), extremely important for various calculus applications. It is usually defined as

e = 1lim 1n

n

n→∞

⎛ ⎞+⎜ ⎟⎝ ⎠

.

Exercise: Evaluate this expression for n = 1, 10, 100, 1000, …, 106. It can be shown, via rigorous mathematical proof, that the “limiting value” formally exists, and converges to the value 2.718281828459… Another common expression for e is the “infinite series”

e = 1 + 1

1! + 1

2! + 1

3! +…+ 1

n! +…

Exercise: Add a few terms of this series. How do the convergence rates of the two expressions compare? The reason for its importance: Of all possible bases b, it is this constant e = 2.71828… that has the most natural calculus properties. Specifically, if f(x) = bx, then it can be mathematically proved that its derivative is f′ (x) = bx (ln b). (Remember that ln b = loge b.) For example, the function f(x) = 10x has as its derivative f′ (x) = 10x (ln 10) = 10x (2.3026); see Figure 1. The constant multiple of 2.3026, though necessary, is something of a nuisance. On the other hand, if b = e, that is, if f(x) = ex, then f′ (x) = ex (ln e) = ex (1) = ex, i.e., itself! See Figure 2. This property makes calculations involving base e much easier. Figure 1. y = 10x Figure 2. y = ex

mtan = e1

mtan = e-0.5

mtan = 1

mtan = e-1

mtan = e0.5

mtan = ln 10 mtan = 10-0.5 ln 10

mtan = 100.5 ln 10

mtan = 101 ln 10

101

10.05

1 e-0.5

e0.5

e1

1

Ismor Fischer, 7/21/2010 Appendix / A1. Basic Reviews / Perms & Combos-1

A1. Basic Reviews

PERMUTATIONS and COMBINATIONS... or “HOW TO COUNT”

Question 1: Suppose we wish to arrange n = 5 people {a, b, c, d, e}, standing side by side, for a portrait. How many such distinct portraits (“permutations”) are possible? a b c d e Example: Solution: There are 5 possible choices for which person stands in the first position (either a, b, c, d, or e). For each of these five possibilities, there are 4 possible choices left for who is in the next position. For each of these four possibilities, there are 3 possible choices left for the next position, and so on. Therefore, there are 5 × 4 × 3 × 2 × 1 = 120 distinct permutations. See Table 1. � This number, 5 × 4 × 3 × 2 × 1 (or equivalently, 1 × 2 × 3 × 4 × 5), is denoted by the symbol “5!” and read “5 factorial”, so we can write the answer succinctly as 5! = 120. In general, FACT 1: The number of distinct PERMUTATIONS of n objects is "n factorial", denoted by n! = 1 × 2 × 3 × ... × n, or equivalently, = n × (n-1) × (n-2) × ... × 2 × 1. Examples: 6! = 6 × 5 × 4 × 3 × 2 × 1

= 6 × 5!

= 6 × 120 (by previous calculation) = 720

3! = 3 × 2 × 1 = 6 2! = 2 × 1 = 2 1! = 1 0! = 1, BY CONVENTION (It may not be obvious why, but there are good mathematical reasons for it.)

FACT 1: The number of distinct PERMUTATIONS of n objects is “n factorial”, denoted by

n! = 1 × 2 × 3 × ... × n, or equivalently,

= n × (n − 1) × (n − 2) × ... × 2 × 1.

Here, every different ordering counts as a distinct permutation. For instance, the ordering (a,b,c,d,e) is distinct from (c,e,a,d,b), etc.


Question 2: Now suppose we start with the same n = 5 people {a, b, c, d, e}, but we wish to make portraits of only k = 3 of them at a time. How many such distinct portraits are possible? a b c Example: Solution: By using exactly the same reasoning as before, there are 5 × 4 × 3 = 60 permutations. �

See Table 2 for the explicit list! Note that this is technically NOT considered a factorial (since we don't go all the way down to 1), but we can express it as a ratio of factorials:

5 × 4 × 3 = 5 × 4 × 3 × (2 × 1)

(2 × 1) = 5!2! .

In general, FACT 2: The number of distinct PERMUTATIONS of n objects, taken k at a time, is given by the ratio

n!(n − k)! = n × (n − 1) × (n − 2) × ... × (n − k + 1) .

Question 3: Finally suppose that instead of portraits (“permutations”), we wish to form committees (“combinations”) of k = 3 people from the original n = 5. How many such distinct committees are possible? Example:

c

b a

Again, as above, every different ordering counts as a distinct permutation. For instance, the ordering (a,b,c) is distinct from (c,a,b), etc.

Now, every different ordering does NOT count as a distinct combination. For instance, the committee {a,b,c} is the same as the committee {c,a,b}, etc.


Solution: This time the reasoning is a little subtler. From the previous calculation, we know that

# of permutations of k = 3 from n = 5 is equal to 5!2! = 60.

But now, all the ordered permutations of any three people (and there are 3! = 6 of them, by FACT 1), will “collapse” into one single unordered combination, e.g., {a, b, c}, as illustrated. So...

# of combinations of k = 3 from n = 5 is equal to 5!2! , divided by 3!, i.e., 60 ÷ 6 = 10. �

See Table 3 for the explicit list!

This number, 5!3! 2!

, is given the compact notation 53

, read “5 choose 3”, and corresponds to the

number of ways of selecting 3 objects from 5 objects, regardless of their order. Hence 53

= 10.

In general, FACT 3: The number of distinct COMBINATIONS of n objects, taken k at a time, is given by the ratio

n!k! (n − k)! =

n × (n − 1) × (n − 2) × ... × (n − k + 1)k! .

This quantity is usually written as nk

, and read “n choose k”.

Examples: 53

= 5!3! 2!

= 10, just done. Note that this is also equal to 52

= 5!2! 3!

= 10.

82

= 8!2! 6!

= 8 7 6!× ×

2! 6!× = 8 7

2× = 28. Note that this is equal to

86

= 8!6! 2!

= 28.

151

= 15!1! 14!

= 15 14!×

1! 14!× = 15. Note that this is equal to

1514

= 15. Why?

77

= 7!7! 0!

= 1. (Recall that 0! = 1.) Note that this is equal to 70

= 1. Why?

Observe that it is neither necessary nor advisable to compute the factorials of large numbers directly. For instance, 8! = 40320, but by writing it instead as 8 × 7 × 6!, we can cancel 6!, leaving only 8 × 7 above. Likewise, 14! cancels out of 15!, leaving only 15, so we avoid having to compute 15! , etc.


Remark: nk

is sometimes called a “combinatorial symbol” or “binomial coefficient” (in

connection with a fundamental mathematical result called the Binomial Theorem; you may also recall the related “Pascal’s Triangle”). The previous examples also show that binomial coefficients

possess a useful symmetry, namely, nk

= n

n k −

. For example, 53

= 5!3! 2!

, but this is clearly

the same as 52

= 5!2! 3!

. In other words, the number of ways of choosing 3-person committees

from 5 people is equal to the number of ways of choosing 2-person committees from 5 people. A quick way to see this without any calculating is through the insight that every choice of a 3-person committee from a collection of 5 people leaves behind a 2-person committee, so the total number of both types of committee must be equal (10). Exercise: List all the ways of choosing 2 objects from 5, say {a, b, c, d, e}, and check these claims explicitly. That is, match each pair with its complementary triple in the list of Table 3.

A Simple Combinatorial Application Suppose you toss a coin n = 5 times in a row. How many ways can you end up with k = 3 heads? Solution: The answer can be obtained by calculating the number of ways of rearranging 3 objects among 5; it only remains to determine whether we need to use permutations or combinations. Suppose, for example, that the 3 heads occur in the first three tosses, say a, b, and c, as shown below. Clearly, rearranging these three letters in a different order would not result in a different outcome. Therefore, different orderings of the letters a, b, and c should not count as distinct permutations, and likewise for any other choice of three letters among {a, b, c, d, e}. Hence, there

are 53

= 10 ways of obtaining k = 3 heads in n = 5 independent successive tosses.

Exercise: Let “H” denote heads, and “T” denote tails. Using these symbols, construct the explicit list of 10 combinations. (Suggestion: Arrange this list of H/T sequences in alphabetical order. You should see that in each case, the three H positions match up exactly with each ordered triple in the list of Table 3. Why?) a b c d e


Table 1 – Permutations of {a, b, c, d, e} These are the 5! = 120 ways of arranging 5 objects, in such a way that all the different orders count as being distinct. a b c d e b a c d e c a b d e d a b c e e a b c d a b c e d b a c e d c a b e d d a b e c e a b d c a b d c e b a d c e c a d b e d a c b e e a c b d a b d e c b a d e c c a d e b d a c e b e a c d b a b e c d b a e c d c a e b d d a e b c e a d b c a b e d c b a e d c c a e d b d a e c b e a d c b a c b d e b c a d e c b a d e d b a c e e b a c d a c b e d b c a e d c b a e d d b a e c e b a d c a c d b e b c d a e c b d a e d b c a e e b c a d a c d e b b c d e a c b d e a d b c e a e b c d a a c e b d b c e a d c b e a d d b e a c e b d a c a c e d b b c e d a c b e d a d b e c a e b d c a a d b c e b d a c e c d a b e d c a b e e c a b d a d b e c b d a e c c d a e b d c a e b e c a d b a d c b e b d c a e c d b a e d c b a e e c b a d a d c e b b d c e a c d b e a d c b e a e c b d a a d e b c b d e a c c d e a d d c e a b e c d a b a d e c b b d e c a c d e d a d c e b a e c d b a a e b c d b e a c d c e a b d d e a b c e d a b c a e b d c b e a d c c e a d b d e a c b e d a c b a e c b d b e c a d c e b a d d e b a c e d b a c a e c d b b e c d a c e b d a d e b c a e d b c a a e d b c b e d a c c e d a b d e c a b e d c a b a e d c b b e d c a c e d b a d e c b a e d c b a


Table 2 – Permutations of {a, b, c, d, e}, taken 3 at a time

These are the 5!2! = 60 ways of arranging 3 objects among 5, in such a way that different orders of

any triple count as being distinct, e.g., the 3! = 6 permutations of (a, b, c), shown below . a b c b a c c a b d a b e a b a b d b a d c a d d a c e a c a b e b a e c a e d a e e a d a c b b c a c b a d b a e b a a c d b c d c b d d b c e b c a c e b c e c b e d b e e b d a d b b d a c d a d c a e c a a d c b d c c d b d c b e c b a d e b d e c d e d c e e c d a e b b e a c e a d e a e d a a e c b e c c e b d e b e d b a e d b e d c e d d e c e d c Table 3 – Combinations of {a, b, c, d, e}, taken 3 at a time If different orders of the same triple are not counted as being distinct, then their six permutations are

lumped as one, e.g., {a, b, c}. Therefore, the total number of combinations is 16 of the original 60,

or 10. Notationally, we express this as 1 3! of the original

5!2! , i.e.,

5!3! 2! , or more neatly, as

53

.

These 53

= 10 combinations are listed below.

a b c a b d a b e a c d a c e a d e b c d b c e b d e c d e

A2. Geometric Viewpoint

A2.1 Mean and Variance A2.2 ANOVA

A2.3 Least Squares Approximation

Ismor Fischer, 7/21/2010 Appendix / A2. Geometric Viewpoint / Mean and Variance-1

A2. Statistics from a Geometric Viewpoint

Mean and Variance Many of the concepts we will encounter can be unified in a very elegant geometric way, which yields additional insight and understanding. If you relate to visual ideas, then you might benefit from reading this. First, recall some basic facts from elementary vector analysis: For any two column vectors v = (v1, v2, …, vn)T and w = (w1, w2, …, wn)T in n, the standard

Euclidean dot product “v ⋅ w” is defined as vT w = 1

n

i ii

v w=∑ , hence is a scalar. Technically, the

dot product is a special case of a more general mathematical object known as an inner product, denoted by ,v w , and these notations are often used interchangeably. The length, or norm, of a

vector v can therefore be characterized as v = ,v v = 2

1

n

ii

v=∑ , and the included angle θ

between two vectors v and w can be calculated via the formula

cos θ = ,v w

v w , 0 ≤ θ ≤ π .

From this relation, it is easily seen that two vectors v and w are orthogonal (i.e., θ = π /2), written v ⊥ w, if and only if their dot product is equal to zero, i.e., ⟨v, w⟩ = 0.

Now suppose we have n random sample observations {x1, x2, x3, …, xn}, with mean x . As shown below, let x be the vector consisting of these n data values, and x be the vector composed solely of x . Note that x is simply a scalar multiple of the vector 1 = (1, 1, 1, …, 1)T. Finally, let x – x be the vector difference; therefore its components are the individual deviations between the observations and the overall mean. (It’s useful to think of x as a sample taken from an ideal population that responds exactly the same way to some treatment, hence there is no variation; x is the sample of actual responses, and x – x measures the error between them.)

x =

1

2

3

n

xxx

x

x =

xxx

x

x – x =

1

2

3

n

x xx xx x

x x

− − − −

Ismor Fischer, 7/21/2010 Appendix / A2. Geometric Viewpoint / Mean and Variance-2

s2 = 1

n − 1 2

2

1 1

1n n

i ii i

x xn= =

−

∑ ∑

Recall that the sum of the individual deviations is equal to zero, i.e., 1

( )n

ii

x x=

−∑ = 0, or in vector

notation, the dot product 1 ⋅ (x – x ) = 0. Therefore, 1 ⊥ (x – x ), and the three vectors above form a right triangle. Let the scalars a, b, and c represent the lengths of the corresponding vectors, respectively. That is,

a = x - x = 2

1( )

n

ii

x x=

−∑ , b = x = 2

1

n

ix

=∑ = 2n x , c = x = 2

1

n

ii

x=∑ .

Therefore, a2, b2, and c2 are all “sums of squares,” denoted by

SSError = a2 = 2

1( )

n

ii

x x=

−∑ , SSTreatment = b2 = n x 2, SSTotal = c2 = 2

1

n

ii

x=∑ .

via algebra, = 1n

2

1

n

ii

x=

∑

Now via the Pythagorean Theorem, we have c2 = b2 + a2, referred to in this context as a “partitioning of sums of squares”:

SSTotal = SSTreatment + SSError . Note also that, by definition, the sample variance is

s2 = SSError

n − 1 ,

and that combining both of these boxed equations yields the equivalent “alternate formula”:

s2 = 1

n − 1 [ SSTotal − SSTreatment ] ,

i.e.,

This formula, because it only requires one subtraction rather than n, is computationally more stable than the original; however, it is less enlightening. Exercise: Verify that SSTotal = SSTreatment + SSError for the sample data values {3, 8, 17, 20, 32}, and calculate s2 both ways, showing equality. Be especially careful about roundoff error!

IMPORTANT FORMULA!!

IMPORTANT FORMULA!!

Ismor Fischer, 1/7/2009 Appendix / A2. Geometric Viewpoint / ANOVA-1


Analysis of Variance The technique of multiple comparison of treatment means via ANOVA can be viewed very elegantly, from a purely geometric perspective. Again, recall some basic facts from elementary vector analysis: For any two column vectors v = (v1, v2, …, vn)T and w = (w1, w2, …, wn)T in n, the standard Euclidean

dot product “v ⋅ w” is defined as vT w = , hence is a scalar. Technically, the dot product is a

special case of a more general mathematical object known as an inner product, denoted by 1

n

i ii

v w=∑

,v w , and these notations are often used interchangeably. The length, or norm, of a vector v can therefore be

characterized as v = ,v v = 2

1

n

ii

v=∑ , and the included angle θ between two vectors v and w can be

calculated via the formula

cos θ = ,v w

v w , 0 ≤ θ ≤ π .

From this relation, it is easily seen that two vectors v and w are orthogonal (i.e., θ = π /2), written v ⊥ w, if and only if their dot product is equal to zero, i.e., ⟨v, w⟩ = 0. Now suppose we have sample data from k treatment groups of sizes n1, n2, …, nk, respectively, which we organize in vector form as follows:

Treatment 1 Treatment 2 . . . . . Treatment k

y1 =

1

11

12

13

1 n

yyy

y

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

y2 =

2

21

22

23

2 n

yyy

y

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

. . . . . yk =

1

2

3

k

k

k

k

k n

yyy

y

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

Group Means: 1y 2y . . . . . ky Group Variances: s1

2 s22 . . . . . sk

2

Grand Mean: y = 1 1 2 2 ... k kn y n y n yn

+ + + , where n = n1 + n2 + … + nk is the combined sample size.

Pooled Variance: swithin groups2 =

(n1 − 1) s12 + (n2 − 1) s2

2 + … + (nk − 1) sk2

n − k

Ismor Fischer, 1/7/2009 Appendix / A2. Geometric Viewpoint / ANOVA-2 Now, for Treatment column i = 1, 2, …, k and row j = 1, 2, …, ni, it is clear from simple algebra that

yi j − y = ( iy − y ) + (yi j − iy ). Therefore, for each Treatment i = 1, 2, …, k, we have the ni-dimensional column vector identity

yi − y 1 = ( iy − y ) 1 + (yi − iy 1), where the ni-dimensional vector 1 = (1, 1, …, 1)T. Hence, vertically stacking these k columns produces a vector identity in n: Treatment i

123

k

yyy

y

−⎛ ⎞⎜ ⎟−⎜ ⎟⎜ ⎟−⎜ ⎟⎜ ⎟⎜ ⎟−⎝ ⎠

1

2

3

y 1y 1y 1

y 1k

=

1

2

3

( )( )( )

( )k

y yy yy y

y y

−⎛ ⎞⎜ ⎟−⎜ ⎟⎜ ⎟−⎜ ⎟⎜ ⎟⎜ ⎟−⎝ ⎠

111

1

+

1

2

3

k

yyy

y

−⎛ ⎞⎜ ⎟−⎜ ⎟⎜ ⎟−⎜ ⎟⎜ ⎟⎜ ⎟−⎝ ⎠

1

2

3

y 1y 1y 1

y 1k

or, more succinctly… u = v + w. But the two vectors v and w are orthogonal, since they have a zero dot product:

vT w = 1

( ) (k

Ti i

iy y y

=

− −∑ 1 )y 1i

1(

in

i j ij

y y=

−∑ ) = 0,

because this is the sum of the deviations of the yi j values in Treatment i from their group mean iy . Therefore, the three vectors u, v and w form a right triangle, as shown. So by the Pythagorean Theorem,

“Total vector” u

“Treatment vector” v

“Error vector” w

Ismor Fischer, 1/7/2009 Appendix / A2. Geometric Viewpoint / ANOVA-3

2u = 2v + 2w or, in statistical notation…

SSTotal = SSTrt + SSError where

The sum of the squared deviations of each observation

from the grand mean.

The sum of the squared deviations of each group mean

from the grand mean.

The sum of the squared deviations of each observation

from its group mean.

SSTotal = 2u = 2

1

k

iy

=

−∑ y 1i = 2

1 1( )

ink

i ji j

y y= =

−∑ ∑ = 2

all ,( )i j

i jy y−∑

SSTrt = 2v = 2

1( )

k

ii

y y=

−∑ 1 = 2

1 1(

ink

ii j

y y= =

−∑ ∑ ) = 2

1( )

k

i ii

n y y=

−∑

SSError = 2w = 2

1

k

ii

y=

−∑ y 1i = 2

1 1(

ink

i j ii j

y y= =

−∑ ∑ ) = . 2

1( 1)

k

i ii

n s=

−∑ The resulting ANOVA table for the null hypothesis H0: μ 1 = μ 2 = … = μ k is given by:

Source df SS MS F-statistic p-value

Treatment k − 1 2( )k

i in y y−∑1i=

sbetween groups2

Error n − k 2

1( 1)

k

i ii

n s=

−∑

Fk − 1, n − k

F

p-value

swithin groups2

Fk − 1, n − k 0 ≤ p ≤ 1

Total n − 1 2

all ,( )i j

i jy y−∑

Ismor Fischer, 1/7/2009 Appendix / A2. Geometric Viewpoint / ANOVA-4 One final note about multiple treatment comparisons… We may also express the problem via the following equivalent formulation: For each Treatment column i = 1, 2, …, k and row j = 1, 2, …, ni, the (i, j)th response yi j differs from its true group mean μ i by a random error amount ε i j . At the same time however, the true group mean μ i itself differs from the true grand mean μ by a random amount α i , appropriately called the ith treatment effect. That is, Null Hypothesis:

yi j = μ i + ε i j H0: μ 1 = μ 2 = … = μ ki.e., yi j = μ + α i + ε i j H0: α 1 = α 2 = … = α k = 0 .

Estimated by yi

Estimated by y

In words, this so-called model equation says that each individual response can be formulated as the sum of the grand mean plus its group treatment effect (the two of these together sum to its group mean), and an individual error term. The null hypothesis that “all of the group means are equal to each other” translates to the equivalent null hypothesis that “all of the group treatment effects are equal to zero.” This expression of the problem as “response = model + error” is extremely useful, and will appear again, in the context of regression models.

Ismor Fischer, 7/26/2010 Appendix / A2. Geometric Viewpoint / Least Squares Approximation-1

v

v

e = v − v

u


Least Squares Approximation The concepts of linear correlation and least squares regression can be viewed very elegantly, from a pure geometric perspective. Again, recall some basic background facts from elementary vector analysis: For any two column vectors v = (v1, v2, …, vn)T and w = (w1, w2, …, wn)T in n, the standard Euclidean

dot product “v ⋅ w” is defined as vT w = 1

n

i ii

v w=∑ , hence is a scalar. Technically, the dot product is a

special case of a more general mathematical object known as an inner product, denoted by ,v w , and these notations are often used interchangeably. The length, or norm, of a vector v can therefore be

characterized as v = ,v v = 2

1

n

ii

v=∑ , and the included angle θ between two vectors v and w can be

calculated via the formula

cos θ = ,v w

v w , 0 ≤ θ ≤ π .

From this relation, it is easily seen that two vectors v and w are orthogonal (i.e., θ = π /2), written v ⊥ w, if and only if their dot product is equal to zero, i.e., ⟨v, w⟩ = 0. More generally, the orthogonal projection of the vector v onto the vector w is given by the formula shown in the figure below. (Think of it informally as the “shadow vector” that v casts in the direction of w.)

Why are orthogonal projections so important? Suppose we are given any vector v (in a general inner product space), and a plane (or more precisely, a linear subspace) not containing v. Of all the vectors u in this plane, we wish to find a vector v that comes “closest” to v, in some formal mathematical sense. The Best Approximation Theorem asserts that, under such very general conditions, such a vector does indeed exist, and is uniquely determined by the orthogonal projection of v onto this plane. Moreover, the resulting error e = v − v is smallest possible, with 2e = 2v −

2v , via the Pythagorean Theorem.

w

v

proj w v = 2

,v ww

w

scalar multiple of w

θ

Of all the vectors u in the plane, the one that minimizes the length v - u (thin dashed line) is the orthogonal projection v . Therefore, v is the least squares approximation to v, yielding the least squares error

2e = 2v − 2v .

Ismor Fischer, 7/26/2010 Appendix / A2. Geometric Viewpoint / Least Squares Approximation-2 Now suppose we are given n data points (xi, yi), i = 1, 2, …, n, obtained from two variables X and Y. Define the following vectors in n-dimensional Euclidean space n: 0 = (0, 0, 0, …, 0)T, 1 = (1, 1, 1, …, 1)T,

x = (x1, x2, x3, …, xn)T, x = ( x , x , x , …, x )T, so that x − x = (x1 − x , x2 − x , x3 − x , …, xn − x )T,

y = (y1, y2, y3, …, yn)T, y = ( y , y , y , …, y )T, so that y − y = (y1 − y , y2 − y , y3 − y , …, yn − y )T.

The “centered” data vectors x − x and y − y are crucial to our analysis. For observe that, by definition,

2−x x = (n − 1) sx2, 2−y y = (n − 1) sy

2, and ,− −x x y y = (n − 1) sxy .

Now, note that ⟨1, x − x ⟩ = 1

( )n

ii

x x=

−∑ = 0, therefore 1 ⊥ (x − x ); likewise, 1 ⊥ (y − y ) as well.

See the figure below, showing the geometric relationships between the vector y − y and the plane spanned by the orthogonal basis vectors 1 and x − x .

Also, from a previous formula, we see that the general angle θ between these two vectors is given by

cos θ = ,− −

− −x x y yx x y y

(from above) = (n − 1) sxy

(n − 1) sx2 (n − 1) sy

2 = xy

x y

sr

s s=

i.e., the sample linear correlation coefficient! Therefore, this ratio r measures the cosine of the angle θ between the vectors x − x and y − y , and hence is always between −1 and +1. But what is its exact connection with the original vectors x and y?

0

1

θ

y − y

y − y x − x

e = y − y

Ismor Fischer, 7/26/2010 Appendix / A2. Geometric Viewpoint / Least Squares Approximation-3

IF the vectors x and y are exactly linearly correlated, then by definition, it must hold that 0 1b b= +y 1 x for some constants b0 and b1, and conversely. A little elementary algebra (take the mean of both sides, then subtract the two equations from one another) shows that this is equivalent to the statement

1 ( )b− = −y y x x , with 0 1b y b x= − .

That is, the vector y − y is a scalar multiple of the vector x − x , and therefore must lie not only in the plane, but along the line spanned by x − x itself. If the scalar multiple b1 > 0, then y − y must point in the same direction as x − x ; hence r = cos 0 = +1, and the linear correlation is positive. If b1 < 0, then these two vectors point in opposite directions, hence r = cos π = −1, and the linear correlation is negative. However, if these two vectors are orthogonal, then r = cos(π/2) = 0, and there is no linear correlation between x and y.

More generally, if the original vectors x and y are not exactly linearly correlated (that is, −1 < r < +1), then the vector y − y does not lie in the plane. The unique vector y − y that does lie in the plane which best approximates it in the “least squares” sense is its orthogonal projection onto the vector x − x , computed by the formula given above:

y − y = 2

,− −

−

y y x xx x

(x − x )

= (n − 1) sxy

(n − 1) sx2 (x − x ),

i.e., Linear Model: 1ˆ ( )b− = −y y x x , with b1 =

sxysx

2 .

Furthermore, via the Pythagorean Theorem,

2 22 ˆ ˆ− = − + −y y y y y y

or, in statistical notation… SSTotal = SSReg + SSError . Finally, from this, we also see that the ratio

SSReg SSTotal

= 2

2

ˆ −

−

y y

y y

= cos2θ , i.e., the coefficient of determination is

SSReg

SSTotal = r2 , where r is the correlation coefficient.

Exercise: Derive the previous formulas 2 2 2 2x y xyx ys s s s± = + ± . (Hint: Use the Law of Cosines.) Remark: In this analysis, we have seen how the familiar formulas of linear regression follow easily and immediately from “orthogonal approximation” on vectors. With slightly more generality, interpreting vectors abstractly as functions f(x), it is possible to develop the formulas that are used in Fourier series.

A3. Statistical Inference

A3.1 Mean, One Sample A3.2 Means and Proportions, One and Two Samples

A3.3 General Parameters and FORMULA TABLES

Ismor Fischer, 1/7/2009 Appendix / A3. Statistical Inference / Mean, One Sample-1


Population Mean μ of a Random Variable X … with known standard deviation σ, and random sample of size n 1

Before selecting a random sample, the experimenter first decides on each of the following… • Null Hypothesis H : μ = μ 0 (the conjectured value of the true mean) 0 • Alternative Hypothesis2 HA : μ ≠ μ (that is, either μ < μ 0 0 or μ > μ ) 0

“Type I error”

• Significance Level α = P (Reject H true) = .05, usually; therefore, 0 | H0 • Confidence Level 1 – α = P (Accept H | H true) = .95, usually 0 0 … and calculates each of these following: • Standard Error σ / n , the standard deviation of X ; this is then used to calculate…

× σ /• Margin of Error z n , where the critical value z is computed via its definition: α/2 α/2

Z ∼ N (0,1), P (− z < Z < z ) = 1 – α , i.e., by tail-area symmetry, α/2 α/2

P (Z < − z ) = P (Z > zα/2 α/2) = α /2. Note: If α = .05, then z.025 = 1.96.

Acceptance Region for H0

Figure 1 Illustration of the sample mean

Sampling Distribution ∼ N (μ 0 , σ / n ) X

x in the rejection region; note that p < α .

p / 2 p / 2 margin of error

x 0

Rejection Region for H0

1 If σ is unknown, but n ≥ 30, then estimate σ by the sample standard deviation s. If n < 30, then use the t-distribution instead of the standard normal Z-distribution. 2 The two-sided alternative is illustrated here. Some formulas may have to be modified slightly if a one-sided alternative is used instead. (Discussed later…)

Ismor Fischer, 1/7/2009 Appendix / A3. Statistical Inference / Mean, One Sample-2 After selecting a random sample, the experimenter next calculates the statistic…

x = “point estimate” of μ • Sample Mean …then calculates any or all of the following:

x• (1 − α) × 100% Confidence Interval − the interval , such that P (centered at μ inside) = 1 – α

x xC.I. = ( – margin of error, + margin of error) = “interval estimate” of μ

Decision Rule

At the (1 − α) × 100% confidence level… If μ is contained in C.I., then accept H

• (1 − α) × 100% Acceptance Region − the interval centered at μ 0 , such that P ( x inside) = 1 – α

A.R. = (μ 0 – margin of error, μ 0 + margin of error)

Decision Rule

• p-value − a measure of how “significantly” our sample mean differs from the null hypothesis

p = the probability of obtaining a random sample mean that is AS, or MORE, extreme than the value of x actually obtained, assuming the null hypothesis H0: μ = μ 0 is true.

= P (obtaining a sample mean on either side of μ 0 , as far away as, or farther than, x is )

= P ⎝⎜⎛

⎠⎟⎞

Z ≤ − | x − μ 0 |σ / n

+ P ⎝⎜⎛

⎠⎟⎞

Z ≥ | x − μ 0 |σ / n

Decision Rule

Left-sided area… + Right-sided area… cut by both x and its symmetrically reflected value through μ 0

0 0. If μ is not in C.I., then reject H0 0 in favor of HA.

x is in the acceptance region, then accept HIf . 0 SEE FIGURE 1! xIf is not in the acceptance region (i.e., is in the rejection region), then reject H in favor of HA. 0

NOTE: By symmetry, can

multiply the amount of area in one tail by 2.

If p < α, then reject H0 in favor of HA. SEE

FIGURE 1! x and μ ♦ The difference between is “statistically significant”. 0

If p > α, then accept H . 0

x and μ ♦ The difference between is “not statistically significant”. 0

Ismor Fischer, 1/7/2009 Appendix / A3. Statistical Inference / Mean, One Sample-3 For a one-sided hypothesis test, the preceding formulas must be modified. The decision to reject H0 in favor of HA depends on the probability of a sample mean being either significantly larger, or significantly smaller, than the value μ (always following the direction of the alternative HA0 ), but not both, as in a two-sided test. Previous remarks about σ and s, as well as z and t, still apply.

, right-sided alternative • Hypotheses (Case 1) H : μ ≤ μ

Acceptance Region

p

0

Rejection Region

x

Figure 3 Illustration of the sample mean x in the rejection region; note that p < α .

Decision Rule

If p < α, then x is in rejection region for H0. ♦ x is “significantly” smaller than μ 0.

If p > α, then x is in acceptance region for H0. ♦ x is not “significantly” smaller than μ 0.

Acceptance Region Rejection Region

Figure 2 Illustration of the sample mean x in the rejection region; note that p < α .

x0

If p < α, then x is in rejection region for H0. ♦ x is “significantly” larger than μ 0.

If p > α, then x is in acceptance region for H0. ♦ x is not “significantly” larger than μ 0.

Decision Rule

p

0 0 vs. HA : μ > μ 0

• Confidence Interval = ( x − zα (σ / n ), +∞ )

• Acceptance Region = ( −∞ , μ 0 + zα (σ / n ) )

x• p-value = P (obtaining a sample mean that is equal to, or larger than, )

= P ⎝⎜⎛

⎠⎟⎞

Z ≥ x − μ 0σ /

xn

, right-sided area cut off by (darkened)

, left-sided alternative • Hypotheses (Case 2) H : μ ≥ μ 0 0 vs. HA : μ < μ 0

• Confidence Interval = ( −∞ , x + zα (σ / n ) )

• Acceptance Region = ( μ 0 − zα (σ / n ), +∞ )

• p-value = P (obtaining a sample mean that is equal to, or smaller than, x )

= P ⎝⎜⎛

⎠⎟⎞

Z ≤ x − μ 0σ /

xn

, left-sided area cut off by (darkened)


Examples

Given: Assume that the random variable “X = IQ score” is normally distributed in a certain study population, with standard deviation σ = 30.0, but with unknown mean μ . Conjecture a null hypothesis H

μ x compare

hypothesized mean value (100)

...with mean of random sample

data

THEORY EXPERIMENT

0: μ = 100 vs. the (two-sided) alternative hypothesis HA: μ ≠ 100.

Question: Do we accept or reject H0 at the 5% (i.e., α = .05) significance level, and how strong is our decision, relative to this 5%?

Suppose statistical inference is to be based on random sample data of size n = 400 individuals.

Figure 4 Normal distribution of X = IQ score, under the conjectured null hypothesis H0: μ = 100

N(100, 30)

Procedure: Decision Rule will depend on calculation of the following quantities. First,… …then,

100 X

30

Margin of Error = Critical Value × Standard Error = zα / 2 × σ / n = 1.96 × 30 / 400 = 1.96 × 1.5 = 2.94

if α = .05 given

Standard Normal Distribution N(0, 1)

−1.96 1.96

.025 .025 .95

0 Z


x : All values between 100 ± 2.94, i.e., (97.06, 102.94). • Acceptance Region for

102.9497.06 105101

.025 .025

.95

Why?

N(100, 1.5)

100X

Figure 5

Null Distribution “Sampling distribution” of X , under the null hypothesis H0: μ = 100. Compare with Figure 4 above.

Sample # 1: Suppose it is found that x = 105 (or 95). As shown in Figure 5, this value lies far inside the α = .05 rejection region for the null hypothesis H0 (i.e., true mean μ = 100). In particular, we can measure exactly how significantly our sample evidence differs from the null hypothesis, by calculating the probability (that is, the area under the curve) that a random sample mean

xX will be as far away, or farther, from μ = 100 as = 105 is, on either side. Hence, this corresponds to the combined total area contained in the tails to the left and right of 95 and 105 respectively, and it is clear from Figure 5 that this value will be much smaller than the combined shaded area of .05 shown. This can be checked by a formal computation:

X X• p-value = P ( ≤ 95) + P ( ≥ 105) by definition X = 2 × P ( ≤ 95), by symmetry

= 2 × P (Z ≤ −3.33), since (95 − 100) / 1.5 = −3.33, to two places = 2 × .0004, via tabulated entry for N(0,1) tail areas = .0008 << .05, statistically significant difference As observed, our p-value of .08% is much smaller than the accepted 5% probability of committing a Type I error (i.e., the α = .05 significance level) initially specified. Therefore, as suggested above, this sample evidence indicates a strong rejection of the null hypothesis at the .05 level. As a final method of verifying this decision, we may also calculate the sample-based…

• 95% Confidence Interval for μ : All values between 105 ± 2.94, i.e., (102.06, 107.94).

X

x = 105102.06 107.94 μ = 100 By construction, this interval should contain the true value of the mean μ, with 95% confidence. Because μ = 100 is clearly outside the interval, this shows a reasonably strong rejection at α = .05.


xSimilarly, we can experiment with means that result from other random samples.

Sample # 2: Suppose it is now found that x = 103 (or likewise, 97). This sample mean is closer to the hypothesized mean μ = 100 than that of the previous sample, hence it is somewhat stronger evidence in support of H0. However, Figure 5 illustrates that 103 is only very slightly larger than the rejection region endpoint 102.94, thus we technically have a borderline rejection of H0 at α = .05. In addition, we can see that the combined left and right tail areas will total only slightly less than the .05 significance level. Proceeding as above,


= 2 × P (Z ≤ −2), since (97 − 100) / 1.5 = −2 = 2 × .0228, via tabulated entry of N(0,1) tail areas = .0456 ≈ .05, borderline significant difference Finally, additional insight may be gained via the… • 95% Confidence Interval for μ : All values between 103 ± 2.94, i.e., (100.06, 105.94).

X

x = 103μ = 100 100.06 105.94

As before, this interval should contain the true value of the mean μ with 95% confidence, by definition. Because μ = 100 is just outside the interval, this shows a borderline rejection at α = .05.

Sample # 3: Suppose now x = 101 (or 99). The difference between this sample mean and the hypothesized mean μ = 100 is much less significant, hence this is quite strong evidence in support of H . Figure 5 illustrates that 101 is clearly in the acceptance region of H at α = .05. Furthermore, 0 0


= 2 × P (Z ≤ −0.67), since (99 − 100) / 1.5 = −0.67, to two places = 2 × .2514, via tabulated entry of N(0,1) tail areas = .5028 >> .05, not statistically significant difference • 95% Confidence Interval for μ : All values between 101 ± 2.94, i.e., (98.06, 103.94). As before, this interval should contain the true value of μ with 95% confidence, by definition. Because μ = 100 is clearly inside, this too indicates acceptance of H at the α = .05 level. 0

Other Samples: As an exercise, show that if x x = 100.3, then p = .8414; if = 100.1, then p = .9442, etc. From these examples, it is clear that the closer the random sample mean x gets to the hypothesized value of the true mean μ, the stronger the empirical evidence is for that hypothesis, and the higher the p-value. (Of course, the maximum value of any probability is 1.)

Ismor Fischer, 1/7/2009 Appendix / A3. Statistical Inference / Mean, One Sample-7 Next suppose that, as before, “X = IQ score” is normally distributed, with σ = 30.0, and that statistical inference for μ is to be based on random samples of size n = 400, at the α = .05 significance level. But perhaps we now wish to test specifically for significantly “higher than average” IQ in our population, by seeing if we can reject the null hypothesis H0: μ ≤ 100, in favor of the (right-sided) alternative hypothesis HA: μ > 100, via sample data. Proceeding as before (with the appropriate modifications), we have…

Margin of Error itical Value × Standard Error = zα × σ /

= CrStandard Normal Distribution

n = 1.645 × 1.5

= 2.4675 Z

.05 .95

1.645

N(0, 1)

if α = .05

Figure 6

Null Distribution N(100, 1.5)

.95

.05

X100 102.4675

x : All values below 100 + 2.4675, i.e., < 102.4675. • Acceptance Region for

Samples: As in the first example, suppose that x = 105, which is clearly in the rejection region. The corresponding p-value is P ( X ≥ 105), i.e., the single right-tailed area only, or .0004 – exactly half the two-sided p-value calculated before. (Of course, this leads to an even stronger rejection of H x at the α = .05 level than before.) Likewise, if, as in the second sample, 0 = 103, the corresponding p-value = .0228 < .05, a moderate rejection. The sample mean x = 101 is in the acceptance region, with a right-sided p-value = .2514 > .05. Clearly, x = 100 corresponds to p = .5 exactly; x x = 99 corresponds to p = .7486 >> .05, and as sample means continue to decrease to the left, the corresponding p-values continue to increase toward 1, as empirical evidence in support of the null hypothesis H0: μ ≤ 100 continues to grow stronger.

Ismor Fischer, 1/7/2009 Appendix / A3. Statistical Inference / Means & Proportions, One & Two Samples-1


Hypothesis Testing for One Mean μ POPULATION

ONE SAMPLE

Assume random variable X ∼ N(μ, σ ). 1a

Testing H0: μ = μ 0 vs. HA: μ ≠ μ 0

Test Statistic (with s replacing σ in standard error σ / n ):

1

, if 30~

, if 30/ n

Z nXt ns n −

≥⎧−⎨ <⎩

0μ

1a Normality can be verified empirically by checking quantiles (such as 68%, 95%, 99.7%),

stemplot, normal scores plot, and/or “Lilliefors Test.” If the data turn out not to be normally distributed, things might still be OK due to the Central Limit Theorem, provided n ≥ 30. Otherwise, a transformation of the data can sometimes restore a normal distribution.

1b When X1 and X2 are not close to being normally distributed (or more to the point, when their

difference X1 – X2 is not), or not known, a common alternative approach in hypothesis testing is to use a “nonparametric” test, such as a Wilcoxon Test. There are two types: the “Rank Sum Test” (or “Mann-Whitney Test”) for independent samples, and the “Signed Rank Test” for paired sample data. Both use test statistics based on an ordered ranking of the data, and are free of distributional assumptions on the random variables.

2 If the sample sizes are large, the test statistic follows a standard normal Z-distribution (via the

Central Limit Theorem), with standard error = σ 1 2 / n1 + σ 2

2 / n2 . If the sample sizes are small, the test statistic does not follow an exact t-distribution, as in the single sample case, unless the two population variances σ1

2 and σ22 are equal. (Formally, this requires a separate

test of how significantly the sample statistic s12 / s2

2, which follows an F-distribution, differs from 1. An informal rule of thumb is to accept equivariance if this ratio is between 0.25 and 4. Other, formal tests, such as “Levene’s Test”, can also be used.) In this case, the two samples can be pooled together to increase the power of the t-test, and the common value of their equal variances estimated. However, if the two variances cannot be assumed to be equal, then approximate t-tests – such as Satterwaithe’s Test – should be used. Alternatively, a Wilcoxon Test is frequently used instead; see footnote 1b above.


Hypothesis Testing for Two Means μ 1 vs. μ 2

POPULATION

Random Variable X defined on two groups (“arms”): Assume X1 ~ N(μ 1 , σ 1), X2 ~ N(μ 2 , σ 2). 1a, 1b

Testing H0: μ 1 − μ 2 = μ 0

mean μ of X1− X2

Note: = 0, frequently

TWO SAMPLES

Independent 2 Paired

n 1 ≥

30,

n2 ≥

30

Test Statistic (σ 12 , σ 22 replaced by s1

2 , s2

2 in standard error):

Z = 1 22 2

1 2

1 2

( )X X

s sn n

− −

+

0μ ~ N(0,1)

σ 1

2 = σ

2 2

Test Statistic (σ 12 , σ 22 replaced by spooled

2 in standard error):

T = 1 2

2

1 2pooled

( )1 1

X X

sn n

− −

+

0μ ~ tdf , df = n1 + n2 – 2

where spooled 2 = [ (n1 – 1) s1

2 + (n2 – 1) s22 ] / df

n 1 <

30,

n2 <

30

σ 1

2 ≠ σ

2 2 Must use an approximate t-test, such as Satterwaithe.

Note that the Wilcoxon (= Mann-Whitney) Rank Sum Test may be used as an alternative.

Since the data are naturally “matched” by design, the pairwise differences constitute a single collapsed sample. Therefore, apply the appropriate one-sample test to the random variable

D = X1 – X2

(hence 1 2D X X= − ), having mean μ = μ1 − μ2; s = sample standard deviation of the D-values. Note that the Wilcoxon Signed Rank Test may be used as an alternative.


Hypothesis Testing for One Proportion π

POPULATION

Binary random variable Y, with P (Y = 1) = π Testing H0: π = π 0 vs. HA: π ≠ π 0

ONE SAMPLE

If n is large3, then standard error ≈ π (1 – π ) / n with N(0, 1) distribution. • For confidence intervals, replace π by its point estimate

π = X / n, where X = Σ (Y = 1) = # “successes” in sample.

• For acceptance regions and p-values, replace π by π 0, i.e.,

Test Statistic: Z = 0

0 0 )

ˆ

(1n

π ππ π

−−

~ N(0, 1)

If n is small, then the above approximation does not apply, and computations are performed directly on X, using the fact that it is binomially distributed. That is, X ∼ Bin (n; π ). Messy by hand...

3 In this context, “large” is somewhat subjective and open to interpretation. A typical criterion

is to require that the mean number of “successes” nπ , and the mean number of “failures” (1 )n π− , in the sample(s) be sufficiently large, say greater than or equal to 10 or 15. (Other,

less common, criteria are also used.)


Hypothesis Testing for Two Proportions π 1 vs. π 2

POPULATION

Binary random variable Y defined on two groups (“arms”), P (Y1 = 1) = π 1 , P (Y2 = 1) = π 2

Testing H0: π 1 − π 2 = 0 vs. HA: π 1 − π 2 ≠ 0

TWO SAMPLES

Independent Paired

Large3

Standard error = π 1 (1 – π 1) / n 1 + π 2 (1 – π 2) / n 2 ,

with N(0, 1) distribution • For confidence intervals, replace π1, π2 by point estimates ˆ

1π , ˆ2π .

• For acceptance regions and p-values, replace π1, π2 by the pooled

estimate of their common value under the null hypothesis, ˆpooledπ = (X 1 + X 2) / (n 1 + n 2), i.e.,

Test Statistic: Z = 1 2

1 2pooled pooled

ˆ ˆ( ) 01 1ˆ ˆ(1 )n n

π π

π π

− −

− + ~ N(0, 1)

Alternatively, can use a Chi-squared (χ 2) Test.

McNemar’s Test (A “matched” form of the χ2 Test.)

Small Fisher’s Exact Test (Messy; based on the “hypergeometric distribution” of X.) Ad hoc techniques; not covered.

Ismor Fischer, 4/21/2011 Appendix / A3. Statistical Inference / General Parameters-1


Hypothesis Testing for General Population Parameters POPULATION Null Hypothesis H0: θ = θ 0

SAMPLE Once a suitable random sample (or two or more, depending on the application) has been selected, the observed data can be used to compute a point estimate θ that approximates the parameter θ above. For example, for single sample estimates, we take µ = x , π = p, 2σ = s2; for two samples, take

1µ – 2µ = 1x – 2x , 1π – 2π = p1 – p2, 21σ / 2

2σ = s12 / s2

2. This sample-based statistic is then used to test the null hypothesis in a procedure known as statistical inference. The fundamental question: “At some pre-determined significance level α, does the sample estimator θ provide sufficient experimental evidence to reject the null hypothesis that the parameter value is equal to θ 0, i.e., is there a statistically significant difference between the two?” If not, then this can be interpreted as having evidence in support of the null hypothesis, and we can tentatively accept it, bearing further empirical evidence; see THE BIG PICTURE. In order to arrive at the correct decision rule for the mean(s) and proportion(s) [subtleties exist in the case of the variance(s)], we need to calculate the following object(s): margin of error

• Confidence Interval endpoints = θ ± critical value × standard error (If θ 0 is inside, then accept null hypothesis. If θ 0 is outside, then reject null hypothesis.)

• Acceptance Region endpoints = 0θ ± critical value × standard error

(If θ is inside, then accept null hypothesis. If θ is outside, then reject null hypothesis.)

• Test Statistic = 0ˆ

standard errorθ θ−

, which is used to calculate the p-value of the experiment.

(If p-value > α, then accept null hypothesis. If p-value < α, then reject null hypothesis.) The appropriate critical values and standard errors can be computed from the following tables, assuming that the variable X is normally distributed. (Details can be found in previous notes.)

“θ ” is a generic parameter of interest (e.g., µ, π, σ 2 in the one sample case; µ1 − µ2, π1 − π2, σ1

2 / σ22 in the two

sample case) of a random variable X.

“θ 0” is a conjectured value of the parameter θ in the null hypothesis. In the two sample case for means and proportions, this value is often chosen to be zero if, as in a clinical trial, we are attempting to detect any statistically significant difference between the two groups (at some predetermined significance level α). For the ratio of variances between two groups, this value is usually one, to test for equivariance.


MARGIN OF ERROR

k samples

POPULATION PARAMETER

SAMPLE STATISTIC = product of these two factors:

Null Hypothesis H0: θ = θ 0

Point Estimate θ = f(x1,…, xn)

CRITICAL VALUE (2-sided) 1

STANDARD ERROR (estimate) 2

Mean 2

H0: µ = µ 0 µ = x = ∑ xi

n n ≥ 30: tn−1, α /2 or zα /2

Any n: s / n n < 30: tn−1, α /2 only

Proportion H0: π = π 0 π (= p) =

Xn ,

where X = # Successes

n ≥ 30…: zα /2 ~ N(0, 1) n ≥ 30… also, nπ ≥ 15 and n(1–π) ≥ 15: • For Confidence Interval: ˆ ˆ(1 ) nπ π−

• For Acceptance Region, p-value: 0 0(1 ) nπ π−

n < 30: Use X ~ Bin(n,π). (not explicitly covered)

Two Paired Samples 3

Null Hypothesis H0: θ1 – θ2 = 0

Point Estimate

1 2ˆ ˆθ θ−

CRITICAL VALUE (2-sided) 1

STANDARD ERROR (estimate) 2

Means 2

H0: µ1 – µ2 = 0 1 2x x−

n1, n2 ≥ 30:

1 2 2, / 2n nt α+ − or zα /2 n1, n2 ≥ 30:

s12 / n1 + s2

2 / n2 n1, n2 < 30: Is σ1

2 = σ22 ?

Informal: 1/4 < s12/s2

2 < 4 ?

Yes → 1 2 2, / 2n nt α+ −

No → Satterwaithe’s Test

n1, n2 < 30:

spooled2 1 / n1 + 1 / n2

where spooled2 = (n1 − 1) s1

2 + (n2 − 1) s22

n1 + n2 − 2

Proportions H0: π1 – π2 = 0 1 2ˆ ˆπ π−

n1, n2 ≥ 30…: zα /2

(or use Chi-squared Test)

n1, n2 ≥ 30… also, (see criteria above):

• For Confidence Interval:

1 1 1 2 2 2ˆ ˆ ˆ ˆ(1 ) (1 )n nπ π π π− + −

• For Acceptance Region, p-value:

1 2pooled pooledˆ ˆ(1 ) 1 1n nπ π− +

where pooledπ = (X 1 + X 2) / (n1 + n2)

n1, n2 < 30:

Fisher’s Exact Test (not explicitly covered)

(k ≥ 2)

Null Hypothesis H0: θ1 = θ2 = … = θk

Independent Dependent (not covered)

Means H0: µ1 = µ2 = … = µk F-test (ANOVA) Repeated Measures, “Blocks” Proportions H0: π1 = π2 = … = πk Chi-squared Test Other techniques

1 For 1-sided hypothesis tests, replace α /2 by α. 2 For Mean(s): If normality is established, use the true standard error if known – either / nσ or 2 2

1 1 2 2/ /n nσ σ+ – with the Z-distribution. If normality is not established, then use a transformation, or a nonparametric Wilcoxon Test on the median(s).

3 For Paired Means: Apply the appropriate one sample test to the pairwise differences D = X – Y. For Paired Proportions: Apply McNemar’s Test, a “matched” version of the 2 × 2 Chi-squared Test.

One Sample

Two Independent Samples


HOW TO USE THESE TABLES

The preceding page consists of three tables that are divided into general statistical inference formulas for hypothesis tests of means and proportions, for one sample, two samples, and k ≥ 2 samples, respectively. The first two tables for 2-sided Z- and t- tests can be used to calculate the margin of error = critical value × standard error for acceptance/rejection regions and confidence intervals. Column 1 indicates the general form of the null hypothesis H0 for the relevant parameter value, Column 2 shows the form of the sample-based parameter estimate (a.k.a., statistic), Column 3 shows the appropriate distribution and corresponding critical value, and Column 4 shows the corresponding standard error estimate (if the exact standard error is unknown).

Pay close attention to the footnotes in the tables, and refer back to previous notes for details and examples!

To calculate… To reject H0, ask…

Confidence Limits: Column 2 ± (Column 3)(Column 4) Is Column 1 outside?

Acceptance Region: Column 1 ± (Column 3)(Column 4) Is Column 2 outside?

Test Statistic: Column 2 − Column 1

Column 4 Is the p-value < α ?

(Z-score for large samples, T-score for small samples)

Two-sided alternative H0: θ = θ 0 vs. HA: θ ≠ θ 0

2 × P(Z > |Z-score|), or equivalently, 2 × P(Z < − |Z-score|), for large samples p-value = 2 × P(Tdf > |T-score|), or equivalently, 2 × P(Tdf < − |T-score|), for small samples

Reject H0 Accept H0

1 0

p ≤ .001 extremely significant

Example: α = .05

p ≈ .005 strongly

significant

p ≈ .01 moderately significant

p ≈ .05 borderline significant

p ≥ .10 not significant



Confidence Interval: ≥ Column 2 − (Column 3)(Column 4) Is Column 1 outside?

Acceptance Region: ≤ Column 1 + (Column 3)(Column 4) Is Column 2 outside?





Confidence Interval: ≤ Column 2 + (Column 3)(Column 4) Is Column 1 outside?

Acceptance Region: ≥ Column 1 − (Column 3)(Column 4) Is Column 2 outside?




* The formulas in the tables are written for 2-sided tests only, and must be modified for 1-sided tests, by

changing α /2 to α . Also, recall that the p-value is always determined by the direction of the corresponding alternative hypothesis (either < or > in a 1-sided test, both in a 2-sided test).

One-sided test*, Right-tailed alternative H0: θ ≤ θ 0 vs. HA: θ > θ 0

One-sided test*, Left-tailed alternative H0: θ ≥ θ 0 vs. HA: θ < θ 0

P(Z > Z-score), for large samples p-value = P(Tdf > T-score), for small samples

P(Z < Z-score), for large samples p-value = P(Tdf < T-score), for small samples


THE BIG PICTURE

STATISTICS AND THE SCIENTIFIC METHOD

If, over time, a particular null hypothesis is continually “accepted” (as in a statistical meta-analysis of numerous studies, for example), then it may eventually become formally recognized as an established scientific fact. When sufficiently many such interrelated facts are collected and the connections between them understood in a coherently structured way, the resulting organized body of truths is often referred to as a scientific theory – such as the Theory of Relativity, the Theory of Plate Tectonics, or the Theory of Natural Selection. It is the ultimate goal of a scientific theory to provide an objective description of some aspect, or natural law, of the physical universe, such as the Law of Gravitation, Laws of Thermodynamics, Mendel’s Laws of Genetic Inheritance, etc.

http

://w

ww

.nas

a.go

v/vi

sion

/uni

vers

e/st

arsg

alax

ies/

hubb

le_U

DF.

htm

l

http://www.nasa.gov/vision/universe/starsgalaxies/hubble_UDF.html�

http://www.nasa.gov/vision/universe/starsgalaxies/hubble_UDF.html�

A4. Regression Models

A4.1 Power Law Growth A4.2 Exponential Growth

A4.3 Logistic Growth

A4.4 Example – Newton’s Law of Cooling

Ismor Fischer, 1/7/2009 Appendix / A4. Regression Models / Power Law Growth-1


Power Law Growth The technique of transforming data, especially using logarithms, is extremely valuable. Many physical systems involve two variables X and Y that are known (or suspected) to obey a “power law” relation, where Y is proportional to X raised to a power, i.e., Y = α X β

for some fixed constants α and β. Examples include the relation L = 1.4 A 0.6 that exists between river length L and the area A that it drains, “inverse square” laws such as the gravitational attraction F = G m1 m2 r −2 between two masses separated by a distance r, earthquake frequency versus intensity, the frequency of global mass extinction events over geologic time, comet brightness vs. distance to the sun, economic trends, language patterns, and numerous others. As mentioned before, in these cases, both variables – X and Y – are often transformed by means of a logarithm. The resulting data are replotted on a “log-log” scale, where a linear model is then fit: (The algebraic details were presented in the basic review of logarithms.)

log 10 Y = β 0 + β 1 log 10 X , and the original power law relation can be recovered via the formulas

α = 10 β 0 β = β 1 . As a simple example, suppose we are examining the relation between V = Volume (cm3) and A = Surface Area (cm2) of various physical objects. For the sake of simplicity, let us confine our investigation to sample data of solid cubes of n = 10 different sizes:

V 1 8 27 64 125 216 343 512 729 1000 A 6 24 54 96 150 216 294 384 486 600

Note the nonlinear scatterplot in Figure 1. If we take the common logarithm of both variables, the rescaled “log-log” plot reveals a strong linear correlation; see Figure 2. This is strong evidence that there is a power law relation between the original variables, i.e., A = α V β . log10 V 0.000 0.903 1.431 1.806 2.097 2.334 2.535 2.709 2.863 3.000 log10 A 0.778 1.380 1.732 1.982 2.176 2.334 2.468 2.584 2.687 2.778 Therefore, a linear model will be a much better fit for these transformed data points than for the original data points. Solving for the regression coefficients in the usual way (Exercise), we find that the least squares regression line is given by

log10 V = 0.778 + 0.667 log10 A .

0β 1β


We can now estimate the original coefficients: α = 10 0.778 = 6, and β = 0.667 = 23 ,

approximately. Therefore, the required power law relation is A = 6 V 2/3 . This should come as no surprise, because the surface area of a cube (which has six square faces) is given by A = 6 s2, and the volume is given by V = s3, where s is the length of one side of the cube. Hence, eliminating the s, we see that A = 6 V 2/3 for solid cubes. If we had chosen to work with spheres instead, only the constant of proportionality α would have changed

slightly (to 3

36π ≈ 4.836); the power would remain unchanged at β = 23 . (Here, V = 43 π r3

and A = 4π r2, where r is the radius.) This illustrates a basic principle of mechanics: since the volume of any object is roughly proportional to the cube of its “length” (say, V ∝ L3), and the surface area is proportional to its square (say, A ∝ L2), what follows is the general power relation that A ∝ V 2/3. Comment. In a biomechanical application of power law scaling, consider the relation between the metabolic rate Y of organisms (as measured by the amount of surface area heat dissipation per unit time), and their body mass M (generally proportional to the volume). From the preceding argument, one might naively expect that, as a general rule, Y ∝ M 2/3. However, this has been shown not to be the case. From systematic measurements of the correlation between these two variables (first done in 1932 by Max Kleiber), it was shown that a more accurate power relation is given by Y ∝ M 3/4, known as Kleiber’s Law. Since that time, “quarter-power scaling” has been shown to exist everywhere in biology, from respiratory rates (∝ M −1/4), to tree trunk and human aorta diameters (∝ M 3/8). Exactly why this is so universal is something of a major mystery, but seems related to an area of mathematics known as “fractal geometry.” Since 1997, much research has been devoted to describe general models that explain the origin and prevalence of quarter-power scaling in nature, and is considered by some to be “perhaps the single most pervasive theme underlying all biological diversity.” (Santa Fe Institute Bulletin, Volume 12, No. 2.)


Figure 1

Figure 2

Ismor Fischer, 1/7/2009 Appendix / A4. Regression Models / Exponential Growth-1


Exponential Growth Consider a (somewhat idealized) example of how to use a logarithm transformation on exponential growth data. Assume we start with an initial population of 100 cells in culture, and they grow under ideal conditions, exactly doubling their numbers once every hour. Let X = time (hours), Y = population size; suppose we obtain the following data.

X: 0 1 2 3 4

Y: 100 200 400 800 1600 A scatterplot reveals typical exponential (a.k.a. geometric) growth; see Figure 1. A linear fit of these data points (X, Y) will not be a particularly good model for it, but there is nothing to prevent us, either statistically or mathematically, from proceeding this way. Their least squares regression line (also shown in Figure 1) is given by the equation

Y = −100 + 360 X, with a coefficient of determination r2 = 0.871. (Exercise: Verify these claims.) Although r2 is fairly close to 1, there is nothing scientifically compelling about this model; there is certainly nothing natural or enlightening about the regression coefficients −100 and 360 in the context of this particular application. This illustrates the drawback of relying on r2 as the sole indicator of the fit of the linear model. One alternative approach is to take the logarithm (we will use “common logarithm,” base 10) of the response variable Y – which is possible to do, since Y takes positive values – in an attempt to put the population size on the same scale as the time variable X. This gives log10(Y): 2.0 2.3 2.6 2.9 3.2 Notice that the transformed response variable increases with a constant slope (+0.3), for every one-hour increase in time, the hallmark of linear behavior. Therefore, since the points (X, log10Y) are collinear, their least squares regression line is given by the equation

log10( ) = 2 + 0.3 X. (Verify this by computing the regression coefficients.) Y Given this, we can now solve for the population size directly. Inverting the logarithm,

Y = 10 2 + 0.3 X

= 10 2 × 10 0.3 X (via a law of exponents),

i.e., Y = 100 × 2 X . This exponential growth model is a much better fit to the data; see Figure 2. In fact, it's exact (check it for X = 0, 1, 2, 3, 4), and makes intuitively reasonable sense for this application. The population size Y at any time X, is equal to the initial population size of 100, times 2 raised to the X power, since it doubles in size every hour. This is an example of unrestricted exponential growth. The technique of logistic regression applies to restricted exponential growth models, among other things.

Ismor Fischer, 1/7/2009 Appendix / A4. Regression Models / Exponential Growth-2 Figure 1 Figure 2

Ismor Fischer, 1/7/2009 Appendix / A4. Regression Models / Logistic Growth-1


Logistic Growth Consider the more realistic situation of restricted population growth. As with the previous unrestricted case, the population initially grows exponentially, as resources are plentiful. Eventually, however, various factors (such as competition for diminishing resources, stress due to overcrowding, disease, predation, etc.) act to reduce the population growth rate. The population size continues to increase, but at an ever-slower rate. Ultimately it approaches (but may never actually reach) an asymptotically stable value, the “carrying capacity,” that represents a theoretical maximum limit to the population size under these conditions. Given the following (idealized) data set, for a population with a carrying capacity of 900 organisms. Note how the growth slows down and levels off with time.

X: 0 1 2 3 4

Y: 100 300 600 800 873 We wish to model this growth rate via regression, taking into account the carrying capacity. We first convert the Y-values to proportions (π) that survive out of 900, by dividing.∗

π : 0.11 0.33 0.67 0.89 0.97

Next we transform π to the peculiar-looking “link function” log10 ⎝⎛

⎠⎞π

1 − π . (Note: In practice, the

“natural logarithm” – base e = 2.71828… – is normally used, for good reasons, but here we use the “common logarithm” – base 10. The final model for π will not depend on the particular base used.)

log10 ⎝⎛

⎠⎞π

1 − π : −0.9 −0.3 0.3 0.9 1.5 Notice how for every +1 increase in X, there is a corresponding constant increase (+0.6) in the transformed variable log10 ⎝

⎛⎠⎞π

1 − π , indicating linear behavior. Hence, the fitted linear model is exact:

log10 ⎝⎜⎛

⎠⎟⎞

1 − ππ

= −0.9 + 0.6 X . Solving algebraically (details omitted) yields

π = (0.125) (4 X )

1 + (0.125) (4 X ) ,

which simplifies to… = π4 X

8 + 4 X , i.e., 1

1 + 8 (4−X ) . (Multiply by 900 to get the fitted Y

.) Exercise: Calculate the fitted values of this model for X = 0, 1, 2, 3, 4, and compare with the original data values.

∗ Formally, π is the probability P(S = 1), where the binary variable S = 1 indicates survival, and S = 0 indicates death.

Ismor Fischer, 1/7/2009 Appendix / A4. Regression Models / Logistic Growth-2 The S-shaped graph of this relation is the classical logistic curve, or logit (pronounced “low-jit”); see figure. Besides restricted population growth, it also describes many other phenomena that behave similarly, such as “dose - response” in pharmacokinetics, and the “learning curve” in psychology.

Ismor Fischer, 1/7/2009 Appendix / A4. Regression Models / Example – Newton’s Law of Cooling-1 A4. Regression Models

A Modeling Example: Newton’s Law of Cooling Suppose that a beaker containing hot liquid is placed in a room of ambient temperature 70°F, and allowed to cool. Its temperature Y (°F) is recorded every ten minutes over a period of time (X), yielding the n = 5 measurements shown below, along with some accompanying summary statistics:

0 10 20 30 40 X

Y 150 110 90 80 75

x = 20

y = 101

sx2 = 250

sy

2 = 930 sxy = −450

Using simple linear regression, the least-squares regression line is given by Y = 137 − 1.8 X, which has a reasonably high coefficient of determination r

ˆ2 = 0.871, indicating an acceptable fit.

(You should check these claims on your own; in fact, this is one of my old exam problems!) However, it is clear from the scatterplot that the linear model does not capture the curved nature of the relationship between time X and temperature Y, as the latter decreases very rapidly in the early minutes, then more slowly later on. Therefore, curvilinear regression might produce better models. In particular, using polynomial regression, we may fit a quadratic model (i.e., parabola), or a higher degree polynomial, to the data. Using elementary functions other than polynomials can also produce suitable alternative least-squares regression models, as shown in the figures below. All these models are reasonably good fits, and are potentially useful within the limits of the data, especially if we have no additional information about the theoretical dynamics between X and Y. However, in certain instances, it is possible to derive a formal mathematical relationship between the variables of interest, starting from known fundamental scientific principles. For example, the behavior of this system is governed by a principle known as Newton’s Law of Cooling, which states that “at any given time, the rate of change of the temperature of the liquid is proportional to the difference between the temperature of the liquid and the ambient temperature.” In calculus notation, this statement translates to a first-order ordinary differential equation, and corresponding initial condition at time zero:

dYdX = k (Y − a), Y(0) = Y0 .

Here, k < 0 is the constant of proportionality (negative, because the temperature Y is decreasing), a is the known ambient temperature, and Y0 is the given initial temperature of the liquid. The unique solution of this initial value problem (IVP) is given by the following formula:

Y = a + (Y0 − a) ek X .

In this particular example, we have Y0 = 150°F and a = 70°F. Furthermore, with the given data,

the constant of proportionality turns out to be precisely k = − ln 210 , so there is exact agreement with

/1070 80(2 )XY −= + .

More importantly, note that as the time variable X grows large, the temperature variable Y in this decaying exponential model asymptotically approaches the ambient temperature of a = 70°F at equilibrium, as expected in practice. The other models do not share this physically realistic property.

Ismor Fischer, 1/7/2009 Appendix / A4. Regression Models / Example – Newton’s Law of Cooling-2

Y = 58.269 + 935.718X + 10

r2 = 0.991

Y = 70 + 80 (2−X/10)

r2 = 1.000

Y = 152.086 – 20.288 ln(X + 1)

r2 = 0.985

Y = 137 − 1.8 X

r2 = 0.871 Y = 148.429 – 4.086 X + 0.057 X 2

r2 = 0.994

Y = 148.733 – 12.279 X

r2 = 0.991

70

A5. Statistical Tables

A5.1 Z-distribution A5.2 T-distribution

A5.3 Chi-squared distribution

Cumulative Probabilities of the Standard Normal Distribution N(0, 1) Left-sided area Left-sided area Left-sided area Left-sided area Left-sided area Left-sided area z-score P(Z ≤ z-score) z-score P(Z ≤ z-score) z-score P(Z ≤ z-score) z-score P(Z ≤ z-score) z-score P(Z ≤ z-score) z-score P(Z ≤ z-score)

–4.26 0.00001 –4.25 0.00001 –4.24 0.00001 –4.23 0.00001 –4.22 0.00001 –4.21 0.00001 –4.20 0.00001 –4.19 0.00001 –4.18 0.00001 –4.17 0.00002 –4.16 0.00002 –4.15 0.00002 –4.14 0.00002 –4.13 0.00002 –4.12 0.00002 –4.11 0.00002 –4.10 0.00002 –4.09 0.00002 –4.08 0.00002 –4.07 0.00002 –4.06 0.00002 –4.05 0.00003 –4.04 0.00003 –4.03 0.00003 –4.02 0.00003 –4.01 0.00003 –4.00 0.00003 –3.99 0.00003 –3.98 0.00003 –3.97 0.00004 –3.96 0.00004 –3.95 0.00004 –3.94 0.00004 –3.93 0.00004 –3.92 0.00004 –3.91 0.00005 –3.90 0.00005 –3.89 0.00005 –3.88 0.00005 –3.87 0.00005 –3.86 0.00006 –3.85 0.00006 –3.84 0.00006 –3.83 0.00006 –3.82 0.00007 –3.81 0.00007 –3.80 0.00007 –3.79 0.00008 –3.78 0.00008 –3.77 0.00008 –3.76 0.00008 –3.75 0.00009 –3.74 0.00009 –3.73 0.00010 –3.72 0.00010 –3.71 0.00010 –3.70 0.00011 –3.69 0.00011 –3.68 0.00012 –3.67 0.00012 –3.66 0.00013 –3.65 0.00013 –3.64 0.00014 –3.63 0.00014 –3.62 0.00015 –3.61 0.00015 –3.60 0.00016 –3.59 0.00017 –3.58 0.00017 –3.57 0.00018 –3.56 0.00019

–3.55 0.00019 –3.54 0.00020 –3.53 0.00021 –3.52 0.00022 –3.51 0.00022 –3.50 0.00023 –3.49 0.00024 –3.48 0.00025 –3.47 0.00026 –3.46 0.00027 –3.45 0.00028 –3.44 0.00029 –3.43 0.00030 –3.42 0.00031 –3.41 0.00032 –3.40 0.00034 –3.39 0.00035 –3.38 0.00036 –3.37 0.00038 –3.36 0.00039 –3.35 0.00040 –3.34 0.00042 –3.33 0.00043 –3.32 0.00045 –3.31 0.00047 –3.30 0.00048 –3.29 0.00050 –3.28 0.00052 –3.27 0.00054 –3.26 0.00056 –3.25 0.00058 –3.24 0.00060 –3.23 0.00062 –3.22 0.00064 –3.21 0.00066 –3.20 0.00069 –3.19 0.00071 –3.18 0.00074 –3.17 0.00076 –3.16 0.00079 –3.15 0.00082 –3.14 0.00084 –3.13 0.00087 –3.12 0.00090 –3.11 0.00094 –3.10 0.00097 –3.09 0.00100 –3.08 0.00104 –3.07 0.00107 –3.06 0.00111 –3.05 0.00114 –3.04 0.00118 –3.03 0.00122 –3.02 0.00126 –3.01 0.00131 –3.00 0.00135 –2.99 0.00139 –2.98 0.00144 –2.97 0.00149 –2.96 0.00154 –2.95 0.00159 –2.94 0.00164 –2.93 0.00169 –2.92 0.00175 –2.91 0.00181 –2.90 0.00187 –2.89 0.00193 –2.88 0.00199 –2.87 0.00205 –2.86 0.00212 –2.85 0.00219

–2.84 0.00226 –2.83 0.00233 –2.82 0.00240 –2.81 0.00248 –2.80 0.00256 –2.79 0.00264 –2.78 0.00272 –2.77 0.00280 –2.76 0.00289 –2.75 0.00298 –2.74 0.00307 –2.73 0.00317 –2.72 0.00326 –2.71 0.00336 –2.70 0.00347 –2.69 0.00357 –2.68 0.00368 –2.67 0.00379 –2.66 0.00391 –2.65 0.00402 –2.64 0.00415 –2.63 0.00427 –2.62 0.00440 –2.61 0.00453 –2.60 0.00466 –2.59 0.00480 –2.58 0.00494 –2.57 0.00508 –2.56 0.00523 –2.55 0.00539 –2.54 0.00554 –2.53 0.00570 –2.52 0.00587 –2.51 0.00604 –2.50 0.00621 –2.49 0.00639 –2.48 0.00657 –2.47 0.00676 –2.46 0.00695 –2.45 0.00714 –2.44 0.00734 –2.43 0.00755 –2.42 0.00776 –2.41 0.00798 –2.40 0.00820 –2.39 0.00842 –2.38 0.00866 –2.37 0.00889 –2.36 0.00914 –2.35 0.00939 –2.34 0.00964 –2.33 0.00990 –2.32 0.01017 –2.31 0.01044 –2.30 0.01072 –2.29 0.01101 –2.28 0.01130 –2.27 0.01160 –2.26 0.01191 –2.25 0.01222 –2.24 0.01255 –2.23 0.01287 –2.22 0.01321 –2.21 0.01355 –2.20 0.01390 –2.19 0.01426 –2.18 0.01463 –2.17 0.01500 –2.16 0.01539 –2.15 0.01578 –2.14 0.01618

–2.13 0.01659 –2.12 0.01700 –2.11 0.01743 –2.10 0.01786 –2.09 0.01831 –2.08 0.01876 –2.07 0.01923 –2.06 0.01970 –2.05 0.02018 –2.04 0.02068 –2.03 0.02118 –2.02 0.02169 –2.01 0.02222 –2.00 0.02275 –1.99 0.02330 –1.98 0.02385 –1.97 0.02442 –1.96 0.02500 –1.95 0.02559 –1.94 0.02619 –1.93 0.02680 –1.92 0.02743 –1.91 0.02807 –1.90 0.02872 –1.89 0.02938 –1.88 0.03005 –1.87 0.03074 –1.86 0.03144 –1.85 0.03216 –1.84 0.03288 –1.83 0.03362 –1.82 0.03438 –1.81 0.03515 –1.80 0.03593 –1.79 0.03673 –1.78 0.03754 –1.77 0.03836 –1.76 0.03920 –1.75 0.04006 –1.74 0.04093 –1.73 0.04182 –1.72 0.04272 –1.71 0.04363 –1.70 0.04457 –1.69 0.04551 –1.68 0.04648 –1.67 0.04746 –1.66 0.04846 –1.65 0.04947 –1.64 0.05050 –1.63 0.05155 –1.62 0.05262 –1.61 0.05370 –1.60 0.05480 –1.59 0.05592 –1.58 0.05705 –1.57 0.05821 –1.56 0.05938 –1.55 0.06057 –1.54 0.06178 –1.53 0.06301 –1.52 0.06426 –1.51 0.06552 –1.50 0.06681 –1.49 0.06811 –1.48 0.06944 –1.47 0.07078 –1.46 0.07215 –1.45 0.07353 –1.44 0.07493 –1.43 0.07636

–1.42 0.07780 –1.41 0.07927 –1.40 0.08076 –1.39 0.08226 –1.38 0.08379 –1.37 0.08534 –1.36 0.08691 –1.35 0.08851 –1.34 0.09012 –1.33 0.09176 –1.32 0.09342 –1.31 0.09510 –1.30 0.09680 –1.29 0.09853 –1.28 0.10027 –1.27 0.10204 –1.26 0.10383 –1.25 0.10565 –1.24 0.10749 –1.23 0.10935 –1.22 0.11123 –1.21 0.11314 –1.20 0.11507 –1.19 0.11702 –1.18 0.11900 –1.17 0.12100 –1.16 0.12302 –1.15 0.12507 –1.14 0.12714 –1.13 0.12924 –1.12 0.13136 –1.11 0.13350 –1.10 0.13567 –1.09 0.13786 –1.08 0.14007 –1.07 0.14231 –1.06 0.14457 –1.05 0.14686 –1.04 0.14917 –1.03 0.15151 –1.02 0.15386 –1.01 0.15625 –1.00 0.15866 –0.99 0.16109 –0.98 0.16354 –0.97 0.16602 –0.96 0.16853 –0.95 0.17106 –0.94 0.17361 –0.93 0.17619 –0.92 0.17879 –0.91 0.18141 –0.90 0.18406 –0.89 0.18673 –0.88 0.18943 –0.87 0.19215 –0.86 0.19489 –0.85 0.19766 –0.84 0.20045 –0.83 0.20327 –0.82 0.20611 –0.81 0.20897 –0.80 0.21186 –0.79 0.21476 –0.78 0.21770 –0.77 0.22065 –0.76 0.22363 –0.75 0.22663 –0.74 0.22965 –0.73 0.23270 –0.72 0.23576

–0.71 0.23885 –0.70 0.24196 –0.69 0.24510 –0.68 0.24825 –0.67 0.25143 –0.66 0.25463 –0.65 0.25785 –0.64 0.26109 –0.63 0.26435 –0.62 0.26763 –0.61 0.27093 –0.60 0.27425 –0.59 0.27760 –0.58 0.28096 –0.57 0.28434 –0.56 0.28774 –0.55 0.29116 –0.54 0.29460 –0.53 0.29806 –0.52 0.30153 –0.51 0.30503 –0.50 0.30854 –0.49 0.31207 –0.48 0.31561 –0.47 0.31918 –0.46 0.32276 –0.45 0.32636 –0.44 0.32997 –0.43 0.33360 –0.42 0.33724 –0.41 0.34090 –0.40 0.34458 –0.39 0.34827 –0.38 0.35197 –0.37 0.35569 –0.36 0.35942 –0.35 0.36317 –0.34 0.36693 –0.33 0.37070 –0.32 0.37448 –0.31 0.37828 –0.30 0.38209 –0.29 0.38591 –0.28 0.38974 –0.27 0.39358 –0.26 0.39743 –0.25 0.40129 –0.24 0.40517 –0.23 0.40905 –0.22 0.41294 –0.21 0.41683 –0.20 0.42074 –0.19 0.42465 –0.18 0.42858 –0.17 0.43251 –0.16 0.43644 –0.15 0.44038 –0.14 0.44433 –0.13 0.44828 –0.12 0.45224 –0.11 0.45620 –0.10 0.46017 –0.09 0.46414 –0.08 0.46812 –0.07 0.47210 –0.06 0.47608 –0.05 0.48006 –0.04 0.48405 –0.03 0.48803 –0.02 0.49202 –0.01 0.49601

z-score

P(Z ≤ z-score)

Z

0.00 0.50000 +0.01 0.50399 +0.02 0.50798 +0.03 0.51197 +0.04 0.51595 +0.05 0.51994 +0.06 0.52392 +0.07 0.52790 +0.08 0.53188 +0.09 0.53586 +0.10 0.53983 +0.11 0.54380 +0.12 0.54776 +0.13 0.55172 +0.14 0.55567 +0.15 0.55962 +0.16 0.56356 +0.17 0.56749 +0.18 0.57142 +0.19 0.57535 +0.20 0.57926 +0.21 0.58317 +0.22 0.58706 +0.23 0.59095 +0.24 0.59483 +0.25 0.59871 +0.26 0.60257 +0.27 0.60642 +0.28 0.61026 +0.29 0.61409 +0.30 0.61791 +0.31 0.62172 +0.32 0.62552 +0.33 0.62930 +0.34 0.63307 +0.35 0.63683 +0.36 0.64058 +0.37 0.64431 +0.38 0.64803 +0.39 0.65173 +0.40 0.65542 +0.41 0.65910 +0.42 0.66276 +0.43 0.66640 +0.44 0.67003 +0.45 0.67364 +0.46 0.67724 +0.47 0.68082 +0.48 0.68439 +0.49 0.68793 +0.50 0.69146 +0.51 0.69497 +0.52 0.69847 +0.53 0.70194 +0.54 0.70540 +0.55 0.70884 +0.56 0.71226 +0.57 0.71566 +0.58 0.71904 +0.59 0.72240 +0.60 0.72575 +0.61 0.72907 +0.62 0.73237 +0.63 0.73565 +0.64 0.73891 +0.65 0.74215 +0.66 0.74537 +0.67 0.74857 +0.68 0.75175 +0.69 0.75490 +0.70 0.75804 +0.71 0.76115 +0.72 0.76424

+0.73 0.76730 +0.74 0.77035 +0.75 0.77337 +0.76 0.77637 +0.77 0.77935 +0.78 0.78230 +0.79 0.78524 +0.80 0.78814 +0.81 0.79103 +0.82 0.79389 +0.83 0.79673 +0.84 0.79955 +0.85 0.80234 +0.86 0.80511 +0.87 0.80785 +0.88 0.81057 +0.89 0.81327 +0.90 0.81594 +0.91 0.81859 +0.92 0.82121 +0.93 0.82381 +0.94 0.82639 +0.95 0.82894 +0.96 0.83147 +0.97 0.83398 +0.98 0.83646 +0.99 0.83891 +1.00 0.84134 +1.01 0.84375 +1.02 0.84614 +1.03 0.84849 +1.04 0.85083 +1.05 0.85314 +1.06 0.85543 +1.07 0.85769 +1.08 0.85993 +1.09 0.86214 +1.10 0.86433 +1.11 0.86650 +1.12 0.86864 +1.13 0.87076 +1.14 0.87286 +1.15 0.87493 +1.16 0.87698 +1.17 0.87900 +1.18 0.88100 +1.19 0.88298 +1.20 0.88493 +1.21 0.88686 +1.22 0.88877 +1.23 0.89065 +1.24 0.89251 +1.25 0.89435 +1.26 0.89617 +1.27 0.89796 +1.28 0.89973 +1.29 0.90147 +1.30 0.90320 +1.31 0.90490 +1.32 0.90658 +1.33 0.90824 +1.34 0.90988 +1.35 0.91149 +1.36 0.91309 +1.37 0.91466 +1.38 0.91621 +1.39 0.91774 +1.40 0.91924 +1.41 0.92073 +1.42 0.92220 +1.43 0.92364 +1.44 0.92507 +1.45 0.92647

+1.46 0.92785 +1.47 0.92922 +1.48 0.93056 +1.49 0.93189 +1.50 0.93319 +1.51 0.93448 +1.52 0.93574 +1.53 0.93699 +1.54 0.93822 +1.55 0.93943 +1.56 0.94062 +1.57 0.94179 +1.58 0.94295 +1.59 0.94408 +1.60 0.94520 +1.61 0.94630 +1.62 0.94738 +1.63 0.94845 +1.64 0.94950 +1.65 0.95053 +1.66 0.95154 +1.67 0.95254 +1.68 0.95352 +1.69 0.95449 +1.70 0.95543 +1.71 0.95637 +1.72 0.95728 +1.73 0.95818 +1.74 0.95907 +1.75 0.95994 +1.76 0.96080 +1.77 0.96164 +1.78 0.96246 +1.79 0.96327 +1.80 0.96407 +1.81 0.96485 +1.82 0.96562 +1.83 0.96638 +1.84 0.96712 +1.85 0.96784 +1.86 0.96856 +1.87 0.96926 +1.88 0.96995 +1.89 0.97062 +1.90 0.97128 +1.91 0.97193 +1.92 0.97257 +1.93 0.97320 +1.94 0.97381 +1.95 0.97441 +1.96 0.97500 +1.97 0.97558 +1.98 0.97615 +1.99 0.97670 +2.00 0.97725 +2.01 0.97778 +2.02 0.97831 +2.03 0.97882 +2.04 0.97932 +2.05 0.97982 +2.06 0.98030 +2.07 0.98077 +2.08 0.98124 +2.09 0.98169 +2.10 0.98214 +2.11 0.98257 +2.12 0.98300 +2.13 0.98341 +2.14 0.98382 +2.15 0.98422 +2.16 0.98461 +2.17 0.98500 +2.18 0.98537

+2.19 0.98574 +2.20 0.98610 +2.21 0.98645 +2.22 0.98679 +2.23 0.98713 +2.24 0.98745 +2.25 0.98778 +2.26 0.98809 +2.27 0.98840 +2.28 0.98870 +2.29 0.98899 +2.30 0.98928 +2.31 0.98956 +2.32 0.98983 +2.33 0.99010 +2.34 0.99036 +2.35 0.99061 +2.36 0.99086 +2.37 0.99111 +2.38 0.99134 +2.39 0.99158 +2.40 0.99180 +2.41 0.99202 +2.42 0.99224 +2.43 0.99245 +2.44 0.99266 +2.45 0.99286 +2.46 0.99305 +2.47 0.99324 +2.48 0.99343 +2.49 0.99361 +2.50 0.99379 +2.51 0.99396 +2.52 0.99413 +2.53 0.99430 +2.54 0.99446 +2.55 0.99461 +2.56 0.99477 +2.57 0.99492 +2.58 0.99506 +2.59 0.99520 +2.60 0.99534 +2.61 0.99547 +2.62 0.99560 +2.63 0.99573 +2.64 0.99585 +2.65 0.99598 +2.66 0.99609 +2.67 0.99621 +2.68 0.99632 +2.69 0.99643 +2.70 0.99653 +2.71 0.99664 +2.72 0.99674 +2.73 0.99683 +2.74 0.99693 +2.75 0.99702 +2.76 0.99711 +2.77 0.99720 +2.78 0.99728 +2.79 0.99736 +2.80 0.99744 +2.81 0.99752 +2.82 0.99760 +2.83 0.99767 +2.84 0.99774 +2.85 0.99781 +2.86 0.99788 +2.87 0.99795 +2.88 0.99801 +2.89 0.99807 +2.90 0.99813 +2.91 0.99819

+2.92 0.99825 +2.93 0.99831 +2.94 0.99836 +2.95 0.99841 +2.96 0.99846 +2.97 0.99851 +2.98 0.99856 +2.99 0.99861 +3.00 0.99865 +3.01 0.99869 +3.02 0.99874 +3.03 0.99878 +3.04 0.99882 +3.05 0.99886 +3.06 0.99889 +3.07 0.99893 +3.08 0.99896 +3.09 0.99900 +3.10 0.99903 +3.11 0.99906 +3.12 0.99910 +3.13 0.99913 +3.14 0.99916 +3.15 0.99918 +3.16 0.99921 +3.17 0.99924 +3.18 0.99926 +3.19 0.99929 +3.20 0.99931 +3.21 0.99934 +3.22 0.99936 +3.23 0.99938 +3.24 0.99940 +3.25 0.99942 +3.26 0.99944 +3.27 0.99946 +3.28 0.99948 +3.29 0.99950 +3.30 0.99952 +3.31 0.99953 +3.32 0.99955 +3.33 0.99957 +3.34 0.99958 +3.35 0.99960 +3.36 0.99961 +3.37 0.99962 +3.38 0.99964 +3.39 0.99965 +3.40 0.99966 +3.41 0.99968 +3.42 0.99969 +3.43 0.99970 +3.44 0.99971 +3.45 0.99972 +3.46 0.99973 +3.47 0.99974 +3.48 0.99975 +3.49 0.99976 +3.50 0.99977 +3.51 0.99978 +3.52 0.99978 +3.53 0.99979 +3.54 0.99980 +3.55 0.99981 +3.56 0.99981 +3.57 0.99982 +3.58 0.99983 +3.59 0.99983 +3.60 0.99984 +3.61 0.99985 +3.62 0.99985 +3.63 0.99986 +3.64 0.99986

+3.65 0.99987 +3.66 0.99987 +3.67 0.99988 +3.68 0.99988 +3.69 0.99989 +3.70 0.99989 +3.71 0.99990 +3.72 0.99990 +3.73 0.99990 +3.74 0.99991 +3.75 0.99991 +3.76 0.99992 +3.77 0.99992 +3.78 0.99992 +3.79 0.99992 +3.80 0.99993 +3.81 0.99993 +3.82 0.99993 +3.83 0.99994 +3.84 0.99994 +3.85 0.99994 +3.86 0.99994 +3.87 0.99995 +3.88 0.99995 +3.89 0.99995 +3.90 0.99995 +3.91 0.99995 +3.92 0.99996 +3.93 0.99996 +3.94 0.99996 +3.95 0.99996 +3.96 0.99996 +3.97 0.99996 +3.98 0.99997 +3.99 0.99997 +4.00 0.99997 +4.01 0.99997 +4.02 0.99997 +4.03 0.99997 +4.04 0.99997 +4.05 0.99997 +4.06 0.99998 +4.07 0.99998 +4.08 0.99998 +4.09 0.99998 +4.10 0.99998 +4.11 0.99998 +4.12 0.99998 +4.13 0.99998 +4.14 0.99998 +4.15 0.99998 +4.16 0.99998 +4.17 0.99998 +4.18 0.99999 +4.19 0.99999 +4.20 0.99999 +4.18 0.99999 +4.19 0.99999 +4.20 0.99999 +4.21 0.99999 +4.22 0.99999 +4.23 0.99999 +4.24 0.99999 +4.25 0.99999 +4.26 0.99999 +4.27 0.99999 +4.28 0.99999 +4.29 0.99999 +4.30 0.99999 +4.31 0.99999 +4.32 0.99999 +4.33 0.99999 +4.34 0.99999

Right-sided area: P(Z ≥ z-score) = 1 – Left-sided area Interval area: P(a ≤ Z ≤ b) = P(Z ≤ b) – P(Z ≤ a)

Note: To linearly interpolate for “in-between” values, solve

(zhigh – zlow)(Pbetween – Plow) = (zbetween – zlow)(Phigh – Plow)

for either zbetween or Pbetween, whichever required, given the other.

Right-tailed area

T-scores corresponding to selected right-tailed probabilities of the tdf-distribution [Note that, for any fixed df, t-scores > z-scores. As df → ∞, t-scores → z-scores (i.e., last row).] df 0.5 0.25 0.10 0.05 0.025 0.010 0.005 0.0025 0.001 0.0005 0.00025 1 0 1.000 3.078 6.314 12.706 31.821 63.657 127.321 318.309 636.619 1273.239 2 0 0.816 1.886 2.920 4.303 6.965 9.925 14.089 22.327 31.599 44.705 3 0 0.765 1.638 2.353 3.182 4.541 5.841 7.453 10.215 12.924 16.326 4 0 0.741 1.533 2.132 2.776 3.747 4.604 5.598 7.173 8.610 10.306 5 0 0.727 1.476 2.015 2.571 3.365 4.032 4.773 5.893 6.869 7.976 6 0 0.718 1.440 1.943 2.447 3.143 3.707 4.317 5.208 5.959 6.788 7 0 0.711 1.415 1.895 2.365 2.998 3.499 4.029 4.785 5.408 6.082 8 0 0.706 1.397 1.860 2.306 2.896 3.355 3.833 4.501 5.041 5.617 9 0 0.703 1.383 1.833 2.262 2.821 3.250 3.690 4.297 4.781 5.291 10 0 0.700 1.372 1.812 2.228 2.764 3.169 3.581 4.144 4.587 5.049 11 0 0.697 1.363 1.796 2.201 2.718 3.106 3.497 4.025 4.437 4.863 12 0 0.695 1.356 1.782 2.179 2.681 3.055 3.428 3.930 4.318 4.716 13 0 0.694 1.350 1.771 2.160 2.650 3.012 3.372 3.852 4.221 4.597 14 0 0.692 1.345 1.761 2.145 2.624 2.977 3.326 3.787 4.140 4.499 15 0 0.691 1.341 1.753 2.131 2.602 2.947 3.286 3.733 4.073 4.417 16 0 0.690 1.337 1.746 2.120 2.583 2.921 3.252 3.686 4.015 4.346 17 0 0.689 1.333 1.740 2.110 2.567 2.898 3.222 3.646 3.965 4.286 18 0 0.688 1.330 1.734 2.101 2.552 2.878 3.197 3.610 3.922 4.233 19 0 0.688 1.328 1.729 2.093 2.539 2.861 3.174 3.579 3.883 4.187 20 0 0.687 1.325 1.725 2.086 2.528 2.845 3.153 3.552 3.850 4.146 21 0 0.686 1.323 1.721 2.080 2.518 2.831 3.135 3.527 3.819 4.110 22 0 0.686 1.321 1.717 2.074 2.508 2.819 3.119 3.505 3.792 4.077 23 0 0.685 1.319 1.714 2.069 2.500 2.807 3.104 3.485 3.768 4.047 24 0 0.685 1.318 1.711 2.064 2.492 2.797 3.091 3.467 3.745 4.021 25 0 0.684 1.316 1.708 2.060 2.485 2.787 3.078 3.450 3.725 3.996 26 0 0.684 1.315 1.706 2.056 2.479 2.779 3.067 3.435 3.707 3.974 27 0 0.684 1.314 1.703 2.052 2.473 2.771 3.057 3.421 3.690 3.954 28 0 0.683 1.313 1.701 2.048 2.467 2.763 3.047 3.408 3.674 3.935 29 0 0.683 1.311 1.699 2.045 2.462 2.756 3.038 3.396 3.659 3.918 30 0 0.683 1.310 1.697 2.042 2.457 2.750 3.030 3.385 3.646 3.902 31 0 0.682 1.309 1.696 2.040 2.453 2.744 3.022 3.375 3.633 3.887 32 0 0.682 1.309 1.694 2.037 2.449 2.738 3.015 3.365 3.622 3.873 33 0 0.682 1.308 1.692 2.035 2.445 2.733 3.008 3.356 3.611 3.860 34 0 0.682 1.307 1.691 2.032 2.441 2.728 3.002 3.348 3.601 3.848 35 0 0.682 1.306 1.690 2.030 2.438 2.724 2.996 3.340 3.591 3.836 36 0 0.681 1.306 1.688 2.028 2.434 2.719 2.990 3.333 3.582 3.826 37 0 0.681 1.305 1.687 2.026 2.431 2.715 2.985 3.326 3.574 3.815 38 0 0.681 1.304 1.686 2.024 2.429 2.712 2.980 3.319 3.566 3.806 39 0 0.681 1.304 1.685 2.023 2.426 2.708 2.976 3.313 3.558 3.797 40 0 0.681 1.303 1.684 2.021 2.423 2.704 2.971 3.307 3.551 3.788 41 0 0.681 1.303 1.683 2.020 2.421 2.701 2.967 3.301 3.544 3.780 42 0 0.680 1.302 1.682 2.018 2.418 2.698 2.963 3.296 3.538 3.773 43 0 0.680 1.302 1.681 2.017 2.416 2.695 2.959 3.291 3.532 3.765 44 0 0.680 1.301 1.680 2.015 2.414 2.692 2.956 3.286 3.526 3.758 45 0 0.680 1.301 1.679 2.014 2.412 2.690 2.952 3.281 3.520 3.752 46 0 0.680 1.300 1.679 2.013 2.410 2.687 2.949 3.277 3.515 3.746 47 0 0.680 1.300 1.678 2.012 2.408 2.685 2.946 3.273 3.510 3.740 48 0 0.680 1.299 1.677 2.011 2.407 2.682 2.943 3.269 3.505 3.734 49 0 0.680 1.299 1.677 2.010 2.405 2.680 2.940 3.265 3.500 3.728 50 0 0.679 1.299 1.676 2.009 2.403 2.678 2.937 3.261 3.496 3.723

df 0.5 0.25 0.10 0.05 0.025 0.01 0.005 0.0025 0.001 0.0005 0.00025 51 0 0.679 1.298 1.675 2.008 2.402 2.676 2.934 3.258 3.492 3.718 52 0 0.679 1.298 1.675 2.007 2.400 2.674 2.932 3.255 3.488 3.713 53 0 0.679 1.298 1.674 2.006 2.399 2.672 2.929 3.251 3.484 3.709 54 0 0.679 1.297 1.674 2.005 2.397 2.670 2.927 3.248 3.480 3.704 55 0 0.679 1.297 1.673 2.004 2.396 2.668 2.925 3.245 3.476 3.700 56 0 0.679 1.297 1.673 2.003 2.395 2.667 2.923 3.242 3.473 3.696 57 0 0.679 1.297 1.672 2.002 2.394 2.665 2.920 3.239 3.470 3.692 58 0 0.679 1.296 1.672 2.002 2.392 2.663 2.918 3.237 3.466 3.688 59 0 0.679 1.296 1.671 2.001 2.391 2.662 2.916 3.234 3.463 3.684 60 0 0.679 1.296 1.671 2.000 2.390 2.660 2.915 3.232 3.460 3.681 61 0 0.679 1.296 1.670 2.000 2.389 2.659 2.913 3.229 3.457 3.677 62 0 0.678 1.295 1.670 1.999 2.388 2.657 2.911 3.227 3.454 3.674 63 0 0.678 1.295 1.669 1.998 2.387 2.656 2.909 3.225 3.452 3.671 64 0 0.678 1.295 1.669 1.998 2.386 2.655 2.908 3.223 3.449 3.668 65 0 0.678 1.295 1.669 1.997 2.385 2.654 2.906 3.220 3.447 3.665 66 0 0.678 1.295 1.668 1.997 2.384 2.652 2.904 3.218 3.444 3.662 67 0 0.678 1.294 1.668 1.996 2.383 2.651 2.903 3.216 3.442 3.659 68 0 0.678 1.294 1.668 1.995 2.382 2.650 2.902 3.214 3.439 3.656 69 0 0.678 1.294 1.667 1.995 2.382 2.649 2.900 3.213 3.437 3.653 70 0 0.678 1.294 1.667 1.994 2.381 2.648 2.899 3.211 3.435 3.651 71 0 0.678 1.294 1.667 1.994 2.380 2.647 2.897 3.209 3.433 3.648 72 0 0.678 1.293 1.666 1.993 2.379 2.646 2.896 3.207 3.431 3.646 73 0 0.678 1.293 1.666 1.993 2.379 2.645 2.895 3.206 3.429 3.644 74 0 0.678 1.293 1.666 1.993 2.378 2.644 2.894 3.204 3.427 3.641 75 0 0.678 1.293 1.665 1.992 2.377 2.643 2.892 3.202 3.425 3.639 76 0 0.678 1.293 1.665 1.992 2.376 2.642 2.891 3.201 3.423 3.637 77 0 0.678 1.293 1.665 1.991 2.376 2.641 2.890 3.199 3.421 3.635 78 0 0.678 1.292 1.665 1.991 2.375 2.640 2.889 3.198 3.420 3.633 79 0 0.678 1.292 1.664 1.990 2.374 2.640 2.888 3.197 3.418 3.631 80 0 0.678 1.292 1.664 1.990 2.374 2.639 2.887 3.195 3.416 3.629 81 0 0.678 1.292 1.664 1.990 2.373 2.638 2.886 3.194 3.415 3.627 82 0 0.677 1.292 1.664 1.989 2.373 2.637 2.885 3.193 3.413 3.625 83 0 0.677 1.292 1.663 1.989 2.372 2.636 2.884 3.191 3.412 3.623 84 0 0.677 1.292 1.663 1.989 2.372 2.636 2.883 3.190 3.410 3.622 85 0 0.677 1.292 1.663 1.988 2.371 2.635 2.882 3.189 3.409 3.620 86 0 0.677 1.291 1.663 1.988 2.370 2.634 2.881 3.188 3.407 3.618 87 0 0.677 1.291 1.663 1.988 2.370 2.634 2.880 3.187 3.406 3.617 88 0 0.677 1.291 1.662 1.987 2.369 2.633 2.880 3.185 3.405 3.615 89 0 0.677 1.291 1.662 1.987 2.369 2.632 2.879 3.184 3.403 3.613 90 0 0.677 1.291 1.662 1.987 2.368 2.632 2.878 3.183 3.402 3.612 91 0 0.677 1.291 1.662 1.986 2.368 2.631 2.877 3.182 3.401 3.610 92 0 0.677 1.291 1.662 1.986 2.368 2.630 2.876 3.181 3.399 3.609 93 0 0.677 1.291 1.661 1.986 2.367 2.630 2.876 3.180 3.398 3.607 94 0 0.677 1.291 1.661 1.986 2.367 2.629 2.875 3.179 3.397 3.606 95 0 0.677 1.291 1.661 1.985 2.366 2.629 2.874 3.178 3.396 3.605 96 0 0.677 1.290 1.661 1.985 2.366 2.628 2.873 3.177 3.395 3.603 97 0 0.677 1.290 1.661 1.985 2.365 2.627 2.873 3.176 3.394 3.602 98 0 0.677 1.290 1.661 1.984 2.365 2.627 2.872 3.175 3.393 3.601 99 0 0.677 1.290 1.660 1.984 2.365 2.626 2.871 3.175 3.392 3.600

100 0 0.677 1.290 1.660 1.984 2.364 2.626 2.871 3.174 3.390 3.598 120 0 0.677 1.289 1.658 1.980 2.358 2.617 2.860 3.160 3.373 3.578 140 0 0.676 1.288 1.656 1.977 2.353 2.611 2.852 3.149 3.361 3.564 160 0 0.676 1.287 1.654 1.975 2.350 2.607 2.846 3.142 3.352 3.553 180 0 0.676 1.286 1.653 1.973 2.347 2.603 2.842 3.136 3.345 3.545 200 0 0.676 1.286 1.653 1.972 2.345 2.601 2.839 3.131 3.340 3.539 ∞ 0 0.674 1.282 1.645 1.960 2.326 2.576 2.807 3.090 3.291 3.481

χ 2-score 0

Right-tailed area

Chi-squared scores corresponding to selected right-tailed probabilities of the 2

dfχ distribution

df 1 0.5 0.25 0.10 0.05 0.025 0.010 0.005 0.0025 0.0010 0.0005 0.00025 1 0 0.455 1.323 2.706 3.841 5.024 6.635 7.879 9.141 10.828 12.116 13.412 2 0 1.386 2.773 4.605 5.991 7.378 9.210 10.597 11.983 13.816 15.202 16.588 3 0 2.366 4.108 6.251 7.815 9.348 11.345 12.838 14.320 16.266 17.730 19.188 4 0 3.357 5.385 7.779 9.488 11.143 13.277 14.860 16.424 18.467 19.997 21.517 5 0 4.351 6.626 9.236 11.070 12.833 15.086 16.750 18.386 20.515 22.105 23.681 6 0 5.348 7.841 10.645 12.592 14.449 16.812 18.548 20.249 22.458 24.103 25.730 7 0 6.346 9.037 12.017 14.067 16.013 18.475 20.278 22.040 24.322 26.018 27.692 8 0 7.344 10.219 13.362 15.507 17.535 20.090 21.955 23.774 26.124 27.868 29.587 9 0 8.343 11.389 14.684 16.919 19.023 21.666 23.589 25.462 27.877 29.666 31.427 10 0 9.342 12.549 15.987 18.307 20.483 23.209 25.188 27.112 29.588 31.420 33.221 11 0 10.341 13.701 17.275 19.675 21.920 24.725 26.757 28.729 31.264 33.137 34.977 12 0 11.340 14.845 18.549 21.026 23.337 26.217 28.300 30.318 32.909 34.821 36.698 13 0 12.340 15.984 19.812 22.362 24.736 27.688 29.819 31.883 34.528 36.478 38.390 14 0 13.339 17.117 21.064 23.685 26.119 29.141 31.319 33.426 36.123 38.109 40.056 15 0 14.339 18.245 22.307 24.996 27.488 30.578 32.801 34.950 37.697 39.719 41.699 16 0 15.338 19.369 23.542 26.296 28.845 32.000 34.267 36.456 39.252 41.308 43.321 17 0 16.338 20.489 24.769 27.587 30.191 33.409 35.718 37.946 40.790 42.879 44.923 18 0 17.338 21.605 25.989 28.869 31.526 34.805 37.156 39.422 42.312 44.434 46.508 19 0 18.338 22.718 27.204 30.144 32.852 36.191 38.582 40.885 43.820 45.973 48.077 20 0 19.337 23.828 28.412 31.410 34.170 37.566 39.997 42.336 45.315 47.498 49.632 21 0 20.337 24.935 29.615 32.671 35.479 38.932 41.401 43.775 46.797 49.011 51.173 22 0 21.337 26.039 30.813 33.924 36.781 40.289 42.796 45.204 48.268 50.511 52.701 23 0 22.337 27.141 32.007 35.172 38.076 41.638 44.181 46.623 49.728 52.000 54.217 24 0 23.337 28.241 33.196 36.415 39.364 42.980 45.559 48.034 51.179 53.479 55.722 25 0 24.337 29.339 34.382 37.652 40.646 44.314 46.928 49.435 52.620 54.947 57.217 26 0 25.336 30.435 35.563 38.885 41.923 45.642 48.290 50.829 54.052 56.407 58.702 27 0 26.336 31.528 36.741 40.113 43.195 46.963 49.645 52.215 55.476 57.858 60.178 28 0 27.336 32.620 37.916 41.337 44.461 48.278 50.993 53.594 56.892 59.300 61.645 29 0 28.336 33.711 39.087 42.557 45.722 49.588 52.336 54.967 58.301 60.735 63.104 30 0 29.336 34.800 40.256 43.773 46.979 50.892 53.672 56.332 59.703 62.162 64.555 31 0 30.336 35.887 41.422 44.985 48.232 52.191 55.003 57.692 61.098 63.582 65.999 32 0 31.336 36.973 42.585 46.194 49.480 53.486 56.328 59.046 62.487 64.995 67.435 33 0 32.336 38.058 43.745 47.400 50.725 54.776 57.648 60.395 63.870 66.403 68.865 34 0 33.336 39.141 44.903 48.602 51.966 56.061 58.964 61.738 65.247 67.803 70.289 35 0 34.336 40.223 46.059 49.802 53.203 57.342 60.275 63.076 66.619 69.199 71.706 36 0 35.336 41.304 47.212 50.998 54.437 58.619 61.581 64.410 67.985 70.588 73.118 37 0 36.336 42.383 48.363 52.192 55.668 59.893 62.883 65.739 69.346 71.972 74.523 38 0 37.335 43.462 49.513 53.384 56.896 61.162 64.181 67.063 70.703 73.351 75.924 39 0 38.335 44.539 50.660 54.572 58.120 62.428 65.476 68.383 72.055 74.725 77.319 40 0 39.335 45.616 51.805 55.758 59.342 63.691 66.766 69.699 73.402 76.095 78.709 41 0 40.335 46.692 52.949 56.942 60.561 64.950 68.053 71.011 74.745 77.459 80.094 42 0 41.335 47.766 54.090 58.124 61.777 66.206 69.336 72.320 76.084 78.820 81.475 43 0 42.335 48.840 55.230 59.304 62.990 67.459 70.616 73.624 77.419 80.176 82.851 44 0 43.335 49.913 56.369 60.481 64.201 68.710 71.893 74.925 78.750 81.528 84.223 45 0 44.335 50.985 57.505 61.656 65.410 69.957 73.166 76.223 80.077 82.876 85.591 46 0 45.335 52.056 58.641 62.830 66.617 71.201 74.437 77.517 81.400 84.220 86.954 47 0 46.335 53.127 59.774 64.001 67.821 72.443 75.704 78.809 82.720 85.560 88.314 48 0 47.335 54.196 60.907 65.171 69.023 73.683 76.969 80.097 84.037 86.897 89.670 49 0 48.335 55.265 62.038 66.339 70.222 74.919 78.231 81.382 85.351 88.231 91.022 50 0 49.335 56.334 63.167 67.505 71.420 76.154 79.490 82.664 86.661 89.561 92.371

VOLUME 3: NO. 1 JANUARY 2006

Identifying Geographic Disparities in theEarly Detection of Breast Cancer Using a

Geographic Information System

ORIGINAL RESEARCH

Suggested citation for this article: McElroy JA,Remington PL, Gangnon RE, Hariharan L, AndersenLD. Identifying geographic disparities in the early detec-tion of breast cancer using a geographic information sys-tem. Prev Chronic Dis [serial online] 2006 Jan [datecited]. Available from: URL: http://www.cdc.gov/pcd/issues/2006/jan/05_0065.htm.

PEER REVIEWED

Abstract

IntroductionIdentifying communities with lower rates of mammogra-

phy screening is a critical step to providing targetedscreening programs; however, population-based data nec-essary for identifying these geographic areas are limited.This study presents methods to identify geographic dis-parities in the early detection of breast cancer.

MethodsData for all women residing in Dane County, Wisconsin,

at the time of their breast cancer diagnosis from 1981through 2000 (N = 4769) were obtained from the WisconsinCancer Reporting System (Wisconsin’s tumor registry) byZIP code of residence. Hierarchical logistic regression mod-els for disease mapping were used to identify geographicdifferences in the early detection of breast cancer.

ResultsThe percentage of breast cancer cases diagnosed in situ

(excluding lobular carcinoma in situ) increased from 1.3%in 1981 to 11.9% in 2000. This increase, reflecting increas-

ing mammography use, occurred sooner in Dane Countythan in Wisconsin as a whole. From 1981 through 1985,the proportion of breast cancer diagnosed in situ in Danecounty was universally low (2%–3%). From 1986 through1990, urban and suburban ZIP codes had significantlyhigher rates (10%) compared with rural ZIP codes (5%).From 1991 through 1995, mammography screening hadincreased in rural ZIP codes (7% of breast cancer diag-nosed in situ). From 1996 through 2000, mammographyuse was fairly homogeneous across the entire county(13%–14% of breast cancer diagnosed in situ).

ConclusionThe percentage of breast cancer cases diagnosed in situ

increased in the state and in all areas of Dane County from1981 through 2000. Visual display of the geographic differ-ences in the early detection of breast cancer demonstratesthe diffusion of mammography use across the county overthe 20-year period.

Introduction

Geographic differences in health status and use of healthservices have been reported in the United States and inter-nationally (1), including stage of breast cancer incidence andmammography screening practices (2). Early diagnosis ofbreast cancer through mammography screening improvesbreast cancer treatment options and may reduce mortality(3,4), yet many women in the United States are not routine-ly screened according to recommended guidelines (5).

Needs assessment to account for noncompliance withbreast cancer screening recommendations has focused on

The opinions expressed by authors contributing to this journal do not necessarily reflect the opinions of the U.S. Department of Health and Human Services,the Public Health Service, the Centers for Disease Control and Prevention, or the authors’ affiliated institutions. Use of trade names is for identification only

and does not imply endorsement by any of the groups named above.

www.cdc.gov/pcd/issues/2006/jan/05_0065.htm • Centers for Disease Control and Prevention 1

Jane A. McElroy, PhD, Patrick L. Remington, MD, Ronald E. Gangnon, PhD, Luxme Hariharan, LeAnn D. Andersen, MS

VOLUME 3: NO. 1JANUARY 2006

personal factors related to participation, including the bar-riers women perceive (6), the role of physicians (7), and therole of services such as mobile vans (8) and insurance cov-erage (9). Evaluations of the effectiveness of interventionsdirected at patients, communities, and special populationshave also provided important information about mammog-raphy use (10). However, little attention has been paid togeographic location, except to focus on inner-city and ruraldisparities in mammography use (11,12).

The purpose of this study was to identify geographic dis-parities in the early detection of breast cancer using cancerregistry data. This information can be used to identifyareas where increased mammography screening is neededand to understand the diffusion of innovation in an urbanor a rural setting.

Cancer registry data were used for these analyses.Validity of the use of these data rests on the correlationbetween the percentage of breast cancer diagnosed in situand mammography screening rates; breast cancer in situ(BCIS) (excluding lobular carcinoma in situ [13-15]) is theearliest stage of localized breast cancer and is diagnosedalmost exclusively by mammography (16). In the 1970s,before widespread use of mammography, BCIS represent-ed less than 2% of breast cancer cases in the United States(15). A nationwide community-based breast cancer screen-ing program showed that among populations of womenscreened regularly, the stage distribution of diagnosedcases was skewed to earlier stages, with BCIS accountingfor more than 35% (17). Trends in the relative frequency ofBCIS are closely correlated with trends in mammographyuse (reflected in data from surveys of mammographyproviders in Wisconsin) and with trends in self-reportedmammography use (reflected in data from the BehavioralRisk Factor Surveillance System) (18-20).

In Wisconsin, either a physician can refer a patient forscreening or a woman can self-refer. More than 60% ofthe mammography imaging facilities in the state acceptself-referrals (21). Since 1989, Wisconsin state law hasmandated health insurance coverage for women aged 45to 65 years, and Medicare covers mammography screen-ing for eligible women (22). In Wisconsin, theDepartment of Health and Family Services provides atoll-free number through which women can contact morethan 400 service providers (22). Finally, several programssuch as the Wisconsin Well Woman Program, which isfunded by the Centers for Disease Control and

Prevention, provide free or low-cost screening to under-served women.

Methods

Study population

All female breast cancer cases diagnosed from 1981through 2000 were identified by the Wisconsin CancerReporting System (WCRS). The WCRS was established in1976 as mandated by Wisconsin state statute to collectcancer incidence data on Wisconsin residents. In compli-ance with state law, hospitals and physicians are requiredto report cancer cases to the WCRS (within 6 months ofinitial diagnosis for hospitals and within 3 months forphysicians, through their clinics). Variables obtained fromthe WCRS included histology (International Classificationof Diseases for Oncology, 2nd Edition [ICD-02] codes),stage (0 = in situ, 1 = localized, 2–5 = regional, 7 = distant,and 9 = unstaged), year of diagnosis, county of residenceat time of diagnosis, and number of incident cases in 5-year age groups by ZIP code for all breast cancer casesamong women. ZIP codes and county of residences, self-reported by the women with diagnosed breast cancer, areprovided to the WCRS. Only ZIP codes verified for DaneCounty by the U.S. Postal Service were included in thedata set (n = 37). The ZIP code was the smallest area unitavailable for WCRS incidence data.

Study location and characteristics

Dane County is located in south central Wisconsin. Thepopulation of the county in 1990 was 367,085, with 20% ofthe population living in rural areas (23); approximately190,000 people lived in Madison, the second largest city inWisconsin and home to the University of Wisconsin. The37 unique ZIP codes in Dane County incorporate 60 cities,villages, and towns (Figure 1).

Data analysis

We determined the percentage of breast cancer casesdiagnosed as BCIS in Wisconsin and Dane County overtime and by ZIP codes for Dane County. For ZIP codesthat encompassed areas beyond the borders of DaneCounty, only women who reported their county of resi-dence as Dane were included in the analysis. The per-centage of BCIS by ZIP code was mapped using 1996 ZIP

2 Centers for Disease Control and Prevention • www.cdc.gov/pcd/issues/2006/jan/05_0065.htm



code boundary files. For 17 breast cancer cases in whichthe women’s ZIP codes no longer existed, each ZIP codewas reassigned to the ZIP code in the same location.

We used analytic methods to estimate rates of earlybreast cancer detection by ZIP code. Because of small num-bers of BCIS cases in each ZIP code, a well-characterizedstatistical method was used to stabilize the prediction ofrates by borrowing information from neighboring ZIPcodes (24). This is done by using Bayesian hierarchicallogistic regression models to estimate ZIP-code–specificeffects on percentage of breast cancer cases diagnosed insitu (excluding lobular carcinoma in situ). ZIP-code–specif-ic effects (log odds ratios) were modeled as a Gaussian con-ditional autoregression (CAR) (25). Using the CAR model,one assumes that the log odds ratio for one ZIP code isinfluenced by the average log odds ratio for its neighbors.

The conditional standard deviation of the CAR model, thefree parameter which controls the smoothness of the map,was given a uniform prior (24).

For each time period, two CAR models were fitted. Thefirst model included age group as the only covariate. Agegroup effects were modeled using an exchangeable normalprior. The standard deviation of this distribution was givena uniform prior. The second model included additional ZIP-code–level covariates. Potential covariates were urban orrural status, education, median household income, maritalstatus, employment status, and commuting time from theSummary Tape File 3 of the 1990 U.S. Census ofPopulation and Housing (23). Census data from 1990 wereused because 1990 is the midpoint of the years included inthese analyses (1981–2000). Urban or rural status wasdefined as percentage of women living in each of the fourcensus classifications: urban inside urbanized area, urbanoutside of urbanized area, rural farm, and rural nonfarmfor each ZIP code. Education was defined as percentage ofwomen in each ZIP code aged 25 years and older with lessthan a high school diploma. Median household income foreach ZIP code was based on self-reported income. Maritalstatus was defined as women aged 25 years and older ineach ZIP code who had never been married. Employmentstatus was defined as percentage of women aged 16 yearsand older in each ZIP code who worked in 1989. Full-timeemployment variable was defined as percentage of women25 years and older in each ZIP code who worked at least 40hours per week. Commuting time was divided into five cat-egories of percentage of female workers in each ZIP code:worked at home, commuted 1 to 14 minutes, commuted 15to 29 minutes, commuted 30 to 44 minutes, and commuted45 minutes or more. Age was defined as age at diagnosis.These potential covariates were initially screened usingforward stepwise logistic regression models, which includ-ed ZIP code as an exchangeable (nonspatially structured)random effect. Covariates included in the best modelselected using Schwarz’s Bayesian Information Criterion(BIC) (26) were used in the second covariate-adjustedmodel. The covariate effects and the intercept were givenposterior priors.

Posterior estimates of the age-adjusted percentage ofBCIS for each ZIP code in each time period were obtainedfrom the CAR model. Posterior medians were used aspoint estimates of the parameters, and 95% posteriorcredible intervals were obtained. Analyses were per-formed using WinBUGS software (27). Covariate screen-





Figure 1. Map of Dane County, Wisconsin, showing capital city of Madison,major lakes, active mammogram facilities, and percentage of area classifiedas urban by ZIP code, using 1996 ZIP code boundaries and 1990 censusdata. Inset map shows location of Dane County within the state.


ing was performed using SAS software, version 8 (SASInstitute Inc, Cary, NC). ZIP-code–specific estimates weremapped using ESRI 3.2 ArcView software(Environmental Systems Research Institute, Redwood,Calif) and 1996 ZIP code boundary files to display thedata.

As an empirical check on our mapping, we fitted aregression model to the BCIS rates by ZIP code. Thedependent variable was BCIS rates (using the posteriorestimates of age-adjusted percentage of BCIS), and theindependent variable in the model was linear distancefrom the University of Wisconsin Comprehensive CancerCenter (UWCCC), located in Madison, to the centroid ofeach ZIP code.

Results

A total of 4769 breast cancer cases were reported inDane County from 1981 through 2000: 825 from 1981through 1985, 1119 from 1986 through 1990, 1239 from1991 through 1995, and 1586 from 1996 through 2000.Percentage of cases in situ varied by age group from a highof 18% among women aged 45 to 49 years to a low of 0%among women aged 20 to 24 years. From the mid 1980s,the age group most frequently diagnosed with BCIS waswomen aged 45 to 49. In contrast, women aged 20 to 34and older than 84 were the least often (<2%) diagnosedwith BCIS (data not shown). Based on the 1990 U.S. cen-sus, the total female population (aged 18 years and older)in Dane County was 145,974; 60% of the female populationhad more than a high school degree, and 15% of the femalepopulation aged 25 and older had never married.

In Dane County, the percentage of BCIS increased from1.3% in 1981 to 11.9% in 2000. For the state, the percent-age of BCIS increased from 1.5% in 1981 to 12.8% in 2000.From 1981 to 1993, Dane County had a higher percentageof BCIS diagnosis than the state as a whole. By the mid-1990s, the percentage of BCIS among breast cancer casesin Dane County was similar to the percentage in the state(Figure 2). Similar results are seen when mapping theobserved data (maps not shown).

Figure 3 shows model-based estimates of the age-adjust-ed percentage of BCIS diagnosis by ZIP code in DaneCounty during four 5-year periods. These maps demon-strate the increase in the percentage of cases diagnosed as

BCIS noted in Figure 2. These maps also show that theincrease in the percentage of BCIS was not uniform acrossDane County. From 1981 through 1985, the entire countyhad uniformly low rates of BCIS (2%–3%). From 1986through 1990, urban ZIP codes had markedly higher ratesof BCIS (approximately 12%) compared with rural ZIPcodes (approximately 5%). From 1991 through 1995, use ofmammography screening had begun to increase in therural ZIP codes (with a 7% rate of BCIS), although the




Figure 2. Smoothed trends in percentage of breast cancer cases diagnosedin situ (excluding lobular carcinoma in situ), Dane County, Wisconsin, andWisconsin, 1981–2000. Data point for Dane County, 1980, was estimatedfrom Andersen et al (28).

Figure 3. Model-based estimates of age-adjusted percentage of breast can-cer cases diagnosed in situ during four 5-year periods, by ZIP code, DaneCounty, Wisconsin, 1981–2000. BCIS indicates breast cancer in situ.

rates of BCIS remained higher in urban ZIP codes (12%).From 1996 through 2000, mammography screening wasfairly universal across the county, with BCIS rates of 13%to 14%. Similar patterns were observed from models thatadjusted for additional covariates of marital status andeducation (data not shown).

From 1981 through 1985, there was no significant rela-tionship between distance from UWCCC and the rate ofBCIS (P = .27). From 1986 through 1990 and from 1991through 1995, there was strong evidence of an inverse rela-tionship between distance from UWCCC and the rate ofBCIS (i.e., the closer to UWCCC, the higher the BCIS rate[P < .001] for both periods). From 1996 through 2000, therewas a nonsignificant inverse relationship between distancefrom UWCCC and the rate of BCIS (P = .07).

Discussion

The frequency of BCIS diagnosis increased substan-tially in Wisconsin and in Dane County from 1981through 2000. This increase in percentage of BCISamong diagnosed breast cancer cases is consistent withincreases in self-reported mammography use,Wisconsin Medicare claims for mammography, and thenumber of medical imaging centers in Wisconsin (21).However, progress in mammography screening was notuniform across Dane County, and this lack of uniformi-ty represents a classic case of diffusion of innovation.Early adopters of mammography use lived in and nearthe city of Madison. We can speculate that Madisonembodies one characteristic that accelerates the diffu-sion process: namely, a more highly educated popula-tion living in a university community with a strongmedical presence. One predictor of mammography useis education: women who are more educated are morelikely to ask their physician for a referral or to self-refer (29), and the strongest predictor of mammog-raphy use is physician referral (30). Furthermore,physicians are more likely to have chosen to live in theMadison area instead of a more rural location becausethey value the opportunity for regular contact with themedical school and the medical community (31).Consequently, a greater number of interpersonal net-works and more information exchange among physi-cians about adoption of this innovation might haveoccurred earlier in the Madison medical communitythan in the more rural areas of the county (32).

Although median household income by ZIP code was nota predictor of mammography use in our study, the amountof disposable income by individuals, which is not capturedby this variable, might also have been an important factorfor early adopters (33,34). In a national study of mammog-raphy use, income was a significant predictor of repeatscreening in 1987 but not in 1990 (35). In the mid-1980s,few insurance plans covered mammography screening.Therefore, women of higher socioeconomic status (SES)would have been more likely to be able to pay the cost ofthe mammogram. Efforts to reduce costs, such as a 1987statewide promotional campaign sponsored by theAmerican Cancer Society, still required a $50 copay fromwomen who were able to self-refer for a mammogram (36).

As the use of this technology diffused outward, increas-ing numbers of women living in suburban and rural areassurrounding Madison elected to get a mammogram. From1996 through 2000, the geographic disparity in mammog-raphy use was muted, although the eastern corridor ofDane County still had slightly lower rates of BCIS thanother parts of the county. The reasons for persistent dis-parity in this region of Dane County are unclear: it isunlikely to be because of proximity to mammographyscreening facilities, nor are the ZIP-code–level SES meas-ures such as percentage unemployed, household income,percentage below poverty level, or education level statisti-cally different from the western corridor of Dane County.

Differences in the trends of early detection of breast can-cer within Dane County suggest that progress in mam-mography screening has not been uniform across thecounty. From 1996 through 2000, while more than 14% ofage-adjusted breast cancer cases were diagnosed as BCISin Madison, fewer than 6% of age-adjusted breast cancercases were diagnosed as BCIS in a few outlying and morerural areas of Dane County, reflecting lower mammogra-phy use by residents in this area. The results of an earli-er analysis of these data were shared with local healthdepartment staff in rural Dane County who were workingto increase early detection efforts through outreach edu-cation and referrals to providers. As suggested byAndersen et al, strategies to improve mammography useinclude improving access to primary care physicians,increasing the number of mammography facilities locatedin rural areas, and increasing outreach efforts by a net-work of public health professionals promoting screening intheir community (28). In addition, pointing out the varia-tions in care may lead to improvements, since the first step






toward change is identifying a problem. With identificationof particular areas of need, resources can be garneredtoward alleviating the disparity.

Persistent disparities in mammography use afteradjusting for community level of educational attainmentand marital status were found. Other studies have foundthat patients with cancer living in census tracts withlower median levels of education attainment are diag-nosed in later disease stages than are patients in tractswith higher median levels of education (29). Studies havealso shown that one predictor for getting a mammogramis being married (37).

This study demonstrates the use of percentage of BCISas a tool for comparing population-based mammographyscreening rates in different geographic areas. Using cancerincidence data to monitor population-based rates of breastcancer screening is possible throughout the nation,because data from population-based cancer registries arenow widely available, often by ZIP code or census tract.This method permits comparison of mammography screen-ing rates among geographic areas smaller than areas usedin many previous studies of geographic variation in theearly detection of breast cancer (2).

The method described in this article can be used to com-plement other ways to assess the quality of health care incommunities, such as the Health Plan Employer Data andInformation Set (HEDIS), created by the NationalCommittee for Quality Assurance. HEDIS addresses over-all rates in managed care but does not include the under-insured or fee-for-service populations particularly at riskfor inadequate screening (34). Cancer registry data arepopulation based; therefore, using cancer registry data isnot only effective but also economical and efficient for out-reach specialists and health providers.

A potential weakness in this method is the representa-tiveness of the statewide tumor registry. However, theWCRS has been evaluated by the North AmericanAssociation of Central Cancer Registries and was given itsgold standard for quality, completeness, and timeliness in1995 and 1996, the first 2 years this standard was recog-nized (38). Completeness estimates are a general measureof accuracy. The WCRS participated in national auditsthat measured completeness in 1987, 1992, and 1996 aswell as one formal study in 1982. Overall, the quality ofthe data improved slightly after 1994 when clinics and

neighboring state data-sharing agreements were imple-mented (oral communication, Laura Stephenson, WCRS,July 2005). In addition, the tumor registry has used stan-dard methods for classifying tumor stage (e.g., in situ)throughout the entire period of the study. Incidence datafrom data sources of lesser quality or completeness thanthe WCRS would need to be carefully evaluated for use inthis type of analysis.

Another limitation of this type of analysis is our use ofBCIS as a proxy for mammography screening practices.Undoubtedly, some diagnoses of BCIS result from diag-nostic mammograms, but reported use of screening mam-mograms by individuals and medical facilities correlatesstrongly with percentage of BCIS over time, particularlyductal carcinoma in situ (18-20). Furthermore, we chose toexclude lobular carcinoma in situ from our BCIS categorybecause this condition is often opportunistic (13-15).

A third limitation, which would be found in any type ofgeographic analysis, rests on the accuracy of the assign-ment of participants to the proper location. For area analy-sis (e.g., ZIP code, county), this legitimate concern is ame-liorated by using tools to check ZIP codes and countyassignments for correctness. For this study, women diag-nosed with breast cancer provided their addresses, includ-ing county of residence, to their medical facilities. Theseaddresses were forwarded to the WCRS, where quality-control checks were implemented to validate ZIP code andcounty assignments. For example, lists of ZIP codes andtheir county codes were cross-referenced to the ZIP codesand county codes of the addresses provided by the womendiagnosed with breast cancer. Inaccuracies were correctedby the WCRS (oral communication, Laura Stephenson,WCRS, January 2005).

Although there has been significant improvement inbreast cancer screening across the state and county, thisstudy demonstrates that the improvement has not beenuniform. The maps clearly indicate for program direc-tors and policy makers the areas where further outreachand research should be conducted. More specifically,this type of analysis can be used to identify specificareas (such as ZIP codes) within a community (such asa county) with varying rates of early-stage breast can-cer. Using this method, public health professionals canprovide population-level data to all health care providersto target interventions to improve the early detection ofbreast cancer in other counties in Wisconsin and other




states. Finally, this type of analysis is useful for compre-hensive cancer control efforts and can be conducted forother cancers with effective screening methods, such ascolorectal cancer.

Acknowledgments

The authors are grateful to Dr Larry Hanrahan andMark Bunge for advice and Laura Stephenson of theWCRS for assistance with data.

This study was supported by National Cancer Institutegrant U01CA82004.

Author Information

Corresponding Author: Jane A. McElroy, PhD,Comprehensive Cancer Center, 610 Walnut St, 307WARF, Madison, WI 53726. Telephone: 608-265-8780. E-mail: [email protected].

Author Affiliations: Patrick L. Remington, MD,Comprehensive Cancer Center and Department ofPopulation Health Sciences, University of Wisconsin,Madison, Wis; Ronald E. Gangnon, PhD, Department ofPopulation Health Sciences and Biostatistics and MedicalInformatics, University of Wisconsin, Madison, Wis;Luxme Hariharan, Department of Molecular Biology,University of Wisconsin, Madison, Wis; LeAnn D.Andersen, MS, Department of Population Health Sciences,University of Wisconsin, Madison, Wis.

References

1. Coughlin SS, Uhler RJ, Bobo JK, Caplan L. Breastcancer screening practices among women in theUnited States, 2000. Cancer Causes Control2004;15:159-70.

2. Roche LM, Skinner R, Weinstein RB. Use of a geo-graphic information system to identify and character-ize areas with high proportions of distant stage breastcancer. J Public Health Manag Pract 2002;8:26-32.

3. Lee CH. Screening mammography: proven benefit,continued controversy. Radiol Clin North Am2002;40:395-407.

4. Nystrom L, Andersson I, Bjurstam N, Frisell J,Nordenskjold B, Rutqvist LE. Long-term effects of

mammography screening: updated overview of theSwedish randomised trials. Lancet 2002;359:909-19.

5. Mandelblatt J, Saha S, Teutsch S, Hoerger T, Siu AL,Atkins D, et al. The cost-effectiveness of screeningmammography beyond age 65 years: a systematicreview for the U.S. Preventive Services Task Force.Ann Intern Med 2003;139:835-42.

6. Rimer BK. Understanding the acceptance of mam-mography by women. Annals of Behavioral Medicine1992;14:197-203.

7. Fox SA, Stein JA. The effect of physician-patient com-munication on mammography utilization by differentethnic groups. Med Care 1991;29:1065-82.

8. Haiart DC, McKenzie L, Henderson J, Pollock W,McQueen DV, Roberts MM, et al. Mobile breastscreening: factors affecting uptake, efforts to increaseresponse and acceptability. Public Health1990;104:239-47.

9. Thompson GB, Kessler LG, Boss LP. Breast cancerscreening legislation in the United States. Am J PublicHealth 1989;79:1541-3.

10. Zapka JG, Harris DR, Hosmer D, Costanza ME, MasE, Barth R. Effect of a community health center inter-vention on breast cancer screening among HispanicAmerican women. Health Serv Res 1993;28:223-35.

11. Breen N, Figueroa JB. Stage of breast and cervicalcancer diagnosis in disadvantaged neighborhoods: aprevention policy perspective. Am J Prev Med1996;12:319-26.

12. Andersen MR, Yasui Y, Meischke H, Kuniyuki A,Etzioni R, Urban N. The effectiveness of mammogra-phy promotion by volunteers in rural communities.Am J Prev Med 2000;18:199-207.

13. Li CI, Anderson BO, Daling JR, Moe RE. Changingincidence of lobular carcinoma in situ of the breast.Breast Cancer Res Treat 2002;75:259-68.

14. Millikan R, Dressler L, Geradts J, Graham M. Theneed for epidemiologic studies of in-situ carcinoma ofthe breast. Breast Cancer Res Treat 1995;35:65-77.

15. Ernster VL, Barclay J, Kerlikowske K, Grady D,Henderson C. Incidence of and treatment for ductalcarcinoma in situ of the breast. JAMA 1996;275:913-8.

16. Claus EB, Stowe M, Carter D. Breast carcinoma insitu: risk factors and screening patterns. J Natl CancerInst 2001;93:1811-7.

17. May DS, Lee NC, Richardson LC, Giustozzi AG, BoboJK. Mammography and breast cancer detection byrace and Hispanic ethnicity: results from a nationalprogram (United States). Cancer Causes Control






2000;11:697-705.18. Lantz P, Bunge M, Remington PL. Trends in mam-

mography in Wisconsin. Wis Med J 1990;89:281-2.19. Lantz PM, Remington PL, Newcomb PA.

Mammography screening and increased incidence ofbreast cancer in Wisconsin. J Natl Cancer Inst1991;83:1540-6.

20. Bush DS, Remington PL, Reeves M, Phillips JL. Insitu breast cancer correlates with mammography use,Wisconsin: 1980-1992. Wis Med J 1994;93:483-4.

21. Propeck PA, Scanlan KA. Breast imaging trends inWisconsin. WMJ 2000;99:42-6.

22. Fowler BA. Variability in mammography screeninglegislation across the states. J Womens Health GendBased Med 2000;9:175-84.

23. U.S. Census Bureau. 1990 Summary Tape File 3 (STF3)- sample data. Census of Population and Housing.Washington (DC): American Factfinder;1990.Available from: URL:http://www.census.gov/main/www/cen1990.html.

24. Gelman A. Prior distributions for variance parametersin hierarchical models. New York: ColumbiaUniversity; 2004.

25. Besag J, York J, Mollié A. Bayesian image restorationwith applications in spatial statistics. Annals of theInstitute of Mathematical Statistics 1991;43:1-20.

26. Schwarz G. Estimating the dimension of a model.Annals of Statistics 1978;6:461-4.

27. Spiegelhalter D, Thomas A, Best N, Lunn D.WinBUGS: Bayesian inference using Gibbs Sampling.London (UK): MRC Biostatistics Unit; 2004.

28. Andersen LD, Remington PL, Trentham-Dietz A,Robert S. Community trends in the early detection ofbreast cancer in Wisconsin, 1980-1998. Am J Prev Med2004;26:51-5.

29. Wells BL, Horm JW. Stage at diagnosis in breast can-cer: race and socioeconomic factors. Am J PublicHealth 1992;82:1383-5.

30. Simon MS, Gimotty PA, Coombs J, McBride S,Moncrease A, Burack RC. Factors affecting participa-tion in a mammography screening program amongmembers of an urban Detroit health maintenanceorganization. Cancer Detect Prev 1998;22:30-8.

31. Cooper JK, Heald K, Samuels M, Coleman S. Rural orurban practice: factors influencing the location deci-sion of primary care physicians. Inquiry 1975;12:18-25.

32. Rogers EM. Diffusion of innovations. New York (NY):Free Press; 2003.

33. Calle EE, Flanders WD, Thun MJ, Martin LM.Demographic predictors of mammography and Papsmear screening in US women. Am J Public Health1993;83:53-60.

34. Bradley CJ, Given CW, Roberts C. Disparities in can-cer diagnosis and survival. Cancer 2001;91:178-88.

35. Zapka JG, Hosmer D, Costanza ME, Harris DR,Stoddard A. Changes in mammography use: economic,need, and service factors. Am J Public Health1992;82:1345-51.

36. Remington PL, Lantz PM. Using a population-basedcancer reporting system to evaluate a breast cancerdetection and awareness program. CA Cancer J Clin1992;42:367-71.

37. Lannin DR, Mathews HF, Mitchell J, Swanson MS,Swanson FH, Edwards MS. Influence of socioeconomicand cultural factors on racial differences in late-stagepresentation of breast cancer. JAMA 1998;279:1801-7.

38. Chen VW, Wu XC, Andrews PA. Cancer in NorthAmerica, 1991-1995, Vol I: Incidence. Sacramento(CA): North American Association of Central CancerRegistries; 1999.




1

Case studies of bias in real life epidemiologic studies

Bias File 2. Should we stop drinking coffee? The story of coffee and pancreatic cancer

Compiled by

Madhukar Pai, MD, PhD

Jay S Kaufman, PhD

Department of Epidemiology, Biostatistics & Occupational Health

McGill University, Montreal, Canada

[email protected] & [email protected]

THIS CASE STUDY CAN BE FREELY USED FOR EDUCATIONAL PURPOSES WITH DUE CREDIT



2

Bias File 2. Should we stop drinking coffee? The story of coffee and pancreatic cancer

The story

Brian MacMahon (1923 - 2007) was a British-American epidemiologist who chaired the Department of Epidemiology at Harvard from 1958 until 1988. In 1981, he published a paper in the New England Journal of Medicine, a case-control study on coffee drinking and pancreatic cancer [MacMahon B, et al.. 1981]. The study concluded that "coffee use might account for a substantial proportion of the cases of this disease in the United States." According to some reports, after this study came out, MacMahon stopped drinking coffee and replaced coffee with tea in his office. This publication provoked a storm of protest from coffee drinkers and industry groups, with coverage in the New York Times, Time magazine and Newsweek. Subsequent studies, including one by MacMahon's group, failed to confirm the association. So, what went wrong and why?

The study

From the original abstract:

We questioned 369 patients with histologically proved cancer of the pancreas and 644 control patients about their use of tobacco, alcohol, tea, and coffee. There was a weak positive association between pancreatic cancer and cigarette smoking, but we found no association with use of cigars, pipe tobacco, alcoholic beverages, or tea. A strong association between coffee consumption and pancreatic cancer was evident in both sexes. The association was not affected by controlling for cigarette use. For the sexes combined, there was a significant dose-response relation (P approximately 0.001); after adjustment for cigarette smoking, the relative risk associated with drinking up to two cups of coffee per day was 1.8 (95% confidence limits, 1.0 to 3.0), and that with three or more cups per day was 2.7 (1.6 to 4.7). This association should be evaluated with other data; if it reflects a causal relation between coffee drinking and pancreatic cancer, coffee use might account for a substantial proportion of the cases of this disease in the United States.

The bias

The MacMahon study had several problems and several experts have debated these in various journals, but a widely recognized bias was related to control selection. A nice, easy to read explanation can be found in the Gordis text [Gordis L, 2009], but a 1981 paper by Feinstein drew attention to this problem). Controls in the MacMahon study were selected from a group of patients hospitalized by the same physicians who had diagnosed and hospitalized the cases' disease. The idea was to make the selection process of cases and controls similar. It was also logistically easier to get controls using this method. However, as the exposure factor was coffee drinking, it turned out that patients seen by the physicians who diagnosed pancreatic cancer often had gastrointestinal disorders and were thus advised not to drink coffee (or had chosen to reduce coffee drinking by themselves). So, this led to the selection of controls with higher prevalence of gastrointestinal disorders, and these controls had an unusually low odds of exposure (coffee intake). These in turn may have led to a spurious positive association between coffee intake and pancreatic cancer that could not be subsequently confirmed.

3

This problem can be explored further using causal diagrams. Since the study used a case-control design, cases were sampled from the source population with higher frequency than the controls, which is

represented by the directed arrow between “pancreatic cancer” and “recruitment into study” in Figure 1. However, controls were selected by being hospitalized by the same doctors who treated the cases. If they were not hospitalized for pancreatic cancer, they must have been hospitalized for some other disease, which gave them a higher representation of GI tract disease than observed in the source population. Patients with GI tract disease may have been discouraged from drinking coffee, which gave controls a lower prevalence of exposure than seen in the source population. This is shown in

Figure 1 as a directed arc from “GI tract disease” to coffee and to “recruitment into study”

Collider stratification bias occurs when one conditions (in the design or the analysis) on a common child of two parents. In this case, restricting the observations to people recruited into the study (Figure 2) changes the correlation structure so that it is no longer the same as in the source population. Specifically, pancreatic cancer and GI tract diseases may be uncorrelated in the general population. However, among patients hospitalized by the doctors who had admitted patients with pancreatic cancer, the ones who didn’t have pancreatic disease were more likely to have

something else: a GI tract disease. Therefore, restriction to the population of the doctors who hospitalized the cases induces a negative correlation between these two diseases in the data set.

Figure 3 shows a graph of the data for the study population, as opposed to the source population. Restriction to the subjects recruited from the hospital has created a correlation between GI tract disease and pancreatic cancer. Since GI tract disease lowers exposure, an unblocked backdoor path is now opened, which leads to confounding of the estimated exposure effect (shown with a dashed line and a question mark). Specifically, since the induced correlation is negative, and the effect of GI tract disease on coffee is negative, the exposure estimate

for coffee on pancreatic cancer will be biased upward (Vander Stoep et al 1999).

4

The lesson

Control selection is a critical element of case-control studies, and even the best among us can make erroneous choices. Considerable thought needs to go into this critical step in study design. As Rothman et al. emphasize in their textbook (Modern Epidemiology, 2008), the two important rules for control selection are:

1. Controls should be selected from the same population - the source population (i.e. study base) - that gives rise to the study cases. If this rule cannot be followed, there needs to be solid evidence that the population supplying controls has an exposure distribution identical to that of the population that is the source of cases, which is a very stringent demand that is rarely demonstrable.

2. Within strata of factors that will be used for stratification in the analysis, controls should be selected independently of their exposure status, in that the sampling rate for controls should not vary with exposure.

A more general concern than the issue of control selection in case-control studies is the problem of selection bias (Hernán et al 2004). Whenever the epidemiologist conditions statistically (e.g. by stratification, exclusion or adjustment) on a factor affected by exposure and affected by outcome, a spurious correlation will occur in the study data-set that does not reflect an association in the real world from which the data were drawn. If there is already a non-null association between exposure and outcome, it can be shifted upwards or downwards by this form of bias.

Sources and suggested readings*

1. MacMahon B, Yen S, Trichopoulos D et al. Coffee and cancer of the pancreas. N Engl J Med 1981;304: 630–633.

2. Schmeck HM. Critics say coffee study was flawed. New York Times, June 30, 1981.

3. Gordis L. Epidemiology. Saunders, 2008.

4. Feinstein A et al. Coffee and Pancreatic Cancer. The Problems of Etiologic Science and Epidemiologic Case-Control Research. JAMA 1981;246:957-961.

5. Rothman K, Greenland S, Lash T. Modern epidemiology. Lippincott Williams & Wilkins, 3rd edition, 2008.

6. Coffee and Pancreatic Cancer. An Interview With Brian MacMahon. EpiMonitor, April/May, 1981.

7. Vander Stoep A, Beresford SA, Weiss NS. A didactic device for teaching epidemiology students how to anticipate the effect of a third factor on an exposure-outcome relation. Am J Epidemiol. 1999 Jul 15;150(2):221.

8. Hernán MA, Hernández-Díaz S, Robins JM. A structural approach to selection bias. Epidemiology. 2004 Sep;15(5):615-25.

Image credit: Epidemiology: July 2004 - Volume 15 - Issue 4 - pp 504-508

*From this readings list, the most relevant papers are enclosed.

http://www.ncbi.nlm.nih.gov/pubmed/10412971?ordinalpos=6&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_DefaultReportPanel.Pubmed_RVDocSum�

http://www.ncbi.nlm.nih.gov/pubmed/10412971?ordinalpos=6&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_DefaultReportPanel.Pubmed_RVDocSum�

Sunday, August 16, 2009

Search All NYTimes.com Science

WORLD U.S. N.Y. / REGION BUSINESS TECHNOLOGY SCIENCE HEALTH SPORTS OPINION ARTS STYLE TRAVEL JOBS REAL ESTATE AUTOS

CRITICS SAY COFFEE STUDY WAS FLAWEDBy HAROLD M. SCHMECK JR.Published: June 30, 1981

THERE were flaws in a study showing links between coffee drinking

and a common form of cancer, several medical scientists and

physicians said in letters published in the latest issue of The New

England Journal of Medicine.

In March, the journal carried a report showing statistical links

between coffee drinking and cancer of the pancreas, the fourth most

common cause of cancer deaths among Americans.

''This otherwise excellent paper may be flawed in one critical way,'' said a letter from Dr.

Steven Shedlofsky of the Veterans Administration Hospital in White River Junction, Vt.

He questioned the comparison of pancreatic cancer patients with persons hospitalized for

noncancerous diseases of the digestive system.

Such patients, he noted, might be expected to give up coffee drinking because of their

illness. This, he argued, would tilt the proportion of coffee drinkers away from the

''control'' group who were being compared with the cancer patients. Amplifying the letter

in an interview, Dr. Shedlofsky said many patients with digestive diseases give up coffee

because they believe it aggravates their discomfort, and others do so because their

doctors have advised them to.

Dr. Thomas C. Chalmers, president of the Mount Sinai Medical Center and dean of its

medical school, commented that the investigators who questioned patients on their

prehospitalization coffee habits knew in advance which ones had cancer. This could have

introduced unintentional bias in the results, Dr. Chalmers asserted.

Among the comments from other physicians were these: the question of whether

noncancerous illness might have kept the control patients from drinking coffee was

raised; a correspondent pointed out the problem inherent in trying to judge coffee

consumption simply by asking about typical daily consumption before hospitalization;

and another noted the possible role of other health habits that are closely related to coffee

drinking. These habits included cigarette smoking and the use of sugar, milk, cream or

nondairy ''creamers'' with the coffee.

The authors of the original report, led by Dr. Brian MacMahon of the Harvard School of

Public Health, defended their study against all of the comments. They agreed that

concern was ''reasonable'' over the large number of patients in their control group who

had gastrointestinal disorders. But they said the association between coffee drinking and

cancer of the pancreas was present in all the control groups.

The introduction of unintentional bias was unlikely, they said, because the study team

had no hypotheses about coffee when it began the study. Coffee drinking only emerged as

statistically important when most of the data had already been gathered, they said.

Differences Between Sexes

Go to Complete List »

RELATED ADS what are related ads?

MOST POPULAR

Patient Money: The Expense of Eating With Celiac Disease

1.

Noticed: It’s Hip to Be Round2.

Bob Herbert: Hard to Believe!3.

Gail Collins: To Be Old and in Woodstock4.

Getting Your Wireless Network Up to Speed5.

Multicultural Stages in a Small Oregon Town6.

Well: Fatty Foods Affect Memory and Exercise7.

Believers Invest in the Gospel of Getting Rich8.

Shortcuts: New Worries About Children With Cellphones

9.

Paul Krugman: Republican Death Trip10.

The authority of informal powerALSO IN JOBS »

Looking for a new job?Are three martinis three too many?

» Coffee Car

» MR Coffee Pots

» Study Skills

SIGN IN TO RECOMMEND

E-MAIL

SEND TO PHONE

PRINT

REPRINTS

SHARE

BLOGGEDE-MAILED

HOME PAGE TODAY'S PAPER VIDEO MOST POPULAR TIMES TOPICS Log In Register Now

Welcome to TimesPeopleGet Started Recommend TimesPeople Lets You Share and Discover the Best of NYTimes.com 0:15 AM

Page 1 of 2CRITICS SAY COFFEE STUDY WAS FLAWED - New York Times

16/08/2009http://www.nytimes.com/1981/06/30/science/critics-say-coffee-study-was-flawed.html

Home World U.S. N.Y. / Region Business Technology Science Health Sports Opinion Arts Style Travel Jobs Real Estate Autos Back to Top

Copyright 2009 The New York Times Company Privacy Policy Search Corrections RSS First Look Help Contact Us Work for Us Media Kit Site Map

Sign in to Recommend

More Articles in Science >

The study showed no difference in risk between men who said they drank only about two

cups of coffee a day and those who drank much more. Among women, however, the risk

seemed to be related to the amount consumed. Some of the physicians who commented

on the study considered the lack of a dose effect in men puzzling and a cause of doubt

concerning the overall implications of the study.

In their original report, Dr. MacMahon and his colleagues treated their evidence

cautiously, asserting that further studies were needed to determine whether coffee

drinking was actually a factor in causing the cancers. If it is a matter of cause and effect,

they said, and if the findings apply to the nation as a whole, coffee drinking might be a

factor in slightly more than half of the roughly 20,000 cases a year of that form of cancer

in the United States.

Coffee industry spokesmen, who were critical of the report when it was published in

March, estimate that more than half of Americans over the age of 10 drink coffee.

» Sleep Study

» Coffee Allergy

Ads by Google what's this?

Bell Traveller Q3

Bell Traveller Q3 Bell Traveller Q3

affairesmobiles.bell.ca/fr/C

HEALTH »

Roving Runner: Baseball Nostalgia in the Bronx

WORLD »

Kiev Residents Wonder if Mayor Is Fit for Office

OPINION »

But They Were Next in Line for TakeoffAirplane passengers should demand approval of the merciful Airline Passengers Bill of Rights.

FASHION & STYLE »

The Spirit of ’69, Circa 1972

ARTS »

In Dresden, High Culture and Ugly Reality Clash

OPINION »

Weekend Opinionator: Cheney v. Bush

INSIDE NYTIMES.COM

Page 2 of 2CRITICS SAY COFFEE STUDY WAS FLAWED - New York Times

16/08/2009http://www.nytimes.com/1981/06/30/science/critics-say-coffee-study-was-flawed.html

Coffee and Pancreatic CancerThe Problems of Etiologic Science andEpidemiologic Case-Control Research

THE RECENT report that coffee may cause pancreaticcancer' was presented in a pattern that has becomedistressingly familiar. The alleged carcinogen is a com-

monly used product. The report was given widespreadpublicity before the supporting evidence was available forappraisal by the scientific community, and the publicreceived renewed fear and uncertainty about the cancer-ous hazards lurking in everyday life.The research on coffee and pancreatic cancer was done

with the case-control technique that has regularly beenused in epidemiologic circumstances where the more

scientifically desirable forms2 of clinical investigation\p=m-\arandomized controlled trial or a suitably performedobservational cohort study\p=m-\areeither impossible or

unfeasible. In case-control studies, the investigators beginat the end, rather than at the beginning, of the cause\x=req-\effect pathway. The cases are selected from persons inwhom the target disease has already developed. Thecontrols are selected from persons in whom that diseasehas not been noted. The cases and controls are theninvestigated in a backward temporal direction, withinquiries intended to determine antecedent exposure toagents that may have caused the disease. If the ratio ofantecedent exposure to a particular agent is higher in thecases than in the controls, and if the associated mathe¬matical calculations are "statistically significant," theagent is suspected of having caused the disease.In the recently reported study' of coffee and pancreatic

cancer, the investigators began by assembling records for578 cases of patients with "histologie diagnoses of cancerof the exocrine pancreas." The investigators next createdtwo "control" groups, having other diagnoses. The casesand controls were then interviewed regarding antecedentexposure to tobacco, alcohol, tea, and coffee. When thedata were analyzed for groups demarcated according togender and quantity of coffee consumption, the calculatedrelative-risk ratios for pancreatic cancer were the valuesshown in Table 1.From these and other features of the statistical analy¬

sis, the investigators concluded that "a strong associationbetween coffee consumption and pancreatic cancer wasevident in both sexes." The conclusions were presented

with the customary caveats about the need for moreresearch and with the customary restraints shown in suchexpressions as "coffee use might [our italics] account for asubstantial proportion" of pancreatic cancers. Neverthe¬less, the impression was strongly conveyed that coffee hadbeen indicted as a carcinogen.Although the major public attention has been given to

the "Results" and "Discussion" sections of the publishedreport, readers concerned with scientific standards ofevidence will want to focus on the "Methods." The rest ofthis commentary contains a review of pertinent principlesof case-control methodology, together with a critique ofthe way these principles were applied in the coffee-pancreatic cancer study to formulate a hypothesis, assem¬ble the case and control groups, collect the individual data,and interpret the results.

Scientific Hypotheses and 'Fishing Expeditions'Most case-control studies are done to check the hypothe¬

sis that the target disease has been caused by a specifiedsuspected agent, but after the cases and controls areassembled the investigators can also collect data aboutmany other possible etiologic agents. The process ofgetting and analyzing data for these other agents issometimes called a "fishing expedition," but the processseems entirely reasonable. If we do not know what causesa disease, we might as well check many different possibil¬ities. On the other hand, when an unsuspected agent yieldsa positive result, so that the causal hypothesis is gener¬ated by the data rather than by the investigator, theresults of the fishing expedition require cautious interpre¬tation. Many scientists would not even call the positiveassociation a "hypothesis" until the work has beenreproduced in another investigation.

The investigators who found a positive associationbetween coffee consumption and pancreatic cancer havebeen commendably forthright in acknowledging that theywere looking for something else. When the originalanalyses showed nothing substantial to incriminate thetwo principal suspects—tobacco and alcohol—the explora¬tion of alternative agents began. The investigators do notstate how many additional agents were examined besides

Table 1.—Relative-Risk Ratios According to Genderand Quantity of Coffee Consumption

Coffee Consumption, Cups per Day

_0_1-2_3-4_>5Men 1.0 2.6 2.3 2.6Women 1.0 1.6 3.3 3.1

From the Robert Wood Johnson Clinical Scholars Program, Yale UniversitySchool of Medicine, New Haven, Conn (Drs Feinstein and Horwitz), and theCooperative Studies Program Support Center, Veterans AdministrationHospital, West Haven, Conn (Dr Feinstein), and the McGill Cancer Center,McGill University (Dr Spitzer), and the Kellogg Center for Advanced Studiesin Primary Care, Montreal General Hospital (Drs Spitzer and Battista),Montreal.

Reprint requests to Robert Wood Johnson Clinical Scholar Program, YaleUniversity School of Medicine, 333 Cedar St, Box 3333, New Haven, CT06510 (Dr Feinstein).

at McGill University Libraries on August 15, 2009 www.jama.comDownloaded from

http://jama.ama-assn.org

tea and coffee, but tea was exonerated in the subsequentanalyses, while coffee yielded a positive result.

The investigators suggest that this result is consistentwith coffee-as-carcinogen evidence that had appeared in a

previous case-control study3 of pancreatic cancer. In fact,however, coffee was not indicted in that previous study.The previous investigators found an elevated risk ratio foronly decaffeinated coffee, and they drew no conclusionabout it, having found elevated risks for several otherphenomena that led to the decision that pancreatic cancerhad a nonspecific multifactorial etiology. Thus, the new

hypothesis that coffee may cause pancreatic cancer notonly arises from a "fishing expedition," but also contra¬dicts the results found in previous research.

Selection and Retention of Cases and ControlsBecause the investigators begin at the end of the causal

pathway and must explore it with a reversal of customaryscientific logic, the selection of cases and controls is acrucial feature of case-control studies. Both groups arechosen according to judgments made by the investigators.The decisions about the cases are relatively easy. They are

commonly picked from a registry or some other listingthat will provide the names of persons with the targetdisease. For the controls, who do not have the targetdisease, no standard method of selection is available, andthey have come from an extraordinarily diverse array ofsources. The sources include death certificates, tumorregistries, hospitalized patients, patients with specificcategories of disease, patients hospitalized on specificclinical services, other patients of the same physicians,random samples of geographically defined communities,people living in "retirement" communities, neighbors ofthe cases, or personal friends of the cases.

One useful way of making these decisions less arbitraryis to choose cases and controls according to the same

principles of eligibility and observation that would be usedin a randomized controlled trial of the effects of thealleged etiologic agent. In such a trial, a set of admissioncriteria would be used for demarcating persons to beincluded (or excluded) in the group who are randomlyassigned to be exposed or non-exposed to the agent.Special methods would then be used to follow the mem¬bers of the exposed and non-exposed groups thereafter,and to examine them for occurrence of the target disease.Those in whom this disease develops would become thecases, and all other people would be the controls.When cases and controls are chosen for a case-control

study, the selection can be made from persons who wouldhave been accepted for admission to such a randomizedtrial and who have been examined with reasonably similarmethods of observation. As a scientific set of guidelinesfor choosing eligible patients, the randomized-trial princi¬ples could also help avpid or reduce many of the differentforms of bias that beset case-control studies. Among thesedifficulties are several biases to be discussed later, as wellas other problems such as clinical susceptibility bias,surveillance bias, detection bias, and "early death" bias,which are beyond the scope of this discussion and havebeen described elsewhere.4"9

The randomized-trial principles can also help illuminatethe problems created and encountered by the investigatorsin the study of coffee and pancreatic cancer. In a

randomized trial, people without pancreatic cancer would

be assigned either to drink or not to drink coffee. Anyonewith clinical contraindications against coffee drinking or

indications for it (whatever they might be) would beregarded as ineligible and not admitted. Everyone who didenter the trial, however, would thereafter be included inthe results as the equivalent of either a case, if later foundto have pancreatic cancer, or a control. The cases would be"incidence cases," with newly detected pancreatic cancer,whose diagnoses would be verified by a separate panel ofhistological reviewers. All of the other admitted personswould eventually be classified as unaffected "controls," no

matter what ailments they acquired, as long as they didnot have pancreatic cancer. If large proportions of thepotential cases and controls were lost to follow-up, theinvestigators would perform detailed analyses to showthat the remaining patients resembled those who were

lost, thus providing reasonable assurance that the resultswere free from migration bias.2In the coffee-pancreatic cancer study, the source of the

cases was a list of 578 patients with "histologie diagnosesof cancer of the exocrine pancreas." The histologie materi¬al was apparently not obtained and reviewed; and theauthors do not indicate whether the patients were newlydiagnosed "incidence cases," or "prevalence cases" whohad been diagnosed at previous admissions. Regardless ofthe incidence-prevalence distinction, however, the pub¬lished data are based on only 369 (64% ) of the 578 patientswho were identified as potential cases. Most of the "lost"patients were not interviewed, with 98 potential cases

being too sick or already dead when the interviewerarrived. The investigators report no data to indicatewhether the "lost" cases were otherwise similar to thosewho were retained.In choosing the control group, the investigators made

several arbitrary decisions about whom to admit orexclude. The source of the controls was "all other patientswho were under the care of the same physician in thesame hospital at the time of an interview with a patientwith pancreatic cancer." From this group, the investiga¬tors then excluded anyone with any of the followingdiagnoses: diseases of the pancreas; diseases of thehepatobiliary tract; cardiovascular disease; diabetes melli-tus; respiratory cancer; bladder cancer; or peptic ulcer.Since none of these patients would have been excluded as

nonpancreatic-cancer controls if they acquired these dis¬eases after entry into a randomized trial of coffeedrinking, their rejection in this case-control study ispuzzling. The investigators give no reasons for excludingpatients with "diseases of the pancreas or hepatobiliarytract." The reason offered for the other rejections is thatthe patients had "diseases known to be associated withsmoking or alcohol consumption." The pertinence of thisstipulation for a study of coffee is not readily apparent.

Since the investigators do not state how many potentialcontrols were eliminated, the proportionate impact of theexclusions cannot be estimated. The remaining list ofeligible control patients, however, contained 1,118 people,of whom only a little more than half—644 patients-became the actual control group used for analyses. Most ofthe "lost" controls were not interviewed because of death,early discharge, severity of illness, refusal to participate,and language problems. Of the 700 interviewed controls,56 were subsequently excluded because they were non-

white, foreign, older than 79 years, or "unreliable." No



data are offered to demonstrate that the 644 actualcontrols were similar to the 474 "eligible" controls whowere not included.

The many missing controls and missing interviewscould have led to exclusion biases10" whose effects cannot .

be evaluated in this study. The investigators have alsogiven no attention to the impact of selective hospitaliza-tion bias, perceived by Berkson" and empirically demon¬strated by Roberts et al,6 that can sometimes falselyelevate relative-risk ratios in a hospital population to as

high as 17 times their true value in the general population.For example, in a hospitalized population, Roberts et al*found a value of 5.0 for the relative-risk ratio of arthriticand rheumatic complaints in relation to laxative usage;but in the general population that contained the hospital¬ized patients, the true value was 1.5. Whatever may havebeen the effects of selective hospitalization in the currentstudy (including the possibility of having masked realeffects of tobacco and alcohol), the way that the cases andcontrols were chosen made the study particularly vulnera¬ble to the type of bias described in the next section.

Protopathic Bias in Cases and Controls

"Protopathic" refers to early disease. A protopathicproblem occurs if a person's exposure to a suspectedetiologic agent is altered because of the early manifesta¬tions of a disease, and if the altered (rather than theoriginal) level of exposure is later associated with thatdisease. By producing changes in a person's life-style or

medication, the early manifestations of a disease can

create a bias unique to case-control studies.12 In arandomized trial or observational cohort study, the inves¬tigator begins with each person's baseline state andfollows it to the subsequent outcome. If exposure to a

suspected etiologic agent is started, stopped, or alteredduring this pathway, the investigator can readily deter¬mine whether the change in exposure took place before orafter occurrence of the outcome. In a case-control study,however, the investigator beginning with an outcomecannot be sure whether it preceded or followed changes inexposure to the suspected agent. If the exposure was

altered because the outcome had already occurred and ifthe timing of this change is not recognized by theinvestigator, the later level of exposure (or non-exposure)may be erroneously linked to the outcome event.For example, in circumstances of ordinary medical care,

women found to have benign breast disease might be toldby their physicians to avoid or stop any form of estrogentherapy. If such women are later included as cases in a

case-control study of etiologic factors in benign breastdisease, the antecedent exposure to estrogens will havebeen artifactually reduced in the case group. Oral contra¬ceptives or other forms of estrogen therapy may then befound to exert a fallacious "protection" against thedevelopment of benign breast disease.

The problem of protopathic bias will occur in a case-control study if the amount of previous exposure to thesuspected etiologic agent was preferentially altered—either upward or downward—because of clinical manifes¬tations that represented early effects of the same diseasethat later led to the patient's selection as either a case orcontrol. The bias is particularly likely to arise if thepreferential decisions about exposure were made in oppo¬site directions in the cases and controls. The coffee-

pancreatic cancer study was particularly susceptible tothis type of bi-directional bias. The customary intake ofcoffee may have been increased by members of thepancreatic-cancer case group who were anxious aboutvague abdominal symptoms that had not yet becomediagnosed or even regarded as "illness." Conversely,control patients with such gastrointestinal ailments as

regional enteritis or dyspepsia may have been medicallyadvised to stop or reduce their drinking of coffee. With astrict set of admission criteria, none of these patientswould be chosen as cases or controls, because the use ofthe alleged etiologic agent would have been previouslyaltered by the same ailment that led to the patient'sselection for the study.This problem of protopathic bias is a compelling

concern in the investigation under review here. Because so

many potential control patients were excluded, theremaining control group contained many people withgastrointestinal diseases for which coffee drinking mayhave been previously reduced or eliminated. Of the 644controls, 249 (39%) had one of the following diagnoses:cancer of the stomach, bowel, or rectum; colitis, enteritis,or diverticulitis; bowel obstruction, adhesions, or fistula;gastritis; or "other gastroenterologic conditions." If coffeedrinking is really unrelated to pancreatic cancer, but ifmany of these 249 patients had premonitory symptomsthat led to a cessation or reduction in coffee drinking"before the current illness was evident," the subsequentdistortions could easily produce a false-positive associa¬tion.

The existence of this type of bias could have beenrevealed or prevented if the investigators had obtainedsuitable data. All that was needed during the interviewwith each case or control patient was to ask aboutduration of coffee drinking, changes in customary patternof consumption, and reasons for any changes. Unfortu¬nately, since coffee was not a major etiologic suspect inthe research, this additional information was not solicited.After the available data were analyzed, when the investi¬gators became aware of a possible problem, they tried tominimize its potential importance by asserting that"although the majority of control patients in our serieshad chronic disease, pancreatic cancer is itself a chronicdisease, and in theory it would seem as likely as any otherdisorder to induce a change in coffee [consumption]." Thisassertion does not address the point at issue. The biasunder discussion arises from changes in exposure statusbecause of the early clinical manifestations of a disease,not from the chronic (or acute) characteristics of theconditions under comparison.

The investigators also claimed that "it is inconceivablethat this bias would account for the total differencebetween cases and controls." The conception is actuallyquite easy. To make the demonstration clear, let useliminate gender distinctions and coffee quantification inthe investigators' Table 4, which can then be convertedinto a simple fourfold table (Table 2). In this table, theodds ratio, which estimates the relative-risk ratio, is(347/20)/(555/88)=2.75, which is the same magnitude asthe relative risks cited by the investigators.Let us now assume that 5% of the coffee-drinker cases

were formerly non-coffee-drinkers. If so, 17 people in thecase group would be transferred downward from thecoffee drinkers to the nondrinkers. Although 249 members



Table 2.—Status of Study Subjects According toCoffee Consumption

Cases ControlsCoffee-drinkers 347 555Non-coffee-drinkers 20 88Total 367 643

Table 3.—Hypothetical' Status of Study SubjectsShown in Table 2

Cases ControlsCoffee-drinkers 330 573Non-coffee-drinkers 37 70

Total_367_643•Based on estimate that 5% of coffee-drinkers in case group were

previously non-coffee-drinkers and that 20% of non-coffee-drinkers in controlgroup ceased coffee consumption because of symptoms.

of the control group had gastrointestinal conditions thatmight have led to a cessation of coffee consumption, let usconservatively estimate that only 20% of the 88 controlslisted in the non-coffee-drinkers category were previouscoffee-drinkers who had stopped because of symptoms. Ifso, 18 of the non-coffee-drinking controls should move

upward into the coffee-drinking group. With these reclas¬sifications, the adjusted fourfold table would be aspresented in Table 3. For this new table, the odds ratio is(330/37)/(573/70)=1.09, and the entire positive associationvanishes.

Acquisition of Basic DataAll of the difficulties just described arise as conse¬

quences of basic decisions made in choosing cases andcontrols. After these decisions are completed, the case-control investigator acquires information about eachperson's antecedent exposure. This information becomesthe basic research data, analogous to the description ofeach patient's outcome in a randomized controlled trial.The information about exposure should therefore becollected with thorough scientific care, using impeccablecriteria to achieve accuracy, and, when necessary, usingobjective (or "blinded") methods to prevent biased obser¬vations.

These scientific requirements are seldom fulfilled inepidemiologic research. The primary data about exposureare verified so infrequently in case-control studies thatprominent epidemiologists'1 have begun to make publicpleas for improved scientific standards and methods. Inthe few instances where efforts have been made toconfirm recorded data,""5 to repeat interviews at a laterdate16 or to check the agreement of data obtained fromdifferent sources,11 the investigators have encountereddiscrepancies of major magnitude. In one of these stud¬ies,17 when the agent of exposure (occupation as a fisher¬man) was confirmed, the original numbers of exposedpeople were reduced by 17%. Had these numbers not beencorrected, the study would have produced misleadingconclusions.Although errors of similar magnitude could easily have

occurred in the coffee-pancreatic cancer investigation, theinvestigators did not publish even a brief text of the actualquestions used for the interviews, and no efforts are

mentioned to check the quality of the data that were

obtained in the single interview with each patient. Family

members or friends were not asked to confirm thepatients' answers; the information was not checkedagainst previous records; and no patients were reinter-viewed after the original interrogation to see whethersubsequent responses agreed with what was said previous¬ly. Although a verification of each interview is difficult toachieve in a large study, the scientific quality of the datacould have been checked in a selected sample.

Because of the high likelihood of the protopathic biasnoted earlier, the quality of the coffee-drinking data is a

major problem in the study under review. The investiga¬tors state that "the questions on tea and coffee werelimited to the number of cups consumed in a typical daybefore the current illness was evident." This approachwould not produce reliable data, since it does not indicatewhat and when was a "typical day," who decided what wasthe "time before the current illness was evident," or whodetermined which of the patient's symptoms were the firstmanifestation of "illness" either for pancreatic cancer orfor the diverse diseases contained in the control group.Although the investigators acknowledge the possibility

that "patients reduced their coffee consumption becauseof illness," nothing was done to check this possibility or tocheck the alternative possibility that other patients mayhave increased their customary amounts of coffee drink¬ing. In addition to no questions about changes in coffeeconsumption, the patients were also asked nothing aboutduration. Thus, a patient who had started drinking fourcups a day in the past year would have been classified as

having exactly the same exposure as a patient who hadbeen drinking four cups a day for 30 years.

The Problem of Multiple ContrastsWhen multiple features of two groups are tested for

"statistically significant" differences, one or more of thosefeatures may seem "significant" purely by chance. Thismultiple-contrast problem is particularly likely to ariseduring a "fishing expedition." In the customary test ofstatistical significance, the investigator contrasts theresults for a single feature in two groups. The result ofthis single-feature two-group contrast is declared signifi¬cant if the P value falls below a selected boundary, whichis called the a level. Because a is commonly set at .05,medical literature has become replete with statementsthat say "the results are statistically significant at P-C05."For a single two-group contrast at an a level of .05, theinvestigator has one chance in 20 (which can also beexpressed as contrary odds of 19 to 1) of finding a

false-positive result if the contrasted groups are reallysimilar.For the large series of features that receive two-group

contrasts during a "fishing expedition," however, statisti¬cal significance cannot be decided according to the same a

level used for a single contrast. For example, in thecoffee-pancreatic cancer study, the cases and controlswere divided for two-group contrasts of such individualexposures (or non-exposures) as cigars, pipes, cigarettes,alcohol, tea, and coffee. (If other agents were also checked,the results are not mentioned.) With at least six suchtwo-group contrasts, the random chance of finding a

single false-positive association where none really exists isno longer .05. If the characteristics are mutually indepen¬dent, the chance is at least .26[=1—(.95)*]. Consequently,when six different agents are checked in the same study,



the odds against finding a spurious positive result are

reduced from 19 to 1 and become less than 3 to 1[=.74/.26].

To guard against such spurious conclusions duringmultiple contrasts, the customary statistical strategy is tomake stringent demands on the size of the P valuerequired for "significance." Instead of being set at thecustomary value of .05, the a level is substantially lowered.Statisticians do not agree on the most desirable formulafor determining this lowered boundary, but a frequentprocedure is to divide the customary a level by k, where kis the number of comparisons.18 Thus, in the current study,containing at least six comparisons, the decisive level of awould" be set at no higher than .05/6=.008.In the published report, the investigators make no

comment about this multiple-contrast problem and theydo not seem to have considered it in their analyses. In one

of the results, a P value is cited as "<.001," but most of thecogent data for relative risks are expressed in "95%confidence intervals," which were calculated with «=.05.Many of those intervals would become expanded to includethe value of 1, thereby losing "statistical significance,"if a were re-set at the appropriate level of .008 or lower.

CommentThe foregoing discussion has been confined to the main

reasons for doubting the reported association betweencoffee and pancreatic cancer. Readers who are interestedin evaluating other features of the study can check itsconstituent methods by referring to the criteria listed inseveral published proposals810 of scientific standards forcase-control research.A separate problem, to be mentioned only in passing, is

the appropriateness of forming conclusions and extensive¬ly diffusing results from a study in which the hypothesisdevelops as an analytic surprise in the data. Scientists andpractitioners in the field of human health face difficultdilemmas about the risks and benefits of their activities.The old principle of avoiding harm whenever possibleholds true whether a person or a population is at risk.Whether to shout "Fire!" in a crowded theater is adifficult decision, even if a fire is clearly evident. The riskof harm seems especially likely if such shouts are raisedwhen the evidence of a blaze is inconclusive or meager.Aside from puzzled medical practitioners and a confusedlay public, another possible victim is the developingscience of chronic disease epidemiology. Its credibility canwithstand only a limited number of false alarms.

Because the epidemiologic case-control study is a neces¬

sary, currently irreplaceable research mechanism in etio¬logic science,~its procedures and operating paradigms needmajor improvements in scientific quality. In the evalua¬tion of cause-effect, relationships for therapeutic agents,the experimental scientific principles of a randomizedtrial have sometimes required huge sample sizes andmassive efforts that have made the trials become an

"indispensable ordeal."" In the evaluation of cause-effectrelationships for etiologic agents, the case-control tech¬nique has eliminated the "ordeal" of a randomizedcontrolled trial by allowing smaller sample sizes, theanalysis of natural events and data, and a reversedobservational direction. Since the use of scientific princi¬ples remains "indispensable," however, the developmentand application of suitable scientific standards in case-

control research is a prime challenge in chronic diseaseepidemiology today.

The current méthodologie difficulties arise becausecase-control investigators, having recognized that etio¬logic agents cannot be assigned with experimentaldesigns, and having necessarily abandoned the randomiza¬tion principle in order to work with naturally occurringevents and data, have also abandoned many other scientif¬ic principles that are part of the experimental method andthat could be employed in observational research. Theverification and suitably unbiased acquisition of basic rawdata regarding diagnoses and exposures do not requirerandomized trials; and the patients admitted to anobservational study can be selected in accordance with thesame eligibility criteria and the same subsequent diagnos¬tic procedures that would have been used in a randomizedtrial.20 These scientific experimental principles, however,are still frequently disregarded in case-control research,despite the celebrated warning of the distinguished Brit¬ish statistician, Sir Austin Bradford Hill.21 In discussingthe use of observational substitutes for experimentaltrials, he said that the investigator "must have theexperimental approach firmly in mind" and must work "insuch a way as to fulfill, as far as possible, experimentalrequirements."

Alvan R. Feinstein, MDRalph I. Horwitz, MDWalter O. Spitzer, MDRenaldo N. Battista, MD

1. MacMahon B, Yen S, Trichopoulos D, et al: Coffee and cancer of thepancreas. N Engl J Med 1981;304:630-633.

2. Feinstein AR: Clinical biostatistics: XLVIII. Efficacy of differentresearch structures in preventing bias in the analysis of causation. ClinPharmacol Ther 1979;26:129-141.

3. Lin RS, Kessler II: A multifactorial model for pancreatic cancer inman. JAMA 1981;245:147-152.

4. Berkson J: Limitations of the application of four-fold tables tohospital data. Biometrics Bull 1946;2:47-53.

5. Neyman J: Statistics: Servant of all sciences. Science 1955;122:401.6. Roberts RS, Spitzer WO, Delmore T, et al: An empirical demonstra-

tion of Berkson's bias. J Chronic Dis 1978;31:119-128.7. Horwitz RI, Feinstein AR: Methodologic standards and contradictory

results in case-control research. Am J Med 1979;66:556-564.8. Feinstein AR: Methodologic problems and standards in case-control

research. J Chronic Dis 1979;32:35-41.9. Sackett DL: Bias in analytic research. J Chronic Dis 1979;32:51-63.10. Horwitz RI, Feinstein AR, Stewart KR: Exclusion bias and the false

relationship of reserpine/breast cancer, abstracted. Clin Res 1981;29:563.11. Horwitz RI, Feinstein AR, Stremlau JR: Alternative data sources

and discrepant results in case-control studies of estrogens and endometrialcancer. Am J Epidemiol 1980;111:389-394.

12. Horwitz RI, Feinstein AR: The problem of `protopathic bias' incase-control studies. Am J Med 1980;68:255-258.

13. Gordis L: Assuring the quality of questionnaire data in epidemiologicresearch. Am J Epidemiol 1979;109:21-24.

14. Chambers LW, Spitzer WO, Hill GB, et al: Underreporting of cancerin medical surveys: A source of systematic error in cancer research. Am JEpidemiol 1976;104:141-145.

15. Chambers LW, Spitzer WO: A method of estimating risk foroccupational factors using multiple data sources: The Newfoundland lipcancer study. Am J Public Health 1977;67:176-179.

16. Klemetti A, Saxen L: Prospective versus retrospective approach inthe search for environmental causes of malformations. Am J Public Health1967;57:2071-2075.

17. Spitzer WO, Hill GB, Chambers LW, et al: The occupation of fishingas a risk factor in cancer of the lip. N Engl J Med 1975;293:419-424.

18. Brown BW Jr, Hollander M: Statistics: A Biomedical Introduction.New York, John Wiley & Sons Inc, 1977, pp 231-234.

19. Fredrickson DS: The field trial: Some thoughts on the indispensableordeal. Bull NY Acad Med 1968;44:985-993.

20. Horwitz RI, Feinstein AR: A new research method, suggesting thatanticoagulants reduce mortality in patients with myocardial infarction.Clin Pharmacol Ther 1980;27:258.

21. Hill AB: Observation and experiment. N Engl J Med 1953;248:995\x=req-\1001.



ORIGINAL ARTICLE

A Structural Approach to Selection Bias

Miguel A. Hernan,* Sonia Hernandez-Dıaz,† and James M. Robins*

Abstract: The term “selection bias” encompasses various biases inepidemiology. We describe examples of selection bias in case-control studies (eg, inappropriate selection of controls) and cohortstudies (eg, informative censoring). We argue that the causal struc-ture underlying the bias in each example is essentially the same:conditioning on a common effect of 2 variables, one of which iseither exposure or a cause of exposure and the other is either theoutcome or a cause of the outcome. This structure is shared by otherbiases (eg, adjustment for variables affected by prior exposure). Astructural classification of bias distinguishes between biases result-ing from conditioning on common effects (“selection bias”) andthose resulting from the existence of common causes of exposureand outcome (“confounding”). This classification also leads to aunified approach to adjust for selection bias.

(Epidemiology 2004;15: 615–625)

Epidemiologists apply the term “selection bias” to manybiases, including bias resulting from inappropriate selec-

tion of controls in case-control studies, bias resulting fromdifferential loss-to-follow up, incidence–prevalence bias, vol-unteer bias, healthy-worker bias, and nonresponse bias.

As discussed in numerous textbooks,1–5 the commonconsequence of selection bias is that the association betweenexposure and outcome among those selected for analysisdiffers from the association among those eligible. In thisarticle, we consider whether all these seemingly heteroge-neous types of selection bias share a common underlyingcausal structure that justifies classifying them together. Weuse causal diagrams to propose a common structure and showhow this structure leads to a unified statistical approach to

adjust for selection bias. We also show that causal diagramscan be used to differentiate selection bias from what epide-miologists generally consider confounding.

CAUSAL DIAGRAMS AND ASSOCIATIONDirected acyclic graphs (DAGs) are useful for depicting

causal structure in epidemiologic settings.6–12 In fact, the struc-ture of bias resulting from selection was first described in theDAG literature by Pearl13 and by Spirtes et al.14 A DAG iscomposed of variables (nodes), both measured and unmeasured,and arrows (directed edges). A causal DAG is one in which 1)the arrows can be interpreted as direct causal effects (as definedin Appendix A.1), and 2) all common causes of any pair ofvariables are included on the graph. Causal DAGs are acyclicbecause a variable cannot cause itself, either directly or throughother variables. The causal DAG in Figure 1 represents thedichotomous variables L (being a smoker), E (carrying matchesin the pocket), and D (diagnosis of lung cancer). The lack of anarrow between E and D indicates that carrying matches does nothave a causal effect (causative or preventive) on lung cancer, ie,the risk of D would be the same if one intervened to change thevalue of E.

Besides representing causal relations, causal DAGsalso encode the causal determinants of statistical associations.In fact, the theory of causal DAGs specifies that an associa-tion between an exposure and an outcome can be produced bythe following 3 causal structures13,14:

1. Cause and effect: If the exposure E causes the outcome D,or vice versa, then they will in general be associated.Figure 2 represents a randomized trial in which E (anti-retroviral treatment) prevents D (AIDS) among HIV-infected subjects. The (associational) risk ratio ARRED

differs from 1.0, and this association is entirely attribut-able to the causal effect of E on D.

2. Common causes: If the exposure and the outcome share acommon cause, then they will in general be associatedeven if neither is a cause of the other. In Figure 1, thecommon cause L (smoking) results in E (carryingmatches) and D (lung cancer) being associated, ie, again,ARRED �1.0.

3. Common effects: An exposure E and an outcome D thathave a common effect C will be conditionally associated if

Submitted 21 March 2003; final version accepted 24 May 2004.From the *Department of Epidemiology, Harvard School of Public Health,

Boston, Massachusetts; and the †Slone Epidemiology Center, BostonUniversity School of Public Health, Brookline, Massachusetts.

Miguel Hernan was supported by NIH grant K08-AI-49392 and JamesRobins by NIH grant R01-AI-32475.

Correspondence: Miguel Hernan, Department of Epidemiology, HarvardSchool of Public Health, 677 Huntington Avenue, Boston, MA 02115.E-mail: [email protected]

Copyright © 2004 by Lippincott Williams & WilkinsISSN: 1044-3983/04/1505-0615DOI: 10.1097/01.ede.0000135174.63482.43

Epidemiology • Volume 15, Number 5, September 2004 615

the association measure is computed within levels of thecommon effect C, ie, the stratum-specific ARRED�C willdiffer from 1.0, regardless of whether the crude (equiva-lently, marginal, or unconditional) ARRED is 1.0. Moregenerally, a conditional association between E and D willoccur within strata of a common effect C of 2 othervariables, one of which is either exposure or a cause ofexposure and the other is either the outcome or a cause ofthe outcome. Note that E and D need not be uncondition-ally associated simply because they have a common effect.In the Appendix we describe additional, more complex,structural causes of statistical associations.

That causal structures (1) and (2) imply a crude asso-ciation accords with the intuition of most epidemiologists.We now provide intuition for why structure (3) induces aconditional association. (For a formal justification, see refer-ences 13 and 14.) In Figure 3, the genetic haplotype E andsmoking D both cause coronary heart disease C. Nonetheless,E and D are marginally unassociated (ARRED � 1.0) becauseneither causes the other and they share no common cause. Wenow argue heuristically that, in general, they will be condi-tionally associated within levels of their common effect C.

Suppose that the investigators, who are interested inestimating the effect of haplotype E on smoking status D,restricted the study population to subjects with heart disease(C � 1). The square around C in Figure 3 indicates that theyare conditioning on a particular value of C. Knowing that asubject with heart disease lacks haplotype E provides someinformation about her smoking status because, in the absenceof E, it is more likely that another cause of C such as D ispresent. That is, among people with heart disease, the pro-portion of smokers is increased among those without thehaplotype E. Therefore, E and D are inversely associatedconditionally on C � 1, and the conditional risk ratioARRED�C�1 is less than 1.0. In the extreme, if E and D werethe only causes of C, then among people with heart disease,

the absence of one of them would perfectly predict thepresence of the other.

As another example, the DAG in Figure 4 adds to theDAG in Figure 3 a diuretic medication M whose use is aconsequence of a diagnosis of heart disease. E and D are alsoassociated within levels of M because M is a common effectof E and D.

There is another possible source of association between2 variables that we have not discussed yet. As a result ofsampling variability, 2 variables could be associated bychance even in the absence of structures (1), (2), or (3).Chance is not a structural source of association becausechance associations become smaller with increased samplesize. In contrast, structural associations remain unchanged.To focus our discussion on structural rather than chanceassociations, we assume we have recorded data in everysubject in a very large (perhaps hypothetical) population ofinterest. We also assume that all variables are perfectlymeasured.

A CLASSIFICATION OF BIASES ACCORDING TOTHEIR STRUCTURE

We will say that bias is present when the associationbetween exposure and outcome is not in its entirety the resultof the causal effect of exposure on outcome, or more pre-cisely when the causal risk ratio (CRRED), defined in Appen-dix A.1, differs from the associational risk ratio (ARRED). Inan ideal randomized trial (ie, no confounding, full adherenceto treatment, perfect blinding, no losses to follow up) such asthe one represented in Figure 2, there is no bias and theassociation measure equals the causal effect measure.

Because nonchance associations are generated by struc-tures (1), (2), and (3), it follows that biases could be classifiedon the basis of these structures:

1. Cause and effect could create bias as a result of reversecausation. For example, in many case-control studies, theoutcome precedes the exposure measurement. Thus, theassociation of the outcome with measured exposure couldin part reflect bias attributable to the outcome’s effect onmeasured exposure.7,8 Examples of reverse causation biasinclude not only recall bias in case-control studies, butalso more general forms of information bias like, forexample, when a blood parameter affected by the presenceof cancer is measured after the cancer is present.

2. Common causes: In general, when the exposure and out-come share a common cause, the association measure

FIGURE 1. Common cause L of exposure E and outcome D.

FIGURE 2. Causal effect of exposure E on outcome D.

FIGURE 3. Conditioning on a common effect C of exposure Eand outcome D.

FIGURE 4. Conditioning on a common effect M of exposure Eand outcome D.

Hernan et al Epidemiology • Volume 15, Number 5, September 2004

© 2004 Lippincott Williams & Wilkins616

differs from the effect measure. Epidemiologists tend touse the term confounding to refer to this bias.

3. Conditioning on common effects: We propose that thisstructure is the source of those biases that epidemiologistsrefer to as selection bias. We argue by way of example.

EXAMPLES OF SELECTION BIAS

Inappropriate Selection of Controls in aCase-Control Study

Figure 5 represents a case-control study of the effect ofpostmenopausal estrogens (E) on the risk of myocardialinfarction (D). The variable C indicates whether a woman inthe population cohort is selected for the case-control study(yes � 1, no � 0). The arrow from disease status D toselection C indicates that cases in the cohort are more likelyto be selected than noncases, which is the defining feature ofa case-control study. In this particular case-control study,investigators selected controls preferentially among womenwith a hip fracture (F), which is represented by an arrow fromF to C. There is an arrow from E to F to represent theprotective effect of estrogens on hip fracture. Note Figure 5 isessentially the same as Figure 3, except we have now elab-orated the causal pathway from E to C.

In a case-control study, the associational exposure–disease odds ratio (AORED�C � 1) is by definition conditionalon having been selected into the study (C � 1). If subjectswith hip fracture F are oversampled as controls, then theprobability of control selection depends on a consequence Fof the exposure (as represented by the path from E to Cthrough F) and “inappropriate control selection” bias willoccur (eg, AORED�C � 1 will differ from 1.0, even when likein Figure 5 the exposure has no effect on the disease). Thisbias arises because we are conditioning on a common effectC of exposure and disease. A heuristic explanation of thisbias follows. Among subjects selected for the study, controlsare more likely than cases to have had a hip fracture. There-fore, because estrogens lower the incidence of hip fractures,a control is less likely to be on estrogens than a case, andhence AORED�C � 1 is greater than 1.0, even though theexposure does not cause the outcome. Identical reasoningwould explain that the expected AORED�C � 1 would begreater than the causal ORED even had the causal ORED

differed from 1.0.

Berkson’s BiasBerkson15 pointed out that 2 diseases (E and D) that are

unassociated in the population could be associated amonghospitalized patients when both diseases affect the probabilityof hospital admission. By taking C in Figure 3 to be theindicator variable for hospitalization, we recognize that Berk-son’s bias comes from conditioning on the common effect Cof diseases E and D. As a consequence, in a case-controlstudy in which the cases were hospitalized patients withdisease D and controls were hospitalized patients with diseaseE, an exposure R that causes disease E would appear to be arisk factor for disease D (ie, Fig. 3 is modified by addingfactor R and an arrow from R to E). That is, AORRD�C � 1

would differ from 1.0 even if R does not cause D.

Differential Loss to Follow Up in LongitudinalStudies

Figure 6a represents a follow-up study of the effect ofantiretroviral therapy (E) on AIDS (D) risk among HIV-

FIGURE 6. Selection bias in a cohort study. See text for details.FIGURE 5. Selection bias in a case-control study. See text fordetails.

Epidemiology • Volume 15, Number 5, September 2004 Structural Approach to Selection Bias

© 2004 Lippincott Williams & Wilkins 617

infected patients. The greater the true level of immunosup-pression (U), the greater the risk of AIDS. U is unmeasured.If a patient drops out from the study, his AIDS status cannotbe assessed and we say that he is censored (C � 1). Patientswith greater values of U are more likely to be lost to followup because the severity of their disease prevents them fromattending future study visits. The effect of U on censoring ismediated by presence of symptoms (fever, weight loss, diar-rhea, and so on), CD4 count, and viral load in plasma, allsummarized in the (vector) variable L, which could or couldnot be measured. The role of L, when measured, in dataanalysis is discussed in the next section; in this section, wetake L to be unmeasured. Patients receiving treatment are ata greater risk of experiencing side effects, which could leadthem to dropout, as represented by the arrow from E to C. Forsimplicity, assume that treatment E does not cause D and sothere is no arrow from E to D (CRRED � 1.0). The squarearound C indicates that the analysis is restricted to thosepatients who did not drop out (C � 0). The associational risk(or rate) ratio ARRED�C � 0 differs from 1.0. This “differentialloss to follow-up” bias is an example of bias resulting fromstructure (3) because it arises from conditioning on thecensoring variable C, which is a common effect of exposureE and a cause U of the outcome.

An intuitive explanation of the bias follows. If a treatedsubject with treatment-induced side effects (and thereby at agreater risk of dropping out) did in fact not drop out (C � 0),then it is generally less likely that a second cause of droppingout (eg, a large value of U) was present. Therefore, an inverseassociation between E and U would be expected. However, Uis positively associated with the outcome D. Therefore, re-stricting the analysis to subjects who did not drop out of thisstudy induces an inverse association (mediated by U) betweenexposure and outcome, ie, ARRED�C � 0 is not equal to 1.0.

Figure 6a is a simple transformation of Figure 3 thatalso represents bias resulting from structure (3): the associa-tion between D and C resulting from a direct effect of D onC in Figure 3 is now the result of U, a common cause of Dand C. We now present 3 additional structures, (Figs. 6b–d),which could lead to selection bias by differential loss tofollow up.

Figure 6b is a variation of Figure 6a. If prior treatmenthas a direct effect on symptoms, then restricting the study tothe uncensored individuals again implies conditioning on thecommon effect C of the exposure and U thereby introducinga spurious association between treatment and outcome. Fig-ures 6a and 6b could depict either an observational study or anexperiment in which treatment E is randomly assigned, becausethere are no common causes of E and any other variable. Thus,our results demonstrate that randomized trials are not free ofselection bias as a result of differential loss to follow up becausesuch selection occurs after the randomization.

Figures 6c and d are variations of Figures 6a and b,respectively, in which there is a common cause U* of E andanother measured variable. U* indicates unmeasured life-style/personality/educational variables that determine bothtreatment (through the arrow from U* to E) and eitherattitudes toward attending study visits (through the arrowfrom U* to C in Fig. 6c) or threshold for reporting symptoms(through the arrow from U* to L in Fig. 6d). Again, these 2are examples of bias resulting from structure (3) because thebias arises from conditioning on the common effect C of botha cause U* of E and a cause U of D. This particular bias hasbeen referred to as M bias.12 The bias caused by differentialloss to follow up in Figures 6a–d is also referred to as biasdue to informative censoring.

Nonresponse Bias/Missing Data BiasThe variable C in Figures 6a–d can represent missing

data on the outcome for any reason, not just as a result of lossto follow up. For example, subjects could have missing databecause they are reluctant to provide information or becausethey miss study visits. Regardless of the reasons why data onD are missing, standard analyses restricted to subjects withcomplete data (C � 0) will be biased.

Volunteer Bias/Self-selection BiasFigures 6a–d can also represent a study in which C is

agreement to participate (yes � 1, no � 0), E is cigarettesmoking, D is coronary heart disease, U is family history ofheart disease, and U* is healthy lifestyle. (L is any mediatorbetween U and C such as heart disease awareness.) Under anyof these structures, there would be no bias if the studypopulation was a representative (ie, random) sample of thetarget population. However, bias will be present if the studyis restricted to those who volunteered or elected to participate(C � 1). Volunteer bias cannot occur in a randomized studyin which subjects are randomized (ie, exposed) only afteragreeing to participate, because none of Figures 6a–d canrepresent such a trial. Figures 6a and b are eliminated becauseexposure cannot cause C. Figures 6c and d are eliminatedbecause, as a result of the random exposure assignment, therecannot exist a common cause of exposure and any anothervariable.

Healthy Worker BiasFigures 6a–d can also describe a bias that could arise

when estimating the effect of a chemical E (an occupationalexposure) on mortality D in a cohort of factory workers. Theunderlying unmeasured true health status U is a determinantof both death (D) and of being at work (C). The study isrestricted to individuals who are at work (C � 1) at the timeof outcome ascertainment. (L could be the result of bloodtests and a physical examination.) Being exposed to thechemical is a predictor of being at work in the near future,either directly (eg, exposure can cause disabling asthma), like



in Figures 6a and b, or through a common cause U* (eg,certain exposed jobs are eliminated for economic reasons andthe workers laid off) like in Figures 6c and d.

This “healthy worker” bias is an example of biasresulting from structure (3) because it arises from condition-ing on the censoring variable C, which is a common effect of(a cause of) exposure and (a cause of) the outcome. However,the term “healthy worker” bias is also used to describe thebias that occurs when comparing the risk in certain group ofworkers with that in a group of subjects from the generalpopulation. This second bias can be depicted by the DAG inFigure 1 in which L represents health status, E representsmembership in the group of workers, and D represents theoutcome of interest. There are arrows from L to E and Dbecause being healthy affects job type and risk of subsequentoutcome, respectively. In this case, the bias is caused bystructure (1) and would therefore generally be considered tobe the result of confounding.

These examples lead us to propose that the term selec-tion bias in causal inference settings be used to refer to anybias that arises from conditioning on a common effect as inFigure 3 or its variations (Figs. 4–6).

In addition to the examples given here, DAGs havebeen used to characterize various other selection biases. Forexample, Robins7 explained how certain attempts to elimi-nate ascertainment bias in studies of estrogens and endome-trial cancer could themselves induce bias16; Hernan et al.8

discussed incidence–prevalence bias in case-control studiesof birth defects; and Cole and Hernan9 discussed the bias thatcould be introduced by standard methods to estimate directeffects.17,18 In Appendix A.2, we provide a final example: thebias that results from the use of the hazard ratio as an effectmeasure. We deferred this example to the appendix becauseof its greater technical complexity. (Note that standard DAGsdo not represent “effect modification” or “interactions” be-tween variables, but this does not affect their ability torepresent the causal structures that produce bias, as morefully explained in Appendix A.3).

To demonstrate the generality of our approach to se-lection bias, we now show that a bias that arises in longitu-dinal studies with time-varying exposures19 can also beunderstood as a form of selection bias.

Adjustment for Variables Affected by PreviousExposure (or its causes)

Consider a follow-up study of the effect of antiretrovi-ral therapy (E) on viral load at the end of follow up (D � 1if detectable, D � 0 otherwise) in HIV-infected subjects. Thegreater a subject’s unmeasured true immunosuppression level(U), the greater her viral load D and the lower the CD4 countL (low � 1, high � 0). Treatment increases CD4 count, andthe presence of low CD4 count (a proxy for the true level ofimmunosuppression) increases the probability of receiving

treatment. We assume that, in truth but unknown to the dataanalyst, treatment has no causal effect on the outcome D. TheDAGs in Figures 7a and b represent the first 2 time points ofthe study. At time 1, treatment E1 is decided after observingthe subject’s risk factor profile L1. (E0 could be decided afterobserving L0, but the inclusion of L0 in the DAG would notessentially alter our main point.) Let E be the sum of E0 andE1. The cumulative exposure variable E can therefore take 3values: 0 (if the subject is not treated at any time), 1 (if thesubject is treated at time one only or at time 2 only), and 2 (ifthe subject is treated at both times). Suppose the analyst’sinterest lies in comparing the risk had all subjects beenalways treated (E � 2) with that had all subjects never beentreated (E � 0), and that the causal risk ratio is 1.0 (CRRED

� 1, when comparing E � 2 vs. E � 0).To estimate the effect of E without bias, the analyst

needs to be able to estimate the effect of each of its compo-nents E0 and E1 simultaneously and without bias.17 As wewill see, this is not possible using standard methods, evenwhen data on L1 are available, because lack of adjustment forL1 precludes unbiased estimation of the causal effect of E1

whereas adjustment for L1 by stratification (or, equivalently,by conditioning, matching, or regression adjustment) pre-cludes unbiased estimation of the causal effect of E0.

Unlike previous structures, Figures 7a and 7b contain acommon cause of the (component E1 of) exposure E and theoutcome D, so one needs to adjust for L1 to eliminate

FIGURE 7. Adjustment for a variable affected by previousexposure.



confounding. The standard approach to confounder control isstratification: the associational risk ratio is computed in eachlevel of the variable L1. The square around the node L1

denotes that the associational risk ratios (ARRED�L � 0 andARRED�L � 1) are conditional on L1. Examples of stratifica-tion-based methods are a Mantel-Haenzsel stratified analysisor regression models (linear, logistic, Poisson, Cox, and soon) that include the covariate L1. (Not including interactionterms between L1 and the exposure in a regression model isequivalent to assuming homogeneity of ARRED�L � 0 andARRED�L � 1.) To calculate ARRED�L � l, the data analyst hasto select (ie, condition on) the subset of the population withvalue L1 � l. However, in this example, the process ofchoosing this subset results in selection on a variable L1

affected by (a component E0 of) exposure E and thus canresult in bias as we now describe.

Although stratification is commonly used to adjust forconfounding, it can have unintended effects when the asso-ciation measure is computed within levels of L1 and inaddition L1 is caused by or shares causes with a componentE0 of E. Among those with low CD4 count (L1 � 1), beingon treatment (E0 � 1) makes it more likely that the person isseverely immunodepressed; among those with a high level ofCD4 (L1 � 0), being off treatment (E0 � 0) makes it morelikely that the person is not severely immunodepressed. Thus,the side effect of stratification is to induce an associationbetween prior exposure E0 and U, and therefore between E0

and the outcome D. Stratification eliminates confounding forE1 at the cost of introducing selection bias for E0. The net biasfor any particular summary of the time-varying exposure thatis used in the analysis (cumulative exposure, average expo-sure, and so on) depends on the relative magnitude of theconfounding that is eliminated and the selection bias that iscreated. In summary, the associational (conditional) risk ratioARRED�L1

, could be different from 1.0 even if the exposurehistory has no effect on the outcome of any subjects.

Conditioning on confounders L1 which are affected byprevious exposure can create selection bias even if the con-founder is not on a causal pathway between exposure andoutcome. In fact, no such causal pathway exists in Figures 7aand 7b. On the other hand, in Figure 7C the confounder L1 forsubsequent exposure E1 lies on a causal pathway from earlierexposure E0 to an outcome D. Nonetheless, conditioning onL1 still results in selection bias. Were the potential forselection bias not present in Figure 7C (e.g., were U not acommon cause of L1 and D), the association of cumulativeexposure E with the outcome D within strata of L1 could bean unbiased estimate of the direct effect18 of E not through L1

but still would not be an unbiased estimate of the overalleffect of E on D, because the effect of E0 mediated throughL1 is not included.

ADJUSTING FOR SELECTION BIASSelection bias can sometimes be avoided by an ade-

quate design such as by sampling controls in a manner toensure that they will represent the exposure distribution in thepopulation. Other times, selection bias can be avoided byappropriately adjusting for confounding by using alternativesto stratification-based methods (see subsequently) in the pres-ence of time-dependent confounders affected by previousexposure.

However, appropriate design and confounding adjust-ment cannot immunize studies against selection bias. For ex-ample, loss to follow up, self-selection, and, in general, missingdata leading to bias can occur no matter how careful theinvestigator. In those cases, the selection bias needs to beexplicitly corrected in the analysis, when possible.

Selection bias correction, as we briefly describe, couldsometimes be accomplished by a generalization of inverseprobability weighting20–23 estimators for longitudinal studies.Consider again Figures 6a–d and assume that L is measured.Inverse probability weighting is based on assigning a weightto each selected subject so that she accounts in the analysisnot only for herself, but also for those with similar charac-teristics (ie, those with the same vales of L and E) who werenot selected. The weight is the inverse of the probability ofher selection. For example, if there are 4 untreated women,age 40–45 years, with CD4 count �500, in our cohort study,and 3 of them are lost to follow up, then these 3 subjects donot contribute to the analysis (ie, they receive a zero weight),whereas the remaining woman receives a weight of 4. Inother words, the (estimated) conditional probability of re-maining uncensored is 1/4 � 0.25, and therefore the (esti-mated) weight for the uncensored subject is 1/0.25 � 4.Inverse probability weighting creates a pseudopopulation inwhich the 4 subjects of the original population are replacedby 4 copies of the uncensored subject.

The effect measure based on the pseudopulation, incontrast to that based on the original population, is unaffectedby selection bias provided that the outcome in the uncensoredsubjects truly represents the unobserved outcomes of thecensored subjects (with the same values of E and L). Thisprovision will be satisfied if the probability of selection (thedenominator of the weight) is calculated conditional on E andon all additional factors that independently predict bothselection and the outcome. Unfortunately, one can never besure that these additional factors were identified and recordedin L, and thus the causal interpretation of the resultingadjustment for selection bias depends on this untestableassumption.

One might attempt to remove selection bias by strati-fication (ie, by estimating the effect measure conditional onthe L variables) rather than by weighting. Stratification couldyield unbiased conditional effect measures within levels of L



under the assumptions that all relevant L variables weremeasured and that the exposure does not cause or share acommon cause with any variable in L. Thus, stratificationwould work (ie, it would provide an unbiased conditionaleffect measure) under the causal structures depicted in Fig-ures 6a and c, but not under those in Figures 6b and d. Inverseprobability weighting appropriately adjusts for selection biasunder all these situations because this approach is not basedon estimating effect measures conditional on the covariates L,but rather on estimating unconditional effect measures afterreweighting the subjects according to their exposure and theirvalues of L.

Inverse probability weighting can also be used to adjustfor the confounding of later exposure E1 by L1, even whenexposure E0 either causes L1 or shares a common cause withL1 (Figs. 7a–7c), a situation in which stratification fails.When using inverse probability weighting to adjust for con-founding, we model the probability of exposure or treatmentgiven past exposure and past L so that the denominator of asubject’s weight is, informally, the subject’s conditionalprobability of receiving her treatment history. We thereforerefer to this method as inverse-probability-of-treatmentweighting.22

One limitation of inverse probability weighting is thatall conditional probabilities (of receiving certain treatment orcensoring history) must be different from zero. This wouldnot be true, for example, in occupational studies in which theprobability of being exposed to a chemical is zero for thosenot working. In these cases, g-estimation19 rather than inverseprobability weighting can often be used to adjust for selectionbias and confounding.

The use of inverse probability weighting can provideunbiased estimates of causal effects even in the presence ofselection bias because the method works by creating a pseu-dopopulation in which censoring (or missing data) has beenabolished and in which the effect of the exposure is the sameas in the original population. Thus, the pseudopopulationeffect measure is equal to the effect measure had nobody beencensored. For example, Figure 8 represents the pseudopula-tion corresponding to the population of Figure 6a when theweights were estimated conditional on L and E. The censor-ing node is now lower-case because it does not correspond toa random variable but to a constant (everybody is uncensoredin the pseudopopulation). This interpretation is desirable

when censoring is the result of loss to follow up or nonre-sponse, but questionably helpful when censoring is the resultof competing risks. For example, in a study aimed at estimat-ing the effect of certain exposure on the risk of Alzheimer’sdisease, we might not wish to base our effect estimates on apseudopopulation in which all other causes of death (cancer,heart disease, stroke, and so on) have been removed, becauseit is unclear even conceptually what sort of medical interven-tion would produce such a population. Another more prag-matic reason is that no feasible intervention could possiblyremove just one cause of death without affecting the others aswell.24

DISCUSSIONThe terms “confounding” and “selection bias” are used

in multiple ways. For instance, the same phenomenon is some-times named “confounding by indication” by epidemiologistsand “selection bias” by statisticians/econometricians. Others usethe term “selection bias” when “confounders” are unmeasured.Sometimes the distinction between confounding and selectionbias is blurred in the term “selection confounding.”

We elected to refer to the presence of common causesas “confounding” and to refer to conditioning on commoneffects as “selection bias.” This structural definition providesa clearcut classification of confounding and selection bias,even though it might not coincide perfectly with the tradi-tional, often discipline-specific, terminologies. Our goal,however, was not to be normative about terminology, butrather to emphasize that, regardless of the particular termschosen, there are 2 distinct causal structures that lead to thesebiases. The magnitude of both biases depends on the strengthof the causal arrows involved.12,25 (When 2 or more commoneffects have been conditioned on, an even more generalformulation of selection bias is useful. For a brief discussion,see Appendix A.4.)

The end result of both structures is the same: noncom-parability (also referred to as lack of exchangeability) be-tween the exposed and the unexposed. For example, considera cohort study restricted to firefighters that aims to estimatethe effect of being physically active (E) on the risk of heartdisease (D) (as represented in Fig. 9). For simplicity, we haveassumed that, although unknown to the data analyst, E doesnot cause D. Parental socioeconomic status (L) affects the

FIGURE 8. Causal diagram in the pseudopopulation created byinverse–probability weighting. FIGURE 9. The firefighters’ study.



risk of becoming a firefighter (C) and, through childhood diet,of heart disease (D). Attraction toward activities that involvephysical activity (an unmeasured variable U) affects the riskof becoming a firefighter and of being physically active (E).U does not affect D, and L does not affect E. According to ourterminology, there is no confounding because there are nocommon causes of E and D. Thus, if our study population hadbeen a random sample of the target population, the crudeassociational risk ratio ARRED would have been equal to thecausal risk ratio CRRED of 1.0.

However, in a study restricted to firefighters, thecrude ARRED and CRRED would differ because condition-ing on a common effect C of causes of exposure andoutcome induces selection bias resulting in noncompara-bility of the exposed and unexposed firefighters. To thestudy investigators, the distinction between confoundingand selection bias is moot because, regardless of nomen-clature, they must stratify on L to make the exposed andthe unexposed firefighters comparable. This example dem-onstrates that a structural classification of bias does notalways have consequences for either the analysis or inter-pretation of a study. Indeed, for this reason, many epide-miologists use the term “confounder” for any variable L onwhich one has to stratify to create comparability, regard-less of whether the (crude) noncomparability was the resultof conditioning on a common effect or the result of acommon cause of exposure and disease.

There are, however, advantages of adopting a structuralor causal approach to the classification of biases. First, thestructure of the problem frequently guides the choice ofanalytical methods to reduce or avoid the bias. For example,in longitudinal studies with time-dependent confounding,identifying the structure allows us to detect situations inwhich stratification-based methods would adjust for con-founding at the expense of introducing selection bias. In thosecases, inverse probability weighting or g-estimation are betteralternatives. Second, even when understanding the structureof bias does not have implications for data analysis (like inthe firefighters’ study), it could still help study design. Forexample, investigators running a study restricted to firefight-ers should make sure that they collect information on jointrisk factors for the outcome and for becoming a firefighter.Third, selection bias resulting from conditioning on preexpo-sure variables (eg, being a firefighter) could explain whycertain variables behave as “confounders” in some studies butnot others. In our example, parental socioeconomic statuswould not necessarily need to be adjusted for in studies notrestricted to firefighters. Finally, causal diagrams enhancecommunication among investigators because they can beused to provide a rigorous, formal definition of terms such as“selection bias.”

ACKNOWLEDGMENTSWe thank Stephen Cole and Sander Greenland for their helpful

comments.

REFERENCES1. Rothman KJ, Greenland S. Modern Epidemiology, 2nd ed. Philadelphia:

Lippincott-Raven; 1998.2. Szklo M0, Nieto FJ. Epidemiology. Beyond the Basics. Gaithersburg,

MD: Aspen; 2000.3. MacMahon B, Trichopoulos D. Epidemiology. Principles & Methods,

2nd ed. Boston: Little, Brown and Co; 1996.4. Hennekens CH, Buring JE. Epidemiology in Medicine. Boston: Little,

Brown and Co; 1987.5. Gordis L. Epidemiology. Philadelphia: WB Saunders Co; 1996.6. Greenland S, Pearl J, Robins JM. Causal diagrams for epidemiologic

research. Epidemiology. 1999;10:37–48.7. Robins JM. Data, design, and background knowledge in etiologic infer-

ence. Epidemiology. 2001;11:313–320.8. Hernan MA, Hernandez-Diaz S, Werler MM, et al. Causal knowledge as

a prerequisite for confounding evaluation: an application to birth defectsepidemiology. Am J Epidemiol. 2002;155:176–184.

9. Cole SR, Hernan MA. Fallibility in the estimation of direct effects. IntJ Epidemiol. 2002;31:163–165.

10. Maclure M, Schneeweiss S. Causation of bias: the episcope. Epidemi-ology. 2001;12:114–122.

11. Greenland S, Brumback BA. An overview of relations among causalmodeling methods. Int J Epidemiol. 2002;31:1030–1037.

12. Greenland S. Quantifying biases in causal models: classical confoundingversus collider-stratification bias. Epidemiology. 2003;14:300–306.

13. Pearl J. Causal diagrams for empirical research. Biometrika. 1995;82:669–710.

14. Spirtes P, Glymour C, Scheines R. Causation, Prediction, and Search.Lecture Notes in Statistics 81. New York: Springer-Verlag; 1993.

15. Berkson J. Limitations of the application of fourfold table analysis tohospital data. Biometrics. 1946;2:47–53.

16. Greenland S, Neutra RR. An analysis of detection bias and proposedcorrections in the study of estrogens and endometrial cancer. J ChronicDis. 1981;34:433–438.

17. Robins JM. A new approach to causal inference in mortality studies witha sustained exposure period—application to the healthy worker survivoreffect �published errata appear in Mathematical Modelling. 1987;14:917–921�. Mathematical Modelling. 1986;7:1393–1512.

18. Robins JM, Greenland S. Identifiability and exchangeability for directand indirect effects. Epidemiology. 1992;3:143–155.

19. Robins JM. Causal inference from complex longitudinal data. In: Ber-kane M, ed. Latent Variable Modeling and Applications to Causality.Lecture Notes in Statistics 120. New York: Springer-Verlag; 1997:69–117.

20. Horvitz DG, Thompson DJ. A generalization of sampling withoutreplacement from a finite universe. J Am Stat Assoc. 1952;47:663–685.

21. Robins JM, Finkelstein DM. Correcting for noncompliance and depen-dent censoring in an AIDS clinical trial with inverse probability ofcensoring weighted (IPCW) log-rank tests. Biometrics. 2000;56:779–788.

22. Hernan MA, Brumback B, Robins JM. Marginal structural models toestimate the causal effect of zidovudine on the survival of HIV-positivemen. Epidemiology. 2000;11:561–570.

23. Robins JM, Hernan MA, Brumback B. Marginal structural models andcausal inference in epidemiology. Epidemiology. 2000;11:550–560.

24. Greenland S. Causality theory for policy uses of epidemiologic mea-sures. In: Murray CJL, Salomon JA, Mathers CD, et al., eds. SummaryMeasures of Population Health. Cambridge, MA: Harvard UniversityPress/WHO; 2002.

25. Walker AM. Observation and Inference: An introduction to the Methodsof Epidemiology. Newton Lower Falls: Epidemiology Resources Inc; 1991.

26. Greenland S. Absence of confounding does not correspond to collapsibilityof the rate ratio or rate difference. Epidemiology. 1996;7:498–501.



APPENDIX

A.1. Causal and Associational Risk RatioFor a given subject, E has a causal effect on D if the subject’svalue of D had she been exposed differs from the value of Dhad she remained unexposed. Formally, letting Di, e � 1 andDi,e � 0 be subject’s i (counterfactual or potential) outcomeswhen exposed and unexposed, respectively, we say there is acausal effect for subject i if Di, e � 1 � Di, e � 0. Only one ofthe counterfactual outcomes can be observed for each subject(the one corresponding to his observed exposure), ie, Di, e �Di if Ei � e, where Di and Ei represent subject i’s observedoutcome and exposure. For a population, we say that there isno average causal effect (preventive or causative) of E on D ifthe average of D would remain unchanged whetherthe whole population had been treated or untreated, ie, whenPr(De � 1 � 1) � Pr(De � 0 � 1) for a dichotomous D.Equivalently, we say that E does not have a causal effect on Dif the causal risk ratio is one, ie, CRRED � Pr(De � 1 � 1)/Pr(De � 0 � 1) � 1.0. For an extension of counterfactual theoryand methods to complex longitudinal data, see reference 19.

In a DAG, CRRED � 1.0 is represented by the lack ofa directed path of arrows originating from E and ending on Das, for example, in Figure 5. We shall refer to a directed pathof arrows as a causal path. On the other hand, in Figure 5,CRREC � 1.0 because there is a causal path from E to Cthrough F. The lack of a direct arrow from E to C implies thatE does not have a direct effect on C (relative to the othervariables on the DAG), ie, the effect is wholly mediatedthrough other variables on the DAG (ie, F).

For a population, we say that there is no associationbetween E and D if the average of D is the same in the subsetof the population that was exposed as in the subset that wasunexposed, ie, when Pr(D � 1�E � 1) � Pr(D � 1�E � 0) fora dichotomous D. Equivalently, we say that E and D areunassociated if the associational risk ratio is 1.0, ie,ARRED � Pr(D � 1�E � 1) / Pr(D � 1�E � 0) � 1.0. Theassociational risk ratio can always be estimated from obser-vational data. We say that there is bias when the causal riskratio in the population differs from the associational riskratio, ie, CRRED � ARRED.

A.2. Hazard Ratios as Effect MeasuresThe causal DAG in Appendix Figure 1a describes a

randomized study of the effect of surgery E on death at times1 (D1) and 2 (D2). Suppose the effect of exposure on D1 isprotective. Then the lack of an arrow from E to D2 indicatesthat, although the exposure E has a direct protective effect(decreases the risk of death) at time 1, it has no direct effecton death at time 2. That is, the exposure does not influencethe survival status at time D2 of any subject who wouldsurvive past time 1 when unexposed (and thus when ex-posed). Suppose further that U is an unmeasured haplotype

that decreases the subject’s risk of death at all times. Theassociational risk ratios ARRED1

and ARRED2are unbiased

measures of the effect of E on death at times 1 and 2,respectively. (Because of the absence of confounding,ARRED1

and ARRED2equal the causal risk ratios CRRED1

andCRRED2

, respectively.) Note that, even though E has no directeffect on D2, ARRED2

(or, equivalently, CRRED2) will be less

than 1.0 because it is a measure of the effect of E on totalmortality through time 2.

Consider now the time-specific associational hazard(rate) ratio as an effect measure. In discrete time, the hazardof death at time 1 is the probability of dying at time 1 and thusis the same as ARRED1

. However, the hazard at time 2 is theprobability of dying at time 2 among those who survived pasttime 1. Thus, the associational hazard ratio at time 2 is thenARRED2

�D1 � 0. The square around D1 in Appendix Figure1a indicates this conditioning. Exposed survivors of time 1are less likely than unexposed survivors of time 1 to have theprotective haplotype U (because exposure can explain theirsurvival) and therefore are more likely to die at time 2. Thatis, conditional on D1 � 0, exposure is associated with ahigher mortality at time 2. Thus, the hazard ratio at time 1 isless than 1.0, whereas the hazard ratio at time 2 is greater than1.0, ie, the hazards have crossed. We conclude that the hazardratio at time 2 is a biased estimate of the direct effect ofexposure on mortality at time 2. The bias is selection biasarising from conditioning on a common effect D1 of exposureand of U, which is a cause of D2 that opens the noncausal (ie,associational) path E3 D14 U3 D2 between E and D2.13 Inthe survival analysis literature, an unmeasured cause of deaththat is marginally unassociated with exposure such as U isoften referred to as a frailty.

In contrast to this, the conditional hazard ratioARRED2�D1 � 0,U at D2 given U is equal to 1.0 within eachstratum of U because the path E3 D14 U3 D2 between Eand D2 is now blocked by conditioning on the noncollider U.Thus, the conditional hazard ratio correctly indicates theabsence of a direct effect of E on D2. The fact that theunconditional hazard ratio ARRED2�D1

� 0 differs from thecommon-stratum specific hazard ratios of 1.0 even though U

Appendix Figure 1. Effect of exposure on survival.



is independent of E, shows the noncollapsibility of the hazardratio.26

Unfortunately, the unbiased measure ARRED2�D1 � 0,U

of the direct effect of E on D2 cannot be computed because Uis unobserved. In the absence of data on U, it is impossible toknow whether exposure has a direct effect on D2. That is, thedata cannot determine whether the true causal DAG generat-ing the data was that in Appendix Figure 1a versus that inAppendix Figure 1b.

A.3. Effect Modification and Common Effectsin DAGs

Although an arrow on a causal DAG represents a directeffect, a standard causal DAG does not distinguish a harmfuleffect from a protective effect. Similarly, a standard DAGdoes not indicate the presence of effect modification. Forexample, although Appendix Figure 1a implies that both Eand U affect death D1, the DAG does not distinguish amongthe following 3 qualitatively distinct ways that U couldmodify the effect of E on D1:

1. The causal effect of exposure E on mortality D1 is inthe same direction (ie, harmful or beneficial) in bothstratum U � 1 and stratum U � 0.

2. The direction of the causal effect of exposure E onmortality D1 in stratum U � 1 is the opposite of thatin stratum U � 0 (ie, there is a qualitative interactionbetween U and E).

3. Exposure E has a causal effect on D1 in one stratumof U but no causal effect in the other stratum, eg, Eonly kills subjects with U � 0.

Because standard DAGs do not represent interaction, itfollows that it is not possible to infer from a DAG thedirection of the conditional association between 2 marginallyindependent causes (E and U) within strata of their commoneffect D1. For example, suppose that, in the presence of anundiscovered background factor V that is unassociated with Eor U, having either E � 1 or U � 1 is sufficient and necessaryto cause death (an “or” mechanism), but that neither E nor Ucauses death in the absence of V. Then among those who diedby time 1 (D1 � 1), E and U will be negatively associated,because it is more likely that an unexposed subject (E � 0)had U � 1 because the absence of exposure increases thechance that U was the cause of death. (Indeed, the logarithmof the conditional odds ratio ORUE�D1

� 1 will approachminus infinity as the population prevalence of V approaches1.0.) Although this “or” mechanism was the only explanationgiven in the main text for the conditional association ofindependent causes within strata of a common effect; none-theless, other possibilities exist. For example, suppose that inthe presence of the undiscovered background factor V, havingboth E � 1 and U � 1 is sufficient and necessary to causedeath (an “and” mechanism) and that neither E nor U causesdeath in the absence of V. Then, among those who die by time

1, those who had been exposed (E � 1) are more likely to havethe haplotype (U � 1), ie, E and U are positively correlated. Astandard DAG such as that in Appendix Figure 1a fails todistinguish between the case of E and U interacting through an“or” mechanism from the case of an “and” mechanism.

Although conditioning on common effect D1 alwaysinduces a conditional association between independent causesE and U in at least one of the 2 strata of D1 (say, D1 � 1),there is a special situation under which E and U remainconditionally independent within the other stratum (say, D1 �0). This situation occurs when the data follow a multiplicativesurvival model. That is, when the probability, Pr�D1 � 0� U� u, E � e�, of survival (ie, D1 � 0) given E and U is equalto a product g(u) h(e) of functions of u and e. The multipli-cative model Pr�D1 � 0� U � u, E � e� � g(u) h(e) isequivalent to the model that assumes the survival ratio Pr�D1

� 0� U � u, E � e�/Pr�D1 � 0� U � 0, E � 0� does notdepend on u and is equal to h(e). (Note that if Pr�D1 � 0� U� u, E � e� � g(u) h(e), then Pr�D1 � 1� U � u, E � e� �1 – �g(u) h(e)� does not follow a multiplicative mortalitymodel. Hence, when E and U are conditionally independentgiven D1 � 0, they will be conditionally dependent given D1

� 1.)Biologically, this multiplicative survival model will

hold when E and U affect survival through totally indepen-dent mechanisms in such a way that U cannot possiblymodify the effect of E on D1, and vice versa. For example,suppose that the surgery E affects survival through the re-moval of a tumor, whereas the haplotype U affects survivalthrough increasing levels of low-density lipoprotein-choles-terol levels resulting in an increased risk of heart attack(whether or not a tumor is present), and that death by tumorand death by heart attack are independent in the sense thatthey do not share a common cause. In this scenario, we canconsider 2 cause-specific mortality variables: death fromtumor D1A and death from heart attack D1B. The observedmortality variable D1 is equal to 1 (death)when either D1A orD1B is equal to 1, and D1 is equal to 0 (survival) when bothD1A and D1B equal 0. We assume the measured variables arethose in Appendix Figure 1a so data on underlying cause ofdeath is not recorded. Appendix Figure 2 is an expansion ofAppendix Figure 1a that represents this scenario (variable D2

is not represented because it is not essential to the current

Appendix Figure 2. Multiplicative survival model.



discussion). Because D1 � 0 implies both D1A � 0 and D1B

� 0, conditioning on observed survival (D1 � 0) is equivalentto simultaneously conditioning on D1A � 0 and D1B � 0 aswell. As a consequence, we find by applying d-separation13 toAppendix Figure 2 that E and U are conditionally indepen-dent given D1 � 0, ie, the path, between E and U through theconditioned on collider D1 is blocked by conditioning on thenoncolliders D1A and D1B.8 On the other hand, conditioningon D1 � 1 does not imply conditioning on any specific valuesof D1A and D1B as the event D1 � 1 is compatible with 3possible unmeasured events D1A � 1 and D1B � 1, D1A � 1and D1B � 0, and D1A � 0 and D1B � 1. Thus, the pathbetween E and U through the conditioned on collider D1 isnot blocked, and thus E and U are associated given D1 � 1.

What is interesting about Appendix Figure 2 is that byadding the unmeasured variables D1A and D1B, which function-ally determine the observed variable D1, we have created anannotated DAG that succeeds in representing both the condi-tional independence between E and U given D1 � 0 and the theirconditional dependence given D1 � 1. As far as we are aware,this is the first time such a conditional independence structurehas been represented on a DAG.

If E and U affect survival through a common mecha-nism, then there will exist an arrow either from E to D1B orfrom U to D1A, as shown in Appendix Figure 3a. In that case,the multiplicative survival model will not hold, and E and Uwill be dependent within both strata of D1. Similarly, if thecauses D1A and D1B are not independent because of a com-mon cause V as shown in Appendix Figure 3b, the multipli-cative survival model will not hold, and E and U will bedependent within both strata of D1.

In summary, conditioning on a common effect alwaysinduces an association between its causes, but this associationcould be restricted to certain levels of the common effect.

A.4. Generalizations of Structure (3)Consider Appendix Figure 4a representing a study

restricted to firefighters (F � 1). E and D are unassociatedamong firefighters because the path EFACD is blocked by C.If we then stratify on the covariate C like in Appendix Figure4b, E and D are conditionally associated among firefighters ina given stratum of C; yet C is neither caused by E nor by acause of E. This example demonstrates that our previousformulation of structure (3) is insufficiently general to coverexamples in which we have already conditioned on anothervariable F before conditioning on C. Note that one could tryto argue that our previous formulation works by insisting thatthe set (F,C) of all variables conditioned be regarded as asingle supervariable and then apply our previous formulationwith this supervariable in place of C. This fix-up fails becauseit would require E and D to be conditionally associated withinjoint levels of the super variable (C, F) in Appendix Figure 4cas well, which is not the case.

However, a general formulation that works in all set-tings is the following. A conditional association between E andD will occur within strata of a common effect C of 2 othervariables, one of which is either the exposure or statisticallyassociated with the exposure and the other is either the outcomeor statistically associated with the outcome.

Clearly, our earlier formulation is implied by the newformulation and, furthermore, the new formulation givesthe correct results for both Appendix Figures 4b and 4c. Adrawback of this new formulation is that it is not statedpurely in terms of causal structures, because it makesreference to (possibly noncausal) statistical associations.Now it actually is possible to provide a fully generalformulation in terms of causal structures but it is notsimple, and so we will not give it here, but see references13 and 14.

Appendix Figure 3. Multiplicative survival model does nothold. Appendix Figure 4. Conditioning on 2 variables.



Chapter 1 ~ Things You Should Know…

1. The difference between population and sample. 2. The informal notion of a random variable X and its distribution of values in a population.

3. The difference between parameters and statistics. 4. “Statistics” follows the “Classical Scientific Method” of correctly designing, conducting,

and analyzing the random sample outcomes of an experiment, in order to test a specific null hypothesis on a population, and infer a formal conclusion.

5. The informal notion of significance level (say, α = .05) and confidence level (then 1 – α = .95).

6. The informal notion that the corresponding confidence interval gives the lowest and highest

estimates for a population mean µ, based on a sample mean x . It can then be used to test any null hypothesis for µ.

7. The informal notion that if some null hypothesis for a population mean µ is true, then its

corresponding acceptance region gives the expected lowest and highest estimates for a random sample mean x . (Outside of that is the corresponding rejection region.)

8. The informal notion that if the null hypothesis is true, then the “p-value of a particular sample

outcome” measures the probability of obtaining that outcome (or farther from the null value). Hence, a low p-value indicates that the sample provides evidence against the null hypothesis. (Formally, if the p-value is less than the predetermined significance level (say, α = .05), then the null hypothesis can be rejected in favor of a complementary alternative hypothesis.)

9. The different types of medical study design.

For Biostatistics courses only (i.e., not Stat 301).

(usually “arbitrarily large” or “infinite”) (always finite)

(a.k.a. “sample characteristics” e.g., mean x )

(a.k.a. “population characteristics” e.g., mean µ )

Chapter 2 ~ Things You Should Know… 1. The different types of random variable on a population; how to classify data.

2. Graph numerical sample data, especially histograms (frequency, relative frequency, density) and cumulative distribution function (cdf).

3. Calculate “Summary Statistics” for grouped and ungrouped sample data:

• Measures of Center: mode, median, and mean x • Quartiles: Q1, Q2 (= median), Q3, and other percentiles • Measures of Spread: range, Interquartile Range (IQR), variance 2s , and standard deviation s. • Understand the behaviors these have with respect to outliers, skew, etc.

4. Calculate proportion of sample (especially grouped data) between two given values a and b.

Numerical (Quantitative)

Continuous

Ex: Foot length

Discrete

Ex: Shoe size

Categorical (Qualitative)

Nominal (unranked)

Ex: Zip code

Non-binary (Non-dichotomous)

≥ 3 categories

Ex: Race (White = 1, Hisp = 2,...

Binary (Dichotomous)

2 categories

Ex: Sex (M = 0, F = 1)

Ordinal (ranked)

Ex: Alphabet

A = 1, ..., Z = 26

F E

Chapter 3 ~ Things You Should Know…

1. Basic Definitions

• How to represent the sample space S of all experimental outcomes via a Venn Diagram, and two or more events (E, F,…) as subsets of S.

Example: Experiment = “Randomly pick a single playing card from a standard deck (and replace).” S = {A♣, 2♣, …, K♣, A♠, 2♠, …, K♠, A♦, 2♦, …, K♦, A♥, 2♥, …, K♥} A = “Pick an Ace” = {A♣, A♠, A♦, A♥}

B = “Pick a Black card” = {A♣, 2♣, …, K♣, A♠, 2♠, …, K♠} C = “Pick a Clubs card” = {A♣, 2♣, …, K♣} D = “Pick a Diamonds card” = {A♦, 2♦, …, K♦}

2. Basic Definition and Properties of Probability (for any two events E and F)

• The general notion of probability P(E) of an event E as the “limiting value” of its “long-run”

relative frequency # times event occurs# experimental trials

E , as the experimental trials are repeated indefinitely.

• 0 ≤ P(E) ≤ 1

• P(E) = # outcomes in # outcomes in

ES

… ONLY IF the outcomes are equally likely.

Example (cont’d): P(A) = 4/52 and P(B) = 26/52… IF the deck is “fair,” i.e., P(each card) = 1/52.

• Complement E c = “Not E” P(E c) = 1 – P(E) “Complement Rule”

• Intersection E ⋂ F = “E and F” Example (cont’d): A ⋂ B = {A♣, A♠}, so that P(A ⋂ B) = 2/52.

Special Case: E and F are disjoint or mutually exclusive if E ⋂ F = ∅, i.e., P(E ⋂ F) = 0.

Example (cont’d): With C = “Clubs” and D = “Diamonds” above, C ⋂ D = ∅, so that P(C ⋂ D) = 0.

• Union E ⋃ F = “E or F” P(E ⋃ F) = P(E) + P(F) – P(E ⋂ F) “Addition Rule” Example (cont’d): P(A ⋃ B) = 4/52 + 26/52 – 2/52 = 28/52

• How to construct and use a 2 × 2 probability table for two events E and F:

Events E E c

F P(E ⋂ F) P(E c ⋂ F) P(F)

F c P(E ⋂ F c) P(E c ⋂ F c) P(F c)

P(E) P(E c) 1

3. Conditional Probability (of any event E, given any event F)

• P(E | F) = ( )( )

P E FP F

, which can be rewritten as P(E ⋂ F) = P(E | F) P(F) “Multiplication Rule”

This latter formula can be expanded into a full tree diagram, where successive “branch probabilities” are multiplied together to yield intersection probabilities.

• Special Case: E and F are statistically independent if either of the following conditions holds: o P(E | F) = P(E) …or likewise, P(F | E) = P(F) o P(E ⋂ F) = P(E) P(F) (from above)

Example (cont’d): P(A ⋂ B) = 1/26 is indeed equal to the product of P(A) = 1/13 times P(B) = 1/2, so events A = “Pick an Ace” and B = “Pick a Black card” are statistically independent (… but not disjoint, since A ⋂ B = {A♣, A♠})!

Example: E and F below are statistically independent because each cell probability is equal to the product of its corresponding row and column marginal probabilities (e.g., 0.28 = 0.7 × 0.4, etc.), but events G and H are not, i.e., they are statistically dependent.

4. Bayes’ Rule

Bi are disjoint and exhaustive, i =1, 2, …, n B1 B2 …… Bn

A …… P(A)

Ac …… P(Ac)

Given: Prior Probabilities P(B1) P(B2) …… P(Bn) 1

Given: Conditional Probabilities P(A | B1) P(A | B2) …… P(A | Bn)

Then… Posterior Probabilities P(B1 | A) P(B2 | A) …… P(Bn | A) 1

are obtained via

the formula: ( )( ( ))

( | ) i iii

P A | BP B AP

PA

BB = =

∑=1

( ) ( ) ( )

n

j jj

P A P A | B P B, i = 1, 2, …, n

Finally, compare each prior to its corresponding posterior. INTERPRET IN CONTEXT!

E E c

F 0.28 0.42 0.70

F c 0.12 0.18 0.30

0.40 0.60 1.00

G G c

H 0.15 0.55 0.70

H c 0.25 0.05 0.30

0.40 0.60 1.00

“Law of Total Probability”

Prior probabilities × Conditional probabilities STEP 1.

STEP 2.

STEP 3.

etc.

Chebyshev’s Inequality

Now that the mean µ and standard deviation σ have been formally defined for any population distribution (with the exception of a very few special cases), is it possible to relate them in any way? At first glance, it may appear that the answer is no. For example, if the mean age of a certain population is known to be µ = 40 years, that tells us nothing about the standard deviation σ. At one extreme, it could be the case that nearly everyone is very close to 40 (i.e., σ is small), or at the other extreme, the ages could vary widely from infants to the elderly (i.e., σ is large). Likewise, knowing that the standard deviation of a population is, say σ = 10 years, tells us absolutely nothing about the mean age of that population. So it is perhaps somewhat surprising that there is in fact a relation of sorts between the two. While it may not be possible to derive one directly from the other, there is still something we can say, albeit very general.

A well-known theorem proved by the Russian mathematician Chebyshev (pronounced just as it appears, although there are numerous spelling variations of his name) loosely states that no matter what the shape of the population distribution (e.g., bell, skewed, bimodal, etc.), at least 3/4 (or 0.75) of the population values lie within “plus or minus” two standard deviations σ of the mean µ. That is, the event that the population value (say, age) of a randomly selected individual lies between the lower bound of µ − 2σ, and the upper bound of µ + 2σ, has probability ≥ 75%. Furthermore, at least 8/9 (or 0.89) of the population values lie within “plus or minus” three standard deviations σ of the mean µ. That is, the event that the population value of a randomly selected individual lies between the lower bound of µ − 3σ, and the upper bound of µ + 3σ, has probability ≥ 89%. In fact, in general, for any number k > 1, at least (1 – 1/k2) of the population values lie within “plus or minus” k standard deviations σ of the mean µ. That is, the interval between the lower bound of µ − kσ, and the upper bound of µ + kσ, captures (1 – 1/k2) or more of the population values. (So, for instance, 1 – 1/k2 is equal to 3/4 when k = 2, and is equal to 8/9 when k = 3, thus confirming the claims above. At least how much of the population is captured within k = 2.5 standard deviations σ of the mean µ? Answer: 0.84)

However, the generality of Chebyshev’s Inequality (i.e., no assumptions are made on the shape of the distribution) is also something of a drawback, for, although true, it is far too general to be of practical use, and is therefore mainly of theoretical interest. The probabilities considered above for most “realistic” distributions correspond to values which are much higher than the very general ones provided by Chebyshev. For example, we will see that any bell curve captures exactly 68.3% of the population values within one standard deviation σ of its mean µ. (Note that Chebyshev’s Inequality states nothing useful for the case k = 1.) Similarly, any bell curve captures exactly 95.4% of the population values within two standard deviations σ of its mean µ. (For k = 2, Chebyshev’s Inequality states only that this probability is ≥ 75%... true, but very conservative, when compared with the actual value.) Likewise, any bell curve captures exactly 99.7% of the population values within three standard deviations σ of its mean µ. (For k = 3, Chebyshev’s Inequality states only that this probability is ≥ 89%... again, true, but conservative.)

Intro Stat HW – LECTURE NOTES Problem Sets

Each HW assignment consists of at least one set of required problems from the textbook, AND at least one set of problems from the Lecture Notes (numbered sets I, II, … are shown below in BLUE). The “Suggested” problems are not to be turned in, but are there for additional practice. Solutions will be posted here. 0. READ: Getting Started with R

I. 1.5 - Problems Introduction Required: 1, 2, 3, 4, 7 Suggested: 5, 6

II. 2.5 - Problems Exploratory Data Analysis Required: 2, 3, 4, 6, 7, 8, 9 Suggested: 1, 11, 13

III. 3.5 - Problems Probability Theory Required: 1, 2, 7, 8, 11, 15, 16(a), 19, 30 – DO ANY FIVE PROBLEMS Suggested: 3, 6, 9, 10, 18, 20, 21(a), 24, 27

IVa. 4.4 - Problems Discrete Models Required: 1, 2, 19, 25 Suggested: 3, 11

IVb. 4.4 - Problems Continuous Models Required: 13(a), 15, 16, 17, 18, 21, 29, 30, 31, 33 – DO ANY FIVE PROBLEMS Suggested: 11, 13(b), 26, 32

V. 5.3 - Problems Sampling Distributions, Central Limit Theorem Required: 3, 4, 5, 6 Suggested: 1, 8

VIa. 6.4 - Problems Hypothesis Testing: One Mean (Large Samples) Required: 2, 3, 5, 8, 25

VIb. 6.4 - Problems Hypothesis Testing: One Mean (Small Samples) Required: 4, 6, 26

VIc. 6.4 - Problems Hypothesis Testing: One Proportion Required: 1

VId. 6.4 - Problems Hypothesis Testing: Two Means Required: 10 [see hint for (d)], 11, 27

VIe. 6.4 - Problems Hypothesis Testing: Proportions Required: 14, 19 Suggested: 18, 20

VIf. 6.4 - Problems Hypothesis Testing: ANOVA Required: 21

VII. 7.4 - Problems Linear Correlation and Regression Required: 5, 6, 7 Suggested: 2, 3

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/�

http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/Lec_Notes_Sols/�


http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/Rcode/6.4-10(d).pdf�

AAPS PharmSci 2001; 3 (4) article 29 (http://www.pharmsci.org/).

1

Allometric Scaling of Xenobiotic Clearance: Uncertainty versus Universality Submitted: February 21, 2001; Accepted: November 7, 2001; Published: November 21, 2001 Teh-Min Hu and William L. Hayton Division of Pharmaceutics, College of Pharmacy, The Ohio State University, 500 W. 12th Ave. Columbus, OH 43210-1291 ABSTRACT Statistical analysis and Monte Carlo simulation were used to characterize uncertainty in the allometric exponent (b) of xenobiotic clearance (CL). CL values for 115 xenobiotics were from published studies in which at least 3 species were used for the purpose of interspecies comparison of pharmacokinetics. The b value for each xenobiotic was calculated along with its confidence interval (CI). For 24 xenobiotics (21%), there was no correlation between log CL and log body weight. For the other 91 cases, the mean ± standard deviation of the b values was 0.74 ± 0.16; range: 0.29 to 1.2. Most (81%) of these individual b values did not differ from either 0.67 or 0.75 at P = 0.05. When CL values for the subset of 91 substances were normalized to a common body weight coefficient (a), the b value for the 460 adjusted CL values was 0.74; the 99% CI was 0.71 to 0.76, which excluded 0.67. Monte Carlo simulation indicated that the wide range of observed b values could have resulted from random variability in CL values determined in a limited number of species, even though the underlying b value was 0.75. From the normalized CL values, 4 xenobiotic subgroups were examined: those that were (i) protein, and those that were (ii) eliminated mainly by rena l excretion, (iii) by metabolism, or (iv) by renal excretion and metabolism combined. All subgroups except (ii) showed a b value not different from 0.75. The b value for the renal excretion subgroup (21 xenobiotics, 105 CL values) was 0.65, which differed from 0.75 but not from 0.67. KEYWORDS: allometric scaling, body-weight exponent, clearance, metabolism, metabolic rate, pharmacokinetics, Monte Carlo simulation, power law

Corresponding Author: William L. Hayton; Division of Pharmaceutics, College of Pharmacy, The Ohio State University, 500 W. 12th Ave. Columbus, OH 43210-1291;Telephone: 614-292-1288; Facsimile: 614-292-7766; E-mail: [email protected]

INTRODUCTION

Biological structures and processes ranging from cellular metabolism to population dynamics are affected by the size of the organism1,2 . Although the sizes of mammalian species span 7 orders of magnitude, interspecies similarities in structural, physiological, and biochemical attributes result in an empirical power law (the allometric equation) that characterizes the dependency of biological variables on body mass: Y = a BW b where Y is the dependent biological variable of interest, a is a normalization constant known as the allometric coefficient, BW is the body weight, and b is the allometric exponent. The exponential form can be transformed into a linear function: Log Y = Log a + b (Log BW), and a and b can be estimated from the intercept and slope of a linear regression analysis. The magnitude of b characterizes the rate of change of a biological variable subjected to a change of body mass and reflects the geometric and dynamic constraints of the body3,4 . Although allometric scaling of physiological parameters has been a century- long endeavor, no consensus has been reached as to whether a universal scaling exponent exists. In particular, discussion has centered on whether the basal metabolic rate scales as the 2/3 or 3/4 power of the body mass1,2,3-9 . Allometric scaling has been applied in pharmacokinetics for approximately 2 decades. The major interest has been prediction of pharmacokinetic parameters in man from parameter values determined in animals10-15 . Clearance has been the most studied parameter, as it determines the drug-dosing rate. In most cases, the pharmacokinetics of a new drug was studied in several animal species, and the allometric relationship between pharmacokinetic parameters


2

and body weight was determined using linear regression of the log-transformed data. One or more of the following observations apply to most such studies: (i) Little attention was given to uncertainty in the a and b values; although the correlation coefficient was frequently reported, the confidence intervals of the a and b values were infrequently addressed. (ii) The a and b values were used for interspecies extrapolation of pharmacokinetics without analysis of the uncertainty in the predicted parameter values. (iii) The b value of clearance was compared with either the value 2/3 from "surface law" or 3/4 from "Kleiber's law" and the allometric scaling of basal metabolic rate. This paper addresses the possible impact of the uncertainty in allometric scaling parameters on predicted pharmacokinetic parameter values. We combined a statistical analysis of the allometric exponent of clearance from 115 xenobiotics and a Monte Carlo simulation to characterize the uncertainty in the allometric exponent for clearance and to investigate whether a universal exponent may exist for the scaling of xenobiotic clearance. MATERIALS AND METHODS

Data collection and statistical analysis

Clearance (CL) and BW data for 115 substances were collected from published studies in which at least 3 animal species were used for the purpose of interspecies comparison of pharmacokinetics16-90 . A total of 18 species (16 mammals, 2 birds) with body weights spanning 104 were involved (Table 1). Previously published studies generally did not control or standardize across species the (i) dosage, (ii) numbers of individuals studied per species, (iii) principal investigator, (iv) blood sampling regime, or (v) gender. Table 1. Allometric Scaling Parameters Obtained from Linear Regressions of the Log-Log-Transformed CL versus BW Data of 115 Xenobiotics (a: allometric coefficient; b: allometric exponent) (Table located at the end of article).

Linear regression was performed on the log-transformed data according to the equation, Log CL = log a + b * log BW. Values for a and b were obtained from the intercept and the slope of the regression, along with the coefficient of determination (r 2). Statistical inferences about b were performed in the following form:

H0 : b = ßi H1 : b ≠ ßi, i = 0, 1, 2 Where ß = 0, ß1 = 2/3, and ß2 = 3/4, respectively. The 95% and 99% confidence intervals (CI) were also calculated for each b value. In addition, the CL values for each individual xenobiotic were normalized so that all compounds had the same a value. Linear regression analysis was applied to the pooled, normalized CL versus BW data for the 91 xenobiotics that showed statistically significant correlation between log CL and log BW in Table 1 .

Monte Carlo simulation

The power function CL = a BW b was used to generate a set of error-free CL versus BW data. The values for BW were 0.02, 0.25, 2.5, 5, 14, and 70 kg, which represented the body weights of mouse, rat, rabbit, monkey, dog, and human, respectively. The values of a and b used in the simulation were 100 and 0.75, respectively. Random error was added to the calculated CL values, assuming a normal distribution of error with either a 20% or a 30% coefficient of variation (CV), using the function RANDOM in Mathematica 4.0. (Wolfram Research, Champaign, IL) The b and r values were obtained by applying linear regression analyses on the log- log-transformed error-containing CL versus BW data using the Mathematica function REGRESS. Ten scenarios with a variety of sampling regimens that covered different numbers of animal species (3-6) with various body weight ranges (varying 5.6- to 3500-fold) were simulated (n = 100 per scenario). The simulations mimicked the sampling patterns commonly adopted in the published interspecies pharmacokinetics studies. RESULTS

The allometric scaling parameters and their statistics are listed in Table 1 . Of 115 compounds, 24 (21%) showed no correlation between clearance and body weight; in other words, there was a lack of statistical significance for the regression (P > 0.05). This generally occurred when only 3 species were used. Among the remaining 91 cases, the mean ± standard deviation of the b values was 0.74 ± 0.16 with a wide range from 0.29 to 1.2 (Figure 1). The frequency distribution of the b values appeared to be Gaussian. The mean significantly differed from 0.67 (P < 0.001) but not from 0.75. When the b value of each substance was tested


3

statistically against both 0.67 and 0.75, the majority of the cases (81% and 98% at the level of significance equal to 0.05 and 0.01, respectively) failed to reject the null hypotheses raised against both values (Table 1); in other words, individual b values did not differ from 0.67 and 0.75. The wide range for b of 95% and 99% CI highlighted the uncertainty associated with the determination of b values in most studies. The 10 animal groups studied by Monte Carlo simulation had mean b values (n = 100 per simulation) close to the assigned true value, 0.75 (Table 2). However, the 95% CI in the majority of the scenarios failed to distinguish the expected value 0.75 from 0.67. Only Scenario 3 at the level of 20% CV excluded the possibility that b was 0.67 with 95% confidence. When the experimental error was set at 30% CV, none of the simulations distinguished between b values of 0.67 and 0.75 with 95% confidence. The mean r values ranged from 0.925 to 0.996, suggesting that the simulated experiments with a 20% and a 30% CV in experimental bias were not particularly noisy. The frequency distributions of b values are shown in Figure 2 .

Figure 1.The frequency distribution of the b values for the 91 xenobiotics that showed statistically significant correlation between log clearance (CL) and log body weight (BW) in Table 1 . The frequency of the b values from 0.2 to 1.2, at an interval of 0.1, was plotted against the midpoint of each interval of b values. The dotted line represents a fitted Gaussian distribution curve. SD = standard deviation.

Allometric exponent

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5

Freq

uenc

y

0

5

10

15

20

25

30

Mean = 0.74, SD = 0.16Normal distribution

Table 2. Simulated b Values in Different Scenarios with Varied Body Weight Ranges b† r ††

Scenarios*

ms

rt

rb

mk

dg

hm

Range** 20% CV 30% CV 20% CV 30% CV

1 · · · 125 0.75 (0.63−0.87)

0.74 (0.53−0.95)

0.996 0.986

2 · · · · 250 0.74 (0.64−0.84)

0.74 (0.58−0.91)

0.994 0.988

3 · · · · · 700 0.75 (0.67−0.83)

0.75 (0.62−0.88)

0.996 0.990

4 · · · · · · 3500 0.75 (0.69−0.81)

0.75 (0.62−0.88)

0.996 0.989

5 · · · 20 0.76 (0.57−0.94)

0.72 (0.29−1.2)

0.992 0.954

6 · · · · 56 0.75 (0.60−0.88)

0.73 (0.50−0.95)

0.990 0.968

7 · · · · · 280 0.75 (0.65−0.85)

0.76 (0.58−0.93)

0.992 0.980

8 · · · 5.6 0.80 (0.50−1.1)

0.74 (0.23−1.3)

0.974 0.925

9 · · · · 28 0.74 (0.58−0.90)

0.75 (0.47−1.0)

0.987 0.971

10 · · · 14 0.74 (0.50−0.98)

0.73 (0.44−1.0)

0.988 0.969

* ms: mouse, 0.02 kg; rt: rat, 0.25 kg; rb: rabbit, 2.5 kg; mk: monkey, 5 kg; dg: dog, 14 kg; hm: human, 70 kg. ; ** Range = maximum body weight/minimum body weight in each scenario; † The mean b value with 95% confidence interval (boldface in parentheses) was obtained from 100 simulations where linear regression analyses were applied to the log-log-transformed CL versus BW data with either a 20% or a 30% coefficient of variation (CV) in clearance; †† The mean correlation coefficient (r) of linear regression from 100 simulated experiments per scenario.


4

mouse, rat, rabbit

0.35

0.45

0.55

0.65

0.75

0.85

0.95

1.05

1.15

1.25

1.35

0

20

40

60

80

100

30% CV20% CV

mouse, rat, rabbit, monkey, dog

0.35

0.45

0.55

0.65

0.75

0.85

0.95

1.05

1.15

1.25

1.35

0

20

40

60

80

100

rat, rabbit, monkey

0.35

0.45

0.55

0.65

0.75

0.85

0.95

1.05

1.15

1.25

1.35

0

20

40

60

80

100

rat, rabbit, monkey, dog, human

0.35

0.45

0.55

0.65

0.75

0.85

0.95

1.05

1.15

1.25

1.35

0

20

40

60

80

100

rabbit, monkey, dog, human

0.35

0.45

0.55

0.65

0.75

0.85

0.95

1.05

1.15

1.25

1.35

0

20

40

60

80

100

mouse, rat, rabbit, monkey

0.35

0.45

0.55

0.65

0.75

0.85

0.95

1.05

1.15

1.25

1.35

0

20

40

60

80

100

mouse, rat, rabbit, monkey, dog, human

0.35

0.45

0.55

0.65

0.75

0.85

0.95

1.05

1.15

1.25

1.35

0

20

40

60

80

100

rat, rabbit, monkey,dog0.

35

0.45

0.55

0.65

0.75

0.85

0.95

1.05

1.15

1.25

1.35

0

20

40

60

80

100

rabbit, monkey, dog

0.35

0.45

0.55

0.65

0.75

0.85

0.95

1.05

1.15

1.25

1.35

0

20

40

60

80

100

monkey, dog, human

0.35

0.45

0.55

0.65

0.75

0.85

0.95

1.05

1.15

1.25

1.35

0

20

40

60

80

100

Exponent

Fre

qu

ency


5

Figure 2 (previous page). The frequency distribution of the simulated b values in the 10 scenarios where the number of animal species and the range of body weight were varied. The b values were obtained by applying linear regression analyses on the log-log-transformed, error-containing clearance (CL) versus body weight (BW) data with either a 20% (gray) or a 30% (black) coefficient of variation (CV) in CL.

Figure 3. The relationship between normalized clearances (CLnormalized ) and body weights (BW) for the 91 xenobiotics (n = 460) that showed statistically significant correlation between log CL and log BW in Table 1 . The relationship follows the equation: log CLnormalized = 0.74 log BW + 0.015, r 2 = 0.917. The 99% confidence interval of the regression slope was 0.71 to 0.76. The different colors represent different subgroups of xenobiotics: red, protein; blue, xenobiotics that were eliminated mainly (< 70%) by renal excretion; green, xenobiotics that were eliminated mainly (< 70%) by metabolism; black, xenobiotics that were eliminated by both renal excretion and metabolism. The result of each subgroup can be viewed in the Web version by moving the cursor to each symbol legend.

Table 3. Summary of the Statistical Results in Figure 3.

Group* No. of Xenobiotics

No. of Data Points

Slope, b

(95% CI) (99% CI)

1 9 41 0.78 0.73–0.83 0.72–0.84 2 21 105 0.65 0.62–0.69 0.61–0.70 3 39 203 0.75 0.72–0.78 0.70–0.79 4 22 111 0.76 0.71–0.81 0.70–0.82

Overall 91 460 0.74 0.72–0.76 0.71–0.76 Note: CI = confidence interval * Group 1 = protein; group 2 = xenobiotics that were eliminated mainly by renal excretion; group 3 = xenobiotics that were eliminated mainly by extensive metabolism; group 4 = xenobiotics that were eliminated by both renal excretion and nonrenal metabolism

Figure 3 shows the relationship between normalized clearances and body weights (n = 460) for the 91 xenobiotics that showed a statistically significant correlation in Table 1 . The regression slope was 0.74, and the 99% CI was 0.71 to 0.76. The normalized clearances were divided into four groups: 9 proteins (Group 1, n = 41), 21 compounds eliminated mainly via renal excretion (Group 2, n = 105), 39 compounds eliminated mainly via extensive metabolism (Group 3, n = 203), and 22 compounds eliminated by both renal excretion and metabolism (Group 4, n = 111) (Figure 3). The summary of the regression results appears in Table 3 . While Groups 1, 3, and 4 had b values close to 0.75 and significantly different from 0.67 (P < 0.001), Group 2 had a b value close to 0.67 and significantly different from 0.75 (P < 0.001). DISCUSSION Successful prediction of human clearance values using allometric scaling and clearance values measured in animals depends heavily on the accuracy of the b value. Retrospective analysis of published results for 115 substances indicated that the commonly used experimental designs result in considerable uncertainty for this parameter (Table 1). CL values for 24 of the substances listed in Table 1 failed to follow the allometric equation at the 95% confidence level. The failures appeared to result from the following factors: (i) Only 3 species were studied in 16 cases, which severely limited the robustness of the statistics. In the remaining 8 failed cases, 1 or more of the following occurred: (ii) The species were studied in different labs in 3 cases, (iii) small (n = 2) or unequal (n = 2-10) numbers of animals per species were studied in 4 cases, (iv) different dosages among species were used in 2 cases, and (v) high interspecies variability in UDP-glucuronosyltransferase activity was proposed in 1 case75 . The failure of these 24 cases to follow the allometric equation appeared for the most part, therefore, to result from deficiencies in experimental design-in other words, failure of detection rather than failure of the particular substance's CL to follow the allometric relationship.

Body Weight (kg)

0.001 0.01 0.1 1 10 100 1000 10000

Nor

mal

ized

CL

0.001

0.01

0.1

1

10

100

1000


6

How well did allometry applied to animal CL values predict the human CL value? One ind ication is how close the human CL value fell to the fitted line. Of the 91 substances that followed the allometric equation, 68 included human as 1 of the species. In 41 cases, the human CL value fell below the line, and in 27 cases it fell above (Figure 4). The mean deviation was only 0.62%, and the majority of deviations were less than 50%. It therefore appeared that for most of the 68 substances studied with human as one of the species, the human CL value did not deviate systematically or extraordinarily from the fitted allometric equation. The tendency, noted by others10,12 , of the CL value for human to be lower than that predicted from animal CL values was therefore not apparent in this large data set.

Figure 4.The deviation between the fitted and the observed human clearance (CL) for 68 xenobiotics. The fitted human CL of each xenobiotic was obtained by applying linear regression on the log-log-transformed CL versus BW data from different animal species including human. The deviation was calculated as 100*(CLobserved - CLfitted)/CL fitted . The mean deviation was 0.62%.

The b values for the 91 substances that followed the allometric equation appeared to be normally distributed around a mean value of 0.74, but the range of values was quite broad (Figure 1). Although impossible to answer definitively with these data, the question of whether there is a "universal" b value is of interest. Does the distribution shown in Figure 1 reflect a universal

value with deviation from the mean due to measurement errors, or are there different b values for the various mechanisms involved in clearance? The Monte Carlo simulations indicated that introduction of modest amounts of random error in CL determinations (Figure 2) resulted in a distribution of b values not unlike that shown in Figure 1 . This result supported the possibility that a universal b value operates and that the range of values seen in Table 1 resulted from random error in CL determination coupled with the uncertainty that accrued from use of a limited number of species. However, examination of subsets of the 91 substances segregated by elimination pathway showed a b value around 0.75, except for substances cleared primarily by the kidneys; the b value for this subgroup was 0.65 (see below), and the CI excluded a value larger than 0.70. The central tendency of the b values is of interest, particularly given the recent interest in the question of whether basal metabolic rate scales with a b value of 0.67 or 0.753,4,8,9 . When examined individually, the 95% CI of the b values for most of the 91 substances inc luded both values, although the mean for all the b values tended toward 0.75. So that all CL values could be viewed together, a normalization process was used that assumed a common a value for all 91 substances, and CL values were adjusted accordingly (Figure 3). Fit of the allometric equation to this data set gave a b value of 0.74, and its CI included 0.75 and excluded 0.67. Normalized CL values were randomly scattered about the line, with one exception: In the body weight range 20 to 50 kg (dog, minipig, sheep, and goat), the normalized CL values generally fell above the line. The 91 substances were segregated by molecular size (protein) and by major elimination pathway (renal excretion, metabolism, combination of both) (Figure 3). With the exception of the renal excretion subgroup, the normalized CL values for the subgroups showed b values similar to the combined group and their CIs included 0.75 and excluded 0.67 (Table 3). The renal excretion subgroup (21 substances and 105 CL values), however, showed a b value of 0.65 with a CI that excluded 0.75. This result was surprising as it appeared to contradict b values of 0.77 reported for

% Deviation

-150 -100 -50 0 50 100 150


7

both mammalian glomerular filtration rate and effective renal plasma flow91-93 , although it was consistent with a b value of 0.66 reported for intraspecies scaling of inulin-based glomerular filtration rate in humans94 and with a b value of 0.69 for scaling creatinine clearance95 . Whether the metabolic rate scales to the 2/3 or the 3/4 power of body weight has been the subject of debate for many years. No consensus has been reached. The surface law that suggested a proportional relationship between the metabolic rate and the body surface area was first conceptualized in the 19th century. It has gained support from empirical data6, 96 as well as statistical6,9 and theoretical6, 97 results. In 1932, Kleiber's empirical analysis led to the 3/4-power law, which has recently been generalized as the quarter-power law by West et al.3,4 . Different theoretical analyses based on nutrient-supply networks3,8 and 4-dimensional biology4 all suggested that the quarter-power law is the universal scaling law in biology98 . However, the claim of universality was challenged by Dodds et al.9 , whose statistical and theoretical reanalyses cannot exclude 0.67 as the scaling exponent of the basal metabolic rate. The logic behind the pursuit of a universal law for the scaling of energy metabolism across animal species is mainly based on the assumption that an optimal design of structure and function operates across animal species3,4,8, 99-101 . Given the fact that all mammals use the same energy source (oxygen) and energy transport systems (cardiovascular, pulmonary) and given the possibility that evolutionary force may result in a design principle that optimizes energy metabolism systems across species, the existence of such a law might be possible. However, available data and analyses have not led to a conclusion. A large body of literature data has indicated that the allometric scaling relationship applies to the clearance of a variety of xenobiotics. It has been speculated that xenobiotic clearance is related to metabolic rate, and clearance b values have frequently been compared with either 0.67 or 0.75. The b values obtained from the scaling of clearance for a variety of xenobiotics tended to be scattered. Our analysis indicated that the b value generally

fell within a broad range between 0 and 1 or even higher. The scatter of b values may have resulted from the uncertainty that accrued from the regression analysis of a limited number of data points as discussed above. In addition, the scatter may have involved the variability in pharmacokinetic properties among different xenobiotics. This variability rendered the prediction of the b value extremely difficult. Moreover, the discussion of "universality" of the b value was less possible in this regard. From the pharmacokinetics point of view, lack of a unique b value for all drugs may be considered as a norm. In this regard, the uncertainty and variability became a universal phenomenon. To determine whether a unique b value exists for the scaling of CL, a more rigorous experimental design has to be included to control the uncertainty that may obscure the conclusion. Although a study that includes the CL data for a variety of drugs covering the animal species with a scope similar to that of its counterpart in scaling basal metabolic rate might be sufficient, it would also be extremely unrealistic. Therefore, from the perspective of pharmacokinetics where the drug is the center of discussion, it is almost impossible to address whether the b value of CL tended to be dominated by 1 or 2 values. However, from the perspective of physiology where the function of a body is of interest, systematic analysis of currently available data in interspecies scaling of CL may provide some insight into the interspecies scaling of energy metabolism. The rationale behind this line of reasoning was that the elimination of a xenobiotic from a body is a manifestation of physiological processes such as blood flow and oxygen consumption. Interestingly, the two competitive exponent values, but not others, in theorizing the scaling of energy metabolism reappeared in our analysis. The value 0.75 appeared to be the central tendency of the b values for the CL of most compounds, except for that of drugs whose elimination was mainly via kidney. CONCLUSION

Whether allometric scaling could be used for the prediction of the first-time-in-man dose has been debated102,103 . Figure 4 shows that a reasonable error range can be achieved when human CL is predicted by the animal data for some drugs.


8

However, the success shown in the retrospective analysis does not necessarily warrant success in prospective applications. As indicated by our analyses on the uncertainty of b values and as illustrated in Bonate and Howard's commentary102 , caution is needed when allometric scaling is applied in a prospective manner. In addition, the use of a deterministic equation in predicting individual CL data may be questionable because the intersubject variability cannot be accounted for. Nevertheless, allometric scaling could be an alternative tool, if the mean CL for a population is to be estimated and if the uncertainty is adequately addressed. When the uncertainty in the determination of a b value is relatively large, a fixed-exponent approach might be feasible. In this regard, 0.75 might be used for substances that are eliminated mainly by metabolism or by metabolism and excretion combined, whereas 0.67 might apply for drugs that are eliminated mainly by renal excretion. ACKNOWLEDGEMENTS

Teh-Min Hu is supported by a fellowship from National Defense Medical Center, Taipei, Taiwan. REFERENCES

1. Schmidt-Nielsen K. Scaling: Why Is Animal Size So Important? Princeton, NJ: Cambridge University Press, 1983. 2. Calder WA III. Size, Function and Life History. Cambridge, MA: Harvard University Press, 1984. 3. West GB, Brown JH, Enquist BJ. A general model for the origin of allometric scaling laws in biology. Science. 1997;276:122-126. 4. West GB, Brown JH, Enquist BJ. The fourth dimension of life: Fractal geometry and allometric scaling of organisms. Science. 1999;284:1677-1679. 5. Kleiber M. Body size and metabolism. Hilgardia. 1932;6:315-353. 6. Heusner AA. Energy metabolism and body size. I. Is the 0.75 mass exponent of Kleiber’s equation a statistical artifact? Respir Physiol. 1982;48:1-12. 7. Feldman HA, McMahon TA. The 3/4 mass exponent for energy metabolism is not a statistical artifact. Respir Physiol. 1983;52:149-163. 8. Banavar JR, Maritan A, Rinaldo A. Size and form in efficient transportation networks. Nature. 1999;399:130-132. 9. Dodds PS, Rothman DH, Weitz JS. Re-examination of the “3/4-law” of metabolism. J Theor Biol. 2001;209:9-27. 10.Boxenbaum H. Interspecies scaling, allometry, physiological time, and the ground plan of pharmacokinetics. J Pharmacokin Biopharm. 1982;10:201-227. 11.Sawada Y, Hanano M, Sugiyama Y, Iga T. Prediction of disposition of beta-lactam antibiotics in humans from pharmacokinetic parameters in animals. J Pharmacokin Biopharm. 1984;12:241-261.

12.Mordenti J. Man versus beast: Pharmacokinetic scaling in mammals. J Pharm Sci. 1986;75:1028-1040. 13.Mahmood I, Balian JD. Interspecies scaling: Prediction clearance of drugs in humans. Three different approaches. Xenobiotica. 1996;26:887-895. 14.Feng MR, Lou X, Brown RR, Hutchaleelaha A. Allometric pharmacokinetic scaling: Towards the prediction of human oral pharmacokinetics. Pharm Res. 2000;17:410-418. 15.Mahmood I. Interspecies scaling of renally secreted drugs. Life Sci. 1998;63:2365-2371. 16.McGovren SP, Williams MG, Stewart JC. Interspecies comparison of acivicin pharmacokinetics. Drug Metab Dispo. 1988;16:18-22. 17.Brazzell RK, Park YH, Wooldridge CB, et al. Interspecies comparison of the pharmacokinetics of aldose reductase inhibitors. Drug Metab Dispos. 1990;18:435-440. 18.Bjorkman S, Redke F. Clearance of fentanyl, alfentanil, methohexitone, thiopentone and ketamine in relation to estimated hepatic blood flow in several animal species: Application to prediction of clearance in man. J Pharm Pharmacol. 2000;52:1065-1074. 19.Cherkofsky SC. 1-Aminocyclopropanecarboxylic acid: Mouse to man interspecies pharmacokinetic comparisons and allometric relationships. J Pharm Sci. 1995;84:1231-1235. 20.Robbie G, Chiou WL. Elucidation of human amphotericin B pharmacokinetics: Identification of a new potential factor affecting interspecies pharmacokinetic scaling. Pharm Res. 1998;15:1630-1636. 21.Paxton JW, Kim SN, Whitfield LR. Pharmacokinetic and toxicity scaling of the antitumor agents amsacrine and CI-921, a new analogue, in mice, rats, rabbits, dogs, and humans. Cancer Res. 1990;50:2692-2697. 22.GreneLerouge NAM, Bazin-Redureau MI, Debray M, Schermann JM. Interspecies scaling of clearance and volume of distribution for digoxin-specific Fab. Toxicol Appl Pharmacol. 1996;138:84-89. 23.Lave T, Dupin S, Schmidt C, Chou RC, Jaeck D, Coassolo PH. Integration of in vitro data into allometric scaling to predict hepatic metabolic clearance in man: Application to 10 extensively metabolized drugs. J Pharm Sci. 1997;86:584-590. 24.Bazin-Redureau M, Pepin S, Hong G, Debray M, Scherrmann JM. Interspecies scaling of clearance and volume of distribution for horse antivenom F(ab’)2. Toxicol Appl Pharmacol. 1998;150:295-300. 25.Lashev LD, Pashov DA, Marinkov TN. Interspecies differences in the pharmacokinetics of kanamycin and apramycin. Vet Res Comm. 1992;16:293-300. 26.Patel BA, Boudinot FD, Schinazi RF, Gallo JM, Chu CK. Comparative pharmacokinetics and interspecies scaling of 3’-azido-3’-deoxy -thymidine (AZT) in several mammalian species. J Pharmacobio-Dyn. 1990;13:206-211. 27.Kurihara A, Naganuma H, Hisaoka M, Tokiwa H, Kawahara Y. Prediction of human pharmacokinetics of panipenem-betamipron, a new carbapenem, from animal data. Antimicrob Ag Chemother. 1992;36:1810-1816. 28.Mehta SC, Lu DR. Interspecies pharmacokinetic scaling of BSH in mice, rats, rabbits, and humans. Biopharm Drug Dispos. 1995;16:735-744. 29.Bonati M, Latini R, Tognoni G. Interspecies comparison of in vivo caffeine pharmacokinetics in man, monkey, rabbit, rat, and mouse. Drug Metab Rev. 1984-85;15:1355-1383. 30.Kaye B, Brearley CJ, Cussans NJ, Herron M, Humphrey MJ, Mollatt AR. Formation and pharmacokinetics of the active drug candoxatrilat in mouse, rat, rabbit, dog and man following


9

administration of the produg candoxatril. Xenobiotica. 1997;27:1091-1102. 31.Mordenti J, Chen SA, Moore JA, Ferraiolo BL, Green JD. Interspecies scaling of clearance and volume of distribution data for five therapeutic proteins. Pharm Res. 1991;8:1351-1359. 32.Sawada Y, Hanano M, Sugiyama Y, Iga T. Prediction of the disposition of β-lactam antibiotics in humans from pharmacokinetic parameters in animals. J Pharmacokinet Biopharm. 1984;12:241-261. 33.Matsushita H, Suzuki H, Sugiyama Y, et al. Prediction of the pharmacokinetics of cefodizime and cefotetan in humans from pharmacokinetic parameters in animals. J Pharmacobio-Dyn. 1990;13:602-611. 34.Mordenti J. Pharmacokinetic scale-up: Accurate prediction of human pharmacokinetic profiles from animal data. J Pharm Sci. 1985;74:1097-1099. 35.Feng MR, Loo J, Wright J. Disposition of the antipsychot ic agent CI-1007 in rats, monkeys, dogs, and human cytochrome p450 2D6 extensive metabolizers: Species comparison and allometric scaling. Drug Metab Dispos. 1998;26:982-988. 36.Hildebrand M. Inter-species extrapolation of pharmacokinetic data of three prostacyclin-mimetics. Prostaglandins. 1994;48:297-312. 37.Ericsson H, Tholander B, Bjorkman JA, Nordlander M, Regardh CG. Pharmacokinetics of new calcium channel antagonist clevidipine in the rat, rabbit, and dog and pharmacokinetic/pharmacodynamic relationship in anesthetized dogs. Drug Metab Dispo. 1999;27:558-564. 38.Sangalli L, Bortolotti A, Jiritano L, Bonati M. Cyclosporine pharmacokinetics in rats and interspecies comparison in dogs, rabbits, rats, and humans. Drug Metab Dispo. 1998;16:749-753. 39.Kim SH, Kim WB, Lee MG. Interspecies pharmacokinetic scaling of a new carbapenem, DA-1131, in mice, rats, rabbits and dogs, and prediction of human pharmacokinetics. Biopharm Drug Dispos. 1998;19:231-235. 40.Klotz U, Antonin K-H, Bieck PR. Pharmacokinetics and plasma binding of diazepam in man, dog, rabbit, guinea pig and rat. J Pharmacol Exp Ther. 1976;199:67-73. 41.Kaul S, Daudekar KA, Schilling BE, Barbhaiya RH. Toxicokinetics of 2’,3’-deoxythymidine, stavudine (D4T). Drug Metab Dispos. 1999;27:1-12. 42.Sanwald-Ducray P, Dow J. Prediction of the pharmacokinetic parameters of reduced-dolasetron in man using in vitro-in vivo and interspecies allometric scaling. Xenobiotica. 1997;27:189-201. 43.Kawakami J, Yamamoto K, Sawada Y, Iga T. Prediction of brain delivery of ofloxacin, a new quinolone, in the human from animal data. J Pharmacokinet Biopharm. 1994;22:207-227. 44.Tsunekawa Y, Hasegawa T, Nadai M, Takagi K, Nabeshima T. Interspecies differences and scaling for the pharmacokinetics of xanthine derivatives. J Pharm Pharmacol. 1992;44:594-599. 45.Bregante MA, Saez P, Aramayona JJ, et al. Comparative pharmacokinetics of enrofloxacin in mice, rats, rabbits, sheep, and cows. Am J Vet Res. 1999;60:1111-1116. 46.Duthu GS. Interspecies correlation of the pharmacokinetics of erythromycin, oleandomycin, and tylosin. J Pharm Sci. 1995;74:943-946. 47.Efthymiopoulos C, Battaglia R, Strolin Benedetti M. Animal pharmacokinetics and interspecies scaling of FCE 22101, a penem antibiotic. J Antimicrob Chemother. 1991;27:517-526. 48.Jezequel SG. Fluconazole: Interspecies scaling and allometric relationships of pharmacokinetic properties. J Pharm Pharmacol. 1994;46:196-199.

49.Segre G, Bianchi E, Zanolo G. Pharmacokinetics of flunoxaprofen in rats, dogs, and monkeys. J Pharm Sci. 1988;77:670-673. 50.Khor SP, Amyx H, Davis ST, Nelson D, Baccanari DP, Spector T. Dihydropyrimidine dehydrogenase inactivation and 5-fluorouracil pharmacokinetics: Allometric scaling of animal data, pharmacokinetics and toxicodynamics of 5-fluorouracil in humans. Cancer Chemother Pharmacol. 1997;39:233-238. 51.Clark B, Smith DA. Metabolism and excretion of a chromone carboxylic acid (FPL 52757) in various animal species. Xenobiotica. 1982;12:147-153. 52.Nakajima Y, Hattori K, Shinsei M, et al. Physiologically-based pharmacokinetic analysis of grepafloxacin. Biol Pharm Bull. 2000;23:1077-1083. 53.Baggot JD. Application of interspecies scaling to the bispyridinium oxime HI-6. Am J Vet Res. 1994;55:689-691. 54.Lave T, Levet-Trafit B, Schmitt-Hoffmann AH, et al. Interspecies scaling of interferon disposition and comparison of allometric scaling with concentration-time transformations. J Pharm Sci. 1995;84:1285-1290. 55.Sakai T, Hamada T, Awata N, Watanabe J. Pharmacokinetics of an antiallergic agent, 1-(2-ethoxyethyl)-2-(hexahydro-4-methyl-1H-1,4-diazepin-1-yl)-1H-benzimidazole difumarate (KG-2413) after oral administration: Interspecies differences in rats, guinea pigs and dogs. J Pharmacobio-Dyn. 1989;12:530-536. 56.Lave T, Saner A, Coassolo P, Brandt R, Schmitt-Hoffman AH, Chou RC. Animal pharmacokinetics and interspecies scaling from animals to man of lamifiban, a new platelet aggregation inhibitor. J Pharm Pharmacol. 1996;48:573-577. 57.Richter WF, Gallati H, Schiller CD. Animal pharmacokinetics of the tumor necrosis factor receptor-immunoglobulin fusion protein lenercept and their extrapolation to humans. Drug Metab Dispos. 1999;27:21-25. 58.Lapka R, Rejholec V, Sechser T, Peterkova M, Smid M. Interspecies pharmacokinetic scaling of metazosin, a novel alpha-adrenergic antagonist. Biopharm Drug Dispo. 1989;10:581-589. 59.Ahr H-J, Boberg M, Brendel E, Krause HP, Steinke W. Pharmacokinetics of miglitol: Absorption, distribution, metabolism, and excretion following administration to rats, dogs, and man. Arzneim Forsch. 1997;47:734-745. 60.Siefert HM, Domdey -Bette A, Henninger K, Hucke F, Kohlsdorfer C, Stass HH. Pharmacokinetics of the 8-methoxyquinolone, moxifloxacin: A comparison in humans and other mammalian species. J Antimicrob Chemother. 1999;43 (Suppl. B):69-76. 61.Lave T, Portmann R, Schenker G, et al. Interspecies pharmacokinetic comp arisons and allometric scaling of napsagatran, a low molecular weight thrombin inhibitor. J Pharm Pharmacol. 1999;51:85-91. 62.Higuchi S, Shiobara Y. Comparative pharmacokinetics of nicardipine hydrochloride, a new vasodilator, in various species. Xenobiotica. 1980;10:447-454. 63.Mitsuhashi Y, Sugiyama Y, Ozawa S, et al. Prediction of ACNU plasma concentration-time profiles in humans by animal scale-up. Cancer Chemother Pharmacol. 1990;27:20-26. 64.Yoshimura M, Kojima J, Ito T, Suzuki J. Pharmacokinetics of nipradilol (K-351), a new antihypertensive agent. I. Studies on interspecies variation in laboratory animals. J Pharmacobio-Dyn. 1985;8:738-750. 65.Gombar CT, Harrington GW, Pylypiw HM Jr, et al. Interspecies scaling of the pharmacokinetics of N-nitrosodimethylamine. Cancer Res. 1990;50:4366-4370.


10

66.Mukai H, Watanabe S, Tsuchida K, Morino A. Pharmacokinetics of NS-49, a phenethylamine class α1A-adrenoceptor agonist, at therapeutic doses in several animal species and interspecies scaling of its pharmacokinetic parameters. Int J Pharm. 1999;186:215-222. 67.Owens SM, Hardwick WC, Blackall D. Phencyclidine pharmacokinetic scaling among species. J Pharmacol Exp Ther. 1987;242:96-101. 68.Ishigami M, Saburomaru K, Niino K, et al. Pharmacokinetics of procaterol in the rat, rabbit, and beagle dog. Arzneim Forsch. 1979;29:266-270. 69.Khor AP, McCarthy K, DuPont M, Murray K, Timony G. Pharmacokinetics, pharmacody namics, allometry, and dose selection of rPSGL-Ig for phase I trial. J Pharmacol Exp Ther. 2000;293:618-624. 70.Mordenti J, Osaka G, Garcia K, Thomsen K, Licko V, Meng G. Pharmacokinetics and interspecies scaling of recombinant human factor VIII. Toxicol Appl Pharmacol. 1996;136:75-78. 71.Coassolo P, Fischli W, Clozel J-P, Chou RC. Pharmacokinetics of remikiren, a potent orally active inhibitor of human renin, in rat, dog, and primates. Xenobiotica. 1996;26:333-345. 72.Widman M, Nilsson LB, Bryske B, Lundstrom J. Disposition of remoxipride in different species. Arzneim Forsch. 1993;43:287-297. 73.Lashev L, Pashov D, Kanelov I. Species specific pharmacokinetics of rolitetracycline. J Vet Med A. 1995;42:201-208. 74.Herault JP, Donat F, Barzu T, et al. Pharmacokinetic study of three synthetic AT-binding pentasaccharides in various animal species-extrapolation to humans. Blood Coagul Fibrinol. 1997;8:161-167. 75.Ward KW, Azzarano LM, Bondinell WE, et al. Preclinical pharmacokinetics and interspecies scaling of a novel vitronectin receptor antagonist. Drug Metab Dispos. 1999;27:1232-1241. 76.Lin C, Gupta S, Loebenberg D, Cayen MN. Pharmacokinetics of an everninomicin (SCH 27899) in mice, rats, rabbits, and cynomolgus monkeys following intravenous administration. Antimicrob Ag Chemother. 2000;44:916-919. 77.Chung M, Radwanski E, Loebenberg D, et al. Interspecies pharmacokinetic scaling of Sch 34343. J Antimicrob Chemother. 1985;15 (Suppl. C):227-233. 78.Hinderling PH, Dilea C, Koziol T, Millington G. Comparative kinetics of sematilide in four species. Drug Metab Dispo. 1993;21:662-669. 79.Walker DK, Ackland MJ, James GC, et al. Pharmacokinetics and metabolism of sildenafil in mouse, rat, rabbit, dog, and man. Xenobiotica. 1999;29:297-310. 80.Brocks DR, Freed MI, Martin DE, et al. Interspecies pharmacokinetics of a novel hematoregulatory peptide (SK&F 107647) in rats, dogs, and oncologic patients. Pharm Res. 1996;13:794-797. 81.Cosson VF, Fuseau E, Efthymiopoulos C, Bye A. Mixed effect modeling of sumatriptan pharmacokinetics during drug development. I: Interspecies allometric scaling. J Pharmacokin Biopharm. 1997;25:149-167. 82.Leusch A, Troger W, Greischel A, Roth W. Pharmacokinetics of the M1-agonist talsaclidine in mouse, rat, rabbit, and monkey, and extrapolation to man. Xenobiotica. 2000;30:797-813. 83.van Hoogdalem EJ, Soeishi Y, Matsushima H, Higuchi S. Disposition of the selective α1A-adrenoceptor antagonist tamsulosin in humans: Comparison with data from interspecies scaling. J Pharm Sci. 1997;86:1156-1161. 84.Cruze CA, Kelm GR, Meredith MP. Interspecies scaling of tebufelone pharmacokinetic data and application to preclinical toxicology. Pharm Res. 1995;12:895-901.

85.Gaspari F, Bonati M. Interspecies metabolism and pharmacokinetic scaling of theophylline disposition. Drug Metab Rev. 1990;22:179-207. 86.Davi H, Tronquet C, Calx J, et al. Disposition of tiludronate (Skelid) in animals. Xenobiotica. 1999;29:1017-1031. 87.Pahlman I, Kankaanranta S, Palmer L. Pharmacokinetics of tolterodine, a muscarinic receptor antagonist, in mouse, rat and dog. Arzneim Forsch. 2001;51:134-144. 88.Tanaka E, Ishikawa A, Horie T. In vivo and in vitro trimethadione oxidation activity of the liver from various animal species including mouse, hamster, rat, rabbit, dog, monkey and human. Human Exp Toxicol. 1999;18:12-16. 89.Izumi T, Enomoto S, Hosiyama K, et al. Prediction of the human pharmacokinetics of troglitazone, a new and extensively metabolized antidiabetic agent, after oral administration, with an animal scale-up approach. J Pharmacol Exp Ther. 1996;277:1630-1641. 90.Grindel JM, O’Neil PG, Yorgey KA, et al. The metabolism of zomepirac sodium I. Disposition in laboratory animals and man. Drug Metab Dispo. 1980;8:343-348. 91.Singer MA, Morton AR. Mouse to elephant: Biological scaling and Kt/V. Am J Kidney Dis. 2000;35:306-309. 92.Singer MA. Of mice and men and elephants: Metabolic rate sets glomerular filtration rate. Am J Kidney Dis. 2001;37:164-178. 93.Edwards NA. Scaling of renal functions in mammals. Comp Biochem Physiol. 1975;52A:63-66. 94.Hayton WL. Maturation and growth of renal function: Dosing renally cleared drugs in children. AAPS PharmSci. 2000;2(1), article 3. 95.Adolph EF. Quantitative relations in the physiological constituents of mammals. Science. 1949;109:579-585. 96.Rubner M. Über den enifluss der körpergrösse auf stoff und kraftwechsel. Z Biol. 1883;19:535-562. 97.Heusner A. Energy metabolism and body size. II. Dimensional analysis and energetic non-similarity. Resp Physiol. 1982;48:13-25. 98.West GB. The origin of universal scaling laws in biology. Physica A. 1999;263:104-113. 99.Murray CD. The physiological principle of minimum work. I. The vascular system and the cost of blood volume. Proc Natl Acad Sci U S A. 1926;12:207-214. 100. Cohn DL. Optimal systems: I. The vascular system. Bull Math Biophys. 1954;16:59-74. 101. Cohn DL. Optimal systems: II. The vascular system. Bull Math Biophys. 1955;17:219-227. 102. Bonate PL, Howard D. Prospective allometic scaling: Does the emperor have clothes? J Clin Pharmacol. 2000;40:665-670. 103. Mahmood I. Critique of prospective allometric scaling: Does the emperor have clothes? J Clin Pharmacol. 2000;40:671-674.


11

Table 1. Allometric Scaling Parameters Obtained from Linear Regressions of the Log-Log-Transformed CL versus BW Data of 115 Xenobiotics (a: allometric coefficient; b: allometric exponent) Compounds a b r 2 ( i ) P ( ii ) 95% CI of b 99% CI of b Species (vii) Ref Acivin 3.9 0.57 0.976 *** 0.45 - 0.70 (iii) 0.37 - 0.78 ms, rt, mk, dg, hm 16 AL01567 0.41 0.93 0.834 * 0.17 - 1.7 n.d. rt, mk, dg, cz, hm 17 AL01576 0.36 1.1 0.955 ** 0.75 - 1.4 (iv) 0.54 - 1.6 rt, mk, cz, hm 17 AL01750 0.39 0.98 0.829 * 0.16 - 1.8 n.d. rt, dg, mk, cz 17 Alfentanil 47 0.75 0.975 *** 0.59 - 0.92 0.48 - 1.0 rt, rb, dg, sh 18 1-Aminocyclopropanecarboxylate 2.6 0.72 0.902 * 0.28 - 1.2 n.d. ms, rt, mk, hm 19 Amphotericin B 0.94 0.84 0.988 *** 0.77 - 0.91 (v) 0.74 - 0.94 (iv) ms, rt, rb, dg, hm 20 Amsacrine 38 0.46 0.906 * 0.19 - 0.73 n.d. ms, rt, rb, dg, hm 21 Anti-digoxin Fab 1.0 0.67 0.992 0.06 n.d. (vi) n.d. ms, rt, rb 22 Antipyrine 6.9 0.57 0.716 0.15 n.d. n.d. rt, rb, dg, hm 23 Antivenom Fab2 0.033 0.53 0.990 0.06 n.d. n.d. ms, rt, rb 24 Apramycin 2.8 0.80 0.924 ** 0.38 - 1.2 0.028 - 1.6 sh, rb, ck, pn 25 AZT 26 0.96 0.982 ** 0.72 - 1.2 (iv) 0.52 - 1.4 ms, rt, mk, dg, hm 26 Betamipron 16 0.69 0.975 *** 0.53 - 0.84 0.43 - 0.94 ms, gp, rt, rb, mk, dg 27 Bosentan 25 0.56 0.663 * 0.006 - 1.1 n.d. ms, mt, rt, rb, hm 23 BSH 2.1 0.68 0.945 * 0.028 - 0.18 n.d. ms, rt, rb, hm 28 Caffeine 6.3 0.74 0.981 ** 0.55 - 0.93 0.39 - 1.1 ms, rt, rb, mk, hm 29 Candoxatrilat 9.6 0.66 0.986 *** 0.52 - 0.81 0.39 - 0.93 ms, rt, rb, dg, hm 30 CD4-IgG 0.10 0.74 0.959 * 0.27 - 1.2 n.d. rt, rb, mk, hm 31 Cefazolin 4.5 0.68 0.975 *** 0.52 - 0.83 0.43 - 0.93 ms, rt, rb, dg, mk, hm 32 Cefmetazole 12 0.59 0.917 ** 0.35 - 0.84 0.18 - 1.0 ms, rt, rb, dg, mk, hm 32 Cefodizime 1.5 1.0 0.926 ** 0.48 - 1.5 0.047 - 1.9 ms, rt, rb, dg, mk 33 Cefoperazone 6.7 0.57 0.823 * 0.20 - 0.94 n.d. ms, rt, rb, dg, mk, hm 32 Cefotetan 6.3 0.53 0.849 ** 0.22 - 0.84 0.016 - 1.0 ms, rt, rb, dg, mk, hm 32 Cefpiramide 4.1 0.40 0.589 0.07 n.d. n.d. ms, rt, rb, dg, mk, hm 32 Ceftizoxime 11 0.57 0.986 ** 0.37 - 0.78 0.10 - 1.1 ms, rt, mk, dg 34 CI-1007 35 0.90 0.998 * 0.44 - 1.4 n.d. rt, mk, dg 35 CI-921 15 0.51 0.830 * 0.085 - 0.93 n.d. ms, rt, rb, dg, hm 21 Cicaprost 37 0.83 0.956 *** 0.59 - 1.1 0.42 - 1.2 ms, rt, rb, mk, pg, hm 36 Clevidipine 288 0.84 0.985 0.07 n.d. n.d. rt, rb, dg 37


12

Table 1. (continued)

Compounds a b r 2 ( i ) P ( ii ) 95% CI of b 99% CI of b Species (vii) Ref. Cyclosporin 5.8 0.99 0.931 * 0.17 - 1.8 n.d. rt, rb, dg, hm 38 DA-1131 11 0.81 0.995 *** 0.71 - 0.93 (iv) 0.61 - 1.0 ms, rt, rb, dg, hm 39 Diazepam 89 0.2 0.135 0.5 n.d. n.d. rt, gp, rb, dg, hm 40 Didanosine 33 0.76 0.971 ** 0.52 - 1.0 0.32 - 1.2 ms, rt, mk, dg, hm 41 Dolasetron 74 0.73 0.950 * 0.22 - 1.2 n.d. rt, mk, dg, hm 42 Enoxacin 36 0.43 0.874 * 0.13 - 0.73 (iii) n.d. ms, rt, mk, dg, hm 43 Enprofylline 6.0 0.72 0.852 ** 0.30 - 1.1 0.028 - 1.4 ms, rt, gp, rb, dg, hm 44 Enrofloxacin 23 0.77 0.972 ** 0.53 - 1.0 0.33 - 1.2 ms, rt, rb, sh, cw 45 Eptaloprost 115 0.83 0.985 0.08 n.d. n.d. rt, mk, hm 36 Erythromycin 37 0.66 0.966 *** 0.49 - 0.83 0.37 - 0.94 ms, rt, rb, dg, hm, cw 46 FCE22101 11 0.76 0.909 * 0.027 - 1.5 n.d. rt, rb, mk, dg 47 Fentanyl 60 0.88 0.990 0.06 n.d. n.d. rt, dg, pg 18 Fluconazole 1.2 0.70 0.992 *** 0.63 - 0.77 0.58 - 0.82 ms, rt, gp, rb, ct, dg, hm 48 Flunoxaprofen 0.98 1.0 0.925 0.2 n.d. n.d. rt, dg, mk 49 5-Fluorouracil 7.6 0.74 0.991 ** 0.52 - 0.95 0.24 - 1.2 ms, rt, dg, hm 50 FPL-52757 0.91 0.62 0.973 ** 0.43 - 0.81 0.28 - 0.97 rt, rb, mk, dg, hm 51 Grepafloxacin 15 0.64 0.886 0.06 n.d. n.d. rt, rb, mk, dg 52 HI-6 9.8 0.76 0.972 *** 0.61 - 0.91 0.53 - 0.99 ms, rt, rb, mk, dg, sh, hm 53 Iloprost 48 0.85 0.970 *** 0.64 - 1.1 0.51 - 1.2 ms, rt, rb, dg, pg, hm 36 Interferon α 3.7 0.71 0.980 ** 0.52 - 0.90 0.36 - 1.1 ms, rt, rb, dg, mk 54 Kanamycin 2.9 0.81 0.970 *** 0.61 - 1.0 0.48 - 1.1 sh, gt, rb, ck, pn 25 Ketamine 119 0.56 0.632 0.1 n.d. n.d. rt, rb, pg 18 KG-2413 610 1.1 0.741 0.3 n.d. n.d. rt, gp, dg 55 Lamifiban 6.1 0.88 0.887 0.2 n.d. n.d. rt, dg, mk 56 Lamivudine 15 0.75 0.991 ** 0.53 - 0.97 0.24 - 1.3 rt, mk, dg, hm 41 Lenercept 0.0079 1.1 0.998 ** 0.90 - 1.2 (v) 0.71 - 1.4 (iv) rt, rb, mk, dg 57 Lomefloxacin 10 0.79 0.992 *** 0.66 - 0.92 0.56 - 1.0 ms, rt, mk, dg, hm 46 Metazocin 11 0.29 0.973 * 0.15 - 0.44 n.d. ms, rt, rb, hm 58 Methohexitone 73 0.86 0.997 * 0.26 - 1.5 n.d. rt, rb, dg 18 Mibefradil 62 0.62 0.923 ** 0.29 - 0.95 0.018 - 1.2 rt, mt, rb, dg, hm 23 Midazolam 67 0.68 0.850 * 0.15 - 1.2 n.d. rt, rb, dg, pg, hm 23


13


Compounds a b r 2 ( i ) P ( ii ) 95% CI of b 99% CI of b Species (vii) Ref. Miglitol 7.4 0.64 0.998 * 0.31 - 0.97 n.d. rt, dg, hm 59 Mofarotene 14 0.84 0.983 ** 0.51 - 1.2 n.d. ms, rt, dg, hm 23 Moxalactam 5.0 0.66 0.992 *** 0.58 - 0.74 (iii) 0.53 - 0.79 ms, rt, rb, dg, mk, hm 32 Moxifloxacin 20 0.56 0.949 *** 0.38 - 0.74 (iii) 0.26 - 0.86 ms, rt, mk, dg 60 Napsagatran 50 0.74 0.842 0.08 n.d. n.d. rt, rb, dg, mk 61 Nicardipine 69 0.55 0.962 *** 0.40 - 0.70 (iii) 0.30 - 0.80 rt, dg, mk, hm 62 Nimustine 42 0.83 0.968 ** 0.55 - 1.1 0.32 - 1.3 ms, rt, rb, dg, hm 63 Nipradilol 59 0.66 0.796 * 0.047 - 1.3 n.d. rt, rb, mk, dg 64 N-Nitrosodimethylamine 59 0.93 0.972 *** 0.75 - 1.1 (iv) 0.65 - 1.2 ms, hr, rt, rb, mk, dg, pg 65 Norfloxacin 81 0.77 0.893 * 0.28 - 1.3 n.d ms, rt, mk, dg, hm 43 NS-49 14 0.64 0.994 0.05 n.d. n.d. rt, rb, dg 66 Ofloxacin 7.5 0.64 0.946 * 0.17 - 1.1 n.d. rt, mk, dg, hm 43 Oleandomycin 30 0.69 0.996 ** 0.55 - 0.83 0.36 - 1.0 ms, rt, dg, hm 46 Panipenem 12 0.61 0.977 *** 0.48 - 0.74 (iii) 0.39 - 0.82 ms, gp, rt, rb, mk, dg 27 Pefloxacin 13 0.57 0.910 * 0.24 - 0.90 n.d. ms, rt, mk, dg, hm 43 Phencyclidine 52 0.64 0.891 ** 0.33 - 0.95 0.12 - 1.1 ms, rt, pn, mk, dg, hm 67 Procaterol 29 0.80 0.992 0.06 n.d. n.d. rt, rb, dg 68 Propranolol 98 0.64 0.81 0.10 n.d. n.d. rt, rb, dg, hm 23 P-selectin glycoprotein ligand-1 0.0060 0.93 0.939 ** 0.49 - 1.4 0.13 - 1.7 ms, rt, mk, pg 69 Recombinant CD4 3.4 0.65 0.995 ** 0.50 - 0.79 0.31 - 0.98 rt, rb, mk, hm 31 Recombinant growth hormone 6.8 0.71 0.995 ** 0.55 - 0.87 0.34 - 1.1 ms, rt, mk, hm 31 Recombinant human factor VIII 0.16 0.71 0.999 * 0.45 - 0.97 n.d. ms, rt, hm 70 Relaxin 6.0 0.80 0.992 *** 0.66 - 0.93 0.55 - 1.0 ms, rt, rb, mk, hm 31 Remikiren 50 0.67 0.898 * 0.26 - 1.1 n.d. rt, dg, mt, mk, 71 Remoxipride 29 0.42 0.710 0.07 n.d. n.d. ms, rt, hs, dg, hm 72 Ro 24-6173 69 0.64 0.976 * 0.33 - 0.95 n.d. rt, rb, dg, hm 23 Rolitetracycline 11 0.89 0.989 *** 0.72 - 1.1 (iv) 0.58 - 1.2 rb, pg, pn, ck 73 Sanorg 32701 0.35 0.87 0.979 0.09 n.d. n.d. rt, rb, bb 74 SB-265123 15 0.80 0.812 0.1 n.d. n.d. ms, rt, mk, dg 75 Sch 27899 0.78 0.62 0.966 * 0.27 - 0.98 n.d. ms, rt, rb, mk 76 Sch 34343 13 0.77 0.924 *** 0.51 - 1.0 0.37 - 1.2 ms, rt, mk, rb, dg, hm 77


14


Compounds a b r 2 ( i ) P ( ii ) 95% CI of b 99% CI of b Species (vii) Ref. Sematilide 20 0.66 0.982 ** 0.39 - 0.94 0.034 - 1.3 rt, rb, dg, hm 78 Sildenafil 28 0.66 0.999 *** 0.59 - 0.73 (iii) 0.51 - 0.81 ms, rt, dg, hm 79 SK&F107647 7.2 0.63 0.964 0.1 n.d. n.d. rt, dg, hm 80 SR 80027 0.10 0.53 0.990 0.06 n.d. n.d. rt, rb, bb 74 SR90107A 0.68 0.55 0.978 * 0.30 - 0.79 n.d. rt, rb, bb, hm 74 Stavudine 19 0.84 0.993 *** 0.71 - 0.97 (iv) 0.60 - 1.1 ms, rt, mk, rb, hm 41 Sumatriptan 32 0.84 0.973 * 0.42 - 1.3 n.d. rt, rb, dg, hm 81 Talsaclidine 37 0.63 0.971 * 0.30 - 0.97 n.d. ms, rt, mk, hm 82 Tamsulosin 61 0.59 0.993 0.05 n.d. n.d. rt, rb, dg 83 Tebufelone 31 0.79 0.963 * 0.32 - 1.3 n.d. rt, mk, dg, hm 84 Theophylline 1.9 0.81 0.950 *** 0.64 - 0.98 0.57 - 1.1 rt, gp, rb, ct, pg, hs,

hm 85

Thiopentone 3.5 1.0 0.874 ** 0.57 - 1.4 0.32 - 1.7 rt, rb, dg, sh 18 Tiludronate 1.5 0.56 0.977 ** 0.40 - 0.71 0.27 - 0.84 ms, rt, rb, dg, bb 86 Tissue-plasminogen activator 17 0.84 0.986 *** 0.72 - 0.95 (iv) 0.66 - 1.0 ms, hs, rt, rb, mk, dg,

hm 23

Tolcapone 12 0.65 0.927 * 0.095 - 1.2 n.d. rt, rb, dg, hm 27 Tolterodine 62 0.62 0.978 * 0.34 - 0.90 n.d. ms, rt, dg, hm 87 Tosufloxacin 64 0.80 0.919 * 0.36 - 1.24 n.d. ms, rt, mk, dg, hm 43 Trimethadione 4.1 0.70 0.942 *** 0.50 - 0.90 0.39 - 1.0 ms, hs, rt, rb, dg, mk,

hm 88

Troglitazone 12 0.81 0.988 ** 0.54 - 1.1 0.19 - 1.4 ms, rt, mk, dg 89 Tylosin 54 0.69 0.993 0.053 n.d. n.d. rt, dg, cw 48 Zalcitabine 15 0.82 0.983 *** 0.62 - 1.0 0.45 - 1.2 ms, rt, ct, mk, hm 41 Zidovudine 26 0.95 0.981 ** 0.71 - 1.2 (iv) 0.51 - 1.4 ms, rt, mk, dg, hm 41 Zomepirac 1.6 1.2 0.902 ** 0.63 - 1.7 0.28 - 2.0 ms, rt, rb, hs, mk, hm 90

Note: CL = clearance, BW = body weight, CI = confidence interval. (i) Coefficient of determination. (ii) Statistical testing against b = 0: P < 0.05 (*); P < 0.01 (**); P < 0.001 (***). (iii) Excluding b = 0.75. (iv) Excluding b = 0.67. (v) Excluding both b = 0.75 and b = 0.67. (vi) n.d.: not determined because of a lack of correlation between CL and BW at the significance level = 0.05 (column 6) and = 0.01 (column 7). (vii) rt, rat; rb, rabbit; bb, baboon; mk, monkey; dg, dog; hm, human; ms, mouse; cz, chimpanzee; sh, sheep; ck, chicken; pn, pigeon; gp, guinea pig; pg, pig; ct, cat; cw, cow; gt, goat; mt, marmoset; hs, hamster.

Measures of Center If X is a random variable (e.g., age) defined on a specific population, it has a certain theoretical “distribution” of values in that population, with definite characteristics. One of these is a “center” around which the distribution is located; another is a “spread” which would correspond to the amount of its variability. (Although, there are some distributions – such as the Cauchy distribution – for which this is not true, but they are infrequently encountered in practice.) Also, these two objects are often independent of one another; knowing one gives no information about the other. (Although, again, there are some important exceptions.) Of course, “center” and “spread” are vague, informal terms that require clarification. Furthermore, even with precise definitions (later), as it is usually impossible to measure every individual in the population, these so-called population characteristics – or parameters – are typically unknown quantities, even though they may exist. However, we can begin to approach an understanding of their meanings, using various estimates based on random sample data. These parameter estimators are so-called sample characteristics – or statistics – and are entirely computable from the data values, hence known. (Although, they will differ from sample to sample, but let us assume a single random sample for now.)

Suppose the collection 1 2 3{ , , , , }nx x x x represents a random sample of n measurements of the variable X. For the sake of simplicity, we will also assume that these data values are sorted from low to high. (Duplicates may also occur; two individuals could be the same age, for example.) There are three main “measures of center” that are used in practice, representing what might be thought of as an estimate of a “typical” value in the population. These are listed below with some of their most basic properties. (The most common “measure of spread,” the sample standard deviation, will be discussed in lecture.)

sample mode

This is simply that data value which occurs most often, i.e., has the largest frequency. It gives some information, but is rather crude. A distribution with exactly one mode is called unimodal (such as “the bell curve”); a bimodal distribution has two modes, at least locally, and could be thought of as two unimodal distributions (which could have unequal “heights”) that are blended together. This suggests that the sample consists mostly of two distinct subgroups which differ in the variable being measured, e.g., the ages of infants and geriatrics.

sample median

This is the value that divides the dataset into two equally-sized halves. That is, half the data values are below the median, and half are above. As the data has been sorted, this is particularly easy to find. If the sample size n is odd, there will be exactly one data value that is located at the exact middle (in position number 1

2n + ). However, this will not be the case if

n is even, so here the median is defined as the average of the two data values that bracket the exact middle (position 2

n to its immediate left, and position 2 1n + to its immediate right).

The median is most useful as a measure of center if there are so-called outliers in the data – i.e., “extreme” values. (Again, there is a formal definition, which we will not pursue here.) For instance, in a dataset of company employee salaries that happens to include the CEO, the median would be a more accurate representative of a “typical” salary, than say, the average.

sample mean

The calculation and properties of this most common “measure of center,” will be discussed in detail in lecture. In a perfectly symmetric distribution (such as a “bell curve”), the mean and median will be exactly equal to each other. However, the presence of many outliers on either end of the distribution will tend to pull the mean in that direction, while having no effect on the median. Hence, this results in an asymmetric distribution having a “negatively skewed” tail (or “skewed to the left”) if the mean < median, or likewise, a “positively skewed” tail (or “skewed to the right”) if the mean > median, respectively.

A Few Words on Mathematical Notation

Every grade school pupil knows that the average (i.e., sample mean) of a set of values is computed by “adding them, and then dividing this sum by the number of values in the set.” As mathematical procedures such as this become more complex however, it becomes more necessary to devise a succinct way to express them, rather than writing them out in words. Proper mathematical notation allows us to do this. First, as above, if we express a generic set of n values as 1 2 3{ , , , , }nx x x x , then the sum can be written as

1 2 3 nx x x x+ + + + . [Note: The ellipsis (...) indicates the x-values that are in between, but not explicitly written.] But even this shorthand will eventually become too cumbersome. So mathematicians have created a standard way to rewrite an expression like this, using so-called “sigma notation.” The sum

1 2 3 nx x x x+ + + + can be abstractly, but more conveniently, expressed as

1

n

ii

x=∑ ,

which we now dissect. The symbol ∑ is the uppercase Greek letter “sigma” – equivalent to the letter “S” in English – and stands for summation (i.e., addition). The objects being summed are the values

1 2 3, , , , nx x x x – or for short, the generic symbol ix – as the index i ranges from 1, 2, 3, ..., n. Thus, the first term of the sum is ix when i = 1, or simply 1x . Likewise, the second term of the sum is ix when i = 2, or 2x , and so on. This pattern continues until the last value of the summation, which is ix when i = n, or nx . Hence, the formula written above would literally be read as “the sum of x-sub-i, as i ranges from 1 to n.” (If

the context is clear, sometimes symbols are dropped for convenience, as in 1

n

ix∑ , or even just x∑ .)

Therefore, the mean would be equal to this sum divided by n, or 1

n

ii

x

n=∑

. And since division by n is

equivalent to multiplication by its reciprocal 1n

(e.g., dividing by 3 is the same as multiplying by 1/3), this

last expression can also be written as 1

1 n

ii

xn =∑ . This is the quantity to be calculated for the sample mean, x .

Finally, because repeated values can arise, each ix comes naturally equipped with a frequency, labeled if , equal to the number of times it occurs in the original dataset of n values. Thus, for example, if the value 3x actually occurs 5 times, then its corresponding frequency is 3 5f = . (If, say, 3x is not repeated, then its frequency is 3 1f = , for it occurs only once.) A related concept is the relative frequency of ix , which is

defined as the ratio ifn

. In order to emphasize that this ratio explicitly depends on (or, to say it

mathematically, “is a function of”) the value ix , it is often customary to symbolize ifn

with an alternative

notation, ( )if x , read “f of x-sub-i.” So, to summarize, if is the absolute frequency of ix , but ( ) ii

ff xn

= is

the relative frequency of ix . They look very similar, but they are not the same, so try not to confuse them. [Peeking ahead: Later, ( )if x will denote the “probability” that the value ix occurs in a population, which is a direct extension of the concept of “relative frequency” for a sample data value ix .]

Sample Quartiles

We have seen that the sample median of a data set {x1, x2, x3,…, xn}, sorted in increasing order, is a value that divides it in such a way, that exactly half (i.e., 50%) of the sample observations fall below the median, and exactly half (50%) are above it.

• If the sample size n is odd, then precisely one of the data values will lie at the exact center; this value is located at position (n + 1)/2 in the data set, and corresponds to the median.

• If the sample size n is even however, then the exact middle of the data set will fall between two values, located at positions n/2 and n/2 + 1. In this case, it is customary to define the median as the average of the two values, which lies midway between them.

Sample quartiles are defined similarly: they divide the data set into quarters (i.e., 25%). The first quartile, designated, Q1, marks the cutoff below which lies the lowest 25% of the sample. Likewise, the second quartile Q2 marks the cutoff between the second lowest 25% and second highest 25% of the sample; note that this coincides with the sample median! Finally, the third quartile Q3 marks the cutoff above which lies the highest 25% of the sample.

This procedure of ranking data is not limited to quartiles. For example, if we wanted to divide a sample into ten intervals of 10% each, the cutoff points would be known as sample deciles. In general, the cutoff values that divide a data set into any given proportion(s) are known as sample quantiles or sample percentiles. For example, receiving an exam score in the “90th percentile” means that 90% of the scores are below it, and 10% are above.

For technical reasons, the strict definitions of quartiles and other percentiles follow rigorous mathematical formulas; however, these formulas can differ slightly from one reference to another. As a consequence, different statistical computing packages frequently output slightly different values from one another. On the other hand, these differences are usually very minor, especially for very large data sets.

Exercise 1 (not required): Using the R code given below, generate and view a random sample of n = 40 positive values, and find the quartiles via the so-called “five number summary” that is output.

# Create and view a sorted sample, rounded to 3 decimal places. x = round(sort(rchisq(40, 1)), 3) print(x) y = rep(0, 40)

# Plot it along the real number line. plot.new() plot(x, y, pch = 19, cex = .5, xlim = range(0, 1 + max(x)), ylim = range(0, 0.01), ylab = "", axes = F) axis(1)

# Identify the quartiles. summary(x)

# Plot the median Q2 (with a filled red circle). Q2 = summary(x)[3] points(Q2, 0, col = "red", pch = 19)

# Plot the first quartile Q1 (with a filled blue circle). Q1 = summary(x)[2] points(Q1, 0, col = "blue", pch = 19)

# Plot the third quartile Q3 (with a filled green circle). Q3 = summary(x)[5] points(Q3, 0, col = "green", pch = 19)

Exercise 2 (not required): Using the same sample, sketch and interpret

boxplot(x, pch = 19)

identifying all relevant features. From the results of these two exercises, what can you conclude about the general “shape” of this distribution?

NOTE: Finding the approximate quartiles (or other percentiles) of grouped data can be a little more challenging. Refer to the Lecture Notes, pages 2.3-4 to 2.3-6, and especially 2.3-11.

To illustrate the idea of estimating quartiles from the density histogram of grouped data, let us consider a previous, posted exam problem (Fall 2013).

• First, we find the median 2 ,Q i.e., the value on the X-axis that divides the total area of 1

into .50 area on either side of it. By inspection, the cumulative area below the left endpoint 4 is equal to .10 + .20 = .30, too small. Likewise, the cumulative area below the right endpoint 12 is .10 + .20 + .30 = .60, too big. Therefore, in order to have .50 area both below and above it, 2Q must lie in the third interval [4, 12), in such a way that its corresponding rectangle of .30 area is split into left and right sub-areas of .20 + .10, respectively:

.10

.20

.25 .15

.10

.20

.30 .25

.15

.20 .10

Total Area above Q2 = .10 + .25 + .15 = .50

Total Area below Q2 = .10 + .20 + .20 = .50

2Q

http://www.stat.wisc.edu/~ifischer/Intro_Stat/stat301/Exams/Practice_Problems_301/Exam_1/Solutions/2013%20Fall.pdf�

Now just focus on this particular rectangle…

… and use any of the three boxed formulas on page 2.3-5 of the Lecture Notes, with the quantities labeled above. For example, the third formula (which I think is easiest) yields

2(.2)(12) (.1)(4) 2.8

.2 .1 .3Ab BaQ

A B+ +

= = = =+ +

9.333.

• The other quartiles are computed similarly. For example, the first quartile 1Q is the cutoff for the lowest 25% of the area. By the same logic, this value must lie in the second interval [2, 4), and split its corresponding rectangle of .20 area into left and right sub-areas of .15 + .05, respectively:

2Q

A =.20 B = .10

a = 4 b = 12

Den

sity

= .0

375

Den

sity

= 0

.1

a = 2 4 = b 1Q

A =.15 .05 = B

.10

sum = .25 Therefore,

1(.15)(4) (.05)(2) .7

.15 .05 .2Ab BaQ

A B+ +

= = = =+ +

3.5.

• Likewise, the third quartile 3Q is the cutoff for the highest 25% of the area. By the same logic as before, this value must lie in the fourth interval [12, 22), and split its corresponding rectangle of .25 area into left and right sub-areas of .15 + .10, respectively:

Estimating a sample proportion between two known quantile values is done pretty much the same way, except in reverse, using the formulas on the bottom of the same page, 2.3-5. For example, the same problem asks to estimate the sample proportion in the interval [9, 30). This interval consists of the disjoint union of the subintervals [9, 12), [12, 22), and [22, 30).

• The first subinterval [9, 12) splits the corresponding rectangle of area .30 over the class interval [4, 12) into unknown left and right subareas A and B, respectively, as shown below. Since it is the right subarea B we want, we use the formula B = (b – Q) × Density = (12 – 9) × .0375 = .1125.

• The next subinterval [12, 22) contains the entire corresponding rectangular area of .25. • The last subinterval [22, 30) splits the corresponding rectangle of area .15 over the class

interval [22, 34) into unknown left and right subareas A and B, respectively, as shown below. In this case, it is the left subarea A that we want, so we use A = (Q – a) × Density = (30 – 22) × .0125 = .10.

Adding these three areas together yields our answer, .1125 + .25 + .10 = .4625.

Page 2.3-6 gives a way to calculate quartiles, etc., from the cumulative distribution function (cdf) table, without using the density histogram.

Therefore,

3(.15)(22) (.10)(12) 4.5

.15 .10 .25Ab BaQ

A B+ +

= = = =+ +

18.

3Q a = 12 b = 22

.15

Den

sity

= .0

25

sum = .25

B = .10 A =.15

A B = ?

a = 4 Q = 9 b = 12

Den

sity

= .0

375

B A = ?

a = 22 Q = 30 b = 34

Den

sity

=

.012

5

Documents

FischerLecNotesIntroStatistics.pdf