BIOL2300 Biostatistics Chapter 7 - Boston College

BIOL2300 BiostatisticsChapter 7

Point estimates, confidence intervals and minimum sample size for proportion, mean and variance

Inferential statistics• estimate a population parameter (proportion,

mean, variance)– confidence interval for point estimate

• hypothesis testing– H1: patients taking Vioxx (Merck) have MORE

heart attacks and strokes than those not taking Vioxx

– H0: NULL hypothesis, EQUAL number of heart attacks and strokes in control and treatment groups

Merck • The arthritis medication Vioxx was removed

from US market on Sep 30, 2004, after data from a clinical trial showed an increased risk of heart attack, stroke, blood clots and other cardiovascular illnesses.

• FDA announced in 2004 that patients taking Vioxx have a 50 percent greater chance of heart attacks and sudden cardiac death, and that patients taking the highest recommended daily dosage of Vioxx had a 300 percent greater chance of heart attack and sudden cardiac death. – source of information from this paragraph is

http://www.yourlawyer.com/topics/overview/vioxx

ESTIMATING PROPORTIONS

Requirements to check before approximating population proportion

• Simple random sample• Sampling with replacement (binomial)

– fixed number of trials– trials independent– 2 outcomes: success, failure

• Conditions allowing approximation of binomial by normal are satisfied:

where p,q are estimated by p-hat (resp. q-hat), the proportion of successes (resp. failures) in sample

• point estimate: a single value used as approximation for a population parameter

• confidence interval: a real interval

such that we are 1-α confident that p lies in that interval

Rationale for confidence interval

Warning: Take care with proper interpretation of confidence intervals

• NOT that probability that p lies in given interval (a,b) is 95% (1 - 5% where α equals 5%), since p is a fixed constant and either belongs to (a,b) or not.

• Proper interpretation is that the probability that p belongs to is at least 95%, if one were to take ALL samples of size n, compute the corresponding p-hat and test if p belongs to

Critical value

Critical value• Critical value for 1-α confidence interval is the

z-score such that

How to find 95% confidence interval for point estimator of proportion: Margin of Error

• Recall that for proportion of successes in n independent Bernouilli trials, we have

Margin of error E for estimate of proportion for 1-α confidence interval

Why is E so defined?

Sample size for estimating proportion p with confidence 1-α

Minimum sample size is independent of population size• Previous computation of minimum

sample size depends on probability p of success, q of failure, margin of error E, and population standard deviation, but NOT on population size.

Example

Problem on proportion estimation

• The drug Eliquis (apixaban) is used to help prevent blood clots in certain patients. In clinical trials, among 5924 patients treated with Eliquis, 153 developed the adverse reaction of nausea (based on data from Bristol-Myers Squibb Co.). Construct a 99% confidence interval for the proportion of adverse reactions.

ESTIMATING THE MEAN

Estimate of mean

Requirements to estimate µwhen σ is known

1) Simple random sample (all samples of same size have same probability of being selected)

2) Value of population standard deviation is known

3) Population is normally distributed or n>30

Problems

Answer: yes

1) Simple random sample (all samples of same size have same probability of being selected)

2) Value of population standard deviation is known: sigma = 13

3) n=239>30

Sample size for estimating mean µ with confidence 1-α

This is not quite right, since the Excel functionCEILING(real value, 0) should be used to round UP.

If σ not known, or if n<30, then replace critical value for normal distribution by

critical value for t-distribution

Out[2]=

t-distribution is symmetric, but wider than normal distribution

Superimposition of normal and Student T-distributions

Standard normal distribution has HIGHER y-intercept(~0.4) than all T-distributions, but then has SLIMMER tail.

T-distribution for n=3…12

2-tailed T-test

Right-tailed T-test

Left-tailed T-test

Excel analogue of normdist for the t-distribution, when n<30 or population is population is

approximately normal but σ not known

TDISTReturns the Percentage Points (probability) for the Student t-distribution where a

numeric value (x) is a calculated value of t for which the Percentage Points are to be computed. The t-distribution is used in the hypothesis testing of small sample data sets. Use this function in place of a table of critical values for the t-distribution.

SyntaxTDIST(x,degrees_freedom,tails)X is the numeric value at which to evaluate the distribution.Degrees_freedom is an integer indicating the number of degrees of freedom.Tails specifies the number of distribution tails to return. If tails = 1, TDIST

returns the one-tailed distribution. If tails = 2, TDIST returns the two-tailed distribution.

Excel analogue of norminv for computing critical values for the t-distribution, when n<30 or

population is approximately normal but σ not knownTINVReturns the t-value of the Student's t-distribution as a function of the probability and

the degrees of freedom.SyntaxTINV(probability,degrees_freedom)Probability is the probability associated with the two-tailed Student's t-distribution.Degrees_freedom is the number of degrees of freedom to characterize the

distribution.Remarks • • TINV returns that value t, such that P(|X| > t) = probability where X is a

random variable that follows the t-distribution and P(|X| > t) = P(X < -t or X > t).

• A one-tailed t-value can be returned by replacing probability with 2*probability. For a probability of 0.05 and degrees of freedom of 10, the two-tailed value is calculated with TINV(0.05,10), which returns 2.28139. The one-tailed value for the same probability and degrees of freedom can be calculated with TINV(2*0.05,10), which returns 1.812462.

Answer• The t-distribution is wider than the normal

distribution, so to capture 95% of area under the curve, when centered at the mean, one must go out further -- i.e. the 95% confidence interval when using the t-distribution is LARGER than the 95% confidence interval when using the normal distribution.

• Thus the answer to the question is that the confidence is LESS than 95%.

ESTIMATING THE VARIANCE

Variance estimation

• point estimate of variance• confidence interval• minimum sample size

• estimation of variance (stdev) of population is used in quality control

Chi square distribution Â2

Out[2]=

Chi square distribution NOT symmetric, unlike normaland t-distributions.

Approximation of Chi-Square distribution with 10 df

Let Z=X+…+X be the sum of 10 values, each valueX = Y2, where Y is sampled from the standard normaldistribution. Obtain 10,000 values for Z and create histogram.

Definition of χ2 distribution

Using Â2 distribution

Excel function chidistCHIDISTReturns the one-tailed probability of the χ2 distribution. The χ2

distribution is associated with a χ2 test. Use the χ2 test to compare observed and expected values. For example, a genetic experiment might hypothesize that the next generation of plants will exhibit a certain set of colors. By comparing the observed results with the expected ones, you can decide whether your original hypothesis is valid.

SyntaxCHIDIST(x,degreesfreedom)X is the value at which you want to evaluate the distribution.Degreesfreedom is the number of degrees of freedom.Remarks • CHIDIST is calculated as CHIDIST = P(X>x)

Excel function chiinvCHIINVReturns the inverse of the one-tailed probability of the chi-squared

distribution. If probability = CHIDIST(x,df), then CHIINV(probability,df) = x. Use this function to compare observed results with expected ones in order to decide whether your original hypothesis is valid.

SyntaxCHIINV(probability,degrees_freedom)Probability is a probability associated with the chi-squared distribution.Degrees_freedom is the number of degrees of freedom.Remarks • Given a value for probability, CHIINV seeks that value x such that

CHIDIST(x, degrees_freedom) = probability. Thus, precision of CHIINV depends on precision of CHIDIST. CHIINV uses an iterative search technique. If the search has not converged after 100 iterations, the function returns the #N/A error value.

Using Excel to compute minimum sample size for variance or stdev estimations

BOOTSTRAPPING

Bootstrapping allows one to create confidence intervals (CI)for proportions, means, variances and standard deviations In the case that the requirements for parametric methods areNot satisfied. NEVERTHELESS, the initial sample must NOTbe biased – it must be a simple random sample.

Bootstrapping construction of a 90% confidence interval for proportion

Mathematica demo

Similar constructions for mean, variance and standard deviation

Documents

BIOL2300 Biostatistics Chapter 7 - Boston College