60
EPIDEMIOLOGY AND BIOSTATISTICS REVIEW, PART I Tommy Byrd MSII

biostatsepi_pathreview-part1

Embed Size (px)

DESCRIPTION

high yield review for public health

Citation preview

Page 1: biostatsepi_pathreview-part1

EPIDEMIOLOGY AND BIOSTATISTICS REVIEW, PART I Tommy Byrd MSII

Page 2: biostatsepi_pathreview-part1

http://www.usmle.org/pdfs/step-1/2013midMay2014_Step1.pdf

Page 3: biostatsepi_pathreview-part1

http://www.usmle.org/pdfs/step-1/2013midMay2014_Step1.pdf

Page 4: biostatsepi_pathreview-part1

Know the 4 scales of data measurement

• Nominal • Ordinal • Interval • Ratio

Page 5: biostatsepi_pathreview-part1

Nominal scale data are divided into qualitative categories or groups

Male

Black

Suburban

Female

White

Rural

Page 6: biostatsepi_pathreview-part1

Ordinal scale data has an order • Class rankings data (1st / 2nd / 3rd…)

• Answers to these types of questions:

**But it does not describe the size of the interval (eg. it cannot tell by how many percentage points Tommy is ranked 1st in his class)

Page 7: biostatsepi_pathreview-part1

Interval scale data has order and a set interval • Celsius (and Fahrenheit) temperatures

• Anno Domini years (1990, 1991, 1992, etc.)

**But ratios of this kind of data are not meaningful •  100°C is not twice as hot as 50°C because 0°C does not

indicate a complete absence of heat

Page 8: biostatsepi_pathreview-part1

Ratio scale data has order, a set interval, and is based on an absolute zero • Kelvin temperatures • MOST BIOMEDICAL VARIABLES

•  Weight (grams, pounds) •  Time (seconds, days à ‘zero’ is the starting point of measurement) •  Age (years) •  Blood pressure (mmHg) •  Pulse (beats per minute)

• With these types of data ratios are valid: •  300K is twice as hot as 150K •  A pulse rate of 120 beats/min is twice as fast as a pulse rate of 60

beats/min

Page 9: biostatsepi_pathreview-part1

http://www.usmle.org/pdfs/step-1/2013midMay2014_Step1.pdf

Page 10: biostatsepi_pathreview-part1

Many naturally occurring phenomena are distributed in the bell-shaped normal or Gaussian distribution

Score

(Blood pressure, cholesterol, etc.)

Page 11: biostatsepi_pathreview-part1

Skewed distributions are described by the location of the tail of the curve, not the location of the hump

a.k.a. “Left skew” a.k.a. “Right skew”

Page 12: biostatsepi_pathreview-part1

Know the measures of central tendency

Score

• Mode • Median • Mean

Page 13: biostatsepi_pathreview-part1

Mode is the value that occurs with the greatest frequency

2 4 5 7 4 2 3 6 8 9 7 5 4 4 2 4 6 7 7 7

Bimodal distribution!

2 3 4 5 6 7 8 9

Page 14: biostatsepi_pathreview-part1

Median is the value that divides the distribution in half • Odd # total elements: the median is the middle one • Even # total elements: the median is the average of the

two middle ones

**Very useful measure of central tendency for highly skewed distributions

Page 15: biostatsepi_pathreview-part1

Mean (the average) is the sum of all values divided by the total # of values

• Unlike median and mode, it is very sensitive to extreme scores •  Therefore NOT good for measuring skewed distributions

• Repeated samples drawn from the same population will tend to have very similar means •  Therefore the mean is the measure of central tendency that BEST

resists the influence of fluctuation between different samples

Page 16: biostatsepi_pathreview-part1

Match the mean, median, and mode each with its corresponding hash mark

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

Glaser, Anthony N. High-yield Biostatistics, Epidemiology, & Public Health. N.p.: n.p., n.d. 9. Print.

Page 17: biostatsepi_pathreview-part1

http://www.usmle.org/pdfs/step-1/2013midMay2014_Step1.pdf

✔ ✔

Page 18: biostatsepi_pathreview-part1

Normal distributions with identical measures of central tendency can have different variabilities • Variability = the extent to which their scores are clustered

together or scattered about The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

How do we measure this variability???

Page 19: biostatsepi_pathreview-part1

Standard deviation (σ) measures how far away, on average, that values lay away from the mean of the population • Remember the last infectious disease quiz?

•  Let’s assume the mean (average) grade was a 70% with a normal distribution •  If the σ was really HIGH, there was probably a bunch of A’s and a bunch

of F’s in addition to B’s and C’s and D’s •  If the σ was really LOW, most people probably got a high D or low C

So, since we gun hard, how can we use standard deviation to tell exactly how we did in comparison to everybody else?

Page 20: biostatsepi_pathreview-part1

By MEMORIZING these numbers! •  Approx. 68% of the distribution falls within ±1 standard deviations •  Approx. 95% of the distribution falls within ±2 standard deviations •  Approx. 99.7% of the distribution falls within ±3 standard deviations

Generated by CamScanner

Therefore, assuming the σ of the test scores was 10 points, we can assume the following:

So, out of a class of 100, about how many people got an A? (assume extra credit was possible)

A)  9-11 B)  2-3 C)  14-16 D)  4-6 E)  19-21

Grade (%)

Page 21: biostatsepi_pathreview-part1

The z score is simply how many standard deviations the element lies above or below the mean

Generated by CamScanner

Grade (%)

85

z = + 1.5

65 z = − 0.5

A table of z scores compares the z score to the “Area beyond Z”…

Page 22: biostatsepi_pathreview-part1

The z score is simply how many standard deviations the element lies above or below the mean

A table of z scores compares the z score to the “Area beyond Z”…

Generated by CamScanner

6.7% got ‘beyond’ an 85% on our startlingly realistic, made-up test

~7 people here

Page 23: biostatsepi_pathreview-part1

Therefore the z score can be used to specify probability

We know that 6.7% of the class has a grade above 85%, so the probability of one randomly selected person from this population having a grade above 85% is 6.7%, or 0.067

Generated by CamScanner

Page 24: biostatsepi_pathreview-part1

http://www.usmle.org/pdfs/step-1/2013midMay2014_Step1.pdf

✔ ✔ ✔ ✔

Page 25: biostatsepi_pathreview-part1

What if we don’t know every single person’s score on the test? • But, through some stealthy looking-over-shoulders while

people check their online test scores, we can get a sample of random scores •  How close to the actual class average will our sample be?

0% 70% 100%

n = the size of each sample

The # of times that the average of a sample of 4 scores is ~80%

One sample representing one score

Page 26: biostatsepi_pathreview-part1

The standard error of the mean (SEM) is the standard deviation over the square root of the sample size

SEM = σ/√n

0% 70% 100%

Recall that the standard deviation (σ) of this test was 10 percentage points

SEM = 10/√1 = 10

SEM = 10/√4 = 5

SEM = 10/√7 = 3.8

SEM = 10/√10 = 3.2

Page 27: biostatsepi_pathreview-part1

Standard error (SEM) can be used in the same way as standard deviation • But remember that SEM decreases as n é

• Now we have gathered a sample of 10 random scores from our classmates, so:

SEM = σ/√n

SEM = 10/√10 = 3.2 **Do you remember how much of the population falls within 2 standard deviations (or SEMs) of the mean?

Page 28: biostatsepi_pathreview-part1

95% confidence limits are approximately equal to the sample mean plus or minus 2 standard errors

Generated by CamScanner

µ − 3 SEM µ − 2 SEM µ − 1 SEM µ + 1 SEM µ + 2 SEM µ + 3 SEM

Practically, the 95% confidence interval is the range in which the means 95% of samples would be expected to fall

In other words, there is a 95% chance that the average of our random sample would be in this range

Page 29: biostatsepi_pathreview-part1

95% confidence limits are approximately equal to the sample mean plus or minus 2 standard errors • Remember, the σ on our test was 10%, and the mean was

a 70%. We are randomly sampling 10 scores (n=10) •  So the standard error (SEM) = σ/√n = 10/√10 = 3.2%

• We just decided that our sample has a 95% chance of falling within 2 SEMs of the average

• So our 95% confidence interval is 70% ± 2(SEM) = 70% ± 2(3.2%) = 70% ± 6.4% = 63.6% - 76.4%

A random sample of 10 people’s scores on this test has a 95% chance of averaging between 63.6% and 76.4% The width of the confidence interval reflects precision

Page 30: biostatsepi_pathreview-part1

How would we double the precision of an estimate?

• Double the sample size?

• We need to quadruple the sample size!

SEM = σ/√n

Page 31: biostatsepi_pathreview-part1

If we do not know the σ of our population, can we still calculate SEM? • Pretend we don’t have any fancy ExamSoft statistics from

our test, only our sample of 10 scores • We can calculate the standard deviation of the 10 scores

in our sample (S), and substitute it in for σ in the SEM equation to come up with the estimated standard error of the mean

Estimated standard error = S / √n

Page 32: biostatsepi_pathreview-part1

The t score is to the z score as the estimated standard error is to the σ

Generated by CamScanner

For USMLE purposes, consider degrees of freedom (df) to equal n-1

Generated by CamScanner

t = the number of estimated standard errors away from the sample mean

Similar to P – values !

So what do we do with all this?

Page 33: biostatsepi_pathreview-part1

http://www.usmle.org/pdfs/step-1/2013midMay2014_Step1.pdf

✔ ✔ ✔ ✔

Page 34: biostatsepi_pathreview-part1

There are 7 steps in hypothesis testing •  1) State the null and alternative hypothesis, H0 and HA

•  H0 = no difference •  HA = there is a difference

•  2) Select the decision criterion α (“level of significance”) •  3) Establish the critical values of t •  4) Draw a random sample, find its mean •  5) Calculate the standard deviation of the sample (S) and

find the estimated standard error of the sample •  6) Calculate the value of the test statistic t that

corresponds to the mean of the sample (tcalc) •  7) Compare the calculated value of t with the critical

values of t, then accept or reject the null hypothesis

Page 35: biostatsepi_pathreview-part1

Step 1: State the null and alternative hypotheses • We want to test Julia Silva’s claim: “Because of Tommy

and Danielle’s amazing biostats presentation, the average Step 1 score of our class will be 260” •  Null hypothesis = The mean score is 260 •  Alternative hypothesis = The mean score is not 260

• We could ask for the score of every student, but we would rather take a random representative sample so we can save time •  Again, our sample size will be 10 randomly selected students

Page 36: biostatsepi_pathreview-part1

Step 2: Select the decision criteria α • Random sampling error (this is normal) will always cause

our sample mean to deviate slightly from the true mean •  We have to decide what an acceptable level of this chance

deviation is

• α is conventionally set at 0.05 •  If the probability of obtaining the sample mean is greater than 0.05,

H0 is accepted: •  The class indeed scored an average of 260

•  If the probability of obtaining the sample mean less than 0.05, H0 is rejected: •  The class average is either above or below 260

Page 37: biostatsepi_pathreview-part1

Step 3: Establish the critical values of t

Generated by CamScanner

Sample size (n) = 10 students, so df = 9

α = 0.05

So tcrit = ±2.262

Page 38: biostatsepi_pathreview-part1

Step 4: Draw a random sample and calculate the mean of the sample

284   234   268   254   246   264   266   265   245   244  

Average = 257

Page 39: biostatsepi_pathreview-part1

Step 5: Calculate standard deviation and estimated standard error of the sample

•  In our sample, standard deviation (S) = 15 •  (You don’t have to know the equation for standard deviation on the

USMLE)

• Estimated standard error = S / √n = 15 / √10 = 4.747

Page 40: biostatsepi_pathreview-part1

Step 6: Calculate t from the data • Remember, similar to a z-value, the t-score represents the

# of estimated standard means that the sample mean lays away from the hypothesized mean

• Our average score was 257, which is 3 points away from our hypothesized average of 260 •  Therefore, our t-value is “the # of estimated standard errors

contained in 3 points” •  Our estimated standard error from the last slide is 4.747 •  This gives a t-score (tcalc) of:

−3 / 4.747 = −0.632

Page 41: biostatsepi_pathreview-part1

Step 7: Compare t-values and be very concerned that Julia Silva is a psychic • Our calculated t-value (same thing as t-score) is −0.632 • Our critical t-value is ±2.262

• Clearly our calculated t lies between +2.2 and −2.2, therefore: •  H0 is accepted and reported as follows: “The hypothesis that the

mean Step 1 score of the medschool class is 260 was accepted, t = 0.632, df = 9, p ≤ 0.5”

Generated by CamScanner

+2.262 −2.262

t=0 ★

Page 42: biostatsepi_pathreview-part1

http://www.usmle.org/pdfs/step-1/2013midMay2014_Step1.pdf

✔ ✔ ✔ ✔

Page 43: biostatsepi_pathreview-part1

Error types indicate that you accepted the wrong hypothesis

Type I Error

•  “False-positive” error •  You accept the alternative

hypothesis when there is no difference

• Also known as alpha (α) error à yes, this is referring to the α we just talked about

•  The p-value is the probability of making a type I error

Type II Error

•  “False-negative” error •  You fail to reject the null

hypothesis when there actually is a difference

• Also known as β error •  β is the probability of

making a type II error

Page 44: biostatsepi_pathreview-part1

A study with greater power has less type II (β) error •  The power of a statistical test = 1 − β •  The power represents the probability of rejecting the null

hypothesis when it is in fact false (vs. accepting it in β error); we want this to happen!

• Conventionally, a study is required to have a power of 0.8 (or a β of 0.2) to be acceptable

• Power increases as α increases à trade off • High-yield point: Increasing the sample size is the

most practical and important way of increasing the power of a statistical test

Page 45: biostatsepi_pathreview-part1

http://www.usmle.org/pdfs/step-1/2013midMay2014_Step1.pdf

✔ ✔ ✔ ✔

✔ ✔

Page 46: biostatsepi_pathreview-part1

Nonexperimental (descriptive or analytic) study designs – Cohort studies • Group without disease are selected and followed for an

extended period • Some members may have already been exposed to risk

factor • Exception: “Inception Cohorts” follow those recently

diagnosed to track progression • Can estimate incidence • Not good for rare diseases • Historical cohort study = retrospective cohort study

Page 47: biostatsepi_pathreview-part1

• All are retrospective • Compare people who do have the disease (the cases) w/

otherwise similar people who do not have the disease • Start w/ outcome then LOOK BACK into the past for

possible independent variables that may have caused the disease

• Cheap, good for rare or that take a long time to develop

Nonexperimental (descriptive or analytic) study designs – Case-control studies

Page 48: biostatsepi_pathreview-part1

• Essentially a series of case reports that may link disease to exposure, but NOT controlled, as in case-control (no group w/o the disease compared to)

• Eg. Kaposis’s sarcoma

Nonexperimental (descriptive or analytic) study designs – Case-series studies

Page 49: biostatsepi_pathreview-part1

• Survey (“snap shot”) of a whole population, also asks about risk factors individually

• Prevalence ratio = the prevalence of a disease in people who have and have not been exposed to a risk factor

•  Likely to overrepresent chronic diseases and underrepresent acute diseases

Nonexperimental (descriptive or analytic) study designs – Prevalence survey

Page 50: biostatsepi_pathreview-part1

• Check non-individual info (eg. study of the rate of diabetes in countries with different levels of automobile ownership)

• May be experimental: •  Community intervention trials

•  Experimental group consists of an entire community, while the control group is an otherwise similar community that is not subject to any kind of intervention

Nonexperimental (descriptive or analytic) study designs – Ecological studies

Page 51: biostatsepi_pathreview-part1

http://www.usmle.org/pdfs/step-1/2013midMay2014_Step1.pdf

✔ ✔ ✔ ✔

✔ ✔

Page 52: biostatsepi_pathreview-part1

Bias occurs from systemic (rather than random) errors when one outcome is systematically favored over another

Generated by CamScanner

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

What is the

difference

between selection bias

and sampling

bias?

(Magazine subscribers in great depression)

(Referral bias)

Page 53: biostatsepi_pathreview-part1

Bias occurs from systemic (rather than random) errors when one outcome is systematically favored over another

Generated by CamScanner

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. (Putting all whites in drug

group and blacks in control group for treating a racially selective disease) Race = confounding variable

Page 54: biostatsepi_pathreview-part1

Bias occurs from systemic (rather than random) errors when one outcome is systematically favored over another

Generated by CamScanner

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

Page 55: biostatsepi_pathreview-part1

http://www.usmle.org/pdfs/step-1/2013midMay2014_Step1.pdf

✔ ✔ ✔ ✔

✔ ✔

♯ ♯

Page 56: biostatsepi_pathreview-part1
Page 57: biostatsepi_pathreview-part1
Page 58: biostatsepi_pathreview-part1
Page 59: biostatsepi_pathreview-part1
Page 60: biostatsepi_pathreview-part1