Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Final Review
Chapters 1, 2
Controlled experiment: one in which the investigators get to determine who is in
the treatment group and who is in the control group. The existence of a control
group does not necessarily mean we have a controlled experiment. For example,
in an observational study on smoking it is not uncommon to refer to nonsmokers
as the “controls”.
Randomized Controlled experiment: one in which the investigators use an
objective chance procedure to assign participants to the two groups. This is
preferable to using human judgment, because humans are almost inevitably
biased and this can affect the results (or at least, their credibility).
Placebo: A sugar pill or other “fake” treatment that resembles the real treatment
but lacks the active ingredient. Used to make the study “blind” so that the subjects
do not know which group they are in. This reduces psychosomatic effects and
differences in behavior between the groups. Good placebos are hard to find –
sometimes people figure out which group they are in and we say they “broke the
blind”.
Observational study: one in which the subjects have determined which group they
will be in. There may be a “control group” – these are the subjects who did not
choose the treatment. These studies always have confounding factors.
Confounding factor: Ideally, the groups differ only with respect to the treatment. If
they differ with respect to some other factor, and if this factor could account for the
observed association, then the factor is a confounding factor. Need to explain how
the confounding factor relates to both factors in the association. For example, in a
study on diet and lifespan, we might see an association between good diet and long
life, but there are possible confounding factors such as exercise. People who eat a
healthy diet are likely to look after themselves in other ways, such as exercising
more, and maybe it’s the exercise (not the diet) that causes them to live longer.
Note that there are other factors that may be related to one of the factors in the
association, but if they are not related to the other, then they are just contributing
factors, not confounding. In the previous example, genetics is a contributing factor to
long life, but genetics is unlikely to be related to healthy diet, so it would be difficult to
make the case that it is a confounding factor.
Crosstabs: break down the subjects into smaller groups with similar values of known
confounding factors to try to “adjust for” the confounding factor. The problem is, you
don’t always know what the confounding factors are, and even if you did, you
eventually run out of data (the groups get very small).
Simpson’s paradox: patterns for subgroups can differ from overall patterns. You
should always look at the data on the level at which the decisions are being made.
Chapter 3
Histograms
• area represents percentages
• total area is 100%
• to draw a histogram, first calculate the width of each interval, then calculate the
percentage of cases in each interval (if this has not already been provided), last
calculate the height using
height = percentage/width
• both axes must have proper scales and the vertical axis must start at 0
• label for y axis is “Percentage per” unit on the x-axis
• area represents percentages
• to approximate percentages, assume the data are evenly spread in the intervals
and work out areas, using
area = base x height
Chapter 4
Average, median, SD
• what they measure
• ave and median measure the “middle”
• SD measures variability or “spread”
• how they relate to histograms
• ave = balance point
• median = half the area on each side
• SD: if we go from 2 SD’s below the average to 2 SDs above the average, we
catch 95% of the area
• how to calculate them
• median: sort them and pick out the middle one, or average the middle two
• calculator for ave, SD
• how they are affected by including more data, removing data, adding a constant,
multiplying by a constant, combining two groups, etc.
• longitudinal versus cross-sectional studies – if you want to see what happens as
people age, you need to watch them age.
Chapter 5
Normal Curves
• standard units say how many SDs above average:
z = x – average
SD
• use tables to approximate areas
• the 30th percentile has 30% of the data to the left. For the normal curve, work out
the corresponding middle area, get the z from the tables, and use
x = average + z(SD)
• in this chapter all methods that use the tables are not valid if the histograms do not
follow the normal curve
• if the data follow the normal curve, approximately
• 68% of the data will be between
ave – SD and ave + SD
• 95% of the data will be between
ave – 2(SD) and ave + 2(SD)
Chapter 6
Measurement error
• chance error
• bias
• outliers
• estimate the true value using the average
• the SD gives the likely size of the chance error in a new measurement
• typically, 95% of the measurements will be between
true value – 2(SD) and true value + 2(SD)
Chapter 8
Correlation
• scatter diagrams
• positive, negative, strong, weak
• the correlation coefficient, r
• the 5-number summary:
ave and SD of x
ave and SD of y
correlation coefficient, r
• you should know how to approximate the 5 number summary from a plot
• you should know how to sketch a plot from the 5 number summary
Chapter 9
Interpreting Correlation
• r measures ASSOCIATION
• r measures LINEAR association
• r is badly affected by outliers – always look at the scatterplot if you can
• ecological correlations are artificially strong
• association is not causation
• correlation is not causation
• the SD line and how to draw it – the SD line says that when r is positive, if someone
is so many SDs above or below average in x, they will be the same number of SDs
above or below average in y. If r is negative, the ebove and below get switched.
• calculating y from x if a point is on the SD line
Chapter 10
Regression
• the regression line and how to draw it
• predicting y from x using regression
• the regression line moves if we flip the axes (2 lines)
• the regression effect and fallacy – the regression effect says that in test/retest
situations, students who did very well on the first test should expect to do somewhat
less well on the second, while students who did not do well on the first test should
expect to do somewhat better on the second, just due to chance error.
Chapter 11
r.m.s. error
• r.m.s. error = √(1 – r2) (SDY)
• for homoscedastic scatter diagrams
• the rms error is like an SD for the line:
• 68% of dots are within 1 r.m.s. error of the line
• 95% of dots are within 2 r.m.s. errors of the line
• also works at a particular value of x
Chapter 12
The Regression Line
• y = (slope)(x) + intercept
• slope = r SDY
SDX
• intercept = aveY – slope(aveX)
• cannot interpret the slope as how much y would increase if we increased x by 1 unit
unless we did an experiment.
• generally, do not draw the line using the formula because the intercept may not be
on the scatter diagram. Instead, go to the point of averages, go across one SD in x
and up r times one SD in y.
Chapter 13, 14
Probability
• multiplication rule
• independence
• multiplication rule for independent events
• mutually exclusive
• addition rule for mutually exclusive events
•the chance something DOES NOT happen is 1 minus the chance it does happen
•the chance something happens AT LEAST ONCE is 1 minus the chance it NEVER
happens
Chapter 16
The Law of Averages
• as we increase the number of times, the chance error gets bigger and the
percentage error gets smaller
• to answer questions, draw the picture of what is happening to the percentage or the
chance error (the percentage gets closer and closer to what we would expect, while
the chance error drifts further and further away from 0). The HARD bit is knowing
which picture to draw – if we are comparing the percentage, we need the percentage
plot, if we are comparing the chance error itself, we need the chance error plot.
• we are more likely to get exactly ZERO chance error (or percentage error) for a few
than a lot
• does not work by compensation – chance has no memory
Box Models
• Box models say that some quantity of interest is like the sum of draws from a box.
Later we looked at the average of the draws and the percentage of 1’s in the draws.
• Need to be able to figure out how many draws, what numbers go on the tickets and
how many of each ticket.
• In some situations, we don’t know the tickets in the box, but we are told their
average and SD. It’s still a good idea to draw a box and write down what the tickets
represent (e.g. ages, weights, etc).
• If we are classifying or counting how many times something happens, the box
contains 0’s and 1’s.
Chapter 17
EVsum and SEsum
• What do they mean? The EVsum tells you how big the sum of the draws is likely
to be, on average, if you repeat the draws many, many times. The SEsum is like a
standard deviation for the sum of the draws – it tells you how much variability you
can expect from the sum of the draws.
• If the number of draws is large, the normal curve can be used to find chances that
the sum of the draws will be in a given interval. See the handwritten review sheet
for details.
• If the number of draws is large, there is a
68% chance the sum of the draws will be between
EVsum – SEsum and EVsum + SEsum
95% chance the sum of the draws will be between
EVsum – 2(SEsum) and EVsum + 2(SEsum)
Chapter 18
Probability Histograms
This chapter gives a bit of theory behind using the normal curve to calculate chances
for the sum of the draws (and later, for the average and the percentage of 1’s in the
draws).
• probability histograms represent chances
• if we increase the number of draws, the probability histogram for the SUM gets
closer and closer to the normal curve
• same for the average and the percentage of 1’s
• IT DOES NOT MATTER WHAT IS IN THE BOX – if the number of draws is large
enough, the normal curve can be used to give approximate chances for the
• sum of the draws
• average of the draws
• percentage of 1’s in the draws
But note that the draws themselves will not necessarily follow the normal curve – to
use the normal curve for the draws themselves, we have to be told that they follow
the normal curve (see chapter 5).
Chapter 19
Sample Surveys
• representative sample
• selection bias (not everyone is invited to participate and bias comes in if the ones
who are invited differ from the ones who are not invited, with respect to the thing of
interest)
• nonresponse bias (not everyone who is invited will actually respond and bias comes
in if the ones who respond differ from the ones who do not, with respect to the thing
of interest)
• quota sampling (human choice allows bias to creep in, even though the sample is
designed to be representative with respect to a few important variables)
• simple random sample (objective chance procedure that prevents selection bias but
can’t avoid nonresponse bias)
• cluster sample (randomly select groups and sample everyone in those groups)
• all our formulas are worked out assuming we have simple random samples and
ignoring nonresponse.
• big simple random samples are more accurate than small ones
• a small, well-designed sample can be much more accurate than a large, poorly-
designed sample
Chapter 20
Chance Errors in Sampling
• EV% SE% The EV% tells you how big the percentage of 1’s in the draws is likely
to be, on average, if you repeat the draws many, many times. The SE% is like a
standard deviation for the percentage of 1’s in the draws – it tells you how much
variability you can expect from the percentage of 1’s in the draws
• If the number of draws is large, the normal curve can be used to find chances that
the percentage of 1’s in the draws will be in a given interval. See the handwritten
review sheet for details.
• If the number of draws is large, there is a
68% chance the percentage of 1’s in the draws will be between
EV% – SE% and EV% + SE%
95% chance the percentage of 1’s in the draws will be between
EV% – 2(SE%) and EV% + 2(SE%)
• The actual sample size determines the accuracy, not the proportion of the
population we sample. So for equal accuracy, we should sample the same number of
people, even if the populations are very different in size.
Chapter 21
Accuracy of Percentages
• the bootstrap
• An approiximate 95% confidence interval is: sample% ± 2 (SE%)
• valid if the number of draws is large enough, even though the tickets in the box do
not follow the normal curve (they can’t follow the normal curve – they are 0’s and 1’s)
Chapter 23
Accuracy of Averages
• The EVave tells you how big the average of the draws is likely to be, on average,
if you repeat the draws many, many times. The SEave is like a standard deviation for
the average of the draws – it tells you how much variability you can expect from the
average of the draws.
• If the number of draws is large, the normal curve can be used to find chances that
the average of the draws will be in a given interval. See the handwritten review
sheet for details.
• If the number of draws is large, there is a
68% chance the average of the draws will be between
EVave – SEave and EVave + SEave
95% chance the average of the draws will be between
EVave – 2(SEave) and EVave + 2(SEave)
• using the normal curve for the sample average
• An approiximate 95% confidence interval is: sample ave ± 2 (SEave)
• valid if the number of draws is large enough
• does not say that 95% of the data are in the interval!
Chapter 26
The z test
• null, alternative
• test statistic z = (observed – EV) / SE
• P-value (see the handwritten review sheet for a reminder of how to calculate this)
• We reject the null hypothesis if the P-value is less than 5%, otherwise we fail to
reject the null.
• Conclusions:
• If you reject the null, you conclude that the alternative is true.
• If you fail to reject the null, you conclude there is insufficient evidence to say
that the null is not true.
Your conclusions should be given in the language of the problem, not using the
words “null hypothesis” etc. See old finals for examples.
• Statistical significance
• If the P-value is less than 5%, the result is statistically significant
• If the P-value is less than 1%, the result is highly statistically significant
• One-tailed versus two-tailed tests (for a 2-tailed test, the P-value is twice as big)
the t test
• used when the number of draws is small (less than 25) and the tickets in the box
are known to follow the normal curve
• the null and the alternative are just the same as those for the z test
• SD+ is used to estimate the SD of the box
• test statistic t = (sample ave – EVave) / SEave
• degrees of freedom = number of draws – 1
• t tables are used to give the P-value
• once you have the P-value, follow the same procedure as the z test.
Chapter 27
Two sample tests
In some ways these are easier than the 1-sample tests.
• need 2 independent simple random samples OR a randomized, controlled
experiment
• null: the average for box A is the same as the average for box B
Or: the percentage for box A is the same as the percentage for box B
Or: there is no difference between the averages for the treatment group and
the control group
Or: there is no difference between the percentages for the treatment group
and the control group
• alt: can be 1-tailed (e.g. “the average for box A is bigger than the average for box
B”) or two-tailed (e.g. “there is a difference”). Background information
should tell you which one to use.
• SEdiff (given on formula sheet)
• test statistic: z = (sample ave for box A – sample average for box B)/SEdiff ave
Or: z = (sample % for box A – sample % for box B)/SEdiff %
Chapter 28
The Chi-Square Test
“Goodness of fit”
Have a collection of probabilities or percentages that represent a chance model and
we want to know whether observed counts could have arisen from that chance
model.
• null: the chance model holds
• alt: the chance model does not hold
• chi-square = formula is given on the formula sheet
• df = number of rows – 1
• the right-hand tail of the chi-square table gives the P-value
• as for the z test, if the P-value is small, we reject the null hypothesis etc etc.
Two classic examples of this type of test are the one on dice (in class) and M&M’s (in
recitation).
The Chi-Square Test
Independence
• two categorical variables with a table, asking whether they are independent or
whether there is a relationship between them
• null: these two things are independent
• alt: these two things are not independent
• chi-square = formula given on formula sheet
• df = (number of rows – 1)(number of columns – 1)
• the right-hand tail of the chi-square table gives the P-value
• as for the z test, if the P-value is small, we reject the null hypothesis etc etc.