Upload
cullen-moll
View
212
Download
0
Embed Size (px)
Citation preview
Impact of a simulation/ randomization-based curriculum on student understanding of p-values and confidence intervals
Beth ChanceKaren McGaugheyJimmy WongCal Poly – San Luis Obispo
ICOTS9
Outline• About the curriculum (Karen)• Evaluating the curriculum (Beth)• Benefits/Cautions/Suggestions (Karen)• Next Steps (Beth)
Background• Randomization-based introductory statistics
courses (Saturday workshop)• Introducing all inferential techniques through
simulation and randomization-based methods• e.g., permutation tests, bootstrapping
• Tintle et al. (2015) text (Roy, Session 4A)• Focus on overall statistical process via
genuine research studies• Normal-based methods presented as
alternative approximation to simulation results
Background• Spiraled just-in-time curriculum:
• Brief introduction to probability through simulation• e.g., Monty Hall problem, coin tossing• Develop understanding of probability as a long-run
proportion
• Statistical Inference (Ch. 1)• Process probability/one proportion• One mean, two proportions, two means, matched pairs,
multiple proportions, multiple means, regression
• Deeper dive in each iteration• Interspersed as needed: discussions of random
sampling, random assignment, graphical displays, scope of conclusions, etc.
Background• Ch 1: Test of significance
• One proportion• Facial Prototyping – “Bob
& Tim” (Lea, Thomas, Lamkin, & Bell, 2007)
• Binary response• Overwhelmingly name left
picture “Tim” (e.g. ~ 80%)
• Two competing explanations for the study outcome:• “Random chance alone”• Research conjecture
• Could the observed statistic plausibly have happened by random chance alone?
• Design the simulation:• What does “by random
chance alone” look like?• Coin tossing model• Tactile & via computer
Background• Ch. 3: Confidence Intervals = Interval of plausible values
• Example: Reese’s Pieces (n = 40, = 16/40 = 0.40 )• Test (via simulation) for plausible values of population
proportion given observed sample proportion
Test Two-sided p-value
Decision at 0.05 significance level Plausible?
Ho: π = 0.26 0.0430 Reject Ho No
Ho: π = 0.27 0.0800 Fail to reject Ho Yes
: : Fail to reject Ho Yes
: : Fail to reject Ho Yes
Ho: π = 0.55 0.0770 Fail to reject Ho Yes
Ho: π = 0.56 0.0450 Reject Ho No
Background
• Ch 5: Two proportions• Dolphin Therapy
(Antonioli & Reveley, 2005)
• Binary response• Designed
experiment
• Two competing explanations:• Ho: “random chance alone”
• Ha: research conjecture
• Could the observed statistic plausibly have happened by random chance alone?
• Design the simulation:• Card shuffling• Tactile & via computer
Therapy group
Dolphin Control
Improved 10 3
Did not improve 5 12
2013-2014 Evaluation
• New and experienced teachers• 15 institutions (HS, community college, university)• 15 instructors (fall) and 23 instructors (spring, 12 new)• Over 1500 students
• Assessment• (Modified) CAOS pre and post tests (Tintle, Session 8A)• SATS attitudes pre and post tests (Swanson, Session 1F)• Set of common multiple choice exam questions
• 25 instructors, 774-826 students• Final exam transfer question
One Proportion (Exam 1)
• Research question: Are city residents more likely to watch a movie at home rather than in the theater?
Q1: Picking the correct null hypothesis (overall percentages)Adult residents of the city are equally likely to choose to watch the movie at home as to watch the movie at the theater. 92.9%
Adult residents of the city are more likely to choose to watch the movie at home than to watch the movie at the theater.
5.8%
Adult residents of the city are less likely to choose to watch the movie at home than to watch at the theater. .6%
Other .6%
One Proportion (Exam 1)
• Research question: Are city residents more likely to watch a movie at home rather than in the theater?
Q2: Picking the correct alternative hypothesisAdult residents of the city are equally likely to choose to watch the movie at home as to watch the movie at the theater. 1.7%
Adult residents of the city are more likely to choose to watch the movie at home than to watch the movie at the theater.
90.1%
Adult residents of the city are less likely to choose to watch the movie at home than to watch at the theater. 5.6%
Other 2.7%
One Proportion (Exam 1)
• Research question: Are city residents more likely to watch a movie at home rather than in the theater?
Q3: Result is statistically significant (p = 0.012), which explanation is more plausible?More than half of the adult residents in her city prefer to watch the movie at home. 65.6%
There is no overall preference for movie-watching-at-home in her city, but by pure chance her sample just happened to have an unusually high number of people choose to watch the movie at home.
6.0%
(a) and (b) are equally plausible explanations. 29.7%
Substantial section-to-section variability!
One Proportion (Exam 1)
• Research question: Are city residents more likely to watch a movie at home rather than in the theater?
Q4: Most valid interpretation of p-value?A sample proportion as large as or larger than hers would rarely occur. 14.0%
A sample proportion as large as or larger than hers would rarely occur if the study had been conducted properly. 6.9%
A sample proportion as large as or larger than hers would rarely occur if 50% of adults in the population prefer to watch the movie at home.
59.9%
A sample proportion as large as or larger than hers would rarely occur if more than 50% of adults in the population prefer to watch the movie at home
20.3%Higher for experienced instructors
One Proportion (Exam 1)
• Research question: Are city residents more likely to watch a movie at home rather than in the theater?
Q5: Would 95% confidence interval contain 0.5?Yes 25.3%No 43.8%Not enough information 31.0%
Two Proportions (Exam 2)
• Research question: Are women more likely to dream in color than men?
Q1: Best conclusion from not significant (not small p-value) result ?You have found strong evidence that there is no difference between the proportions of men and women in your community that dream in color.
14.5%
You have not found enough evidence to conclude that there is a difference between the proportions of men and women in your community that dream in color.
72.8%
You have found strong evidence against the claim that there is a difference between the proportions of men and women that dream in color.
10.7%
Because the result is not significant, we can’t conclude anything from this study. 4.1%
Higher for new instructors
Two Proportions (Exam 2)
• Research question: Are women more likely to dream in color than men?
Q2: Best interpretation from small p-value?It would not be very surprising to obtain the observed sample results if there is really no difference between the proportions of men and women in your community that dream in color. 5.0%
It would be very surprising to obtain the observed sample results if there is really no difference between the proportions of men and women in your community that dream in color. 56.5%
It would be very surprising to obtain the observed sample results if there is really a difference between the proportion of men and women in your community that dream in color.
7.9%
The probability is very small that there is no difference between the proportions of men and women in your community that dream in color. 22.6%The probability is very small that there is a difference between the proportions of men and women in your community that dream in color. 8.4%
Two Proportions (Exam 2)
• Research question: Are women more likely to dream in color than men?
Q3: If really is a difference, why might get large p-value?Something went wrong with the analysis, and the results of this study cannot be trusted. 6.1%
There must not be a difference after all and the other research studies were flawed. 3.8%
The sample size might have been too small to detect a difference even if there is one. 90.1%
Two Proportions (Exam 2)
• Research question: Are women more likely to dream in color than men?
Q4: Which has stronger evidence of a difference: Study A vs. Study B?Study A: 40/100 vs. 20/100 80.3%Study B: 35/100 vs. 25/100 4.4%The strength of evidence would be similar for these two studies 15.3%
Two Proportions (Exam 2)
• Research question: Are women more likely to dream in color than men?
Q5: Which has stronger evidence of a difference: Study C vs. Study D (30% vs. 20%)?Study C: sample sizes of 100 and 100 83.0%Study D: sample sizes of 40 and 40 6.0%The strength of evidence would be similar for these two studies 10.8%
Two Proportions (Exam 2)
• Research question: Are women more likely to dream in color than men?
Q6: Small p-value, which explanation is more plausible?Men and women in your community do not differ on this issue but by chance alone the random sampling led to the difference we observed between the two groups.
13.6%
Men and women in your community differ on this issue. 58.1%(a) and (b) are equally plausible explanations. 28.2%
36% correct with draft curriculum four years ago
Two Proportions (Exam 2)
• n = 404 students (8 instructors)
Q7: Main purpose of the randomness in the simulation?To allow me to draw a cause-and-effect conclusion from the study. 19.1%
To allow me to generalize my results to a larger population.
11.4%
To simulate values of the statistic under the null hypothesis. 58.8%
To replicate the study and increase the accuracy of the results 8.2
Two Means (Exam 2/Final)
• 717 students, 14 instructors• Want to compare mean score on video game
with and without monetary incentive• Simulation process is described and given null
distribution
Two Means (Exam 2/Final)
Q1: Main motivation for this process?This process allows her to compare her actual result to what could have happened by chance if gamers’ performances were not affected by whether they were asked to do their best or offered an incentive.
83.0%
This process allows her to determine the percentage of time the $5 incentive strategy would outperform the “do your best" strategy for all possible scenarios.
12.0%
This process allows her to determine how many times she needs to replicate the experiment for valid results. 2.2%
This process allows her to determine whether the normal distribution fits the data. 2.8%
Two Means (Exam 2/Final)
Q2: What’s assumed in carrying out the simulation?The $5 incentive is more effective than the “do your best” incentive for improving performance. 25.8%
The $5 incentive and the “do your best” incentive are equally effective at improving performance. 60.9%
The “do your best” incentive is more effective than a $5 incentive for improving performance. 6.0%
Both (a) and (b) but not (c).7.3%
Two Means (Exam 2/Final)
Q3: Approximate p-value from graph0.501 (using null value) 14.0%
0.047 (two-sided) 16.9%
0.022 52.5%
.001 (small) 16.2%
Two Means (Exam 2/Final)
Q4: What does histogram tell us about research question?The $5 incentive is not effective because the distribution of differences generated is centered at zero. 16.3%
The $5 incentive is effective because distribution of differences generated is centered at zero. 14.8%
The $5 incentive is not effective because the p-value is greater than 0.05. 5.1%
The $5 incentive is effective because the p-value is less than 0.05. 63.4%
Two Means (Exam 2/Final)
Q5: Appropriate interpretation of p-value?The p-value is the probability that the $5 incentive is not really helpful. 3.7%
The p-value is the probability that the $5 incentive is really helpful. 12.9%
The p-value is the probability that she would get a result as least as extreme as the one she actually found, if the $5 incentive is really not helpful.
82.3%
The p-value is the probability that a student wins on the video game. 0.9%
CAOS Significance questions(n 2,000 pre, 1,500 post)
• Valid/invalid interpretations
PrePost
CAOSExp New Non
Large or small p-value, no impact 50% 89% 85% 62%en 68%
Probability of results at least as extreme under null: valid 50% 65% 66% 52% 57%
Probability of alternative: invalid 40% 53% 58% 48% 54%
Probability of null: invalid 53% 72% 67% 58% 60%
CAOS Conf interval questions • Valid/invalid interpretations
PrePost
CAOSExp New Non 95% of all observations in population in interval: invalid 57% 63% 64% 56% 65%
95% confident an observational unit is in interval: invalid 27% 41% 37% 21%en 49%
95% of sample means from population are in interval: invalid
51% 60% 60% 64% 48%
95% confident population mean is in interval: valid 71% 80% 80% 82% 76%
CAOS Sampling variability questions
PrePost
CAOSExp New Non
Small sample (n = 60) may fail to detect difference 71% 58% 57% 49% 67%
Necessary sample size for all 310 million U.S. residents 10% 19% 22% 11%
“Hospital problem” 33% 39% 38% 34% 33%
Values of 10 sample proportions 42% 44% 52%e 43% 52%
Simulation design 24% 40% 35% 24%e 22%
Topic areas – Summary
• Auth= author team member• Mid = non-author but have used materials more than once
Pre Post
Auth Mid New Non Auth Mid New Non
Significance 52% 43% 47% 46% 72% 67% 69% 55%*
Confidence 55% 51% 51% 49% 63% 60% 60% 56%
Sampling variability 35% 36% 36% 35% 41% 40% 41% 32%
Transfer Question (Final exam)
• A constant theme of course: Could the statistic have happened by chance alone?• Applicable in any situation vs. statistical test
applicable in only one specific situation• Can students apply the same logic to a novel problem?• Spring 2014: Two Cal Poly instructors (169 students)• Final exam: mean/median as a measure of skewness to
make inference about population shape (adapted from 2009 AP Statistics exam)• Earlier midterm: Ratio of standard deviations or
relative risk
Transfer Question (Final exam)
• Do the sample data provide convincing evidence the population is right skewed?• Calculate statistic: mean/median = 1.05• What values would you expect for the statistic with a
normally distributed population? With a skewed right population?• 39% answered both questions correctly• Common errors:
• Mean/median > 1.05 if right skewed• Wrong direction: mean/median < 1 if right skewed
Transfer Question (Final exam)
• Do the sample data provide evidence the population is right skewed?• Calculate statistic: mean/median = 1.05• Given a simulated null distribution from a symmetric
population (centered at 1)• Evidence against the null hypothesis?
Transfer Question (Final exam)
• Multiple choice version based on common responses from open-ended version:• Answer choices focus on 3 characteristics of the
null distribution:• There is strong evidence (or not) to suggest the
actual population distribution is right skewed…….• Due to symmetric shape• Because the center is at 1• Because most values vary between 0.96 to 1.04
Transfer Question (Final exam)Two instructors (5 sections/169 students) from Cal Polydoes not provide strong evidence … because this null distribution is symmetric.
11%
provides strong evidence … because this null distribution is symmetric. 12%
does not provide strong evidence … because this null distribution is centered around one.
20%
provides strong evidence … because this null distribution is centered around one.
26%
does not provide strong evidence … because most of the values in this null distribution vary between 0.96 to 1.04.
10%
provides strong evidence … because most of the values in this null distribution vary between 0.96 to 1.04.
18%
Other: provided correct reasoning 7%
* 25% answered correctly and an additional 8% showed work indicating correct reasoning
Benefits
• Little to no confusion that small p-values statistical significance • Students very comfortable (even initially) with idea of “could this
have happened by chance alone”• Idea of large z-score or t-score (beyond 2SE) also clicks
• Address difficult inferential reasoning earlier in course• Repeated exposures allow a synthesis of the ideas
• Understanding “Inference process” as statistical method, rather than stand-alone methods for testing means, proportions, etc.
• Efficiency gains:• Still possible to do both simulation and normal-based methods• Exploration of other statistics (e.g. MAD for multiple means)
• Instructors enjoy approach, research study focus, richer student questions
Cautions• Inferential reasoning is difficult and initially, little carry-
over of learning:• Non 50/50 cases• Comparing groups• Need several repeated exposures
• May introduce a misconception of “repeating the study” • Possible increase in misconception that we are
“providing evidence for the null hypothesis”• Continue to struggle with identifying & defining
parameters• Balance inferential with descriptive statistics (less as
Common Core comes on line?)
Main Suggestions• Emphasize the ideas of model and simulation
• Repeatedly test their ability to design a simulation• Ask students to predict simulation results (where
will it be centered, why) • Focus on variability in null distribution as the key
• Clearly delineate observed data from simulation• Explicitly discuss roles of randomness in the study
design vs. randomness in simulation• Use early experiential examples that give students
ownership of the data (“observed” statistic)
Future Steps
• Three year NSF grant (DUE/TUES – 1323210) to continue data collection across institutions• More “non-users” and other randomization-
based curriculums (e.g., Lock5, Catalst)• More studies of student retention of concepts• Next theme of common exam questions:
Confidence intervals• Email Nathan Tintle ([email protected])
or Beth Chance ([email protected]) if you would like to participate
Questions?