Impact of a simulation/ randomization-based curriculum on student understanding of p-values and confidence intervals Beth Chance Karen McGaughey Jimmy

Impact of a simulation/ randomization-based curriculum on student understanding of p-values and confidence intervals

Beth ChanceKaren McGaugheyJimmy WongCal Poly – San Luis Obispo

ICOTS9

Outline• About the curriculum (Karen)• Evaluating the curriculum (Beth)• Benefits/Cautions/Suggestions (Karen)• Next Steps (Beth)

Background• Randomization-based introductory statistics

courses (Saturday workshop)• Introducing all inferential techniques through

simulation and randomization-based methods• e.g., permutation tests, bootstrapping

• Tintle et al. (2015) text (Roy, Session 4A)• Focus on overall statistical process via

genuine research studies• Normal-based methods presented as

alternative approximation to simulation results

Background• Spiraled just-in-time curriculum:

• Brief introduction to probability through simulation• e.g., Monty Hall problem, coin tossing• Develop understanding of probability as a long-run

proportion

• Statistical Inference (Ch. 1)• Process probability/one proportion• One mean, two proportions, two means, matched pairs,

multiple proportions, multiple means, regression

• Deeper dive in each iteration• Interspersed as needed: discussions of random

sampling, random assignment, graphical displays, scope of conclusions, etc.

Background• Ch 1: Test of significance

• One proportion• Facial Prototyping – “Bob

& Tim” (Lea, Thomas, Lamkin, & Bell, 2007)

• Binary response• Overwhelmingly name left

picture “Tim” (e.g. ~ 80%)

• Two competing explanations for the study outcome:• “Random chance alone”• Research conjecture

• Could the observed statistic plausibly have happened by random chance alone?

• Design the simulation:• What does “by random

chance alone” look like?• Coin tossing model• Tactile & via computer

Background• Ch. 3: Confidence Intervals = Interval of plausible values

• Example: Reese’s Pieces (n = 40, = 16/40 = 0.40 )• Test (via simulation) for plausible values of population

proportion given observed sample proportion

Test Two-sided p-value

Decision at 0.05 significance level Plausible?

Ho: π = 0.26 0.0430 Reject Ho No

Ho: π = 0.27 0.0800 Fail to reject Ho Yes

: : Fail to reject Ho Yes

: : Fail to reject Ho Yes

Ho: π = 0.55 0.0770 Fail to reject Ho Yes

Ho: π = 0.56 0.0450 Reject Ho No

Background

• Ch 5: Two proportions• Dolphin Therapy

(Antonioli & Reveley, 2005)

• Binary response• Designed

experiment

• Two competing explanations:• Ho: “random chance alone”

• Ha: research conjecture

• Could the observed statistic plausibly have happened by random chance alone?

• Design the simulation:• Card shuffling• Tactile & via computer

Therapy group

Dolphin Control

Improved 10 3

Did not improve 5 12

2013-2014 Evaluation

• New and experienced teachers• 15 institutions (HS, community college, university)• 15 instructors (fall) and 23 instructors (spring, 12 new)• Over 1500 students

• Assessment• (Modified) CAOS pre and post tests (Tintle, Session 8A)• SATS attitudes pre and post tests (Swanson, Session 1F)• Set of common multiple choice exam questions

• 25 instructors, 774-826 students• Final exam transfer question

One Proportion (Exam 1)

• Research question: Are city residents more likely to watch a movie at home rather than in the theater?

Q1: Picking the correct null hypothesis (overall percentages)Adult residents of the city are equally likely to choose to watch the movie at home as to watch the movie at the theater. 92.9%

Adult residents of the city are more likely to choose to watch the movie at home than to watch the movie at the theater.

5.8%

Adult residents of the city are less likely to choose to watch the movie at home than to watch at the theater. .6%

Other .6%



Q2: Picking the correct alternative hypothesisAdult residents of the city are equally likely to choose to watch the movie at home as to watch the movie at the theater. 1.7%

Adult residents of the city are more likely to choose to watch the movie at home than to watch the movie at the theater.

90.1%

Adult residents of the city are less likely to choose to watch the movie at home than to watch at the theater. 5.6%

Other 2.7%



Q3: Result is statistically significant (p = 0.012), which explanation is more plausible?More than half of the adult residents in her city prefer to watch the movie at home. 65.6%

There is no overall preference for movie-watching-at-home in her city, but by pure chance her sample just happened to have an unusually high number of people choose to watch the movie at home.

6.0%

(a) and (b) are equally plausible explanations. 29.7%

Substantial section-to-section variability!



Q4: Most valid interpretation of p-value?A sample proportion as large as or larger than hers would rarely occur. 14.0%

A sample proportion as large as or larger than hers would rarely occur if the study had been conducted properly. 6.9%

A sample proportion as large as or larger than hers would rarely occur if 50% of adults in the population prefer to watch the movie at home.

59.9%

A sample proportion as large as or larger than hers would rarely occur if more than 50% of adults in the population prefer to watch the movie at home

20.3%Higher for experienced instructors



Q5: Would 95% confidence interval contain 0.5?Yes 25.3%No 43.8%Not enough information 31.0%

Two Proportions (Exam 2)

• Research question: Are women more likely to dream in color than men?

Q1: Best conclusion from not significant (not small p-value) result ?You have found strong evidence that there is no difference between the proportions of men and women in your community that dream in color.

14.5%

You have not found enough evidence to conclude that there is a difference between the proportions of men and women in your community that dream in color.

72.8%

You have found strong evidence against the claim that there is a difference between the proportions of men and women that dream in color.

10.7%

Because the result is not significant, we can’t conclude anything from this study. 4.1%

Higher for new instructors



Q2: Best interpretation from small p-value?It would not be very surprising to obtain the observed sample results if there is really no difference between the proportions of men and women in your community that dream in color. 5.0%

It would be very surprising to obtain the observed sample results if there is really no difference between the proportions of men and women in your community that dream in color. 56.5%

It would be very surprising to obtain the observed sample results if there is really a difference between the proportion of men and women in your community that dream in color.

7.9%

The probability is very small that there is no difference between the proportions of men and women in your community that dream in color. 22.6%The probability is very small that there is a difference between the proportions of men and women in your community that dream in color. 8.4%



Q3: If really is a difference, why might get large p-value?Something went wrong with the analysis, and the results of this study cannot be trusted. 6.1%

There must not be a difference after all and the other research studies were flawed. 3.8%

The sample size might have been too small to detect a difference even if there is one. 90.1%



Q4: Which has stronger evidence of a difference: Study A vs. Study B?Study A: 40/100 vs. 20/100 80.3%Study B: 35/100 vs. 25/100 4.4%The strength of evidence would be similar for these two studies 15.3%



Q5: Which has stronger evidence of a difference: Study C vs. Study D (30% vs. 20%)?Study C: sample sizes of 100 and 100 83.0%Study D: sample sizes of 40 and 40 6.0%The strength of evidence would be similar for these two studies 10.8%



Q6: Small p-value, which explanation is more plausible?Men and women in your community do not differ on this issue but by chance alone the random sampling led to the difference we observed between the two groups.

13.6%

Men and women in your community differ on this issue. 58.1%(a) and (b) are equally plausible explanations. 28.2%

36% correct with draft curriculum four years ago


• n = 404 students (8 instructors)

Q7: Main purpose of the randomness in the simulation?To allow me to draw a cause-and-effect conclusion from the study. 19.1%

To allow me to generalize my results to a larger population.

11.4%

To simulate values of the statistic under the null hypothesis. 58.8%

To replicate the study and increase the accuracy of the results 8.2

Two Means (Exam 2/Final)

• 717 students, 14 instructors• Want to compare mean score on video game

with and without monetary incentive• Simulation process is described and given null

distribution


Q1: Main motivation for this process?This process allows her to compare her actual result to what could have happened by chance if gamers’ performances were not affected by whether they were asked to do their best or offered an incentive.

83.0%

This process allows her to determine the percentage of time the $5 incentive strategy would outperform the “do your best" strategy for all possible scenarios.

12.0%

This process allows her to determine how many times she needs to replicate the experiment for valid results. 2.2%

This process allows her to determine whether the normal distribution fits the data. 2.8%


Q2: What’s assumed in carrying out the simulation?The $5 incentive is more effective than the “do your best” incentive for improving performance. 25.8%

The $5 incentive and the “do your best” incentive are equally effective at improving performance. 60.9%

The “do your best” incentive is more effective than a $5 incentive for improving performance. 6.0%

Both (a) and (b) but not (c).7.3%


Q3: Approximate p-value from graph0.501 (using null value) 14.0%

0.047 (two-sided) 16.9%

0.022 52.5%

.001 (small) 16.2%


Q4: What does histogram tell us about research question?The $5 incentive is not effective because the distribution of differences generated is centered at zero. 16.3%

The $5 incentive is effective because distribution of differences generated is centered at zero. 14.8%

The $5 incentive is not effective because the p-value is greater than 0.05. 5.1%

The $5 incentive is effective because the p-value is less than 0.05. 63.4%


Q5: Appropriate interpretation of p-value?The p-value is the probability that the $5 incentive is not really helpful. 3.7%

The p-value is the probability that the $5 incentive is really helpful. 12.9%

The p-value is the probability that she would get a result as least as extreme as the one she actually found, if the $5 incentive is really not helpful.

82.3%

The p-value is the probability that a student wins on the video game. 0.9%

CAOS Significance questions(n 2,000 pre, 1,500 post)

• Valid/invalid interpretations

PrePost

CAOSExp New Non

Large or small p-value, no impact 50% 89% 85% 62%en 68%

Probability of results at least as extreme under null: valid 50% 65% 66% 52% 57%

Probability of alternative: invalid 40% 53% 58% 48% 54%

Probability of null: invalid 53% 72% 67% 58% 60%

CAOS Conf interval questions • Valid/invalid interpretations

PrePost

CAOSExp New Non 95% of all observations in population in interval: invalid 57% 63% 64% 56% 65%

95% confident an observational unit is in interval: invalid 27% 41% 37% 21%en 49%

95% of sample means from population are in interval: invalid

51% 60% 60% 64% 48%

95% confident population mean is in interval: valid 71% 80% 80% 82% 76%

CAOS Sampling variability questions

PrePost

CAOSExp New Non

Small sample (n = 60) may fail to detect difference 71% 58% 57% 49% 67%

Necessary sample size for all 310 million U.S. residents 10% 19% 22% 11%

“Hospital problem” 33% 39% 38% 34% 33%

Values of 10 sample proportions 42% 44% 52%e 43% 52%

Simulation design 24% 40% 35% 24%e 22%

Topic areas – Summary

• Auth= author team member• Mid = non-author but have used materials more than once

Pre Post

Auth Mid New Non Auth Mid New Non

Significance 52% 43% 47% 46% 72% 67% 69% 55%*

Confidence 55% 51% 51% 49% 63% 60% 60% 56%

Sampling variability 35% 36% 36% 35% 41% 40% 41% 32%

Transfer Question (Final exam)

• A constant theme of course: Could the statistic have happened by chance alone?• Applicable in any situation vs. statistical test

applicable in only one specific situation• Can students apply the same logic to a novel problem?• Spring 2014: Two Cal Poly instructors (169 students)• Final exam: mean/median as a measure of skewness to

make inference about population shape (adapted from 2009 AP Statistics exam)• Earlier midterm: Ratio of standard deviations or

relative risk


• Do the sample data provide convincing evidence the population is right skewed?• Calculate statistic: mean/median = 1.05• What values would you expect for the statistic with a

normally distributed population? With a skewed right population?• 39% answered both questions correctly• Common errors:

• Mean/median > 1.05 if right skewed• Wrong direction: mean/median < 1 if right skewed


• Do the sample data provide evidence the population is right skewed?• Calculate statistic: mean/median = 1.05• Given a simulated null distribution from a symmetric

population (centered at 1)• Evidence against the null hypothesis?


• Multiple choice version based on common responses from open-ended version:• Answer choices focus on 3 characteristics of the

null distribution:• There is strong evidence (or not) to suggest the

actual population distribution is right skewed…….• Due to symmetric shape• Because the center is at 1• Because most values vary between 0.96 to 1.04

Transfer Question (Final exam)Two instructors (5 sections/169 students) from Cal Polydoes not provide strong evidence … because this null distribution is symmetric.

11%

provides strong evidence … because this null distribution is symmetric. 12%

does not provide strong evidence … because this null distribution is centered around one.

20%

provides strong evidence … because this null distribution is centered around one.

26%

does not provide strong evidence … because most of the values in this null distribution vary between 0.96 to 1.04.

10%

provides strong evidence … because most of the values in this null distribution vary between 0.96 to 1.04.

18%

Other: provided correct reasoning 7%

* 25% answered correctly and an additional 8% showed work indicating correct reasoning

Benefits

• Little to no confusion that small p-values statistical significance • Students very comfortable (even initially) with idea of “could this

have happened by chance alone”• Idea of large z-score or t-score (beyond 2SE) also clicks

• Address difficult inferential reasoning earlier in course• Repeated exposures allow a synthesis of the ideas

• Understanding “Inference process” as statistical method, rather than stand-alone methods for testing means, proportions, etc.

• Efficiency gains:• Still possible to do both simulation and normal-based methods• Exploration of other statistics (e.g. MAD for multiple means)

• Instructors enjoy approach, research study focus, richer student questions

Cautions• Inferential reasoning is difficult and initially, little carry-

over of learning:• Non 50/50 cases• Comparing groups• Need several repeated exposures

• May introduce a misconception of “repeating the study” • Possible increase in misconception that we are

“providing evidence for the null hypothesis”• Continue to struggle with identifying & defining

parameters• Balance inferential with descriptive statistics (less as

Common Core comes on line?)

Main Suggestions• Emphasize the ideas of model and simulation

• Repeatedly test their ability to design a simulation• Ask students to predict simulation results (where

will it be centered, why) • Focus on variability in null distribution as the key

• Clearly delineate observed data from simulation• Explicitly discuss roles of randomness in the study

design vs. randomness in simulation• Use early experiential examples that give students

ownership of the data (“observed” statistic)

Future Steps

• Three year NSF grant (DUE/TUES – 1323210) to continue data collection across institutions• More “non-users” and other randomization-

based curriculums (e.g., Lock5, Catalst)• More studies of student retention of concepts• Next theme of common exam questions:

Confidence intervals• Email Nathan Tintle ([email protected])

or Beth Chance ([email protected]) if you would like to participate

mailto:[email protected]

mailto:[email protected]

Questions?

Documents

Impact of a simulation/ randomization-based curriculum on student understanding of p-values and confidence intervals Beth Chance Karen McGaughey Jimmy