Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
MATH 1040 Final Project
Skittles Study Hillary Bowler
Contains:
Data Collection
Organization of Data
Confidence Interval Estimates and Hypothesis Tests
Reflection
Project Introduction:
The first step in any statistics study is the data collection. For this project, each student in our class
purchased one small packet of skittles at the grocery store—generating the best random sample possible.
We each counted and recorded the frequency of each of the five colors within our bags and submitted that
information to the professor. Below you will find the table I personally submitted, followed by the full
data set for the class. From there, my group was able to organize the data (in Part II), calculate estimates
and run tests (in Part III). My personal reflection on the assignment can be found in Part IV.
PART I: DATA COLLECTION Colors in My Skittles Bag
Color Frequency
Red 12
Purple 9
Green 17
Yellow 12
Orange 10
Candies Database for the Final Project MATH 1040 Spring 2016
Name Red Orange Yellow Green Purple
Total
Count
1 Anderson, Jennifer M. 11 10 14 13 11 59
2 Bowen, Sydney 10 14 9 13 14 60
3 Bowler, Hillary A. 12 10 12 17 9 60
4 Chaires, Carmen 10 11 13 9 18 61
5 Clark, Ashley 15 14 11 11 8 59
6 Cordova, Amanda E. 9 10 16 15 5 55
7 Darger, Virginia N. 10 13 13 7 19 62
8 Dumas, Cynthia V. 12 10 11 18 8 59
9 Ferre, Bethany 13 15 12 13 8 61
10 Hanes, Joel H. 12 14 13 12 10 61
11 Hoth, Blake 10 11 16 13 9 59
12 King, Jasmine A. 13 14 4 13 14 58
13 Macey, George H. 12 8 12 12 16 60
14 Ouro-Gneni, Alamissi 11 9 14 14 13 61
15 Page, Cynthia 11 14 7 16 11 59
16 Piela, Alicia M. 12 17 10 11 8 58
17 Pratt, Eric D. 8 14 17 11 11 61
18 Ratliff, Lily M. 12 8 6 14 18 58
19 Schofield, Victoria A. 11 12 15 7 15 60
20 Silva, Joshua D. 13 18 5 9 15 60
21 Smithwick, Jillian C. 6 15 14 8 11 54
22 Sorensen, Jared L. 11 11 11 9 18 60
23 Udy, Dustin J. 16 15 11 9 7 58
24 Veroni Fioramonti, Elias F. 17 7 7 15 12 58
25 Walsh, Sarena K. 13 11 13 10 12 59
26 Wilstead, Tanea C. 14 11 9 12 10 56
Total: 304 316 295 311 310 1536
PART II: DATA ORGANIZATION
Introduction
The first step in our process was to collect the data—this came in the form of a report from each student
on their respective bag of skittles. We reported the number of each color we found in the bags we
purchased. Our objective is to study the data to understand the frequency and distribution of the
occurrence of the number of candies and number of each color of candy. To better understand this, we
compared class totals and our own personal totals. We detail that comparison below.
-Categorical Data
-Discussion: Both graphs for the full class data express what we expected to see. The totals for each color
were similar—the range was only 21 for data in that’s in the 100s. The pie chart and Pareto show a very
small difference between the frequencies of each color. The story is a little different for our individual
data. Hillary clearly had more greens than anything else, Ashley had more reds, Amanda had the most
yellows and Victoria had more purples than any other color. Each of our Pareto charts especially show a
very clear descending pattern and clear difference between the frequencies of colors. This is probably
because the total number of skittles in each bag is small compared to class totals, so even the smallest
difference of frequency is significant.
-Mean/Standard deviation/5-Number Summary
Mean: 59.08
Standard Deviation: 1.9
5-Number Summary
Minimum: 54
Maximum: 62
Q1: 58
Q2 (Median): 59
Q3: 60
- Histogram and Box plot
-Discussion: The graphs show a fairly normal distribution of data, minus one outlier and one gap. The
histogram has no data for 57 skittles, because no one out of the 26 students had a bag with that exact
count. The box plot shows one outlier at 54—this student had the lowest skittle count by far. Again, we
expected a pretty normal distribution due to a lack of much variance between each student’s data, the
standard. All of our individual data (Hillary 61, Ashley 59, Amanda 55, Victoria 60—see a box plot of
our data below) fall within range, though Amanda’s is on the low end and more than 4 skittles away from
the mean.
Reflection: Put simply, quantitative data requires numbers and categorical data divides information by
category. The first set of graphs, pie charts and Pareto charts, are better suited to categorical data.
Categorical data is discrete and not always numerical. While we did count the quantity of each color,
color itself is not numeric. The histogram and boxplot work better for quantitative data. They were a
visual representation of the number of skittles per each bag in the class sample and their distribution. You
can’t calculate much for categorical data beyond total counts per each category. With quantitative data,
you can calculate everything from mean to standard deviation and variance. The categorical data was fun
to know, but the quantitative data of the skittles provided more calculations to help us really understand
skittle packaging patterns.
PART III: ESTIMATES AND TESTS Introduction
For the purposes of this class, we are treating the class data as a workable sample to help us make
calculations and hypotheses for the population (or all manufactured skittles). We’ve included the
StatCrunch calculations for confidence intervals and hypothesis tests for the skittles population below.
Although the class data only includes information for 26 bags of skittles (making n < 30), we will treat
the data as if our sample is greater than 30 and assume that the data is normally distributed.
-Confidence Interval Estimates
Explain in general the purpose and meaning of a confidence interval.
The book defines a confidence interval as a range of values used to estimate the true value of a population
parameter—the parameters we’re calculating for this assignment are the true proportion, true mean and
standard deviation. In general, a confidence interval estimates the level of certainty with which we can
conclude our calculations—it’s the probably that the set interval actually contains the calculated
parameter.
Confidence Interval Estimate: True Proportion for Purple Candies
95% confidence interval results: 0.182 < p < 0.222 p : Proportion of successes
Method: Standard-Wald
Proportion Count Total Sample Prop. Std. Err. L. Limit U. Limit
p 310 1536 0.20182292 0.010240927 0.18175107 0.22189476
Confidence Interval Estimate: True Mean of Candies Per Bag
99% confidence interval results: 58.041 < μ < 60.113 μ : Mean of variable
Variable Sample Mean Std. Err. DF L. Limit U. Limit
Total Candies 59.076923 0.37178603 25 58.040593 60.113253
Confidence Interval Estimate: Standard Deviation of Candies Per Bag
98% confidence interval results: 1.424 < σ < 2.792 σ
2 : Variance of variable
σ : Standard Deviation—equal to square root of data below
Variable Sample Var. DF L. Limit U. Limit
Total Candies 3.5938462 25 2.0274843 7.7964549
-Interpretation
Each of these answers, written out, is as follows:
- We can be 95% confident that the interval between 0.182 and 0.222 actually does contain the
value of the true proportion of the population p.
- We can be 99% confident that the interval between 58.041 and 60.113 does in fact contain the
value of the true population mean μ.
- We can be 98% confident that the interval between 2.027 and 7.796 does contain the standard
deviation (variance squared-- σ2) of the population.
All of this means that if we were to select many different samples of the same size and construct them
with the same confidence intervals, the confidence levels would remain the same.
-Hypothesis Tests
Explain in general the purpose and meaning of a hypothesis tests.
A hypothesis is a claim about a property of a population—making a hypothesis test a procedure for
testing such claims. Hypothesis tests are used to determine the validity of assumptions about data. In this
assignment, for example, we test the claim that 20% of all Skittles are green and that the mean number of
candies per all skittles bags is 56. A null hypothesis is the original claim and an alternative hypothesis is a
statement differing from the null hypothesis used for comparison to complete the test.
Hypothesis test results: proportion of green candies p : Proportion of successes
H0 : p = 0.2
HA : p ≠ 0.2
Proportion Count Total Sample Prop. Std. Err. Z-Stat P-value
p 311 1536 0.20247396 0.010206207 0.24239742 0.8085
Hypothesis test results: mean number of candies per bag μ : Mean of variable
H0 : μ = 56
HA : μ ≠ 56
Variable Sample Mean Std. Err. DF T-Stat P-value
Total Candies 59.076923 0.37178603 25 8.2760589 <0.0001
-Interpretation
For the claim that 20% of all Skittles are green, we have a P-value of 0.809. Because 0.809 > 0.01 (level
of significance) we can conclude the following: We fail to reject the null hypothesis that 20% of all
Skittles are green.
For the claim that the mean number of candies per all Skittle bags is 56, our P-value is < 0.0001—
extremely small and far lower than the significance level of 0.05. Because 0.0001 < 0.05, we reject the
null hypothesis that the mean number of candies per all Skittle bags is 56 Skittles.
Small p-values indicate strong evidence against a null hypothesis and larger p-values indicate weak
evidence against them—hence the failure to reject the first hypothesis and rejection of the last.
PART IV: REFLECTIVE ESSAY
“Thank goodness for technology!” is by far the most repeated thought I’ve had during this class,
and I’m sure that rings true even for professional statisticians. I was grateful for the practical applications
in the project. They helped me grow more comfortable with using software like StatCrunch, and I was
even able to utilize my calculator (a TI-84) to compare results on some of the calculations. I found that
the use of technology not only saved time, but also helped me see the “big picture” of the data more
easily. I expect that the things I’ve learned about math technologies will help me get to conclusions faster
and go to greater depths with data in future research projects.
The parts of the project I found most helpful and applicable were the discussions and
interpretations. Since I have never planned on being any sort of mathematician, I have always struggled to
find real-world applications for math (beyond the basic arithmetic, of course, that I used for my taxes last
month). But I discovered that statistics is as much about interpretation and conclusion as it is the numbers.
For example, your hypothesis test isn’t finished until you have drawn a conclusion that you have either
failed to reject or that you reject the hypothesis.
The ability to collect and interpret data would be helpful in most any field, but I’m looking
forward to how helpful it will be to me personally. I will be pursuing studies in psychotherapy and will be
called upon to collect my own data, build my own research and draw conclusions. This project helped me
imagine doing that. It’s easy to feel like I am just plugging in numbers on a lot of the assignments—but
for this, I actually physically counted my personal sample and had a better vision of what the numbers
really meant.
On the one hand, skittles doesn’t feel very “real-world,” but I understand why we would need to
keep things fairly simple for the class. I clearly can’t say that knowing the approximate mean number of
candies per skittles bag is very applicable knowledge. But conducting my own statistics research from
beginning to end made the math real enough to me that I can now more easily envision applying these
skills to real-life studies.