Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
FINAL PROJECT
Probability and Statistics for Engineers
Jack Connolly
Math 2100 Section 4
Fall 2018
Submitted: 3 December 2018
Final Project | Probability and Statistics for Engineers 2
1.
a.
Section 1 (1.a.)
R-Code
> fall=read.table("Project_Data_Fall.txt",header=T)
> attach(fall)
> mean(section1$NUM_GR)
[1] 77.04
> median(section1$NUM_GR)
[1] 80
> freqfall=table(section1$NUM_GR)
> names(freqfall)[which(freqfall==max(freqfall))]
[1] "72" "82" "95"
> var(section1$NUM_GR)
[1] 170.9567
> sd(section1$NUM_GR)
[1] 13.07504
#reads the text document and
acknowledges header
#attaches the data
#finds the mean of the sample
#finds the median of the sample
#creates a frequency table for each
grade
#outputs which values have the
highest frequency (mode)
#calculates sample variance
#calculates sample standard
deviation
Sample Statistic Value Mean (�̅�) 77.04 Median (�̃�) 80 Mode (𝑀𝑜) 72, 82, 95 Variance (𝑠2) 170.96 Standard Deviation (𝑠) 13.08
Section 2 (1.a.)
R-Code
> fall=read.table("Project_Data_Fall.txt",header=T)
> attach(fall)
> mean(section2$NUM_GR)
[1] 79.36
> median(section2$NUM_GR)
[1] 83
> freqfall2=table(section2$NUM_GR)
> names(freqfall2)[which(freqfall2==max(freqfall2))]
[1] "88"
> var(section2$NUM_GR)
[1] 183.0733
> sd(section2$NUM_GR)
[1] 13.53046
#reads the text document and
acknowledges header
#attaches the data
#finds the mean of the sample
#finds the median of the sample
#creates a frequency table for each
grade
#outputs which values have the
highest frequency (mode)
#calculates sample variance
#calculates sample standard
deviation
Final Project | Probability and Statistics for Engineers 3
Sample Statistic Value Mean (�̅�) 79.36 Median (�̃�) 83 Mode (𝑀𝑜) 88 Variance (𝑠2) 183.07 Standard Deviation (𝑠) 13.53
b.
Section 1 (1.b.)
R-Code [1]
> freqtable=hist(section1$NUM_GR, breaks=10, xlab=”Gra
des”, ylab=”Frequency”, main=”Section 1 Grade Distribu
tion”) > freqtable
$`breaks`
[1] 50 55 60 65 70 75 80 85 90 95
$counts
[1] 1 3 2 1 5 1 4 3 5
$density
[1] 0.008 0.024 0.016 0.008 0.040 0.008 0.032 0.024 0.
040
$mids
[1] 52.5 57.5 62.5 67.5 72.5 77.5 82.5 87.5 92.5
$xname
[1] “section1$NUM_GR”
$equidist
[1] TRUE
attr(,”class”) [1] “histogram”
#prints a frequency table with the
histogram parameters set for 10
bins for the grades data set
#bin start points
#frequency counts
Bin (Grade Range) Frequency 50 – 55 1 56 – 60 3 61 – 65 2 66 – 70 1 71 – 75 5 76 – 80 1 81 – 85 4
Final Project | Probability and Statistics for Engineers 4
86 – 90 3 91 – 95 5 95 – 100 0
Note: The histogram shows only 9 bins because the value of the 10th bin is 0. The distribution for section one is odd in that it almost appears bimodal with peaks around 57 and 85. It is expected that this distribution would appear more normal, however this is likely due to a small sample size.
Section 2 (1.b.)
R-Code
> freqtable2=hist(section2$NUM_GR, breaks=10, xlab="Gr
ades", ylab="Frequency", main="Section 2 Grade Distrib
ution")
> freqtable2
$`breaks`
[1] 50 55 60 65 70 75 80 85 90 95 100
$counts
[1] 1 1 3 2 4 1 2 5 4 2
$density
[1] 0.008 0.008 0.024 0.016 0.032 0.008 0.016 0.040 0
.032 0.016
#prints a frequency table with the
histogram parameters set for 10
bins for the grades data set
#bin start points
#frequency counts
Final Project | Probability and Statistics for Engineers 5
$mids
[1] 52.5 57.5 62.5 67.5 72.5 77.5 82.5 87.5 92.5 97.5
$xname
[1] "section2$NUM_GR"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
Bin (Grade Range) Frequency 50 – 55 1 56 – 60 1 61 – 65 3 66 – 70 2 71 – 75 4 76 – 80 1 81 – 85 2 86 – 90 5 91 – 95 4 95 – 100 2
The shape of the curve is different from the first and appears to have a distribution that is approaching normality. The spread is also very similar with the curve spanning about 50 grade points wide.
Final Project | Probability and Statistics for Engineers 6
c.
Boxplot (1.c.)
R-Code
> boxplot(section1$NUM_GR,section2$NUM_GR, names=c("Se
ction 1","Section 2"),ylab="Grades",main="Grade Distri
bution")
#outputs two boxplots (one for each
section), and labels the axes/title
The range of distributions for each section is very similar despite having a different distribution shape. However, section two is slightly better performing, with a slightly higher first and third quartiles, as well as median.
d.
Proportion of Good Academic Standing (1.d.)
R-Code
> goodGr=subset(fall,NUM_GR>69.5)
> pgoodGr=(nrow(goodGr)/nrow(fall))
> pgoodGr
[1] 0.74
#creates a column vector with all
of the grades of 70 or higher
#calculates the proportion of good
grades by counting the good grades
and dividing by the total number of
students
#prints the proportion of students
in good academic standing
Final Project | Probability and Statistics for Engineers 7
When considering 70% as good academic standing, the proportion of students in good academic standing during the fall is 0.74.
e.
Percentage with Low Courseload and Good Academic Standing (1.e.)
R-Code
> lazy=100*nrow(subset(fall,ENR_CR<16 & NUM_GR>69.5))
/nrow(goodGr)
> lazy
[1] 70.27027
#creates a variable to calculate
the number of students with good
grades and less than 16 credits and
divides by the total number of
students with good grades to
calculate the percentage of
students with less than 16 credits
and good academic standing
#prints the percentage
When considering students in good academic standing (70%+), the percentage of students who took less than 16 credits is 70.27%.
2. a.
Grade Statistics (2.a.)
R-Code
> full=read.table("Project_Data_Full.txt",header=T)
> attach(fall)
> mean(NUM_GR)
[1] 77.21
> median(NUM_GR)
[1] 77.5
> freqfull=table(NUM_GR)
> names(freqfull)[which(freqfull==max(freqfull))]
[1] "74" "95"
> variance=sum((NUM_GR-mean(NUM_GR))^2)/(nrow(full))
> variance
[1] 150.0259
> sigma=sqrt(variance)
#reads the text document and
acknowledges header
#attaches the data
#finds the mean of the population
#finds the median of the population
#creates a frequency table for each
grade
#outputs which values have the
highest frequency (mode)
#calculates population variance
#calculates population standard
Final Project | Probability and Statistics for Engineers 8
> sigma
[1] 12.24851
deviation
Sample Statistic Value Mean (�̅�) 77.21 Median (�̃�) 77.5 Mode (𝑀𝑜) 74, 95 Variance (𝑠2) 150.03 Standard Deviation (𝑠) 12.25
Students’ Registered Credits Statistics (2.a.)
R-Code
> full=read.table("Project_Data_Full.txt",header=T)
> attach(fall)
> mean(ENR_CR)
[1] 13.68
> median(ENR_CR)
[1] 14
> freqfull2=table(ENR_CR)
> names(freqfull2)[which(freqfull2==max(freqfull2))]
[1] "16"
> variance=sum((ENR_CR-mean(ENR_CR))^2)/(nrow(full))
> variance
[1] 9.0176
> sigma=sqrt(variance)
> sigma
[1] 3.002932
#reads the text document and
acknowledges header
#attaches the data
#finds the mean of the population
#finds the median of the population
#creates a frequency table for each
grade
#outputs which values have the
highest frequency (mode)
#calculates population variance
#calculates population standard
deviation
Sample Statistic Value Mean (�̅�) 13.68 Median (�̃�) 14 Mode (𝑀𝑜) 16 Variance (𝑠2) 9.02 Standard Deviation (𝑠) 3.00
Final Project | Probability and Statistics for Engineers 9
b.
Registered Credit’s Impact on Grade (2.b.)
R-Code
> plot(ENR_CR,NUM_GR, xlab="Registered Course Credits"
,ylab="Grade",main="Courseload's Impact on Grade")
#creates a plot with credits on the
x, and grades on the y. Adds data
labels for the axes/title
c.
Registered Credit’s Impact on Grade (2.c.)
R-Code [2]
> plot(ENR_CR,NUM_GR, xlab="Registered Course Credits"
,ylab="Grade",main="Courseload's Impact on Grade")
> lm(NUM_GR~ENR_CR)
Call:
lm(formula = NUM_GR ~ ENR_CR)
Coefficients:
(Intercept) ENR_CR
101.077 -1.745
#creates a plot with credits on the
x, and grades on the y. Adds data
labels for the axes/title
#creates a linear model between
credits and grade
Final Project | Probability and Statistics for Engineers 10
> abline(lm(NUM_GR~ENR_CR))
#adds the line of the model to the
scatter-plot
i. Approximating Grade with 10 Credits (2.c.i.)
R-Code
> exp10CR=101.077-1.745*10
> exp10CR
[1] 83.627
#creates and prints a variable to
store the expected value of the
grade with 10 credits using the
linear model from 2.c.
The expected score for a student registered for 10 course credits is 83.6, a B.
ii. Approximating Credits for an A (2.c.ii.)
R-Code
> lm(ENR_CR~NUM_GR)
Call:
lm(formula = ENR_CR ~ NUM_GR)
Coefficients:
(Intercept) NUM_GR
21.7768 -0.1049
> expAGr=21.7768-0.1049*89.5
#creates a linear model relating
credits to grade (inverse of above)
#creates and prints the expected
Final Project | Probability and Statistics for Engineers 11
> expAGr
[1] 12.38825
course load given a student wants
to get an A.
The expected courseload for a student who wants to get an A (90%+) is 12.39 credits.
d.
Calculating Correlation Coefficient R (2.d.)
R-Code
> cor(full$ENR_CR, full$NUM_GR)
[1] -0.4277376
#calculates the correlation
Coefficient (R) of the linear model
The correlation coefficient R = -0.4277, is a measured of accuracy of our model. The 0.42 magnitude represents a low correlation. The negative sign indicates that the regression line is sloping negatively, which is validated in the plot of 2.c.
e.
Calculating Coefficient of Determination R2 (2.e.)
R-Code
> rsquared=(cor(full$ENR_CR, full$NUM_GR))^2
> rsquared
[1] 0.1829594
#calculates the correlation
coefficient (R) of the linear model
and squares it to calculate the
coefficient of determination
The coefficient of determination R2 = 0.183 represents the proportion of error accounted for in the linear model. It is a proportion of model accounted error to unaccounted error in the data.
f.
Normality Plot of Grades (2.f.)
R-Code
> qqnorm(NUM_GR, main="Noramilty of Grades: Q-Q Plot")
> abline(qqline(NUM_GR))
#creates a qqplot of the grades
with a title
#adds the normal line to the plot
Final Project | Probability and Statistics for Engineers 12
The plot shows us that the data is approximately normal with some outliers at each end of the range. The data between the first and third quartiles appears to be very normal.
3. a.
Margin of Error (3.a.)
R-Code
> n=nrow(full)
> df=n-1
> P=(((100-95)/2)+95)/100
> MOE=qt(P,df)*(sd(NUM_GR))/(sqrt(n))
> MOE
[1] 2.442613
#calculates the size of the class
#calculates degrees of freedom
#calculates 1 – alpha (p)
#calculates margin of error through
the t-distribution and the standard
deviation and sample size
The margin of error is 2.443.
Final Project | Probability and Statistics for Engineers 13
b.
95 % Confidence Interval for Population Mean (3.b.)
R-Code [3]
> CI_lower=mean(NUM_GR)-MOE
> CI_upper=mean(NUM_GR)+MOE
> CI_lower
[1] 74.76739
> CI_upper
[1] 79.65261
#calculates the lower/upper bounds
of the confidence interval by
subtracting/adding the margin of
error to the mean
The confidence interval is (74.77, 79.65).
c. It is 95% likely that the true population score mean lies on 74.77 < μ < 79.65,
meaning the population mean grade is likely to be a C.
4.
a.
H0: 𝜇 ≤ 77.21 ̶ Null Hypothesis
HA: 𝜇 > 77.21 ̶ Alternative Hypothesis
b.
i.
Hypothesis Test (4.b.i.)
R-Code [4]
> xbar=80.34
> mu0=77.21
> sigma=11.19
> n=50
> TH0=(xbar-mu0)/(sigma/sqrt(n))
> TH0
[1] 1.977877
> tdfalpha=qt(P,df)
> tdfalpha
[1] 1.984217
#establishes sample mean
#establishes the hypothesized value
#establishes sample st. dev.
#establishes sample size
#calculates the test statistic with
the hypothesis and sample values
#calculates the talpha for the
rejection region comparison
The rejection rule for this hypothesis is TH0 > tdf, α, so because TH0 = 1.978 < tdf, α = 1.984, we cannot reject the null, or our test supports the null.
Final Project | Probability and Statistics for Engineers 14
ii.
P-value (4.b.ii.)
R-Code [5]
> 1-pt(TH0,df)
[1] 0.02536118
#calculates the p-value by reverse
look up of the t-table with the
test statistic and degrees of
freedom
The p-value is 0.0254, which is a measured of how likely it is that the null hypothesis is true.
iii.
Our test statistic shows TH0 = 1.978 < tdf, α = 1.984, which does not allow us to
reject the null. The p-value is 0.0254, which makes the following statement
true: p-value = 0.0254 ≤ α = 0.10. This statement supports the alternative,
which contradicts our test statistic. In this case, more data collection is
needed to support or reject the null.
c.
We cannot reject the null hypothesis nor accept the alternative hypothesis because
our test statistic does not show support for the alternative, while the p-value shows
support for the alternative. In order to accept or reject the null, more data needs to
be collected.
5.
Probability and Statistics for Engineers (MATH 2100) taught me a variety of computational
methods to complete this project. I used a multitude of R-code commands, with subjects
varying from simple statistics, plotting charts, correlation and regression, confidence
intervals, and test statistics.
To expand on this study, the data could be analyzed in several different ways. The data set
could analyzed for impact on grades due to students’ majors or differences due to gender. It
could also be expanded by collecting more data on different variables, such as studying
habits of students, or weekly study hours of each student. This variable could be compared
to grades, credits, or any other variable of the data set.
Final Project | Probability and Statistics for Engineers 15
References
[1] How to generate bin frequency table in R? (n.d.). Retrieved November 30, 2018, from
https://stackoverflow.com/questions/27839432/how-to-generate-bin-frequency-
table-in-r
[2] Correlation Test Between Two Variables in R. (n.d.). Retrieved November 30, 2018,
from http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r
[3] Calculating Confidence Intervals. (n.d.). Retrieved November 30, 2018, from
https://www.cyclismo.org/tutorial/R/confidence.html
[4] Two-Tailed Test of Population Mean with Known Variance. (n.d.). Retrieved November
30, 2018, from http://www.r-tutor.com/elementary-statistics/hypothesis-testing/two-
tailed-test-population-mean-known-variance
[5] Calculating p Values. (n.d.). Retrieved November 30, 2018, from
https://www.cyclismo.org/tutorial/R/pValues.html