FINAL PROJECT · Final Project | Probability and Statistics for Engineers 4 86 – 90 3 91 – 95 5 95 – 100 0 Note: The histogram shows only 9 bins because the value of the 10th

FINAL PROJECT

Probability and Statistics for Engineers

Jack Connolly

Math 2100 Section 4

Fall 2018

Submitted: 3 December 2018

Final Project | Probability and Statistics for Engineers 2

1.

a.

Section 1 (1.a.)

R-Code

> fall=read.table("Project_Data_Fall.txt",header=T)

> attach(fall)

> mean(section1$NUM_GR)

[1] 77.04

> median(section1$NUM_GR)

[1] 80

> freqfall=table(section1$NUM_GR)

> names(freqfall)[which(freqfall==max(freqfall))]

[1] "72" "82" "95"

> var(section1$NUM_GR)

[1] 170.9567

> sd(section1$NUM_GR)

[1] 13.07504

#reads the text document and

acknowledges header

#attaches the data

#finds the mean of the sample

#finds the median of the sample

#creates a frequency table for each

grade

#outputs which values have the

highest frequency (mode)

#calculates sample variance

#calculates sample standard

deviation

Sample Statistic Value Mean (�̅�) 77.04 Median (�̃�) 80 Mode (𝑀𝑜) 72, 82, 95 Variance (𝑠2) 170.96 Standard Deviation (𝑠) 13.08

Section 2 (1.a.)

R-Code

> fall=read.table("Project_Data_Fall.txt",header=T)

> attach(fall)

> mean(section2$NUM_GR)

[1] 79.36

> median(section2$NUM_GR)

[1] 83

> freqfall2=table(section2$NUM_GR)

> names(freqfall2)[which(freqfall2==max(freqfall2))]

[1] "88"

> var(section2$NUM_GR)

[1] 183.0733

> sd(section2$NUM_GR)

[1] 13.53046


acknowledges header

#attaches the data

#finds the mean of the sample

#finds the median of the sample


grade



#calculates sample variance

#calculates sample standard

deviation


Sample Statistic Value Mean (�̅�) 79.36 Median (�̃�) 83 Mode (𝑀𝑜) 88 Variance (𝑠2) 183.07 Standard Deviation (𝑠) 13.53

b.

Section 1 (1.b.)

R-Code [1]

> freqtable=hist(section1$NUM_GR, breaks=10, xlab=”Gra

des”, ylab=”Frequency”, main=”Section 1 Grade Distribu

tion”) > freqtable

$`breaks`

[1] 50 55 60 65 70 75 80 85 90 95

$counts

[1] 1 3 2 1 5 1 4 3 5

$density

[1] 0.008 0.024 0.016 0.008 0.040 0.008 0.032 0.024 0.

040

$mids

[1] 52.5 57.5 62.5 67.5 72.5 77.5 82.5 87.5 92.5

$xname

[1] “section1$NUM_GR”

$equidist

[1] TRUE

attr(,”class”) [1] “histogram”

#prints a frequency table with the

histogram parameters set for 10

bins for the grades data set

#bin start points

#frequency counts

Bin (Grade Range) Frequency 50 – 55 1 56 – 60 3 61 – 65 2 66 – 70 1 71 – 75 5 76 – 80 1 81 – 85 4


86 – 90 3 91 – 95 5 95 – 100 0

Note: The histogram shows only 9 bins because the value of the 10th bin is 0. The distribution for section one is odd in that it almost appears bimodal with peaks around 57 and 85. It is expected that this distribution would appear more normal, however this is likely due to a small sample size.

Section 2 (1.b.)

R-Code

> freqtable2=hist(section2$NUM_GR, breaks=10, xlab="Gr

ades", ylab="Frequency", main="Section 2 Grade Distrib

ution")

> freqtable2

$`breaks`

[1] 50 55 60 65 70 75 80 85 90 95 100

$counts

[1] 1 1 3 2 4 1 2 5 4 2

$density

[1] 0.008 0.008 0.024 0.016 0.032 0.008 0.016 0.040 0

.032 0.016

#prints a frequency table with the

histogram parameters set for 10

bins for the grades data set

#bin start points

#frequency counts


$mids

[1] 52.5 57.5 62.5 67.5 72.5 77.5 82.5 87.5 92.5 97.5

$xname

[1] "section2$NUM_GR"

$equidist

[1] TRUE

attr(,"class")

[1] "histogram"

Bin (Grade Range) Frequency 50 – 55 1 56 – 60 1 61 – 65 3 66 – 70 2 71 – 75 4 76 – 80 1 81 – 85 2 86 – 90 5 91 – 95 4 95 – 100 2

The shape of the curve is different from the first and appears to have a distribution that is approaching normality. The spread is also very similar with the curve spanning about 50 grade points wide.


c.

Boxplot (1.c.)

R-Code

> boxplot(section1$NUM_GR,section2$NUM_GR, names=c("Se

ction 1","Section 2"),ylab="Grades",main="Grade Distri

bution")

#outputs two boxplots (one for each

section), and labels the axes/title

The range of distributions for each section is very similar despite having a different distribution shape. However, section two is slightly better performing, with a slightly higher first and third quartiles, as well as median.

d.

Proportion of Good Academic Standing (1.d.)

R-Code

> goodGr=subset(fall,NUM_GR>69.5)

> pgoodGr=(nrow(goodGr)/nrow(fall))

> pgoodGr

[1] 0.74

#creates a column vector with all

of the grades of 70 or higher

#calculates the proportion of good

grades by counting the good grades

and dividing by the total number of

students

#prints the proportion of students

in good academic standing


When considering 70% as good academic standing, the proportion of students in good academic standing during the fall is 0.74.

e.

Percentage with Low Courseload and Good Academic Standing (1.e.)

R-Code

> lazy=100*nrow(subset(fall,ENR_CR<16 & NUM_GR>69.5))

/nrow(goodGr)

> lazy

[1] 70.27027

#creates a variable to calculate

the number of students with good

grades and less than 16 credits and

divides by the total number of

students with good grades to

calculate the percentage of

students with less than 16 credits

and good academic standing

#prints the percentage

When considering students in good academic standing (70%+), the percentage of students who took less than 16 credits is 70.27%.

2. a.

Grade Statistics (2.a.)

R-Code

> full=read.table("Project_Data_Full.txt",header=T)

> attach(fall)

> mean(NUM_GR)

[1] 77.21

> median(NUM_GR)

[1] 77.5

> freqfull=table(NUM_GR)

> names(freqfull)[which(freqfull==max(freqfull))]

[1] "74" "95"

> variance=sum((NUM_GR-mean(NUM_GR))^2)/(nrow(full))

> variance

[1] 150.0259

> sigma=sqrt(variance)


acknowledges header

#attaches the data

#finds the mean of the population

#finds the median of the population


grade



#calculates population variance

#calculates population standard


> sigma

[1] 12.24851

deviation

Sample Statistic Value Mean (�̅�) 77.21 Median (�̃�) 77.5 Mode (𝑀𝑜) 74, 95 Variance (𝑠2) 150.03 Standard Deviation (𝑠) 12.25

Students’ Registered Credits Statistics (2.a.)

R-Code

> full=read.table("Project_Data_Full.txt",header=T)

> attach(fall)

> mean(ENR_CR)

[1] 13.68

> median(ENR_CR)

[1] 14

> freqfull2=table(ENR_CR)

> names(freqfull2)[which(freqfull2==max(freqfull2))]

[1] "16"

> variance=sum((ENR_CR-mean(ENR_CR))^2)/(nrow(full))

> variance

[1] 9.0176

> sigma=sqrt(variance)

> sigma

[1] 3.002932


acknowledges header

#attaches the data

#finds the mean of the population

#finds the median of the population


grade



#calculates population variance

#calculates population standard

deviation

Sample Statistic Value Mean (�̅�) 13.68 Median (�̃�) 14 Mode (𝑀𝑜) 16 Variance (𝑠2) 9.02 Standard Deviation (𝑠) 3.00


b.

Registered Credit’s Impact on Grade (2.b.)

R-Code

> plot(ENR_CR,NUM_GR, xlab="Registered Course Credits"

,ylab="Grade",main="Courseload's Impact on Grade")

#creates a plot with credits on the

x, and grades on the y. Adds data

labels for the axes/title

c.

Registered Credit’s Impact on Grade (2.c.)

R-Code [2]

> plot(ENR_CR,NUM_GR, xlab="Registered Course Credits"

,ylab="Grade",main="Courseload's Impact on Grade")

> lm(NUM_GR~ENR_CR)

Call:

lm(formula = NUM_GR ~ ENR_CR)

Coefficients:

(Intercept) ENR_CR

101.077 -1.745

#creates a plot with credits on the

x, and grades on the y. Adds data

labels for the axes/title

#creates a linear model between

credits and grade


> abline(lm(NUM_GR~ENR_CR))

#adds the line of the model to the

scatter-plot

i. Approximating Grade with 10 Credits (2.c.i.)

R-Code

> exp10CR=101.077-1.745*10

> exp10CR

[1] 83.627

#creates and prints a variable to

store the expected value of the

grade with 10 credits using the

linear model from 2.c.

The expected score for a student registered for 10 course credits is 83.6, a B.

ii. Approximating Credits for an A (2.c.ii.)

R-Code

> lm(ENR_CR~NUM_GR)

Call:

lm(formula = ENR_CR ~ NUM_GR)

Coefficients:

(Intercept) NUM_GR

21.7768 -0.1049

> expAGr=21.7768-0.1049*89.5

#creates a linear model relating

credits to grade (inverse of above)

#creates and prints the expected


> expAGr

[1] 12.38825

course load given a student wants

to get an A.

The expected courseload for a student who wants to get an A (90%+) is 12.39 credits.

d.

Calculating Correlation Coefficient R (2.d.)

R-Code

> cor(full$ENR_CR, full$NUM_GR)

[1] -0.4277376

#calculates the correlation

Coefficient (R) of the linear model

The correlation coefficient R = -0.4277, is a measured of accuracy of our model. The 0.42 magnitude represents a low correlation. The negative sign indicates that the regression line is sloping negatively, which is validated in the plot of 2.c.

e.

Calculating Coefficient of Determination R2 (2.e.)

R-Code

> rsquared=(cor(full$ENR_CR, full$NUM_GR))^2

> rsquared

[1] 0.1829594

#calculates the correlation

coefficient (R) of the linear model

and squares it to calculate the

coefficient of determination

The coefficient of determination R2 = 0.183 represents the proportion of error accounted for in the linear model. It is a proportion of model accounted error to unaccounted error in the data.

f.

Normality Plot of Grades (2.f.)

R-Code

> qqnorm(NUM_GR, main="Noramilty of Grades: Q-Q Plot")

> abline(qqline(NUM_GR))

#creates a qqplot of the grades

with a title

#adds the normal line to the plot


The plot shows us that the data is approximately normal with some outliers at each end of the range. The data between the first and third quartiles appears to be very normal.

3. a.

Margin of Error (3.a.)

R-Code

> n=nrow(full)

> df=n-1

> P=(((100-95)/2)+95)/100

> MOE=qt(P,df)*(sd(NUM_GR))/(sqrt(n))

> MOE

[1] 2.442613

#calculates the size of the class

#calculates degrees of freedom

#calculates 1 – alpha (p)

#calculates margin of error through

the t-distribution and the standard

deviation and sample size

The margin of error is 2.443.


b.

95 % Confidence Interval for Population Mean (3.b.)

R-Code [3]

> CI_lower=mean(NUM_GR)-MOE

> CI_upper=mean(NUM_GR)+MOE

> CI_lower

[1] 74.76739

> CI_upper

[1] 79.65261

#calculates the lower/upper bounds

of the confidence interval by

subtracting/adding the margin of

error to the mean

The confidence interval is (74.77, 79.65).

c. It is 95% likely that the true population score mean lies on 74.77 < μ < 79.65,

meaning the population mean grade is likely to be a C.

4.

a.

H0: 𝜇 ≤ 77.21 ̶ Null Hypothesis

HA: 𝜇 > 77.21 ̶ Alternative Hypothesis

b.

i.

Hypothesis Test (4.b.i.)

R-Code [4]

> xbar=80.34

> mu0=77.21

> sigma=11.19

> n=50

> TH0=(xbar-mu0)/(sigma/sqrt(n))

> TH0

[1] 1.977877

> tdfalpha=qt(P,df)

> tdfalpha

[1] 1.984217

#establishes sample mean

#establishes the hypothesized value

#establishes sample st. dev.

#establishes sample size

#calculates the test statistic with

the hypothesis and sample values

#calculates the talpha for the

rejection region comparison

The rejection rule for this hypothesis is TH0 > tdf, α, so because TH0 = 1.978 < tdf, α = 1.984, we cannot reject the null, or our test supports the null.


ii.

P-value (4.b.ii.)

R-Code [5]

> 1-pt(TH0,df)

[1] 0.02536118

#calculates the p-value by reverse

look up of the t-table with the

test statistic and degrees of

freedom

The p-value is 0.0254, which is a measured of how likely it is that the null hypothesis is true.

iii.

Our test statistic shows TH0 = 1.978 < tdf, α = 1.984, which does not allow us to

reject the null. The p-value is 0.0254, which makes the following statement

true: p-value = 0.0254 ≤ α = 0.10. This statement supports the alternative,

which contradicts our test statistic. In this case, more data collection is

needed to support or reject the null.

c.

We cannot reject the null hypothesis nor accept the alternative hypothesis because

our test statistic does not show support for the alternative, while the p-value shows

support for the alternative. In order to accept or reject the null, more data needs to

be collected.

5.

Probability and Statistics for Engineers (MATH 2100) taught me a variety of computational

methods to complete this project. I used a multitude of R-code commands, with subjects

varying from simple statistics, plotting charts, correlation and regression, confidence

intervals, and test statistics.

To expand on this study, the data could be analyzed in several different ways. The data set

could analyzed for impact on grades due to students’ majors or differences due to gender. It

could also be expanded by collecting more data on different variables, such as studying

habits of students, or weekly study hours of each student. This variable could be compared

to grades, credits, or any other variable of the data set.


References

[1] How to generate bin frequency table in R? (n.d.). Retrieved November 30, 2018, from

https://stackoverflow.com/questions/27839432/how-to-generate-bin-frequency-

table-in-r

[2] Correlation Test Between Two Variables in R. (n.d.). Retrieved November 30, 2018,

from http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r

[3] Calculating Confidence Intervals. (n.d.). Retrieved November 30, 2018, from

https://www.cyclismo.org/tutorial/R/confidence.html

[4] Two-Tailed Test of Population Mean with Known Variance. (n.d.). Retrieved November

30, 2018, from http://www.r-tutor.com/elementary-statistics/hypothesis-testing/two-

tailed-test-population-mean-known-variance

[5] Calculating p Values. (n.d.). Retrieved November 30, 2018, from

https://www.cyclismo.org/tutorial/R/pValues.html

Documents

FINAL PROJECT · Final Project | Probability and Statistics for Engineers 4 86 – 90 3 91 – 95 5 95 – 100 0 Note: The histogram shows only 9 bins because the value of the 10th