· Web viewComplete the following problems below. Within each part, include your R program output with code inside of it and any additional information needed to explain your answer

Project #3 Answers STAT 873Fall 2013

Complete the following problems below. Within each part, include your R program output with code inside of it and any additional information needed to explain your answer. Note that you will need to edit your output and code in order to make it look nice after you copy and paste it into your Word document.

Background

Wheat producers want to identify kernels that are in poor condition after being harvested. To facilitate this identification process, categorization systems have been developed to partition kernels into different categories. For this example, we will look at the categories of “healthy”, “sprout”, or “scab”. In summary,

i) Healthy is the preferred condition because these kernels have not been damagedii) Sprout is less preferred than healthy because they have reduced weight and poorer flour

qualityiii) Scab is less preferred than healthy because these kernels come from plants that have

been infected by a disease and have undesirable qualities in their appearance

Ideally, it would be preferred to make these categorizations for each kernel through using an automated process.

Data

To test a new automated system out, 276 wheat kernels were classified by human examination (assumed to be perfect) and through the automated system. The automated system uses information about the class of the wheat kernel (soft red winter or hard red winter) and measurements for density, hardness, size, weight, and moisture for the kernel. The data is stored in my wheat_all.csv file available on my website. Below is how I read in the data:

> wheat<-read.csv(file = "C:\\data\\wheat_all.csv") > head(wheat, n = 3) class density hardness size weight moisture type1 hrw 1.35 60.33 2.30 24.65 12.02 Healthy2 hrw 1.29 56.09 2.73 33.30 12.17 Healthy3 hrw 1.23 43.99 2.51 31.76 11.88 Healthy

> tail(wheat, n = 3) class density hardness size weight moisture type274 srw 0.85 34.07 1.41 12.09 11.93 Scab275 srw 1.18 60.98 1.06 9.48 12.24 Scab276 srw 1.03 -9.57 2.06 23.82 12.65 Scab

The focus here is to develop methods that best differentiate between the kernel types using the physical characteristics of the kernel and the wheat class.

1

1) Perform an initial investigation into the data as follows. a) (10 points) Examine the data using the appropriate graphical methods discussed earlier in the

course. In your plots, determine if there may be ways to differentiate among kernel types. Also, examine observation #31 and compare it to the other observations.

> wheat<-read.csv(file = "C:\\chris\\wheat_all.csv")> wheat[31,] class density hardness size weight moisture type31 hrw 2.03 121.84 0.99 9.36 10.28 Scab

> table(wheat$type)

Healthy Scab Sprout 96 84 96 > stars(x = wheat[order(wheat$type),-1], ncol = 20, key.loc = c(-2, 1), draw.segments = TRUE, label = NULL, cex = 0.75, main = "Wheat data ordered by type")

Wheat data ordered by type

densityhardness

size

weightmoisture

type

> wheat2<-data.frame(kernel = 1:nrow(wheat), wheat[,2:6], class.new = ifelse(test = wheat$class == "hrw", yes = 0, no = 1))> head(wheat2) kernel density hardness size weight moisture class.new1 1 1.35 60.33 2.30 24.65 12.02 02 2 1.29 56.09 2.73 33.30 12.17 03 3 1.23 43.99 2.51 31.76 11.88 04 4 1.34 53.82 2.27 32.71 12.11 05 5 1.26 44.39 2.35 26.07 12.06 06 6 1.30 48.12 2.49 33.30 12.19 0

> #Colors by condition:

2

> wheat.colors<-ifelse(test = wheat$type == "Healthy", yes = "black", no = ifelse(test = wheat$type == "Sprout", yes = "red", no = "green"))> #Line type by condition:> wheat.lty<-ifelse(test = wheat$type == "Healthy", yes = "solid", no = ifelse(test = wheat$type == "Sprout", yes = "longdash", no = "dotdash"))> kernel31<-ifelse(test = wheat2$kernel == 31, yes = 3, no = 1)> parcoord(x = wheat2, col = wheat.colors, lty = wheat.lty, lwd = kernel31, main = "Parallel coordinate plot \n Kernel #31 is represented by a large line width") > legend(locator(1), legend = c("Healthy", "Sprout", "Scab"), lty = c("solid", "longdash", "dotdash"), col = c("black", "red", "green"), cex = 0.8, bty = "n")

Parallel coordinate plot Kernel #31 is represented by a large line width

kernel density hardness size weight moisture class.new

HealthySproutScab

Note that I created a new variable named class.new which is 0 for hard red winter wheat and 1 for soft red winter wheat. This needs to be done because class consists of characters.

All plots show some separation of scab from the other kernel types (although not complete separation) indicating we may have some success differentiating scab kernels from the others. There is not as much separation between the healthy and sprout kernels. For example, there is a lot of overlap of the lines in the parallel coordinate plot for these two kernel types. This indicates that we may have difficulty differentiating between them.

From examining a number of plots, one can see that observation #31 is quite different from the rest! It has the largest density value by far. It also has the largest hardness.

Note that this data is an example where order in which observations are plotted in a parallel coordinate plot could make a difference in how you interpret it. Below is an example where the kernels are first sorted by kernel type (green is plotted last):

> #Sort by wheat type> wheat.colors2<-ifelse(test = wheat$type == "Healthy", yes = 1,

3

no = ifelse(test = wheat$type == "Sprout", yes = 2, no = 3))> wheat3<-data.frame(wheat.colors2, wheat2)> parcoord(x = wheat3[order(wheat.colors2),], col = wheat.colors2[order(wheat.colors2)], lty = wheat.lty[order(wheat.colors2)], main = "Parallel coordinate plot for wheat data - sort by Type")> legend(locator(1), legend = c("Healthy", "Sprout", "Scab"), lty = c("solid", "longdash", "dotdash"), col = c("black", "red", "green"), cex = 0.8, bty = "n")

Parallel coordinate plot for wheat data - sort by Type

wheat.colors2 kernel density hardness size weight moisture class.new

HealthySproutScab

One could have missed where all of the green observations (especially for hardness) were located in the first parallel coordinate plot due to being underneath some of the red and black observations.

Using these plots, one can develop general ideas of what variables may be the best for classifying observations. For example, we see the density, size, and weight of the kernels help to somewhat differentiate between scab and the other two classes. Also, we see some variables, like hardness, may be of little help.

PCA could also be used to examine the data. I will save the PCA part until later in the project.

b) (1 point) This data comes from an actual consulting problem that I worked on in the past. I pointed out observation #31 to the subject-matter researcher, and he concluded that this observation must be a result of a measurement error. For this reason, we decided to remove it from the data set. For the remainder of this project, use an altered version of the data set that has this observation removed. Show how this observation is removed and show that the number of observations is now 275.

> wheat.no31<-wheat[-31,]> nrow(wheat.no31)[1] 275

4

2) This portion of the project applies DA methods to differentiate between the kernel types. a) (10 points) Fill in the table below using the appropriate DA methods:

Proportion correctDA Priors Accuracy method Healthy Sprout Scab OverallLinear Proportional Cross-validationQuadratic Proportional Cross-validation

Also, provide the 33 classification tables. Which DA method performs the best? For what type of classifications do the methods perform the worse? Fully explain all of your answers.

Proportion correctDA Priors Accuracy method Healthy Sprout Scab Overall

Linear Proportional Cross-validation 0.72920.541

7 0.7470 0.6691Quadratic Proportional Cross-validation 0.6875

0.5938 0.7831 0.6836

Below are my classification tables:

> summarize.class<-function(original, classify) { class.table<-table(original, classify) numb<-rowSums(class.table) prop<-round(class.table/numb,4) overall<-round(sum(diag(class.table))/sum(class.table),4) list(class.table = class.table, prop = prop, overall.correct = overall) }

> library(MASS) > DA2<-lda(formula = type ~ class + density + hardness + size + weight + moisture, data = wheat.no31, CV = TRUE)> DA4<-qda(formula = type ~ class + density + hardness + size + weight + moisture, data = wheat.no31, CV = TRUE) > lda.accuracy<-summarize.class(original = wheat.no31$type, classify = DA2$class)> qda.accuracy<-summarize.class(original = wheat.no31$type, classify = DA4$class)> lda.accuracy$class.table classifyoriginal Healthy Scab Sprout Healthy 70 5 21 Scab 11 62 10 Sprout 24 20 52

$prop classifyoriginal Healthy Scab Sprout Healthy 0.7292 0.0521 0.2188 Scab 0.1325 0.7470 0.1205 Sprout 0.2500 0.2083 0.5417

$overall.correct[1] 0.6691

5

> qda.accuracy$class.table classifyoriginal Healthy Scab Sprout Healthy 66 10 20 Scab 7 65 11 Sprout 26 13 57



Overall, we see that LDA and QDA have some ability to differentiate between the different kernel types. The largest amount of errors occurs with the sprout and healthy classifications – both with sprout being misclassified as healthy and vice versa. The least amount of classification error occurs with healthy kernels being classified as scab.

QDA leads to very similar results to LDA. For parsimonious reasons then, I would likely use LDA. However, if the extra amount of accuracy potentially provided by QDA is important to the subject-matter researcher, I would use QDA.

b) (8 points) The DA homework shows a scatter plot where there are two plotting points for each observation. The smaller point denotes the original population for the observation and the larger point denotes the classification. Construct a similar plot here, but now plot the first two PCs for it. Interpret the plot in the context of what the 33 classification table in b) gives as the correct and incorrect classification rates. Use the LDA cross-validation classifications found in part a) for the plot.

> save<-princomp(formula = ~ density + hardness + size + weight + moisture + class.new, data = wheat.no31, cor = TRUE, scores = TRUE)> summary(save, loadings = TRUE, cutoff = 0.0)Importance of components: Comp.1 Comp.2 Comp.3Standard deviation 1.471919 1.3133732 0.9591963Proportion of Variance 0.361091 0.2874915 0.1533429Cumulative Proportion 0.361091 0.6485825 0.8019254 Comp.4 Comp.5 Comp.6Standard deviation 0.8444315 0.53339093 0.43689493Proportion of Variance 0.1188441 0.04741765 0.03181286Cumulative Proportion 0.9207695 0.96818714 1.00000000

Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6density -0.287 0.308 0.618 0.653 -0.043 0.115hardness 0.361 0.238 0.662 -0.522 0.186 -0.260size -0.441 0.459 -0.086 -0.419 0.237 0.597weight -0.559 0.325 -0.156 -0.135 -0.159 -0.717moisture -0.359 -0.494 0.352 -0.328 -0.604 0.175class.new -0.390 -0.537 0.156 0.006 0.719 -0.134

6

> save$scale<-apply(X = wheat.no31[,c(2:6,8)], MARGIN = 2, FUN = sd)> score.cor<-predict(save, newdata = wheat.no31[,c(2:6,8)])> head(score.cor) Comp.1 Comp.2 Comp.3 Comp.4 Comp.51 0.4498091 0.9670625 1.6347757 -0.03223047 -0.64946022 -0.4987891 1.5106507 1.0292678 -0.78907466 -0.66961283 -0.1694543 1.0656393 0.4728126 -0.59576666 -0.72109534 -0.1722994 1.1679671 1.2920498 -0.08442183 -0.89373615 0.2841531 0.7124044 0.7953254 -0.24934765 -0.74446736 -0.4135285 1.2352344 0.9291961 -0.38554676 -0.8487791 Comp.61 0.384561342 0.125375953 0.034325514 -0.321008395 0.392866276 -0.08053882 > par(pty = "s")> original.pch<-ifelse(test = wheat.no31$type == "Healthy", yes = 1, no = ifelse(test = wheat.no31$type == "Sprout", yes = 2, no = 5))> original.color<-ifelse(test = wheat.no31$type == "Healthy", yes = "black", no = ifelse(test = wheat.no31$type == "Sprout", yes = "red", no = "green"))> plot(x = score.cor[,1], y = score.cor[,2], pch = original.pch, col = original.color, cex = 0.75, xlab = "Principal component 1", ylab = "Principal component 2", main = "PC score plot \n Classified (large points) overlaid on the original (small points)")> abline(h = 0, lty = 1, lwd = 2)> abline(v = 0, lty = 1, lwd = 2) > classify.pch<-ifelse(test = DA2$class == "Healthy", yes = 1, no = ifelse(test = DA2$class == "Sprout", yes = 2, no = 5))> classify.color<-ifelse(test = DA2$class == "Healthy", yes = "black", no = ifelse(test = DA2$class=="Sprout", yes = "red", no = "green"))> points(x = score.cor[,1], y = score.cor[,2], pch = classify.pch, col = classify.color, cex = 1.5)> legend(locator(1), legend = c("Healthy", "Sprout", "Scab"), pch = c(1,2,5), col = c("black", "red", "green"), cex=1, bty="n")

7

-2 0 2 4

-4-3

-2-1

01

23

PC score plot LDA classified (large points) overlaid on original (small points)

Principal component 1

Prin

cipa

l com

pone

nt 2

HealthySproutScab

Overall, the first two PCs account for 65% of the total variation in the data. While this value is not extremely large, it does provide a decent proportion of the information available in the data. Plots like this can also be done using a bubble plot in order to include the third PC. I decided to keep it to two PCs in the project to correspond to what was done in the homework.

We can immediately see why LDA performs better when classifying the scab kernels than with classifying the healthy and sprout kernels. The scab kernels are more separated from healthy and sprout than those two kernel types are from each other. Thus, positive PC #1 with negative PC #2 values tend to be correctly classified as scab. The loadings for the PCs provide some insight for why this is true. In 1) of this project, we saw that scab kernels tend to have a smaller density, size, and weight than the other kernel types. The coefficients on density, size, and weight are negative for PC #1 and positive for PC #2. This means that small density, size, and weight values will allow scab kernels to be larger for PC #1 (given the other loadings for this PC) and smaller for PC #2 (given the other loadings for this PC) in comparison to the other kernel types.

8

3) This portion of the project applies NNC methods to differentiate between the kernel types. a) (8 points) Determine an appropriate value for K using cross-validation. Set a seed number of

7771 before using NNC so that I can duplicate your results.

> library(class) > class.new<-ifelse(test = wheat.no31$class == "hrw", yes = 0, no = 1)> Z<-scale(cbind(wheat.no31[,2:6], class.new))> head(Z) density hardness size weight moisture class.new1 1.2299602 1.2708682 0.1939024 -0.3602040 0.4071616 -0.95902052 0.7725714 1.1158760 1.0710624 0.7326285 0.4809390 -0.95902053 0.3151825 0.6735634 0.6222829 0.5380664 0.3383028 -0.95902054 1.1537287 1.0328967 0.1327052 0.6580885 0.4514281 -0.95902055 0.5438769 0.6881853 0.2958978 -0.1808026 0.4268356 -0.95902056 0.8488028 0.8245346 0.5814847 0.7326285 0.4907760 -0.9590205 > set.seed(7771)> save.results.cv<-matrix(data = NA, nrow = 40, ncol = 5)> for (K in 1:40) { NNC.cv<-knn.cv(train = Z, cl = wheat.no31$type, k = K, prob = TRUE) NNC.cv.accuracy<-summarize.class(original = wheat.no31$type, classify = NNC.cv) save.results.cv[K,]<-c(K, NNC.cv.accuracy$prop[1,1], NNC.cv.accuracy$prop[2,2], NNC.cv.accuracy$prop[3,3], NNC.cv.accuracy$overall.correct) } > head(save.results.cv) [,1] [,2] [,3] [,4] [,5][1,] 1 0.5833 0.6386 0.5000 0.5709[2,] 2 0.5625 0.5904 0.4896 0.5455[3,] 3 0.5521 0.6386 0.4792 0.5527[4,] 4 0.5312 0.6627 0.5000 0.5600[5,] 5 0.5833 0.6627 0.5104 0.5818[6,] 6 0.6562 0.6386 0.5521 0.6145 > plot(x = save.results.cv[,1], y = save.results.cv[,2], ylim = c(0, 1), main = "Cross-validation", panel.first = grid(), type = "o", col = "red", xlab = "K", ylab = "Accuracy")> points(x = save.results.cv[,1], y = save.results.cv[,3], ylim = c(0, 1), type = "o", col = "blue")> points(x = save.results.cv[,1], y = save.results.cv[,4], ylim = c(0, 1), type = "o", col = "green")> points(x = save.results.cv[,1], y = save.results.cv[,5], ylim = c(0, 1), type = "o", col = "black")> legend(x = 5, y = 1, legend = c("Correct healthy", "Correct scab", "Correct sprout", "Correct overall"), col = c("red", "blue", "green", "black"), bty = "n", cex = 0.75, lty = c(1,1,1,1), pch = c(1,1,1,1))

9

0 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

Cross-validation

K

Acc

urac

y

Correct healthyCorrect scabCorrect sproutCorrect overall

> max(save.results.cv[,5])[1] 0.6255> save.results.cv[save.results.cv[,5] == max(save.results.cv[,5]),] [,1] [,2] [,3] [,4] [,5][1,] 11 0.6562 0.7108 0.5208 0.6255[2,] 13 0.6875 0.6747 0.5208 0.6255[3,] 15 0.6979 0.6506 0.5312 0.6255[4,] 19 0.6667 0.6747 0.5417 0.6255

> data.frame(K = 1:15, overall = save.results.cv[1:15,5]) K overall1 1 0.57092 2 0.54553 3 0.55274 4 0.56005 5 0.58186 6 0.61457 7 0.60738 8 0.61459 9 0.607310 10 0.607311 11 0.625512 12 0.614513 13 0.625514 14 0.607315 15 0.6255

The proportion of correct classifications is approximately the same once K 6. For this reason, I will choose K = 6.

10

b) (6 points) With the value of K chosen in a), perform NNC with cross-validation and provide the 33 classification table. For what type of classifications do the methods perform the worse? Set a seed number of 6126 before using knn.cv() so that I can duplicate your results.

> set.seed(6126)> NNC6<-knn.cv(train = Z, cl = wheat.no31$type, k = 6, prob = TRUE)> NNC6.accuracy<-summarize.class(original = wheat.no31$type, classify = NNC.cv)> NNC6.accuracy$class.table classifyoriginal Healthy Scab Sprout Healthy 61 7 28 Scab 7 59 17 Sprout 34 8 54



Overall, we that NNC has some ability to differentiate between the different kernel types. The largest amount of errors occurs with the sprout and healthy classifications – both with sprout being misclassified as healthy and vice versa. The least amount of classification error occurs with healthy kernels being classified as scab.

c) (6 points) Construct a similar plot as done in part 2b with the NNC classifications obtained in b). Interpret the plot in the context of what the 33 classification table in b) gives as the correct and incorrect classification rates. Compare the plot for this problem to the one found in part 2b.

> plot(x = score.cor[,1], y = score.cor[,2], pch = original.pch, col = original.color, cex = 0.75, xlab = "Principal component 1", ylab = "Principal component 2", main = "PC score plot \n NNC classified (large points) overlaid on original (small points)")> abline(h = 0, lty = 1, lwd = 2)> abline(v = 0, lty = 1, lwd = 2) > classify.pch<-ifelse(test = NNC.cv == "Healthy", yes = 1, no = ifelse(test = NNC.cv == "Sprout", yes = 2, no = 5))> classify.color<-ifelse(test = NNC.cv == "Healthy", yes = "black", no = ifelse(test = NNC.cv=="Sprout", yes = "red", no = "green"))> points(x = score.cor[,1], y = score.cor[,2], pch = classify.pch, col = classify.color, cex = 1.5)> legend(locator(1), legend = c("Healthy", "Sprout", "Scab"), pch = c(1,2,5), col = c("black", "red", "green"), cex=1, bty="n")

11

-2 0 2 4

-4-3

-2-1

01

23

PC score plot NNC classified (large points) overlaid on original (small points)

Principal component 1

Prin

cipa

l com

pone

nt 2

HealthySproutScab

The overall results regarding scab classifications doing better than the healthy and sprout classifications are similar here as to what we saw for LDA.

Overall, most classifications appear to be the same for NNC and LDA. We do see a few differences here when compared to the results from LDA. For example, the upper left portion of the plot (circled in blue) here contain two sprout kernels, but the nearest other kernels are healthy. This leads to a healthy classification by NNC; however, these kernels were correctly classified by LDA. Of course, one needs to be careful with this judgment of “nearest” here because PCA is used with only 65% of the total variation in the data being accounted for.

12

Documents

· Web viewComplete the following problems below. Within each part, include your R program output with code inside of it and any additional information needed to explain your answer