30
Running Head: A COMPARISON OF STATISTICAL LEARNING METHODS IN DIFFERENT SAMPLE SIZES A Comparison of Statistical Learning Methods in Different Sample Sizes Katlynn Kennedy Saint Mary’s College

SENIOR COMP FINAL

Embed Size (px)

Citation preview

Page 1: SENIOR COMP FINAL

Running Head: A COMPARISON OF STATISTICAL LEARNING METHODS IN DIFFERENT SAMPLE SIZES

A Comparison of Statistical Learning Methods in Different Sample Sizes

Katlynn Kennedy

Saint Mary’s College

Page 2: SENIOR COMP FINAL

STATISTICAL METHODS 1

A Comparison of Statistical Learning Methods in Different Sample Sizes

In this modern age of science and technology, researchers can examine and collect

massive, diverse sets of data that have previously been impossible to examine. This is largely

due to advancements in computational sciences and digital storage capacities, in conjunction

with the continual development of statistical analyses (Caruana, Karampatziakis & Yessenalina,

2008; Iniesta, Stahl & McGuffin, 2016). Yet even with these large datasets being increasingly

available, many researchers in the social sciences are ill-equipped with the training to use

methods specifically designed to analyze them (Sinharay, 2016). Providing researchers with

knowledge on how different statistical learning methods can be applied to these datasets can

facilitate the improvement of data analytics (Iniesta et.al., 2016; Sinharay, 2016; Kleinberg,

Ludwig, Mullainathan & Obermeyer, 2015). Such methods can be used specifically for research

designed to study prediction problems where previous methods of variable selection may be

inappropriate or outdated (Sinharay, 2016; Iniesta et. al., 2016; Kleinberg et. al., 2015). Such

methods may also be applied to large datasets containing a wide variety of sample sizes.

Consider this example used by Kleinberg et al. (2015) when thinking about prediction

versus causal problems:

A policy-maker facing a drought is deciding whether to invest in a rain dance to increase rain. Another is seeing clouds and must decide whether to take an umbrella to work to avoid getting wet. Both characters required something different and have different questions. One is a causal problem asking “Do rain dances cause rain?” The other is a prediction problem: “Is the chance of rain high enough to bring an umbrella?”

Researchers often look at questions like the rain dance, a causal question; but prediction

problems are becoming increasingly important, common, and interesting (Kleinberg et al.,2015).

Researchers are now able to ask and assess such prediction questions using large and substantial

Page 3: SENIOR COMP FINAL

STATISTICAL METHODS 2

datasets to answer them. For example, one could use a large educational dataset to understand

which high school students have the highest risk of dropping out. This prediction problem could

be instrumental in providing intervention plans earlier and more effectively for at-risk students

(Burrus & Roberts, 2012; Sinharay, 2016; Iniesta et al., 2016).

Once a researcher gets into these larger, more complex datasets, they are often left to

their own devices to decide what methods may be best to use for their analyses (Sinharay, 2016).

In doing so, many continue to use methods learned from their undergraduate or graduate training

(e.g. multiple linear regression). However, these traditional methods are not designed to handle

the complexity of data that comes with large datasets that result from scenarios such as

longitudinal studies (Iniesta et al., 2016; Caruana et.al., 2008). One reason is the high-

dimensional data problem (Iniesta et al., 2016; Caruana et. al., 2008). The high-dimensional data

problem is when a data set has a number of variables (p) close to or greater than number of

observations (N) (Iniesta et al., 2016). This problem violates the assumptions/requirements of

many traditional methods, making the results of analyses done with these methods high in error.

It is also simply not realistic for researchers to use a traditional method of comparing thousands

of single association tests and sorting by significance to determine which predictors are

important in the dataset. In addition to the high-dimensional data problem, traditional methods

also are at a high risk for overfitting a model. Meaning they may include variables that have no

real significance to the outcome variable (Iniesta et al., 2016). For example, within a given

sample the color of a parent’s may be determined to be a significant predictor of high school

dropout – however, this result may not generalize to the population. In addition, traditional

prediction methods depend on assumptions such as linearity and homogeneity of variance

(Sinharay, 2016). These assumptions are often violated in data sets that are larger in size. Thus,

Page 4: SENIOR COMP FINAL

STATISTICAL METHODS 3

the use of traditional methods has been outgrown by our data, and researchers need to begin

using more advanced methodology that are well-equipped to answer prediction problems more

accurately.

This increase in computational power has created the question of whether more

computationally-intensive prediction methods are more practical and could provide more

accurate predictions than traditional methods in important problems (Sinharay, 2016). These

types of computation-intensive prediction methods are called statistical learning methods

(Sinharay, 2016; Iniesta et al., 2016). There are five specific statistical learning methods that this

study intends to better understand: 1. Logistic Least Absolute Shrinkage and Selection Operator

(LASSO), 2. Classification Trees, 3. Random Forrest, 4. Support Vector Machines, and 5. K-

Nearest Neighbors.

LASSO

The Least Absolute Shrinkage and Selection Operator (LASSO) method automatically

penalizes the absolute sum of the regression coefficients. By doing this it shrinks the regression

coefficient of variables that don’t contribute to the model to zero, therefore removing them from

the model. This makes LASSO good at selecting important variables into the model. It also

makes for a more interpretable model. The LASSO requires a meta-parameter lambda to operate.

The choice of this tuning parameter greatly influences the results of the estimation process and is

usually calibrated using a cross-validation procedure.

Single Classification Trees

Classification trees are one of the two types of trees that are considered prediction trees.

Their goal is to predict an outcome variable, Y, from a variety of input variables, X1, X2, …, Xp.

Classification trees use recursive partioning to determine predictors that influence the

Page 5: SENIOR COMP FINAL

STATISTICAL METHODS 4

categorization of an individual into a class. A classification tree might be used in a situation

where there are many variables that interact in a nonlinear and complicated way. A classification

tree begins with a single node and then selects a binary distinction that gives the most optimal

split. In other words, it selects a predictor that is important in determining the class of the

individual based on their response to the previous node, see figure 1. A new individual node is

then created and this process repeats for each individual node until it reaches a stopping criteria.

A classification tree produces an individual model for each node in addition to a model for the

entire tree.

Figure 1: The classification tree for predicting the risk of recurrent falling in community-dwelling older persons at 3-year follow-up.

Page 6: SENIOR COMP FINAL

STATISTICAL METHODS 5

Random Forests

A potential issue with classification trees is they are highly sample dependent. (Breiman,

2001). The algorithm used to create the trees is a greedy algorithm that selected splits that

maximize the amount of variance accounted for in the outcome variable. This makes single trees

highly susceptible to overfitting and causes potential for single trees to not generalize to the

population.

The random forest method accounts for this by growing many classification trees and

each tree gives a classification and ‘votes’ for the class it would assign an individual in. The

forest then chooses classification having the most votes. One characteristic of random forest that

make them very good candidates to analyze large data sets is that they can handle non-linear

relationships within the data set and they do it well.

Support Vector Machines

Support vector machine (SVM) is a data mining algorithm primarily used for

classification problems. The underlying mechanism behind SVM (Boser, Guyon, &

Vapnik,1992) is an attempt to find a line (in two-dimensional space), plane (in three-dimensional

space), or hyper-plane (in n-dimensional space) that linearly separates two distinct classes. The

algorithm works by finding “support vectors”, i.e., points in the dataset, that form the previously

discusses separator (i.e. hyperplane). The algorithm intelligently chooses support vectors that

define a boundary between the two classes to create a wide “margin”. The margin is defined as

the space on either side of the boundary but within the points defined by the support vectors. The

goal of this margin is to be as wide as possible, it is believed that the widest margin is the most

generalizable to a new sample. Therefore, one consideration for SVMs is their limited ability to

separate with many irrelevant predictors (Friedman, Hastie, & Tibshirani, 2001). Given the large

Page 7: SENIOR COMP FINAL

STATISTICAL METHODS 6

number of predictors in this sample, it is likely that not all the potential predictors will contribute

positively to the model. Therefore, it is likely that SVM will not perform well in this study.

Figure 2: Example of a Two-Dimensional Support Vector Machine Line and Margin

K-Nearest Neighbor

K-Nearest Neighbor (KNN) classifies based on a similar measure to other points in the

data set. Each new class is classified by a majority vote of neighbors to the point and is assigned

to a class most common amongst its K-Nearest Neighbors measured by a distance metric. The

distance metric most commonly used is the Euclidian distance metric, which measures the

straight-line distance between points.

Page 8: SENIOR COMP FINAL

STATISTICAL METHODS 7

Figure 3: Example of a K-Nearest Neighbor Classification

These methods are quite complex and can be difficult for even an experienced

investigator to choose and apply (Sinharay, 2016). Research has provided empirical evidence

that these tests work better for prediction questions than traditional methods in the case of large

data sets (Caruana, Karampatziakis & Yessenalina, 2008). Although this research can be used to

guide researchers to learn and utilize these advanced methods, it still leaves the question as to

how these methods behave in smaller and lower dimensional sets.

Therefore, the purpose of the current study is to examine how five of these different

methods behave in relation to a variety of sample sizes, spanning from considerably small to

moderately large.

In conjunction with previous research, it is hypothesized that the advanced learning

statistical methods will perform better than the traditional model of multiple linear regression in

producing models of prediction that have lower errors associated with them (Caruana. et al.,

2008; Kleinberg, et al., 2015; Sinharay, 2016; Iniesta et al., 2016). It is also hypothesized that

different models will be produced from each learning method solely due to differences in sample

Page 9: SENIOR COMP FINAL

STATISTICAL METHODS 8

sizes. In other words, models created by the same method may be different for each of the

sample sizes tested on the method. Finally, it is hypothesized that the models and errors

produced will differ between learning methods; each learning method may produce a different

prediction model. The results of this last hypothesis may be due to how each learning method

acts in relation to the size of the sample being assessed.

Along with conducting these comparisons, one major purpose of this study is to outline

how each method is used in a manner that is easily interpreted by researchers who may not be

familiar with them. Rather than simply go through the results, this paper should allow social

science researchers to grasp the basic principles of these methods without being bogged down by

the computational complexities. It should also encourage researchers to apply these methods to

their own research; it will provide researchers with the conceptual ideas, pros, and cons of these

methods to allow them to know when each method is appropriate.

Method

The Education Longitudinal Study (ELS; 2002), collected by the National Center for

Educational Statistics, contains the data sample that was utilized in the analyses. This

longitudinal study followed over 15,000 students from around 750 schools through their

secondary and post-secondary school years. The sample also includes students who transferred to

schools being surveyed, dropped out, and completed school early for both secondary and post-

secondary years. This data set is created from a nationally representative group of 10th graders in

2002 and 12th graders in 2004. There were four data collections over the span of 10 years, with

surveys given to students, parents, school administrators, teachers, and facilities. These data

collections collected thousands of variables about a student’s educational success, home life,

environmental influences and more. There are over 74 variables within this data set that could

Page 10: SENIOR COMP FINAL

STATISTICAL METHODS 9

be utilized as the ‘outcome’ variable for our analyses, but the one selected is the dichotomous

variable of whether a student will drop out of high school.

Once the data was selected, the process of data preprocessing began. Data was cleaned by

removing impossible data (e.g. future variables) and restricted data. Variables with over 25%

missingness were removed from the dataset, leaving around ~1200 variables to use in the

prediction models. This concluded the preprocessing for the data and allowed for the following

pseudo-simulation study.

Cases were randomly sampled from the remaining observations (a student with

observations on all 1,200 variables represents a single case) in sample sizes of 200, 500, 1000,

and 2000. The previously described statistical learning models were fit to these samples, with the

binary dropout classifier as the outcome variable. This method is called training a model with a

training set of data. The open source statistical software R (Team, 2013) was utilized to run

analyses, along with a variety of packages (Meyer & Wien, 2015; Liaw & Wiener, 2002;

Therneau, Atkinson, & Ripley, 2010; Friedman, Hastie, & Tibshirani, 2010; Ripley, Venables, &

Ripley, 2015) within the software that were each programmed by statisticians who developed the

specified methods. The five statistical learning methods being tested are: 1. Logistic Least

Absolute Shrinkage and Selection Operator (LASSO), 2. Classification and Regression Trees, 3.

Random Forrest, 4. Support Vector Machines, 5. K-Nearest Neighbor. The samples drawn from

the training set are the data sets that were used to fit and train the model.

Once a model was trained, it needed to be tested to estimate how well that model

performs in new data. This was done by taking the test set of data, (i.e., the remaining data from

the dataset) and predicting the class of each remaining observation using the predictions from the

model.

Page 11: SENIOR COMP FINAL

STATISTICAL METHODS 10

To assess a model’s accuracy, we utilized a receiver operating characteristic (ROC) curve

to create a visual representation of the performance of the model. In a ROC curve the true

positive rate (Sensitivity) is plotted against the false positive rate (1-Specificity) for different cut-

off thresholds. Each point on the ROC curve represents a sensitivity/specificity pair

corresponding to a chosen decision threshold. Youden’s Index was used to select the optimal cut

off threshold for each individual model, and thus allowing to calculate a specific value of

sensitivity, specificity, and accuracy. Note that each of these values depends on the prevalence of

dropout in the population, and therefore none of them can be interpreted in isolation.

For example, an accuracy of 93% could be obtained by simply predicting there would be

no dropouts. Given that 93% of students do not drop out, for 93% of observations, the model

would be correct, yielding 93% accuracy. However, this “model” is not very useful. It provides

zero information over and above predicting the mean. It is equivalent to predicting a woman’s

height to be 5 feet 4 inches tall. This is the median height in the population, but based on

covariates a better prediction model would be developed. However, if we simultaneously

interpret accuracy, sensitivity, and specificity we can develop a more thorough understanding of

the performance of the classifier. If everyone is predicted to not drop out, accuracy will be 93%,

specificity will be 100%, but sensitivity will be 0%. Therefore, a goal of a good classification

model would be to maximize each of these.

This whole process was repeated 1,000 times, using a loop in R, to further understand

how well each model is predicting and creating a model from different random samples with the

same population values.

Results Appendix A is a complete table of the results of the performance statistics from the four

algorithms applied to the four different sample sizes. A higher value of area under the curve

Page 12: SENIOR COMP FINAL

STATISTICAL METHODS 11

(AUC) indicates a better classifier. An AUC of 0.5 indicates a classifier that is no better than

random guessing and an AUC of 1.0 indicates a perfect classifier. Due to the nature of

classification and prediction there are no set cut off values within those points, it is entirely

context dependent. One interpretation of AUC is the expected proportion of condition positives

with a higher predicted condition probability than a drawn random condition negative. (Kurczab,

Smusz, & Bojarski, 2014) The analyses on the method of K-Nearest Neighbors was not

completed because the computation time was prohibitively large because of the inability to work

well with high dimensional data. We also were unable to run a traditional multiple linear

regression on this data set using R due to the violations of assumptions of the method.

Not surprisingly, the general pattern found in the data was that the area under the curve

went up with increase in sample size for each method. It was also found that generally accuracy

goes down with the increase in sample size. It might also be noted that generally sensitivity and

specificity grew closer together with increase in sample size.

Appendices B-E are the figures of the average ROC curve for each method at the sample

size of 2000. It is important to note that these are only illustrations of the performance of the

model. A model that is an excellent classifier will have an ROC curve that hugs the upper left

corner of the axis, which represents high sensitivity and specificity. In this study an ROC curve

like this would indicate an individual model that can predict most accurately the probability of

whether a student will drop out of school.

A confusion matrix is used to describe the performance of a classifier on a set of data for

which the true values are known. By reviewing a confusion matrix, you can better understand

how many cases were correctly classified and how many were incorrectly classified. A confusion

matrix is also used to further evaluate specificity and sensitivity of a model. Table 2 is an

Page 13: SENIOR COMP FINAL

STATISTICAL METHODS 12

example to explain how to read a confusion matrix. Appendices F-I are confusion matrices for

each method at n = 2000.

Table 2: Example Confusion Matrix

True Negatives (TN):

Predicted no drop out

and didn’t drop out

False Positive (FP):

Predicted no drop out

and did drop out

Accuracy

Percentage

of the

Model

False Negatives (FN):

Predicted drop out and

didn’t drop out

True Position (TP):

Predicted drop out and

did drop out

True Percentage in the Population

Appendix J can be interpreted as a table listing the predictors most important in

predicting drop out by each method. The single classification tree and random forest methods

were restricted to 20 variables, therefore not including all the predictors included in their model.

Discussion

Due to the increase in computational power and storage capacities, data sets are

becoming more readily available to researchers (Caruana et. al., 2008; Iniesta et. al., 2016).

While this is a positive for the field of data analytics and future research, it comes with

challenges. The methods previously used by social scientists are ill-equipped to handle the

complexity of the data being collected (Sinharay, 2016). Therefore researchers must look to

other resources to create interpretable and accurate models from their large data sets. Statistical

learning methods are computationally intense methods that have been designed to combat the

high-dimensionality problem (Sinharay, 2016; Iniesta et. al., 2016).The purpose of this study was

Page 14: SENIOR COMP FINAL

STATISTICAL METHODS 13

to better understand how these methods behave in a variety of sample sizes on highly

dimensional data set. All three of the initial hypothesis were supported by the results of our study

and our results supported previous research.

We were unable to quantify the performance of a traditional method on this data due to

the high dimensionality of the data. This is especially important to note because no previous

research was unable to apply a traditional method, such as multiple linear regression, to their

study. One hypothesis as to why this might be is because of the nested nature of our data set, i.e.

a student within a classroom, a classroom with a school, and so on. This makes our data set

particularly complex. This should also be considered when discussing the remainder of the

results. While unfortunately we were unable to directly compare the performance statistics of a

traditional method to the statistical learning methods, our first hypothesis was supported. It can

be determined that because four of the five statistical learning methods were able to create a

model at all they are better choice of analyses than a traditional method.

It was found that generally within each method an increase in sample size influenced an

increase in the predictive capabilities of the model. It was also observed that the sensitivity and

specificity generally grew closer together, a positive sign for the accuracy of classification of the

model, as the sample sizes increased. This supports our hypothesis that within each method there

will be variation because of changing sample sizes.

Not only to support our third hypothesis but also to contribute to the overall goal of this

paper, results show that the models each performed differently, some better than others. For the

purpose of this study we are then able to discuss why some of the methods are better choices

then others for use on future high-dimensional data sets.

Page 15: SENIOR COMP FINAL

STATISTICAL METHODS 14

The method K-Nearest Neighbor was unable to be applied to the data set because the

computation time of the method being prohibitively large. This was due to the fact that K-

Nearest Neighbor simply cannot handle such a large sample size. The computational time for

applying this method is grows by O(nd+kn), where n is the sample size for the training data, d is

the dimensionality (number of predictor variables), and k is the chosen number of “nearest

neighbors”. Thus, with a sample size as small as 200 but with 1200 predictor variables, it could

take quite a long time to fit this model to our dataset, let alone repeating the analysis a large

number of times. Social scientists should take this into account when considering this method for

their own research.

Support Vector Machine (SVM) was similar to K-Nearest Neighbor in the complexity of

its algorithm and the time it would take to simulate 1,000 replications on such a large sample size

of a complex data set like the ELS. While K-Nearest Neighbor would have taken 8+ months to

run on a single core computer system, initially SVM was going to take 3-4 months. Through the

use of the University of Notre Dame Research Computing Center supercomputer and clusters we

were able to spread out the simulations across multiple nodes and decrease the computation time

of SVM. For a social scientist who does not have access to a computing system like this applying

the SVM method to their data set might be unrealistic and time consuming. In addition to the

computation time of the SVM method, we found that it was the least efficient in creating an

accurate model. Even at the largest sample size the models did on average nearly no better than

random guessing. We can contribute this to the fact that support vector machines cannot handle

outliers or deal with irrelevant predictors very well.

The Single Classification tree was an average performing method on this dataset. It was

observed in the classification trees that there is an improvement in performance with increase in

Page 16: SENIOR COMP FINAL

STATISTICAL METHODS 15

sample size. From n = 200 to n = 2000, the model went from nearly random guessing to an

averagely performing classifier. This does not make single classifications a bad choice when

analyzing large data sets, but it was certainly not the best of our methods.

LASSO was one of the higher performing methods from our results. One of the

advantages of the lasso method is that because of the penalization on the regression coefficient

the model created by LASSO is a very easily interpreted model. This is very beneficial to social

scientists studying a large amount of variables because they are able to limit their model to the

most important predictors and get rid of the ones that don’t contribute significantly to the model.

This simplifies the model and makes it easier to understand. One drawback of statistical learning

methods can be the lack of interpretability among the models and LASSO is very attractive when

choosing a method because of this.

Within this study we found that random forest had this highest accuracy in creating a

high performing model. This is because random forest does a very good job of modeling the

effects of different predictors within a data set and knowing which are important. It also has the

capacity to handle nonlinear effects of the predictors on the outcome variable. Through the use of

optimal splits at each node random forest is able to narrowly examine how each predictor effects

other variables and the outcome variable. For large and highly complex datasets this is crucial as

a method needs to be able to decipher what is important and what is not important to include in a

model. One of the drawbacks that should be understood when considering random forest is that

you are unable to explain the relationship between variables even though it is included in the

model. With that being said, it was still the best classifier of the methods tested.

Being able to better understand these methods of statistical learning and their individual

differences may have crucial implications for future research on important prediction questions.

Page 17: SENIOR COMP FINAL

STATISTICAL METHODS 16

If we continue the specific application from this study, knowing which predictors of a student

dropping out of high school might help educational leaders to provide intervention earlier and

more effectively for students who display those predictor characteristics. This could create a

rippling effect in our education system by increasing the number of students staying in school

and potentially pursuing higher education. An interesting observation was found when reviewing

the predictors selected by the various methods as important to predicting drop out of high school.

Among the three methods with chosen predictors listed there were strong comparisons and many

overlapping predictor variables. This is believed to be a positive when comparing methods in

their accuracy because it represents consistency among models in predicting the same outcome

variable. It would concerning to a social scientist to see high variability in each of the models

applied to a particular dataset.

There were several limitations within this study. Because of the smaller sample size of

this set of data and the limited amount of time to run the analyses, the models were compared

with minimal tuning. We were attempting to compare these models at various sample sizes in an

out-off-the-box format, meaning using the available default settings in currently implemented

software. Tuning the models would allow to spend additional time adjusting the meta-

parameters. Some analysis provides implementations that accomplish this task automatically, e.g.

logistic LASSO and Random Forest. While SVM performed the worst of all the methods it could

have been improved with adjustments. Another limitation we faced when interpreting our results

and looking towards future research is that we don’t know the “truth” of the variable interactions.

This means that we don’t know the actual relationship between the variables and the outcome

variable. Therefore, our prediction capabilities are only moderate and this leaves room for fully

understanding how our models are performing. With real data, there are no perfect classifiers

Page 18: SENIOR COMP FINAL

STATISTICAL METHODS 17

therefore for many models the variables included can only account for so much of the variance,

meaning the model we trained might actually be performing optimally.

In future research, it would be helpful to run a Monte Carlo simulation on this set of data

to quantify the risk of students dropping out of high school using a probabilistic simulation. This

will increase our ability to truly understand what contributes to at-risk populations and further

the amount of information gained from a study like this. Future research might also include the

addition of more statistical learning methods to be applied and studied on the data set. In this

study, we examined five of the most common and easy to locate methods, but there are

potentially many more that would work in a large dataset much like the one used. Doing this

would expand our knowledge of these highly computational methods and improve data analytics

on “big data”.

This study intended to further the understanding of statistical learning methods and how

they can be applied in large, highly-dimensional data sets. We collected data that supported our

hypothesis and previous research that these methods perform better than traditional methods,

such as multiple linear regression. It was discussed how each method generally improves with an

increase in sample size. To better facilitate the understanding of these methods we discussed why

some of the methods did poorly or why they were more accurate. In this study, the methods of

random forest and LASSO were found to be the most effective and accurate in creating a

prediction model for whether or not students will drop out of high school. It is the hope of this

study that social scientists will take interest in understanding and applying statistical learning

methods to their large, high-dimensional data and that continued research will further the

understanding of these methods.

Page 19: SENIOR COMP FINAL

STATISTICAL METHODS 18

References

Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992, July). A training algorithm for optimal

margin classifiers. In Proceedings of the fifth annual workshop on Computational

learning theory (pp. 144-152). ACM.

Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.

Burrus, J., & Roberts, R. D. (2012, February). Dropping Out of High School: Prevalence, Risk

Factors, and Remediation Strategies. ETS: R & D Connections, (18). Retrieved from

https://www.ets.org/Media/Research/pdf/RD_Connections18.pdf.

Caruana, R., Karampatziakis, N., & Yessenalina, A. (2008). An empirical evaluation of

supervised learning in high dimensions. Proceedings of the 25th International

Conference on Machine Learning - ICML '08, 96-103. doi:10.1145/1390156.1390169

Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1).

Springer, Berlin: Springer series in statistics.

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear

models via coordinate descent. Journal of statistical software, 33(1), 1.

Iniesta, R., Stahl, D., & Mcguffin, P. (2016). Machine learning, statistical learning and the future

of biological research in psychiatry. Psychological Medicine, 46(12), 2455-2465.

doi:10.1017/s0033291716001367

Kleinberg, J., Ludwig, J., Mullainathan, S., & Obermeyer, Z. (2015). Prediction Policy

Problems. American Economic Review, 105(5), 491-495. doi:10.1257/aer.p20151023

Kurczab, R., Smusz, S., & Bojarski, A. J. (2014). The influence of negative training set size on

machine learning-based virtual screening. Journal of cheminformatics, 6(1), 1.

Page 20: SENIOR COMP FINAL

STATISTICAL METHODS 19

Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22. Messenger, R., & Mandell, L. (1972). A modal search technique for predictive nominal scale

multivariate analysis. Journal of the American Statistical Association, 67(340), 768-772.

Meyer, D., & Wien, F. T. (2015). Support vector machines. The Interface to libsvm in package

e1071.

Ripley, B., Venables, W., & Ripley, M. B. (2015). Package ‘class’.

Sinharay, S. (2016). An NCME Instructional Module on Data Mining Methods for

Classification and Regression. Educational Measurement: Issues and Practice.

doi:10.1111/emip.12115

Team, R. C. (2013). R: A language and environment for statistical computing.

Therneau, T. M., Atkinson, B., & Ripley, B. (2010). rpart: Recursive Partitioning. R package

version 3.1–42. Computer software program retrieved from http://CRAN. R-project.

org/package= rpart.

Page 21: SENIOR COMP FINAL

STATISTICAL METHODS 20

Appendix A

Performance Statistics Across Sample Sizes and Models

n AUC Accuracy Sensitivity (TPR)

Specificity (TNR)

LASSO

200 0.512 0.912 0.053 0.971

500 0.574 0.828 0.281 0.870

1000 0.681 0.717 0.640 0.723

2000 0.729 0.717 0.744 0.720

Single Classification

Tree

200 0.537 0.886 0.136 0.938

500 0.597 0.831 0.328 0.866

1000 0.640 0.771 0.490 0.790

2000 0.667 0.717 0.609 0.724

Random Forest

200 0.672 0.629 0.721 0.623

500 0.699 0.649 0.756 0.642

1000 0.717 0.662 0.781 0.654

2000 0.733 0.673 0.802 0.664

Support Vector Machines

200 0.548 0.596 0.492 0.603

500 0.543 0.634 0.435 0.648

1000 0.550 0.652 0.433 0.668

2000 0.570 0.588 0.550 0.591

Page 22: SENIOR COMP FINAL

STATISTICAL METHODS 21

Appendix B

Average ROC Curve for LASSO at n = 2000

Page 23: SENIOR COMP FINAL

STATISTICAL METHODS 22

Appendix C

Average ROC Curve for Single Classification Trees at n = 2000

Page 24: SENIOR COMP FINAL

STATISTICAL METHODS 23

Appendix D

Average ROC Curve for Random Forest at n = 2000

Page 25: SENIOR COMP FINAL

STATISTICAL METHODS 24

Appendix E

Average ROC Curve for Support Vector Machine at n = 2000

Page 26: SENIOR COMP FINAL

STATISTICAL METHODS 25

Appendix F

Confusion Matrix for LASSO

LASSO True Classification

No Dropout Dropout

Predicted Classification

No Dropout 66.83% 1.67% 68.50%

Dropout 26.65% 4.85% 31.50% 93.48% 6.52%

Page 27: SENIOR COMP FINAL

STATISTICAL METHODS 26

Appendix G

Confusion Matrix for Random Forest

Random Forest True Classification No Dropout Dropout

Predicted Classification

No Dropout 62.03% 1.29% 63.33%

Dropout 31.45% 5.23% 36.67% 93.48% 6.52%

Page 28: SENIOR COMP FINAL

STATISTICAL METHODS 27

Appendix H

Confusion Matrix for Support Vector Machines

Support Vector Machines True Classification No Dropout Dropout

Predicted Classification

No Dropout 55.22% 2.93% 58.15%

Dropout 38.26% 3.59% 41.85% 93.48% 6.52%

Page 29: SENIOR COMP FINAL

STATISTICAL METHODS 28

Appendix I

Confusion Matrix for Single Classification Trees

Single Classification Tree True Classification No Dropout Dropout

Predicted Classification

No Dropout 67.70% 2.55% 70.25%

Dropout 25.78% 3.97% 29.75% 93.48% 6.52%

Page 30: SENIOR COMP FINAL

STATISTICAL METHODS 29

Appendix J

Variable importance measures for Single Trees, Random Forests, and LASSO

Single Classification Trees: Random Forests: LASSO: % 10th-graders receive remedial math

Socio-economic status composite, Version 2

Total family income from all sources 2001-composite

% 10th graders are LEP or non-English proficient

Number of grades repeated (K-10)

Socio-economic status composite, Version 2

Student's year and month of birth

Standardized test composite score-math/reading

Standardized test composite score-math/reading

Number of grades repeated (K-10)

ELS-NELS 1992 scale equated sophomore math score

Mathematics proficiency probability at level 2

Has interlibrary loan program with public libraries

ELS-NELS 1992 scale equated sophomore reading score

Recognized for good grades

How often library used for classes at same time

ELS:2002-PISA:2003 concordance math score

How many times cut/skip classes

ELS-NELS 1990 scale equated sophomore math score

ELS:2002-PISA:2000 concordance reading score

How many times put on in-school suspension

ELS-NELS 1992 scale equated sophomore math score

Math estimated number right 1st friend's grade level at school

School contacted parent about problem behavior

Math test standardized score 10th-grader ever had behavior problem at school

How far in school parent expects 10th-grader will go

Mathematics proficiency probability at level 1

Parent's satisfaction with 10th-grader's education up to now

ELS:2002-PISA:2003 concordance math score

Mathematics proficiency probability at level 2

How far in school parent wants 10th-grader to go

How often uses computer at friend's house

Mathematics proficiency probability at level 3

How often student completes homework (math)

Ever worked for pay not around house

Reading estimated number right How often student is absent (math)

Socio-economic status composite, Version 1

Reading test standardized score How many people congregated in area around school

Socio-economic status composite, Version 2

Reading proficiency probability at level 1

-

How often student is absent (math)

Reading proficiency probability at level 2

-

How often use computer for administrative records (math)

How many times cut/skip classes -

Teacher-Student relations (Positive or Negative)

How far in school student thinks will get

-

Standardized test composite score-math/reading

How often student is absent (math)

-