Upload
katlynn-kennedy
View
50
Download
0
Embed Size (px)
Citation preview
Running Head: A COMPARISON OF STATISTICAL LEARNING METHODS IN DIFFERENT SAMPLE SIZES
A Comparison of Statistical Learning Methods in Different Sample Sizes
Katlynn Kennedy
Saint Mary’s College
STATISTICAL METHODS 1
A Comparison of Statistical Learning Methods in Different Sample Sizes
In this modern age of science and technology, researchers can examine and collect
massive, diverse sets of data that have previously been impossible to examine. This is largely
due to advancements in computational sciences and digital storage capacities, in conjunction
with the continual development of statistical analyses (Caruana, Karampatziakis & Yessenalina,
2008; Iniesta, Stahl & McGuffin, 2016). Yet even with these large datasets being increasingly
available, many researchers in the social sciences are ill-equipped with the training to use
methods specifically designed to analyze them (Sinharay, 2016). Providing researchers with
knowledge on how different statistical learning methods can be applied to these datasets can
facilitate the improvement of data analytics (Iniesta et.al., 2016; Sinharay, 2016; Kleinberg,
Ludwig, Mullainathan & Obermeyer, 2015). Such methods can be used specifically for research
designed to study prediction problems where previous methods of variable selection may be
inappropriate or outdated (Sinharay, 2016; Iniesta et. al., 2016; Kleinberg et. al., 2015). Such
methods may also be applied to large datasets containing a wide variety of sample sizes.
Consider this example used by Kleinberg et al. (2015) when thinking about prediction
versus causal problems:
A policy-maker facing a drought is deciding whether to invest in a rain dance to increase rain. Another is seeing clouds and must decide whether to take an umbrella to work to avoid getting wet. Both characters required something different and have different questions. One is a causal problem asking “Do rain dances cause rain?” The other is a prediction problem: “Is the chance of rain high enough to bring an umbrella?”
Researchers often look at questions like the rain dance, a causal question; but prediction
problems are becoming increasingly important, common, and interesting (Kleinberg et al.,2015).
Researchers are now able to ask and assess such prediction questions using large and substantial
STATISTICAL METHODS 2
datasets to answer them. For example, one could use a large educational dataset to understand
which high school students have the highest risk of dropping out. This prediction problem could
be instrumental in providing intervention plans earlier and more effectively for at-risk students
(Burrus & Roberts, 2012; Sinharay, 2016; Iniesta et al., 2016).
Once a researcher gets into these larger, more complex datasets, they are often left to
their own devices to decide what methods may be best to use for their analyses (Sinharay, 2016).
In doing so, many continue to use methods learned from their undergraduate or graduate training
(e.g. multiple linear regression). However, these traditional methods are not designed to handle
the complexity of data that comes with large datasets that result from scenarios such as
longitudinal studies (Iniesta et al., 2016; Caruana et.al., 2008). One reason is the high-
dimensional data problem (Iniesta et al., 2016; Caruana et. al., 2008). The high-dimensional data
problem is when a data set has a number of variables (p) close to or greater than number of
observations (N) (Iniesta et al., 2016). This problem violates the assumptions/requirements of
many traditional methods, making the results of analyses done with these methods high in error.
It is also simply not realistic for researchers to use a traditional method of comparing thousands
of single association tests and sorting by significance to determine which predictors are
important in the dataset. In addition to the high-dimensional data problem, traditional methods
also are at a high risk for overfitting a model. Meaning they may include variables that have no
real significance to the outcome variable (Iniesta et al., 2016). For example, within a given
sample the color of a parent’s may be determined to be a significant predictor of high school
dropout – however, this result may not generalize to the population. In addition, traditional
prediction methods depend on assumptions such as linearity and homogeneity of variance
(Sinharay, 2016). These assumptions are often violated in data sets that are larger in size. Thus,
STATISTICAL METHODS 3
the use of traditional methods has been outgrown by our data, and researchers need to begin
using more advanced methodology that are well-equipped to answer prediction problems more
accurately.
This increase in computational power has created the question of whether more
computationally-intensive prediction methods are more practical and could provide more
accurate predictions than traditional methods in important problems (Sinharay, 2016). These
types of computation-intensive prediction methods are called statistical learning methods
(Sinharay, 2016; Iniesta et al., 2016). There are five specific statistical learning methods that this
study intends to better understand: 1. Logistic Least Absolute Shrinkage and Selection Operator
(LASSO), 2. Classification Trees, 3. Random Forrest, 4. Support Vector Machines, and 5. K-
Nearest Neighbors.
LASSO
The Least Absolute Shrinkage and Selection Operator (LASSO) method automatically
penalizes the absolute sum of the regression coefficients. By doing this it shrinks the regression
coefficient of variables that don’t contribute to the model to zero, therefore removing them from
the model. This makes LASSO good at selecting important variables into the model. It also
makes for a more interpretable model. The LASSO requires a meta-parameter lambda to operate.
The choice of this tuning parameter greatly influences the results of the estimation process and is
usually calibrated using a cross-validation procedure.
Single Classification Trees
Classification trees are one of the two types of trees that are considered prediction trees.
Their goal is to predict an outcome variable, Y, from a variety of input variables, X1, X2, …, Xp.
Classification trees use recursive partioning to determine predictors that influence the
STATISTICAL METHODS 4
categorization of an individual into a class. A classification tree might be used in a situation
where there are many variables that interact in a nonlinear and complicated way. A classification
tree begins with a single node and then selects a binary distinction that gives the most optimal
split. In other words, it selects a predictor that is important in determining the class of the
individual based on their response to the previous node, see figure 1. A new individual node is
then created and this process repeats for each individual node until it reaches a stopping criteria.
A classification tree produces an individual model for each node in addition to a model for the
entire tree.
Figure 1: The classification tree for predicting the risk of recurrent falling in community-dwelling older persons at 3-year follow-up.
STATISTICAL METHODS 5
Random Forests
A potential issue with classification trees is they are highly sample dependent. (Breiman,
2001). The algorithm used to create the trees is a greedy algorithm that selected splits that
maximize the amount of variance accounted for in the outcome variable. This makes single trees
highly susceptible to overfitting and causes potential for single trees to not generalize to the
population.
The random forest method accounts for this by growing many classification trees and
each tree gives a classification and ‘votes’ for the class it would assign an individual in. The
forest then chooses classification having the most votes. One characteristic of random forest that
make them very good candidates to analyze large data sets is that they can handle non-linear
relationships within the data set and they do it well.
Support Vector Machines
Support vector machine (SVM) is a data mining algorithm primarily used for
classification problems. The underlying mechanism behind SVM (Boser, Guyon, &
Vapnik,1992) is an attempt to find a line (in two-dimensional space), plane (in three-dimensional
space), or hyper-plane (in n-dimensional space) that linearly separates two distinct classes. The
algorithm works by finding “support vectors”, i.e., points in the dataset, that form the previously
discusses separator (i.e. hyperplane). The algorithm intelligently chooses support vectors that
define a boundary between the two classes to create a wide “margin”. The margin is defined as
the space on either side of the boundary but within the points defined by the support vectors. The
goal of this margin is to be as wide as possible, it is believed that the widest margin is the most
generalizable to a new sample. Therefore, one consideration for SVMs is their limited ability to
separate with many irrelevant predictors (Friedman, Hastie, & Tibshirani, 2001). Given the large
STATISTICAL METHODS 6
number of predictors in this sample, it is likely that not all the potential predictors will contribute
positively to the model. Therefore, it is likely that SVM will not perform well in this study.
Figure 2: Example of a Two-Dimensional Support Vector Machine Line and Margin
K-Nearest Neighbor
K-Nearest Neighbor (KNN) classifies based on a similar measure to other points in the
data set. Each new class is classified by a majority vote of neighbors to the point and is assigned
to a class most common amongst its K-Nearest Neighbors measured by a distance metric. The
distance metric most commonly used is the Euclidian distance metric, which measures the
straight-line distance between points.
STATISTICAL METHODS 7
Figure 3: Example of a K-Nearest Neighbor Classification
These methods are quite complex and can be difficult for even an experienced
investigator to choose and apply (Sinharay, 2016). Research has provided empirical evidence
that these tests work better for prediction questions than traditional methods in the case of large
data sets (Caruana, Karampatziakis & Yessenalina, 2008). Although this research can be used to
guide researchers to learn and utilize these advanced methods, it still leaves the question as to
how these methods behave in smaller and lower dimensional sets.
Therefore, the purpose of the current study is to examine how five of these different
methods behave in relation to a variety of sample sizes, spanning from considerably small to
moderately large.
In conjunction with previous research, it is hypothesized that the advanced learning
statistical methods will perform better than the traditional model of multiple linear regression in
producing models of prediction that have lower errors associated with them (Caruana. et al.,
2008; Kleinberg, et al., 2015; Sinharay, 2016; Iniesta et al., 2016). It is also hypothesized that
different models will be produced from each learning method solely due to differences in sample
STATISTICAL METHODS 8
sizes. In other words, models created by the same method may be different for each of the
sample sizes tested on the method. Finally, it is hypothesized that the models and errors
produced will differ between learning methods; each learning method may produce a different
prediction model. The results of this last hypothesis may be due to how each learning method
acts in relation to the size of the sample being assessed.
Along with conducting these comparisons, one major purpose of this study is to outline
how each method is used in a manner that is easily interpreted by researchers who may not be
familiar with them. Rather than simply go through the results, this paper should allow social
science researchers to grasp the basic principles of these methods without being bogged down by
the computational complexities. It should also encourage researchers to apply these methods to
their own research; it will provide researchers with the conceptual ideas, pros, and cons of these
methods to allow them to know when each method is appropriate.
Method
The Education Longitudinal Study (ELS; 2002), collected by the National Center for
Educational Statistics, contains the data sample that was utilized in the analyses. This
longitudinal study followed over 15,000 students from around 750 schools through their
secondary and post-secondary school years. The sample also includes students who transferred to
schools being surveyed, dropped out, and completed school early for both secondary and post-
secondary years. This data set is created from a nationally representative group of 10th graders in
2002 and 12th graders in 2004. There were four data collections over the span of 10 years, with
surveys given to students, parents, school administrators, teachers, and facilities. These data
collections collected thousands of variables about a student’s educational success, home life,
environmental influences and more. There are over 74 variables within this data set that could
STATISTICAL METHODS 9
be utilized as the ‘outcome’ variable for our analyses, but the one selected is the dichotomous
variable of whether a student will drop out of high school.
Once the data was selected, the process of data preprocessing began. Data was cleaned by
removing impossible data (e.g. future variables) and restricted data. Variables with over 25%
missingness were removed from the dataset, leaving around ~1200 variables to use in the
prediction models. This concluded the preprocessing for the data and allowed for the following
pseudo-simulation study.
Cases were randomly sampled from the remaining observations (a student with
observations on all 1,200 variables represents a single case) in sample sizes of 200, 500, 1000,
and 2000. The previously described statistical learning models were fit to these samples, with the
binary dropout classifier as the outcome variable. This method is called training a model with a
training set of data. The open source statistical software R (Team, 2013) was utilized to run
analyses, along with a variety of packages (Meyer & Wien, 2015; Liaw & Wiener, 2002;
Therneau, Atkinson, & Ripley, 2010; Friedman, Hastie, & Tibshirani, 2010; Ripley, Venables, &
Ripley, 2015) within the software that were each programmed by statisticians who developed the
specified methods. The five statistical learning methods being tested are: 1. Logistic Least
Absolute Shrinkage and Selection Operator (LASSO), 2. Classification and Regression Trees, 3.
Random Forrest, 4. Support Vector Machines, 5. K-Nearest Neighbor. The samples drawn from
the training set are the data sets that were used to fit and train the model.
Once a model was trained, it needed to be tested to estimate how well that model
performs in new data. This was done by taking the test set of data, (i.e., the remaining data from
the dataset) and predicting the class of each remaining observation using the predictions from the
model.
STATISTICAL METHODS 10
To assess a model’s accuracy, we utilized a receiver operating characteristic (ROC) curve
to create a visual representation of the performance of the model. In a ROC curve the true
positive rate (Sensitivity) is plotted against the false positive rate (1-Specificity) for different cut-
off thresholds. Each point on the ROC curve represents a sensitivity/specificity pair
corresponding to a chosen decision threshold. Youden’s Index was used to select the optimal cut
off threshold for each individual model, and thus allowing to calculate a specific value of
sensitivity, specificity, and accuracy. Note that each of these values depends on the prevalence of
dropout in the population, and therefore none of them can be interpreted in isolation.
For example, an accuracy of 93% could be obtained by simply predicting there would be
no dropouts. Given that 93% of students do not drop out, for 93% of observations, the model
would be correct, yielding 93% accuracy. However, this “model” is not very useful. It provides
zero information over and above predicting the mean. It is equivalent to predicting a woman’s
height to be 5 feet 4 inches tall. This is the median height in the population, but based on
covariates a better prediction model would be developed. However, if we simultaneously
interpret accuracy, sensitivity, and specificity we can develop a more thorough understanding of
the performance of the classifier. If everyone is predicted to not drop out, accuracy will be 93%,
specificity will be 100%, but sensitivity will be 0%. Therefore, a goal of a good classification
model would be to maximize each of these.
This whole process was repeated 1,000 times, using a loop in R, to further understand
how well each model is predicting and creating a model from different random samples with the
same population values.
Results Appendix A is a complete table of the results of the performance statistics from the four
algorithms applied to the four different sample sizes. A higher value of area under the curve
STATISTICAL METHODS 11
(AUC) indicates a better classifier. An AUC of 0.5 indicates a classifier that is no better than
random guessing and an AUC of 1.0 indicates a perfect classifier. Due to the nature of
classification and prediction there are no set cut off values within those points, it is entirely
context dependent. One interpretation of AUC is the expected proportion of condition positives
with a higher predicted condition probability than a drawn random condition negative. (Kurczab,
Smusz, & Bojarski, 2014) The analyses on the method of K-Nearest Neighbors was not
completed because the computation time was prohibitively large because of the inability to work
well with high dimensional data. We also were unable to run a traditional multiple linear
regression on this data set using R due to the violations of assumptions of the method.
Not surprisingly, the general pattern found in the data was that the area under the curve
went up with increase in sample size for each method. It was also found that generally accuracy
goes down with the increase in sample size. It might also be noted that generally sensitivity and
specificity grew closer together with increase in sample size.
Appendices B-E are the figures of the average ROC curve for each method at the sample
size of 2000. It is important to note that these are only illustrations of the performance of the
model. A model that is an excellent classifier will have an ROC curve that hugs the upper left
corner of the axis, which represents high sensitivity and specificity. In this study an ROC curve
like this would indicate an individual model that can predict most accurately the probability of
whether a student will drop out of school.
A confusion matrix is used to describe the performance of a classifier on a set of data for
which the true values are known. By reviewing a confusion matrix, you can better understand
how many cases were correctly classified and how many were incorrectly classified. A confusion
matrix is also used to further evaluate specificity and sensitivity of a model. Table 2 is an
STATISTICAL METHODS 12
example to explain how to read a confusion matrix. Appendices F-I are confusion matrices for
each method at n = 2000.
Table 2: Example Confusion Matrix
True Negatives (TN):
Predicted no drop out
and didn’t drop out
False Positive (FP):
Predicted no drop out
and did drop out
Accuracy
Percentage
of the
Model
False Negatives (FN):
Predicted drop out and
didn’t drop out
True Position (TP):
Predicted drop out and
did drop out
True Percentage in the Population
Appendix J can be interpreted as a table listing the predictors most important in
predicting drop out by each method. The single classification tree and random forest methods
were restricted to 20 variables, therefore not including all the predictors included in their model.
Discussion
Due to the increase in computational power and storage capacities, data sets are
becoming more readily available to researchers (Caruana et. al., 2008; Iniesta et. al., 2016).
While this is a positive for the field of data analytics and future research, it comes with
challenges. The methods previously used by social scientists are ill-equipped to handle the
complexity of the data being collected (Sinharay, 2016). Therefore researchers must look to
other resources to create interpretable and accurate models from their large data sets. Statistical
learning methods are computationally intense methods that have been designed to combat the
high-dimensionality problem (Sinharay, 2016; Iniesta et. al., 2016).The purpose of this study was
STATISTICAL METHODS 13
to better understand how these methods behave in a variety of sample sizes on highly
dimensional data set. All three of the initial hypothesis were supported by the results of our study
and our results supported previous research.
We were unable to quantify the performance of a traditional method on this data due to
the high dimensionality of the data. This is especially important to note because no previous
research was unable to apply a traditional method, such as multiple linear regression, to their
study. One hypothesis as to why this might be is because of the nested nature of our data set, i.e.
a student within a classroom, a classroom with a school, and so on. This makes our data set
particularly complex. This should also be considered when discussing the remainder of the
results. While unfortunately we were unable to directly compare the performance statistics of a
traditional method to the statistical learning methods, our first hypothesis was supported. It can
be determined that because four of the five statistical learning methods were able to create a
model at all they are better choice of analyses than a traditional method.
It was found that generally within each method an increase in sample size influenced an
increase in the predictive capabilities of the model. It was also observed that the sensitivity and
specificity generally grew closer together, a positive sign for the accuracy of classification of the
model, as the sample sizes increased. This supports our hypothesis that within each method there
will be variation because of changing sample sizes.
Not only to support our third hypothesis but also to contribute to the overall goal of this
paper, results show that the models each performed differently, some better than others. For the
purpose of this study we are then able to discuss why some of the methods are better choices
then others for use on future high-dimensional data sets.
STATISTICAL METHODS 14
The method K-Nearest Neighbor was unable to be applied to the data set because the
computation time of the method being prohibitively large. This was due to the fact that K-
Nearest Neighbor simply cannot handle such a large sample size. The computational time for
applying this method is grows by O(nd+kn), where n is the sample size for the training data, d is
the dimensionality (number of predictor variables), and k is the chosen number of “nearest
neighbors”. Thus, with a sample size as small as 200 but with 1200 predictor variables, it could
take quite a long time to fit this model to our dataset, let alone repeating the analysis a large
number of times. Social scientists should take this into account when considering this method for
their own research.
Support Vector Machine (SVM) was similar to K-Nearest Neighbor in the complexity of
its algorithm and the time it would take to simulate 1,000 replications on such a large sample size
of a complex data set like the ELS. While K-Nearest Neighbor would have taken 8+ months to
run on a single core computer system, initially SVM was going to take 3-4 months. Through the
use of the University of Notre Dame Research Computing Center supercomputer and clusters we
were able to spread out the simulations across multiple nodes and decrease the computation time
of SVM. For a social scientist who does not have access to a computing system like this applying
the SVM method to their data set might be unrealistic and time consuming. In addition to the
computation time of the SVM method, we found that it was the least efficient in creating an
accurate model. Even at the largest sample size the models did on average nearly no better than
random guessing. We can contribute this to the fact that support vector machines cannot handle
outliers or deal with irrelevant predictors very well.
The Single Classification tree was an average performing method on this dataset. It was
observed in the classification trees that there is an improvement in performance with increase in
STATISTICAL METHODS 15
sample size. From n = 200 to n = 2000, the model went from nearly random guessing to an
averagely performing classifier. This does not make single classifications a bad choice when
analyzing large data sets, but it was certainly not the best of our methods.
LASSO was one of the higher performing methods from our results. One of the
advantages of the lasso method is that because of the penalization on the regression coefficient
the model created by LASSO is a very easily interpreted model. This is very beneficial to social
scientists studying a large amount of variables because they are able to limit their model to the
most important predictors and get rid of the ones that don’t contribute significantly to the model.
This simplifies the model and makes it easier to understand. One drawback of statistical learning
methods can be the lack of interpretability among the models and LASSO is very attractive when
choosing a method because of this.
Within this study we found that random forest had this highest accuracy in creating a
high performing model. This is because random forest does a very good job of modeling the
effects of different predictors within a data set and knowing which are important. It also has the
capacity to handle nonlinear effects of the predictors on the outcome variable. Through the use of
optimal splits at each node random forest is able to narrowly examine how each predictor effects
other variables and the outcome variable. For large and highly complex datasets this is crucial as
a method needs to be able to decipher what is important and what is not important to include in a
model. One of the drawbacks that should be understood when considering random forest is that
you are unable to explain the relationship between variables even though it is included in the
model. With that being said, it was still the best classifier of the methods tested.
Being able to better understand these methods of statistical learning and their individual
differences may have crucial implications for future research on important prediction questions.
STATISTICAL METHODS 16
If we continue the specific application from this study, knowing which predictors of a student
dropping out of high school might help educational leaders to provide intervention earlier and
more effectively for students who display those predictor characteristics. This could create a
rippling effect in our education system by increasing the number of students staying in school
and potentially pursuing higher education. An interesting observation was found when reviewing
the predictors selected by the various methods as important to predicting drop out of high school.
Among the three methods with chosen predictors listed there were strong comparisons and many
overlapping predictor variables. This is believed to be a positive when comparing methods in
their accuracy because it represents consistency among models in predicting the same outcome
variable. It would concerning to a social scientist to see high variability in each of the models
applied to a particular dataset.
There were several limitations within this study. Because of the smaller sample size of
this set of data and the limited amount of time to run the analyses, the models were compared
with minimal tuning. We were attempting to compare these models at various sample sizes in an
out-off-the-box format, meaning using the available default settings in currently implemented
software. Tuning the models would allow to spend additional time adjusting the meta-
parameters. Some analysis provides implementations that accomplish this task automatically, e.g.
logistic LASSO and Random Forest. While SVM performed the worst of all the methods it could
have been improved with adjustments. Another limitation we faced when interpreting our results
and looking towards future research is that we don’t know the “truth” of the variable interactions.
This means that we don’t know the actual relationship between the variables and the outcome
variable. Therefore, our prediction capabilities are only moderate and this leaves room for fully
understanding how our models are performing. With real data, there are no perfect classifiers
STATISTICAL METHODS 17
therefore for many models the variables included can only account for so much of the variance,
meaning the model we trained might actually be performing optimally.
In future research, it would be helpful to run a Monte Carlo simulation on this set of data
to quantify the risk of students dropping out of high school using a probabilistic simulation. This
will increase our ability to truly understand what contributes to at-risk populations and further
the amount of information gained from a study like this. Future research might also include the
addition of more statistical learning methods to be applied and studied on the data set. In this
study, we examined five of the most common and easy to locate methods, but there are
potentially many more that would work in a large dataset much like the one used. Doing this
would expand our knowledge of these highly computational methods and improve data analytics
on “big data”.
This study intended to further the understanding of statistical learning methods and how
they can be applied in large, highly-dimensional data sets. We collected data that supported our
hypothesis and previous research that these methods perform better than traditional methods,
such as multiple linear regression. It was discussed how each method generally improves with an
increase in sample size. To better facilitate the understanding of these methods we discussed why
some of the methods did poorly or why they were more accurate. In this study, the methods of
random forest and LASSO were found to be the most effective and accurate in creating a
prediction model for whether or not students will drop out of high school. It is the hope of this
study that social scientists will take interest in understanding and applying statistical learning
methods to their large, high-dimensional data and that continued research will further the
understanding of these methods.
STATISTICAL METHODS 18
References
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992, July). A training algorithm for optimal
margin classifiers. In Proceedings of the fifth annual workshop on Computational
learning theory (pp. 144-152). ACM.
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
Burrus, J., & Roberts, R. D. (2012, February). Dropping Out of High School: Prevalence, Risk
Factors, and Remediation Strategies. ETS: R & D Connections, (18). Retrieved from
https://www.ets.org/Media/Research/pdf/RD_Connections18.pdf.
Caruana, R., Karampatziakis, N., & Yessenalina, A. (2008). An empirical evaluation of
supervised learning in high dimensions. Proceedings of the 25th International
Conference on Machine Learning - ICML '08, 96-103. doi:10.1145/1390156.1390169
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1).
Springer, Berlin: Springer series in statistics.
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear
models via coordinate descent. Journal of statistical software, 33(1), 1.
Iniesta, R., Stahl, D., & Mcguffin, P. (2016). Machine learning, statistical learning and the future
of biological research in psychiatry. Psychological Medicine, 46(12), 2455-2465.
doi:10.1017/s0033291716001367
Kleinberg, J., Ludwig, J., Mullainathan, S., & Obermeyer, Z. (2015). Prediction Policy
Problems. American Economic Review, 105(5), 491-495. doi:10.1257/aer.p20151023
Kurczab, R., Smusz, S., & Bojarski, A. J. (2014). The influence of negative training set size on
machine learning-based virtual screening. Journal of cheminformatics, 6(1), 1.
STATISTICAL METHODS 19
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22. Messenger, R., & Mandell, L. (1972). A modal search technique for predictive nominal scale
multivariate analysis. Journal of the American Statistical Association, 67(340), 768-772.
Meyer, D., & Wien, F. T. (2015). Support vector machines. The Interface to libsvm in package
e1071.
Ripley, B., Venables, W., & Ripley, M. B. (2015). Package ‘class’.
Sinharay, S. (2016). An NCME Instructional Module on Data Mining Methods for
Classification and Regression. Educational Measurement: Issues and Practice.
doi:10.1111/emip.12115
Team, R. C. (2013). R: A language and environment for statistical computing.
Therneau, T. M., Atkinson, B., & Ripley, B. (2010). rpart: Recursive Partitioning. R package
version 3.1–42. Computer software program retrieved from http://CRAN. R-project.
org/package= rpart.
STATISTICAL METHODS 20
Appendix A
Performance Statistics Across Sample Sizes and Models
n AUC Accuracy Sensitivity (TPR)
Specificity (TNR)
LASSO
200 0.512 0.912 0.053 0.971
500 0.574 0.828 0.281 0.870
1000 0.681 0.717 0.640 0.723
2000 0.729 0.717 0.744 0.720
Single Classification
Tree
200 0.537 0.886 0.136 0.938
500 0.597 0.831 0.328 0.866
1000 0.640 0.771 0.490 0.790
2000 0.667 0.717 0.609 0.724
Random Forest
200 0.672 0.629 0.721 0.623
500 0.699 0.649 0.756 0.642
1000 0.717 0.662 0.781 0.654
2000 0.733 0.673 0.802 0.664
Support Vector Machines
200 0.548 0.596 0.492 0.603
500 0.543 0.634 0.435 0.648
1000 0.550 0.652 0.433 0.668
2000 0.570 0.588 0.550 0.591
STATISTICAL METHODS 21
Appendix B
Average ROC Curve for LASSO at n = 2000
STATISTICAL METHODS 22
Appendix C
Average ROC Curve for Single Classification Trees at n = 2000
STATISTICAL METHODS 23
Appendix D
Average ROC Curve for Random Forest at n = 2000
STATISTICAL METHODS 24
Appendix E
Average ROC Curve for Support Vector Machine at n = 2000
STATISTICAL METHODS 25
Appendix F
Confusion Matrix for LASSO
LASSO True Classification
No Dropout Dropout
Predicted Classification
No Dropout 66.83% 1.67% 68.50%
Dropout 26.65% 4.85% 31.50% 93.48% 6.52%
STATISTICAL METHODS 26
Appendix G
Confusion Matrix for Random Forest
Random Forest True Classification No Dropout Dropout
Predicted Classification
No Dropout 62.03% 1.29% 63.33%
Dropout 31.45% 5.23% 36.67% 93.48% 6.52%
STATISTICAL METHODS 27
Appendix H
Confusion Matrix for Support Vector Machines
Support Vector Machines True Classification No Dropout Dropout
Predicted Classification
No Dropout 55.22% 2.93% 58.15%
Dropout 38.26% 3.59% 41.85% 93.48% 6.52%
STATISTICAL METHODS 28
Appendix I
Confusion Matrix for Single Classification Trees
Single Classification Tree True Classification No Dropout Dropout
Predicted Classification
No Dropout 67.70% 2.55% 70.25%
Dropout 25.78% 3.97% 29.75% 93.48% 6.52%
STATISTICAL METHODS 29
Appendix J
Variable importance measures for Single Trees, Random Forests, and LASSO
Single Classification Trees: Random Forests: LASSO: % 10th-graders receive remedial math
Socio-economic status composite, Version 2
Total family income from all sources 2001-composite
% 10th graders are LEP or non-English proficient
Number of grades repeated (K-10)
Socio-economic status composite, Version 2
Student's year and month of birth
Standardized test composite score-math/reading
Standardized test composite score-math/reading
Number of grades repeated (K-10)
ELS-NELS 1992 scale equated sophomore math score
Mathematics proficiency probability at level 2
Has interlibrary loan program with public libraries
ELS-NELS 1992 scale equated sophomore reading score
Recognized for good grades
How often library used for classes at same time
ELS:2002-PISA:2003 concordance math score
How many times cut/skip classes
ELS-NELS 1990 scale equated sophomore math score
ELS:2002-PISA:2000 concordance reading score
How many times put on in-school suspension
ELS-NELS 1992 scale equated sophomore math score
Math estimated number right 1st friend's grade level at school
School contacted parent about problem behavior
Math test standardized score 10th-grader ever had behavior problem at school
How far in school parent expects 10th-grader will go
Mathematics proficiency probability at level 1
Parent's satisfaction with 10th-grader's education up to now
ELS:2002-PISA:2003 concordance math score
Mathematics proficiency probability at level 2
How far in school parent wants 10th-grader to go
How often uses computer at friend's house
Mathematics proficiency probability at level 3
How often student completes homework (math)
Ever worked for pay not around house
Reading estimated number right How often student is absent (math)
Socio-economic status composite, Version 1
Reading test standardized score How many people congregated in area around school
Socio-economic status composite, Version 2
Reading proficiency probability at level 1
-
How often student is absent (math)
Reading proficiency probability at level 2
-
How often use computer for administrative records (math)
How many times cut/skip classes -
Teacher-Student relations (Positive or Negative)
How far in school student thinks will get
-
Standardized test composite score-math/reading
How often student is absent (math)
-