Upload
venkatesh-nadar
View
64
Download
4
Embed Size (px)
Citation preview
Running head: CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 1
CLASSIFICATION AND PREDICTION ANALYSIS OF
STUDENT ALCOHOL CONSUMPTION
by
Venkadesan Gandhi Rajan
CS 7720 Data Mining Course Project
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 2
TABLE OF CONTENTSABSTRACT..........................................................................................................................................3
INTRODUCTION.................................................................................................................................3
BACKGROUND OF THE PROBLEM.................................................................................................4
STATEMENT OF THE PROBLEM.....................................................................................................5
PURPOSE OF THE STUDY.................................................................................................................5
SIGNIFICANCE OF THE STUDY.......................................................................................................6
NATURE OF THE STUDY..................................................................................................................6
RESEARCH QUESTIONS...................................................................................................................7
HYPOTHESES.....................................................................................................................................7
THEORECTICAL FAMEWORK.........................................................................................................8
DATA ANALSYSIS...........................................................................................................................11
CONCLUSION...................................................................................................................................15
SUMMARY........................................................................................................................................16
REFERENCES....................................................................................................................................17
APPENDIX A:....................................................................................................................................20
APPENDIX B:....................................................................................................................................21
APPENDIX C:....................................................................................................................................26
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 3
ABSTRACT
The project work proposes to approach secondary level student addiction on alcohol using
Data Mining (DM) techniques in WEKA (Srivastava, 2014). The outcome shows that an
excellent predictive accuracy can be achieved, provided that addiction of alcohol can
influence to the student performance. Several attributes are the main cause for a student’s
alcoholic life and the attributes are classified using the best algorithms such as Bagging, K-
nearest algorithm (IBK), Random Forrest, and Decision Tree (Kabakchieva, 2013) to predict
the accuracy. Several attributes bring good cause to a student and the instances are to be
followed by each and every student to lead a growth of one’s life. With the project’s
outcome, tools for classification and prediction analysis (Cortez & Silva, 2008) can be
developed in order to pay more care to the student and share how the alcohol effect so
seriously in a student’s life.
Keywords: student alcohol consumption, decision tree, prediction, accuracy,
classification
INTRODUCTION
Consuming alcohol – Is it good or bad? Some drink alcohol to get away from stress
and some drink excessively thinking of own problems faced in life. Alcohol had lots of bad
impact in our life. Consuming alcohol on a single event or above some phase is injurious to
health (Benton & Downey, 2006). If a student consumes alcohol, the affected health
percentage gets higher day by day and leads to minimal life span. Alcohol have many short-
term and long term health effects. Taking alcohol in teenager age, lessen a kid’s
psychological and physical capabilities which in-turn affects decision and management skills
and later ends in anxiety. The alcoholism level becomes higher and the mind changes
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 4
drastically, which embrace breathing mechanism (Engs, 1977). Alcoholic consumers most
probably gets hurt or have calamities than non-drinkers and gets injured when a drinker is a
traveler in a drink-driving event.
Alcohol consumption is restricted to age limits in different country. According to the
source in the internet (Toomey, Rosenfeld & Wagenaar, 1996), USA has the restriction of 21
years, China as 18 years and Japan as 20 years. Drinking alcohol is illegal in Saudi Arabia
and if found subjected to prison. In reference to the above, alcohol consumption has been a
threat to all students in every country. The growth of a student’s life is affected with respect
to the failures in one’s life, free time spending with friends. The most important victim for
alcoholic consumption is the male students (Schaeffer & Parsons, 1986). When a child
drinks alcohol, their judgement gets exaggerated. Whereas extreme drinking by teenagers is a
unruly on its individual right, it is at times related to other injurious performances like
captivating illegal drugs.
In this project, a student alcohol dataset is taken and classification algorithms are
tested to predict the behavior of a student consuming alcohol based on several attributes
(Cortez & Silva, 2008).
BACKGROUND OF THE PROBLEM
Data Mining is widely used in scholastic arena to augment the thoughtful of education
process and to concentrate on finding, mining and estimating variables linked to the erudition
procedure of students. The problems faced in the recent study (Pham, Clough, Nguyen, Kim
& Buettner, 2009) of alcohol consumption proposes that many students involved are affected
with the growth of the life and with the short-term and long-term health effects. A children
gets attracted to alcohol in puberty time, when there is a problem with family, school friends
and to prove as a grown up man. In fact when a child is too focused on studies, alcohol is an
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 5
attractive tool for all students to relieve from stress. Underage drinkers (Bernstein, Wood &
Colby, 2016) mostly grieve from a variety of health problems comprising fat or lean, bad
skin, troubled nap, pains. In juvenile and adolescent years, the mind is still emerging.
Addition of liquor to that procedure is probing for distress and can disturb memory function,
responses, learning capability and attention span all particularly significant throughout the
college days. Consumption could disturb kid’s performance at college and prevent them from
reaching their full potential. Young people who drink extremely are more likely to also have
bothered psychological health, even self-harm. Each parent needs the kid to be the best of
doing and carrying out well at school plays a big part in that. The stats illustrate underage
drinking makes that less likely.
STATEMENT OF THE PROBLEM
According to the research (Pham, Clough, Nguyen, Kim & Buettner, 2009), an
alcohol consumption leads a secondary school student to proceed on to poorer grades, to miss
college and in the nastiest case situation to be debarred from college. Having fewer
willpower, the wits struggled to identify cautioning symbols which in-turn leads to violence
and fights. The jeopardy of being intricate in ferocity and serious destruction rises straight in
line with liquor consumption and can lead to seizure and a felonious record. A student
alcohol consumption is predicted based on the alcohol and the absent attribute in the dataset.
In fact it is classified and predicted using popular Data Mining algorithms in which several
techniques are used and finding the best suited algorithm yields a good accuracy (Fielding,
2007). In the dataset (Lichman, 2013) taken, the attributes play a major role in predicting the
student behavior.
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 6
PURPOSE OF THE STUDY
The main purpose of the study is to predict the alcohol consumption by the secondary
school student by adding an attribute “Alco” to the dataset and to find the best classification
(Kumar & Crawford, 2001) algorithm with the excellent accuracy. The other purpose is to
minimize the student drinker by predicting the results. Somehow the students are managed to
be predicted by the attributes associated in the dataset. The study would be extended to list
the alcohol consumption students based on certain factors. Specifically, the factors contribute
an effective result for the project and the student’s behaviour is found by analysis of
attributes such as age, sex, free time, absence and failures. The Data mining approaches are
applied in this dataset using WEKA tool (Srivastava, 2014) to pre-process, predict, classify,
cluster and visualize data and yield an effective result.
SIGNIFICANCE OF THE STUDY
The research study (Biesieda & Duch, 2007) could provide some important Data
Mining techniques to perform correlation and applying filter to the required dataset. A
supervised filter, one of the pre-process method which contains the attributes and the
instances are applied to the dataset. This supervised learning (Rokach & Maimon, 2009)
denotes that the normal objects and the anomalies are modelled and if any outlier is detected
it is collected. The usual objects are modelled with the task to learn with classifier. Moreover,
the steps are taken with Data pre-processing, Data Classification and Data Clustering to
analyse the student behaviour related to alcoholic consumption. The Clustering-based
approach (Berkhin, 2006) is also one of the outlier detection method to find a group of cluster
and a small cluster and collect them according to the dataset requirements.
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 7
NATURE OF THE STUDY
The study of student database with several attributes in spite of their difficulties using
Data Mining techniques to bring change in the student performance with or without the
alcohol consumption is calculated in this project. WEKA tool (Srivastava, 2014) contains
several classification algorithm and prediction accuracy of each algorithm differs by huge
value depending upon the dataset applied. The attributes are cleansed and applied to classify,
cluster in order to invent the finest predictive model for the dataset with respect to the factors
in the secondary school and the student performance in the school. This study further reduces
the secondary school student drinkers to limited one.
RESEARCH QUESTIONS
Intrinsically there are some undesirable impacts of students drinking behavior for the
educational and community outcomes of college and university. Moreover, it has been
divided into two different sub-divisions: children who consume and children who abjure. The
research questions are:
(1) How accurate will prediction and classification algorithm work in student alcohol
consumption?
(2) Whether the difficulties are much noticeable for children from a certain sub-division?
How the comparison does with the results of alcohol consumption of male and female
students? Do abjurers grieve more comparative to consumers?
(3) Does coverage to the effects of drinking have same significances for grades and
achievement for the students or varies for abjurers or consumers?
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 8
HYPOTHESES
College students who drink alcohol affected in grades due to certain factors like free
time and absence in school.
(1) Drinking behavior will diminish the satisfaction of attending a university or college to a
greater extent for students who abstain from alcohol compared to those who drink.
(2) Students who abjure from alcohol consumption are more likely to have anxiety over their
academic grades in comparison to students who drink.
THEORECTICAL FAMEWORK
Data model
The data model has three different types. They are Star scheme, Snowflake scheme
and Fact constellation (Cabibbo & Torlone, 1998). The star scheme is used in this dataset
which implies that it consist of a fact table with the multiple dimension tables. The student
dataset has several attributes which in turn contains key attributes.
OLAP Operations
OLAP (Online Transaction Processing) is used in multi-dimensional model with
multiple dimensions (Codd & Salley, 1993) and each one contains multiple abstraction with
the concept levels. Usually it has several operations which includes Drill-down, Roll-up,
Slice and dice, Pivot and Drill-through. These operations are performed for the quality of the
data to visualize it in the multiple dimensions. Now the OLAP is processed and then the bit
map indexing is also calculated as it is the most popular method in OLAP products and it
searches values in cubic data. The base table is formed with letters and the index table is
formed with 0 and 1 to find the value in the data cube.
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 9
Classification methods
Classification (Bauer & Kohavi, 1999) is the most important and often used Data
Mining and Machine Learning method to classify a give dataset and predict the values with
the other attributes associated in the dataset. In classification, there are numerous methods
which helps in predicting a model. They are Bayes Classification, Random Forest, Decision
tree, Support Vector Machine (SVM) and many more. In this project, the accuracy of all the
algorithm is calculated predicting the alcohol and absent attribute with the other attributes.
With the effect of the outcome, the classifier predicts an accurate model for certain attributes.
Eventually, the best algorithm best fits the dataset and the output resembles a correctly
classified instances with the dataset.
Bayes classification (Jordan, 2002) depends on the conditional probability and the
likelihood of a property with the independence between various attributes in a dataset.
Random forest is one of the popular data mining classifier to predict the model. It is a group
of tree unpruned Decision tree. Each tree is created on an arbitrary attribute selection from
training samples and its predictions are constructed by finding the mean of the outputs of the
trees. The Decision Tree (DT) is a branching model which signifies a set of instructions,
distinguishing values in a ordered form. This depiction can translated into a set of IF-THEN
rules, which are stress-free to comprehend by persons. In decision tree, the C 4.5 algorithm
(Kretschman, Fleischmann & Apweiler, 2001) is an effective one for the statistics to classify
a model. It is otherwise known as Statistical classifier. The Support Vector Machine (SVM)
is one of the method for mapping the original data with the higher dimension. In this project,
these algorithms are classified using a WEKA (Srivastava, 2014) tool and the result is
generated with the good predictive accuracy.
Linear Regression and Outliers
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 10
In view of regression it was planned to signify the instances in the leaves of the
decision using linear function. Typically this topic is seemed especially interesting in regards
to the data modeled in the environment.
Analysts (Witten & Frank, 2005) tackled issues brought about by uproarious
information in direct relapse by checking information for anomalies and physically
evacuating them. In the event of direct relapse, they can be recognized outwardly. Anomalies
dramatically affect the typical slightest square relapse on the grounds that the squared
separation measure complements the impact of focuses far from the relapse line. Measurable
strategies (Witten & Frank, 2005) that address the issue of exceptions are called strong.
Relapse can be made strong in the accompanying ways:
1. Use supreme quality separation measure rather than the typical squared one. This
debilitates the impact of anomalies.
2. Distinguish anomalies consequently and expel them from contemplations. Illustration:
Structure a relapse line and expel from though those 10% of the focuses that lie most remote
from the line.
3. Minimize the middle (rather than the mean) of the squares of divergences from the relapse
line. This technique adapts to anomalies in both the X-heading and the Y-bearing.
There is a genuine weakness to this procedure: high computational cost, which can
make them infeasible for reasonableness. In this anticipate lamentably there is no straight
relapse on the takes off. Later on in this report there is a nominalization of the numeric
properties.
Learning and Prediction
In data mining (Han, Kamber & Pei, 2011), the classification methods play an
important role in building and predicting a model. WEKA (Srivastava, 2014) tool learns the
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 11
dataset with the training data and the test data which is obtained using percentage split where
some part of a data is training data and the remaining is the test data. Several attributes are
taken and each one is treated as a class attribute and the prediction algorithm works with the
different classification algorithms to predict a model.
DATA ANALSYSIS
Dataset description
A data set (Lichman, 2013) about Portuguese students done by Paulo Cortez and
Alice Silva, University of Minho, Portugal. In Portugal, the schooling has 3 years of
schooling in secondary education following 9 years of standard tutoring and tailed by higher
education. Furthermost students join the open and free schooling scheme. This study (Cortez
& Silva, 2008) contains data to be mined during the period 2005-2006 from two public
schools located in the Alentejo region of Portugal. The questionnaires to the students
designed several attributes such as sex, age, mother’s education, family income,
social/emotional activities and class failures were predictable to affect student performance.
Finally, the data was integrated (Ceruti & Kamel, 1999) into two data-sets related to
Mathematics (395 instances) and the Portuguese language (649 records) classes.
Data to be mined
In this project, the Student alcohol consumption dataset is taken from the UCI
Machine Learning Repository (Lichman, 2013). The data mining has several types of data.
Here the flat file is used which is a CSV file but it is converted into arff file to use in WEKA.
Eventually a flat file is mined using various datamining classification algorithm.
Data Preprocessing
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 12
Real-life databases are extremely prone with inconsistent, missing, and noisy data
(Garcia, Luengo & Herrera, 2015) due to abundance amount of information in the recent era
(which may be a number of gigabytes or more) and the possible foundation from numerous,
varied sources. The quality of data is very important in mining results. Low quality data
should be filtered efficiently using supervised filter (Rokach & Maimon, 2009). Some data
pre-processing methods are involved to perform data cleansing to remove noise and correct
discrepancies in data and combine two diverse data-set by data integration.
Close by are numerous probable explanations for imprecise data (i.e., having
inappropriate attribute values). There may be fault in data collecting equipment. There is a
possibility of manual or system faults when entering data. Sometimes, students may
intentionally submit improper data values for obligatory fields when the students do not
desire to give in the personal information (e.g., for birthday students select default as
“January 1”). It is called as disguised lost data. In this project, a data-set (Lichman, 2013) is
collected by Portuguese professor and the test for lost data or value is done and found no lost
value or data in those dataset.
During this stage (Garcia, Luengo & Herrera, 2015), some attributes were rejected
due to the deficiency of discriminative value. For example, few scholars answered about the
family income (maybe due to confidentiality problems), whereas nearly 100 % of the scholars
stay with the parentages and have an individual mainframe at household.
Attribute selection
The aim is to find secondary school students who drink alcohol. In this data-set, there
are two additional attribute with the original dataset. The first attribute is alcohol consuming
in weekdays (Dalco) and the second attribute is alcohol consuming in weekend (Walco). But
the predictive analysis limits to one attribute (Alco) as a student who consume alcohol in a
week. Accordingly merging two attributes yields the following equation
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 13
Alco=Walco∗2+Dalco∗57
The other important attribute is find student absence rate at school. A student
frequently make absence in school takes more alcohol. Therefore this attribute becomes a
binary value, if he frequently make absence (Over 10 days) is 0, otherwise value is 1.
Prediction of absence rate is done using Weka and to perform this data reduction,
concatenation component for merging two data-set (Lichman, 2013): student-mat and
student-por into a new dataset and convert binary value of Alco and Fabs to nominal value to
perform correlations.
Functionalities of data
The data mining has certain important functionalities (Han, Kamber & Pei, 2011)
such as Characterization, Classification, Prediction, Clustering, Outlier analysis. Although
many concepts/classes were derived for characterization and discrimination, this dataset tells
that classes are schools and concepts are Drinkers and Non-drinkers (Wiers, Hoogeveen,
Sergeant & Gunning, 1997). Characterization is optimized by study of the class variables
whereas the discrimination compares the class data.
Linear correlation and filter low correlation value where correlation threshold value is
less than 0.35. It filtered four column attribute (Fed; G1; G2; G2) that is not correlated to Alc.
Elimination of backward feature is performed for a loop with cross validation (Kohavi,
1995). Cross validation is used to test data by using Random Forest. This procedure again
used to perform prediction and testing the result. The last procedure of data preprocessing
(Ceruti & Kamel, 1999) is filter backward elimination feature data to rescue original data.
Here 11 column attribute is filtered which is not correlated with Alco attribute.
Statistical presentation
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 14
The statistical representation of data contains the central tendency (Wilcox &
Keselman, 2003), range, Interquartile range (IQR) and variance. Eventually the mean is the
average of the value in the dataset. Median has the dataset selected with the middle value and
separates two halves of the data. Mode is the frequently occurring value in the dataset. Range
is defined as the change between maximum and minimum value in the dataset. IQR denotes
the difference between the third quartile and the first quartile. Variance is one of the measure
of data dispersion. It is the square of the standard deviation. In this dataset (Lichman, 2013),
there are several attributes associated with it. Each attribute has some values either it is
numeric or nominal. Therefore the statistical representation of these data with numeric
attribute in WEKA is shown in Appendix.
List of attributes
An attribute represents a characteristics of an object. There are several types of
attributes (Han, Kamber & Pei, 2011) depending on the values in the dataset. They are
Nominal, Numeric, Binary and Ordinal attributes. Nominal attributes relate to the name of
things. It may be symbols too depending on the value. Numeric attributes denote that the
values represented in integer or real. Binary attributes specify an object to be in two ways (0
or 1). Ordinal attributes are those having meaningful value in an object. In this project, the
student alcohol consumption dataset has numeric, nominal and binary attributes. By using
WEKA (Srivastava, 2014) tool, the class attribute is selected and the algorithms
corresponding to that attribute is classified and predicted for a good accuracy.
Visualization of data
The size of the dataset (Lichman, 2013) is 1044 instances and 33 attributes. The data
has no redundant information. Visually, the data is represented in WEKA in which the 33
attributes are normalized and the clusters are collected to define it with the attributes and the
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 15
classifier classifies with the predicting tool. Using R, the boxplot and histogram is drawn to
show the data is of good quality and the data has no missing values and duplicate
information. Data Mining (Han, Kamber & Pei, 2011) is the process of classifying new
outlines and visions in data. As the capacity of data collected and kept in files produces, there
is a rising essential to offer data summarization, identify important designs and tendencies,
and act upon the conclusions.
CONCLUSION
Education is a vital component in this civilization. Business Intelligence (BI)/Data
Mining (DM) strategies, which permit an abnormal state extraction of information from raw
information, offer intriguing conceivable outcomes for the training area. Specifically, a few
studies have utilized BI/DM techniques (Han, Kamber & Pei, 2011) to lessen the liquor
compulsion rate to youngsters and improve way of life for kid. In this work, the expectation
of young people liquor enslavement by utilizing past school records, demographic, family
and other information identified are studied. A few DM objectives and DM technique were
tried. Here a few restrictions, as number of folds we utilized is just 10 for all classification
algorithms. Since for vast number folds, it takes additional time and need capable PC to
process information. Additionally it takes just 25 models for test, since some model worth is
exceptionally lower. The acquired results uncover that it is conceivable to keep the kid far
from liquor. This affirm the conclusion found that, kid conduct is very influenced by
companions or gathering. By and by, an investigation to information gave by the best
prescient models has demonstrated that, for some situation, there are other pertinent elements,
for example, school related, demographic and social variables. This work depends on logged
off study. Every one of the strategies is used to an information set gathered by other
individuals. Nonetheless, there is a potential for a programmed on-line learning environment,
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 16
by utilizing Internet can discover more information about underage liquor habit. In addition,
chose on the off chance that somebody need work all the more profoundly on this theme and
beat the restrictions of the work. More research and greater information set are likewise
required keeping in mind the end goal to comprehend why and how a few variables impact on
kids and they are dependent on liquor.
SUMMARY
Finally the WEKA (Srivastava, 2014) is executed with the student data set and
obtained some results. The results shows that the classification algorithm works effectively
with good accuracy for algorithm and the prediction is analyzed with the Regression
algorithm. The prediction outcome is effective with an Alco attribute to be made as a class
attribute. In the dataset, the “Alco” attribute is used is an important factor for predicting the
model. The accuracy of all the classification algorithm (Cortex & Silva, 2008) is found and
the K-nearest neighbor algorithm predicts with 87.3563% tops the most. Other algorithms
such as J48, Random Forest, SMO yields 74.0421 %, 72.1264 %, 74.3295 % respectively.
Regression algorithms such as Logistic and Simple Logistic are used to determine 64.6552 %
and 73.3716 % respectively.
The gender is one of most significant attribute for a liquor drinker. Males are the
supreme convoluted with alcohol which proves that men consume alcohol more than women
in a 2014 alcohol report. Alternatively, the other significant feature of an alcoholics is the
social activity. An individual who verves out often with friends and cousins consume more
liquor. This is due to the drinking turn out to be a way to party a good news or in the
civilization and also to distinguish persons as typically a person are more existing when a
person are drunk. Roughly vital features are also: additional free time and also fewer study
time because usually they find in alcohol a way to relax and escape from problems).
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 17
REFERENCES
Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms:
Bagging, boosting, and variants. Machine learning,36(1-2), 105-139.
Benton, S. L., Benton, S. A., & Downey, R. G. (2006). College Student Drinking, Attitudes Toward
Risks, and Drinking Consequences. J. Stud. Alcohol Journal of Studies on Alcohol, 67(4),
543-551. doi:10.15288/jsa.2006.67.543
Berkhin, P. (2006). A survey of clustering data mining techniques. InGrouping multidimensional
data (pp. 25-71). Springer Berlin Heidelberg.
Bernstein, M. H., Wood, M. D., & Colby, S. M. (2016). A new paradigm for credibly administering
placebo alcohol to underage drinkers. Addictive Behaviors, 52, 22-27.
doi:10.1016/j.addbeh.2015.08.004
Biesiada, J., & Duch, W. (2007). Feature Selection for High-Dimensional Data — A Pearson
Redundancy Based Filter. Advances in Soft Computing Computer Recognition Systems 2,
242-249. doi:10.1007/978-3-540-75175-5_30
Blaheta, D. (2002, July). Handling noisy training and testing data. InProceedings of the ACL-02
conference on Empirical methods in natural language processing-Volume 10 (pp. 111-116).
Association for Computational Linguistics.
Cabibbo, L., & Torlone, R. (1998). A logical approach to multidimensional databases. In Advances in
Database Technology—EDBT'98 (pp. 183-197). Springer Berlin Heidelberg.
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 18
Ceruti, M. G., & Kamel, M. N. (1999). Preprocessing And Integration Of Data From Multiple
Sources For Knowledge Discovery. International Journal on Artificial Intelligence Tools Int.
J. Artif. Intell. Tools, 08(02), 157-177. doi:10.1142/s0218213099000117
Cortez, P., & Silva, A. (2008). Using Data Mining to Predict Secondary School Student Performance.
In Brito and Teixeira Eds., Proceedings of 5th Future Business Technology Conference
(FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-
7
Engs, R. C. (1977). Drinking patterns and drinking problems of college students. J. Stud. Alcohol
Journal of Studies on Alcohol, 38(11), 2144-2156. doi:10.15288/jsa.1977.38.2144
Fielding, A. H. (2007). Classification accuracy. Cluster and Classification Techniques for the
Biosciences, 179-199. doi:10.1017/cbo9780511607493.008
García, S., Luengo, J., & Herrera, F. (2015). Data Preprocessing in Data Mining. Intelligent Systems
Reference Library. doi:10.1007/978-3-319-10247-4
Hall, M. A., & Holmes, G. (2003). Benchmarking attribute selection techniques for discrete class data
mining. Knowledge and Data Engineering, IEEE Transactions on, 15(6), 1437-1447.
Han, J., Kamber, M., & Pei, J. (2011). Data mining: concepts and techniques. Elsevier.
Jordan, A. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression
and naive bayes. Advances in neural information processing systems, 14, 841.
Kabakchieva, D. (2013). Predicting Student Performance by Using Data Mining Methods for
Classification. Cybernetics and Information Technologies, 13(1). doi:10.2478/cait-2013-
0006.
Kohavi, R. (1995, August). A study of cross-validation and bootstrap for accuracy estimation and
model selection. In Ijcai (Vol. 14, No. 2, pp. 1137-1145).
Kretschmann, E., Fleischmann, W., & Apweiler, R. (2001). Automatic rule generation for protein
annotation with the C4. 5 data mining algorithm applied on SWISS-
PROT. Bioinformatics, 17(10), 920-926.
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 19
Kumar, S., Ghosh, J., & Crawford, M. M. (2001). Best-bases feature extraction algorithms for
classification of hyperspectral data. Geoscience and Remote Sensing, IEEE Transactions
on, 39(7), 1368-1379.
Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA:
University of California, School of Information and Computer Science
Pham, D. B., Clough, A. R., Nguyen, H. V., Kim, G. B., & Buettner, P. G. (2009). Alcohol
consumption and alcohol-related problems among Vietnamese medical students. Drug and
Alcohol Review, 29(2), 219-226. doi:10.1111/j.1465-3362.2009. 00094.x
Rokach, L., & Maimon, O. (2009). Supervised Learning. Data Mining and Knowledge Discovery
Handbook, 133-147. doi:10.1007/978-0-387-09823-4_8
Schaeffer, K. W., & Parsons, O. A. (1986). Drinking practices and neuropsychological test
performance in sober male alcoholics and social drinkers. Alcohol, 3(3), 175-179.
doi:10.1016/0741-8329(86)90041-8
Srivastava, S. (2014). Weka: A Tool for Data preprocessing, Classification, Ensemble, Clustering and
Association Rule Mining. International Journal of Computer Applications IJCA, 88(10), 26-
29. doi:10.5120/15389-3809
Tong, S., & Koller, D. (2002). Support vector machine active learning with applications to text
classification. The Journal of Machine Learning Research, 2, 45-66.
Toomey, T. L., Rosenfeld, C., & Wagenaar, A. C. (1996). The minimum legal drinking age: History,
effectiveness, and ongoing debate. Alcohol Research and Health, 20(4), 213.
Underage Drinking. (n.d.). Retrieved April 26, 2016, from
http://www.niaaa.nih.gov/alcohol-health/special-populations-co-occurring-disorders/
underage-drinking
Wiers, R. W., Hoogeveen, K. J., Sergeant, J. A., & Gunning, W. B. (1997). High‐and low‐dose
alcohol‐related expectancies and the differential associations with drinking in male and
female adolescents and young adultsAddiction, 92(7), 871-888.
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 20
Wilcox, R. R., & Keselman, H. J. (2003). Modern robust data analysis methods: measures of central
tendency. Psychological methods, 8(3), 254.
Witten, I. H., & Frank, E. (2005). Data Mining: Practical machine learning tools and techniques.
Morgan Kaufmann.
APPENDIX A:
Figure 1 Architecture of the student alcohol prediction
In this project, the two datasets (Lichman, 2013) from Mathematics and Portuguese
students are integrated into student dataset which is converted from csv file to arff file
supported by WEKA. Furthermore the dataset contains several attributes and by using
Attribute selection (Hall & Holmes, 2003) in supervised filter, the alcohol attribute is selected
Student Maths dataset
Student Portuguese
dataset
Student Dataset
Attribute selection
Classification by WEKA
Prediction analysis and
results
arff file Supervised filter SVM,
Decision tree and Random
forest
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 21
and it is represented as a class attribute to classify the dataset using certain classification
methods such as Bagging, K-nearest neighbor algorithm, Support Vector Machine (SVM),
Decision tree and Random Forest to predict the behavior of a student consuming alcohol.
This step-by-step process are executed in WEKA (Srivastava, 2014) and each algorithm has
several ways of classifying the data. Decision tree classifies the data based on IF_THEN class
with which the attributes are predicted and analyzed. Support Vector Machine (SVM)
handles the data with the normalization of each attribute value with the Precision, Recall
(Tong & Koller, 2002).
APPENDIX B:
Data preprocessing results
The dataset is preprocessed with Attribute Selection filter to remove the unwanted
noise and the evaluator is “CorrelationAttributeEval” and the search is “Ranker”. Typically
an outcome would be attributes without noise and the classification algorithms is predicted
with good accuracy. Usually, the attributes are taken as a nominal to specify the correctly and
incorrectly instances of a class.
Figure 2 Preprocessing Results
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 22
Figure 3 Attribute Selection method
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 23
Figure 4 Mean value of Alco attribute
Figure 5 Classification algorithm of IBK (K-nearest neighbor) results
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 24
Figure 6 Classification algorithm of SMO (Sequential Minimal Optimization) results
Figure 7 Classification algorithm of J48 results
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 25
Figure 8 Classification algorithm of Random Forest results
Figure 9 Regression algorithm of Logistic results
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 26
Figure 10 Regression algorithm of Simple Logistic results
CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 27
APPENDIX C:
Graph 1 of Attribute distribution