Upload
utkarsh-shrivatava
View
215
Download
0
Embed Size (px)
DESCRIPTION
KDD
Citation preview
KDDCup 2014 Predicting excitement at DonorChoose.org
{100277E, 100131D, 100381R, 100393F} Department of Computer Science and Engineering,
University of Moratuwa, Sri Lanka.
Introduction DonorChoose.org is an online organization for charity where it supports to teachers to publish their projects online and gives the willing donors an opportunity to cater the schools in need. DonorChoose.org is interested in knowing what are the exciting projects proposed by teachers. So the requirement is to train a model that will predict whether a given project is exciting or not. The guidelines to determine the level of excitement of a project are given here[1]. DonorChoose.org determines how exciting a given project by means of an evaluation criteria which is described below.
● was fully funded (fully_funded) ● had at least one teacheracquired donor (at_least_1_teacher_referred_donor) ● has a higher than average percentage of donors leaving an original message (great_chat) ● has at least one "green" donation (at_least_1_green_donation) ● has one or more of:
○ donations from three or more non teacheracquired donors (three_or_more_non_teacher_referred_donors)
○ one non teacheracquired donor gave more than $100 (one_non_teacher_referred_donor_giving_100_plus)
○ the project received a donation from a "thoughtful donor" (donation_from_thoughtful_donor)
Data The data provided by the Kaggle is in the relational format and split by dates. It basically contains details on the projects, donations, resources needed by the projects and essay statements. Kaggle treats the any projects posted after 20140101 as the test data and the donation details and outcome details are not available for those projects.
1
The following data files are provided for training and testing the models.
● donations.csv information about donations provided for the projects in the training set. ● projects.csv information about the projects for both training and testing sets. ● sampleSubmission.csv project ids for the test set and submission format as a guide to the
competitors. ● resource.csv information about the requested resources for each project. This file contains a
large set of attributes that describes the resources requested by teachers. ● essays.csv Essays written by teachers for the proposed projects. These are the statements
written by teachers explaining the requirement and the importance of receiving the resources. ● outcomes.csv information about the outcome of the projects in the training set.
Data integration For data integration, we used Pandas python library. It provides a lot of services to handle csv files.. It provides R like data frames that can be used easily for sql like operations in relational DBs(joins, grouping etc.). We have combined essays.csv, projects.csv, outcomes.csv and resources.csv files to get necessary data.Extracting data from resources.csv file was achieved as follows. In order to merge resources with projects, we have to first group resources by project id. To get total resource cost, we have multiplied item unit price with item quantity and get the sum within a group.
Data Preprocessing
Data preprocessing was mainly used for TFIDF vector generation using essay data. First we need to clean essays to remove illegal characters such as ‘\r’ that gives errors in tfidf vectorization if not removed.
The following techniques were used to improve results from vectorization of essays.
● stop word removal ● lower casing ● stemming and lemmatization
2
Approach 1. TFIDF Based Classifier (Essays only) In this solution, we used only the essays.csv file to determine whether a given project is_exciting or not. TFIDF stands for Term Frequency Inverse Document Frequency, a techniques used in order to maximize the weight of significant terms in textual data. Scikit Learn library has built in functions that return tfidf vectors when the input text is given. We have tried using essays and need statements. For all cases we have removed stop words and used lower casing. Essays with maximum features of 2000 0.56531 area under the ROC curve (auc) Essays with maximum features of 20000 0.56724 auc We further tried to improve the model using stemming and lemmatization. In linguistic morphology, stemming is the process of stripping off affixes from words in order to obtain a base form of a set of terms. Eg. “stemming” → “stem” This process is helpful to reduce the number different terms with the same semantic base. Further we utilized lemmatization provided in NLTK Python library. NLTK has wordnet based lemmatizer that removes affixes only if the resulting word is in its dictionary. Lemmatizer is more advanced than a stemmer in the sense that it detects nontrivial semantic bases (women → woman, children → child). Results showed an improvement after applying followed by a regular stemming step. Adapting an Ngram(Eg. bigram) based technique for essays would have improved the accuracy a lot but calculating ngram sequences consumes a lot of RAM that we did not have in our systems. However, we managed to apply ngrams with Need Statements that lifted the accuracy from 0.51 to 0.53. Text based classification model was trained using logistic regression.
Approach 2. Regular features with different algorithms In a later attempt, we tried a different model by eliminating essay details and focusing on learning with other attributes, especially the ones from projects.csv. We trained several models using different sets of attributes. We used Pandas python library to manipulate csv files and feature vectors. One_Hot_DataFrame is a convenient way of converting categorical data into numerical attributes and producing feature vectors. For example attribute, poverty level = { highest poverty, high poverty, moderate poverty, low poverty}. One hot dataframe will create columns to represent each level and put 1 where the value is given for a
3
project, while other columns values are zero E.g: highest, high, moderate, low 0 1 0 0 Essay length benchmark is 0.54531
Approach 3. Hybrid approach (Regular Classifier + TFIDF Classifier) We built several models using a combination of the above two approaches. TFIDF vectors obtained from essay data were appended to the regular feature vector. However, simply concatenating the two vectors degrades the accuracy. The reason for this is, the higher accuracy obtained from project features is diluted by the high dimensional TFIDF vectors with a lower accuracy so that the overall accuracy is below expectations. The problem was with the way we merged two vectors. So in our next attempt we devised a different solution to combine the effects of TFIDF vectors and other feature vectors. We trained two separate models for TFIDF and project features. These two models separately output two probability values. We then trained a third model giving these 2 valued tuples to output a real number that acts as the overall probability for a given project. The following figure This method worked exceptionally well and we recorded our highest place in Kaggle leaderboard with this approach.
Figure 1.Two tier hierarchical classification model
Failed efforts
4
In TFIDF vectorization, we assumed nouns would be more important than other words, we used POS(Part of Speech) tagging using NLTK and created a dictionary using only nouns (1000 nouns forms the feature vector for tf idf). However this strategy did not improve results. Simple concatenation of TFIDF vectors with the regular feature vector did not produce better results. So we devised a two level hierarchical classification model to combine the effect of the two types of vectors. We thought the derived feature, per_student_cost = (total_cost / students_reached) might increase the accuracy because that is an indicator of the importance of a project in a student’s perspective. But it did not improve the results.
Milestone Submissions In this section we have described our milestone submissions that we have submitted to the kaggle. Our team Sapients has made a total number of 36 submissions to Kaggle, but here are the critical and important submissions that we have done to achieve the final results.
Submission 1 In our initial submission, we have used all the data before 20140101 in our training set with the following selected attributes. The model was trained using logistic regression. 'poverty_level', 'primary_focus_area’, 'fulfillment_labor_materials', 'total_price_excluding_optional_support', 'students_reached', 'school_year_round', 'secondary_focus_area', 'grade_level', 'eligible_double_your_impact_match', 'teacher_teach_for_america', ‘teacher_ny_teaching_fellow', 'eligible_almost_home_match', 'school_magnet',
5
'resource_type', 'school_charter' ‘essay_length’ # A derived feature with high importance ROC Score = 0.59447
Figure 2. ROC accuracy
Submission 2 After studying the time series analysis given in [2], we removed all the data prior to 20100101 because most of the past data are obsolete and these data hinder mining most of the interesting patterns. Further, most of the past data includes negative examples so that the algorithms fail to see intense patterns for determining positive examples. Also, we added ‘month’ as separate feature. These changes improved the results by a considerable amount. Through submission 2 we have achieved a ROC score of 0.60128
Submission 3
6
Changing the algorithm from logistic regression to gradient tree boosting improved the results to 0.61190. We have used the gradient tree boosting with n_estimates=100 and with maximum depth of 5. The following graph shows the improvements of accuracy for each major submission. The vertical axis represents the percentage improvement with respect to “essaylength benchmark”.
Figure 3. ROC accuracy of the models
As an alternative experiment, we developed a classifier based on bitmap feature representation. Bitmap feature vectors are used to create lightweight vectors for nominal attributes. The models are trained using logistic regression algorithm. The following features were used for bitmap encoding and the outcome was converted into float values.
● primary_focus_area ● school_year_round ● school_charter ● fulfillment_labor_materials ● teacher_teach_for_america ● school_magnet ● school_kipp ● grade_level
7
● primary_focus_subject ● poverty_level ● school_state ● secondary_focus_area ● school_charter_ready_promise ● teacher_prefix ● eligible_double_your_impact_match ● teacher_ny_teaching_fellow ● secondary_focus_subject ● eligible_almost_home_match ● school_nlns ● school_metro ● resource_type ● date_posted_month
Additionally, the following features were used which are floats by default.
● total_price_including_optional_support ● total_price_excluding_optional_support ● support_price = (total_price_including_optional_support)
(total_price_excluding_optional_support) ● students_reached
Accuracy of the model was given in the following list with regard to the attributes selected in each training iteration. As it can be seen, the results are not so impressive but this effort has been helpful in determining which attributes to choose. One reason for these low results may be the variance of attribute values. Normalization would have increased the accuracy but since this was experimental, we decided to keep things simple. In the following table we have listed attributes that we have added to the feature vector and the ROC value that we have gained by dividing the training data as the projects posted earlier 20130101 and test data as the projects posted after 20130101 and before 20140101.
Iteration Attributes Accuracy (ROC)
Remarks
8
1 all the features in the bitmap encoded list
0.508950
2 primary_focus_area, month, teacher_prefix
0.518087 adding resource_type degrades accuracy
3 primary_focus_area, month, teacher_prefix, school_year_round
0.519026 removing month improved the value
4 primary_focus_area, teacher_prefix, school_year_round
0.528135 removing primary_focus_area improved the value
5 teacher_prefix, school_year_round 0.535579 removing school_year_round improved the value
6 teacher_prefix 0.538610 adding grade_level degrades the accuracy
7 teacher_prefix, fulfillment_labor_materials, teacher_teach_for_america
0.546426 fulfillment_lab_or_materials has no effect adding poverty_level degrades, school_state degrades the accuracy
8 teacher_prefix,teacher_teach_for_america,school_charter_ready_promise
0.546784 school_charter_ready_promise slightly increased the accuracy
9 teacher_prefix,teacher_teach_for_america, teacher_ny_teaching_fellow,
0.547151 eligible_almost_home_match degrades,school_metro degrades the accuracy
10 teacher_prefix, teacher_teach_for_america, teacher_ny_teaching_fellow, school_charter_ready_promise, total_price_including_optional_support
0.552683
total_price_excluding_optional_support and support_cost degrade the accuracy
Table 1. Selected attributes and corresponding model accuracies
9
Then we drew diagrams to check whether there is a correlation between these parameters we found randomly.
Figure 4. Attribute sets vs accuracy achieved by the attribute set
Different Algorithms In this project, we used a number of classification algorithms to build the is_exciting classifier. Selection criteria of classification algorithms was based on trial and error. Since we have got 5 submissions for a day we have subdivided the training data set again into training data set and test data set to compare the results of different algorithms. We basically tried
● Support Vector Machine Classifier ● Logistic Regression Classifier ● Gradient Tree Boosting Classifier
SVM Support Vector Machines are popular as large margin classifiers because of their ability to find an optimum decision boundary that separates two classes. However in efficiency point of view, it takes a lot of time to train the model. After some initial trials, we decided not to use SVM in this project. Logistic Regression
10
We have used logistic regression as one of our main algorithm in this project. Logistic regression model was used to train TFIDF based classifier. A scikit learn logistic model can be obtained as follows. sklearn.linear_model.LogisticRegression. (penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None) A detailed explanation of the parameters can be found in [4]. Gradient Tree Boosting Gradient Tree Boosting is an ensemble learning method we used in the project and it produced a very good model. Gradient tree boosting is based on decision trees. The reason for using an ensemble method is that it tries to build different base estimators and merges them to produce more generalized results. So ensemble methods often produces better results. Scikit learn provides a very good implementation of Gradient Tree Boosting algorithm[5], that can be invoked as follows.
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0).fit(X_train, y_train) clf.score(X_test, y_test)
Tunable parameters in this algorithm are n_estimators, learning_rate, max_depth and random_state. As in the general case, smaller learning rates produces more accurate models in the expense of time consumption. The number of base learners can be set by n_estimators. We can set the size of each tree by setting the max_depth parameter. However, the values applied for these parameters are always a tradeoff between accuracy and available resources.
Challenges As in any data mining project, there were certain challenges that we needed to address. In the data files,
11
so many attributes were given so that we had to eliminate unnecessary data and use a minimal set of important data. The reason is that we did not have enough computational resources to process all the given features. Further, using all the given features without examining their effect would reduce the quality of the model. Sometimes the models tend to overfit in the presence of certain feature sets. Finding proper resources for running algorithms was another challenge. Initially we used a entire data set provided by Kaggle to train our models. But the accuracy was not reaching expected levels. Later on while looking for patterns in the dataset, we observed that the percentage of is_exciting projects is very low in the data set so the algorithms fail to extract significantly intense patterns to decide which projects are is_exciting. We removed all data prior to 20100101 and used a reduced data set for training which lifted the accuracy in a considerable amount, as expected.
Improvements A better dimensionality reduction algorithm would be used to identify a reduced set of attributes. In this project, feature selection was based on simple visualization techniques and the intuition. Ensemble learning is a powerful learning technique to achieve improved accuracy. We have already exploited an ensemble method available in scikit learn (gradient tree boosting) but there is enough room to improve the accuracy with more advanced ensemble techniques in the expense of resources and time. Parameter configuration is also an important aspect in the optimal use of a learning algorithm. We can train with a lower learning rate for higher accuracy and perform other parameter tuning in algorithms for better results.
References [1] KDDCup 2014 on Kaggle, [online] http://www.kaggle.com/c/kddcup2014predictingexcitementatdonorschoose/data [2] Time Series analysis on KDDCup 2014 data sets, [online] http://rpubs.com/wacax/21669
12
[3] Introduction to ROC curves, [online] http://gim.unmc.edu/dxtests/ROC1.htm [4] Logistic Regression, Scikit learn official documentation, [online] http://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html [5] Gradient Tree Boosting, Scikit learn official documentation, [online], http://scikitlearn.org/stable/modules/ensemble.html#gradientboosting
[6] Data Mining: Concepts and Techniques (The Morgan Kaufmann Series in Data Management Systems) (08 September 2000) by Jiawei Han, Micheline Kamber
13