13
KDDCup 2014 Predicting excitement at DonorChoose.org {100277E, 100131D, 100381R, 100393F} Department of Computer Science and Engineering, University of Moratuwa, Sri Lanka. Introduction DonorChoose.org is an online organization for charity where it supports to teachers to publish their projects online and gives the willing donors an opportunity to cater the schools in need. DonorChoose.org is interested in knowing what are the exciting projects proposed by teachers. So the requirement is to train a model that will predict whether a given project is exciting or not. The guidelines to determine the level of excitement of a project are given here[1]. DonorChoose.org determines how exciting a given project by means of an evaluation criteria which is described below. was fully funded (fully_funded) had at least one teacheracquired donor (at_least_1_teacher_referred_donor) has a higher than average percentage of donors leaving an original message (great_chat) has at least one "green" donation (at_least_1_green_donation) has one or more of: donations from three or more non teacheracquired donors (three_or_more_non_teacher_referred_donors) one non teacheracquired donor gave more than $100 (one_non_teacher_referred_donor_giving_100_plus) the project received a donation from a "thoughtful donor" (donation_from_thoughtful_donor) Data The data provided by the Kaggle is in the relational format and split by dates. It basically contains details on the projects, donations, resources needed by the projects and essay statements. Kaggle treats the any projects posted after 20140101 as the test data and the donation details and outcome details are not available for those projects. 1

projectpaperkddcup2014-140922023843-phpapp01

Embed Size (px)

DESCRIPTION

KDD

Citation preview

Page 1: projectpaperkddcup2014-140922023843-phpapp01

KDDCup 2014 ­ Predicting excitement at DonorChoose.org  

{100277E, 100131D, 100381R, 100393F} Department of Computer Science and Engineering,  

University of Moratuwa,  Sri Lanka. 

 

Introduction  DonorChoose.org is an online organization for charity where it supports to teachers to publish their                             projects online and gives the willing donors an opportunity to cater the schools in need.                             DonorChoose.org is interested in knowing what are the exciting projects proposed by teachers. So the                             requirement is to train a model that will predict whether a given project is exciting or not. The guidelines                                     to determine the level of excitement of a project are given here[1].  DonorChoose.org determines how exciting a given project by means of an evaluation criteria which is described below.   

● was fully funded (fully_funded) ● had at least one teacher­acquired donor (at_least_1_teacher_referred_donor) ● has a higher than average percentage of donors leaving an original message (great_chat) ● has at least one "green" donation (at_least_1_green_donation) ● has one or more of: 

○ donations from three or more non teacher­acquired donors (three_or_more_non_teacher_referred_donors) 

○ one non teacher­acquired donor gave more than $100 (one_non_teacher_referred_donor_giving_100_plus) 

○ the project received a donation from a "thoughtful donor" (donation_from_thoughtful_donor) 

Data  The data provided by the Kaggle is in the relational format and split by dates. It basically contains                                   details on the projects, donations, resources needed by the projects and essay statements. Kaggle treats                             the any projects posted after 2014­01­01 as the test data and the donation details and outcome details                                 are not available for those projects. 

Page 2: projectpaperkddcup2014-140922023843-phpapp01

 The following data files are provided for training and testing the models.   

● donations.csv ­ information about donations provided for the projects in the training set. ● projects.csv ­ information about the projects for both training and testing sets.  ● sampleSubmission.csv ­ project ids for the test set and submission format as a guide to the 

competitors.  ● resource.csv ­ information about the requested resources for each project. This file contains a 

large set of attributes that describes the resources requested by teachers.  ● essays.csv ­ Essays written by teachers for the proposed projects. These are the statements 

written by teachers explaining the requirement and the importance of receiving the resources. ● outcomes.csv ­ information about the outcome of the projects in the training set.  

 

Data integration  For data integration, we used Pandas python library. It provides a lot of services to handle csv files.. It                                     provides R like data frames that can be used easily for sql like operations in relational DBs(joins,                                 grouping etc.). We have combined essays.csv, projects.csv, outcomes.csv and resources.csv files to get necessary                       data.Extracting data from resources.csv file was achieved as follows. In order to merge resources with                             projects, we have to first group resources by project id. To get total resource cost, we have multiplied                                   item unit price with item quantity and get the sum within a group.  

Data Preprocessing 

Data pre­processing was mainly used for TF­IDF vector generation using essay data. First we need to                               clean essays to remove illegal characters such as ‘\r’ that gives errors in tfidf vectorization if not                                 removed. 

The following techniques were used to improve results from vectorization of essays. 

● stop word removal ● lower casing ● stemming and lemmatization 

 

Page 3: projectpaperkddcup2014-140922023843-phpapp01

 

Approach 1. TF­IDF Based Classifier (Essays only)  In this solution, we used only the essays.csv file to determine whether a given project is_exciting or not.                                   TF­IDF stands for Term Frequency Inverse Document Frequency, a techniques used in order to                           maximize the weight of significant terms in textual data.  Scikit Learn library has built in functions that return tf­idf vectors when the input text is given. We have                                     tried using essays and need statements. For all cases we have removed stop words and used lower                                 casing.   Essays with maximum features of 2000 ­  0.56531 area under the ROC curve (auc) Essays with maximum features of 20000 ­ 0.56724 auc  We further tried to improve the model using stemming and lemmatization. In linguistic morphology,                           stemming is the process of stripping off affixes from words in order to obtain a base form of a set of                                         terms. Eg. “stemming” → “stem”   This process is helpful to reduce the number different terms with the same semantic base. Further we                                 utilized lemmatization provided in NLTK Python library. NLTK has wordnet based lemmatizer that                         removes affixes only if the resulting word is in its dictionary. Lemmatizer is more advanced than a                                 stemmer in the sense that it detects non­trivial semantic bases (women → woman, children → child).  Results showed an improvement after applying followed by a regular stemming step. Adapting an                           N­gram(Eg. bi­gram) based technique for essays would have improved the accuracy a lot but                           calculating n­gram sequences consumes a lot of RAM that we did not have in our systems. However,                                 we managed to apply n­grams with Need Statements that lifted the accuracy from 0.51 to 0.53.   Text based classification model was trained using logistic regression. 

Approach 2. Regular features with different algorithms  In a later attempt, we tried a different model by eliminating essay details and focusing on learning with                                   other attributes, especially the ones from projects.csv. We trained several models using different sets of                             attributes. We used Pandas python library to manipulate csv files and feature vectors.                         One_Hot_DataFrame is a convenient way of converting categorical data into numerical attributes and                         producing feature vectors.   For example attribute, poverty level = { highest poverty, high poverty, moderate poverty, low poverty}.                             One hot dataframe will create columns to represent each level and put 1 where the value is given for a                                       

Page 4: projectpaperkddcup2014-140922023843-phpapp01

project, while other columns values are zero E.g:  highest, high, moderate, low             0          1        0           0   Essay length benchmark is 0.54531  

Approach 3. Hybrid approach (Regular Classifier + TFIDF Classifier)  We built several models using a combination of the above two approaches. TF­IDF vectors obtained                             from essay data were appended to the regular feature vector. However, simply concatenating the two                             vectors degrades the accuracy. The reason for this is, the higher accuracy obtained from project                             features is diluted by the high dimensional TF­IDF vectors with a lower accuracy so that the overall                                 accuracy is below expectations.   The problem was with the way we merged two vectors. So in our next attempt we devised a different                                     solution to combine the effects of TF­IDF vectors and other feature vectors. We trained two separate                               models for TF­IDF and project features. These two models separately output two probability values.                           We then trained a third model giving these 2 valued tuples to output a real number that acts as the                                       overall probability for a given project. The following figure This method worked exceptionally well and                             we recorded our highest place in Kaggle leaderboard with this approach.  

 Figure 1.Two tier hierarchical classification model 

Failed efforts 

Page 5: projectpaperkddcup2014-140922023843-phpapp01

 In TF­IDF vectorization, we assumed nouns would be more important than other words, we used                             POS(Part of Speech) tagging using NLTK and created a dictionary using only nouns (1000 nouns                             forms the feature vector for tf idf). However this strategy did not improve results.  Simple concatenation of TF­IDF vectors with the regular feature vector did not produce better results.                             So we devised a two level hierarchical classification model to combine the effect of the two types of                                   vectors.  We thought the derived feature, per_student_cost = (total_cost / students_reached) might increase the                         accuracy because that is an indicator of the importance of a project in a student’s perspective. But it did                                     not improve the results.  

Milestone Submissions  In this section we have described our milestone submissions that we have submitted to the kaggle. Our                                 team Sapients has made a total number of 36 submissions to Kaggle, but here are the critical and                                   important submissions that we have done to achieve the final results.  

Submission 1  In our initial submission, we have used all the data before 2014­01­01 in our training set with the                                   following selected attributes. The model was trained using logistic regression.   'poverty_level',  'primary_focus_area’,   'fulfillment_labor_materials', 'total_price_excluding_optional_support',   'students_reached',  'school_year_round', 'secondary_focus_area',   'grade_level',   'eligible_double_your_impact_match', 'teacher_teach_for_america', ‘teacher_ny_teaching_fellow', 'eligible_almost_home_match',  'school_magnet', 

Page 6: projectpaperkddcup2014-140922023843-phpapp01

'resource_type', 'school_charter' ‘essay_length’ # A derived feature with high importance  ROC Score = 0.59447    

 Figure 2. ROC accuracy  

Submission 2  After studying the time series analysis given in [2], we removed all the data prior to 2010­01­01                                 because most of the past data are obsolete and these data hinder mining most of the interesting patterns.                                   Further, most of the past data includes negative examples so that the algorithms fail to see intense                                 patterns for determining positive examples. Also, we added ‘month’ as separate feature. These changes                           improved the results by a considerable amount. Through submission 2 we have achieved a ROC score                               of  0.60128  

Submission 3 

Page 7: projectpaperkddcup2014-140922023843-phpapp01

 Changing the algorithm from logistic regression to gradient tree boosting improved the results to                           0.61190. We have used the gradient tree boosting with n_estimates=100 and with maximum depth of 5.  The following graph shows the improvements of accuracy for each major submission. The vertical axis                             represents the percentage improvement with respect to “essay­length benchmark”.  

 Figure 3. ROC accuracy of the models 

 As an alternative experiment, we developed a classifier based on bitmap feature representation. Bitmap                           feature vectors are used to create lightweight vectors for nominal attributes. The models are trained                             using logistic regression algorithm.   The following features were used for bitmap encoding and the outcome was converted into float values.   

● primary_focus_area ● school_year_round ● school_charter ● fulfillment_labor_materials ● teacher_teach_for_america ● school_magnet ● school_kipp ● grade_level 

Page 8: projectpaperkddcup2014-140922023843-phpapp01

● primary_focus_subject ● poverty_level ● school_state ● secondary_focus_area ● school_charter_ready_promise ● teacher_prefix ● eligible_double_your_impact_match ● teacher_ny_teaching_fellow ● secondary_focus_subject ● eligible_almost_home_match ● school_nlns ● school_metro ● resource_type ● date_posted_month 

 Additionally, the following features were used which are floats by default.   

● total_price_including_optional_support ● total_price_excluding_optional_support ● support_price = (total_price_including_optional_support) ­ 

(total_price_excluding_optional_support) ● students_reached 

 Accuracy of the model was given in the following list with regard to the attributes selected in each                                   training iteration. As it can be seen, the results are not so impressive but this effort has been helpful in                                       determining which attributes to choose. One reason for these low results may be the variance of                               attribute values. Normalization would have increased the accuracy but since this was experimental, we                           decided to keep things simple.   In the following table we have listed attributes that we have added to the feature vector and the ROC                                     value that we have gained by dividing the training data as the projects posted earlier 2013­01­01 and                                 test data as the projects posted after 2013­01­01 and before 2014­01­01.     

Iteration  Attributes  Accuracy (ROC) 

Remarks 

Page 9: projectpaperkddcup2014-140922023843-phpapp01

1  all the features in the bitmap encoded list 

0.508950   

2  primary_focus_area, month, teacher_prefix 

0.518087  adding resource_type degrades     accuracy 

3  primary_focus_area,  month, teacher_prefix, school_year_round 

0.519026  removing month improved the value 

4  primary_focus_area, teacher_prefix, school_year_round 

0.528135  removing primary_focus_area improved the value  

5  teacher_prefix, school_year_round  0.535579  removing school_year_round   improved the value  

6  teacher_prefix  0.538610   adding grade_level degrades the accuracy  

7  teacher_prefix, fulfillment_labor_materials, teacher_teach_for_america 

0.546426   fulfillment_lab_or_materials has no effect adding poverty_level degrades, school_state degrades the accuracy 

8  teacher_prefix,teacher_teach_for_america,school_charter_ready_promise 

0.546784  school_charter_ready_promise slightly increased the accuracy 

9  teacher_prefix,teacher_teach_for_america, teacher_ny_teaching_fellow,  

0.547151  eligible_almost_home_match degrades,school_metro degrades the accuracy  

10  teacher_prefix, teacher_teach_for_america, teacher_ny_teaching_fellow, school_charter_ready_promise, total_price_including_optional_support 

0.552683  

total_price_excluding_optional_support and support_cost degrade the         accuracy  

       

 Table 1. Selected attributes and corresponding model accuracies 

 

Page 10: projectpaperkddcup2014-140922023843-phpapp01

Then we drew diagrams to check whether there is a correlation between these parameters we found                               randomly.   

 Figure 4. Attribute sets vs accuracy achieved by the attribute set 

 

Different Algorithms   In this project, we used a number of classification algorithms to build the is_exciting classifier. Selection                               criteria of classification algorithms was based on trial and error. Since we have got 5 submissions for a                                   day we have subdivided the training data set again into training data set and test data set to compare the                                       results of different algorithms. We basically tried 

● Support Vector Machine Classifier ● Logistic Regression Classifier ● Gradient Tree Boosting Classifier 

  SVM  Support Vector Machines are popular as large margin classifiers because of their ability to find an                               optimum decision boundary that separates two classes. However in efficiency point of view, it takes a                               lot of time to train the model. After some initial trials, we decided not to use SVM in this project. Logistic Regression  

10 

Page 11: projectpaperkddcup2014-140922023843-phpapp01

We have used logistic regression as one of our main algorithm in this project. Logistic regression model was used to train TF­IDF based classifier. A scikit learn logistic model can be obtained as follows.   sklearn.linear_model.LogisticRegression. (penalty='l2', dual=False, tol=0.0001, C=1.0,  fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None)  A detailed explanation of the parameters can be found in [4].  Gradient Tree Boosting  Gradient Tree Boosting is an ensemble learning method we used in the project and it produced a very good model. Gradient tree boosting is based on decision trees. The reason for using an ensemble method is that it tries to build different base estimators and merges them to produce more generalized results. So ensemble methods often produces better results.   Scikit learn provides a very good implementation of Gradient Tree Boosting algorithm[5], that can be invoked as follows.   

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0).fit(X_train, y_train) clf.score(X_test, y_test)   

 Tunable parameters in this algorithm are n_estimators, learning_rate, max_depth and random_state.                     As in the general case, smaller learning rates produces more accurate models in the expense of time                                 consumption. The number of base learners can be set by n_estimators. We can set the size of each tree                                     by setting the max_depth parameter.  However, the values applied for these parameters are always a trade­off between accuracy and                           available resources.  

 

 

 

Challenges  As in any data mining project, there were certain challenges that we needed to address. In the data files,                                     

11 

Page 12: projectpaperkddcup2014-140922023843-phpapp01

so many attributes were given so that we had to eliminate unnecessary data and use a minimal set of                                     important data. The reason is that we did not have enough computational resources to process all the                                 given features. Further, using all the given features without examining their effect would reduce the                             quality of the model. Sometimes the models tend to overfit in the presence of certain feature sets.  Finding proper resources for running algorithms was another challenge.   Initially we used a entire data set provided by Kaggle to train our models. But the accuracy was not                                     reaching expected levels. Later on while looking for patterns in the dataset, we observed that the                               percentage of is_exciting projects is very low in the data set so the algorithms fail to extract significantly                                   intense patterns to decide which projects are is_exciting. We removed all data prior to 2010­01­01                             and used a reduced data set for training which lifted the accuracy in a considerable amount, as                                 expected.  

Improvements  A better dimensionality reduction algorithm would be used to identify a reduced set of attributes. In this                                 project, feature selection was based on simple visualization techniques and the intuition.   Ensemble learning is a powerful learning technique to achieve improved accuracy. We have already                           exploited an ensemble method available in scikit learn (gradient tree boosting) but there is enough room                               to improve the accuracy with more advanced ensemble techniques in the expense of resources and time.  Parameter configuration is also an important aspect in the optimal use of a learning algorithm. We can                                 train with a lower learning rate for higher accuracy and perform other parameter tuning in algorithms for                                 better results.   

 

 

References  [1] KDDCup 2014 on Kaggle, [online] http://www.kaggle.com/c/kdd­cup­2014­predicting­excitement­at­donors­choose/data  [2] Time Series analysis on KDDCup 2014 data sets, [online] http://rpubs.com/wacax/21669 

12 

Page 13: projectpaperkddcup2014-140922023843-phpapp01

 [3] Introduction to ROC curves, [online] http://gim.unmc.edu/dxtests/ROC1.htm  [4] Logistic Regression, Sci­kit learn official documentation, [online] http://scikit­learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html  [5] Gradient Tree Boosting, Sci­kit learn official documentation, [online], http://scikit­learn.org/stable/modules/ensemble.html#gradient­boosting 

 

[6] Data Mining: Concepts and Techniques (The Morgan Kaufmann Series in Data Management Systems) (08 September 2000) by Jiawei Han, Micheline Kamber 

  

13