Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Learning to Rank Comments Within Subreddit Submissions - Alex Jin, Don Mai, Jeff Setter
Problem Statement Given a Reddit® submission, s, with n comments, c,: Predict the ranking R in Rn where R[i] is the ranking of c[i] for all i in [1,2,…,n]. (i.e., predict the order of the top-level replies’ assigned scores in that submission).
Ask Don for the ranking challenge!
Naïve Bayes Naive Bayes Classification was our first stab at the problem. We saw the model had a very poor performance as the number of training examples increased. This is because for a small data set, certain words can be used as distinct identifiers for a particular post. But this nice property disappears as more and more posts have the same words. This motivated us to move on to linear regression.
Linear Regression
Our regression model takes a post with varying number of comments, and outputs the predicted ranking of all of the comments. This prediction is made by extracting features (listed bottom right) from the comments, normalizing them with respect to all comments in a post, and then training a regression model of linear regression or an SVR (pictorially shown above). To calculate the effectiveness of our ranking, we used a normalized Spearman’s Footrule, which is the sum of the absolute difference of rankings for each post, divided by the maximum Spearman’s Footrule error (shown below right). We tested our model on different subreddits, but show only data from /r/murica. The learning curve of /r/murica shows that our model has high bias and underfits the data, as expected since we don’t not have an enormous number of features (shown below). However, despite this lack of features, we still find that the average error decreases from 0.616 to 0.354 for /r/murica, as well as shift most errors below 0.5 (shown bottom left). The feature effectiveness varied across subreddits, including the most significant feature time; adding feature categories one at a time shows the effect of each for /r/murica (shown below right).
Clustering
The Meme-Clustering algorithm is our attempt to use machine learning tools to provide intelligent features that are difficult to generate using a single formula. The algorithm can be described in the following four steps: 1. generate prior frequencies for $words 2. identify (word, time) pairs of significant importance 3. run k-means on results from step 2 4. run a filter to remove undesired outputs The output of this algorithm is a collection of centroids representing phrases that had a significant impact during a time period. Below are some human readable sample output:
References 1. Bottou, Léon. "Stochastic gradient descent tricks." Neural Networks: Tricks of the Trade. Springer Berlin Heidelberg, 2012. 421-436. 2. Kumar, Ravi, and Sergei Vassilvitskii. "Generalized distances between rankings." Proceedings of the 19th international conference on World wide web. ACM, 2010.
Future Work We plan on improving our k-means clustering algorithm or researching alternative means to more accurately group together words that convey a single idea, so that we can use features that check for the presence of their multi-word phrases. This would allow us to make better comment ranking predictions within subreddits where users up-vote commenters simply because they included a particular grouping of keywords in their comments. For more complex subreddits, we would need to leverage the results of natural language understanding, as the semantics of the comment would play a much larger, possibly dominating role in the overall score of a comment.
0
200
400
600
800
1000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Co
un
t
Error (norm. Footrule)
Histogram of Error for Random Guess
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 10 100 1000 10000
Erro
r (n
orm
. Fo
otr
ule
)
Dataset size
SVR Learning Curve
testing training
Feature vs Error
Feature List
Error Metrics
Learning Rate
0.0%
10.0%
20.0%
30.0%
40.0%
0 0.2 0.4 0.6 0.8 1
E
r
r
o
r
R
a
t
Error Cutoff
Error Rate for different training sizes
Size=50
Size=200
Size=1000
Size=5000
0
200
400
600
800
1000
1200
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Co
un
t
Error (norm. Footrule)
Histogram of Error for SVR