1
Learning to Rank Comments Within Subreddit Submissions - Alex Jin, Don Mai, Jeff Setter Problem Statement Given a Reddit® submission, s , with n comments, c ,: Predict the ranking R in R n where R[i] is the ranking of c[i] for all i in [1,2,…,n]. (i.e., predict the order of the top-level replies’ assigned scores in that submission). Ask Don for the ranking challenge! Naïve Bayes Naive Bayes Classification was our first stab at the problem. We saw the model had a very poor performance as the number of training examples increased. This is because for a small data set, certain words can be used as distinct identifiers for a particular post. But this nice property disappears as more and more posts have the same words. This motivated us to move on to linear regression. Linear Regression Our regression model takes a post with varying number of comments, and outputs the predicted ranking of all of the comments. This prediction is made by extracting features (listed bottom right) from the comments, normalizing them with respect to all comments in a post, and then training a regression model of linear regression or an SVR (pictorially shown above). To calculate the effectiveness of our ranking, we used a normalized Spearman’s Footrule, which is the sum of the absolute difference of rankings for each post, divided by the maximum Spearman’s Footrule error (shown below right). We tested our model on different subreddits, but show only data from /r/murica. The learning curve of /r/murica shows that our model has high bias and underfits the data, as expected since we don’t not have an enormous number of features (shown below). However, despite this lack of features, we still find that the average error decreases from 0.616 to 0.354 for /r/murica, as well as shift most errors below 0.5 (shown bottom left). The feature effectiveness varied across subreddits, including the most significant feature time; adding feature categories one at a time shows the effect of each for /r/murica (shown below right). Clustering The Meme-Clustering algorithm is our attempt to use machine learning tools to provide intelligent features that are difficult to generate using a single formula. The algorithm can be described in the following four steps: 1. generate prior frequencies for $words 2. identify (word, time) pairs of significant importance 3. run k-means on results from step 2 4. run a filter to remove undesired outputs The output of this algorithm is a collection of centroids representing phrases that had a significant impact during a time period. Below are some human readable sample output: References 1. Bottou, Léon. "Stochastic gradient descent tricks." Neural Networks: Tricks of the Trade. Springer Berlin Heidelberg, 2012. 421-436. 2. Kumar, Ravi, and Sergei Vassilvitskii. "Generalized distances between rankings." Proceedings of the 19th international conference on World wide web. ACM, 2010. Future Work We plan on improving our k-means clustering algorithm or researching alternative means to more accurately group together words that convey a single idea, so that we can use features that check for the presence of their multi-word phrases. This would allow us to make better comment ranking predictions within subreddits where users up-vote commenters simply because they included a particular grouping of keywords in their comments. For more complex subreddits, we would need to leverage the results of natural language understanding, as the semantics of the comment would play a much larger, possibly dominating role in the overall score of a comment. 0 200 400 600 800 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Count Error (norm. Footrule) Histogram of Error for Random Guess 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1 10 100 1000 10000 Error (norm. Footrule) Dataset size SVR Learning Curve testing training Feature vs Error Feature List Error Metrics Learning Rate 0.0% 10.0% 20.0% 30.0% 40.0% 0 0.2 0.4 0.6 0.8 1 E r r o r R a t Error Cutoff Error Rate for different training sizes Size=50 Size=200 Size=1000 Size=5000 0 200 400 600 800 1000 1200 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Count Error (norm. Footrule) Histogram of Error for SVR

Learning to Rank Comments Within Subreddit Submissionscs229.stanford.edu/proj2015/336_poster.pdfProceedings of the 19th international conference on World wide web. ACM, 2010. Future

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning to Rank Comments Within Subreddit Submissionscs229.stanford.edu/proj2015/336_poster.pdfProceedings of the 19th international conference on World wide web. ACM, 2010. Future

Learning to Rank Comments Within Subreddit Submissions - Alex Jin, Don Mai, Jeff Setter

Problem Statement Given a Reddit® submission, s, with n comments, c,: Predict the ranking R in Rn where R[i] is the ranking of c[i] for all i in [1,2,…,n]. (i.e., predict the order of the top-level replies’ assigned scores in that submission).

Ask Don for the ranking challenge!

Naïve Bayes Naive Bayes Classification was our first stab at the problem. We saw the model had a very poor performance as the number of training examples increased. This is because for a small data set, certain words can be used as distinct identifiers for a particular post. But this nice property disappears as more and more posts have the same words. This motivated us to move on to linear regression.

Linear Regression

Our regression model takes a post with varying number of comments, and outputs the predicted ranking of all of the comments. This prediction is made by extracting features (listed bottom right) from the comments, normalizing them with respect to all comments in a post, and then training a regression model of linear regression or an SVR (pictorially shown above). To calculate the effectiveness of our ranking, we used a normalized Spearman’s Footrule, which is the sum of the absolute difference of rankings for each post, divided by the maximum Spearman’s Footrule error (shown below right). We tested our model on different subreddits, but show only data from /r/murica. The learning curve of /r/murica shows that our model has high bias and underfits the data, as expected since we don’t not have an enormous number of features (shown below). However, despite this lack of features, we still find that the average error decreases from 0.616 to 0.354 for /r/murica, as well as shift most errors below 0.5 (shown bottom left). The feature effectiveness varied across subreddits, including the most significant feature time; adding feature categories one at a time shows the effect of each for /r/murica (shown below right).

Clustering

The Meme-Clustering algorithm is our attempt to use machine learning tools to provide intelligent features that are difficult to generate using a single formula. The algorithm can be described in the following four steps: 1. generate prior frequencies for $words 2. identify (word, time) pairs of significant importance 3. run k-means on results from step 2 4. run a filter to remove undesired outputs The output of this algorithm is a collection of centroids representing phrases that had a significant impact during a time period. Below are some human readable sample output:

References 1. Bottou, Léon. "Stochastic gradient descent tricks." Neural Networks: Tricks of the Trade. Springer Berlin Heidelberg, 2012. 421-436. 2. Kumar, Ravi, and Sergei Vassilvitskii. "Generalized distances between rankings." Proceedings of the 19th international conference on World wide web. ACM, 2010.

Future Work We plan on improving our k-means clustering algorithm or researching alternative means to more accurately group together words that convey a single idea, so that we can use features that check for the presence of their multi-word phrases. This would allow us to make better comment ranking predictions within subreddits where users up-vote commenters simply because they included a particular grouping of keywords in their comments. For more complex subreddits, we would need to leverage the results of natural language understanding, as the semantics of the comment would play a much larger, possibly dominating role in the overall score of a comment.

0

200

400

600

800

1000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Co

un

t

Error (norm. Footrule)

Histogram of Error for Random Guess

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 10 100 1000 10000

Erro

r (n

orm

. Fo

otr

ule

)

Dataset size

SVR Learning Curve

testing training

Feature vs Error

Feature List

Error Metrics

Learning Rate

0.0%

10.0%

20.0%

30.0%

40.0%

0 0.2 0.4 0.6 0.8 1

E

r

r

o

r

R

a

t

Error Cutoff

Error Rate for different training sizes

Size=50

Size=200

Size=1000

Size=5000

0

200

400

600

800

1000

1200

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Co

un

t

Error (norm. Footrule)

Histogram of Error for SVR