It’s all in the Content: State of the art Best
Answer Prediction based on Discretisation
of Shallow Linguistic Features
George Gkotsis, Karen Stepanyan, Carlos
Pedrinaci, John Domingue, Maria Liakata*
Knowledge Media Institute, The Open University
*Department of Computer Science, University of Warwick
Outline
• Motivation
• Problem description
• Proposed solution
• Evaluation
• Discussion & Conclusion
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Questions on social networking sites
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Recommendations
&
opinions
Authoritative
responses
Expert &
Empirical
knowledge
Why best answer prediction?
• Information overload
• Increase awareness in the community
• Answer questions more efficiently
• One way to study social media reception
• Plus:
• Finding experts in communities
• Study of language use
• Trend analysis
• …
• Visit
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Best answer prediction in Social Q&A
• Binary classification problem
• Is it solved?
• Yes, partially
• Current solutions depend on:
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Answer Ratings
• Score, #comments
Knowledge is Future & Unknown
User Ratings
• User Reputation
• UpVotes etc
• Preferential attachment
Knowledge is Past & Not
always available
State of the art solutions
“…we observe significant assortativity in the reputations of
co-answerers, relationships between reputation and
answer speed, and that the probability of an answer
being chosen as the best one strongly depends on
temporal characteristics of answer arrivals.”
Ashton Anderson, Daniel Huttenlocher, Jon Kleinberg, Jure Leskovec
Discovering Value from Community Activity on Focused Question
Answering Sites: A Case Study of Stack Overflow.
KDD 2012
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
State of the art solutions (cont.)
“When available, scoring (or rating) features improve
prediction results significantly, which demonstrates the
value of community feedback and reputation for identifying
valuable answers.”
Grégoire Burel, Yulan He, Harith Alani.
Automatic Identification of Best Answers in Online Enquiry
Communities
ESWC 2012
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
State of the art solutionsSummary
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Our solution
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Linguistic User Ratings Answer ratings
Average Precision
StackExchange network
SE “is all about getting answers, it’s not a
discussion forum, there’s no chit-chat”
• 123 Q&A sites
• 5,622,330 users
• 9.5 million questions
• 16.3 million answers
• 9.3 million visits per day
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
20 June 2014:
Training Dataset
September 2013 dump
StackOverflow & 20 of the most active SE websites
Questions with Accepted Answers
• 4,366,662 Non Accepted Answers
• 3,939,224 Accepted Answers
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Accepted Answers
47%
Non Accepted Answers …
SE websites
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
Non Accepted
Accepted
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
StackOverflow
91%
The Rest9%
3,375,817
3,795,276
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
8,000,000
stackoverflow
Non AcceptedAnswers
AcceptedAnswers
Shallow Linguistic features
• Long history, coming from studies on readability
1. Average number of characters per word
2. Average number of words per sentence
3. Number of words in the longest sentence
4. Answer length
5. Log Likehood:
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Pitler and Nenkova, 2008
StackOverflowOverview of shallow features’ evolution
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Shallow features: Observations
• Accepted answers tend to be:
• Longer
• Differ more from the community vocabulary
• Contain shorter words
• Have longer longest sentences
• Have more words per sentence
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
But how good are shallow features?
But how good are shallow features?
• 58% macro precision (our baseline)
• Possible reasons
1. Evolution of language characteristics
• Language becomes more eloquent
2. Variance is huge
3. Universal classifier looks unreachable, e.g.:
• SuperUser average length is 577
• Skeptics average length is 2,154
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Objectives
• Build a classifier which is:
1. Based on linguistic features solely
2. Robust
• Performs equally well to other classifiers that use user ratings (past
knowledge) or answer ratings (future knowledge)
3. Universal
• Same classifier applicable to as many SE websites possible
(domain agnostic)
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Feature discretisationExample for Length
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Group by question
Question Id
1
5
Answer Id
6
7
Length
2 200
3 150
4 250
150
100
Sort by Length in descending order
Rank
LengthD
1
2
3
1
2
Feature discretisation
Category Name Information Gain
Linguistic
Length 0.0226
LongestSentence 0.0121
LL 0.0053
WordsPerSentence 0.0048
CharactersPerWord 0.0052
Linguistic
Discretisation
LengthD 0.2168
LongestSentenceD 0.1750
LLD 0.1180
WordsPerSentenceD 0.1404
CharactersPerWordD 0.1162
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
20x increase
User and answer rating features
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Category Name Information Gain
Other
Age 0.0539
CreationDateD 0.1575
AnswerCount 0.3270
User Rating
UserReputation 0.0836
UserUpVotes 0.0535
UserDownVotes 0.0412
UserViews 0.0528
UserUpDownVotes 0.0508
Answer rating
Score 0.0792
CommentCount 0.0286
ScoreRatio 0.4539
What are we evaluating?
1. Prediction
2. How good is it compared with the SOTA?
3. Generality
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
1. Prediction – Features used
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Linguistic
Linguistic
Discretisation
Other
User
Rating
Answer
Rating
Past Knowledge Future Knowledge
1. Prediction
• Classifier was Alternate Decision Trees (ADT)
• Binary, boosting, numerical data
• Weka
• 10-fold validation
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Linguistic
Linguistic
Discretisation
Other
1. PredictionSE Website P R FM AUC
stackoverflow.com 0.82 0.66 0.73 0.85
apple.stackexchange.com 0.84 0.68 0.75 0.86
askubuntu.com 0.84 0.74 0.79 0.88
drupal.stackexchange.com 0.87 0.79 0.83 0.89
electronics.stackexchange.com 0.79 0.65 0.71 0.84
english.stackexchange.com 0.77 0.52 0.62 0.83
gamedev.stackexchange.com 0.82 0.71 0.76 0.87
gaming.stackexchange.com 0.87 0.79 0.83 0.91
gis.stackexchange.com 0.85 0.73 0.78 0.87
math.stackexchange.com 0.85 0.74 0.79 0.87
mathoverflow.net 0.83 0.7 0.76 0.87
meta.stackoverflow.com 0.87 0.69 0.77 0.87
physics.stackexchange.com 0.86 0.71 0.78 0.88
programmers.stackexchange.com 0.76 0.4 0.52 0.84
serverfault.com 0.83 0.66 0.74 0.85
skeptics.stackexchange.com 0.87 0.83 0.85 0.91
stats.stackexchange.com 0.85 0.79 0.82 0.89
superuser.com 0.84 0.65 0.73 0.85
tex.stackexchange.com 0.87 0.77 0.82 0.88
unix.stackexchange.com 0.81 0.68 0.74 0.85
wordpress.stackexchange.com 0.88 0.8 0.84 0.89
Average 0.84 0.7 0.76 0.87
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
SE Website P R FM AUC
stackoverflow.com 0.82 0.66 0.73 0.85
Macro Average 0.84 0.7 0.76 0.87
2. Comparison with other solutions
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Linguistic
Linguistic
Discretisation
Other
User
Rating
Answer
Rating
Case Features Used
1 Linguistic
2 Linguistic & Discretisation
3 Linguistic & Discretisation &
Other
4 Linguistic & Other & User
Rating
(no discretisation)
5 Linguistic & Other & User
Rating
(with discretisation)
6 All features
(Answer and User Rating
with discretisation)
Comparison
Case Features Used P R FM AUC
1 Linguistic 0.58 0.60 0.56 0.60
2 Linguistic & Discretisation 0.81 0.70 0.74 0.84
3 Linguistic & Discretisation &
Other
0.84 0.7 0.76 0.87
4 Linguistic & Other & User
Rating
(no discretisation)
0.82 0.69 0.75 0.86
5 Linguistic & Other & User
Rating
(with discretisation)
0.82 0.72 0.77 0.88
6 All features
(Answer and User Rating
with discretisation)
0.88 0.85 0.86 0.94
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
3. Generality
• Leave-one-out
• Trained a classifier for each SE website based on all other SE
websites
(Stackoverflow was evaluated but was excluded from training due to its size)
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
P R FM AUCMacro average based on self-training(results from the first part of evaluation) 0.84 0.7 0.76 0.87
Leave-one-out 0.83 0.7 0.76 0.87
Best Answer prediction
• Community feedback on the answers remains the best
way for determining the best answer, but
• Discretisation reveals a lot more information
• Content features, even shallow ones CAN be very informative
• Independent from past (not always available) knowledge
• Independent from future knowledge
• Web application/service is under development
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Best Answer Prediction
User & answer rating
Linguistic features
?
Proposed
solution