Upload
nikhil-dandekar
View
443
Download
3
Embed Size (px)
Citation preview
Machine Learning at Quora
Nikhil Dandekar (@nikhilbd)
2/26/2016
Our Mission
“To share and grow the world’s knowledge”
● Millions of questions & answers
● Millions of users● Over a million topics● ...
Demand
What we care about
Quality
Relevance
● The core data● Feed ranking● Other Machine Learning● Data Science @Quora● Personalization
Agenda
The Core Data
Lots of data relations
Complex network propagation effects
Importance of topics & semantics
Feed ranking@Quora
Ranking - Feed• Goal: Present most interesting
stories for a user at a given time• Interesting = topical relevance +
social relevance + timeliness• Stories = questions + answers
• Relevance-ordered vs time-ordered = big gains in engagement
• Challenges:• potentially many candidate stories• real-time ranking• optimize for relevance
• Use Machine Learning for feed ranking
Feed dataset: impression logs
click
upvote
downvote
expand
share
click
answer pass
downvote
follow
● Value of showing a story to a user, e.g. weighted sum of actions: v = ∑a va 1{ya = 1}
● Goal: predict this value for new stories. 2 possible approaches:○ predict value directly
v_pred = f(x)■ pros: single regression model■ cons: can be ambiguous, coupled
○ predict probabilities for each action, then compute expected value:v_pred = E[ V | x ] = ∑a va p(a | x)
■ pros: better use of supervised signal, decouples action models from action values
■ cons: more costly, one classifier per action
What is relevance?
● Essential for getting good ranking● Better if updated in real-time (more reactive)● Main sets of features:
○ user (e.g. age, country, recent activity)○ story (e.g. popularity, trendiness, quality)○ interactions between the two (e.g. topic or author affinity)
Feature engineering
● Linear○ simple, fast to train○ manual, non-linear transforms for
richer representation (buckets, ngrams)
● Decision trees○ learn non-linear representations
● Tree ensembles○ Random forests○ Gradient boosted decision trees
● In-house C++ training code, third-party libraries for prototyping new models
Models
Scalability: feed backend system
Aggregator 1
Aggregator 2
Aggregator 3
Leaf 1 Leaf 2 Leaf 3
Aggregator
Leaf
Requests from Web (python)
...
...
...
user_id
object_id
Machine Learning@Quora
Ranking - Answer rankingWhat is a good Quora answer?
• truthful• reusable• provides explanation• well formatted• ...
Ranking - Answer rankingHow are those dimensions translated into features?
• Features that relate to the text quality itself
• Interaction features (upvotes/downvotes, clicks, comments…)
• User features (e.g. expertise in topic)
How we think of search
Ranking - Search ranking
● Match user queries to Quora entities
● Corpus: Quora questions, answers, topics, users, blogs etc.
● Ranking: Traditional IR scores (e.g. BM25), hand-tuned or ML-ranking
● Focus on long-term satisfaction○ If a question exists, but the
answer is unsatisfactory, let the user “Re-Ask” the question
Question Asking
Goal: Find the best people to answer a question● Understand the question● Find people who can best answer
the question● “Ask to Answer”: Route the
question to these people●Either manual or automated A2A
Recommendations - Topics
Goal: Recommend new topics for the user to follow• Based on
• Other topics followed• Users followed• User interactions• Topic-related features• ...
Recommendations - Users
Goal: Recommend new users to follow• Based on:
• Other users followed• Topics followed• User interactions• User-related features• ...
Related Questions
• Given interest in question A (source) what other questions will be interesting?
• Not only about similarity, but also “interestingness”
• Features such as:• Textual• Co-visit• Topics• …
• Important for logged-out use case
Duplicate Questions• Important issue for Quora
• Want to make sure we don’t disperse knowledge to the same question
• Solution: binary classifier trained with labelled data
• Features• Textual vector space models• Usage-based features• ...
User Trust/Expertise InferenceGoal: Infer user’s trustworthiness in relation to a given topic• We take into account:
• Answers written on topic• Upvotes/downvotes received• Endorsements• ...
• Trust/expertise propagates through the network
• Must be taken into account by other algorithms
Spam Detection/Moderation• Very important for Quora to keep quality of
content• Pure manual approaches do not scale• Hard to get algorithms 100% right• ML algorithms detect content/user issues
• Output of the algorithms feed manually curated moderation queues
Trending TopicsGoal: Highlight current events that are interesting for the user• We take into account:
• Global “Trendiness”• Social “Trendiness”• User’s interest• ...
• Trending topics are a great discovery mechanism
Models
Models● Logistic Regression● Elastic Nets● Gradient Boosted Decision
Trees● Random Forests● (Deep) Neural Networks● LambdaMART● Matrix Factorization● LDA● ...
Data Science @Quora
Data Science at Quora
● Both ML engineers and data scientists are involved in machine learning
● ML engineers build, implement, and maintain production machine learning systems.
● Data scientists conduct research to generate ideas about machine learning projects, and perform analysis to understand the metrics impact of machine learning systems.
Data Science at Quora
Extensive A/B testing, data-driven decision-makingSeparate, orthogonal “layers” for different parts of the
systemExperiment framework showing comparisons for
various metrics
Experimentation
Personalization
Importance of Personalization
The importance of personalization is inversely proportional to how specific the user intent is.
The importance of personalization is directly proportional to the number of “right answers”.
Importance of Personalization
Other contexts
● At a high-level personalization is adding a “user” context to relevance tasks
● Other contexts:○ Location○ Time○ etc.
● Previous learnings generalize to these other contexts
The Search-Recommendation-Notification Spectrum
Questions?