Machine Learning at Quora (2/26/2016)

Machine Learning at Quora

Nikhil Dandekar (@nikhilbd)

2/26/2016

Our Mission

“To share and grow the world’s knowledge”

● Millions of questions & answers

● Millions of users● Over a million topics● ...

Demand

What we care about

Quality

Relevance

● The core data● Feed ranking● Other Machine Learning● Data Science @Quora● Personalization

Agenda

The Core Data

Lots of data relations

Complex network propagation effects

Importance of topics & semantics

Feed ranking@Quora

Ranking - Feed• Goal: Present most interesting

stories for a user at a given time• Interesting = topical relevance +

social relevance + timeliness• Stories = questions + answers

• Relevance-ordered vs time-ordered = big gains in engagement

• Challenges:• potentially many candidate stories• real-time ranking• optimize for relevance

• Use Machine Learning for feed ranking

Feed dataset: impression logs

click

upvote

downvote

expand

share

click

answer pass

downvote

follow

● Value of showing a story to a user, e.g. weighted sum of actions: v = ∑a va 1{ya = 1}

● Goal: predict this value for new stories. 2 possible approaches:○ predict value directly

v_pred = f(x)■ pros: single regression model■ cons: can be ambiguous, coupled

○ predict probabilities for each action, then compute expected value:v_pred = E[ V | x ] = ∑a va p(a | x)

■ pros: better use of supervised signal, decouples action models from action values

■ cons: more costly, one classifier per action

What is relevance?

● Essential for getting good ranking● Better if updated in real-time (more reactive)● Main sets of features:

○ user (e.g. age, country, recent activity)○ story (e.g. popularity, trendiness, quality)○ interactions between the two (e.g. topic or author affinity)

Feature engineering

● Linear○ simple, fast to train○ manual, non-linear transforms for

richer representation (buckets, ngrams)

● Decision trees○ learn non-linear representations

● Tree ensembles○ Random forests○ Gradient boosted decision trees

● In-house C++ training code, third-party libraries for prototyping new models

Models

Scalability: feed backend system

Aggregator 1

Aggregator 2

Aggregator 3

Leaf 1 Leaf 2 Leaf 3

Aggregator

Leaf

Requests from Web (python)

...

...

...

user_id

object_id

Machine Learning@Quora

Ranking - Answer rankingWhat is a good Quora answer?

• truthful• reusable• provides explanation• well formatted• ...

Ranking - Answer rankingHow are those dimensions translated into features?

• Features that relate to the text quality itself

• Interaction features (upvotes/downvotes, clicks, comments…)

• User features (e.g. expertise in topic)

How we think of search

Ranking - Search ranking

● Match user queries to Quora entities

● Corpus: Quora questions, answers, topics, users, blogs etc.

● Ranking: Traditional IR scores (e.g. BM25), hand-tuned or ML-ranking

● Focus on long-term satisfaction○ If a question exists, but the

answer is unsatisfactory, let the user “Re-Ask” the question

Question Asking

Goal: Find the best people to answer a question● Understand the question● Find people who can best answer

the question● “Ask to Answer”: Route the

question to these people●Either manual or automated A2A

Recommendations - Topics

Goal: Recommend new topics for the user to follow• Based on

• Other topics followed• Users followed• User interactions• Topic-related features• ...

Recommendations - Users

Goal: Recommend new users to follow• Based on:

• Other users followed• Topics followed• User interactions• User-related features• ...

Related Questions

• Given interest in question A (source) what other questions will be interesting?

• Not only about similarity, but also “interestingness”

• Features such as:• Textual• Co-visit• Topics• …

• Important for logged-out use case

Duplicate Questions• Important issue for Quora

• Want to make sure we don’t disperse knowledge to the same question

• Solution: binary classifier trained with labelled data

• Features• Textual vector space models• Usage-based features• ...

User Trust/Expertise InferenceGoal: Infer user’s trustworthiness in relation to a given topic• We take into account:

• Answers written on topic• Upvotes/downvotes received• Endorsements• ...

• Trust/expertise propagates through the network

• Must be taken into account by other algorithms

Spam Detection/Moderation• Very important for Quora to keep quality of

content• Pure manual approaches do not scale• Hard to get algorithms 100% right• ML algorithms detect content/user issues

• Output of the algorithms feed manually curated moderation queues

Trending TopicsGoal: Highlight current events that are interesting for the user• We take into account:

• Global “Trendiness”• Social “Trendiness”• User’s interest• ...

• Trending topics are a great discovery mechanism

Models

Models● Logistic Regression● Elastic Nets● Gradient Boosted Decision

Trees● Random Forests● (Deep) Neural Networks● LambdaMART● Matrix Factorization● LDA● ...

Data Science @Quora

Data Science at Quora

● Both ML engineers and data scientists are involved in machine learning

● ML engineers build, implement, and maintain production machine learning systems.

● Data scientists conduct research to generate ideas about machine learning projects, and perform analysis to understand the metrics impact of machine learning systems.

Data Science at Quora

Extensive A/B testing, data-driven decision-makingSeparate, orthogonal “layers” for different parts of the

systemExperiment framework showing comparisons for

various metrics

Experimentation

Personalization

Importance of Personalization

The importance of personalization is inversely proportional to how specific the user intent is.

The importance of personalization is directly proportional to the number of “right answers”.

Importance of Personalization

Other contexts

● At a high-level personalization is adding a “user” context to relevance tasks

● Other contexts:○ Location○ Time○ etc.

● Previous learnings generalize to these other contexts

The Search-Recommendation-Notification Spectrum

Questions?

Software

Machine Learning at Quora (2/26/2016)