33
Intern Presentation A/B Testing by Interleaving Sida Wang

Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

Intern Presentation

A/B Testing by Interleaving

Sida Wang

Page 2: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

My Project

• Evaluating search relevance by interleaving

results and collecting user data

– Interleaving Framework

• Generic, Extensible• Generic, Extensible

– Experiments to evaluate relevance by interleaving

• Based on the paper How Does Clickthrough Data

Reflect Retrieval Quality? by F. Radlinski et al

Page 3: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

Evaluating Search Relevance

• Without Interleaving

- Full time human judges -> precision, recall, NDCG

- Compare Search

Page 4: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

Compare Search

Result S1

Result S2

Result S3

Result G1

Result G2

Result G3

Page 5: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search
Page 6: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

Issues

• Aas

• But do Microsoft people pick O14 Search or

Google Mini? Google Mini?

• Maybe people tend to pick the left?

• Alters the search experience

– Can never collect a lot of data using this method

Page 7: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

By Interleaving

Result A1 - Relevant

Result A2 - Relevant

Result A1 - Relevant

Result B1 - Useless

Result A2 - Relevant

Result B1 - Useless

Result B2 - UselessResult A2 - Relevant

Result A3 - Relevant

Result B2 - Useless

Result A3 - Relevant

Result B3 - Useless

Result B2 - Useless

Result B3 - Useless

Page 8: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

By Interleaving

Result A1

Result A2

Result A1

Result B1

Result A2Result B1

Result B1Result A2

Result A3Result B2

Result A3

Result B3

Result B1

Result B3

Page 9: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

Considerations

• Minimize impact to UX

– So no demo, it looks exactly like normal search

• Minimize Bias

– Summary normalization– Summary normalization

– Interleaving algorithms

• Reliability / performance / and the usual

Page 10: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

Experiments I did

• Automated random clicks

• Automated clicks according to relevance

judgments

• Clicks from real people• Clicks from real people

Page 11: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

Random Clicks

0.4

0.5

0.6

% o

f V

ote

s R

ece

ive

d

Control Using Automated Random Clicks

0

0.1

0.2

0.3

0 500 1000 1500 2000 2500 3000 3500 4000 4500

% o

f V

ote

s R

ece

ive

d

Clicks

Betaa

MSW

Ties

Page 12: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

A Lot of Random Clicks

0.8

1

1.2

% o

f V

ote

s R

ece

ive

d

Control Using Automated Random Clicks

0

0.2

0.4

0.6

0 5000 10000 15000 20000 25000 30000

% o

f V

ote

s R

ece

ive

d

Clicks

ac11

a86f

ties

Page 13: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

Experiments I did

• Automated random clicks

• Automated clicks according to relevance

judgments

• Clicks from real people• Clicks from real people

Page 14: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

O12 vs. O14

0.5

0.6

0.7

0.8

% o

f V

ote

s R

ece

ive

d

Automated Clicks Using Relevence Judgments

-0.1

0

0.1

0.2

0.3

0.4

0 1000 2000 3000 4000

% o

f V

ote

s R

ece

ive

d

Clicks

Acing05 Degraded

Acing05

Ties

Page 15: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

Experiments I did

• Automated random clicks

• Automated clicks according to relevance

judgments

• Clicks from real people• Clicks from real people

Page 16: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

O12 vs. O14

0.8

1

1.2

% o

f V

ote

s R

ece

ive

d

O12 vs. O14 in BSG ALL

-0.2

0

0.2

0.4

0.6

0 10 20 30 40 50 60 70 80 90

% o

f V

ote

s R

ece

ive

d

Clicks

O12

O14

Tie

Page 17: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

Method of Analysis (election)

• Vote by query, by user, by session etc.

• query = person, user = state

Page 18: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

Summary of Results

Method of Voting O12 vs. O14

by queries (direct election): 12 vs. 24

by users (1 vote per state): 4 vs. 9

by sessions (~electoral votes): 5 vs. 11by sessions (~electoral votes): 5 vs. 11

• System does not seem to matter much, but

too little clicks (85) to draw significant

conclusion

Page 19: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

What Logically Follows

• Google Mini vs. O14 (after fixing Google Mini)

• FAST vs. O14 (after fixing RSS in fssearchoffice)

• I’d love to see the results

Page 20: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

What can interleaving do?

• Give relevance team more confidence

• Use interleaving for displaying results

• Use interleaving to automatically tune the

search engine

Am

bitio

n

search engine

Am

bitio

n

Page 21: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

Add Confidence

• In addition to very traditional measures like

NDCG, Precision and Recall. It is nice to have

another independent metric.

• Automatic• Automatic

– Does not require human judgments

• Scalable

– Small impact to UX

Page 22: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

What can interleaving do?

• Give relevance team more confidence

• Use interleaving for displaying results

• Use interleaving to automatically tune the

search engine

Am

bitio

n

search engine

Am

bitio

n

Page 23: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

Display

Page 24: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

Display

Page 25: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

What can we do?

• Give relevance team more confidence

• Use interleave for displaying results

• Use interleaving to automatically tune the

search engine

Am

bitio

n

search engine

Am

bitio

n

Page 26: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

Automatic Tuning

• Many relevance models, each is good for a particular type of corpora (specs, user data, academic articles, product catalog, websites)

• Use interleaving in 10% of searches

• Use user click data to:

– Automatically and dynamically decide on the best model, or tweak model parameters

Page 27: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

Thank you!

• Dmitriy, Eugene, Puneet

• Jamie, Jessica, Ping, Victor, Relevance Team

• Russ, Jon• Russ, Jon

• Search Team

• Hope to see you again in the future!

Page 28: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

Extra Slides

Page 29: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

Automatic Tuning – Pair wise?

• Pair wise comparisons scales poorly

• But there seems to be “strong stochastic

transitivity”

– Given locations A, B ,C– Given locations A, B ,C

– If A > B > C then ΔAC > Max(ΔAB, ΔBC)

Page 30: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

How to Interleave

• Balanced

• Team Draft

Page 31: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

Balanced Interleaving

Page 32: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

Team Draft

1st pick:

LeBron James

2nd pick:

Kobe Bryant

1st pick:

John Smith

2nd pick:

Kobe Bryant

3rd pick:

Tim Duncan

LeBron James

3rd pick:

Tim Duncan

Page 33: Intern Presentation short - Stanford Universitysidaw/projects/nontechnicalslides.pdf · Intern Presentation A/B Testing by Interleaving Sida Wang. My Project •Evaluating search

A/B Testing By Interleaving

Result A1 - Relevant

Result A2 - Relevant

Result A1 - Relevant

Result B1 - Useless

Result B2 - Useless

Result B1 - Useless

Result B2 - UselessResult A2 - Relevant

Result A3 - Relevant

Result A2 - Relevant

Result A3 - Relevant

Result B3 - Useless

Result B2 - Useless

Result B3 - Useless