46
Combining Statistics and Expert Human Judgment for Better Recommendations Brad Klingenberg, Stitch Fix [email protected] MLconf San Francisco 2015 Three lessons

Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

  • Upload
    mlconf

  • View
    1.198

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Combining Statistics and Expert Human Judgment

for Better Recommendations

Brad Klingenberg, Stitch [email protected] MLconf San Francisco 2015

Three lessons

Page 2: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Lessons from having humans in the loop

Humans in the loop

Page 3: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Lessons from having humans in the loop

Humans in the loop

It works really well, but it’s complicated

Page 4: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Lessons from having humans in the loop

Humans in the loop:

It works really well, but it’s complicated

Lesson 1: There’s more than one way to measure success

Page 5: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Lessons from having humans in the loop

Humans in the loop:

It works really well, but it’s complicated

Lesson 1: There’s more than one way to measure success

Lesson 2: You have to think carefully about what you’re predicting

Page 6: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Lessons from having humans in the loop

Humans in the loop:

It works really well, but it’s complicated

Lesson 1: There’s more than one way to measure success

Lesson 2: You have to think carefully about what you’re predicting

Lesson 3: Humans can say “no”, and this complicates experiments

Page 7: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Humans in the loop at Stitch Fix

Page 8: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Stitch Fix

Page 9: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Stitch Fix

Page 10: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Stitch Fix

Page 11: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Stitch Fix

Page 12: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Styling at Stitch Fix

Personal styling

Inventory

Page 13: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Styling at Stitch Fix: personalized recommendations

Inventory Algorithmic recommendations

Statistics

Page 14: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Styling at Stitch Fix: expert human curation

Human curation

Algorithmic recommendations

Page 15: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Lesson 1: There’s more than one way to measure success

Page 16: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Traditional recommenders

Learning through feedback

Page 17: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Humans in the loop

Learning through feedback

Page 18: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Measuring success

In the end, you are usually interested in optimizing

and this may make sense for the combined system.

But when optimizing an algorithm, it is important to consider selection

Page 19: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Optimizing interaction

For a set of algorithms with the same marginal performance,

We generally prefer the algorithms that

Page 20: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Optimizing interaction

For a set of algorithms with the same marginal performance,

We generally prefer the algorithms that

● increase agreement and reduce needed searching (credible and useful recommendations)

Page 21: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Optimizing interaction

For a set of algorithms with the same marginal performance,

We generally prefer the algorithms that

● increase agreement and reduce needed searching (credible and useful recommendations)

● make the humans more efficient (effortless curation)

Page 22: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Optimizing interaction

For a set of algorithms with the same marginal performance,

We generally prefer the algorithms that

● increase agreement and reduce needed searching (credible and useful recommendations)

● make the humans more efficient (effortless curation)● have a better user experience (fewer bad or annoying recommendations)

Page 23: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Logging selection

This means logging and analyzing selection data

Page 24: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Lesson 2: You have to think carefully about what you’re predicting

Page 25: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Training a model

What should you predict?

Naive approach: ignore selection and train on success data

Advantages

● “traditional” supervised problem● simple historical data

Page 26: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Censoring through selection

Problem: selection can censor your data

Page 27: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Censoring through selection

Problem: selection can censor your data

Page 28: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Censoring through selection

Problem: selection can censor your data

Arms flaunted

SuccessYes

No

Yes No

?

?

p

1-p

Page 29: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Predicting selection

What about predicting selection?

Page 30: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Predicting selection

● Simple, but selection is not really success

● There is a much more direct feedback loop

Page 31: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Training a model

You should probably consider both.

It is most interesting when they disagree

Selection model Success model

vs

Page 32: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Good disagreement

Ignoring an inappropriate recommendation

Client request: “I need an outfit for a glamorous night out!”

Page 33: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Good disagreement

Ignoring an inappropriate recommendation

Client request: “I need an outfit for a glamorous night out!”

Page 34: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Bad disagreement

Stylist not choosing something that would be successful

Predicted probability of success = 85%

?

Page 35: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Bad disagreement

Stylist not choosing something that would be successful

Could lack trust in the recommendation: importance of transparency

Predicted probability of success = 85%

?Based on her

recent purchase

Page 36: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Lesson 3: Humans can say “no”, and this complicates experiments

-or-

“the downside of free will”

Page 37: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Testing with humans in the loop

Toy example: Suppose we want to test a (bad) new policy

Page 38: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Testing with humans in the loop

New rule: all fixes must contain polka dots!

Toy example: Suppose we want to test a (bad) new policy

Page 39: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

An experiment

Control Test (Polka Dots Rule)

Page 40: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Selective non-compliance

Humans may not comply. Or, they may comply only selectively

Hmm, no“Please don’t send me

any polka dots” - client X

Test (Polka Dots Rule)

Page 41: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Selective non-compliance

Control Test (Polka Dots Rule)

Page 42: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Selective non-compliance

Control Test (Polka Dots Rule)

Page 43: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Selective non-compliance

Humans help avoid bad choices - this is great for the client!

But, this can obscure the effect you are trying to measure.

Page 44: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Selective non-compliance

Humans help avoid bad choices - this is great for the client!

But, this can obscure the effect you are trying to measure. Helpful analogy: non-compliance in clinical trials. This has been intensively studied

Page 45: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Lessons from having humans in the loop

Humans in the loop

It works really well, but it’s complicated

Lesson 1: There’s more than one way to measure success

Lesson 2: You have to think carefully about what you’re predicting

Lesson 3: Humans can say “no”, and this complicates experiments

Page 46: Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Thanks!

Questions?(we’re hiring!)