Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Combining Statistics and Expert Human Judgment

for Better Recommendations

Brad Klingenberg, Stitch Fixbrad@stitchfix.com MLconf San Francisco 2015

Three lessons

Lessons from having humans in the loop

Humans in the loop

It works really well, but it’s complicated

Humans in the loop:

Lesson 1: There’s more than one way to measure success

Humans in the loop:

Lesson 2: You have to think carefully about what you’re predicting

Humans in the loop:

Lesson 3: Humans can say “no”, and this complicates experiments

Humans in the loop at Stitch Fix

Stitch Fix

Styling at Stitch Fix

Personal styling

Inventory

Styling at Stitch Fix: personalized recommendations

Inventory Algorithmic recommendations

Statistics

Styling at Stitch Fix: expert human curation

Human curation

Algorithmic recommendations

Traditional recommenders

Learning through feedback

Humans in the loop

Learning through feedback

Measuring success

In the end, you are usually interested in optimizing

and this may make sense for the combined system.

But when optimizing an algorithm, it is important to consider selection

Optimizing interaction

For a set of algorithms with the same marginal performance,

We generally prefer the algorithms that

● increase agreement and reduce needed searching (credible and useful recommendations)

● make the humans more efficient (effortless curation)

● make the humans more efficient (effortless curation)● have a better user experience (fewer bad or annoying recommendations)

Logging selection

This means logging and analyzing selection data

Training a model

What should you predict?

Naive approach: ignore selection and train on success data

Advantages

● “traditional” supervised problem● simple historical data

Censoring through selection

Problem: selection can censor your data

Arms flaunted

SuccessYes

Yes No

Predicting selection

What about predicting selection?

Predicting selection

● Simple, but selection is not really success

● There is a much more direct feedback loop

Training a model

You should probably consider both.

It is most interesting when they disagree

Selection model Success model

Good disagreement

Ignoring an inappropriate recommendation

Client request: “I need an outfit for a glamorous night out!”

Good disagreement

Ignoring an inappropriate recommendation

Client request: “I need an outfit for a glamorous night out!”

Bad disagreement

Stylist not choosing something that would be successful

Predicted probability of success = 85%

Bad disagreement

Stylist not choosing something that would be successful

Could lack trust in the recommendation: importance of transparency

Predicted probability of success = 85%

?Based on her

recent purchase

“the downside of free will”

Testing with humans in the loop

Toy example: Suppose we want to test a (bad) new policy

Testing with humans in the loop

New rule: all fixes must contain polka dots!

Toy example: Suppose we want to test a (bad) new policy

An experiment

Control Test (Polka Dots Rule)

Selective non-compliance

Humans may not comply. Or, they may comply only selectively

Hmm, no“Please don’t send me

any polka dots” - client X

Test (Polka Dots Rule)

Humans help avoid bad choices - this is great for the client!

But, this can obscure the effect you are trying to measure.

Humans help avoid bad choices - this is great for the client!

But, this can obscure the effect you are trying to measure. Helpful analogy: non-compliance in clinical trials. This has been intensively studied

Humans in the loop

Thanks!

Questions?(we’re hiring!)

Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

Technology

MLconf NYC Ted Willke

MLConf 2016 SigOpt Talk by Scott Clark

Boston Beer Company Pest & Industry Analysis Jonathan Klingenberg

Scott Clark, Software Engineer, Yelp at MLconf SF

MLconf NYC Animashree Anandkumar

Music recommendations @ MLConf 2014

American Express Slides, MLconf 2013

Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

MLconf NYC Justin Basilico

Jeff Johnson, Research Engineer, Facebook at MLconf NYC

ReviewAnalysis MLconf 2016 JPrendki

Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

MLconf NYC Pek Lum

MLConf Seattle 2015 - ML@Quora

MLconf NYC Edo Liberty

MLconf NYC Chang Wang

Evan Estola – Data Scientist, Meetup.com at MLconf ATL

MLconf seattle 2015 presentation

Talwalkar mlconf (1)

Seat Styling - Bezel Definition - Styling