[CXL Live 16] A/B Testing Pitfalls: Getting Numbers is Easy; Getting Numbers You Can Trust is Hard by Ronny Kohavi

RonnyKohavi,DistinguishedEngineer,GeneralManager,AnalysisandExperimentation,Microsoft

JointworkwithThomasCrook,BrianFrasca, andRogerLongbotham,A&ETeam

A/BTestingPitfallsSlidesathttp://bit.ly/ABPitfalls

RonnyKohavi 2

A/BTestsinOneSlideØConceptistrivial§Randomlysplittrafficbetweentwo(ormore)versionsoA(Control)oB(Treatment)

§Collectmetricsofinterest§Analyze

ØA/Btestisthesimplestcontrolledexperiment§A/B/nreferstomultipletreatments(oftenusedandencouraged:trycontrol+twoorthreetreatments)§MVTreferstomultivariabledesigns(rarelyusedbyourteams)

ØMustrunstatisticalteststoconfirmdifferencesarenotduetochance

ØBestscientificwaytoprovecausality,i.e.,thechangesinmetricsarecausedbychangesintroducedinthetreatment(s)

ConversionXL AudienceStatistics

RonnyKohavi 3

13%

30%

20% 20%17%

0%5%10%15%20%25%30%35%

None Few 12(onepermonth)

15-30 Lotsandlots

Experimentsperyearrunbyattendeesbasedonpre-conferencesurvey(N=118)

83%ofattendeesranlessthan30experimentslastyear.ExperimentersatMicrosoftuseourExPplatformtostart~30experimentsperday

ExperimentationatScaleØI’vebeenfortunatetoworkatanorganizationthatvaluesbeingdata-driven(video)ØWefinishabout~300experimenttreatmentsperweek,mostlyonBing,MSN,butalsoonOffice,OneNote,Xbox,Cortana,Skype,Exchange,OneDrive.(Theseare“real”usefultreatments,not3x10x10MVT=300.)

ØEachvariantisexposedtobetween100Kandmillionsofusers,sometimestensofmillionsØAtBing,90%ofeligibleusersareinexperiments(10%areaglobalholdoutchangedonceayear)ØThereisnosingleBing.Sinceauserisexposedtoover15concurrentexperiments,theygetoneof5^15=30billionvariants(debuggingtakesanewmeaning).

ØUntil2014,thesystemwaslimitingusageasitscaled.Nowthelimitscomefromengineers’abilitytocodenewideas

RonnyKohavi 4

TwoValuableRealExperimentsØWhatisavaluableexperiment?§Absolutevalueofdeltabetweenexpectedoutcomeandactualoutcomeislarge§Ifyouthoughtsomethingisgoingtowinanditwins,youhavenotlearnedmuch§Ifyouthoughtitwasgoingtowinanditloses,it’svaluable(learning)§Ifyouthoughtitwas“meh”anditwasabreakthrough,it’sHIGHLYvaluableSeehttp://bit.ly/expRulesOfThumb forsomeexamplesofbreakthroughs

ØExperimentsranatMicrosoft’sBingwithmillionsofusersineachØForeachexperiment,weprovidetheOEC,theOverallEvaluationCriterionØCanyouguessthewinnercorrectly?Threechoicesare:

oAwins(thedifferenceisstatisticallysignificant)oFlat:AandBareapproximatelythesame(nostatsigdiff)oBwins

5

Example:BingAdswithSiteLinks

ØShouldBingadd“sitelinks”toads,whichallowadvertiserstoofferseveraldestinationsonads?

ØOEC:Revenue,adsconstrainttosameverticalpixelsonavg

ØProadding:richerads,usersbetterinformedwheretheyland

ØCons:Constraintmeansonaverage4“A”adsvs.3“B”adsVariantBis5mscslower(compute+higherpageweight)

RonnyKohavi 6

A B

• RaiseyourlefthandifyouthinkAWins(left)• RaiseyourrighthandifyouthinkBWins(right)• Don’traiseyourhandiftheyaretheaboutthesame

BingAdswithSiteLinksØIfyouraisedyourlefthand,youwerewrong

ØIfyoudidnotraiseahand,youwerewrong

ØSitelinksgenerateincrementalrevenueontheorderoftensofmillionsofdollarsannuallyforBing

ØTheabovechangewascostlytoimplement.WemadetwosmallchangestoBing,whichtookdaystodevelop,eachincreasedannualrevenuesbyover$100million

RonnyKohavi 7

Example:UnderliningLinksØDoesunderliningincreaseordecreaseclickthrough-rate?

RonnyKohavi 8

Example4:UnderliningLinksØDoesunderliningincreaseordecreaseclickthrough-rate?

ØOEC:ClickthroughRateonsearchengineresultpage(SERP)foraquery

RonnyKohavi 9

A(withunderlines) B(nounderlines)

• RaiseyourlefthandifyouthinkAWins(left,withunderlines)• RaiseyourrighthandifyouthinkBWins(right,withoutunderlines)• Don’traiseyourhandiftheyaretheaboutthesame

UnderlinesØIfyouraisedyourrighthand,youwerewrong

ØIfyoudidnotraiseahand,youwerewrong

ØUnderlinesimproveclickthrough-rateforbothalgorithmicresultsandads(somorerevenue)andimprovetimetosuccessfulclick

ØModernwebdesignsdoawaywithunderlines,andmostsiteshaveadoptedthisdesign,despitedatashowingthatusersclicklessandtakemoretimetoclick

ØForsearchengines(Google,BingYahoo),thisisaveryquestionableindustrydirection

RonnyKohavi 10

Pitfall1:MisinterpretingP-valuesØNHST=NullHypothesisStatisticalTesting,the“standard”modelcommonlyused

ØP-value<=0.05isthe“standard”forrejectingtheNullhypothesis

ØP-valueisoftenmis-interpreted.HerearesomeincorrectstatementsfromSteveGoodman’sADirtyDozen1. IfP=.05,thenullhypothesishasonlya5%chanceofbeingtrue2. Anon-significantdifference(e.g.,P>.05)meansthereisnodifferencebetweengroups3. P=.05meansthatwehaveobserveddatathatwouldoccuronly5%ofthetimeunderthenullhypothesis4. P=.05meansthatifyourejectthenullhyp,theprobabilityofatypeIerror(falsepositive)isonly5%

ØTheproblemisthatp-valuegivesusProb (X>=x|H_0),whereaswhatwewantisProb (H_0|X=x)

RonnyKohavi 11

Pitfall2:ExpectingBreakthroughsØ Breakthroughsarerareafterinitialoptimizations.§AtBing(welloptimized),80%ofideasfailtoshowvalue§AtotherproductsacrossMicrosoft,about2/3ofideasfail

ØTakeSessions/User,akeymetricatBing.Historically,itimproves0.02%ofthetime:that’sonein5,000treatmentswetry!

ØMostofthetime,weinvokeTwyman’slaw(http://bit.ly/twymanLaw)

ØNoterelationshiptopriorpitfall§Withstandardp-valuecomputations,5%ofexperimentswillshowstat-sigmovementtoSessions/Userwhenthereisnorealmovement(i.e.,iftheNullHypothesisistrue),halfofthosepositive

§99.6%ofthetime,astat-sigmovementwithp-value=0.05willbeafalsepositive

RonnyKohavi 12

Anyfigurethatlooksinterestingordifferentisusuallywrong

Pitfall3:NotCheckingforSRMØSRM=SampleRatioMismatchØIfyourunanexperimentwithequalpercentagesassignedtoControl/Treatment(A/B),youshouldhaveapproximatelythesamenumberofusersineach

ØRealexamplefromanexperimentalertIreceivedthisweek:§Control:821,588users,Treatment:815,482users§Ratio:50.2%(shouldhavebeen50%)§ ShouldIbeworried?

ØAbsolutely§ Thep-valueis1.8e-6,sotheprobabilityofthissplit(ormoreextreme)happeningbychanceislessthan1in500,000

§Notethattheabovestatementisnotaviolationofthepitfall#1becausebytheexperimentdesign,thereshouldbeanequalnumberofusersincontrol/treatment,sowewanttheconditionalprobability

P(actualsplit=50.2%|designedsplit=50%)

RonnyKohavi 13

Pitfall4:WrongSuccessMetric(OEC)ØOfficeOnlinetestednewdesignforhomepage

ØObjective:increasesalesofOfficeproducts

ØOverallEvaluationCriterion(OEC)wasclickstotheBuyButton[showninredboxes]Whichonewasbetter?

Control

Treatment

Pitfall:WrongOECØTreatmenthadadropintheOEC(clicksonbuy)of64%!

ØNothavingthepriceshownintheControlleadmorepeopletoclicktodeterminetheprice

ØLesson:measurewhatyoureallyneedtomeasure:actualsales(itismoredifficultattimes)

ØLesson2:Focusonlong-termcustomerlifetimevalue

ØPeepinkeynoteheresaid(hewasOKwithmementioningthis):§What’sthegoal?Moremoneyrightnow§Commonpitfall:Youwanttooptimizelong-termmoney.NOTrightnow.Raisingpricesgetsyoushort-termmoney,butlong-termabandonment

ØComingupwithagoodOECusingshort-termmetricsisREALLYhard

Example:OECforSearchØKDD2012 Paper:TrustworthyOnlineControlledExperiments:FivePuzzlingOutcomesExplained

ØSearchengines(Bing,Google)areevaluatedonqueryshare(distinctqueries)andrevenueaslong-termgoals

ØPuzzle§Arankingbuginanexperimentresultedinverypoorsearchresults§Degraded(algorithmic)searchresultscauseuserstosearchmoretocompletetheirtask,andadsappearmorerelevant

§Distinctquerieswentupover10%,andrevenuewentupover30%

ØThisproblemisnowinthebookdatascienceinterviewsexposed

ØWhatmetricsshouldbeintheOECforasearchengine?

RonnyKohavi 16

PuzzleExplainedØAnalyzingqueriespermonth,wehave

𝑄𝑢𝑒𝑟𝑖𝑒𝑠𝑀𝑜𝑛𝑡ℎ

=𝑄𝑢𝑒𝑟𝑖𝑒𝑠𝑆𝑒𝑠𝑠𝑖𝑜𝑛

×𝑆𝑒𝑠𝑠𝑖𝑜𝑛𝑠𝑈𝑠𝑒𝑟

×𝑈𝑠𝑒𝑟𝑠𝑀𝑜𝑛𝑡ℎ

whereasessionbeginswithaqueryandendswith30-minutesofinactivity.(Ideally,wewouldlookattasks,notsessions).

ØKeyobservation:wewantuserstofindanswersandcompletetasksquickly,soqueries/sessionshouldbesmaller

ØInacontrolledexperiment,thevariantsget(approximately)thesamenumberofusersbydesign,sothelasttermisaboutequal

ØTheOECshouldthereforeincludethemiddleterm:sessions/user

RonnyKohavi 17

BadOECExampleØYourdatascientistsmakesanobservation:2%ofqueriesendupwith“Noresults.”

ØManager:mustreduce.Assignsateamtominimize“noresults”metric

ØMetricimproves,butresultsforquerybrochurepaper

arecrap(orinthiscase,papertocleancrap)

ØSometimesit*is*bettertoshow“NoResults.”

RealexamplefrommyAmazonPrimenowsearch3/26/2016https://twitter.com/ronnyk/status/713949552823263234

RonnyKohavi 18

Pitfall5:CombiningDatawhenTreatmentPercentVarieswithtimeØSimplifiedexample:1,000,000usersperday

ØForeachindividualdaytheTreatmentismuchbetter

ØHowever,cumulativeresultforTreatmentisworse (Simpson’sparadox)

ConversionRatefortwodays

Friday Saturday TotalC/Tsplit:99/1 C/Tsplit:50/50

Control 20,000 =2.02% 5,000 =1.00% 25,000 =1.68%990,000 500,000 1,490,000

Treatment 230 =2.30% 6,000 =1.20% 6,230 =1.22%10,000 500,000 510,000

Pitfall6:GettheStatsRight

ØTwoverygoodbooksonA/Btesting(A/BTestingfromOptimizelyfoundersDanSiroker andPeterKoomen;andYouShouldTestThatbyWiderFunnel’sCEOChrisGoward)getthestatswrong(seeAmazonreviews).

ØOptimizelyrecentlyupdatedtheirstatsintheproducttocorrectforthis

ØBesttechniquestofindissues:runA/Atests§ LikeanA/Btest,butbothvariantsareexactlythesame§Areuserssplitaccordingtotheplannedpercentages?§ Isthedatacollectedmatchingthesystemofrecord?§Aretheresultsshowingnon-significantresults95%ofthetime?

RonnyKohavi 20

MorePitfallsØSeeKDDpaper:SevenPitfallstoAvoidwhenRunningControlledExperimentsontheWeb(http://bit.ly/expPitfalls)

ØIncorrectlycomputingconfidenceintervalsforpercentchange

ØUsingstandardstatisticalformulasforcomputationsofvarianceandpower

ØNeglectingtofilterrobots/botsLucrativebusiness,asshowninphotoItook->

ØInstrumentationissues

RonnyKohavi 21

TheHiPPOØHiPPO=HighestPaidPerson’sOpinionØWemadethousandstoyHiPPOsandhandedthematMicrosofttohelpchangetheculture

ØGrabonehereatConversionXLØChangethecultureatyourcompanyØFact:Hipposkillmorehumansthananyother (non-human)mammalØListentothecustomersanddon’tlettheHiPPOkillgoodideas

RonnyKohavi 22

RonnyKohavi 23

Gettingnumbersiseasy;gettingnumbersyoucantrustishard

Slidesathttp://bit.ly/ABPitfalls

Seehttp://exp-platform.com forpapers.Planereadingbookletswithselectedpapersavailableoutsideroom

Rememberthis

Marketing

[CXL Live 16] A/B Testing Pitfalls: Getting Numbers is Easy; Getting Numbers You Can Trust is Hard by Ronny Kohavi