Upload
cxl
View
3.514
Download
0
Embed Size (px)
Citation preview
RonnyKohavi,DistinguishedEngineer,GeneralManager,AnalysisandExperimentation,Microsoft
JointworkwithThomasCrook,BrianFrasca, andRogerLongbotham,A&ETeam
A/BTestingPitfallsSlidesathttp://bit.ly/ABPitfalls
RonnyKohavi 2
A/BTestsinOneSlideØConceptistrivial§Randomlysplittrafficbetweentwo(ormore)versionsoA(Control)oB(Treatment)
§Collectmetricsofinterest§Analyze
ØA/Btestisthesimplestcontrolledexperiment§A/B/nreferstomultipletreatments(oftenusedandencouraged:trycontrol+twoorthreetreatments)§MVTreferstomultivariabledesigns(rarelyusedbyourteams)
ØMustrunstatisticalteststoconfirmdifferencesarenotduetochance
ØBestscientificwaytoprovecausality,i.e.,thechangesinmetricsarecausedbychangesintroducedinthetreatment(s)
ConversionXL AudienceStatistics
RonnyKohavi 3
13%
30%
20% 20%17%
0%5%10%15%20%25%30%35%
None Few 12(onepermonth)
15-30 Lotsandlots
Experimentsperyearrunbyattendeesbasedonpre-conferencesurvey(N=118)
83%ofattendeesranlessthan30experimentslastyear.ExperimentersatMicrosoftuseourExPplatformtostart~30experimentsperday
ExperimentationatScaleØI’vebeenfortunatetoworkatanorganizationthatvaluesbeingdata-driven(video)ØWefinishabout~300experimenttreatmentsperweek,mostlyonBing,MSN,butalsoonOffice,OneNote,Xbox,Cortana,Skype,Exchange,OneDrive.(Theseare“real”usefultreatments,not3x10x10MVT=300.)
ØEachvariantisexposedtobetween100Kandmillionsofusers,sometimestensofmillionsØAtBing,90%ofeligibleusersareinexperiments(10%areaglobalholdoutchangedonceayear)ØThereisnosingleBing.Sinceauserisexposedtoover15concurrentexperiments,theygetoneof5^15=30billionvariants(debuggingtakesanewmeaning).
ØUntil2014,thesystemwaslimitingusageasitscaled.Nowthelimitscomefromengineers’abilitytocodenewideas
RonnyKohavi 4
TwoValuableRealExperimentsØWhatisavaluableexperiment?§Absolutevalueofdeltabetweenexpectedoutcomeandactualoutcomeislarge§Ifyouthoughtsomethingisgoingtowinanditwins,youhavenotlearnedmuch§Ifyouthoughtitwasgoingtowinanditloses,it’svaluable(learning)§Ifyouthoughtitwas“meh”anditwasabreakthrough,it’sHIGHLYvaluableSeehttp://bit.ly/expRulesOfThumb forsomeexamplesofbreakthroughs
ØExperimentsranatMicrosoft’sBingwithmillionsofusersineachØForeachexperiment,weprovidetheOEC,theOverallEvaluationCriterionØCanyouguessthewinnercorrectly?Threechoicesare:
oAwins(thedifferenceisstatisticallysignificant)oFlat:AandBareapproximatelythesame(nostatsigdiff)oBwins
5
Example:BingAdswithSiteLinks
ØShouldBingadd“sitelinks”toads,whichallowadvertiserstoofferseveraldestinationsonads?
ØOEC:Revenue,adsconstrainttosameverticalpixelsonavg
ØProadding:richerads,usersbetterinformedwheretheyland
ØCons:Constraintmeansonaverage4“A”adsvs.3“B”adsVariantBis5mscslower(compute+higherpageweight)
RonnyKohavi 6
A B
• RaiseyourlefthandifyouthinkAWins(left)• RaiseyourrighthandifyouthinkBWins(right)• Don’traiseyourhandiftheyaretheaboutthesame
BingAdswithSiteLinksØIfyouraisedyourlefthand,youwerewrong
ØIfyoudidnotraiseahand,youwerewrong
ØSitelinksgenerateincrementalrevenueontheorderoftensofmillionsofdollarsannuallyforBing
ØTheabovechangewascostlytoimplement.WemadetwosmallchangestoBing,whichtookdaystodevelop,eachincreasedannualrevenuesbyover$100million
RonnyKohavi 7
Example:UnderliningLinksØDoesunderliningincreaseordecreaseclickthrough-rate?
RonnyKohavi 8
Example4:UnderliningLinksØDoesunderliningincreaseordecreaseclickthrough-rate?
ØOEC:ClickthroughRateonsearchengineresultpage(SERP)foraquery
RonnyKohavi 9
A(withunderlines) B(nounderlines)
• RaiseyourlefthandifyouthinkAWins(left,withunderlines)• RaiseyourrighthandifyouthinkBWins(right,withoutunderlines)• Don’traiseyourhandiftheyaretheaboutthesame
UnderlinesØIfyouraisedyourrighthand,youwerewrong
ØIfyoudidnotraiseahand,youwerewrong
ØUnderlinesimproveclickthrough-rateforbothalgorithmicresultsandads(somorerevenue)andimprovetimetosuccessfulclick
ØModernwebdesignsdoawaywithunderlines,andmostsiteshaveadoptedthisdesign,despitedatashowingthatusersclicklessandtakemoretimetoclick
ØForsearchengines(Google,BingYahoo),thisisaveryquestionableindustrydirection
RonnyKohavi 10
Pitfall1:MisinterpretingP-valuesØNHST=NullHypothesisStatisticalTesting,the“standard”modelcommonlyused
ØP-value<=0.05isthe“standard”forrejectingtheNullhypothesis
ØP-valueisoftenmis-interpreted.HerearesomeincorrectstatementsfromSteveGoodman’sADirtyDozen1. IfP=.05,thenullhypothesishasonlya5%chanceofbeingtrue2. Anon-significantdifference(e.g.,P>.05)meansthereisnodifferencebetweengroups3. P=.05meansthatwehaveobserveddatathatwouldoccuronly5%ofthetimeunderthenullhypothesis4. P=.05meansthatifyourejectthenullhyp,theprobabilityofatypeIerror(falsepositive)isonly5%
ØTheproblemisthatp-valuegivesusProb (X>=x|H_0),whereaswhatwewantisProb (H_0|X=x)
RonnyKohavi 11
Pitfall2:ExpectingBreakthroughsØ Breakthroughsarerareafterinitialoptimizations.§AtBing(welloptimized),80%ofideasfailtoshowvalue§AtotherproductsacrossMicrosoft,about2/3ofideasfail
ØTakeSessions/User,akeymetricatBing.Historically,itimproves0.02%ofthetime:that’sonein5,000treatmentswetry!
ØMostofthetime,weinvokeTwyman’slaw(http://bit.ly/twymanLaw)
ØNoterelationshiptopriorpitfall§Withstandardp-valuecomputations,5%ofexperimentswillshowstat-sigmovementtoSessions/Userwhenthereisnorealmovement(i.e.,iftheNullHypothesisistrue),halfofthosepositive
§99.6%ofthetime,astat-sigmovementwithp-value=0.05willbeafalsepositive
RonnyKohavi 12
Anyfigurethatlooksinterestingordifferentisusuallywrong
Pitfall3:NotCheckingforSRMØSRM=SampleRatioMismatchØIfyourunanexperimentwithequalpercentagesassignedtoControl/Treatment(A/B),youshouldhaveapproximatelythesamenumberofusersineach
ØRealexamplefromanexperimentalertIreceivedthisweek:§Control:821,588users,Treatment:815,482users§Ratio:50.2%(shouldhavebeen50%)§ ShouldIbeworried?
ØAbsolutely§ Thep-valueis1.8e-6,sotheprobabilityofthissplit(ormoreextreme)happeningbychanceislessthan1in500,000
§Notethattheabovestatementisnotaviolationofthepitfall#1becausebytheexperimentdesign,thereshouldbeanequalnumberofusersincontrol/treatment,sowewanttheconditionalprobability
P(actualsplit=50.2%|designedsplit=50%)
RonnyKohavi 13
Pitfall4:WrongSuccessMetric(OEC)ØOfficeOnlinetestednewdesignforhomepage
ØObjective:increasesalesofOfficeproducts
ØOverallEvaluationCriterion(OEC)wasclickstotheBuyButton[showninredboxes]Whichonewasbetter?
Control
Treatment
Pitfall:WrongOECØTreatmenthadadropintheOEC(clicksonbuy)of64%!
ØNothavingthepriceshownintheControlleadmorepeopletoclicktodeterminetheprice
ØLesson:measurewhatyoureallyneedtomeasure:actualsales(itismoredifficultattimes)
ØLesson2:Focusonlong-termcustomerlifetimevalue
ØPeepinkeynoteheresaid(hewasOKwithmementioningthis):§What’sthegoal?Moremoneyrightnow§Commonpitfall:Youwanttooptimizelong-termmoney.NOTrightnow.Raisingpricesgetsyoushort-termmoney,butlong-termabandonment
ØComingupwithagoodOECusingshort-termmetricsisREALLYhard
Example:OECforSearchØKDD2012 Paper:TrustworthyOnlineControlledExperiments:FivePuzzlingOutcomesExplained
ØSearchengines(Bing,Google)areevaluatedonqueryshare(distinctqueries)andrevenueaslong-termgoals
ØPuzzle§Arankingbuginanexperimentresultedinverypoorsearchresults§Degraded(algorithmic)searchresultscauseuserstosearchmoretocompletetheirtask,andadsappearmorerelevant
§Distinctquerieswentupover10%,andrevenuewentupover30%
ØThisproblemisnowinthebookdatascienceinterviewsexposed
ØWhatmetricsshouldbeintheOECforasearchengine?
RonnyKohavi 16
PuzzleExplainedØAnalyzingqueriespermonth,wehave
𝑄𝑢𝑒𝑟𝑖𝑒𝑠𝑀𝑜𝑛𝑡ℎ
=𝑄𝑢𝑒𝑟𝑖𝑒𝑠𝑆𝑒𝑠𝑠𝑖𝑜𝑛
×𝑆𝑒𝑠𝑠𝑖𝑜𝑛𝑠𝑈𝑠𝑒𝑟
×𝑈𝑠𝑒𝑟𝑠𝑀𝑜𝑛𝑡ℎ
whereasessionbeginswithaqueryandendswith30-minutesofinactivity.(Ideally,wewouldlookattasks,notsessions).
ØKeyobservation:wewantuserstofindanswersandcompletetasksquickly,soqueries/sessionshouldbesmaller
ØInacontrolledexperiment,thevariantsget(approximately)thesamenumberofusersbydesign,sothelasttermisaboutequal
ØTheOECshouldthereforeincludethemiddleterm:sessions/user
RonnyKohavi 17
BadOECExampleØYourdatascientistsmakesanobservation:2%ofqueriesendupwith“Noresults.”
ØManager:mustreduce.Assignsateamtominimize“noresults”metric
ØMetricimproves,butresultsforquerybrochurepaper
arecrap(orinthiscase,papertocleancrap)
ØSometimesit*is*bettertoshow“NoResults.”
RealexamplefrommyAmazonPrimenowsearch3/26/2016https://twitter.com/ronnyk/status/713949552823263234
RonnyKohavi 18
Pitfall5:CombiningDatawhenTreatmentPercentVarieswithtimeØSimplifiedexample:1,000,000usersperday
ØForeachindividualdaytheTreatmentismuchbetter
ØHowever,cumulativeresultforTreatmentisworse (Simpson’sparadox)
ConversionRatefortwodays
Friday Saturday TotalC/Tsplit:99/1 C/Tsplit:50/50
Control 20,000 =2.02% 5,000 =1.00% 25,000 =1.68%990,000 500,000 1,490,000
Treatment 230 =2.30% 6,000 =1.20% 6,230 =1.22%10,000 500,000 510,000
Pitfall6:GettheStatsRight
ØTwoverygoodbooksonA/Btesting(A/BTestingfromOptimizelyfoundersDanSiroker andPeterKoomen;andYouShouldTestThatbyWiderFunnel’sCEOChrisGoward)getthestatswrong(seeAmazonreviews).
ØOptimizelyrecentlyupdatedtheirstatsintheproducttocorrectforthis
ØBesttechniquestofindissues:runA/Atests§ LikeanA/Btest,butbothvariantsareexactlythesame§Areuserssplitaccordingtotheplannedpercentages?§ Isthedatacollectedmatchingthesystemofrecord?§Aretheresultsshowingnon-significantresults95%ofthetime?
RonnyKohavi 20
MorePitfallsØSeeKDDpaper:SevenPitfallstoAvoidwhenRunningControlledExperimentsontheWeb(http://bit.ly/expPitfalls)
ØIncorrectlycomputingconfidenceintervalsforpercentchange
ØUsingstandardstatisticalformulasforcomputationsofvarianceandpower
ØNeglectingtofilterrobots/botsLucrativebusiness,asshowninphotoItook->
ØInstrumentationissues
RonnyKohavi 21
TheHiPPOØHiPPO=HighestPaidPerson’sOpinionØWemadethousandstoyHiPPOsandhandedthematMicrosofttohelpchangetheculture
ØGrabonehereatConversionXLØChangethecultureatyourcompanyØFact:Hipposkillmorehumansthananyother (non-human)mammalØListentothecustomersanddon’tlettheHiPPOkillgoodideas
RonnyKohavi 22
RonnyKohavi 23
Gettingnumbersiseasy;gettingnumbersyoucantrustishard
Slidesathttp://bit.ly/ABPitfalls
Seehttp://exp-platform.com forpapers.Planereadingbookletswithselectedpapersavailableoutsideroom
Rememberthis