Upload
paul-lucas
View
185
Download
5
Embed Size (px)
Citation preview
AB Testing
Revolution through constsant evolution
Paul LucasSr Director, TechnologyWant to visit next? Greece
Jeff MadynskiDirector, TechnologyWant to visit next? Croatia
Anuj GuptaSr Software Dev EngineerWant to visit next? Peru
Revolution through constant evolution
Technology EvolutionV0 – batch processing from abacus exposure logs, Omniture, and booking datamart. Tableau visualization
V1 - Storm, Kestrel, DynamoDB / Postgresql reading UIS messages and client log data. (Nov 2014 - Dec 2015)
V2 - Introduce Kafka and Cassandra (May 2016)
TNL – original solution• Batch processing• Tableau visualization• Merged data from OMS/omniture• Problems:
– 1-2d feedback loop – what if we had mistakes in test implementation(bucketing not what anticipated)?
– In order to fix data import errors - start over again
TNL Dashboard v0
Omnitureclick data
Booking datamart
Abacus exposures
Tableau
Hadoop ETL
TNL v0 -> v1
Begin Jeffdelete this page
TNL v1 Problems • Database size 420GB, queries took 3-5 minutes
• Data drop (kestrel) • Increase in data (multi-brand, +customers)
TNL v1->v1.1, v2• Fighting fires, borrowing more time• POC next
Fighting fires – borrowing more time
User Interaction Service(UIS) Traffic
Scaling messaging system
Kafka
• Publish-subscribe based messaging system
• Distributed and reliable• Longer retention and
persistence• Monitoring dashboard
and alerts• Buffer for system
downtime
Kestrel limitation
• Message durability is not available
• Reaching potential scalability issues
• In-active open source project
Scaling database performance
• Database views for caching–Views created every 6 hours
–UI only loads data from views
–Read-only replicas for select queries
• Archive data–Moved old and completed experiment data to
separate tables
–DB cleanup using vacuum and re-indexing
TNL Dashboard v2
Product Demo
Streaming
•Column-oriented, time series schema•Time-to-live(TTL) on data•Only store most popular aggregates
v1 VS v2•New Architecture
– More scalable– More responsive– Less prone to data loss
• Lessons learnt–System is as fast as the slowest component
–Fault-tolerance and resilience
–Partition data
–Pre-production environment
Questions/discussion
APPENDIX
27Apply statistical power to test results results
Using 90% confidence level, 1 out of 10 tests will be false positive or negative
Heads TailsRight hand 51 49Left hand 49 51
Right hand is superior at getting
heads!
Do’s and Don’ts when concluding tests
Don’t call test too early; this increases false
positives or negatives
Don’t call tests as soon as you see positive results because test
result frequently goes up and down
To claim a test Winner/Loser, the positive/negative effect has to stay for
at least 5 consecutive days and the trend is stable
Please note this type of chart is not currently available in the Test and Learn dashboard or SiteSpect UI; The shape of Confidence Interval lines varies test by test
Define one success metric and run tests for a pre-determined duration;
(For hotel/flight tests in the US, suggest running until confidence interval of conversion change is
within +/- 1%); tests should run at least 10 days
Don’t assume the midpoint (observed % change during the test period) will hold true after the feature is rolled out: a 4.0% +/- 4.0% test may have zero impact and may not be much
better than a 1.0% +/- 1.0% test
Don’t call an inconclusive test “trending positive” or “trending
negative” as test result fluctuates
Contact ARM testing team for questions
Using 90% confidence levelWinner: Lower bound of % change >= 0 (or probability of test being positive >= 95%);Loser: Higher bound of % change <= 0 (or probability of test being negative >= 95%)
Else: Inconclusive or Neutral