AB Testing at Expedia

AB Testing

Revolution through constsant evolution

Expedia SF114 Sansome

www.expedia.com@expediaeng

Work with us: [email protected]

m

http://www.expedia.com/

Paul LucasSr Director, TechnologyWant to visit next? Greece

Jeff MadynskiDirector, TechnologyWant to visit next? Croatia

Anuj GuptaSr Software Dev EngineerWant to visit next? Peru

Revolution through constant evolution

Technology EvolutionV0 – batch processing from abacus exposure logs, Omniture, and booking datamart. Tableau visualization

V1 - Storm, Kestrel, DynamoDB / Postgresql reading UIS messages and client log data. (Nov 2014 - Dec 2015)

V2 - Introduce Kafka and Cassandra (May 2016)

TNL – original solution• Batch processing• Tableau visualization• Merged data from OMS/omniture• Problems:

– 1-2d feedback loop – what if we had mistakes in test implementation(bucketing not what anticipated)?

– In order to fix data import errors - start over again

TNL Dashboard v0

Omnitureclick data

Booking datamart

Abacus exposures

Tableau

Hadoop ETL

TNL v0 -> v1

Begin Jeffdelete this page

TNL v1 Problems • Database size 420GB, queries took 3-5 minutes

• Data drop (kestrel) • Increase in data (multi-brand, +customers)

TNL v1->v1.1, v2• Fighting fires, borrowing more time• POC next

Fighting fires – borrowing more time

User Interaction Service(UIS) Traffic

Scaling messaging system

Kafka

• Publish-subscribe based messaging system

• Distributed and reliable• Longer retention and

persistence• Monitoring dashboard

and alerts• Buffer for system

downtime

Kestrel limitation

• Message durability is not available

• Reaching potential scalability issues

• In-active open source project

Scaling database performance

• Database views for caching–Views created every 6 hours

–UI only loads data from views

–Read-only replicas for select queries

• Archive data–Moved old and completed experiment data to

separate tables

–DB cleanup using vacuum and re-indexing

TNL Dashboard v2

Product Demo

Streaming

•Column-oriented, time series schema•Time-to-live(TTL) on data•Only store most popular aggregates

v1 VS v2•New Architecture

– More scalable– More responsive– Less prone to data loss

• Lessons learnt–System is as fast as the slowest component

–Fault-tolerance and resilience

–Partition data

–Pre-production environment

Questions/discussion

APPENDIX

27Apply statistical power to test results results

Using 90% confidence level, 1 out of 10 tests will be false positive or negative

Heads TailsRight hand 51 49Left hand 49 51

Right hand is superior at getting

heads!

Do’s and Don’ts when concluding tests

Don’t call test too early; this increases false

positives or negatives

Don’t call tests as soon as you see positive results because test

result frequently goes up and down

To claim a test Winner/Loser, the positive/negative effect has to stay for

at least 5 consecutive days and the trend is stable

Please note this type of chart is not currently available in the Test and Learn dashboard or SiteSpect UI; The shape of Confidence Interval lines varies test by test

Define one success metric and run tests for a pre-determined duration;

(For hotel/flight tests in the US, suggest running until confidence interval of conversion change is

within +/- 1%); tests should run at least 10 days

Don’t assume the midpoint (observed % change during the test period) will hold true after the feature is rolled out: a 4.0% +/- 4.0% test may have zero impact and may not be much

better than a 1.0% +/- 1.0% test

Don’t call an inconclusive test “trending positive” or “trending

negative” as test result fluctuates

Contact ARM testing team for questions

[email protected]

Using 90% confidence levelWinner: Lower bound of % change >= 0 (or probability of test being positive >= 95%);Loser: Higher bound of % change <= 0 (or probability of test being negative >= 95%)

Else: Inconclusive or Neutral

Documents

AB Testing at Expedia