Building better products through Experimentation - SDForum Business Intelligence SIG

Building better products through Experimentation

Deepak Nadig, eBay Principal Architect

SDForum Business Intelligence SIGMarch 27, 2008

2

What we’re up against

• eBay manages …– Over 276,000,000 registered users– Over 1 Billion photos

– eBay users worldwide trade more than $2039 worth of goods every second

– eBay averages well over 1 billion page views per day

– At any given time, there are over 113 million items for sale on the site

– eBay stores over 2 Petabytes of data – over 200 times the size of the Library of Congress!

– eBay analytics processes over 25 Petabytes of data on any day

– The eBay platform handles 4.4 billion API calls per month

• In a dynamic environment– 300+ features per quarter

– We roll 100,000+ lines of code every two weeks

• In 39 countries, in seven languages, 24x7

>44 Billion SQL executions/day!

An SUV is sold every 5 minutesA sporting good sells every 2 seconds

Over ½ Million pounds of Kimchi are sold every year!

3

Site Statistics: in a typical day…

N/A150 M0API Calls

59x16 Gbps268 MbpsPeak Network Utilization

50x99.94%~97%Availability

19x>1 B54 MTotal Page Views

41x41 M1 MOutbound Emails

GrowthQ12007

June1999

43 mins/day 50 sec/day

4

Velocity of eBay -- Software Development Process

• Our site is our product. We change it incrementally through implementing new features.

• Very predictable development process – trains leave on-time at regular intervals (weekly).

• Parallel development process with significant output -- 100,000 LOC per release.

• Always on – over 99.94% available.

6M LOC100K LOC/Wk

99.94%

276M Users

300+ FeaturesPer Quarter

All while supporting a 24x7 environment

5

James Lind and cure for scurvy

cider elixir ofvitriol

sea water garlicmustardhorseradish

vinegar orangelemon

6

Reminder for data/analytics driven decisions

• Auction vs. Stores

• Combined search results– Return a broader mix of inventory– Listings of core + stores were combined– More exposure to store listings

• Results– Business metrics were down – bids, average sales price, etc.– Latency in discovering this

• Analysis– Overall cost of a store listing is less than that of auction listing– Sellers shifted inventory to save on fees

• Rolled back in 03/2006– Higher fees for store listings

7 Scripted or lab-based use of productNatural use of product

Qualitative (direct) Quantitative (indirect)

Self-reported(stated)

ObservedBehavior

mixture

APPROACH

DA

TA S

OU

RC

E

Focus Groups / “Voices”Phone Interviews

“Visits” / Ethnographic Field Studies

CardsortingDiary/Camera Study

Exit Surveys

Usability Lab Studies (task-based)

Eyetracking

Usability benchmarking (in lab)

Quantitative user experience assessments

Clickstreams

Desirability studies

Data mining

Many Insights Methods (By Data Source vs. Approach)

Combination / hybridDe-contextualized / not using product

KEY – Context of data collection with respect to product use

/

Product TrackerMessage Board Mining

/

(Onsite interviews)

(Extended observation)

Intent Discovery

Experimentation

8

Concepts

• Unit (of experimentation, analysis)– Entity on whom the experimentation or analysis is being made – e.g. user, seller, buyer, item

• Factor (or variable)– Something that can have multiple values– Independent or controlled (cause), Dependent or response (effect)

• Treatment (or experience)– A variation of information (e.g. page flow, page, module) served to the unit. The

variation is characterized by change in one or more factors or variables

• Sample– A group of users who are served the same treatment.

• Evaluation Metric– A metric used to compare the response to different treatments

• Experimentation – A method of comparing 2 or more treatments based on measurable metric. One

variant, the status quo, is referred as the ‘control’.

9

Treatment (or experience)

• Module– Strict subset of the page– User is treated to changes to a module– For e.g. zebra vs. integrated vs. distinct ads

• Page– User is treated to different variations of the page– For e.g. 2L1R (Left column is twice as wide as right) vs. 1L2R

• Page Flow or Use Case– User is treated to different variations of a use case– For e.g. different flows for listing an item for sale

10

Sampling

• Population– Group you want to generalize to

• Sample– Units from the population selected

• Sampling– Process of selecting units from a population of interest– By studying the sample you can fairly generalize the

results to the population

• External validity (Generalizability)

• Mechanisms– Random– Stratified random– …

• What matters is number of samples

People

Place Time

Setting

11

Experiments

• A/B testing– A form of testing in which two treatments, a control (‘A’) and variant (‘B’) are

compared.– No emphasis on cause (factor)

• Single-factor testing– A form of testing in which treatments corresponding to values of a single-factor

are compared– For e.g. Ad – Yes/No

• Multi-factorial testing (DOE)– A method of testing in which treatments corresponding to multiple-values of

multiple-factors are compared– For e.g. Ad – Yes/No, Location – Top/Bottom– Manual vs. Automated

12

Objective

• To explore relationship between factors

• Relationships– None– Co-relational, Synchronized

• Positive vs. Negative

• Third-variable problem

– Causal relationship

• Establishing causal relationship– If X, then Y– If not X, then not Y

• Distinguish significant factors and interactions

• Measure impact on the metric

13

Experiment Lifecycle

1. Hypothesis

2. Experimental Design

3. SetupExperiment

4. LaunchExperiment

5. Measurement

7. Analysis &Results

•Setup ExperimentSamplesTreatments, Factors•Implementation

•User (Experiment, Treatment)•Serve Treatment

•Tracking•Monitoring

•DOE•Define Samples, Treatments, Factors

•Metrics•Reporting

•Idea (!)•Learning

eBayExperimentation

Platform

14

Reduce Email Guessing

• Purpose– Measure decline in registrations from introduction of blocking message– Users cannot create username which equals email address– E.g. Username: cooky1 Email: [email protected]

• Metrics– Number of registrations– Reduction in phishing

• Samples– 3% US

• Treatments– Classic, Blocked

• Outcome– No difference in registrations– Improved security

15

Text Ads on SRP

• Purpose– Determine whether the use of text

ads on search result pages

• Metrics– Overall revenue

• Samples– 1% US, International

• Treatments– Ad, No-ad

• Outcome– Overall revenue increased in

certain markets

16

Home Page

• Purpose– Optimal construction of page– Per user segment?

• Metrics– Overall revenue

• Samples– Varied per treatment

• Treatments– 100s of variations– Ads, Merchandising, P13N,

Navigation, Layout

• Outcome– Page structures different for

different user segments

17

What we think about

Fidelity of Experiments The quality of the model and its testing conditions in representing the final feature or product under actual use conditions

Cost of Experiments The total cost of designing, building, running, and analyzing anexperiment

Iteration time The time from planning experiments to when the analyzed results are available and used for planning another iteration

Concurrency The number of experiments that can be run at the same time

Signal/Noise Ratio The extent to which the signal (response) of interest is obscured by noise

Type/Level of Experiment Types and Levels of experiment that can be carried out

18

Experimentation Platform

FindingSellingBuying

ExperimentationService

Message Bus

AlertListener



File LogDataCube

ExperiencesResponses

ebay.com

ExperimentLifecycle

Management

Experience

Page, Module

ExperimentMetadata

ExperienceResponse

Analysis

Metrics / Experience

ResultsObservations

Experimenter

Design

eBay user

Access

ResultsObservations

Access

19

Implementation Considerations

• User identification

• User Sample– No bias towards any experiment or treatment– Sticky-ness between activities (and sessions)– No interaction between experiments– Enabling a user to try out a specific treatment– Ramping-up to understand generalization effects

• Sample Treatment– No bias

• Splitting traffic– Inline– Application server– Load balancer– Browser

• Factor-driven development

20

Measurement – A case of traveling shoppers

31212112

2AliceBobBobAliceBobAliceBobPage-2

3AliceCharlieBobCharlieBobAliceAlicePage-1

SaturdayFridayThursdayWednesdayTuesdayMondaySunday

21

Limitations and ways to overcome them …

• Sticky-ness to user– Session-level analysis

• What, not why– When qualitative research complements

• Short-term vs. Long-term effects– Think about the duration of the experiment

• Newness effect– Consider burn-in periods

• Minor vs. major differences– Think about amount of effort being committed

• Anonymity of tests– When qualitative research spills the beans

22

Key takeaways

• Experimentation is one of the most effective approaches for gaining quantitative insights

• Enables businesses to quickly understand and establish relationships between product changes and their impact on business metrics

• Different types and levels of experiments can be used to gain different amounts of insights

• Experimentation has limitations, but they can be overcome

• Think about “experiment-ability”, as one another “-ability” in product design

23

Experimentation Confirms Innovation

[email protected]

Technology

Building better products through Experimentation - SDForum Business Intelligence SIG