Upload
deepak-nadig
View
190
Download
0
Tags:
Embed Size (px)
Citation preview
Building better products through Experimentation
Deepak Nadig, eBay Principal Architect
SDForum Business Intelligence SIGMarch 27, 2008
2
What we’re up against
• eBay manages …– Over 276,000,000 registered users– Over 1 Billion photos
– eBay users worldwide trade more than $2039 worth of goods every second
– eBay averages well over 1 billion page views per day
– At any given time, there are over 113 million items for sale on the site
– eBay stores over 2 Petabytes of data – over 200 times the size of the Library of Congress!
– eBay analytics processes over 25 Petabytes of data on any day
– The eBay platform handles 4.4 billion API calls per month
• In a dynamic environment– 300+ features per quarter
– We roll 100,000+ lines of code every two weeks
• In 39 countries, in seven languages, 24x7
>44 Billion SQL executions/day!
An SUV is sold every 5 minutesA sporting good sells every 2 seconds
Over ½ Million pounds of Kimchi are sold every year!
3
Site Statistics: in a typical day…
N/A150 M0API Calls
59x16 Gbps268 MbpsPeak Network Utilization
50x99.94%~97%Availability
19x>1 B54 MTotal Page Views
41x41 M1 MOutbound Emails
GrowthQ12007
June1999
43 mins/day 50 sec/day
4
Velocity of eBay -- Software Development Process
• Our site is our product. We change it incrementally through implementing new features.
• Very predictable development process – trains leave on-time at regular intervals (weekly).
• Parallel development process with significant output -- 100,000 LOC per release.
• Always on – over 99.94% available.
6M LOC100K LOC/Wk
99.94%
276M Users
300+ FeaturesPer Quarter
All while supporting a 24x7 environment
5
James Lind and cure for scurvy
cider elixir ofvitriol
sea water garlicmustardhorseradish
vinegar orangelemon
6
Reminder for data/analytics driven decisions
• Auction vs. Stores
• Combined search results– Return a broader mix of inventory– Listings of core + stores were combined– More exposure to store listings
• Results– Business metrics were down – bids, average sales price, etc.– Latency in discovering this
• Analysis– Overall cost of a store listing is less than that of auction listing– Sellers shifted inventory to save on fees
• Rolled back in 03/2006– Higher fees for store listings
7 Scripted or lab-based use of productNatural use of product
Qualitative (direct) Quantitative (indirect)
Self-reported(stated)
ObservedBehavior
mixture
APPROACH
DA
TA S
OU
RC
E
Focus Groups / “Voices”Phone Interviews
“Visits” / Ethnographic Field Studies
CardsortingDiary/Camera Study
Exit Surveys
Usability Lab Studies (task-based)
Eyetracking
Usability benchmarking (in lab)
Quantitative user experience assessments
Clickstreams
Desirability studies
Data mining
Many Insights Methods (By Data Source vs. Approach)
Combination / hybridDe-contextualized / not using product
KEY – Context of data collection with respect to product use
/
Product TrackerMessage Board Mining
/
(Onsite interviews)
(Extended observation)
Intent Discovery
Experimentation
8
Concepts
• Unit (of experimentation, analysis)– Entity on whom the experimentation or analysis is being made – e.g. user, seller, buyer, item
• Factor (or variable)– Something that can have multiple values– Independent or controlled (cause), Dependent or response (effect)
• Treatment (or experience)– A variation of information (e.g. page flow, page, module) served to the unit. The
variation is characterized by change in one or more factors or variables
• Sample– A group of users who are served the same treatment.
• Evaluation Metric– A metric used to compare the response to different treatments
• Experimentation – A method of comparing 2 or more treatments based on measurable metric. One
variant, the status quo, is referred as the ‘control’.
9
Treatment (or experience)
• Module– Strict subset of the page– User is treated to changes to a module– For e.g. zebra vs. integrated vs. distinct ads
• Page– User is treated to different variations of the page– For e.g. 2L1R (Left column is twice as wide as right) vs. 1L2R
• Page Flow or Use Case– User is treated to different variations of a use case– For e.g. different flows for listing an item for sale
10
Sampling
• Population– Group you want to generalize to
• Sample– Units from the population selected
• Sampling– Process of selecting units from a population of interest– By studying the sample you can fairly generalize the
results to the population
• External validity (Generalizability)
• Mechanisms– Random– Stratified random– …
• What matters is number of samples
People
Place Time
Setting
11
Experiments
• A/B testing– A form of testing in which two treatments, a control (‘A’) and variant (‘B’) are
compared.– No emphasis on cause (factor)
• Single-factor testing– A form of testing in which treatments corresponding to values of a single-factor
are compared– For e.g. Ad – Yes/No
• Multi-factorial testing (DOE)– A method of testing in which treatments corresponding to multiple-values of
multiple-factors are compared– For e.g. Ad – Yes/No, Location – Top/Bottom– Manual vs. Automated
12
Objective
• To explore relationship between factors
• Relationships– None– Co-relational, Synchronized
• Positive vs. Negative
• Third-variable problem
– Causal relationship
• Establishing causal relationship– If X, then Y– If not X, then not Y
• Distinguish significant factors and interactions
• Measure impact on the metric
13
Experiment Lifecycle
1. Hypothesis
2. Experimental Design
3. SetupExperiment
4. LaunchExperiment
5. Measurement
7. Analysis &Results
•Setup ExperimentSamplesTreatments, Factors•Implementation
•User (Experiment, Treatment)•Serve Treatment
•Tracking•Monitoring
•DOE•Define Samples, Treatments, Factors
•Metrics•Reporting
•Idea (!)•Learning
eBayExperimentation
Platform
14
Reduce Email Guessing
• Purpose– Measure decline in registrations from introduction of blocking message– Users cannot create username which equals email address– E.g. Username: cooky1 Email: [email protected]
• Metrics– Number of registrations– Reduction in phishing
• Samples– 3% US
• Treatments– Classic, Blocked
• Outcome– No difference in registrations– Improved security
15
Text Ads on SRP
• Purpose– Determine whether the use of text
ads on search result pages
• Metrics– Overall revenue
• Samples– 1% US, International
• Treatments– Ad, No-ad
• Outcome– Overall revenue increased in
certain markets
16
Home Page
• Purpose– Optimal construction of page– Per user segment?
• Metrics– Overall revenue
• Samples– Varied per treatment
• Treatments– 100s of variations– Ads, Merchandising, P13N,
Navigation, Layout
• Outcome– Page structures different for
different user segments
17
What we think about
Fidelity of Experiments The quality of the model and its testing conditions in representing the final feature or product under actual use conditions
Cost of Experiments The total cost of designing, building, running, and analyzing anexperiment
Iteration time The time from planning experiments to when the analyzed results are available and used for planning another iteration
Concurrency The number of experiments that can be run at the same time
Signal/Noise Ratio The extent to which the signal (response) of interest is obscured by noise
Type/Level of Experiment Types and Levels of experiment that can be carried out
18
Experimentation Platform
FindingSellingBuying
ExperimentationService
Message Bus
AlertListener
FindingSellingBuying
FindingSellingBuying
File LogDataCube
ExperiencesResponses
ebay.com
ExperimentLifecycle
Management
Experience
Page, Module
ExperimentMetadata
ExperienceResponse
Analysis
Metrics / Experience
ResultsObservations
Experimenter
Design
eBay user
Access
ResultsObservations
Access
19
Implementation Considerations
• User identification
• User Sample– No bias towards any experiment or treatment– Sticky-ness between activities (and sessions)– No interaction between experiments– Enabling a user to try out a specific treatment– Ramping-up to understand generalization effects
• Sample Treatment– No bias
• Splitting traffic– Inline– Application server– Load balancer– Browser
• Factor-driven development
20
Measurement – A case of traveling shoppers
31212112
2AliceBobBobAliceBobAliceBobPage-2
3AliceCharlieBobCharlieBobAliceAlicePage-1
SaturdayFridayThursdayWednesdayTuesdayMondaySunday
21
Limitations and ways to overcome them …
• Sticky-ness to user– Session-level analysis
• What, not why– When qualitative research complements
• Short-term vs. Long-term effects– Think about the duration of the experiment
• Newness effect– Consider burn-in periods
• Minor vs. major differences– Think about amount of effort being committed
• Anonymity of tests– When qualitative research spills the beans
22
Key takeaways
• Experimentation is one of the most effective approaches for gaining quantitative insights
• Enables businesses to quickly understand and establish relationships between product changes and their impact on business metrics
• Different types and levels of experiments can be used to gain different amounts of insights
• Experimentation has limitations, but they can be overcome
• Think about “experiment-ability”, as one another “-ability” in product design