Mobile App Feature Configuration and A/B Experiments

Feature Configurationand A/B Experimentsin native and hybrid mobile apps

Hello

http://bdconf.com/2013/nashville/schedule#lacyrhoades


Feature Configurationand A/B Experiments

Breaking Development NashvilleOctober 21 [email protected]

in native and hybrid mobile apps

Mostly at Etsy we work on greenfield development, new products or features from scratch. However as our mobile platform matures and our apps start to mature, we are finding ourselves wanting to take the principals we find most fruitful in other, older parts of Etsy and apply them in the mobile realm. TLDR this is about the fun stuff.



mailto:[email protected]


When I talk to people about working at Etsy I expect they want to talk about deploying web applications on tons of servers..

Or working with industry leaders like Rasmus Lerdorf..

But in fact typically people want to talk about the unique and sometimes strange things you can find on Etsy.

Often little do these people know that interesting oddities can be purchased on their phone or tablet.

One might wander what kind of business it is selling unique items over the internet, but as you can see we do pretty well. In recent months we’ve seen as much as half of our business happen on mobile devices. This makes mobile experimentation a really exciting place to be.

Feature Flags

http://www.flickr.com/photos/mig/15964697/

The first of the concepts that we are trying to adapt from the old web world to mobile is the notion of feature flags. We find wrapping things up as features is incredibly useful. Another of these principals of engineering is something we call Continuous Experimentation.



Continuous Experimentation

When we talk about Experimentation though we’re talking about A/B experiments. If you’ve ever been using an app or a site and noticed that someone sitting RIGHT NEXT to you has a different looking app or site, this is probably part of an experiment.


• make small changes• stay honest• don't break the [product]

The name we give it comes from what we call “Continuous Deployment”. Instead this “Continuous Experimentation” describes how we try to continue to develop apps and mature app features after they’ve been working out in the world and proven as largely a good idea.

Real People

https://www.etsy.com/shop/CattailsWoodwork

Brenda

Why is it important to be honest and to not break the product? Because we’re talking about the livelihood of real people.



Real People

https://www.etsy.com/shop/PretentiousBeerGlass

Matthew

People who make cool stuff and depend on Etsy to let them sell that cool stuff.





We use this to develop our products iteratively using real world feedback and data about our users and how their experience is.



I love building things, making things. You can’t, for lack of a better analogy, you can’t build an automobile or a vacuum cleaner or whatever thing, sell that thing, have someone use that thing every day and then expect that you could change it, change how the car works and look at how your changes effect the experience or the performance of that thing from back at the factory.

http://www.flickr.com/photos/usfws_pacificsw/6391185103/

Steve Jobs had this quote about how the computer was a bicycle for the mind. I feel like software today allows us to use computers as a bicycle for industry.



Disclaimer

• Analytics / Big Data• Experimental Analysis• Exploratory Analysis

I’m going to talk mostly about design and development that will keep us from making bad data or picking poor audiences. There are other elements here but let’s assume we have some way to gather analytics and some way to analyze the data we have gathered.


• Everyone does experiments• But not everything works this way• Rarely seller tools, usually public stuff• The develop / release / cheer cycle• Mobile Apps

Everyone at Etsy dabbles in experiments, at least a little. But really not every line and every product gets measured by experiment. Sometimes products are just made because they’re needed. Like making a mobile website.

Launch Day

Story time, we have a big announcement to make one day at Etsy..

Introducingthe New Thing

We are introducing a new thing, a new feature or whatever..

The new product exists in a ton of places. These are all DIFFERENT teams mind you!

WebiPhone Android

iPad

Email

This is all to say that this New Thing exists in an ecosystem across all these different mediums. This is kind of the mindset you have to be in with cross-platform features and their a/b tests. A feature now becomes a dimension in and of itself. It is not just part of your website. Features can transcend an individual app.

Launch Day

Fortunately in this story, on launch day, QA has already seen the new product in operation, ahead of the world. The App Store review process is already done. We had a release go out last week. The code that will show this New Thing lies “dormant” out in the wild, on the website, in the web views in the app and even in the app binary itself.

if ($cfg[‘cool-dog-pics’]) { ...

ZZZZ

Z

index.php index2.php

in the PHP world..

if ($cfg[‘cool-dog-pics’]) { ...

ZZZZ

Z

index.php index2.php

When we flip the switch, the code can branch to the new code.

if (coolDogsIsEnabled()) { ...

ZZZZ

Z

base.js base_new.js

in the DOM..

if ([Config isEnabled@”CoolDogs”]) { ...

ZZZZ

Z

SomeViewController.h SomeNewViewController.h

and in native apps..

One Line!

So the launch starts on a engineer’s macbook. We push, for sake of illustration, as little as one line of web code out to our web servers and API servers. This one line mentions The New Thing by name, saying simply “turn on The New Thing” and within the granularity of 20-30 minutes you have the Newness throughout the product to millions of users, across the ecosystem of all types, across the world.

Launch Day

So this story is really just like a parlor trick. It’s to get your attention. It’s fun to talk about. The real value here is the illustration of this unique dimension, these “features” which we can use to divide up our product that is spanning out in so many different mediums.

Measure Everything• Feature Configuration• Benefits of Features• Configuration as Tests• How Flags Get to the Code• Examples of Experiments• Making Sensible Tests

The roadmap to making our mobile a/b experiments a reality..

Feature Configuration

So first some of the things we take for granted. Four years ago -four years is a really long time- there was a blog article from Flickr I saw floating around on the internet. It was about what they called feature flags. We all thought it was pretty cool stuff. I had no idea why. Then at some point I started working at Etsy and now I have to preface any technical discussion of interest with how feature configuration shapes the way we make software.

Flags, Flippers, Flickrhttp://code.flickr.net/2009/12/02/flipping-out/

$cfg = array();

$cfg[‘cool-dog-pics’];

$cfg[‘cool-dog-pics’] = array(‘enabled’=>‘off’);

$cfg[‘cool-dog-pics’] = array(‘enabled’=>‘on’);

This idea starts with a configuration array that names things on your site. Maybe cool dog pics.

Flags, Flippers, Flickr

if ($cfg[‘cool-dog-pics’) {echo Dogs::getCoolPics();}

You basically push out all the code all the time. Even if it doesn’t work, you just keep it turned off.

Feature Configuration

The original benefit here, the exciting thing about this from the Flickr article is that you don’t have to merge code or do gigantic atomic (and painful) releases. We push code 30, 40, 50 times a day. We push code that doesn’t even work, on purpose, because pushing tons of little pieces of resilient code is a lot more predictable than trying to do a big release.

More Cool Stuff

$cfg = array(‘enabled’=>array(‘users’=>‘lacyrhoades’)

);

$cfg = array(‘enabled’=>array(‘admin’=>true)

);

$cfg = array(‘enabled’=>array(‘groups’=>54321)

);

You could maybe do something like only enable that branch of code to be active for one person, just some people you work with or a group of users you know have a particular interest in the feature.

Measuring Things

Another principal is measuring things in a real-time way

Like here we’ve got what we call dashboards, so as we flip these switches on or off we can look for anomalies and make sure nothing is blowing up.

Benefits of Features

Another trick we have is what we call “Slow Rampups”. Things don’t have to be just “on” or “off”.

Benefits of Features

Launches can now happen on a schedule. There’s no sense in trying to make a software deadline be punctuated by releasing that software. The software development is going to take too long. You are going to need to be able to QA it, change it and make it right.

Requisites

• No backwards incompatible changes• Uniquely identified users

There are some limitations or requirements.. you’ll need to make sure the branches in the code are not drastically different. The data schema for example has to be backwards compatible. You’re going to need a way to uniquely identify users in order to put them into groups and give them consistent experiences.

Features in theNew World

Feature flags were made for websites. But your website is not a website anymore. It hasn’t been a website for a while now. Your website probably employs at least one mobile developer who doesn’t even look at your “website” every day. Similarly feature flags have to adapt to this new world.

Features as Tests

A/B Tests put people in buckets, giving them different values for one flag. Watch what they do over time, this means attaching information about their test buckets to analytics events. Make your app better based on the data you get back.

An ExampleFeature

On the main screen. The individual facets you see there are powered by different parts of the infrastructure.

The personal activity feed is fed from an map-reduce stack in PHP. The curated panels are fed from a Java-based search stack. One of these might mess up and you want the app to go on without it. This is a great place to start for dividing up features in your design.

An ExamplePrototype

For a prototype we might want to gather feedback from users, a select group, for example the photo-editor in our app started as a prototype group and grew with feedback we received.

This was a photo editing interface in the Etsy app. We wanted to know some particularly qualitative things about it, like was it easy to use and understand. It’s not exactly like we can study really dry analytics to get at this sort of answer. So we invited some interested sellers to join a group on Etsy. Those sellers could see the new tools and we could at that point gather feedback from them.

This was a redesign of the activity feed on our mobile web. We used a feedback group to preview the features and make sure we got it right.

An Example Experiment

One of our recent experiments.. mobile templates vs. desktop templates. It turns out from the measurements we could make that users were quantifiably more satisfied with the desktop “look” of Etsy on tablet devices.

We wanted to really go down the road of experiments as the web side of Etsy has before us. We took this listing view as a place to start. Here’s a reasonably priced shadow puppet. Here it is (right) with the experiment enabled.

Here was the flow before. The last three steps were all web views on the server side.

Here’s the flow afterwards. The idea was reducing the number of steps would significantly reduce friction in the checkout flow.

Experiment Steps

• Set up a feature flag

First we needed to look at eligibility, or who can take part in this experiment. If the eligible audience is VERY small compared the general audience of all users, our total number of "people who have used this feature" is going to be small, and so then the numerator of "people who bought an item with this feature" will also be small. Then there’s the feature flag. We make this and start to write code against it, so we’re pushing code from day 1.

Experiment Steps

$config['mobile']['iphone']['BuyItNow'] = array( 'enabled' => 0,'group' => 54321,

'admin' => 0)

];

The config might look something like this. This goes into the one config file that’s at the center of everything.

Experiment Steps

• Set up a feature flag• Determine eligibility programmatically

Next we need to determine eligibility programatically. We’ve already determined our eligible audience is of considerable size at this point. We did that as part of exploratory analysis. This is more about being able to say in the code, something like..

Eligibility

$eligible = isEligible($user, $listing);

if ($eligible) { ... }

We want to answer questions like, do we have this user’s billing information on file? Can this listing even be bought using that credit card? Can the seller who’s selling this item ship the item to the country the user lives in? If the answers to any of these questions are not good, we need to not be including this buyer in the experiment. We’ll dilute our results quickly, since we know there are a bunch of combinations of items and buyers who can’t take part in the experiment.

Experiment Steps

• Set up a feature flag• Determine eligibility• Start hacking away

At this point we start coding up the native elements and the web elements, all the while hiding them behind the feature configuration we chose before. Generally being careful if we need to add code to shared libraries or shared files.

Experiment Steps

• Set up a feature flag• Determine eligibility• Start hacking away• Begin testing

When things are working the way we expect, we begin testing with a small internal group, usually people in QA or just staff members. Also we begin to QA the app as a whole for release.

Experiment Steps

• Set up a feature flag• Determine eligibility• Start hacking away• Begin testing• Put the product on the shelf

Once QA approves the way it works, most of the coding is done and we put the feature on the shelf for a while. We’ll probably try to begin the app store review process as soon as possible.

Eligibility

[EtsyConfig isEnabled:@”BuyItNow”];

The code in objective C will look something like this. The EtsyConfig class here is going to be responsible for remembering “yes I did see this experiment, someone asked about it, and the answer was: x” That answer, and the specific question we came looking for need to be attached to analytics events the user is firing.

Experiment Steps

• Experiment group• Up to a certain percentage• Analytics events

When the app is live, we can implement an experiment group. We can ramp up slowly so that we can kick the tires and know things are okay. When things are looking good we’ll take the experiment up to a percentage we established beforehand. Analytics events are capturing the state of this test as people see it. Typically you can ignore the state of this feature on analytics events for people who are completely ineligible.

Looking at results

• Self Selection• Refunds / Returns• Visit-level vs. User-level

Our initial results were actually pretty good. There are other things to consider, drawbacks and perhaps biased design of the experiment.

How the ConfigGets into the Device

plist

The code starts as the plist you probably recognize from any Xcode project.

+

server

We download a set of config values, things that are enabled or disabled, from the server.

+ =

plist server runtime

At runtime we merge these values into a single dictionary.

Configuration Steps

• App launch• Periodically later, login• Merge downloaded config• Post notification

The config is downloaded and merged whenever the app launches. It also is downloaded when the user logs in or out, as their experiments might change. As a final step in this merging we’ll post a notification in the app code so that UI elements in the app which need to update based on any experimental code, can do so.

Bucketing Users

You’re going to need to know who individual users are if you can put 20% into one bucket and ensure that they stay there, and ensure that no one from the 80% control group, makes it in to that experiment bucket.

Bucketing Users

• Persistent Cookies• Device UDID• user_id (where available)

The best thing to use here would be the user_id. This is not always available, like if the user is logged out. Typically websites use persistent cookies to bucket logged out users. Unfortunately an API for mobile apps doesn't have this avenue of cookies. You've got to have some way to bucket users. One approach is to make up a sort of UDID. Something that is specific to a device, and is stored locally with the app. This is truly only unique to each install of the app, but it seems to work fairly well. You need to pass that UDID to any webviews so that those webviews can identify themselves as part of those app sessions.

Sensible Testing

We mentioned before that the number of users, the percentage, was key to obtaining significant results. If you don’t run the experiment long enough, you’re going to not prove anything. If you take too long measuring something, you’re wasting time. There are deadlines to meet for this or other things, etc. For our test results we follow the equation used here:

ExperimentCalculator.comvia @mcfunley

• How many eligible visits per day?• What percentage of visits will see the change?• What is your current conversion rate?• How will you change conversion?• How confident do you want to be?• How likely should you be to detect the change?

experimentcalculator.com This is something one of our engineers made from a paper on statistics. I don’t pretend to understand exactly but the idea is it gets you to not choose random numbers for the length of the experiment. You have to weigh the cost of developing a new feature against the feature’s potential value.

Sensible Testing

Some shortcomings of these approaches include: lots more code. More permutations of QA. For every test variant you add, you essentially add the need for another QA user story. You risk introducing an unpredictable user experience. Changing minor interactions are probably okay, changing the main navigation scheme in your application probably isn’t. Analysis paralysis - At some point you’ve just got to make decisions. A/B testing only goes so far in helping you choose direction for your product. You can’t be creative in your product decisions by just piling A/B tests back to back.

Things toWatch Out For

Make sure default is "off" for predictable stable experiences and sanity of future support. You don’t want to have a bunch of old flags you need to keep around for un-updated apps. If you've got a good app and a good idea going, you've probably not going to discover a breakaway victory by running an a/b test. They're too subtle and often times only prove that your intuition about your users is heavily biased.

The Future

So get out there, do some experiments. Time is precious. If you're going to make heads or tails of the numbers you see in an experiment, you'll probably need all the time you can get.

Feature Configurationand A/B Experimentsin native and hybrid mobile apps

Breaking Development NashvilleOctober 21 [email protected]





Technology

Mobile App Feature Configuration and A/B Experiments