Using machine learning to determine drivers of bounce and conversion

Preview:

Citation preview

Using machine learning to determine drivers

of bounce and conversionVelocity 2016 Santa Clara

Pat Meenan@patmeenan

Tammy Everts@tameverts

What we did (and why we did it)

Get the codehttps://github.com/WPO-

Foundation/beacon-ml

Deep learning

weights

Random forestLots of random decision trees

Vectorizing the data• Everything needs to be numeric• Strings converted to several inputs as

yes/no (1/0)• i.e. Device manufacturer• “Apple” would be a discrete input

• Watch out for input explosion (UA String)

Balancing the data• 3% conversion rate• 97% accurate by always guessing

no• Subsample the data for 50/50 mix

Validation data• Train on 80% of the data• Validate on 20% to prevent

overfitting

Smoothing the dataML works best on normally

distributed data

scaler = StandardScaler()x_train = scaler.fit_transform(x_train)x_val = scaler.transform(x_val)

Input/output relationships

• SSL highly correlated with conversions• Long sessions highly correlated with

not bouncing• Remove correlated features from

training

Training deep learning

model = Sequential()model.add(...)model.compile(optimizer='adagrad', loss='binary_crossentropy', metrics=["accuracy"])model.fit(x_train, y_train, nb_epoch=EPOCH_COUNT, batch_size=32, validation_data=(x_val, y_val), verbose=2, shuffle=True)

Training random forest

clf = RandomForestClassifier(n_estimators=FOREST_SIZE, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=12, random_state=None, verbose=2, warm_start=False, class_weight=None)clf.fit(x_train, y_train)

Feature importancesclf.feature_importances_

What we learned

What’s in our beacon?• Top-level – domain, timestamp, SSL

• Session – start time, length (in pages), total load time• User agent – browser, OS, mobile ISP• Geo – country, city, organization, ISP, network speed• Bandwidth• Timers – base, custom, user-defined• Custom metrics• HTTP headers• Etc.

Conversion rate

Conversion rate

Bounce rate

Bounce rate

Finding 1Number of scripts was a predictor…

but not in the way we expected

Number of scripts per page (median)

Finding 2When entire sessions were more

complex, they converted less

Finding 3Sessions that converted had 38% fewer images than sessions that didn’t

Number of images per page (median)

Finding 4DOM ready was the greatest

indicator of bounce rate

DOM ready (median)

Finding 5Full load time was the second

greatest indicator of bounce rate

timers_loaded (median)

Finding 6Mobile-related measurements weren’t meaningful predictors of conversions

Conversions

Finding 7Some conventional metrics

were (almost) meaningless, too

Feature Importance (out of 93)

DNS lookup 79Start render 69

Takeaways

1. YMMV2. Do this with your own data3. Gather your RUM data4. Run the machine learning

against it

Thanks!