Upload
tammy-everts
View
2.406
Download
0
Embed Size (px)
Citation preview
Using machine learning to determine drivers
of bounce and conversionVelocity 2016 Santa Clara
Pat Meenan@patmeenan
Tammy Everts@tameverts
What we did (and why we did it)
Get the codehttps://github.com/WPO-
Foundation/beacon-ml
Deep learning
weights
Random forestLots of random decision trees
Vectorizing the data• Everything needs to be numeric• Strings converted to several inputs as
yes/no (1/0)• i.e. Device manufacturer• “Apple” would be a discrete input
• Watch out for input explosion (UA String)
Balancing the data• 3% conversion rate• 97% accurate by always guessing
no• Subsample the data for 50/50 mix
Validation data• Train on 80% of the data• Validate on 20% to prevent
overfitting
Smoothing the dataML works best on normally
distributed data
scaler = StandardScaler()x_train = scaler.fit_transform(x_train)x_val = scaler.transform(x_val)
Input/output relationships
• SSL highly correlated with conversions• Long sessions highly correlated with
not bouncing• Remove correlated features from
training
Training deep learning
model = Sequential()model.add(...)model.compile(optimizer='adagrad', loss='binary_crossentropy', metrics=["accuracy"])model.fit(x_train, y_train, nb_epoch=EPOCH_COUNT, batch_size=32, validation_data=(x_val, y_val), verbose=2, shuffle=True)
Training random forest
clf = RandomForestClassifier(n_estimators=FOREST_SIZE, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=12, random_state=None, verbose=2, warm_start=False, class_weight=None)clf.fit(x_train, y_train)
Feature importancesclf.feature_importances_
What we learned
What’s in our beacon?• Top-level – domain, timestamp, SSL
• Session – start time, length (in pages), total load time• User agent – browser, OS, mobile ISP• Geo – country, city, organization, ISP, network speed• Bandwidth• Timers – base, custom, user-defined• Custom metrics• HTTP headers• Etc.
Conversion rate
Conversion rate
Bounce rate
Bounce rate
Finding 1Number of scripts was a predictor…
but not in the way we expected
Number of scripts per page (median)
Finding 2When entire sessions were more
complex, they converted less
Finding 3Sessions that converted had 38% fewer images than sessions that didn’t
Number of images per page (median)
Finding 4DOM ready was the greatest
indicator of bounce rate
DOM ready (median)
Finding 5Full load time was the second
greatest indicator of bounce rate
timers_loaded (median)
Finding 6Mobile-related measurements weren’t meaningful predictors of conversions
Conversions
Finding 7Some conventional metrics
were (almost) meaningless, too
Feature Importance (out of 93)
DNS lookup 79Start render 69
Takeaways
1. YMMV2. Do this with your own data3. Gather your RUM data4. Run the machine learning
against it
Thanks!