60
Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David Lapayowker Marissa Quitt Elaine Shaver (PM) Devin Smith HMC Advisor: Zachary Dodds

Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Embed Size (px)

Citation preview

Page 1: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Optimizing Online Yield via Predictive Modeling

of Individual Site Visitors

Magnify360 Liasons:

Olivier Chaine, Jim Healy, Nate Pool,

Gilles ?????

David LapayowkerMarissa Quitt

Elaine Shaver (PM)Devin Smith

HMC Advisor:

Zachary Dodds

Page 2: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Magnify360

Designs multiple websites for clients with each site customized to meet the needs of different types of users.

Analyzes clickstream data from site visitors in order to provide the website that will best suit each one.

The result is to convert a larger set of users than a single page.

old Facebook new Facebook

Page 3: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

System OverviewNavigates to a site

serve pageclickstream data

User Actions

Dataflow

Our system

classify user

Musician

Tailored interactions "Conversion"

results

choose page

• user data• pages served• conversion data

Musician

Pachyphile

Bioengineer

Musician

Pasadena resident

InsomniacUser

groups

Online classifier Offline analysis

[email protected]

clustering

Page 4: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Problem StatementNavigates to a site

serve pageclickstream data

User Actions

Dataflow

Our system

classify user

Musician

Tailored interactions "Conversion"

results

choose page

• user data• pages served• conversion data

Musician

Pachyphile

Bioengineer

Musician

Pasadena resident

InsomniacUser

groups

Online classifier Offline analysis

[email protected]

clustering

Detailed problem statement here

Page 5: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Clickstream Dataexample columns…

Database

80 tables 110,000,000 rows 13 GB

ethics ~ anonymous ~ no purchased data!

Page 6: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

User profilesA profile is a binary attribute that captures a specific combination of data values.

Currently 42 of them, hand-specified

insomniac something something

Tradeoffs:+ captures experienced intuition about what is important

+ takes advantage of Magnify360's site-design expertise

- binary attributes- may miss patterns not captured by the user profiles

from Mag360's site

Page 7: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Conversion dataThe site yield, or conversion, is client-specified

Amount of transaction(s)

3% conversion

Time spent on (a part of) the site

Contact information

presence and/or time of an email address

table

Goal: to determine those clusters of visitors who will be best served (convert) via a particular version of a client site

Page 8: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Offline analysis ~ user clustering

Visitors ~ vectors of profile

attributes

hand-tuned clusters

decision-tree clustering

fuzzy k-means clustering

support vector machines

one big cluster ~ "best page"

growing neural gas

hierarchical clustering

Page 9: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Offline analysis ~ user clustering

Visitors ~ vectors of profile

attributes

hand-tuned clusters

decision-tree clustering

fuzzy k-means clustering

support vector machines

one big cluster ~ "best page"

growing neural gashierarchical clustering

Page 10: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Offline analysis ~ user clustering

Visitors ~ vectors of profile

attributes

hand-tuned clusters

decision-tree clustering

fuzzy k-means clustering

support vector machines

one big cluster ~ "best page"

growing neural gashierarchical clustering

Page 11: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Offline analysis ~ user clustering

Visitors ~ vectors of profile

attributes

hand-tuned clusters

decision-tree clustering

fuzzy k-means clustering

support vector machines

one big cluster ~ "best page"

growing neural gashierarchical clustering

Page 12: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Support vector machine example

Can we get one of the real data pages?

Page 13: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

This cluster of six people responds better to site B,

Page: AYield: 7 Page: A

Yield: 1

Page: AYield: 1

Page: BYield: 3

Page: BYield: 8

Page: BYield: 7

page A score ~ 3.0

page B score ~ 6.0

+7 1 1+

3 (visits)

+7 8 3+

3 (visits)

From clusters to sitesTraining data from each cluster determines the best site:

(yield)

(yield)

Page 14: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Magnify360 wants to adapt quickly to new preferences:

but site A has had better recent performance.

Page: AYield: 7t: 0

Page: AYield: 1t: 3

Page: AYield: 1t: 4

Page: BYield: 3t: 1 Page: B

Yield: 8t: 5

Page: BYield: 7t: 4

page A score ~ 6.05

page B score ~ 3.68

+ +2-3 • 120 • 7 2-4 • 1

20 + 2-3 + 2-4

+ +2-5 • 82-4 • 7 2-1 • 3

2-4 + 2-5 + 2-1

t ~ age of data

Time-based site choice

Time-weighted average yields:

Page 15: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

procedure

Online classification

Possible results…

Page 16: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

all on one graph

Results ~ Packet 8

comments

what about hand-tuned system results?

Page 17: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

talk about SVM parameters here?

A closer look…

comments

Page 18: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Sensitivity to scoring parameters?

comments

David's charts

Page 19: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Software structure

comments

Diagram

What's done and not done…

Page 20: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Software structure

comments

Diagram

What's done and not done…

Page 21: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Perspective

Concluding comments

Questions?

Page 22: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David
Page 23: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Clickstream DataThe Good: We have DATA!

Too much?The Bad:

What is this data!?The Ugly:

~ 80 tables

~ 13 GB

Page 24: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

One of our tables…

Page 25: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

ID, anyone?

Page 26: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Fun Statistics

Page 27: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Data: To do

Understand the purpose of each table / column

Understand relationships between tables

Create a single table (or file) of relevant information in order to test and evaluate our clustering algorithms.

(table demodularization, against all design principles)

Page 28: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Clustering Algorithmsk-Means: Choose centroids at random, and place points in cluster such that distances inside clusters are minimized. Recalculate centroids and repeat until a steady state is reached

Fuzzy k-Means: Similar, but every datapoint is in a cluster to some degree, not just in or out.

Heirarchical Clustering: Uses a bottom-up approach to bring together points and clusters that are close together

Bottom line: These clustering algorithms are simple and effective techniques for categorizing data, but they cannot exist in a vacuum; we are investigating other techniques that may be used in parallel or instead.

FuzME's best 10-cluster results ~ synthetic data

Page 29: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Growing Neural Gas

A clustering algorithm masquerading as a neural network Given a data distribution, dynamically determines

nodes or “centroids” to represent the data

Page 30: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Growing Neural Gas

A clustering algorithm masquerading as a neural network Given a data distribution, dynamically determines

nodes or “centroids” to represent the data

User Profiles

Representative Nodes

Page 31: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Growing Neural Gas

A clustering algorithm masquerading as a neural network Given a data distribution, dynamically determines nodes

or “centroids” to represent the data

“Dynamic” because it adds or deletes nodes as necessary, as well as adapting nodes toward changes in the data.

User Profiles

Representative Nodes

Page 32: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

How it works…

Find the closest node, s, and the next closest, t. Update the error of s by εw|s – x| Shift s and its neighbors toward x, and increment

the age of all those edges. If s and t are adjacent, set the age of that edge to

0. Otherwise, create that edge. Remove edges that are too old, decrease the

error of all edges by a small amount. Add a node every generations, putting it between the

node with the largest error and its largest-error neighbor. Repeat!

Given some input x:

Page 33: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

A Few Parameters…

λ: Controls how frequently new nodes are inserted Max Edge Age: Dictates how often old edges are deleted εw: Factor to scale the value of the “winning” node εn: Factor to scale the value of the next nearest node α: Scale factor for decreasing the error of parent nodes β: Scale factor for decreasing error of all nodes

(Making sense of the GUI)

Page 34: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

… and the difference they make.

λ= 1000λ= 100

• Larger λ, nodes inserted less often• Takes longer, but yields more accurate placement of nodes

• Smaller λ, nodes inserted more often • Leaves straggler nodes that don’t accurately match data

Page 35: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Support Vector Machines

Page 36: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Clearly planar

Page 37: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Planar in feature space

Page 38: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Support Vector Regression (Machine?)Goal: Minimize error between hyper-plane and data points.

SVM SVR

Maximize cluster separation Minimize plane-to-data distance

Page 39: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Getting the correct page…

What do we want from a technique?

Input: User data.Output: Page to serve.

Input: User data and possible page.Output: Predicted Success.

Both require multiple SVMs.

CLASSIFICATION:

REGRESSION:

Page 40: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Using Classification via SVMs

Predicted Page:

CDATA

C

B

C

Page 41: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Using Regression via SVRs

Page APredictor

Page BPredictor

Page CPredictor

0.42

0.24

0.78

Predicted Page:

CDATA

Page 42: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

DataThe Good: We have DATA!

Too much?The Bad:

What is this data!?The Ugly:

~ 80 tables

~ 13 GB

Page 43: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

One of our tables…

Page 44: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

ID, anyone?

Page 45: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Fun Statistics

Page 46: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Data: To do

Understand the purpose of each table / column

Understand relationships between tables

Create a single table (or file) of relevant information in order to test and evaluate our clustering algorithms.

(table demodularization, against all design principles)

Page 47: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Goal Breakdown

Page 48: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Short-term Plan

Page 49: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Plan for Algorithm Comparison

Page 50: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Plan for Algorithm Comparison

Page 51: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Plan for Algorithm Comparison

Page 52: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Schedule and Conclusion

Friday November 14 Prototype algorithm comparison method

Friday November 21 Initial testing on real data Meeting with Magnify360

Friday December 5 Initial composition of classification algorithms

Friday December 12 Midyear Report

Questions?

Page 53: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Questions?

Page 54: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

SVM vs SVR

SVM SVR

Maximize Distance Minimize Distance

Page 55: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Data

The Bad, or, The Challenges:

Lots of SQL data

Page 56: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Some Data Tables

80 tables total…

Page 57: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Data Size

Page 58: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Problem StatementOfficially: Develop an innovative predictive modeling system to predict shopping cart abandonment based on profiles, clusters, shopping cart contents

Most importantly: GRAB from email ! Research and implement various AI techniques to optimize the process of matching users with websites

Individualized Online Experiences

Page 59: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Classifying Users

Unsupervised clustering: points are clustered without knowledge of the results

Supervised clustering: clusters are built using prior knowledge of the results

Ethical concerns?

Page 60: Optimizing Online Yield via Predictive Modeling of Individual Site Visitors Magnify360 Liasons: Olivier Chaine, Jim Healy, Nate Pool, Gilles ????? David

Recap: What Magnify360 Does

Individualize a website for different types of users

Collect data on users from their clickstream, and give them the site that will appeal to them best

Appeal to a larger base of users by making the site more interesting to a larger group

serving both!old Facebook