31
Identifying Returning Users

Identifying Returning Users with Spark

Embed Size (px)

Citation preview

Page 1: Identifying Returning Users with Spark

Identifying Returning Users

Page 2: Identifying Returning Users with Spark

Simply Business

•  Largest UK business insurer provider

•  More than 370,000 policy holders

•  Using BML, tech and data to disrupt

the business insurance market

Page 3: Identifying Returning Users with Spark

Data ’n’ Analytics

•  4 Data Engineers

•  3 Business Intelligence Developers

•  1 Data Scientist

•  1 Director of Data Science

•  And hiring! :-)

Page 4: Identifying Returning Users with Spark

Our Infrastructure

Event  Enrichment  (Spark  Streaming)  

Event  Collector  (REST  Endpoint)   Kinesis  

MongoDB  

Kinesis  

Redshi>  

Downstream  NRT  (Spark  Streaming)  

Page 5: Identifying Returning Users with Spark

Returning Visitors

Page 6: Identifying Returning Users with Spark

The Challenge

Accurately identify users over time as they interact with us

Page 7: Identifying Returning Users with Spark

Why? Knowledge

To understand how our visitors behave: •  They have long buying cycles •  They use multiple devices •  They engage with us through multiple channels

Page 8: Identifying Returning Users with Spark

Why? Cost Analysis

To know how much it costs to acquire new customers: •  They visit us from different marketing channels •  Each of them have different costs •  We remunerate our partners depending on that

Page 9: Identifying Returning Users with Spark

Why? Personalization

To adapt their visit while they interact with us: •  We can trigger offers depending on the channel they come from •  We can prefill fields if they’re recognized •  We can change the look and feel of the website

Page 10: Identifying Returning Users with Spark

Cookies? Their work great when you want to identify people coming back to your website.

Challenges: •  Different devices •  Cleared cookies •  Private sessions •  What about telephone calls…?

Page 11: Identifying Returning Users with Spark

Use More IDs! We can use other IDs provided by visitors to link sessions: •  Login details •  Email addresses •  Telephone numbers •  Credit card numbers •  Fingerprints •  …

Page 12: Identifying Returning Users with Spark

Streaming Solution

Page 13: Identifying Returning Users with Spark

How many visitors do we have here?

Example

Timestamp   Event   Cookie  ID   Email  

1   page_view   111  

2   details_submiFed   111   [email protected]  

3   page_view   555  

4   page_view   888  

5   policy_sold   555   [email protected]  

6   details_submiFed   888   [email protected]  

Page 14: Identifying Returning Users with Spark

How many visitors do we have here? Only 2

Example

Timestamp   Event   Cookie  ID   Email  

1   page_view   111  

2   details_submiFed   111   [email protected]  

3   page_view   555  

4   page_view   888  

5   policy_sold   555   [email protected]  

6   details_submiFed   888   [email protected]  

Page 15: Identifying Returning Users with Spark

Example: Streaming Timestamp   Event   Cookie  ID   Email  

1   page_view   111  

2   details_submiFed   111   [email protected]  

3   page_view   555  

4   page_view   888  

5   policy_sold   555   [email protected]  

6   details_submiFed   888   [email protected]  

Visitor  A:  111  

Page 16: Identifying Returning Users with Spark

Example: Streaming Timestamp   Event   Cookie  ID   Email  

1   page_view   111  

2   details_submiFed   111   [email protected]  

3   page_view   555  

4   page_view   888  

5   policy_sold   555   [email protected]  

6   details_submiFed   888   [email protected]  

Visitor  A:  111,  [email protected]  

Page 17: Identifying Returning Users with Spark

Example: Streaming Timestamp   Event   Cookie  ID   Email  

1   page_view   111  

2   details_submiFed   111   [email protected]  

3   page_view   555  

4   page_view   888  

5   policy_sold   555   [email protected]  

6   details_submiFed   888   [email protected]  

Visitor  A:  111,  [email protected]  Visitor  B:  555    

Page 18: Identifying Returning Users with Spark

Example: Streaming Timestamp   Event   Cookie  ID   Email  

1   page_view   111  

2   details_submiFed   111   [email protected]  

3   page_view   555  

4   page_view   888  

5   policy_sold   555   [email protected]  

6   details_submiFed   888   [email protected]  

Visitor  A:  111,  [email protected]  Visitor  B:  555  Visitor  C:  888    

Page 19: Identifying Returning Users with Spark

Example: Streaming Timestamp   Event   Cookie  ID   Email  

1   page_view   111  

2   details_submiFed   111   [email protected]  

3   page_view   555  

4   page_view   888  

5   policy_sold   555   [email protected]  

6   details_submiFed   888   [email protected]  

Visitor  A:  111,  555,  [email protected]  Visitor  C:  888    

Visitors  A  and  B  were  merged!  

Page 20: Identifying Returning Users with Spark

Example: Streaming Timestamp   Event   Cookie  ID   Email  

1   page_view   111  

2   details_submiFed   111   [email protected]  

3   page_view   555  

4   page_view   888  

5   policy_sold   555   [email protected]  

6   details_submiFed   888   [email protected]  

Visitor  A:  111,  555,  [email protected]  Visitor  C:  888,  [email protected]    

Page 21: Identifying Returning Users with Spark

Streaming Algorithm for event in incoming_events: prev_visitors = [get_visitor(id) for id in ids(event)] if len(prev_visitors) == 0: # First time that we see that visitor create_visitor(event) elif len(prev_visitors) == 1: # Visitor coming back using a known ID update_visitor(prev_visitors[0], event) else: # Event IDs were thought to belong to different visitors

merge_and_update_visitor(prev_visitors, event)  

Store mapping: event_ids [1..N] ---> [1] visitor

Page 22: Identifying Returning Users with Spark

Batch Solution

Page 23: Identifying Returning Users with Spark

Batch

We also need to solve this problem in batch mode because every now and then we reprocess all our data.

We cannot do it with Spark Streaming because:

•  Kinesis only stores up to 7 days of data

•  It would take too long

•  We risk overwhelming our MongoDB

Page 24: Identifying Returning Users with Spark

Example: Batch Timestamp   Event   Cookie  ID   Email  

1   page_view   111  

2   details_submiFed   111   [email protected]  

3   page_view   555  

4   page_view   888  

5   policy_sold   555   [email protected]  

6   details_submiFed   888   [email protected]  

Page 25: Identifying Returning Users with Spark

Batch Algorithm

•  Find the connected components in the graph

•  Already implemented in GraphX! •  Vertices: visitor IDs

•  Edges: pairs of IDs occurring in the same event

Page 26: Identifying Returning Users with Spark

Example: Batch Timestamp   Event   Cookie  ID   Email  

1   page_view   111  

2   details_submiFed   111   [email protected]  

3   page_view   555  

4   page_view   888  

5   policy_sold   555   [email protected]  

6   details_submiFed   888   [email protected]  

111   [email protected]  

555  

888   [email protected]  

Page 27: Identifying Returning Users with Spark

GraphX: Issues

•  There is no Python API for GraphX, and Graphframes performance is worse than GraphX’s

•  GraphX requires vertices to have numeric IDs

•  You’ll have to create a mapping between the real IDs and the

numeric IDs

•  Perform the connected components calculation

•  Map back the numeric IDs to the real IDs

•  Documentation is not as good as other Spark projects

Page 28: Identifying Returning Users with Spark

Sum Up

Page 29: Identifying Returning Users with Spark

NRT:  Spark  Streaming   Batch:  GraphX  Latency   Seconds   Minutes/Hours  Throughput   Low   High  Good  for   Online   Analysis  Development   Your  own   Algorithm  available  

Sum Up

Page 30: Identifying Returning Users with Spark

Beware That two events share an ID does not necessarily mean that were generated by the same person. Watch out for:

•  Clashing IDs

•  Fake IDs: [email protected], 07123456789, etc.

•  Shared devices: couples, public computers, etc.

•  People using your system on behalf of someone else

•  Bugs

Analyze your data first!

Page 31: Identifying Returning Users with Spark

Questions?

@dani_sola [email protected]