JOE HSY OPE BANWO...Lookalike Modeling JOE HSY @LIVERAMP Head of Engineering - Data Science and Innovation LiveRamp OPE BANWO @LIVERAMP Software Engineer, Data Science LiveRamp LiveRamp

#RampUp20

RAMPUP FOR DEVELOPERS

The Systems Behind Lookalike Modeling

JOE HSY@LIVERAMP

Head of Engineering - Data Science and

Innovation

LiveRamp

OPE BANWO@LIVERAMP

Software Engineer, Data Science

LiveRamp

LiveRamp Lookalike Modeling API (LLAMA)Joe Hsy, Opeoluwa Banwo

What is Lookalike Modeling?

Display

Social

Programmitic

Mobile

Video/TV

Start with a known seed audience segment1

Expand the segment with Look-Alike audiences2

Activate to multiple destinations3

Goals of LLAMA

Self-service controls facilitate optimization of modeled audiences for Reach or Similarity.

Adjustable

Seed and modeled audiences are matched deterministically to IdentityLinks, to provide more precise targeting.

Accurate

With IdentityLink, both online and offline data can be ingested and modeled out to different channeltypes for activation.

Flexible

Use LiveRamp’s licensed reference data set from the LiveRamp Data Store, or bring your own data (BYOD).

Customizable

LiveRamp’s Approach to Building Data Science Products

Early Days of Data Science

Data Science Application Developers

LiveRamp’s Approach

Data Science+ Software Engineering

Application DevelopersRest API

Four Pillars of Data Science Productization

Create a plug-and-play ecosystem with reliable and secure APIs.

APIs, APIs, APIs

Automate the data engineering layer with scalable data pipelines of data onboarding and model building.

Automate the Data Engineering Process

Design the architecture to accommodates a wide range of machine learning model classes so we can continually evaluate new approaches

Plug-n-play Data Science Models

Enforce a software development life cycle of continuous improvement through reliable code and a robust CI/CD process

Built-in Reliability Engineering

LLAMA Data Science

Challenges of Merging Reference Datasets

Dataset A Overlapping columnsmust standardize data types and values

Must impute missing values here

Overlapping recordscan use these for missingdata imputation models

Must impute missing values here

Dataset B

Must resolve valueconflicts here

Model Building Approach

We currently use

ridge regression with cross

validation for hyperparameter

tuning

• This enables variable selection as we often handle 1000+ variables.

• We also sample the data as ridge regression has a fairly limited capacity - we found 30k positive examples to be the best number.

We continuously evaluate

other approaches:

• Random forests (too slow)

• Boosted trees (overfit easily)

• Neural networks (didn’t offer meaningful improvements over ridge)

We aim for the simplest

model class that performs

well

LLAMA Architecture

LLAMA System Workflow

Advice Sheet

Seeded Dataset

Sample Transformation Pipeline

Sample for Train/Test

FeatureEngineering

Model Training

Predict Pipeline

Transformed Training / Test Set

Model Artefacts

Scores

Percentiles

Raw Dataset 2 Advice Sheet

Advice SheetProfile

Pipeline

Onboarding Pipeline

Prepare Connected Components

Resolve

Onboarded Data

Deduplication By-Pass

Raw Dataset 1

Profile Pipeline

Profile Pipeline

Transformation Artefacts

(optional other data sources)

Automatically determine unique, continuous,

categorical, dummy, etc.

Automatically detect data types including generic data types as

well as IDL-s, etc.

Automatically detect one-hot-encoded

features and sparse features

Generate automated, editable transform

policies for onboarding

ProfilingLLAMA uses a robust profiler that provides full insight into the features, content, and quality of a datasets

Report distributions and levels depending on the type of feature

Onboarding

Process a dataset based on editable policy produced at the profiling stage

Onboarding can be based on a single or multiple profiled datasets

This crucial capability enables Llama to compile multiple spines into the most rich-featured reference datasets in the market

• Reverse one-hot-encode features that were one-hot-encoded

• Drop sparse or irrelevant features

• Resolve many-to-many relationships between row ID-s and IDL-s

Sample & Transform

Computes Training & Test

sets based on client’s seed dataset

Transforms Training & Test

sets based on custom defined transform logic

Persists transformed Training & Test

sets andtransform logic

Predict

Computes the predicted probability for instances

of the onboarded dataset

Computes percentile limits based on a sample of the predicted scores

Co-locates instances belonging to the same percentile

Technology Stack

Cloud dataflow • Highly scalable mapreduce solutions with near zero infrastructure setup

Tensorflow [transform]• Ecosystem of tools to build and train ml models efficiently

• Seamless integration with dataflow for doing feature engineering

Cloud Composer• Management and monitoring of end to end workflow

BigQuery• Efficient and cheap solution for doing CCPA/GDPR compliance operations

Example Use Case

Create a new campaign.

Request Lookalike

Choose Reach vs Accuracy Using Slider

Activate Lookalike Audiences

Future Directions

External API

• Enable partners and customers to programmatically iterate quickly and create many audiences at different cuts of reach/similarity

• Looking for beta customers!

Enhance Audience Expansion Flexibility

• Current lookalike support binary classes only - Audience members are ranked on their similarity to the seed

• Multiclass classifiers allows for audience expansion across different attributes, such as low, medium, high loyalty

2020 LiveRamp. All rights reserved.

Questions?

Documents

JOE HSY OPE BANWO...Lookalike Modeling JOE HSY @LIVERAMP Head of Engineering - Data Science and Innovation LiveRamp OPE BANWO @LIVERAMP Software Engineer, Data Science LiveRamp LiveRamp