Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
#RampUp20
RAMPUP FOR DEVELOPERS
The Systems Behind Lookalike Modeling
JOE HSY@LIVERAMP
Head of Engineering - Data Science and
Innovation
LiveRamp
OPE BANWO@LIVERAMP
Software Engineer, Data Science
LiveRamp
LiveRamp Lookalike Modeling API (LLAMA)Joe Hsy, Opeoluwa Banwo
What is Lookalike Modeling?
Display
Social
Programmitic
Mobile
Video/TV
Start with a known seed audience segment1
Expand the segment with Look-Alike audiences2
Activate to multiple destinations3
Goals of LLAMA
Self-service controls facilitate optimization of modeled audiences for Reach or Similarity.
Adjustable
Seed and modeled audiences are matched deterministically to IdentityLinks, to provide more precise targeting.
Accurate
With IdentityLink, both online and offline data can be ingested and modeled out to different channeltypes for activation.
Flexible
Use LiveRamp’s licensed reference data set from the LiveRamp Data Store, or bring your own data (BYOD).
Customizable
LiveRamp’s Approach to Building Data Science Products
Early Days of Data Science
Data Science Application Developers
LiveRamp’s Approach
Data Science+ Software Engineering
Application DevelopersRest API
Four Pillars of Data Science Productization
Create a plug-and-play ecosystem with reliable and secure APIs.
APIs, APIs, APIs
Automate the data engineering layer with scalable data pipelines of data onboarding and model building.
Automate the Data Engineering Process
Design the architecture to accommodates a wide range of machine learning model classes so we can continually evaluate new approaches
Plug-n-play Data Science Models
Enforce a software development life cycle of continuous improvement through reliable code and a robust CI/CD process
Built-in Reliability Engineering
LLAMA Data Science
Challenges of Merging Reference Datasets
Dataset A Overlapping columnsmust standardize data types and values
Must impute missing values here
Overlapping recordscan use these for missingdata imputation models
Must impute missing values here
Dataset B
Must resolve valueconflicts here
Model Building Approach
We currently use
ridge regression with cross
validation for hyperparameter
tuning
• This enables variable selection as we often handle 1000+ variables.
• We also sample the data as ridge regression has a fairly limited capacity - we found 30k positive examples to be the best number.
We continuously evaluate
other approaches:
• Random forests (too slow)
• Boosted trees (overfit easily)
• Neural networks (didn’t offer meaningful improvements over ridge)
We aim for the simplest
model class that performs
well
LLAMA Architecture
LLAMA System Workflow
Advice Sheet
Seeded Dataset
Sample Transformation Pipeline
Sample for Train/Test
FeatureEngineering
Model Training
Predict Pipeline
Transformed Training / Test Set
Model Artefacts
Scores
Percentiles
Raw Dataset 2 Advice Sheet
Advice SheetProfile
Pipeline
Onboarding Pipeline
Prepare Connected Components
Resolve
Onboarded Data
Deduplication By-Pass
Raw Dataset 1
Profile Pipeline
Profile Pipeline
Transformation Artefacts
(optional other data sources)
Automatically determine unique, continuous,
categorical, dummy, etc.
Automatically detect data types including generic data types as
well as IDL-s, etc.
Automatically detect one-hot-encoded
features and sparse features
Generate automated, editable transform
policies for onboarding
ProfilingLLAMA uses a robust profiler that provides full insight into the features, content, and quality of a datasets
Report distributions and levels depending on the type of feature
Onboarding
Process a dataset based on editable policy produced at the profiling stage
Onboarding can be based on a single or multiple profiled datasets
This crucial capability enables Llama to compile multiple spines into the most rich-featured reference datasets in the market
• Reverse one-hot-encode features that were one-hot-encoded
• Drop sparse or irrelevant features
• Resolve many-to-many relationships between row ID-s and IDL-s
Sample & Transform
Computes Training & Test
sets based on client’s seed dataset
Transforms Training & Test
sets based on custom defined transform logic
Persists transformed Training & Test
sets andtransform logic
Predict
Computes the predicted probability for instances
of the onboarded dataset
Computes percentile limits based on a sample of the predicted scores
Co-locates instances belonging to the same percentile
Technology Stack
Cloud dataflow • Highly scalable mapreduce solutions with near zero infrastructure setup
Tensorflow [transform]• Ecosystem of tools to build and train ml models efficiently
• Seamless integration with dataflow for doing feature engineering
Cloud Composer• Management and monitoring of end to end workflow
BigQuery• Efficient and cheap solution for doing CCPA/GDPR compliance operations
Example Use Case
Create a new campaign.
Request Lookalike
Choose Reach vs Accuracy Using Slider
Activate Lookalike Audiences
Future Directions
External API
• Enable partners and customers to programmatically iterate quickly and create many audiences at different cuts of reach/similarity
• Looking for beta customers!
Enhance Audience Expansion Flexibility
• Current lookalike support binary classes only - Audience members are ranked on their similarity to the seed
• Multiclass classifiers allows for audience expansion across different attributes, such as low, medium, high loyalty
2020 LiveRamp. All rights reserved.
Questions?