Big data and machine learning / Gil Chamiel

Big Data and Machine LearningThese Lessons were Written in Clicks

Gil ChamielDirector of Data Science and Algorithms Engineering

You’ve Seen Us Before

Enabling people to discover information at that moment when they’re likely to engage

750Mmonthly uniqueusers

500K+Requests/sec

15B+recommendati

ons/day

17TB+Daily data

REACH PROPERTY

95.5% Google Ad Network

87.8% Taboola86.2% Google Sites61.5% Facebook60.3% Yahoo Sites56.6% Outbrain

52%mobile traffic

48%desktoptraffic

US desktop users reached, 12/2015

Taboola in Numbers

A typical US user sees a Taboola widget at least twice a day

Taboola’s Discovery Platform

Traffic Acquisition

Business Dev.Sponsored Content

EditorialNewsroo

m

SalesNative Ads

Audience Dev.

ProductPersonalizatio

n

Data & Insights

ContextMetadata Region-based

Location

Information

User BehaviorData

User Consumption Groups

SocialFacebook /Twitter API

The Recommendation Engine

6

Tools We Recommend

The Taboola Data Culture

One stop shop for all data needs to support our constant offensive battle.7

Data for Machine Learning

User Behavior Analysis

System Behavior Analysis

Business Analysis

Data Driven OPSSea of Data

Machine Learning: The Basics

8

Predict User Engagement with Recommended Content

Offline Online

Bayesian Inference

Linear Models

Gradient Boosted TreesFactorization

MachinesDeep Neural

Networks

Machine Learning: Circular Data Pipeline

9

Input “Regular” Program Output

Input Train Output

Model

Predict

Offline vs. Online

10

• Efficient research can only be done offline• Real effect can only be validated online (and we a/b test like crazy)• Flexibility and ease of use => fast validation of new ideas

"Deep Neural Networks for YouTube Recommendations", RecSys ’16

"Wide & Deep Learning for Recommender Systems". CoRR abs/1606.07792 (2016)

11

Maintaining Data for Online Predictions


• Cookies• Easy and super distributed

• Difficult to maintain (sustainability)

• Updates are online only (and

bootstrapping is hard)

• Cannot be reached offline

• Limited storage

• Increases network latency and costs

• Not so great in out of order events

12

• Server-Side Data Counters• Requires high performance NoSQL database

technology (Cassandra, Hbase, Scylladb, etc.)

• Easy to bootstrap data calculated offline or

upload data from other sources

• Less limited on storage (up to $$$)

• Easy on read online (usually not a lot of data)

• Read before write (counter implementations

are dodgy)

• Fixed set of counters and aggregations (early

commitment)

• Saving Individual Events– Let the “future you” decide on how to aggregate

– De-normalize to your liking (tradeoff between computation

time and latency/storage)

– No read before write (and non-blocking)

– Reads are extremely expensive

• Time Series Data Modeling– Control over read latency

– Useful for time dependent modeling (e.g. decay counters)

– May still be a challenge (mastering DB internals is a must)

13

Is this enough for offline analysis and research?


14

Offline: Data for Machine Learning Pipelining and Research

Data for ML Pipelining and Research: The Challenge

• Objective: A complete picture of the user and context on every impression! • Challenges:

– Events occur in different times

– Historic user data must be true to the time of impression

– Fast querying by hundreds of analysts and engineers

– Machine learning programs like their data flat

• What is the real issue? – Joins between various events to form a logical entity (user, session, page view)

– Joins between historic user data and current impression data

15

Maintain a Dedicated Data Store

How we went about solving these challenges?

• Starting point: pre-aggregate counters over raw data• Every query requires rerun (parsing and joins over the raw data)

• Many additional disadvantages

• When in trouble: de-normalize!• Use efficient and extendable serialization schema (e.g. Protobuf)

• De-normalize until you run out of space (or money)

• Useful for pipelining historic user data

• Join multiple events at write time (short term)• Maintain a mutual key (user id, session id, page view id)

• Use a strong and scalable key-value database (e.g. C*)

• Use Columnar Storage (long term)• Drives Machine Learning and research

• Many tools out there (Parquet, BigQuery, etc.)

• Use scalable and rich query mechanism (Spark SQL, BigQuery, Impala, etc.)

• Machine Learning programs like flat data (easy with FLATTEN, explode, user defined functions etc.)16

Users

Sessions

Views

ClicksHistory

Post-click events

Because We Recommend…

Data is king!Online and offline pose different challenges -> different solutionsStorage is cheap: rewrite your data for convenienceStill worried about storage? You don’t have to keep everything for every user:

Sub-sampling is a requirement when learning models

Be extremely verbose for small parts of the data

For fast research: save it again for sample of the users, views, etc.

17

Thank You!Questions?

Technology

Big data and machine learning / Gil Chamiel