Upload
geektimecoil
View
42
Download
4
Embed Size (px)
Citation preview
Big Data and Machine LearningThese Lessons were Written in Clicks
Gil ChamielDirector of Data Science and Algorithms Engineering
You’ve Seen Us Before
Enabling people to discover information at that moment when they’re likely to engage
750Mmonthly uniqueusers
500K+Requests/sec
15B+recommendati
ons/day
17TB+Daily data
REACH PROPERTY
95.5% Google Ad Network
87.8% Taboola86.2% Google Sites61.5% Facebook60.3% Yahoo Sites56.6% Outbrain
52%mobile traffic
48%desktoptraffic
US desktop users reached, 12/2015
Taboola in Numbers
A typical US user sees a Taboola widget at least twice a day
Taboola’s Discovery Platform
Traffic Acquisition
Business Dev.Sponsored Content
EditorialNewsroo
m
SalesNative Ads
Audience Dev.
ProductPersonalizatio
n
Data & Insights
ContextMetadata Region-based
Location
Information
User BehaviorData
User Consumption Groups
SocialFacebook /Twitter API
The Recommendation Engine
The Taboola Data Culture
One stop shop for all data needs to support our constant offensive battle.7
Data for Machine Learning
User Behavior Analysis
System Behavior Analysis
Business Analysis
Data Driven OPSSea of Data
Machine Learning: The Basics
8
Predict User Engagement with Recommended Content
Offline Online
Bayesian Inference
Linear Models
Gradient Boosted TreesFactorization
MachinesDeep Neural
Networks
Machine Learning: Circular Data Pipeline
9
Input “Regular” Program Output
Input Train Output
Model
Predict
Offline vs. Online
10
• Efficient research can only be done offline• Real effect can only be validated online (and we a/b test like crazy)• Flexibility and ease of use => fast validation of new ideas
"Deep Neural Networks for YouTube Recommendations", RecSys ’16
"Wide & Deep Learning for Recommender Systems". CoRR abs/1606.07792 (2016)
Maintaining Data for Online Predictions
• Cookies• Easy and super distributed
• Difficult to maintain (sustainability)
• Updates are online only (and
bootstrapping is hard)
• Cannot be reached offline
• Limited storage
• Increases network latency and costs
• Not so great in out of order events
12
• Server-Side Data Counters• Requires high performance NoSQL database
technology (Cassandra, Hbase, Scylladb, etc.)
• Easy to bootstrap data calculated offline or
upload data from other sources
• Less limited on storage (up to $$$)
• Easy on read online (usually not a lot of data)
• Read before write (counter implementations
are dodgy)
• Fixed set of counters and aggregations (early
commitment)
• Saving Individual Events– Let the “future you” decide on how to aggregate
– De-normalize to your liking (tradeoff between computation
time and latency/storage)
– No read before write (and non-blocking)
– Reads are extremely expensive
• Time Series Data Modeling– Control over read latency
– Useful for time dependent modeling (e.g. decay counters)
– May still be a challenge (mastering DB internals is a must)
13
Is this enough for offline analysis and research?
Maintaining Data for Online Predictions
Data for ML Pipelining and Research: The Challenge
• Objective: A complete picture of the user and context on every impression! • Challenges:
– Events occur in different times
– Historic user data must be true to the time of impression
– Fast querying by hundreds of analysts and engineers
– Machine learning programs like their data flat
• What is the real issue? – Joins between various events to form a logical entity (user, session, page view)
– Joins between historic user data and current impression data
15
Maintain a Dedicated Data Store
How we went about solving these challenges?
• Starting point: pre-aggregate counters over raw data• Every query requires rerun (parsing and joins over the raw data)
• Many additional disadvantages
• When in trouble: de-normalize!• Use efficient and extendable serialization schema (e.g. Protobuf)
• De-normalize until you run out of space (or money)
• Useful for pipelining historic user data
• Join multiple events at write time (short term)• Maintain a mutual key (user id, session id, page view id)
• Use a strong and scalable key-value database (e.g. C*)
• Use Columnar Storage (long term)• Drives Machine Learning and research
• Many tools out there (Parquet, BigQuery, etc.)
• Use scalable and rich query mechanism (Spark SQL, BigQuery, Impala, etc.)
• Machine Learning programs like flat data (easy with FLATTEN, explode, user defined functions etc.)16
Users
Sessions
Views
ClicksHistory
Post-click events
Because We Recommend…
Data is king!Online and offline pose different challenges -> different solutionsStorage is cheap: rewrite your data for convenienceStill worried about storage? You don’t have to keep everything for every user:
Sub-sampling is a requirement when learning models
Be extremely verbose for small parts of the data
For fast research: save it again for sample of the users, views, etc.
17