View
193
Download
2
Category
Preview:
DESCRIPTION
At MediaMath, we deal with billions of records every day. One of our biggest challenges is hourly reporting of attribution data - the joining of billions of records to millions of events. How did we solve this hourly attribution reporting issue? We will walk through our evaluation, testing, and fine tuning of a variety of tools including Netezza, Hive, and Pig, and how we ultimately chose Cloudera's Impala.
Citation preview
How MediaMath Solved a Cri1cal Repor1ng Problem with Impala
©2014 MEDIAMATH INC. 1
The Cloudera Sessions
June 18, 2014 Ram Narayanan, Senior Director of Database Architecture & Opera1ons
Digital Marke1ng Pioneer • Founded in 2007 • Global technology company • Invented first Demand Side PlaJorm (DSP) for online ads • Conducts online adverNsing through real-‐Nme bidding & programmaNc buying
About MediaMath
©2014 MEDIAMATH INC. 2
About MediaMath Overview of Real-‐Time Bidding
Real-‐1me Auc1on
<30 ms Adver1ser (Client)
User
ad
www.cnn.com
ad
About MediaMath Overview or Real-‐Time Bidding
User
www.cnn.com
Purchased!
ad www.shoes.com
$$ Event Logs
• Ad OpportuniNes: 80-‐100 billion per day " 1.2 million opportuniNes per second at peak
• We bid on 30-‐40 billion ads per day • We serve 1-‐2 billion ads per day • 15-‐20 million events (click, sale, online sign-‐up) per hour • 2 TB of data daily (compressed) " Note: This only counts our wins. If we count losses, we easily reach PBs.
About MediaMath
Which ad (impression) led to which ac1on, like a sale or online signup • 35-‐40 billion recorded impressions served every 30 days • 15-‐20 million events per hour • Need to join events with impressions 2x per hour
à Find matching records à Perform complex sequencing & allocaNon logic à Run aggregaNons on results à Send data to data marts
à Provide hourly reporNng to clients
The Repor1ng AZribu1on Problem
©2014 MEDIAMATH INC. 6
Incumbent Architecture: Appliance-‐based (Netezza)
Cost: Expensive -‐ Scale: Non-‐incremental scalability -‐ Performance: ReporNng lag -‐ ReporNng inflexibility
Product feature constrained -‐ -‐
To build a data warehouse architecture that could perform hourly repor1ng of aZribu1on data at scale that is affordable and easy to manage.
Our goal
" Scalability Handle 10-‐50x scale
" Capability Ability to perform big data joins at scale
" Performance Complete aggregaNon in <60 minutes
" Cost effec1ve Cheaper than appliance-‐based soluNons
©2014 MEDIAMATH INC. 9
EvaluaNon Criteria:
" Hive Run Nme: Took 5-‐6 hours to complete Stability: High
" Pig Run Nme: Took 4-‐5 hours to complete Stability: High
" Impala Beta (0.6) Run Nme: Took 2-‐3 hours to complete Stability: Low
Evaluated OpNons: Round 1
" Hive: Post-‐Tuning (map joins, bucke1ng, split size, etc.) Run Nme: Took 2-‐3 hours to complete Stability: High
" Impala GPA (1.0) (L0 compression, slicing, tuning, hw upgrade) Run Nme: Took 30 minutes to complete Stability: High
Evaluated OpNons: Round 2
Data Warehouse Architecture 2011
Bid Logs
Pixel Logs
Metadata
Repor1ng Data Marts
Repor1ng Data Marts
Repor1ng Data Marts
Repor1ng Data Marts
ELT
A T T R I B U T I O N
Reports
Aggrega1on
Netezza 2011
Data Warehouse Architecture 2011
Bid Logs
Pixel Logs
Metadata
Repor1ng Data Marts
Repor1ng Data Marts
Repor1ng Data Marts
Repor1ng Data Marts
ELT
A T T R I B U T I O N
Reports
Aggrega1on
Reports Aggrega1on
Netezza Hadoop
2013
• December 2013: Peak season " New architecture accommodated 2x data volume with unprecedented scalability & stability
• Present: We are planning to add more features " Considering moving some part of aggregaNon into Hadoop
Proof:
©2014 MEDIAMATH INC. 14
• Process ONLY the required data • Compress your data • “Divide & Conquer” your data (i.e. slice and dice)
Lessons Learned & Best Prac1ces
©2014 MEDIAMATH INC. 15
THANK YOU
Recommended