Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase - NoSQL matters Dublin 2015

Relevance - Deal Personalization and Real Time Big Data Analytics

Prassnitha Sampath [email protected]

About Me

•  Lead Engineer working on Real Time Data Infrastructure @ Groupon

•  Graduate of Portland State and Madras University

What are Groupon Deals?

Our Relevance ScenarioUsers

Scaling: Keeping Up With a Changing Business

2014 2011 2012

Growing Number of deals Growing Users

•  100 Million+ subscribers

•  We need to store data like, user click history, email records, service logs etc. This is billions of data points and TB’s of data

Changing Business: Shift from Email to Mobile

•  Growth in Mobile Business

•  Reducing dependence on email markeOng

100 Million+ App Downloads

Deal Personalization Infrastructure Use Cases

Deliver Personalized Emails

Deliver Personalized Website & Mobile

Experience

Offline System Online System

Email

Personalize billions of emails for hundreds of millions of users

Personalize one of the most popular e-‐commerce mobile & web app

for hundreds of millions of users & page views

Deal Personalization Infrastructure Use CasesDeliver Personalized

Website, Mobile and Email Experience

Deal Performance Understand User Behavior

Deliver Relevant Experience with High Quality Deals

Earlier System

Offline PersonalizaOon Map/Reduce

Data Pipeline (User Logs, Email Records, User History etc)

Online Deal PersonalizaOon

API

MySQL Store

Email

Earlier System

Email


Data Pipeline


API

MySQL Store

•  Scaling MySQL for data such as user click history, email records was painful unless we shard data

•  Data Pipeline is not “Real Time”

Email


Real Time Data Pipeline


API

Ideal Data Store

•  Common data store that serves data to both online and offline systems

•  Data store that scales to hundreds of millions of records

•  Data store that works well with our exisOng Hadoop based systems

•  Real Time pipeline that scales and can process about 100,000 messages/ second

Ideal System

Email


Web Site Logs


API

HBase

Final Design

Mobile Logs

Kaà Message Broker

Storm

Two Challenges With HBase

HBase

How to scale 100,000

writes/ second?

HBase

•  How to run Map Reduce Programs over HBase without affecOng read

latency?

•  How to batch load data in HBase without affecOng read latencies?

Final Hbase Design

Real Time HBase

Batch HBase

Bulk Load data via HFiles

ReplicaOon

Map Reduce Over HBase

Leveraging System for Real Time Analytics

Various requirements from relevance algorithms to pre-‐compute real 6me analy6cs for be9er targe6ng

Category Level

MulOdimensional Performance

Metrics

Deal Level

Performance Metrics

How do women in Dublin convert for Pizza deals?

How do women in Dublin convert for a parOcular pizza

deal?

Leveraging System for Real Time Analytics More Complex Examples

Category Level

MulOdimensional Performance Metrics

Deal Level

Performance Metrics

How do women in Dublin from the Dundrum area aged 30-‐35 convert for New York Style Pizza, when deal is

located within 2 miles, and when deal is priced between

€10-‐€20?

How do women in Dublin from Dundrum area aged 30-‐35 convert for a parOcular deal?

Leveraging System for Real Time AnalyticsEven More Complex Examples

How do women in Dublin from the Dundrum area aged 30-‐35 who also like

acOviOes like Biking and are acOve customers on our mobile plahorm convert

when deal is located within 2 miles, and when deal is priced between €10-‐€20?

How do women in Dublin from the Dundrum area aged 30-‐35 who also like acOviOes such as biking and are acOve customers of Groupon deals on mobile plahorm convert for this

parOcular deal?

Power of Simple Counting

Turns out all earlier quesOons can be answered if we could count appropriate events in appropriate bucket

No Deal Impressions by Women in Dublin for Pizza Deals

No of Purchases by Women in Dublin for Pizza Deals Conversion rate

for pizza deals for women in

Dublin

=

Real Time Analytics Infrastructure

Kaà Topic – With Real Time User events

Storm – Running AnalyOcs Topology

Real Time infrastructure processing 100,000 requests/ second

Redis 1 …

Storm Topology calculaOng various dimensions/ buckets and updates appropriate Redis bucket. Redis is

sharded from client side

Redis cluster handles over 3 Million events per second. Stores over 14

Billion unique keys Redis 2 Redis N

Real Time Analytics Infrastructure - Explained

Kaà Topic – With Real Time User events

Read user event Data from Kaà

Find out which all

buckets this event falls

Increase event counter for appropriate

bucket in Redis

Redis Shards

Storm

Scaling Challenges - Kafka - Storm

•  Storm was hard to scale. We had to try various number of combinaOons to

finalize how many bolts of each type are required for steady state operaOons and overall how many workers are needed.

•  Use “topology.max.spout.pending” senng in Storm topologies. We found it to be very useful to shield your topologies from sudden surge in traffic.

•  Build your enOre infrastructure – where data duplicates are allowed

Scaling Challenges - Redis

•  Reduce memory footprint – use hashes. Very memory efficient compared to normal Redis keys

•  In order to support high write operaOons turned off AOF, turned on RDB backups

Easiest of all other infrastructure pieces – Kaà, Storm, HBase

When Small is Big – Bloom Filters•  Since both Kaà and Storm can send same data twice specially at

scale, it was important to build downstream infrastructure that can handle duplicate data.

•  However, by very nature AnalyOcs Topology (CounOng Topology) cannot handle duplicates

•  Storing individual messages for billions of messages is way too expensive and would take lot more memory

•  So we used bloom filters. At a very small % error rate, we could

effecOvely de-‐dupe data with a very small memory footprint.

Avoiding Errors – Backups/ Recovery StrategyFor a high volume system, which also drives so much revenue for the company good

backup/recovery strategy is necessary

Redis

RDB Backups every few hours. RDB

backups are stored in HDFS for later

use

HBase

HBase Snapshot funcOonality is

used. Snapshot are taken every few

hours.

Kaà/ Storm

All input into Kaà topic is stored in HDFS for 30 days. So any hour/ day can be replayed from HDFS if necessary.

MonitoringOverall end-to-end monitoring to test the complete flow of data

Kaà -‐> Storm -‐> HBase Pipeline

Crawler crawls the page and monitoring looks for corresponding data in HBase

[email protected]/techjobs

Ques6ons?

Thank you!

Slides prepared in collabora/on with Ameya Kanitkar

Data & Analytics

Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase - NoSQL matters Dublin 2015