27
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar

Using Hadoop & HBase to build content relevance & personalization

  • Upload
    elliot

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

Using Hadoop & HBase to build content relevance & personalization. Tools to build your big data application Ameya Kanitkar. Ameya Kanitkar – That ’ s me!. Big Data Infrastructure Engineer @ Groupon , Palo Alto USA (Working on Deal Relevance & Personalization Systems) a [email protected] - PowerPoint PPT Presentation

Citation preview

Page 1: Using Hadoop & HBase to build content relevance & personalization

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATIONTools to build your big data application

Ameya Kanitkar

Page 2: Using Hadoop & HBase to build content relevance & personalization

Ameya Kanitkar – That’s me!

• Big Data Infrastructure Engineer @ Groupon, Palo Alto

USA (Working on Deal Relevance & Personalization

Systems)

[email protected]://www.linkedin.com/in/ameyakanitkar@aktwits

Page 3: Using Hadoop & HBase to build content relevance & personalization

Agenda

Basics of Hadoop & HBase

How you can use Hadoop & HBase for big data

application

Case Study: Deal Relevance and Personalization

Systems at Groupon with Hadoop & HBase

Page 4: Using Hadoop & HBase to build content relevance & personalization

Big Data Application Examples

Recommendation Systems

Ad targeting

Personalization Systems

BI/ DW

Log Analysis

Natural Language Processing

Page 5: Using Hadoop & HBase to build content relevance & personalization

So what is Hadoop?

General purpose framework for processing huge

amounts of data.

Open Source

Batch / Offline Oriented

Page 6: Using Hadoop & HBase to build content relevance & personalization

Hadoop - HDFS

Open Source Distributed File System.

Store large files. Can easily be accessed via application

built on top of HDFS.

Data is distributed and replicated over multiple machines

Linux Style commands eg. ls, cp, mv, touchz etc

Page 7: Using Hadoop & HBase to build content relevance & personalization

Hadoop – HDFS

Example:

hadoop fs –dus /data/

185453399927478 bytes =~ 168 TB

(One of the folders from one of our hadoop cluster)

Page 8: Using Hadoop & HBase to build content relevance & personalization

Hadoop – Map Reduce

Application Framework built on top of HDFS to process

your big data

Operates on key-value pairs

Mappers filter and transform input data

Reducers aggregate mapper output

Page 9: Using Hadoop & HBase to build content relevance & personalization

Example

• Given web logs, calculate landing page conversion rate

for each product

• So basically we need to see how many impressions each

product received and then calculate conversion rate of for

each product

Page 10: Using Hadoop & HBase to build content relevance & personalization

Map Reduce Example

Map 1: Process Log File:Output: Key (Product ID), Value

(Impression Count)

Map 2: Process Log File:Output: Key (Product ID), Value

(Impression Count)

Map N: Process Log File:Output: Key (Product ID), Value

(Impression Count)

Reducer: Here we receive all data for a given product. Just run

simple for loop to calculate conversion rate.

(Output: Product ID, Conversion Rate

Map Phase Reduce Phase

Page 11: Using Hadoop & HBase to build content relevance & personalization

Recap

We just processed terabytes of data, and calculated

conversion rate across millions of products.

Note: This is batch process only. It takes time. You can

not start this process after some one visits your website.

How about we generate recommendations in batch process and serve them in real time?

Page 12: Using Hadoop & HBase to build content relevance & personalization

HBase

Provides real time random read/ write access over HDFS

Built on Google’s ‘Big Table’ design

Open Sourced

This is not RDBMS, so no joins. Access patterns are

generally simple like get(key), put(key, value) etc.

Page 13: Using Hadoop & HBase to build content relevance & personalization

Row Cf:<qual> Cf:<qual> …. Cf:<qual>

Row 1 Cf1:qual1 Cf1:qual2

Row 11 Cf1:qual2 Cf1:qual22 Cf1:qual3

Row 2 Cf2:qual1

Row N

Dynamic Column Names. No need to define columns upfront.

Both rows and columns are (lexicological) sorted

Page 14: Using Hadoop & HBase to build content relevance & personalization

Row Cf:<qual> ….

user1 Cf1:click_history:{actual_clicks_data}

Cf1:purchases:{actual_purchases}

user11 Cf1:purchases:{actual_purchases}

user20 Cf1:mobile_impressions:{actual mobile impressions}

Cf1:purchases:{actual_purchases}

Note: Each row has different columns, So think about this as a hash map rather than at table with rows and columns

Page 15: Using Hadoop & HBase to build content relevance & personalization

Putting it all together

Analyze Data(Map Reduce)

Generate Recommendations

(Map Reduce)

Store data in HDFS

Serve Real Time Requests(HBase)

Web

Mobile

Do offline analysis in Hadoop, and serve real time requests with HBase

Page 16: Using Hadoop & HBase to build content relevance & personalization

Use Case: Deal Relevance & Personalization @ Groupon

Page 17: Using Hadoop & HBase to build content relevance & personalization

What are Groupon Deals?

Page 18: Using Hadoop & HBase to build content relevance & personalization

Our Relevance Scenario

Users

Page 19: Using Hadoop & HBase to build content relevance & personalization

Our Relevance Scenario

Users

How do we surface relevant deals ?

Deals are perishable (Deals expire or are sold out)

No direct user intent (As in traditional search advertising)

Relatively Limited User Information

Deals are highly local

Page 20: Using Hadoop & HBase to build content relevance & personalization

Two Sides to the Relevance Problem

AlgorithmicIssues

How to findrelevant deals forindividual usersgiven a set of

optimization criteria

ScalingIssues

How to handlerelevance for

all users acrossmultiple

delivery platforms

Page 21: Using Hadoop & HBase to build content relevance & personalization

Developing Deal Ranking Algorithms

• Exploring Data

• Understanding signals, finding

patterns

• Building Models/Heuristics

• Employ both classical machine

learning techniques and heuristic

adjustments to estimate user

purchasing behavior

• Conduct Experiments

• Try out ideas on real users and

evaluate their effect

Page 22: Using Hadoop & HBase to build content relevance & personalization

Data Infrastructure

20132011 2012

20+

400+

2000+

Growing Deals Growing Users

100 Million+ subscribers

We need to store data like, user click history, email records, service logs etc. This tunes to billions of data points and TB’s of data

Page 23: Using Hadoop & HBase to build content relevance & personalization

Deal Personalization Infrastructure Use Cases

• Deliver Personalized Emails

• Deliver Personalized Website & Mobile

Experience

Offline System Online System

Email

Personalize billions of emails for hundredsof millions of users

Personalize one of the most populare-commerce mobile & web app

for hundreds of millions of users & page views

Page 24: Using Hadoop & HBase to build content relevance & personalization

Architecture

HBase OfflineSystem

HBase for Online System

Real TimeRelevance

Email

RelevanceMap/Reduce

Replication

Data Pipeline

• We can now maintain different SLA on online and offline systems

• We can tune HBase cluster differently for online and offline systems

Page 25: Using Hadoop & HBase to build content relevance & personalization

HBase Schema DesignUser ID Column Family 1 Column Family 2Unique Identifier for Users

User History and Profile Information

Email History For Users

• Most of our data access patterns are via “User Key”• This makes it easy to design HBase schema• The actual data is kept in JSON

Overwrite user history and profile info

Append email history for each day as a separate columns. (On avg each

row has over 200 columns)

Page 26: Using Hadoop & HBase to build content relevance & personalization

Cluster Sizing

Hadoop + HBase Cluster

100+ machine Hadoop cluster, this runs heavy

map reduce jobsThe same cluster also hosts 15 node HBase

cluster

Online HBase Cluster

HBaseReplication

10 Machine dedicated HBase cluster to serve real time SLA

• Machine Profile• 96 GB RAM (HBase

25 GB)• 24 Virtual Cores

CPU• 8 2TB Disks

• Data Profile• 100 Million+

Records• 2TB+ Data• Over 4.2 Billion Data

Points

Page 27: Using Hadoop & HBase to build content relevance & personalization

Questions?

Thank You!

(We are hiring!)www.groupon.com/techjobs