38
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Murtaza Doctor Principal Architect Giang Nguyen Sr. Software Engineer How we solved Real-Time User Segmentation using HBase ?

How we solved Real-time User Segmentation using HBase

Embed Size (px)

DESCRIPTION

At RichRelevance, we service 10 of the top 20 Internet retailer chains and deliver more than $5.5 billions in attributable sales. Every 21 milliseconds a shopper clicks on a recommendation that we have delivered, and we serve over 850 million product recommendations daily. Our Hadoop infrastructure has a capacity to handle upwards of 1.5+ PB. Behavioral Targeting, specifically user segmentation and building personas, is critical for us in generating triggers when a user is added to a segment or switches from a segment. In this presentation, we intend to demonstrate not only how the events are captured, but also how they are stored in HBase in real-time. It is critical to design the system so it can handle thousands of writes per second and, at the same time, be able to query any combination of behavioral attributes in HBase through real-time APIs. This session will walk attendees through the entire design & architecture starting from data Ingestion, schema design, and access patterns, as well as some major problems like sharing & hot spotting. Furthermore, performance metrics will be presented, including the number of read/write per second and details around cluster configuration.

Citation preview

Page 1: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Murtaza Doctor Principal ArchitectGiang Nguyen Sr. Software Engineer

How we solved Real-Time User Segmentation using HBase ?

Page 2: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Outline

• {rr} Story• {rr} Personalization Platform• User Segmentation: Problem Statement• Design & Architecture• Performance Metrics• Q&A

Page 3: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Page 4: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Multiple Personalization Placements

Personalized Product Recommendations Targeted Promotions Relevant Brand Advertising

Inventory andMargin Data

CatalogAttribute Data

Real time user behavior

Multi-Channelpurchase history

3rd Party Data Sources

Future Data Sources

Input Data

100+ algorithms dynamically targeted by page, context, and user behavior, across

all retail channels.Targeted content optimization to

customer segmentsMonetization through relevant

brand advertising

Page 5: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

One place where all data about your customer lives (e.g. loyalty, customer service, offline, catalogue)

Direct access to data for your entire enterprise via API

Real-time actionable data and business intelligence

Link customer activity across any channel

Leverage 3rd party data (e.g. weather, geography, census)

Delivering a Customer-centric

Omni-channelData model

Rich

Relevan

ce DataM

esh

Real-time Segmentation

DMP Integration

RichRelevance DataMesh Cloud PlatformDelivering a Single View of your Customer

Event-based Triggers

Ad Hoc Reporting

Omni-channel Personalization

Loyalty Integration

Page 6: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Our cloud-based platform supports both real-time processes and analytical use cases, utilizing technologies to name a few: Crunch, Hive, HBase, Avro, Azkaban, Voldemort, Kafka

Someone clicks on a {rr} recommendation every 21 milliseconds

Did You Know?

Our data capacity includes a 1.5 PB Hadoop infrastructure, which enables us to employ 100+ algorithms in real-time

In the US, we serve 7000 requests per second with an average response time of 50 ms

Page 7: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Real-Time User SegmentationFinding and Targeting the Right Audience

Page 8: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Meet Amanda and Jessica

Page 9: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Because We Know What They Like!

Page 10: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

• Utilizes valuable targeting data such as event logs, off-line data, DMA, demographics, and etc.

• Finds highly qualified consumers and buckets them into segments

What is the {rr} Segment Builder?

Example: for Retail, Segment Builder supports view, click, purchase on products, categories, and brands

Page 11: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Segment BuilderSegment’s List page

Page 12: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Segment BuilderAdd/Edit Segment page

Page 13: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

• Create segments to capture the audience via UI• Each behavior is captured by a rule• Each rule corresponds to a row key in HBase• Each rule returns the users• Rules are joined using set union or intersection• Segment definition consists of one or more rules

Design: Segment Evaluation Engine

Page 14: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Our choice Avro and Kafka

Real-Time Data Ingestion

Page 15: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

• User interacts with a retail site• Events are triggered in real-

time, each event is converted into an Avro record

• Events are ingested using our real-time framework built using Apache Kafka

Real-time data ingestion

Page 16: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Design Principles: Real-Time Solution

• Streaming: Support for streaming versus batch• Reliability: no loss of events and at least once

delivery of messages• Scalable: add more brokers, add more consumers,

partition data • Distributed system with central coordinator like

zookeeper• Persistence of messages• Push & pull mechanism• Support for compression• Low operational complexity & easy to monitor

Page 17: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Our Decision – Apache Kafka

• Distributed pub-sub messaging system• More generic concept (Ingest & Publish)• Can support both offline & online use-cases• Designed for persistent messages as the common

case• Guarantee ordering of events• Supports gzip & snappy (0.8) compression protocols• Supports rewind offset & re-consumption of data • No concept of master node, brokers are peers

Page 18: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Kafka Architecture

Page 19: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Common Consumer Framework

Page 20: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Volume Facts?

Daily clickstream event data 150 GB

Average size of message 2 KB

Batch Size 5000 messages

Producer throughput 5000 messages/sec

Real time HBase Consumer 7000 messages/sec

Page 21: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

End to End real-time story…

• User exhibits a behavior• Events are generated at front-end data-center• Events are streamed to backend data center

via Kafka• Events are consumed and indexed in HBase

tables• Takes seconds from event origination to

indexing in HBase

Page 22: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

End to End real-time story…

Page 23: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

User Segmentation Engines

Features {rr} Engine Other Engines

User’s behavior ingestion Real-time Not Real-time

Batch style processing is done Immediately At end of a day

When segment membership is changednotifications will be

Event driven N/A

Technologies used Scalable and open source

Unscalable and proprietary

Page 24: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Use Cases

Use Case #1

Users exhibit behaviors

Behaviors ingested and indexed in real time

Users are now in corresponding segments

Retrieving users takes seconds

Use Case #2

Users exhibit behaviors

Segment membership calculated in real time

Notifications are sent on segment membership change

Page 25: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Our choice HBase?

• Real time segment evaluation• Optimized for read and scan• Column cell supports frequency use case• Eventual consistency does not work• Seamless integration with Hadoop• Possible with good row key design

Page 26: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

HBase Row Key Design

• Took a few attempts• Design considerations

– Timestamp in row or columns– Partition behavior by date– Optimized for read or write– Hot spotting issues– Uniform key distribution

Wasn’t easy

Page 27: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Design: First Attempt

• Row key represents behavior• Columns store the user id• Cell stores behavior time and capture frequency• One column family U

RowKey Columns

338VBChanel 23b93laddf82:1370377143973Hd92jslahd0a:1313323414192

338CCElectronic z3be3la2dfa2:1370477142970kd9zjsla3d01:1313323414192

Page 28: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Design: First Attempt Issues

• Row too wide• May exceed HFile size• Terrible write/read performance

Page 29: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

• Partition behavior by date• Reduce row size• Gained ability to scan across dates

Design: Second Attempt

Rowkey Columns

338VBChanel1370377143 23b93laddf82:1370377143973Hd92jslahd0a:1313323414192

338CCElectronic1370377143 z3be3la2dfa2:1370477142970kd9zjsla3d01:1313323414192

Page 30: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Design: Second Attempt Issues

• Hot spotting• Popular products or high level categories can

have millions of users, each day• One region serving same dimension type• Terrible write/read performance

Page 31: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

OK…I need a BREAK!!!

Page 32: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Design: Final

• Shard row to prevent hot-spotting• Shard into N number of regions• Significant improvement in read/write• Prepend a salt to each key

Page 33: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

[salt]_len_[siteId]_len_[metric]_len_[dimension]_len_[value][timestamp]

Design: Final

_len_ is the integer length of the field following it

timestamp is stored in day granularity

[salt] is computed by first creating a hash from the siteId, metric, and dimension, then combining this with a random number between 0 and N (number of shards) for sharding

Row key contains

attribute value siteId metric attribute timestamp userGUID

Page 34: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

• Complex segments contain many rules• Each rule = one behavior = one row key• Each row key returns a set of users• OR = Full outer join• AND = Inner join• Done in memory for small rules• Merged on disk for large rules

Design: Behavior Joins

Page 35: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

HBase Consumer Sync versus Async API

Page 36: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Segmentation Performance

• Seconds latency• 40K puts/sec over 2 tables, 8 regions per table• Scaling achieved through addition of regions• Small segments calculated in msecs• Mid-size segments in seconds• Large segments calculated in 10s of seconds

Page 37: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Page 38: How we solved Real-time User Segmentation using HBase

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Thank You