Analyzing Large-Scale User Data with Hadoop and HBase

Developed By:

: Analyzing Large-Scale User Data with Hadoop and HBase

Odiago, Inc.

Aaron Kimball – CTO

Developed By:

• A large-scale storage, serving, and analysis platform

• For user- or other entity-centric data

Developed By:

use cases

• Product/content recommendations

– “Because you liked book X, you may like book Y”

• Ad targeting

– “Because of your interest in sports, check out…”

• Social network analysis

– “You may know these people…”

• Fraud detection, anti-spam, search personalization…

Developed By:

Use case characteristics

• Have a large number of users

• Want to store (large) transaction data as well as derived data (e.g., recommendations)

• Need to serve recommendations interactively

• Require a combination of offline and on-the-fly computation

Developed By:

A typical workflow

Developed By:

Challenges

• Support real-time retrieval of profile data

• Store a long transactional data history

• Keep related data logically and physically close

• Update data in a timely fashion without wasting computation

• Fault tolerance

• Data schema changes over time

Developed By:

architecture

Certified Technology product

Developed By:

HBase data model

• Data in cells, addressed by four “coordinates”

– Row Id (primary key)

– Column family

– Column “qualifier”

– Timestamp

Developed By:

Schema free: not what you want

• HBase may not impose a schema, but your data still has one

• Up to the application to determine how to organize & interpret data

• You still need to pick a serialization system

Developed By:

Schemas = trade-offs

• Different schemas enable efficient storage/retrieval/analysis of different types of data

• Physical organization of data still makes a big difference

– Especially with respect to read/write patterns

Developed By:

WibiData workloads

• Large number of fat rows (one per user)

• Each row updated relatively few times/day

– Though updates may involve large records

• Raw data written to one set of columns

• Processed results read from another

– Often with an interactive latency requirement

• Needs to support complex data types

Developed By:

Serialization with

• Apache Avro provides flexible serialization • All data written along with its “writer schema” • Reader schema may differ from the writer’s

"type": "record",

"name": "LongList",

"fields" : [

{"name": "value", "type": "long"},

{"name": "next", "type": ["LongList", "null"]}

Developed By:

Serialization with

• No code generation required

• Producers and consumers of data can migrate independently

• Data format migrations do not require structural changes to underlying data

Developed By:

WibiData: An extended data model

• Columns or whole families have common Avro schemas for evolvable storage and retrieval

<name>email</name>

<description>Email address</description>

<schema>"string"</schema>

</column>

Developed By:

• Column families are a logical concept

• Data is physically arranged in locality groups

• Row ids are hashed for uniform write pressure

Developed By:

• Wibi uses 3-d storage

• Data is often sorted by timestamp

Developed By:

Analyzing data: Producers

• Producers create derived column values

• Produce operator works on one row at a time

– Can be run in MapReduce, or on a one-off basis

• Produce is a row mutation operator

Developed By:

Analyzing data: Gatherers

• Gatherers aggregate data across all rows

• Always run within MapReduce

• A bridge between rows and (key, value) pairs

Developed By:

Interactive access: REST API

• REST API provides interactive access

• Producers can be triggered “on demand” to create fresh recommendations

GET request PUT request

Developed By:

Example: Ad Targeting

Developed By:

Producer

Developed By:

Producer

Developed By:

Gathering Category Associations

• Gather observed behavior

Developed By:

• Associate interests, clicks

Developed By:

• … for all pairs

Developed By:

• And aggregate across all users

Map phase: Reduce phase:

Developed By:

Conclusions

• Hadoop, HBase, Avro form the core of a large-scale machine learning/analysis platform

• How you set up your schema matters

• Producer/gatherer programming model allows computations over tables to be expressed naturally; works with MapReduce

Developed By:

www.wibidata.com / @wibidata Aaron Kimball – aaron@odiago.com

Analyzing Large-Scale User Data with Hadoop and HBase

Technology

Hadoop, Hbase and Hive- Bay area Hadoop User Group

Hadoop, Hbase, And Hive

Apache hadoop hbase

Apache HBase: the Hadoop DatabaseApache HBase: the Hadoop Database Yuanru Qian, Andrew Sharp, Jiuling Wang 1 Agenda Motivation Data Model The HBase Distributed System Data Operations

Scaling Out With Hadoop And HBase

Building a LINQ Provider for HBase MapReduce · 2019-04-30 · HBase/ Hadoop Building a LINQ Provider for HBase MapReduce Building a LINQ Provider for HBase MapReduce Summary HBase

Real Time Micro-Blog Summarization based on Hadoop/HBase

Hadoop hbase mapreduce

Secondary Indexing in Phoenix - Hadoop Summit 2012 - HBase BoF

Hadoop, HBase, and Healthcare

BIG DATA HADOOP FULLlBulk Loading in HBase lCreate, Insert, Read Tables in HBase lHBase Admin APIs l HBase Security lHBase vs Hive lBackup & Restore in HBase lApache HBase External

Real Time Micro-Blog Summarization based on Hadoop/HBase · Hadoop HDFS MapReduce HBase The Big Picture HBase Operation Application Twitter Application Architecture Demo Dept. of

An Introduction to Apache Hadoop, Mahout and HBase

Hadoop World: Practical HBase: Getting the most from your HBase install

NoSQL with Hadoop and HBase

Exploring NoSQL, Hadoop and HBase - Université libre de ...cs.ulb.ac.be/.../teaching/infoh415/student_projects/hadoop_hbase.pdf · Exploring NoSQL, Hadoop and HBase ... BigData,

Hbase : Hadoop Database

Introduction to HBase - files.meetup.comfiles.meetup.com/1228907/NYC Hadoop Meetup - Introduction to HBa… · Introduction to HBase NYC Hadoop Meetup ... •Assignment, load balancing,

Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)

Hadoop 與HBase 之架設及應用