27
Intro to HBase Alex Baranau, Sematext International, 2012 Monday, July 9, 12

Intro to HBase

Embed Size (px)

DESCRIPTION

Introduction to HBase briefly covers the following topics: what is HBase, how and when to used it.

Citation preview

Page 1: Intro to HBase

Intro to HBaseAlex Baranau, Sematext International, 2012

Monday, July 9, 12

Page 2: Intro to HBase

About Me

Software Engineer at Sematext International

http://blog.sematext.com/author/abaranau

@abaranau

http://github.com/sematext (abaranau)

Monday, July 9, 12

Page 3: Intro to HBase

Agenda

What is HBase?

How to use HBase?

When to use HBase?

Monday, July 9, 12

Page 4: Intro to HBase

What is HBase?

Monday, July 9, 12

Page 5: Intro to HBase

What: HBase is...

Think of it as a sparse, consistent, distributed, multidimensional, sorted map:

labeled tables of rows

row consist of key-value cells:

Open-source non-relational distributed column-oriented database modeled after Google’s BigTable.

(row key, column family, column, timestamp) -> value

Monday, July 9, 12

Page 6: Intro to HBase

What HBase is NOTNot an SQL database

Not relational

No joins

No fancy query language and no sophisticated query engine

No transactions out-of-the box

No secondary indices out-of-the box

Not a drop-in replacement for your RDBMS

Monday, July 9, 12

Page 7: Intro to HBase

What: Features-1

Linear scalability, capable of storing hundreds of terabytes of data

Automatic and configurable sharding of tables

Automatic failover support

Strictly consistent reads and writes

Monday, July 9, 12

Page 8: Intro to HBase

What: Part of Hadoop ecosystem

Provides realtime random read/write access to data stored in HDFS

HDFS

HBase

Data Producer

Data Consumer

write

write

write

read

read

Monday, July 9, 12

Page 9: Intro to HBase

What: Features-2Integrates nicely with Hadoop MapReduce (both as source and destination)

Easy Java API for client access

Thrift gateway and REST APIs

Bulk import of large amount of data

Replication across clusters & backup options

Block cache and Bloom filters for real-time queries

and many more...

Monday, July 9, 12

Page 10: Intro to HBase

How to use HBase?

Monday, July 9, 12

Page 11: Intro to HBase

How: the DataRow keys uninterpreted byte arrays

Columns grouped in columnfamilies (CFs)

CFs defined statically upon table creation

Cell is uninterpreted byte array and a timestamp

Row Key Data

Minsk geo:{‘country’:‘Belarus’,‘region’:‘Minsk’}demography:{‘population’:‘1,937,000’@ts=2011}

New_York_Citygeo:{‘country’:‘USA’,‘state’:’NY’}

demography:{‘population’:‘8,175,133’@ts=2010,‘population’:‘8,244,910’@ts=2011}

Suva geo:{‘country’:‘Fiji’}

Different data separated into CFs

All values stores as byte arrays

Rows are ordered and accessed by

row key

Rows can have different columns

Cell can have multiple versions

Data can be very “sparse”

Monday, July 9, 12

Page 12: Intro to HBase

How: Writing the DataRow updates are atomic

Updates across multiple rows are NOT atomic, no transaction support out of the box

HBase stores N versions of a cell (default 3)

Tables are usually “sparse”, not all columns populated in a row

Monday, July 9, 12

Page 13: Intro to HBase

How: Reading the DataReader will always read the last written (and committed) values

Reading single row: Get

Reading multiple rows: Scan (very fast)

Scan usually defines start key and stop key

Rows are ordered, easy to do partial key scan

Query predicate pushed down via server-side Filters

Row Key Data‘login_2012-03-01.00:09:17’ d:{‘user’:‘alex’}

... ...‘login_2012-03-01.23:59:35’ d:{‘user’:‘otis’}‘login_2012-03-02.00:00:21’ d:{‘user’:‘david’}

Monday, July 9, 12

Page 14: Intro to HBase

How: MapReduce Integration

Out of the box integration with Hadoop MapReduce

Data from HBase table can be source for MR job

MR job can write data into HBase

MR job can write data into HDFS directly and then output files can be very quickly loaded into HBase via “Bulk Loading” functionality

Monday, July 9, 12

Page 15: Intro to HBase

How: Sharding the DataAutomatic and configurable sharding of tables:

Tables partitioned into Regions

Region defined by start & end row keys

Regions are the “atoms” of distribution

Regions are assigned to RegionServers (HBase cluster slaves)

Monday, July 9, 12

Page 16: Intro to HBase

HMaster

How: Setup: ComponentsHBase components

RegionServer

RegionServer

ZooKeeper

RegionServer

ZooKeeperZooKeeper

client

RegionServerRegionServer

HMaster

Monday, July 9, 12

Page 17: Intro to HBase

How: Setup: Hadoop ClusterTypical Hadoop+HBase setup

HMaster

NameNode

DataNode DataNode

RegionServer RegionServer

Task

Trac

ker

Task

Trac

ker

JobTrackerMaster Node

Slave Node Slave Node

HDFS

MapReduce

HBase

Slave Nodes

Monday, July 9, 12

Page 18: Intro to HBase

How: Setup: Automatic Failover

DataNode failures handled by HDFS (replication)

RSs failures (incl. caused by whole server failure) handled automatically

Master re-assignes Regions to available RSs

HMaster failover: automatic with multiple HMasters

Monday, July 9, 12

Page 19: Intro to HBase

When to Use HBase?

Monday, July 9, 12

Page 20: Intro to HBase

When: What HBase is good at

Serving large amount of data: built to scale from the get-go

fast random access to the data

Write-heavy applications*

Append-style writing (inserting/overwriting new data) rather than heavy read-modify-write operations**

* clients should handle the loss of HTable client-side buffer** see https://github.com/sematext/HBaseHUT

Monday, July 9, 12

Page 21: Intro to HBase

When: HBase vs ...

Favors consistency over availability

Part of a Hadoop ecosystem

Great community; adopted by tech giants like Facebook, Twitter, Yahoo!, Adobe, etc.

Monday, July 9, 12

Page 22: Intro to HBase

When: Use-casesAudit logging systems

track user actions

answer questions/queries like:

what are the last 10 actions made by user?row key: userId_timestamp

which users logged into system yesterday?row key: action_timestamp_userId

Monday, July 9, 12

Page 23: Intro to HBase

When: Use-cases

Real-time analytics, OLAP

real-time counters

interactive reports showing trends, breakdowns, etc

time-series databases

Monday, July 9, 12

Page 24: Intro to HBase

When: Use-casesMonitoring system example

Monday, July 9, 12

Page 25: Intro to HBase

When: Use-casesMessages-centered systems

twitter-like messages/statuses

Content management systems

serving content out of HBase

Canonical use-case: webtable (pages stored during crawling the web)

And others

Monday, July 9, 12

Page 26: Intro to HBase

Future

Making stable enough to substitute RDBMS in mission critical cases

Easier system management

Performance improvements

Monday, July 9, 12

Page 27: Intro to HBase

Qs?(next: Intro into HBase Internals)

Sematext is hiring!Monday, July 9, 12