Upload
gokuldas-pillai
View
57
Download
0
Tags:
Embed Size (px)
Citation preview
Introduction to HBase
Gokuldas K Pillai
@gokool
HBase - The Hadoop Database
• Based on Google’s BigTable (OSDI’06)
• Runs on top of Hadoop but provides real time read/write access
• Distributed Column Oriented Database
HBase Strengths
• Can scale to billions of rows X millions of columns
• Relatively cheap & easy to scale
• Random real time access read/write access to very large data
• Support for update, delete
Who is using it
• StumpleUpon/ su.pr– Uses Hbase as a realtime data storage and analytics platform
• Twitter– Distributed read/write backup of all mySQL instances. Powers
“people search”.
• Powerset (Now part of MS)• Adobe• Yahoo• Ning• Meetup• More at http://wiki.apache.org/hadoop/Hbase/PoweredBy
Key features
• Column Oriented store
– Table costs only for the data stored
– NULLs in rows are free
• Rows stored in sorted order
• Can scale to Petabytes (At Google)
Comparing to RDBMS
• No Joins
• No Query engine
• No transactions
• No column typing
• No SQL, No ODBC/JDBC (Hbql is there now)
Data Model - Tables
• Tables consisting of rows and columns
• Table cells are versioned (by timestamp)
• Tables are sorted by row keys
• Table access is via primary key
• Row updates lock the row no matter how many columns are involved
Column Families
• Row’s columns are grouped into families
• Column family members identified by a common ‘printable’ prefix
• Column family should be predefined – but column family members can be added
dynamically
– member name can be bytes
• All column family members are collocated on disk
Server Architecture
• Similar to HDFS
– HbaseMaster ~ NameNode
– RegionServer ~ DataNode
• HBase stores state via the Hadoop FS API
• Can persist to :
– Local
– Amazon S3
– HDFS (Default)
HBaseMaster
What it does:• Bootstrapping a new instance
• Assignment and handling RegionServer problems
– Each region from every table is assigned to a RegionServer
• When machines fail, move regions
• When regions split, move regions to balance
What it does NOT do:– Handle write requests (Not a DB Master)
– Handle location finding requests (handled by RegionServer)
RegionServer
• Carry the regions
• Handle client read/write requests
• Manage region splits (inform the Master)
Regions
• Horizontal Partitioning
• Every region has a subset of the table’s rows
• Region identified as
– [table, first row(+), last row(-)]
• Table starts on a single region
• Splits into two equal sized regions as the original region grows bigger and so on..
Zookeeper
• Master election and server availability
• Cluster management
– Assignment transaction state management
• Client contacts ZooKeeper to bootstrap connection to the Hbase cluster
• Region key ranges, region server addresses
• Guarantees consistency of data across clients
Workflow (Client connecting first time)
• Client ZooKeeper (returns –ROOT- )
• Client -ROOT- (returns .META.)
• Client .META. (returns RegionServer)
• To avoid 3-lookups everytime, client caches this info.
– Recache on fault
Write/Read Operation
• Write request from Client RegionServer Commit log (on HDFS), memstore
• Flush to filesystem when memstore fills
• Read request from Client RegionServerLookup the memstore if available
If not, lookup flush files (reverse chrono. Order)
Integration
• Java HBase Client API
• High performance Thrift gateway
• A REST-ful Web service gateway (Stargate)
– Supports XML, binary dat encoding options
• Cascading, Hive and Pig integration
• HBase shell (jruby)
• TableInput/TableOutputFormat for MR
Main Classes
• HBaseAdmin
– Create table, drop table, list and alter table
• HTable
– Put
– Get
– Scan
Alternatives to HBase
• Cassandra (From Facebook)
– Based on Amazon’s Dynamo
– No Master-slave but P2P
– Tunable: Consistency Vs Latency
• Yahoo’s PNUTS– Not Open source
– Works well for multi DC/geographical disbursed servers
References
• Hadoop – The Definitive Guide • Cloudera website• http://wiki.hbase.apache.org• Lars George,
– http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
• Comparing Hbase, Cassandra and PNUTS– http://blog.amandeepkhurana.com/2010/05/comparing-
pnuts-hbase-and-cassandra.html
• ACID compliance of Hbase -http://hbase.apache.org/docs/r0.89.20100621/acid-semantics.html