01 hbase

HBase

RDBMS Scaling

• Cannot scale for large distributed data sets

• Vendors Offers replication and partition solutions to grow the database beyond the confines of single node, but generally complicated to install and maintain

• Such techniques compromise RDBMS features such as

– Joins, Complex queries, Views, Triggers and foreign key constraints

– These queries becomes expensive

Why BigTable?• Performance of RDBMS system is good for transaction

processing but for very large scale analytic processing, the solutions are expensive, and specialized.

• Very large scale analytic processing– Big queries – typically range or table scans.

– Big databases (100s of TB)

• Map reduce on Bigtable with optionally Cascading on top to support some relational algebras may be a cost effective solution.

• Sharding (Shared nothing horizontal partitioning) is not a solution to scale open source RDBMS platforms• Application specific• Labor intensive (re)partitionaing

Key concept

• At its core, HBase / BigTable is a map.• It is a persistent storage.• HBase and BigTable are built upon distributed file-

systems.• Unlike most map implementations, in

HBase/BigTable the key/value pairs are kept in strict alphabetical order.

• Multidimensional map.• Sparse.

HBase is a distributed column-oriented database built on top of HDFS.

Map

• A map is "an abstract data type composed of a collection of keys and a collection of values, where each key is associated with one value."

{

"Name" : "Subhas",

"Mail" : "[email protected]",

"Location" : "9F-TA-WS-21",

"Phone" : "+918025113529",

"Sal" : ************

}

In this example "Name" is a key, and "Subhas" is the corresponding value.

Persistent

• Persistence merely means that the data you put in this special map "persists" after the program that created or accessed it is finished.

• This is no different in concept than any other kind of persistent storage such as a file on a file-system.

• Each value can be versioned in HBase

Distributed• Built upon distributed file-systems

– file storage can be spread out among an array of independent machines.

– HBase sits atop either Hadoop's Distributed File System (HDFS) or Amazon's Simple Storage Service (S3),

– BigTable makes use of the Google File System (GFS).

• Data is replicated across a number of participating nodes in an analogous manner to how data is striped across discs in a RAID system.

Sorted

Continuing our example, the sorted version looks like this: {

"Location" : "9F-TA-WS-21",


"Name" : "Subhas",

"Phone" : "+918025113529",

"Sal" : ************

}

Sorting can ensure that items of greatest interest to you are near each other

Multidimensional

A map of maps

{

"Location" :

{

"FL" : "9F",

"TOWER" : "A",

"WS" : "21“

},


"Name" :

{

"FIRST": "Subhas",

"MID" : "Kumar",

"LAST" : "Ghosh“

},

"Phone" : "+918025113529",

"Sal" : ************

}

Each key points to a map with one or more keys: "FL", "TOWER", "WS" e.g.

Top-level key/map pair is a "row".

Also, in BigTable/HBase nomenclature, the "FL" and "TOWER" mappings would be called "Column Families".

Multidimensional

• A table's column families are specified when the table is created, and are difficult or impossible to modify later.

• It can also be expensive to add new column families, so it's a good idea to specify all the ones you'll need up front.

• Fortunately, a column family may have any number of columns, denoted by a column "qualifier" or "label".

Multidimensional

…"aaaaa" : {

"A" : { "foo" : "y", "bar" : "d"

}, "B" : {

"" : "w" } }, "aaaab" : {

"A" : { "foo" : "world", "bar" : "domination"

}, "B" : {

"" : "ocean" } }

},…

Column family with two columns: "foo" and "bar",

"B" column family has just one column whose qualifier is the empty string ("").

When asking HBase/BigTable for data provide the full column name in the form "<family>:<qualifier>“, e.g. "A:foo", "A:bar" and "B:".

Multidimensional

• Labeled tables of rows X columns X timestamp– Cells addressed by row/column/timestamp– As (perverse) java declaration:

• Row keys uninterpreted byte arrays: E.g. an URL– Rows are ordered by Comparator (Default: byte-order)– Row updates are atomic; even if hundreds of columns

• Columns grouped into column-families– Columns have column-family prefix and then qualifier

• E.g. webpage:mimetype, webpage:language

– Column-family 'printable', qualifier arbitrary bytes – Column-families in table schema but not qualifiers

SortedMap<byte [], SortedMap<byte [],

List<Cell>>>> hbase = new TreeMap<ditto>(new RawByteComparator());

Multidimensional

• Cell is uninterpreted byte array and a timestamp

– E.g. webpage content

• Tables partitioned into Regions

–Region defined by start & end row

–Regions are the 'atoms' of distribution deployed around the cluster.

– start < end - in lexicographic sense

Multidimensional

valueTime Stamp

Column Family

Row key

Qualifier

Sparse

• Not all columns in all rows are filled

What HBase Is Not

• Tables have one primary index, the row key.• No join operators.• Scans and queries can select a subset of available columns,

perhaps by using a wildcard.• There are three types of lookups:

– Fast lookup using row key and optional timestamp.– Full table scan– Range scan from region start to end.

• Limited atomicity and transaction support.– HBase supports multiple batched mutations of single rows only.– Data is unstructured and untyped.

• Not accessed or manipulated via SQL.– Programmatic access via Java, REST, or Thrift APIs.– Scripting via JRuby.– No JOIN, No sophisticated query engine, No column typing, no

ODBC/JDBC, No Crystal Reports, No transactions, No secondary indices

Map-Reduce With HBase

• When we use a map-reduce framework with HBase table, a map function is executed for each region independently in parallel.

• Within each map query is answered by scanning the rows in a ordered manner starting with low ordered key to higher ordered key.

• Optionally, certain rows and columns (column families) can be filtered out for better performance.

Architecture

Elements

– Table : a list of tuples sorted by row key ascending, column name ascending and timestamp descending.

– Regions: A Table is broken up into row ranges called regions. Each row range contains rows from start-key to end-key. (A set of regions, sorted appropriately, forms an entire table.)

– HStore: Each column family in a region is managed by an HStore.

– HFile: Each HStore may have one or more HFile (a Hadoop HDFS file type).

Components

• Mastero Responsible for monitoring region serverso Load balancing for regionso Redirect client to correct region serverso The current SPOF (single point of failure)

• Regionserver slaves o Serving requests(Write/Read/Scan) of Cliento Send HeartBeat to Mastero Throughput and Region numbers are scalable by

region servers

Components• ZooKeeper

– centralized service for maintaining • configuration information, • naming, • providing distributed synchronization, and • providing group services.

– ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal namespace • organized similarly to a standard file system. • The name space consists of data registers - called znodes• in ZooKeeper parlance - and these are similar to files and directories. • Unlike a typical file system, which is designed for storage, ZooKeeper

data is kept in-memory, which means ZooKeeper can acheive high throughput and low latency numbers.

Distributed Coordination

Data model and the hierarchical namespace

Distributed Coordination

• The replicated database is in-memory.• Updates are logged to disk for recoverability.• Writes are serialized to disk before they are applied to the in-memory

database.• Clients connect to exactly one server to submit requests. • Read requests are serviced from the local replica of each server database.• Requests that change the state of the service, write requests, are

processed by an agreement protocol.

Distributed Coordination• As part of the agreement protocol all write requests from

clients are forwarded to a single server, called the leader.

• The rest of the ZooKeeper servers, called followers, receive message proposals from the leader and agree upon message delivery.

• The messaging layer takes care of replacing leaders on failures and syncing followers with leaders.

• ZooKeeper uses a custom atomic messaging protocol.– ZooKeeper can guarantee that the local replicas never diverge.

– When the leader receives a write request, it calculates what the state of the system is when the write is to be applied and transforms this into a transaction that captures this new state.

The general protocol flow

The general protocol flow 1. Client contacts the Zookeeper to find where it shall put the data. 2. For this purpose, HBase maintains two catalog tables, namely, -ROOT-, and

.META.. 3. First HBase finds information from the -ROOT- table about location of

.META. Table.4. Subsequently about the server location of the assigned region of a table

from the .META. table. 5. Client caches this information and contacts the HRegionServer.6. Next the HRegionServer creates a HRegion object corresponding to the

opened region. 1. When the HRegion is "opened" it sets up a HStore instance for each

HColumnFamily for every table as defined by the user beforehand. 2. Each of the Store instances have one or more StoreFile instances3. StoreFile are lightweight wrappers around the actual storage file called HFile.

Where is my data?

Zookeeper

MyRow-ROOT-

.META.

MyTable

Row per table region

Row per META region

Client

The general protocol flow

7. The client issues a HTable.put(Put) request to the HRegionServer which hands the details to the matching HRegion instance.

8. The first step is to decide if the data should be first written to the "Write-Ahead-Log" (WAL) represented by the HLog class. The WAL is a standard Hadoop SequenceFile and it stores HLogKey's.

9. These keys contain a sequential number as well as the actual data and are used to replay not yet persisted data after a server crash.

10. Once the data is written (or not) to the WAL it is placed in the MemStore. At the same time it is checked if the MemStore is full and in that case a flush to disk is requested.

11. The store files created on disk are immutable. Sometimes the store files are merged together; this is done by a process called compaction. This buffer-flush-merge strategy is a common pattern described in Log-Structured Merge-Tree.

12. After a compaction, if a newly written store file size is greater than the size specified in hbase.hregion.max.filesize (default 256 MB), the region is split into two new regions.

Flush Flush Flush Flush Flush Flush FlushCompactCompact CompactFlush

Log Structured Merge Trees

• Random IO for writes is bad in HDFS.

• LSM Trees convert random writes to sequential writes.

• Writes go to a commit log and in-memory storage (MemStore)

• The MemStore is occasionally flushed to disk (StoreFile)

• The disk stores are periodically compacted to HFile (on HDFS)

• Use Bloom Filters with merge.

Buffer-Flush-Compact (minor)

Region

Memstore

HLog

(Append only WAL on

HDFS)

(Sequence file)

(One per region)

HFile on

HDFS

HFile on

HDFS

HFile on

HDFS

Buffer

HFile: immutable sorted map (byte[] byte[])(row, column, timestamp cell value)

StoreFile

Flush

Compact

Read

Compaction• Major compaction:

– The most important difference between minor and major compactions is that major compactions processes delete markers, max versions, etc, while minor compactions don't.

– This is because delete markers might also affect data in the non-merged files, so it is only possible to do this when merging all files.

• When a delete is performed in HBase table, nothing gets deleted immediately, rather a delete marker (a.k.a. tombstone) is written. – This is because HBase does not modify files once they are written. – The deletes are processed during the major compaction process; at

which point the data they hide and the delete marker itself will not be present in the merged file.

In Short

Java Example

HBaseConfiguration config = new HBaseConfiguration();

HTable table = new HTable(config, "myTable");

Cell cell = table.get("myRow",

"myColumnFamily:columnQualifier1");

Java Example: A Table Mapper

Scan scan = new Scan(); scan.addColumns(COLUMN_FAMILIY_NAME);

//add some more filters to acan here as scan.setFilter(...);

TableMapReduceUtil.initTableMapperJob(TABLE_NAME, scan, Mapper.class,

ImmutableBytesWritable.class, IntWritable.class, job);

TableMapper<ImmutableBytesWritable, IntWritable>

{

@Override

public void map(ImmutableBytesWritable row, Result values, Context context) throws

IOException

{

ImmutableBytesWritable userKey = new ImmutableBytesWritable(row.get());

for (KeyValue value: values.list())

{

ByteBuffer b = ByteBuffer.wrap(value.getValue());

String column = Bytes.toString(value.getColumn());

//compute something and put in the int res

try { context.write(userKey, res); }

catch (InterruptedException e) { throw new IOException(e); }

}

}

}

KeyValue in the HFile is a low-level byte array that allows for "zero-copy" access to the data, even with lazy or custom parsing if necessary.

http://blrk489d.in002.siemens.net/wiki/index.php/Image:KeyValue.png

http://blrk489d.in002.siemens.net/wiki/index.php/Image:KeyValue.png

Map-Reduce with HBase

Map-Reduce with Hbase - Classes

InputFormat• InputFormat class is responsible for the actual splitting of the input data as

well as returning a RecordReader instance that defines the classes of the key and value objects as well as providing a next() method that is used to iterate over each input record.

• In HBase implementation is called TableInputFormatBase as well as its subclass TableInputFormat.

• TableInputFormat is a light-weight concrete version.

• You can provide the name of the table to scan and the columns you want to process during the Map phase.

• It splits the table into proper pieces for you and hands them over to the subsequent classes.

Mapper• The Mapper class(es) are for the next stage of the MapReduce.

• In this step each record read using the RecordReader is processed using the map() method.

• A TableMap class that is specific to iterating over a HBase table.

• Once specific implementation is the IdentityTableMap which is also a good example on how to add your own functionality to the supplied classes.

• The TableMap class itself does not implement anything but only adds the signatures of what the actual key/value pair classes are.

• The IdentityTableMap is simply passing on the records to the next stage of the processing.

Reducer

• The Reduce stage and class layout is very similar to the Mapperone explained above.

• This time we get the output of a Mapper class and process it after the data was shuffled and sorted.

OutputFormat• The final stage is the OutputFormat class and its job to persist the data in

various locations.

• There are specific implementations that allow output to files or to HBase tables in case of the TableOutputFormat.

• It uses a RecordWriter to write the data into the specific HBase output table.

• It is important to note the cardinality as well.

• While there are many Mappers handing records to many Reducers, there is only one OutputFormat that takes each output record from its Reducer subsequently.

• It is the final class handling the key/value pairs and writes them to their final destination, this being a file or a table.

• The name of the output table is specified when the job is created.

Map-reduce options with HBase

Raw data Table-A Table-B

Raw Data

Map +

Reduce

(Hadoop)

Map only or

Map +

Reduce

Map only or

Map +

Reduce

Table-A

Map only or

Map +

Reduce

Map +

Reduce Map

Table-B

Map only or

Map +

Reduce Map

Map +

Reduce

Output

Inp

ut

Reading and writing into same table: hinder the proper distribution of regions across the servers (open scanners block regions splits) and may or may not see the new data as you scan. must write in the TableReduce.reduce()

Read from one table and write to another: can write updates directly in the TableMap.map()

Map stage completely reads a table and then passes the data on in intermediate files to the Reduce stage.

Reducer reads from DFS and writes into the now idle HBase table

Usage

Classes

• HBaseAdmin

• HBaseConfiguration

• HTable

• HTableDescriptor

• Put

• Get

• Scanner

• Filters

Database Admin

Table

Family

Column Qualifier

Using HBase API

HBaseConfiguration: Adds HBase configuration files to a Configuration

new HBaseConfiguration ( )

new HBaseConfiguration (Configuration c)

<property>

<name> name

</name>

<value> value

</value>

</property>

HBaseAdmin: new HBaseAdmin( HBaseConfiguration conf )

• Ex:HBaseAdmin admin = new HBaseAdmin(config);

admin.disableTable (“tablename”);

Using HBase API

HTableDescriptor: HTableDescriptor contains the name of an HTable, and its column families.

new HTableDescriptor()

new HTableDescriptor(String name)

• Ex: HTableDescriptor htd = new HTableDescriptor(tablename);

htd.addFamily ( new HColumnDescriptor (“Family”));

HColumnDescriptor: An HColumnDescriptor contains information about a column family

new HColumnDescriptor(String familyname)

• Ex:

HTableDescriptor htd = new HTableDescriptor(tablename);

HColumnDescriptor col = new HColumnDescriptor("content:");

htd.addFamily(col);

Using HBase API

HTable: Used for communication with a single HBase table.

new HTable(HBaseConfiguration conf, String tableName)

• Ex:HTable table = new HTable (conf, Bytes.toBytes ( tablename ));

ResultScanner scanner = table.getScanner ( family );

Put: Used to perform Put operations for a single row.

new Put(byte[] row)

new Put(byte[] row, RowLock rowLock)

• Ex:HTable table = new HTable (conf, Bytes.toBytes ( tablename ));

Put p = new Put ( brow );

p.add (family, qualifier, value);

table.put ( p );

Using HBase API

Get: Used to perform Get operations on a single row.

new Get (byte[] row)

new Get (byte[] row, RowLock rowLock)

• Ex: HTable table = new HTable(conf, Bytes.toBytes(tablename));

Get g = new Get(Bytes.toBytes(row));

Result: Single row result of a Get or Scan query.

new Result()

• Ex:HTable table = new HTable(conf, Bytes.toBytes(tablename));

Get g = new Get(Bytes.toBytes(row));

Result rowResult = table.get(g);

Bytes[] ret = rowResult.getValue( (family + ":"+ column ) );

Using HBase API

Scanner

• All operations are identical to Get– Rather than specifying a single row, an optional startRow and stopRow

may be defined.

• If rows are not specified, the Scanner will iterate over all rows.– = new Scan ()

– = new Scan (byte[] startRow, byte[] stopRow)

– = new Scan (byte[] startRow, Filter filter)

HBase Shell

• Non-SQL (intentional) “DSL”

• list : List all tables in hbase

• get : Get row or cell contents; pass table name, row, and optionally a dictionary of column(s), timestamp and versions.

• put : Put a cell 'value' at specified table/row/column and optionally timestamp coordinates.

• create : hbase> create 't1', {NAME => 'f1', VERSIONS => 5}

• scan : Scan a table; pass table name and optionally a dictionary of scanner specifications.

• delete : Put a delete cell value at specified table/row/column and optionally timestamp coordinates.

• enable : Enable the named table

• disable : Disable the named table: e.g. "hbase> disable 't1'"

• drop : Drop the named table.

HBase non-java access• Languages talking to the JVM:

– Jython interface to HBase

– Groovy DSL for HBase

– Scala interface to HBase

• Languages with a custom protocol

– REST gateway specification for HBase

– Thrift gateway specification for HBase

Example: Frequency Counter• Hbase has records of web_access_logs - We record each web page access by

a user.

• The schema looks like this:userID_timestamp => {

details => {page:

}}

• We want to count how many times we have seen each user

row details:page

user1_t1 a.html

user2_t2 b.html

user3_t4 a.html

user1_t5 c.html

user1_t6 b.html

user2_t7 c.html

user4_t8 a.html

user count (frequency)

user1 3

user2 2

user3 1

user4 1

Tutorial• hbase shell

create 'access_logs', 'details' create 'summary_user', {NAME=>'details', VERSIONS=>1}

• Add some data using Importer

• scan 'access_logs', {LIMIT => 5}

• Run 'FreqCounter'

• scan 'summary_user', {LIMIT => 5}

• Show output with PrintUserCount

coprocessors• HBase 0.92 release provides coprocessors functionality which includes

– observers (similar to triggers for certain events) and

– endpoints (similar to stored procedures to be invoked from the client)

• Observers can be at the region, master or at the WAL (Write Ahead Log) level.

• Once a Region Observer has been created, it can be specified in the hbase-default.xml which applies to all the regions and the tables in it or else the Region Observer can be specified on a table in which case it applies only to that table.

• Arbitrary code can run at each tablet in table server

• High-level call interface for clients

– Calls are addressed to rows or ranges of rows and the coprocessor client library resolves them to actual locations;

– Calls across multiple rows are automatically split into multiple parallelized RPC

• Provides a very flexible model for building distributed services

• Automatic scaling, load balancing, request routing for applications

Three observer interfaces• RegionObserver: Provides hooks for data manipulation events, Get, Put,

Delete, Scan, and so on. There is an instance of a RegionObservercoprocessor for every table region and the scope of the observations they can make is constrained to that region.

• WALObserver: Provides hooks for write-ahead log (WAL) related operations. This is a way to observe or intercept WAL writing and reconstruction events. A WALObserver runs in the context of WAL processing. There is one such context per region server.

• MasterObserver: Provides hooks for DDL-type operation, i.e., create, delete, modify table, etc. The MasterObserver runs within the context of the HBase master.

Example

package org.apache.hadoop.hbase.coprocessor;import java.util.List;import org.apache.hadoop.hbase.KeyValue;import org.apache.hadoop.hbase.client.Get;// Sample access-control coprocessor. It utilizes RegionObserver// and intercept preXXX() method to check user privilege for the given table// and column family.public class AccessControlCoprocessor extends BaseRegionObserver {

@Overridepublic void preGet(final ObserverContext<RegionCoprocessorEnvironment> c,

final Get get, final List<KeyValue> result) throws IOExceptionthrows IOException {

// check permissions..if (!permissionGranted()) {

throw new AccessDeniedException("User is not allowed to access.");}

}

// override prePut(), preDelete(), etc.}

Avoiding long pause from The Garbage Collector• Stop-the-world garbage collections is common in HBase,

especially during loading. • There are two issues to be addressed

– concurrent mark and sweep (CMS) performance, and – fragmentation of memstore.

• To address the first, start the CMS earlier than default by adding -XX:CMSInitiatingOccupancyFraction and setting it down from defaults. Start at 60 or 70 percent (The lower you bring down the threshold, the more GCing is done, the more CPU used).

• To address the second fragmentation issue, there is an experimental facility hbase.hregion.memstore.mslab.enabled (memstore local allocation buffer) to be set to true in configuration.

For loading data Pre-Create Regions

• Tables in HBase are initially created with one region by default.

• For bulk imports, this means that all clients will write to the same region until it is large enough to split and become distributed across the cluster.

• A useful pattern to speed up the bulk import process is to pre-create empty regions.

• Note that too-many regions can actually degrade performance.

Enable Scan Caching• When HBase is used as an input source for a MapReduce job,

set setCaching to something greater than the default (which is 1).

• Using the default value => map-task will make call back to the region-server for every record processed. – Setting this value to 80, for example, will transfer 80 rows at a time to

the client to be processed.

• There is a cost/benefit to have the cache value be large because it costs more in memory for both client and RegionServer, so bigger isn't always better.

• It appears from the experimentation that selecting a value between 50 and 100 gives good performance in our setup.

Right Scan Attribute Selection• Whenever a Scan is used to process large numbers of

rows (and especially when used as a MapReduce source), we shall select the right set of attributes.

• If scan.addFamily is called then all of the attributes in the specified ColumnFamily will be returned to the client.

• If only a small number of the available attributes are to be processed, then only those attributes should be specified in the input scan because attribute over-selection is a non-trivial performance penalty over large datasets.

Optimize handler.count

• Count of RPC Listener instances spun up on RegionServers. Same property is used by the Master for count of master handlers. – Default is 10.

• This setting in essence sets how many requests are concurrently being processed inside the RegionServer at any one time.

• If multiple map-reduce job is running in the cluster and there is enough map capacity to handle the jobs concurrently, then this parameter needs to be tuned.

End of session

Day – 4: HBase

Software

01 hbase