30
HBase and HDFS: Past, Present, Future Todd Lipcon [email protected] Twitter: @tlipcon #hbase IRC: tlipcon May 22, 2012

HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Embed Size (px)

DESCRIPTION

Apache HDFS, the file system on which HBase is most commonly deployed, was originally designed for high-latency high-throughput batch analytic systems like MapReduce. Over the past two to three years, the rising popularity of HBase has driven many enhancements in HDFS to improve its suitability for real-time systems, including durability support for write-ahead logs, high availability, and improved low-latency performance. This talk will give a brief history of some of the enhancements from Hadoop 0.20.2 through 0.23.0, discuss some of the most exciting work currently under way, and explore some of the future enhancements we expect to develop in the coming years. We will include both high-level overviews of the new features as well as practical tips and benchmark results from real deployments.

Citation preview

Page 1: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

HBase and HDFS: Past, Present,Future

Todd [email protected]

Twitter: @tlipcon #hbase IRC: tlipcon

May 22, 2012

Page 2: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Intro / who am I?I Been working on data stuff for a few years

I HBase, HDFS, MR committer

I Cloudera engineer since March ’09

(a) My posts to hbase-dev (b) My posts to(core|hdfs|mapreduce)-dev

You know I’m an engineer since my slides are ugly and written in LATEX

Page 3: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Framework for discussionI Time periods

I Past (Hadoop pre-1.0)I Present (Hadoop 1.x, 2.0)I Future (Hadoop 2.x and later)

I CategoriesI Reliability/AvailabilityI PerformanceI Feature set

Page 4: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

HDFS and HBase History - 2006

Author: Douglass Cutting <[email protected]>

Date: Fri Jan 27 22:19:42 2006 +0000

Create hadoop sub-project.

Page 5: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

HDFS and HBase History - 2007

Author: Douglass Cutting <[email protected]>

Date: Tue Apr 3 20:34:28 2007 +0000

HADOOP-1045. Add contrib/hbase, a

BigTable-like online database.

Page 6: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

HDFS and HBase History - 2008

Author: Jim Kellerman <[email protected]>

Date: Tue Feb 5 02:36:26 2008 +0000

2008/02/04 HBase is now a subproject of Hadoop.

The first HBase release as a subproject will be

release 0.1.0 which will be equivalent to the

version of HBase included in Hadoop 0.16.0...

Page 7: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

HDFS and HBase History - Early 2010HBase has been around for 3 years, But HDFS stillacts like MapReduce is the only important client! §

People have accused HDFS of being like a molasses train:

high throughput but not so fast

Page 8: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

HDFS and HBase History - 2010I HBase becomes a top-level projectI Facebook chooses HBase for Messages productI Jump from HBase 0.20 to HBase 0.89 and 0.90I First CDH3 betas include HBaseI HDFS community starts to work on features

for HBase.I Infamous hadoop-0.20-append branch

Page 9: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

What did we get done?And where are we going?

Page 10: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Reliability in the past: Hadoop 1.0I Pre-1.0, if the DN crashed, HBase would lose

its WALs (and your beloved data).I 1.0 integrated hadoop-0.20-append branch into

a main-line releaseI True durability support for HBaseI We have a fighting chance at metadata reliability!

I Numerous bug fixes for write pipeline recoveryand other error paths

I HBase is not nearly so forgiving as MapReduce!I “Single-writer” fault tolerance vs “job-level” fault

tolerance

Page 11: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Reliability in the past: Hadoop 1.0I Pre-1.0: if any disk failed, entire DN would go

offlineI Problematic for HBase: local RS would lose all

locality!I 1.0: per-disk failure detection in DN

(HDFS-457)I Allows HBase to lose a disk without losing all

locality

Tip: Configuredfs.datanode.failed.volumes.tolerated = 1

Page 12: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Reliability today: Hadoop 2.0I Integrates Highly Available HDFS

I Active-standby hot failover removes SPOF

I Transparent to clients: no HBase changesnecessary

I Tested extensively under HBase read/writeworkloads

I Coupled with HBase master failover, no moreHBase SPOF!

Page 13: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

HDFS HA

Page 14: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Reliability in the future: HA in 2.xI Remove dependency on NFS (HDFS-3077)

I Quorum-commit protocol for NameNode edit logsI Similar to ZAB/Multi-Paxos

I Automatic failover for HA NameNodes(HDFS-3042)

I ZooKeeper-based master election, just like HBaseI Merge to trunk should be this week.

Page 15: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Other reliability work for HDFS 2.xI 2.0: current hflush() API only guarantees

data is replicated to three machines – not fullyon disk.

I A cluster-wide power outage can lose data.I Upcoming in 2.x: Support for hsync()

(HDFS-744, HBASE-5954)I Calls fsync() for all replicas of the WALI Full durability of edits, even with full cluster

power outages

Page 16: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

hflush() and hsync()

Page 17: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

HDFS wire compatibility in Hadoop 2.0I In 1.0: HDFS client version must match server

version closely.

I How many of you have manually copied HDFSclient jars?

I Client-server compatibility in 2.0:I Protobuf-based RPCI Easier HBase installs: no more futzing with jarsI Separate HBase upgrades from HDFS

upgrades

I Intra-cluster server compatibility in the worksI Allow for rolling upgrade without downtime

Page 18: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Performance: Hadoop 1.0I Pre-1.0: even for reads from local machine,

client connects to DN via TCPI 1.0: Short-circuit local reads

I Obtains direct access to underlying local block file,then uses regular FileInputStream access.

I 2x speedup for random reads

I Configure dfs.client.read.shortcircuit = true

I Configure dfs.block.local-path-access.user = hbase

I Configure dfs.datanode.data.dir.perm = 755

I Currently does not support security §

Page 19: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Performance: Hadoop 2.0I Pre-2.0: Up to 50% CPU spent verifying CRCI 2.0: Native checksums using SSE4.2 crc32

asm (HDFS-2080)I 2.5x speedup reading from buffer cacheI Now only 15% CPU overhead to checksumming

I Pre-2.0: re-establishes TCP connection to DNfor each seek

I 2.0: Rewritten BlockReader, keepalive to DN(HDFS-941)

I 40% improvement on random read for HBaseI 2-2.5x in micro-benchmarks

I Total improvement vs 0.20.2: 3.4x!

Page 20: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Performance: Hadoop 2.xI Currently: lots of CPU spent copying data in

memoryI “Direct-read” API: read directly into

user-provided DirectByteBuffers (HDFS-2834)I Another ˜2x improvement to sequential

throughput reading from cacheI Opportunity to avoid two more buffer copies

reading compressed data (HADOOP-8148)I Codec APIs still in progress, needs integration into

HBase

Page 21: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Performance: Hadoop 2.xI True “zero-copy read” support (HDFS-3051)

I New API would allow direct access to mmapedblock files

I No syscall or JNI overhead for readsI Initial benchmarks indicate at least 3̃0% gain.I Some open questions around best safe

implementation

Page 22: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Current read path

Page 23: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Proposed read path

Page 24: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Performance: why emphasize CPU?I Machines with lots of RAM now inexpensive

(48-96GB common)

I Want to use that to improve cache hit ratios.

I Unfortunately, 50GB+ Java heaps stillimpractical (GC pauses too long)

I Allocate the extra RAM to the buffer cacheI OS caches compressed data: another win!

I CPU overhead reading from buffer cachebecomes limiting factor for read workloads

Page 25: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

What’s up next in 2.x?I HDFS Hard-links (HDFS-3370)

I Will allow for HBase to clone/snapshot tablesefficiently!

I Improves HBase table-scoped backup story

I HDFS Snapshots (HDFS-2802)I HBase-wide snapshot support for point-in-time

recoveryI Enables consistent backups copied off-site for DR

Page 26: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

What’s up next in 2.x?I Improved block placement policies

(HDFS-1094)I Fundamental tradeoff between probability of data

unvailability and the amount of data that becomesunavailable

I Current scheme: if any 3 nodes not on the samerack die, some very small amount of data isunavailable

I Proposed scheme: lessen chances of unavailability,but if a certain three nodes die, a larger amount isunavailable

I For many HBase applications: any single lost blockhalts whole operation. Prefer to minimizeprobability.

Page 27: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

What’s up next in 2.x?I HBase-specific block placement hints

(HBASE-4755)I Assign each region a set of three RS (primary and

two backups)I Place underlying data blocks on these three DNsI Could then fail-over and load-balance without

losing any locality!

Page 28: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Summary

Hadoop 1.0 Hadoop 2.0 Hadoop 2.xAvailability - DN volume - NameNode HA - HA without NAS

failure isolation - Wire Compat - Rolling upgradePerformance - Short-circuit - Native CRC - Direct-read API

reads - DN Keepalive - Zero-copy API- Direct codec API

Features - durable hflush() - hsync()- Snapshots- Hard links- HBase-aware block

placement

Page 29: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

SummaryI HBase is no longer a second-class citizen.

I We’ve come a long way since Hadoop 0.20.2 inperformance, reliability, and availability.

I New features coming in the 2.x line specificallyto benefit HBase use cases

Hadoop 2.0 features available today via CDH4 beta.Several Cloudera customers already using CDH4b2with HBase with great success.

Official Hadoop 2.0 release and CDH4 GA comingsoon.

Page 30: HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Questions?

[email protected]: @tlipcon

#hbase IRC: tlipcon

P.S. we’re hiring!