HBASE by Nicolas Liochon - Meetup HUGFR du 22 Sept 2014

HBase

Nicolas [email protected]

@nkeywal

mailto:[email protected]

What

• Random reads & Random writes• Scans

• Strong consistency• Durability• Latency

Usual context

• HBase : Apache Implementation of Google Big Table

• API (over simplified)– byte[] get(byte[] row) – void put(byte[] row, byte[] key, byte[] value)– byte[][] scan (byte[] rowBegin, byte[] rowEnd)

• Big Table at Google– gmail, google docs and others– 2.5 exabytes– 600m QPS

Reads & Writes

• Random read & write: OLTP apps– Per row transaction– Phoenix for the SQL fans

• Key Value store like

Scans

• Data is ordered (between rows, and inside rows by columns)• Stored contiguously on disk

• Typical use case: time series– Get a subset of the timeseries for a given time interval– Order by timeserie & time

• Access some data in a single hoop– Single machine / Single disk read

• Big data duality– Single machine latency– Parallel: througput

Technical requirements

• Durability & others

Durability

buffer in the client

buffer in the server

sync in OS cache

sync on disk

Cost / Reliability

Consistency 1/2

• Business need– Need to be implemented in the other layers if not available

• Easier increments & test & set operations– Useful for some features

Consistency 2/2

• Simplifies testing

« eventual consistency is strong consistency 99% of the time » (Ian Varley / Salesforce)

Availability

• CAP:– CP vs. AP: consistent under partition

– CAP: Under partition, you can’t be consistent w/o letting either C1 or C2 go.

– In a big data system, you’re unlikely to have all the data available in both partitions anyway

S1 S2 S3 S4 C2C1

Availability

• HBase– The data is replicated 3 times– If there is at least a replica within the partition we’re good– The partitioned nodes are excluded– Data ownership goes to the non partitioned nodes

• The data remains available & consistent

• Latency hit

Latency• HBase targets millisecond latency

• Under normal operations

• Under failure–Around 1 minute–Can be less w/ configuration efforts–And GC tuning….

Latency is about percentiles– Average != 50% percentile– There are often order of magnitudes between « average » and « 95

percentile »– Post 99% = « magical 1% ».

Normal operations

• Reads– Cache!– Google for « Nick Dimiduk ».

• Writes– Flush, on all server in memory

Failure• Partitions

– Equivalent to a lot of single node failure– Hadoop contract is as well to have all data replicated 3 times

• Single node– Data is always replicated– Detect a failure (30s)– Assign to another server (0s)– Data replay (x * 10s) : writes are available– Client retry (0s)

Scoop: it could be much less• Recovery is

– Failure detection – Data recovery

• Failure detection is GC driven– Less than 10s is difficult– But it can be delagated to the hardware

• Recovery can be minimized by keeping hot data in memory– HDFS feature since ~2.x– Then data recovery is ~1s

LatencyStonebraker

•Bohrbugs. These are repeatable DBMS errors that cause the DBMS to crash. In other words, even when multiple data base replicas are available, the same transaction issued to the replicas will cause all of them to crash. No matter what, the world stops, and high availability is an impossible goal.•Application errors. The application inadvertently updates (all copies) of the data base. The data base is now corrupted, and any sane DBA will stop the world and put the data base back into a consistent state. Again, high availability is impossible to achieve.•Human error. A human types the database equivalent of RM * and causes a global outage. There is no possibility of continuing operation.

Applied same reasoning to latency

• In current deployments, GC and bugs are now more and issue than recovery time

• Likely to be revisited in a year or so.

Conclusion?

HBase is aboutRandom accessTransactionsWith durabilityAnd good performances

Caching, GCing is improving

Internet

HBASE by Nicolas Liochon - Meetup HUGFR du 22 Sept 2014