Hypertable - massively scalable nosql database

Hypertable

An Open Source, High Performance,

Massively Scalable Database

Doug Judd CEO Hypertable Inc.

Three Reasons to Choose Hypertable

•  High Performance •  Open Source •  Future Direction SQL

Introduction

Highlights •  Modeled after Google’s Bigtable database •  High Performance Implementation (C++) •  Apache Thrift interface for all popular languages

(Java, PHP, Ruby, Python, Perl, etc) •  Broad Hadoop distribution support

o  Apache 2 o  Cloudera CDH3, CDH4, CDH5 o  IBM BigInsights 3 o  Hortonworks HDP2 o  MapR

•  Actively developed for 8 years

Open Source •  Licensed under the GPL •  Hosted on GitHub

o  git://github.com/hypertable/hypertable.git o  https://github.com/hypertable/hypertable.git

•  Online source documentation •  Mailing Lists

o  groups.google.com/group/hypertable-user o  groups.google.com/group/hypertable-dev

Bigtable •  Google’s most successful scalable database •  Bigtable underpins 100+ Google services •  YouTube, Blogger, Google Earth, Google Maps,

Orkut, Gmail, Google Analytics, Google Book Search, Google Code, Crawl Database, Google Code …

•  Data is physically ordered by primary key – it’s not a distributed hash table

How Hypertable Differs From A Traditional RDBMS

•  Horizontally Scalable •  Sparse Table Structure

o  Variable number of columns per-row o  Rows can have billions of columns

•  Cells can have multiple time stamped versions

Database Model •  Sparse, two-dimensional tables •  Cells can have multiple versions •  Cells addressed by 4-part key

o  Row o  Column family o  Column qualifier

o  Timestamp

Conceptual Table Representation

Actual Table Representation

Anatomy of a Key •  Column Family is 8-bit •  Timestamp and Revision are 64-bit integer

nanoseconds since Epoch •  Simple byte-wise comparison

Architecture

Table Growth Process

How Scaling Works

How Scaling Works

How Scaling Works

High Level Architecture






RangeServer Insert Handling

RangeServer Query Handling

Cellstore Format

Bloom Filter

Request Routing

Administration

Cluster Task AutomationTool

•  ht_cluster •  Modeled after Capistrano •  Role

o  Designates a function or service and the set of machines that will perform that function or service

o  Examples: Hyperspace, Master, Slave (RangeServer), ThriftBroker o  Machines can belong to one ore more roles

•  Task o  Script written for specific roles and used to manage the associated

function or service

o  Examples: start_hyperspace, stop_hyperspace

cluster.def INSTALL_PREFIX=/opt/hypertable HYPERTABLE_VERSION=0.9.8.2 PACKAGE_FILE=/tmp/hypertable-0.9.8.2-linux-x86_64.tar.gz FS=hadoop HADOOP_DISTRO=cdh4 ORIGIN_CONFIG_FILE=/root/hypertable.cfg PROMPT_CLEAN=true role: source test00 role: master test[00-02] role: hyperspace test[00-02] role: slave test[03-99] - test37 role: thriftbroker role: spare include: "core.tasks"

Common Tasks ht cluster start

ht cluster stop

ht cluster push_config

ht cluster install_package

ht cluster upgrade

Monitoring

Ganglia Metrics

Thrift Broker Metrics Metric Units Connections count Requests requests/s Errors errors/s Virtual Memory GB Resident Memory GB Heap Size GB Heap Slack Bytes GB CPU user percentage CPU sys percentage Version string

Range Server Metrics Metric Units Scans scans/s Updates updates/s Bytes Returned bytes/s Bytes Scanned bytes/s Byte Scan Yield percentage Bytes WriUen bytes/s Cells Returned cells/s Cells Scanned cells/s Cell Scan Yield percentage Outstanding Scanners count Request Backlog count

Metric Units Major Compactions count Minor Compactions count Merging Compactions count GC Compactions count Virtual Memory GB Resident Memory GB Heap Size GB Heap Slack Bytes GB Tracked Memory GB CPU user percentage CPU sys percentage

Range Server Metrics Metric Units Ranges count CellStores count Block Cache Hits percentage Block Cache Memory GB Block Cache Fill GB Query Cache Hits Percentage Query Cache Memory GB Query Cache Fill GB Version string

FS Broker Metrics Metric Units Read Throughput MB/s Write Throughput MB/s Syncs syncs/s Sync Latency milliseconds Errors count JVM GCs count JVM GC Time milliseconds JVM Heap Size GB Virtual Memory GB Resident Memory GB

Metric Units Heap Size GB Heap Slack Bytes GB CPU user percentage CPU sys percentage Version string

Master and Hyperspace Metrics

Metric Units Operations operations/s Virtual Memory GB Resident Memory GB Heap Size GB Heap Slack Bytes GB CPU user percentage CPU sys percentage Version string

Metric Units Requests requests/s Virtual Memory GB Resident Memory GB Heap Size GB Heap Slack Bytes GB CPU user percentage CPU sys percentage Version string

Master Hyperspace

Slow Query Log •  ThriftBroker feature •  Logs queries that

take longer than 10 seconds

•  Log line format o  End time (seconds) o  Start time (seconds) o  Function called o  Client IP/port o  Latency (milliseconds) o  Sub-scanner count

o  Bytes Returned o  Bytes Scanned o  Disk read o  Servers contacted o  Namespace o  HQL representation of query

Features

Namespaces

Namespaces USE ‘/’;

CREATE NAMESPACE foo; USE foo;

CREATE NAMESPACE bar;

CREATE TABLE mytable (a, b, c); GET LISTING;

(bar) namespace mytable

Atomic Counters •  Column option:

CREATE TABLE counts ( url COUNTER );

•  Modified via existing API using specially formatted values:

Value Format Description [+]n Increment counter by n -n Decrement counter by n =n Reset counter to n

Secondary Indexes

Total Cells Inserted: 1 billion Total Time Taken: 45 minutes

Aggregate Throughput (inserts/s): 372,362 Aggregate Throughput (bytes/s): 14,763,300

§  Six test machines -  Dual Six-core Opteron HE Processors -  24 GB RAM -  4X 2TB SATA drives

§  Single Indexed column -  Key: randomly generated 20-byte integer -  Value: two randomly chosen words from /usr/share/dict/

words

Secondary Indexes (HQL)

CREATE TABLE products ( title, section, info, category,

INDEX section, INDEX info, QUALIFIER INDEX info, QUALIFIER INDEX category );

Secondary Indexes SELECT title FROM products WHERE info:actor = “Jack Nicholson”;

B00002VWE0 title Five Easy Pieces (1970) B002VWNIDG title The Shining (1980)

Secondary Indexes SELECT title, info:author FROM products WHERE info:author =~ /^Stephen [PK]/;

0307743659 title The Shining Mass Market Paperback 0307743659 info:author Stephen King 0321776402 title C++ Primer Plus (6th Edition) (Developer's Library)

0321776402 info:author Stephen Prata

Secondary Indexes SELECT title FROM products WHERE Exists(info:studio);

B00002VWE0 title Five Easy Pieces (1970) B000Q66J1M title 2001: A Space Odyssey [Blu-ray] B002VWNIDG title The Shining (1980)

Secondary Indexes SELECT title FROM products WHERE info:author =~ /^Stephen P/ OR info:publisher =~ /Ânchor/; 0307743659 title The Shining Mass Market Paperback 0321776402 title C++ Primer Plus (6th Edition) (Developer's Library)

Secondary Indexes SELECT title FROM products WHERE info:author =~ /^Stephen [PK]/ AND info:publisher =~ /Ânchor/; 0307743659 title The Shining Mass Market Paperback

Secondary Indexes SELECT title FROM products WHERE ROW =^ 'B' AND info:actor = 'Jack Nicholson'; B00002VWE0 title Five Easy Pieces (1970) B002VWNIDG title The Shining (1980)

Regex Filtering •  Google’s RE2 regular expression engine

o  Extremely fast (up to 50X Java regex) o  Searches run in time linear in the size of the

input o  Searches constrained to a fixed amount of

memory

•  Supported Searches: o  Row key o Column qualifier o Value

Regex Filtering SELECT info:/â/ FROM products; 0307743659 info:author Stephen King 0321321928 info:author Stephen C. Dewhurst 0321776402 info:author Stephen Prata B00002VWE0 info:actor Karen Black B00002VWE0 info:actor Jack Nicholson B000Q66J1M info:actor Gary Lockwood B000Q66J1M info:actor Keir Dullea B002VWNIDG info:actor Shelley Duvall B002VWNIDG info:actor Jack Nicholson

Regex Filtering

SELECT title FROM products

WHERE ROW REGEXP "2";

0321321928 title C++ Common Knowledge: Essential Intermediate Programming [Paperback]

0321776402 title C++ Primer Plus (6th Edition) (Developer's Library)


Regex Filtering

SELECT title FROM products WHERE VALUE REGEXP "\("; 0321776402 title C++ Primer Plus (6th Edition) (Developer's Library)


Hadoop MapReduce •  MapReduce Input/Output formats

o  Normal (mapreduce) o  Streaming (mapred)

•  Load data from HT to Hive and vice-versa •  Use Hive types •  Use Hive QL (joins, aggregations) •  Low latency data warehousing •  Uses Hypertable’s native MapReduce Input/Output

format

Column Family Options •  TTL=<t>

o  “time to live” o  Remove cells that are older than <t>

•  MAX_VERSIONS=<n> o  Keep only most recent <n> cell versions

Access Groups CREATE TABLE User ( name, address, photo, profile, ACCESS GROUP default (name, address, photo), ACCESS GROUP profile (profile) );

Adaptive Memory Allocation

Group Commit •  Supports highly concurrent updates •  Trades average latency for better throughput •  By default, commit log writes are auto-coalesced •  Commit log write interval can be statically

configured per-table: CREATE TABLE counts ( url, domain ) GROUP_COMMIT_INTERVAL=100;

Caching •  Block Cache

o  Caches CellStore blocks o  Can be configured to store blocks compressed or

uncompressed (default = compressed) o  Dynamically adjusted size based on workload

•  Query Cache o  Caches query results o  Caches single row queries only

Compression •  Cell Store blocks are compressed •  Commit Log updates are compressed •  Supported Compression Schemes:

bmz, lzo, quicklz, snappy, zlib, none

•  Quicklz performance numbers:

Language Compression Speed (MB/s)

Decompression Speed (MB/s)

C++ 308 358

Java 127 95

Performance Study

Hypertable vs. HBase •  Modeled after test described in Bigtable paper •  Hypertable 0.9.5.5 vs. HBase 0.90.4 •  16-node Cluster

o  CPU: 2X AMD C32 Six-core model 4170 HE 2.1GHz o  RAM: 24GB o  Disk: 4X 2TB SATA

•  Tests Run o  Random Write o  Scan o  Random Read Zipfian o  Random Read Uniform

Random Write

Scan

Random Read Zipfian

Case Studies

•  Operational Data Store •  System metrics

o  CPU o  Memory o  IO o  Network

•  Application metrics o  Web o  DB o  Caches

•  Business metrics o  Usage o  Revenue

Case Study: Noah System

•  Storage Capacity o  Up to 100TB o  Up to 1 trillion records

•  Automatic Sharding o  Irregular data growth patterns

•  Heavy Writes o  ~30K inserts/s

•  Fast Reads of Recent Data •  Table Scans

System Requirements

Architecture Diagram

•  2nd Largest Indian Internet Portal •  Rediffmail

o  One of the world’s largest email services o  Over 100 Million registered users

•  Active Deployments o  Rediffmaill

o  Email SPAM classification o  News Crawl Database o  Recommendation System

Case Study: Rediff

Architectural Overview

Query Latency

Summary

•  High Performance •  Open Source •  Future Direction SQL

The End

Technology

Hypertable - massively scalable nosql database