HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

© 2014 MapR Technologies 1© 2014 MapR Technologies

HBase and Drill

How loosely typed SQL is ideal for NoSQL

Ted Dunning

June 10, 2015

© 2014 MapR Technologies 2

Contact Information

Ted Dunning

Chief Applications Architect at MapR Technologies

Committer & PMC for Apache’s Drill, Zookeeper & Mahout

VP of Incubator at Apache Foundation

Email [email protected] [email protected]

Twitter @ted_dunning

Hashtag today: #BDE2015

mailto:[email protected]

mailto:[email protected]


Agenda

• What does good mean?• What do we mean by loose typing?• Examples of what you can do• Real database with 10-20x fewer tables• Looking forward• Questions


What Does Good Mean (for a DB)?

• Expressive– Must express the concepts we need

• Efficient– Must run fast enough on cheap enough hardware


What Does Good Mean (for a DB)?

• Expressive– Must express the concepts we need

• Efficient– Must run fast enough on cheap enough hardware

• Introspectable– Must be able to inspect the data and schema and gain understanding


What is New Here

• Introspection is better when– A minimum of data entities are used to describe our model– No name overflow– Referential scoping helps narrow our focus to a simpler problem– Many-to-one relations can in-lined

• Introspection was not a goal for the design of the relational model

• Introspection was therefore not a result either


Older than Dirt

• Relational theory is old (1970)– Pre-dates data structures– Predates mainstream recursive procedures– Predates lexical scoping– Predates logic programming– Predates real functional programming (Church, McCarthy, Iverson,

Backus and not-withstanding)

• Some updates are in order to enhance introspection


Contrast Relational and HBase Style noSQL

Relational• Rows containing fields• Fields contain primitive types• Structure is fixed and uniform• Structure is pre-defined• Referential integrity (optional)

• Expressions over sets of rows

HBase / MapR DB• Rows contain fields• Fields bytes• Structure is flexible• No pre-defined structure• Single key• Column families• Timestamps• Versions


Contrast relational and HBase with Structuring

Relational• Rows containing fields• Fields contain primitive types• Structure is fixed and uniform• Structure is pre-defined• Referential integrity (optional)

• Expressions over sets of rows

HBase + Structuring• Rows contain fields• Fields contain primitive types

– Or objects, or lists

• Structure is flexible, ragged• No pre-defined structure• Single key


Turtle Models for Databases

• Allows complex objects in field values– JSON style lists and objects

• Allow references to objects via join– Includes references localized within lists

• Lists of objects and objects of lists are isomorphic to tables so …

• Complex data in tables, • But also tables in complex data, • Even tables containing complex data containing tables


Proviso and Warning

• This is not your father’s BLOB• And not the same as arrays with lateral view joins

• Rationale to come as we talk about idioms


A Catalog of noSQL Idioms


Tables as Objects, Objects as Tables

Row-wise form

Column-wise form

List of objects

Object containing lists


Micro Columnar Formats

An entire table stored in columnar form can be a

first-class value using these techniques

This is very powerful for in-lining one-to-many relations.


Note

• If embedded tables are first-class, schema becomes data

• If schema is data-driven when embedded, constructs that elevate tables to top-level are impossible

• Thus, embedded first-class objects implies late discovery of schema information


A first example:Time-series data


Column names as data

• When column names are not pre-defined, they can convey information

• Examples– Time offsets within a window for time series– Top-level domains for web crawlers– Vendor id’s for customer purchase profiles

• Predefined schema is impossible for this idiom


Relational Model for Time-series


Table Design: Point-by-Point


Table Design: Hybrid Point-by-Point + Sub-table

After close of window, data in row is restated as column-oriented tabular value in different column family.


Compression Results

Samples are 64b time, 16 bit

sample

Sample time at 10kHz

Sample time jitter makes it important to keep original time-stamp

How much overhead to retain time-stamp?


A second example:Music meta-data


MusicBrainz on NoSQL

• Artists, albums, tracks and labels are key objects• Reality check:

– Add works (compositions), recordings, release, release group

• 7 tables for artist alone• 12 for place, 7 for label, 17 for release/group, 8 for work

– (but only 4 for recording!)– Total of 12 + 7 + 17 + 8 + 4 = 48 tables

• But wait, there’s more!– 10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86

link tables, 5 cover art tables and 3 tables for CD timing info (138 total)– And 50 more tables that aren’t documented yet



180 tablesnot shown


236 tablesto describe 7 kinds of things


Can we do better?







27 tables reduce to 4


27 tables reduce to 4so far


Further Reductions

• All 86 link tables become properties on artists, releases and other entities

• All 44 tag, rating and annotation tables become list properties• All 5 cover art tables become lists of file references

• Current score: 162 tables become 4

• You get the idea


Is This Good?

• Expressivity– The JSON data model is at least as expressive as the original relational

model• Many cases easier to describe in nested data• No cases are harder

• Efficiency– Inlining can increase data size. Locality improves, however– Sessionizing can substantially decrease data size– Inlining back-references is more efficient than ordinary indexes– Inlined columnar data allows 1000x speedup for time series

• Introspection (you decide)


But How Can We Query This?

• Can’t use SQL– SQL is strongly typed– SQL is heavily tied into the original relational model– SQL generating tools require relational model

• Must use SQL– Vast numbers of tools and people understand how to write SQL– SQL is the lingua franca of databases


Squaring the Circle

• Enter Apache Drill

• Drill is SQL compliant– Uses standard syntax and semantics

• Drill extends SQL– First class treatment of objects, lists– Full support for destructuring, flattening– Full power of relational model can be applied to complex data


Drill Provides Scalable and Extended SQL


Sample Query

• Find Elvis select distinct id, name, alias from ( select id, flatten(alias.name) alias from artist where alias like 'Elvis%Presley' )


Example Query• Find discs where Elvis was credited

select distinct album_id, namefrom( select id album_id, name, flatten(credit) from release) albumsjoin( select distinct artist_id from ( select id artist_id, flatten(alias) from artist where name like 'Elvis%Presley’ )) artistsusing artist_id


Summary

• Extended relational model allows massive simplification– On a real example, we see >20x reduction in number of tables

• Simplification drives improved introspection– This is good

• Apache Drill gives very high performance execution for extended relational problems

• You can try this out today



Short Books by Ted Dunning & Ellen Friedman

• Published by O’Reilly in 2014 and 2015• For sale from Amazon or O’Reilly• Free e-books currently available courtesy of MapR

http://bit.ly/ebook-real-world-hadoop

http://bit.ly/mapr-tsdb-ebook

http://bit.ly/ebook-anomaly

http://bit.ly/recommendation-ebook


Real World Hadoopby Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly)

Free copies at book signing today


Thank You!


Q & A

@mapr maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies

Technology

HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL