47
© 2014 MapR Technologies 1 © 2014 MapR Technologies HBase and Drill How loosely typed SQL is ideal for NoSQL Ted Dunning June 10, 2015

HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

Embed Size (px)

Citation preview

© 2014 MapR Technologies 1© 2014 MapR Technologies

HBase and Drill

How loosely typed SQL is ideal for NoSQL

Ted Dunning

June 10, 2015

© 2014 MapR Technologies 2

Contact Information

Ted Dunning

Chief Applications Architect at MapR Technologies

Committer & PMC for Apache’s Drill, Zookeeper & Mahout

VP of Incubator at Apache Foundation

Email [email protected] [email protected]

Twitter @ted_dunning

Hashtag today: #BDE2015

© 2014 MapR Technologies 3

Agenda

• What does good mean?• What do we mean by loose typing?• Examples of what you can do• Real database with 10-20x fewer tables• Looking forward• Questions

© 2014 MapR Technologies 4

What Does Good Mean (for a DB)?

• Expressive– Must express the concepts we need

• Efficient– Must run fast enough on cheap enough hardware

© 2014 MapR Technologies 5

What Does Good Mean (for a DB)?

• Expressive– Must express the concepts we need

• Efficient– Must run fast enough on cheap enough hardware

• Introspectable– Must be able to inspect the data and schema and gain understanding

© 2014 MapR Technologies 6

What is New Here

• Introspection is better when– A minimum of data entities are used to describe our model– No name overflow– Referential scoping helps narrow our focus to a simpler problem– Many-to-one relations can in-lined

• Introspection was not a goal for the design of the relational model

• Introspection was therefore not a result either

© 2014 MapR Technologies 7

Older than Dirt

• Relational theory is old (1970)– Pre-dates data structures– Predates mainstream recursive procedures– Predates lexical scoping– Predates logic programming– Predates real functional programming (Church, McCarthy, Iverson,

Backus and not-withstanding)

• Some updates are in order to enhance introspection

© 2014 MapR Technologies 8

Contrast Relational and HBase Style noSQL

Relational• Rows containing fields• Fields contain primitive types• Structure is fixed and uniform• Structure is pre-defined• Referential integrity (optional)

• Expressions over sets of rows

HBase / MapR DB• Rows contain fields• Fields bytes• Structure is flexible• No pre-defined structure• Single key• Column families• Timestamps• Versions

© 2014 MapR Technologies 9

Contrast relational and HBase with Structuring

Relational• Rows containing fields• Fields contain primitive types• Structure is fixed and uniform• Structure is pre-defined• Referential integrity (optional)

• Expressions over sets of rows

HBase + Structuring• Rows contain fields• Fields contain primitive types

– Or objects, or lists

• Structure is flexible, ragged• No pre-defined structure• Single key

© 2014 MapR Technologies 10

Turtle Models for Databases

• Allows complex objects in field values– JSON style lists and objects

• Allow references to objects via join– Includes references localized within lists

• Lists of objects and objects of lists are isomorphic to tables so …

• Complex data in tables, • But also tables in complex data, • Even tables containing complex data containing tables

© 2014 MapR Technologies 11

Proviso and Warning

• This is not your father’s BLOB• And not the same as arrays with lateral view joins

• Rationale to come as we talk about idioms

© 2014 MapR Technologies 12

A Catalog of noSQL Idioms

© 2014 MapR Technologies 13

Tables as Objects, Objects as Tables

Row-wise form

Column-wise form

List of objects

Object containing lists

© 2014 MapR Technologies 14

Micro Columnar Formats

An entire table stored in columnar form can be a

first-class value using these techniques

This is very powerful for in-lining one-to-many relations.

© 2014 MapR Technologies 15

Note

• If embedded tables are first-class, schema becomes data

• If schema is data-driven when embedded, constructs that elevate tables to top-level are impossible

• Thus, embedded first-class objects implies late discovery of schema information

© 2014 MapR Technologies 16

A first example:Time-series data

© 2014 MapR Technologies 17

Column names as data

• When column names are not pre-defined, they can convey information

• Examples– Time offsets within a window for time series– Top-level domains for web crawlers– Vendor id’s for customer purchase profiles

• Predefined schema is impossible for this idiom

© 2014 MapR Technologies 18

Relational Model for Time-series

© 2014 MapR Technologies 19

Table Design: Point-by-Point

© 2014 MapR Technologies 20

Table Design: Hybrid Point-by-Point + Sub-table

After close of window, data in row is restated as column-oriented tabular value in different column family.

© 2014 MapR Technologies 21

Compression Results

Samples are 64b time, 16 bit

sample

Sample time at 10kHz

Sample time jitter makes it important to keep original time-stamp

How much overhead to retain time-stamp?

© 2014 MapR Technologies 22

A second example:Music meta-data

© 2014 MapR Technologies 23

MusicBrainz on NoSQL

• Artists, albums, tracks and labels are key objects• Reality check:

– Add works (compositions), recordings, release, release group

• 7 tables for artist alone• 12 for place, 7 for label, 17 for release/group, 8 for work

– (but only 4 for recording!)– Total of 12 + 7 + 17 + 8 + 4 = 48 tables

• But wait, there’s more!– 10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86

link tables, 5 cover art tables and 3 tables for CD timing info (138 total)– And 50 more tables that aren’t documented yet

© 2014 MapR Technologies 24

© 2014 MapR Technologies 25

180 tablesnot shown

© 2014 MapR Technologies 26

236 tablesto describe 7 kinds of things

© 2014 MapR Technologies 27

Can we do better?

© 2014 MapR Technologies 28

© 2014 MapR Technologies 29

© 2014 MapR Technologies 30

© 2014 MapR Technologies 31

© 2014 MapR Technologies 32

© 2014 MapR Technologies 33

27 tables reduce to 4

© 2014 MapR Technologies 34

27 tables reduce to 4so far

© 2014 MapR Technologies 35

Further Reductions

• All 86 link tables become properties on artists, releases and other entities

• All 44 tag, rating and annotation tables become list properties• All 5 cover art tables become lists of file references

• Current score: 162 tables become 4

• You get the idea

© 2014 MapR Technologies 36

Is This Good?

• Expressivity– The JSON data model is at least as expressive as the original relational

model• Many cases easier to describe in nested data• No cases are harder

• Efficiency– Inlining can increase data size. Locality improves, however– Sessionizing can substantially decrease data size– Inlining back-references is more efficient than ordinary indexes– Inlined columnar data allows 1000x speedup for time series

• Introspection (you decide)

© 2014 MapR Technologies 37

But How Can We Query This?

• Can’t use SQL– SQL is strongly typed– SQL is heavily tied into the original relational model– SQL generating tools require relational model

• Must use SQL– Vast numbers of tools and people understand how to write SQL– SQL is the lingua franca of databases

© 2014 MapR Technologies 38

Squaring the Circle

• Enter Apache Drill

• Drill is SQL compliant– Uses standard syntax and semantics

• Drill extends SQL– First class treatment of objects, lists– Full support for destructuring, flattening– Full power of relational model can be applied to complex data

© 2014 MapR Technologies 39

Drill Provides Scalable and Extended SQL

© 2014 MapR Technologies 40

Sample Query

• Find Elvis select distinct id, name, alias from ( select id, flatten(alias.name) alias from artist where alias like 'Elvis%Presley' )

© 2014 MapR Technologies 41

Example Query• Find discs where Elvis was credited

select distinct album_id, namefrom( select id album_id, name, flatten(credit) from release) albumsjoin( select distinct artist_id from ( select id artist_id, flatten(alias) from artist where name like 'Elvis%Presley’ )) artistsusing artist_id

© 2014 MapR Technologies 42

Summary

• Extended relational model allows massive simplification– On a real example, we see >20x reduction in number of tables

• Simplification drives improved introspection– This is good

• Apache Drill gives very high performance execution for extended relational problems

• You can try this out today

© 2014 MapR Technologies 43

© 2014 MapR Technologies 44

Short Books by Ted Dunning & Ellen Friedman

• Published by O’Reilly in 2014 and 2015• For sale from Amazon or O’Reilly• Free e-books currently available courtesy of MapR

http://bit.ly/ebook-real-world-hadoop

http://bit.ly/mapr-tsdb-ebook

http://bit.ly/ebook-anomaly

http://bit.ly/recommendation-ebook

© 2014 MapR Technologies 45

Real World Hadoopby Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly)

Free copies at book signing today

© 2014 MapR Technologies 46

Thank You!

© 2014 MapR Technologies 47

Q & A

@mapr maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies