Apache Drill Architecture – High-Performance SQL with a JSON Data Model

© 2015 MapR Technologies 1© 2015 MapR Technologies

How Drill achieves Flexibility with Performance

© 2015 MapR Technologies 2

Drill Supports Schema Discovery On-The-Fly

• Fixed schema

• Leverage schema in centralized

repository (Hive Metastore)

• Fixed schema, evolving schema or

schema-less

• Leverage schema in centralized

repository or self-describing data

2Schema Discovered On-The-FlySchema Declared In Advance

SCHEMA ON WRITE

SCHEMA BEFORE READ

SCHEMA ON THE FLY


Drill’s Data Model is Flexible

JSON

BSON

HBase

Parquet

Avro

CSV

TSV

Dynamic

schemaFixed schema

Complex

Flat

Flexibility

Name Gender Age

Michael M 6

Jennifer F 3

{

name: {

first: Michael,

last: Smith

},

hobbies: [ski, soccer],

district: Los Altos

}

{

name: {

first: Jennifer,

last: Gates

},

hobbies: [sing],

preschool: CCLC

}

RDBMS/SQL-on-Hadoop table

Apache Drill table

Fle

xib

ility


- Sub-directory

- HBase namespace

- Hive database

Drill enables ‘SQL on Everything’

SELECT * FROM dfs.yelp.`business.json`

Workspace- Pathnames

- Hive table

- HBase table

Table

- DFS (Text, Parquet, JSON)

- HBase/MapRDB

- Hive Metastore/Hcatalog

- Easy API to go beyond Hadoop

Storage plugin instance


Drill is a Distributed SQL query engine

drillbit

DataNode/RegionServer

drillbit


drillbit


ZooKeeperZooKeeper

ZooKeeper…

Scale out

Columnar and Vectorized execution

Optimistic and pipelined execution (no MR, Spark, Tez)

Late binding

Extensible


Drill allows reuse of existing SQL Tools and Skills

Leverage SQL-compatible tools

(BI, query builders, etc.) via Drill’s

standard ODBC, JDBC and ANSI

SQL support

Enable business analysts, technical

analysts and data scientists to

explore and analyze large volumes

of real-time data


Drill is Designed For A Wide Set Of Use Cases

Raw Data Exploration JSON Analytics DWH Offload …

Hive HBaseFiles Directories

…

{JSON}, Parquet

Text Files …


MapR Optimized Data Architecture

Sources

RELATIONAL,

SAAS,

MAINFRAME

DOCUMENTS,

EMAILS

LOG FILES,

CLICKSTREAMS

SENSORS

BLOGS,

TWEETS,

LINK DATA

DATA WAREHOUSE

Data Movement

Data Access

Analytics

Search

Schema-less

data exploration

BI, reporting

Ad-hoc integrated

analytics

Data Transformation, Enrichment

and Integration

Operational Apps

Recommendations

Fraud Detection

Logistics

Optimized Data Architecture Machine Learning

MAPR DISTRIBUTION FOR HADOOP

Streaming(Spark Streaming,

Storm)

MapR Data Platform

MapR-DB

MAPR DISTRIBUTION FOR HADOOP

Batch(MapReduce,

Spark, Hive, Pig)

MapR-FS

Interactive(Drill,

Impala)

© 2015 MapR Technologies 9© 2015 MapR Technologies

Architecture – Under the hood


High Level Architecture

Cluster of commodity servers– Daemon (drillbit) on each node

ZooKeeper maintains ephemeral cluster membership information– Drillbit uses ZooKeeper to find other drillbits in the cluster

– Client uses ZooKeeper to find drillbits

Built-in, optimistic query execution engine. Doesn’t require a particular storage or execution system (MapReduce, Spark, Tez)

– Better performance and manageability

Data processing unit is columnar record batches– Enables schema flexibility with negligible performance impact


Basic Process

Zookeeper

DFS/HBase/H

ive

DFS/HBase/H

ive

DFS/HBase/H

ive

Drillbit Drillbit Drillbit

Query1. Query comes to any Drillbit (JDBC, ODBC, CLI, REST)

2. Drillbit generates execution plan based on query optimization & locality

3. Fragments are farmed to individual nodes

4. Result is returned to driving node


Core Modules within drillbit

SQL ParserHive

HBase

Sto

rage P

lugin

s

MongoDB

DFS

Physic

al P

lan

ExecutionL

og

ica

l P

lan

Optimizer

RPC Endpoint


A Query engine that is…

• Columnar/Vectorized

• Optimistic/pipelined

• Runtime compilation

• Late binding

• Extensible


Columnar representation

A B C D EA

B

C

D

On disk

E


Columnar Encoding

• Values in a col. stored next to one-another– Better compression

– Range-map: save min-max, can skip if not present

• Only retrieve columns participating in query

• Drill optimizes for BOTH columnar storage

and Execution

A

B

C

D

On disk

E


Vectorization

Drill operates on more than one record at a time

– Word-sized manipulations

– SIMD instructions (GCC, LLVM and JVM all do various optimizations

automatically)

– Manually code algorithms

Logical Vectorization

– Bitmaps allow lightning fast null-checks

– Avoid branching to speed CPU pipeline


Optimistic Execution

With a short time horizon, failures infrequent

– Don’t spend energy and time creating boundaries and checkpoints to

minimize recovery time

– Rerun entire query in face of failure

No barriers

No persistence unless memory overflow


Pipelining

Record batch is the unit of work for Drill

– Operators work on a record batch ( )

Record batches are pipelined between nodes

– ~256kB usually

Operator reconfiguration happens

at batch boundaries

DrillBit

DrillBit DrillBit


Runtime Compilation is Faster

Trivial

500

450

400

350

300

250

200

150

100

50

0Simple ModerateT

ime

for

1 m

illio

n e

valu

ations (

ms)

Source: http://bit.ly/16Xk32x

Janino interpreted

Trivial


Drill compiler

Loaded classMerge byte-code of

the two classes

Janino compiles

runtime

byte-code

CodeModel

generates code

Precompiled byte-

code templates


Cost-based Optimization

Pluggable rules, and cost model

Rules for distributed plan generation

- Insert Exchange operator into physical plan

- Parallel query plans

Pluggable cost model

- CPU, IO, memory, network cost (data locality)

- Storage engine features (HDFS vs HIVE vs HBase)

Pluggable

rulesQuery

Optimizer Pluggable

rules


Integration and extensibility points

Support UDFs– UDFs/UDAFs using high performance Java API

Not Hadoop centric– Work with other NoSQL solutions including MongoDB, Cassandra, Riak, etc.

– Build one distributed query engine together than per technology

Built in classpath scanning and plugin concept to add additional storage engines, function and operators with zero configuration

Support direct execution of strongly specified JSON based logical and physical plans

– Simplifies testing

– Enables integration of alternative query languages


Additional Resources

Download

Apache Drill

Tutorial: Apache

Drill in 10 MinutesWhiteboard Video

with Tomer Shiran

drill.apache.org

http://drill.apache.org/docs/drill-in-10-minutes/


https://www.mapr.com/drill

https://www.mapr.com/drill


https://youtu.be/6pGeQOXDdD8



drill.apache.org

drill.apache.org

Technology

Apache Drill Architecture – High-Performance SQL with a JSON Data Model