23
© 2015 MapR Technologies 1 © 2015 MapR Technologies How Drill achieves Flexibility with Performance

Apache Drill Architecture – High-Performance SQL with a JSON Data Model

Embed Size (px)

Citation preview

© 2015 MapR Technologies 1© 2015 MapR Technologies

How Drill achieves Flexibility with Performance

© 2015 MapR Technologies 2

Drill Supports Schema Discovery On-The-Fly

• Fixed schema

• Leverage schema in centralized

repository (Hive Metastore)

• Fixed schema, evolving schema or

schema-less

• Leverage schema in centralized

repository or self-describing data

2Schema Discovered On-The-FlySchema Declared In Advance

SCHEMA ON WRITE

SCHEMA BEFORE READ

SCHEMA ON THE FLY

© 2015 MapR Technologies 3

Drill’s Data Model is Flexible

JSON

BSON

HBase

Parquet

Avro

CSV

TSV

Dynamic

schemaFixed schema

Complex

Flat

Flexibility

Name Gender Age

Michael M 6

Jennifer F 3

{

name: {

first: Michael,

last: Smith

},

hobbies: [ski, soccer],

district: Los Altos

}

{

name: {

first: Jennifer,

last: Gates

},

hobbies: [sing],

preschool: CCLC

}

RDBMS/SQL-on-Hadoop table

Apache Drill table

Fle

xib

ility

© 2015 MapR Technologies 4

- Sub-directory

- HBase namespace

- Hive database

Drill enables ‘SQL on Everything’

SELECT * FROM dfs.yelp.`business.json`

Workspace- Pathnames

- Hive table

- HBase table

Table

- DFS (Text, Parquet, JSON)

- HBase/MapRDB

- Hive Metastore/Hcatalog

- Easy API to go beyond Hadoop

Storage plugin instance

© 2015 MapR Technologies 5

Drill is a Distributed SQL query engine

drillbit

DataNode/RegionServer

drillbit

DataNode/RegionServer

drillbit

DataNode/RegionServer

ZooKeeperZooKeeper

ZooKeeper…

Scale out

Columnar and Vectorized execution

Optimistic and pipelined execution (no MR, Spark, Tez)

Late binding

Extensible

© 2015 MapR Technologies 6

Drill allows reuse of existing SQL Tools and Skills

Leverage SQL-compatible tools

(BI, query builders, etc.) via Drill’s

standard ODBC, JDBC and ANSI

SQL support

Enable business analysts, technical

analysts and data scientists to

explore and analyze large volumes

of real-time data

© 2015 MapR Technologies 7

Drill is Designed For A Wide Set Of Use Cases

Raw Data Exploration JSON Analytics DWH Offload …

Hive HBaseFiles Directories

{JSON}, Parquet

Text Files …

© 2015 MapR Technologies 8

MapR Optimized Data Architecture

Sources

RELATIONAL,

SAAS,

MAINFRAME

DOCUMENTS,

EMAILS

LOG FILES,

CLICKSTREAMS

SENSORS

BLOGS,

TWEETS,

LINK DATA

DATA WAREHOUSE

Data Movement

Data Access

Analytics

Search

Schema-less

data exploration

BI, reporting

Ad-hoc integrated

analytics

Data Transformation, Enrichment

and Integration

Operational Apps

Recommendations

Fraud Detection

Logistics

Optimized Data Architecture Machine Learning

MAPR DISTRIBUTION FOR HADOOP

Streaming(Spark Streaming,

Storm)

MapR Data Platform

MapR-DB

MAPR DISTRIBUTION FOR HADOOP

Batch(MapReduce,

Spark, Hive, Pig)

MapR-FS

Interactive(Drill,

Impala)

© 2015 MapR Technologies 9© 2015 MapR Technologies

Architecture – Under the hood

© 2015 MapR Technologies 10

High Level Architecture

Cluster of commodity servers– Daemon (drillbit) on each node

ZooKeeper maintains ephemeral cluster membership information– Drillbit uses ZooKeeper to find other drillbits in the cluster

– Client uses ZooKeeper to find drillbits

Built-in, optimistic query execution engine. Doesn’t require a particular storage or execution system (MapReduce, Spark, Tez)

– Better performance and manageability

Data processing unit is columnar record batches– Enables schema flexibility with negligible performance impact

© 2015 MapR Technologies 11

Basic Process

Zookeeper

DFS/HBase/H

ive

DFS/HBase/H

ive

DFS/HBase/H

ive

Drillbit Drillbit Drillbit

Query1. Query comes to any Drillbit (JDBC, ODBC, CLI, REST)

2. Drillbit generates execution plan based on query optimization & locality

3. Fragments are farmed to individual nodes

4. Result is returned to driving node

© 2015 MapR Technologies 12

Core Modules within drillbit

SQL ParserHive

HBase

Sto

rage P

lugin

s

MongoDB

DFS

Physic

al P

lan

ExecutionL

og

ica

l P

lan

Optimizer

RPC Endpoint

© 2015 MapR Technologies 13

A Query engine that is…

• Columnar/Vectorized

• Optimistic/pipelined

• Runtime compilation

• Late binding

• Extensible

© 2015 MapR Technologies 14

Columnar representation

A B C D EA

B

C

D

On disk

E

© 2015 MapR Technologies 15

Columnar Encoding

• Values in a col. stored next to one-another– Better compression

– Range-map: save min-max, can skip if not present

• Only retrieve columns participating in query

• Drill optimizes for BOTH columnar storage

and Execution

A

B

C

D

On disk

E

© 2015 MapR Technologies 16

Vectorization

Drill operates on more than one record at a time

– Word-sized manipulations

– SIMD instructions (GCC, LLVM and JVM all do various optimizations

automatically)

– Manually code algorithms

Logical Vectorization

– Bitmaps allow lightning fast null-checks

– Avoid branching to speed CPU pipeline

© 2015 MapR Technologies 17

Optimistic Execution

With a short time horizon, failures infrequent

– Don’t spend energy and time creating boundaries and checkpoints to

minimize recovery time

– Rerun entire query in face of failure

No barriers

No persistence unless memory overflow

© 2015 MapR Technologies 18

Pipelining

Record batch is the unit of work for Drill

– Operators work on a record batch ( )

Record batches are pipelined between nodes

– ~256kB usually

Operator reconfiguration happens

at batch boundaries

DrillBit

DrillBit DrillBit

© 2015 MapR Technologies 19

Runtime Compilation is Faster

Trivial

500

450

400

350

300

250

200

150

100

50

0Simple ModerateT

ime

for

1 m

illio

n e

valu

ations (

ms)

Source: http://bit.ly/16Xk32x

Janino interpreted

Trivial

© 2015 MapR Technologies 20

Drill compiler

Loaded classMerge byte-code of

the two classes

Janino compiles

runtime

byte-code

CodeModel

generates code

Precompiled byte-

code templates

© 2015 MapR Technologies 21

Cost-based Optimization

Pluggable rules, and cost model

Rules for distributed plan generation

- Insert Exchange operator into physical plan

- Parallel query plans

Pluggable cost model

- CPU, IO, memory, network cost (data locality)

- Storage engine features (HDFS vs HIVE vs HBase)

Pluggable

rulesQuery

Optimizer Pluggable

rules

© 2015 MapR Technologies 22

Integration and extensibility points

Support UDFs– UDFs/UDAFs using high performance Java API

Not Hadoop centric– Work with other NoSQL solutions including MongoDB, Cassandra, Riak, etc.

– Build one distributed query engine together than per technology

Built in classpath scanning and plugin concept to add additional storage engines, function and operators with zero configuration

Support direct execution of strongly specified JSON based logical and physical plans

– Simplifies testing

– Enables integration of alternative query languages