Upload
mapr-technologies
View
918
Download
1
Tags:
Embed Size (px)
Citation preview
© 2015 MapR Technologies 2
Drill Supports Schema Discovery On-The-Fly
• Fixed schema
• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or
schema-less
• Leverage schema in centralized
repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON WRITE
SCHEMA BEFORE READ
SCHEMA ON THE FLY
© 2015 MapR Technologies 3
Drill’s Data Model is Flexible
JSON
BSON
HBase
Parquet
Avro
CSV
TSV
Dynamic
schemaFixed schema
Complex
Flat
Flexibility
Name Gender Age
Michael M 6
Jennifer F 3
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}
RDBMS/SQL-on-Hadoop table
Apache Drill table
Fle
xib
ility
© 2015 MapR Technologies 4
- Sub-directory
- HBase namespace
- Hive database
Drill enables ‘SQL on Everything’
SELECT * FROM dfs.yelp.`business.json`
Workspace- Pathnames
- Hive table
- HBase table
Table
- DFS (Text, Parquet, JSON)
- HBase/MapRDB
- Hive Metastore/Hcatalog
- Easy API to go beyond Hadoop
Storage plugin instance
© 2015 MapR Technologies 5
Drill is a Distributed SQL query engine
drillbit
DataNode/RegionServer
drillbit
DataNode/RegionServer
drillbit
DataNode/RegionServer
ZooKeeperZooKeeper
ZooKeeper…
Scale out
Columnar and Vectorized execution
Optimistic and pipelined execution (no MR, Spark, Tez)
Late binding
Extensible
© 2015 MapR Technologies 6
Drill allows reuse of existing SQL Tools and Skills
Leverage SQL-compatible tools
(BI, query builders, etc.) via Drill’s
standard ODBC, JDBC and ANSI
SQL support
Enable business analysts, technical
analysts and data scientists to
explore and analyze large volumes
of real-time data
© 2015 MapR Technologies 7
Drill is Designed For A Wide Set Of Use Cases
Raw Data Exploration JSON Analytics DWH Offload …
Hive HBaseFiles Directories
…
{JSON}, Parquet
Text Files …
© 2015 MapR Technologies 8
MapR Optimized Data Architecture
Sources
RELATIONAL,
SAAS,
MAINFRAME
DOCUMENTS,
EMAILS
LOG FILES,
CLICKSTREAMS
SENSORS
BLOGS,
TWEETS,
LINK DATA
DATA WAREHOUSE
Data Movement
Data Access
Analytics
Search
Schema-less
data exploration
BI, reporting
Ad-hoc integrated
analytics
Data Transformation, Enrichment
and Integration
Operational Apps
Recommendations
Fraud Detection
Logistics
Optimized Data Architecture Machine Learning
MAPR DISTRIBUTION FOR HADOOP
Streaming(Spark Streaming,
Storm)
MapR Data Platform
MapR-DB
MAPR DISTRIBUTION FOR HADOOP
Batch(MapReduce,
Spark, Hive, Pig)
MapR-FS
Interactive(Drill,
Impala)
© 2015 MapR Technologies 10
High Level Architecture
Cluster of commodity servers– Daemon (drillbit) on each node
ZooKeeper maintains ephemeral cluster membership information– Drillbit uses ZooKeeper to find other drillbits in the cluster
– Client uses ZooKeeper to find drillbits
Built-in, optimistic query execution engine. Doesn’t require a particular storage or execution system (MapReduce, Spark, Tez)
– Better performance and manageability
Data processing unit is columnar record batches– Enables schema flexibility with negligible performance impact
© 2015 MapR Technologies 11
Basic Process
Zookeeper
DFS/HBase/H
ive
DFS/HBase/H
ive
DFS/HBase/H
ive
Drillbit Drillbit Drillbit
Query1. Query comes to any Drillbit (JDBC, ODBC, CLI, REST)
2. Drillbit generates execution plan based on query optimization & locality
3. Fragments are farmed to individual nodes
4. Result is returned to driving node
© 2015 MapR Technologies 12
Core Modules within drillbit
SQL ParserHive
HBase
Sto
rage P
lugin
s
MongoDB
DFS
Physic
al P
lan
ExecutionL
og
ica
l P
lan
Optimizer
RPC Endpoint
© 2015 MapR Technologies 13
A Query engine that is…
• Columnar/Vectorized
• Optimistic/pipelined
• Runtime compilation
• Late binding
• Extensible
© 2015 MapR Technologies 15
Columnar Encoding
• Values in a col. stored next to one-another– Better compression
– Range-map: save min-max, can skip if not present
• Only retrieve columns participating in query
• Drill optimizes for BOTH columnar storage
and Execution
A
B
C
D
On disk
E
© 2015 MapR Technologies 16
Vectorization
Drill operates on more than one record at a time
– Word-sized manipulations
– SIMD instructions (GCC, LLVM and JVM all do various optimizations
automatically)
– Manually code algorithms
Logical Vectorization
– Bitmaps allow lightning fast null-checks
– Avoid branching to speed CPU pipeline
© 2015 MapR Technologies 17
Optimistic Execution
With a short time horizon, failures infrequent
– Don’t spend energy and time creating boundaries and checkpoints to
minimize recovery time
– Rerun entire query in face of failure
No barriers
No persistence unless memory overflow
© 2015 MapR Technologies 18
Pipelining
Record batch is the unit of work for Drill
– Operators work on a record batch ( )
Record batches are pipelined between nodes
– ~256kB usually
Operator reconfiguration happens
at batch boundaries
DrillBit
DrillBit DrillBit
© 2015 MapR Technologies 19
Runtime Compilation is Faster
Trivial
500
450
400
350
300
250
200
150
100
50
0Simple ModerateT
ime
for
1 m
illio
n e
valu
ations (
ms)
Source: http://bit.ly/16Xk32x
Janino interpreted
Trivial
© 2015 MapR Technologies 20
Drill compiler
Loaded classMerge byte-code of
the two classes
Janino compiles
runtime
byte-code
CodeModel
generates code
Precompiled byte-
code templates
© 2015 MapR Technologies 21
Cost-based Optimization
Pluggable rules, and cost model
Rules for distributed plan generation
- Insert Exchange operator into physical plan
- Parallel query plans
Pluggable cost model
- CPU, IO, memory, network cost (data locality)
- Storage engine features (HDFS vs HIVE vs HBase)
Pluggable
rulesQuery
Optimizer Pluggable
rules
© 2015 MapR Technologies 22
Integration and extensibility points
Support UDFs– UDFs/UDAFs using high performance Java API
Not Hadoop centric– Work with other NoSQL solutions including MongoDB, Cassandra, Riak, etc.
– Build one distributed query engine together than per technology
Built in classpath scanning and plugin concept to add additional storage engines, function and operators with zero configuration
Support direct execution of strongly specified JSON based logical and physical plans
– Simplifies testing
– Enables integration of alternative query languages
© 2015 MapR Technologies 23
Additional Resources
Download
Apache Drill
Tutorial: Apache
Drill in 10 MinutesWhiteboard Video
with Tomer Shiran