Upload
camuel-gilyadov
View
2.124
Download
4
Embed Size (px)
DESCRIPTION
Updated version of OpenDremel's team suggested design for Apache Drill project.
Citation preview
Apache Drill Design proposal from
OpenDremel team
Camuel Gilyadov & Constantine Peresypkin,
Email: [email protected]
HLD Version 0.2, 9/sep/2012
Intro
• This is high-level design proposal for project
ApacheDrill from OpenDeremel team.
• History slides and usual “about us” stuff moved to the
end of the deck.
• Slide with all relevant links also published in the end.
Design Tenet #1
• Apache Drill must support multi-tenant semantics
internally and not to be run altogether in guest VMs.
• It should be inspired by BigQuery and not only by
Dremel/PowerDrill/Tenzing papers.
• It is not practical to setup dedicated cloud (billed
hourly) just to be able to run a query for a few seconds.
• The codebase must be clearly divided into trusted part
and untrusted part. Trusted part must be kept to
absolute minimum and must be peer-reviewed, secured,
audited and metered.
Design Tenet #2
• Apache Drill must be modular and customizable in
many dimensions.
• Schema-on-read concept must be supported.
Imperatively coded high-performance data parser must
embeddable into the query.
• SQL is not longer enough. New query languages must
be easily added as well as user-defined-functions (UDF)
implementing deep-analytics (such as statistics and
machine learning).
• Additionally various data-formats must be supported
like column-stores, row-stores, PAX, RCFiles and etc.
Design Tenet #2 (cont.)
• We suggest that query plan format will be relaxed to
arbitrary executable, and data format relaxed to
arbitrary opaque BLOB.
• This way new query languages and new data formats
could be easily supported without changing backend.
• As added benefit backend becomes generic lightweight
homogeneous compute-storage cloud.
• Such approach exhibits good separation of control.
Cloud operator controls and bills for generic
infrastructure and the query engine is left completely in
the control of the tenant/user.
Design Tenet #3
• Apache Drill requests/queries must be hyper-elastic
meaning capability to exploit compute capacity of
thousands of servers for short duration of just a few
seconds. No resources must be kept spinning per user
between queries or when idle.
• Traditional VMs are too heavyweight for that.
Container approach such as OpenVZ/LXC and etc. are
not secure enough in multi-tenancy context.
• We suggest making sandboxing pluggable and
supporting ZeroVM ( developed for OpenDremel ) and
LXC (is fine for private clouds) to begin with.
Design Tenet #4
• Apache Drill must be efficient.
• Value-per-bit is extremely low with BigData.
• Overhead in the inner loop must be kept to minimum.
• Java was found inefficient for general number
crunching (such as data compression). The main
problem with Java is that GC overhead is unavoidable
for the whole data corpus being scanned. We went so
far as to keep all data in byte arrays and auto-generate
transformation code and it still underperformed and
code complexity went through the roof.
Suggested Architecture
Query
Browser / Client
Single-Tenant
Frontend running inside
traditional guest VM
Multi-Tenant
Backend scale-out object store
and in-situ compute
Executable job
Query
Compiler
JVM
Executable job
Suggested Frontend
Design
• Usual Java single-tenant web application.
• In charge of:
– All interaction with user.
– Query/job submission
– Query/job progress monitoring
– Result browsing
Java Servlet
Query
Compiler
REST
Gateway
Client Tools
CLI
AJAX App
Suggested AJAX
• What AJAX framework?
• ExtJs?
• Look&Feel – just clone Google App with the
trademarks and logos replaced?
• Why WebUI of Drill is more important than
Hive?
– Drill is interactive, at least basic WebUI must be
provided with each release.
Suggested CLI
Design
• Bash+curl would suffice?
• Full blown Java CLI tool?
Suggested REST-GW
Design
• Usual vanilla Java WebApp with Spring!
• Query Compiler consists from two component
libraries with stable but language-dependent (so
no reuse unfortunately ) interface between them:
Suggested Query
Compiler Design #1
Parsers Planners Se
man
tic
Mo
del
Rea
de
r Query
Text
Executable
Script
Syntax
Errors
Semantic
Errors
Suggested Query
Compiler Design #2
• DrqlSemanticModelReader is ready and published
under …..
• SemanticModel that parsers produces closely follows
original language. Parsers just parses query text and
doesn’t attempts to “give it meaning” or annotate.
• Simplified example:
– List<Expression> getResultColumns()
– List<DrqlQuery> getFromClause();
– List<ColumnId> getGroupByClause();
– etc….
Suggested Query
Compiler Design #3
• What is Executable Script? – Self-contained serializable, executable object. When executed with
appropriate executor and yields correct query result on given input data
of expected format
– Self contained means no dependencies, everything is included in that
executable object.
– Particularly data parsing logic is included.
– However, data access logic is NOT included.
– The model for script is: “here is your blob of size N mapped to
memory starting from address S, you have time T to generate your
result up to size R in memory starting from address D. You will be
terminated without advance notice for any attempted violation of
any restriction”
Suggested Query
Compiler Design #3
• How executable script is generated? 1. Query object implementing SemanticModelReader interface is
provided to planner by parser.
2. Planner logic examines semantic model through the
SemanticModelReader interface and produces query plan
object, that implements QueryPlanModelReader interface.
Query analysis and optimization takes place during this stage and if
needed additional interface of QueryPlanModelRewriter
and/or QueryPlanModelVisitor could be created for this
reason. However DrQL is a simple language without large (or any)
search space so optimizer value is small. We suggest bypassing
altogether query rewriting and query optimization for initial releases.
3. When query plan is generated, a most appropriate code template script
is selected. Then template engine processes template coupled with
QueryPlanModelReader object to produce executable
Suggested Backend Design
• TODO
• Executors per se – Janino based Java Executor
– LXC-GCC based C Eexecutor
– ZeroVM-GCC based C Executor
• Storage platforms with collocated data processing – Local files (non distributed)
– HDFS
– OpenStack Swift
OpenDremel/Dazo
Query
Two separate unfinished
jQuery apps & cmdline
app with no particular
codenames
We call it Metaxa (historic reasons)
BQL Parser, unfinished
compiler based on Apache
Velocity
We call it Zwift
(Swift + ZeroVM)
Alpha Quality
Executable job
Query
Compiler
JVM
What is Swift?
“Swift is a highly available, distributed,
eventually consistent object/blob store.
Organizations can use Swift to store
lots of data efficiently, safely, and
cheaply.”
Don’t get it?
Swift is THE open-source
implementation of
Amazon S3
What is ZeroVM?
Highly-secure, low-overhead, low-latency container-style
virtualization based on Google Native Client project. The
critical security code is transferred verbatim from Chrome
Browser project and therefore is as secure as Chrome
Browser. More info: http://ZeroVM.org and
http://news.ycombinator.com/item?id=3746222
ZeroVM highlights
1. Disposable VM per request
2. HyperElasticity per request
3. Embeddable into everything
4. High-performance (x86/ARM)
5. Erlang inspired clustering
6. Written in pure C, not deps
Don’t get it?
ZeroVM to Virtualization
is what
SQLite is to Databases
Links
• https://github.com/ApacheDrill/Brainstorm/wiki/Apache-Drill-Links
• OpenDremel (1st generation design): – http://code.google.com/p/dremel/source/browse?repo=dremel
– http://code.google.com/p/dremel/source/browse?repo=metaxa
• Dazo (2nd generation design):
– https://github.com/Dazo-org
OpenDremel Story: 2010
• Camuel Gilyadov started Dremel implementation on
summer 2010 named OpenDremel.
• David Gruzman joined the effort a few months later
followed by Constantine Peresypkin.
• There wasn’t a comprehensive design or architecture.
The goal was to get hierarchal-columnar transformation
working smoothly and in strict accordance to the
Dremel paper. Several working implementations are
published by us under Apache License.
• Hong San was hired as first full-timer to speedup the
development. Metaxa milestone was set.
OpenDremel Story: 2011
• OpenDremel early design was found too naive, mainly due to
Java underperformance in inner number-crunching loops.
• After fierce brainstorming, project was restarted from scratch
under new name Dazo. With Dazo, query plan is an arbitrary
piece of executable native code with Java frontend.
• From now on we got inspiration from BigQuery as opposed to
from Dremel paper.
• We decided to use Google NaCl as sandboxing technology to
isolate queries as well as meter resource consumption. The new
sandbox was named ZeroVM.
• As for storage we decided to use OpenStack Swift.
OpenDremel Story: 2012
• Four people full-time, several others part time, we still
don’t have fully integrated version but we are satisfied
with what we have achieved and convinced that the
decisions behind Dazo were correct.
• We believe ZeroVM could be a disruptive technology in
itself revolutionizing BigData@Cloud space.
• We are excited by Apache Drill initiative and hope to be
useful for it.
• Check the blog: http://BigDataCraft.com