Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 1© 2014 MapR Technologies

Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 2

Contact Information

Ted DunningChief Applications Architect at MapR Technologies

Committer & PMC for Apache’s Drill, Zookeeper & othersVP of Incubator at Apache Foundation

Email [email protected] [email protected]

Twitter @ted_dunning

mailto:[email protected]

mailto:[email protected]


What is Drill?


A Query engine that has…• Columnar/Vectorized • Optimistic/pipelined• Runtime compilation• Late binding • Extensible


Table Can Be an Entire Directory Tree

// On a fileselect errorLevel, count(*)from dfs.logs.`/AppServerLogs/2014/Janpart0001.parquet` group by errorLevel;

// On the entire data collection: all years, all monthsselect errorLevel, count(*)from dfs.logs.`/AppServerLogs`group by errorLevel;


Basic Process

Zookeeper

DFS/HBase DFS/HBase DFS/HBase

Drillbit

Distributed Cache

Drillbit

Distributed Cache

Drillbit

Distributed Cache

Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf)2. Drillbit generates execution plan based on query optimization & locality

3. Fragments are farmed to individual nodes4. Result is returned to driving node

c c c


Stages of Query Planning

Parser Logical Planner

Physical Planner

Query Foreman

Plan fragments sent to drill bits

SQLQuery

Heuristic and cost based

Cost based


Query Execution

SQL Parser

Optimizer

Scheduler

Pig Parser Phys

ical

Pla

n Mongo

CassandraHiveQL Parser

RPC Endpoint

Distributed Cache

Stor

age

Inte

rfac

e

OperatorsOperators

Foreman

Logi

cal P

lan

HDFS

HBase

JDBC Endpoint

ODBC Endpoint


Batches of Values

• Value vectors– List of values, with same schema– With the 4-value semantics for each value

• Shipped around in batches– max 256k bytes in a batch– max 64K rows in a batch

• RPC designed for multiple replies to a request


Fixed Value Vectors


Vectorization• Drill operates on more than one record at a time

– Word-sized manipulations– SIMD instructions

• GCC, LLVM and JVM all do various optimizations automatically– Manually code algorithms

• Logical Vectorization– Bitmaps allow lightning fast null-checks– Avoid branching to speed CPU pipeline


Runtime Compilation is Faster• JIT is smart, but more gains with runtime compilation• Janino: Java-based Java compiler

From http://bit.ly/16Xk32x


Drill compiler

Loaded classMerge byte-code of the two classes

Janino compiles runtime

byte-code

CodeModel generates

code

Precompiled byte-code templates


Optimistic

cmd pipeline small db med db large db dw compilation hadoop0

20

40

60

80

100

120

140

160

Speed vs. check-pointing

No need to checkpoint

Checkpoint frequentlyApache Drill


Optimistic Execution• Recovery code trivial

– Running instances discard the failed query’s intermediate state• Pipelining possible

– Send results as soon as batch is large enough– Requires barrier-less decomposition of query


Pipelining• Record batches are pipelined between

nodes– ~256kB usually

• Unit of work for Drill– Operators works on a batch

• Operator reconfiguration happens at batch boundaries

DrillBit

DrillBit DrillBit


Pipelining• Random access: sort without copy or restructuring• Avoids serialization/deserialization• Off-heap (no GC woes when lots of memory)• Read/write to disk

– when data larger than memory

Drill Bit

Memory overflow

uses disk

Disk


Cost-based Optimization• Using Optiq, an extensible framework

– Pluggable rules, and cost model • Rules for distributed plan generation

– Insert Exchange operator into physical plan– Optiq enhanced to explore parallel query plans

• Pluggable cost model– CPU, IO, memory, network cost (data locality)– Storage engine features (HDFS vs HIVE vs HBase)

Query Optimizer

Pluggablerules

Pluggablecost model


What is SparkSQL?


What is Spark SQL• Essentially syntactic sugar over a limited subset of Spark• Inherits all the virtues (and vices) of Spark

– Lambdas can serve as UDFs (has subtle issues for performance)• Inputs have to be loaded

– Perhaps lazily, not obvious when load actually happens• Not designed as a streaming engine, requires more memory• Some JSON support, but not so much for large or variable

objects

• Embedded in a real language!


In More Detail• A Spark program consists of a computation graph that consumes

and produces so-called resilient data datasets• SparkSQL allows these computations to be defined using SQL

(but needs schema definitions on the RDD’s)

• Conventional Spark programs and SparkSQL programs interoperate nearly seamlessly


Many Similarities

SQL Parser

Optimizer

Java Phys

ical

Pla

n

Scala

Logi

cal P

lan

Python


Important Differences• Spark execution assumes RDD’s are complete representation,

not a stream of row batches

• Input sources don’t inject optimization rules, nor expose detailed cost models

• Most RDD’s don’t have a zero-copy capability

• Spark inherits JVM memory model, very limited use of off-heap


scala> sqlContext.sql("select * from json.`foo.json`").show+---+------+----+| a| b| c|+---+------+----+| 3|[3, 2]| xyz|| 7| null| wxy|| 7| []|null|+---+------+----+


scala> sqlContext.sql( "select a, explode(b) b_v from json.`bug.json`").show+---+---------+| a| b_v|+---+---------+| 3| 3|| 3| 2|+---+---------+


First Synthesis• Drill has a more nuanced optimizer, better code generation

– This often leads to ~2x speed advantage

• Drill has ValueVector and row batches– This leads to much less memory pressure

• Drill has much stricter memory life-cycle– Query and done and gone, no need for big GC’s even on big memory

• Drill is all about SQL execution


But …• Spark can optimize across entire program

– This often leads to ~2x speed advantage

• Spark has much more flexible memory structures– This can lead to much less memory pressure

• Spark has much more flexible RDD life-cycle– RDD’s can be cached, persisted or simply recomputed as necessary

• Spark is not all about SQL execution


The Really Big Differences• Drill focuses heavily on secure, multi-tenant access to data

– Strong impersonation semantics– Cascading rights via views– Queries co-exist in a cluster and reserve only their momentary resource

requirements

• Spark focuses heavily on fully integrated execution models– Any spark function works with (almost) any RDD’s– Memory residency of RDD’s is the highest goal


Drill security➢ End to end security

from BI tools to Hadoop

➢ Standard based PAM Authentication

➢ 2 level user Impersonation

➢ Fine-grained row and column level access control with Drill Views – no centralized security repository required


Granular security permissions through Drill views

Name City State

Credit Card #

Dave San Jose CA 1374-7914-3865-4817John Boulder CO 1374-9735-1794-9711

Raw File (/raw/cards.csv)OwnerAdmins

Permission Admins

Business Analyst Data Scientist

Name City State

Credit Card #

Dave San Jose CA 1374-1111-1111-1111

John Boulder CO 1374-1111-1111-1111

Data Scientist View (/views/maskedcards.view.drill)

Not a physical data copy

Name City State

Dave San Jose CAJohn Boulder CO

Business Analyst View

OwnerAdmins

Permission Business Analysts

OwnerAdmins

Permission Data

Scientists


Ownership ChainingCombine Self Service Exploration with Data Governance

Name City State Credit Card #

Dave San Jose CA 1374-7914-3865-4817

John Boulder CO 1374-9735-1794-9711

Raw File (/raw/cards.csv)

Name City State Credit Card #

Dave San Jose CA 1374-1111-1111-1111

John Boulder CO 1374-1111-1111-1111

Data Scientist (/views/V_Scientist)

Jane (Read)John (Owner)

Name City State

Dave San Jose CA

John Boulder CO

Analyst(/views/V_Analyst)

Jack (Read)Jane(Owner)

RAW

FILEV

_Scientist

V_A

nalyst

Does Jack have access to V_Analyst? ->YES

Who is the owner of V_Analyst? ->Jane

Drill accesses V_Analyst as Jane (Impersonation hop 1)

Does Jane have access to V_Scientist ? -> YES

Who is the owner of V_Scientist? ->John

Drill accesses V_Scientist as John (Impersonation hop 2)

John(Owner)

Does John have permissions on raw file? -> YES

Who is the owner of raw file? ->John

Drill accesses source file as John (no impersonation here)

Jack queries the view V_Analyst

*Ownership chain length (# hops) is configurable

Ownership chaining

Access path


But was that the right question?


Unification is Feasible• It is relatively easy to build a DrillContext in Spark

– compare to SqlContext

• Define Datasets as Drill data sources and sinks– Drill runs at the same time as Spark

• Orchestrate transport of Spark data to/from Drill

• Cost of transport is remarkably small


What does the Spark and Drill integration look like

Features at a glance:• Use Drill as an input to Spark• Query Spark RDDs via Drill and create data pipelines


Is unification valuable?


Example of Unification


Simple Session Protocol• Calls started at random

intervals

• During calls, reconnection is done periodically

• Many log events are buffered and sent to current tower during active state


The Resulting Data• Signal strength reports

– Tower, timestamp, rank, caller, caller location*, signal strength• Tower log events: HELLO, FAIL, CONNECT, END• Call end

• Note that data for one tower is often received by another due to caller buffering to diagnostic data

*Location isn’t quite location … poetic license applied for


What can we do with it?


Baby Steps

• What does signal propagation look like?

select x, y, signal from cdr_stream where tower = 3

• Plot results to get a map of signal strength around a tower


Baby Steps

• What does tower coverage look like?

select x, y from cdr_stream where tower = 3 and event_type = ‘CONNECT’.

• Plot results to get a map of coverage area for a tower


What about anomaly detection?


Detecting Tower Loss

It’s important to know if traffic is stopped or delayed because of a problem…

But events from towers come at irregular intervals

How long after the last event should you begin to worry?


Event Stream (timing)• Events of various types arrive at irregular intervals

– we can assume Poisson distribution

• The key question is whether frequency has changed relative to expected values– This shows up as a change in interval

• Want alert as soon as possible


Converting Event Times to Anomaly

99.9%-ile

99.99%-ile


But in the real world, event rates often change


Time Intervals Are Key to Modeling Sporadic Events


Time Intervals Are Key to Modeling Sporadic Events


After Rate Correction


Detecting Anomalies in Sporadic Events


Propagation Anomalies• What happens when something shadows part of the coverage

field?– Can happen in urban areas with a construction crane

• Can solve heuristically– Subtract from reference image composed by long term averages– Doesn’t deal well with weak signal regions and low S/N

• Can solve probabilistically– Compute anomaly for each measurement, use mean of log(p)




Variable Signal/Noise Makes Heuristic Tricky

Far from the transmitter, received signal is dominated by noise. This makes subtraction of average value a bad algorithm.


Other Issues• Finding anomalies in coverage area is similar tricky

• Coverage area is roughly where tower signal strength is higher than neighbors

• Except for fuzziness due to hand-off delays• Except for bias due to large-scale caller motions

– Rush hour– Event mobs


Simple Answer for Propagation Anomalies • Cluster signal strength reports• Cluster locations using k-means, large k• Model report rate anomaly using discrete event models• Model signal strength anomaly using percentile model

• Trade larger k against higher report rates, faster detection

• Overall anomaly is sum of individual log(p) anomalies


Coverage Areas


Just One Tower


Cluster Reports for That Tower


Cluster Reports for That Tower


General Dataflow


Summary• Drill and Spark provide healthy competition in Apache• Over time, they have converged in many respects

– But important distinctions remain• Projects can work together to share key technology

– Apache Arrow … started as off-shoot of Drill, now has >12 major projects as participants, including Spark

• Systems can work together even more deeply– DrillContext makes integration first class


e-book available courtesy of MapR

http://bit.ly/1jQ9QuL

A New Look at Anomaly Detectionby Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)





Thank you for coming today!


…helping you put data technology to work

● Find answers

● Ask technical questions

● Join on-demand training course discussions

● Follow release announcements

● Share and vote on product ideas

● Find Meetup and event listings

Connect with fellow Apache Hadoop and Spark professionals

community.mapr.com

Technology

Spark SQL versus Apache Drill: Different Tools with Different Rules