Click here to load reader
Upload
hadoop-summit
View
1.081
Download
3
Embed Size (px)
Citation preview
© 2014 MapR Technologies 1© 2014 MapR Technologies
Spark SQL versus Apache Drill: Different Tools with Different Rules
© 2014 MapR Technologies 2
Contact Information
Ted DunningChief Applications Architect at MapR Technologies
Committer & PMC for Apache’s Drill, Zookeeper & othersVP of Incubator at Apache Foundation
Email [email protected] [email protected]
Twitter @ted_dunning
© 2014 MapR Technologies 5
What is Drill?
© 2014 MapR Technologies 6
A Query engine that has…• Columnar/Vectorized • Optimistic/pipelined• Runtime compilation• Late binding • Extensible
© 2014 MapR Technologies 7
Table Can Be an Entire Directory Tree
// On a fileselect errorLevel, count(*)from dfs.logs.`/AppServerLogs/2014/Janpart0001.parquet` group by errorLevel;
// On the entire data collection: all years, all monthsselect errorLevel, count(*)from dfs.logs.`/AppServerLogs`group by errorLevel;
© 2014 MapR Technologies 8
Basic Process
Zookeeper
DFS/HBase DFS/HBase DFS/HBase
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf)2. Drillbit generates execution plan based on query optimization & locality
3. Fragments are farmed to individual nodes4. Result is returned to driving node
c c c
© 2014 MapR Technologies 9
Stages of Query Planning
Parser Logical Planner
Physical Planner
Query Foreman
Plan fragments sent to drill bits
SQLQuery
Heuristic and cost based
Cost based
© 2014 MapR Technologies 10
Query Execution
SQL Parser
Optimizer
Scheduler
Pig Parser Phys
ical
Pla
n Mongo
CassandraHiveQL Parser
RPC Endpoint
Distributed Cache
Stor
age
Inte
rfac
e
OperatorsOperators
Foreman
Logi
cal P
lan
HDFS
HBase
JDBC Endpoint
ODBC Endpoint
© 2014 MapR Technologies 11
Batches of Values
• Value vectors– List of values, with same schema– With the 4-value semantics for each value
• Shipped around in batches– max 256k bytes in a batch– max 64K rows in a batch
• RPC designed for multiple replies to a request
© 2014 MapR Technologies 12
Fixed Value Vectors
© 2014 MapR Technologies 13
Vectorization• Drill operates on more than one record at a time
– Word-sized manipulations– SIMD instructions
• GCC, LLVM and JVM all do various optimizations automatically– Manually code algorithms
• Logical Vectorization– Bitmaps allow lightning fast null-checks– Avoid branching to speed CPU pipeline
© 2014 MapR Technologies 14
Runtime Compilation is Faster• JIT is smart, but more gains with runtime compilation• Janino: Java-based Java compiler
From http://bit.ly/16Xk32x
© 2014 MapR Technologies 15
Drill compiler
Loaded classMerge byte-code of the two classes
Janino compiles runtime
byte-code
CodeModel generates
code
Precompiled byte-code templates
© 2014 MapR Technologies 16
Optimistic
cmd pipeline small db med db large db dw compilation hadoop0
20
40
60
80
100
120
140
160
Speed vs. check-pointing
No need to checkpoint
Checkpoint frequentlyApache Drill
© 2014 MapR Technologies 17
Optimistic Execution• Recovery code trivial
– Running instances discard the failed query’s intermediate state• Pipelining possible
– Send results as soon as batch is large enough– Requires barrier-less decomposition of query
© 2014 MapR Technologies 18
Pipelining• Record batches are pipelined between
nodes– ~256kB usually
• Unit of work for Drill– Operators works on a batch
• Operator reconfiguration happens at batch boundaries
DrillBit
DrillBit DrillBit
© 2014 MapR Technologies 19
Pipelining• Random access: sort without copy or restructuring• Avoids serialization/deserialization• Off-heap (no GC woes when lots of memory)• Read/write to disk
– when data larger than memory
Drill Bit
Memory overflow
uses disk
Disk
© 2014 MapR Technologies 20
Cost-based Optimization• Using Optiq, an extensible framework
– Pluggable rules, and cost model • Rules for distributed plan generation
– Insert Exchange operator into physical plan– Optiq enhanced to explore parallel query plans
• Pluggable cost model– CPU, IO, memory, network cost (data locality)– Storage engine features (HDFS vs HIVE vs HBase)
Query Optimizer
Pluggablerules
Pluggablecost model
© 2014 MapR Technologies 21
What is SparkSQL?
© 2014 MapR Technologies 22
What is Spark SQL• Essentially syntactic sugar over a limited subset of Spark• Inherits all the virtues (and vices) of Spark
– Lambdas can serve as UDFs (has subtle issues for performance)• Inputs have to be loaded
– Perhaps lazily, not obvious when load actually happens• Not designed as a streaming engine, requires more memory• Some JSON support, but not so much for large or variable
objects
• Embedded in a real language!
© 2014 MapR Technologies 23
In More Detail• A Spark program consists of a computation graph that consumes
and produces so-called resilient data datasets• SparkSQL allows these computations to be defined using SQL
(but needs schema definitions on the RDD’s)
• Conventional Spark programs and SparkSQL programs interoperate nearly seamlessly
© 2014 MapR Technologies 24
Many Similarities
SQL Parser
Optimizer
Java Phys
ical
Pla
n
Scala
Logi
cal P
lan
Python
© 2014 MapR Technologies 25
Important Differences• Spark execution assumes RDD’s are complete representation,
not a stream of row batches
• Input sources don’t inject optimization rules, nor expose detailed cost models
• Most RDD’s don’t have a zero-copy capability
• Spark inherits JVM memory model, very limited use of off-heap
© 2014 MapR Technologies 26
scala> sqlContext.sql("select * from json.`foo.json`").show+---+------+----+| a| b| c|+---+------+----+| 3|[3, 2]| xyz|| 7| null| wxy|| 7| []|null|+---+------+----+
© 2014 MapR Technologies 27
scala> sqlContext.sql( "select a, explode(b) b_v from json.`bug.json`").show+---+---------+| a| b_v|+---+---------+| 3| 3|| 3| 2|+---+---------+
© 2014 MapR Technologies 28
First Synthesis• Drill has a more nuanced optimizer, better code generation
– This often leads to ~2x speed advantage
• Drill has ValueVector and row batches– This leads to much less memory pressure
• Drill has much stricter memory life-cycle– Query and done and gone, no need for big GC’s even on big memory
• Drill is all about SQL execution
© 2014 MapR Technologies 29
But …• Spark can optimize across entire program
– This often leads to ~2x speed advantage
• Spark has much more flexible memory structures– This can lead to much less memory pressure
• Spark has much more flexible RDD life-cycle– RDD’s can be cached, persisted or simply recomputed as necessary
• Spark is not all about SQL execution
© 2014 MapR Technologies 30
The Really Big Differences• Drill focuses heavily on secure, multi-tenant access to data
– Strong impersonation semantics– Cascading rights via views– Queries co-exist in a cluster and reserve only their momentary resource
requirements
• Spark focuses heavily on fully integrated execution models– Any spark function works with (almost) any RDD’s– Memory residency of RDD’s is the highest goal
© 2014 MapR Technologies 31
Drill security➢ End to end security
from BI tools to Hadoop
➢ Standard based PAM Authentication
➢ 2 level user Impersonation
➢ Fine-grained row and column level access control with Drill Views – no centralized security repository required
© 2014 MapR Technologies 32
Granular security permissions through Drill views
Name City State
Credit Card #
Dave San Jose CA 1374-7914-3865-4817John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)OwnerAdmins
Permission Admins
Business Analyst Data Scientist
Name City State
Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist View (/views/maskedcards.view.drill)
Not a physical data copy
Name City State
Dave San Jose CAJohn Boulder CO
Business Analyst View
OwnerAdmins
Permission Business Analysts
OwnerAdmins
Permission Data
Scientists
© 2014 MapR Technologies 33
Ownership ChainingCombine Self Service Exploration with Data Governance
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist (/views/V_Scientist)
Jane (Read)John (Owner)
Name City State
Dave San Jose CA
John Boulder CO
Analyst(/views/V_Analyst)
Jack (Read)Jane(Owner)
RAW
FILEV
_Scientist
V_A
nalyst
Does Jack have access to V_Analyst? ->YES
Who is the owner of V_Analyst? ->Jane
Drill accesses V_Analyst as Jane (Impersonation hop 1)
Does Jane have access to V_Scientist ? -> YES
Who is the owner of V_Scientist? ->John
Drill accesses V_Scientist as John (Impersonation hop 2)
John(Owner)
Does John have permissions on raw file? -> YES
Who is the owner of raw file? ->John
Drill accesses source file as John (no impersonation here)
Jack queries the view V_Analyst
*Ownership chain length (# hops) is configurable
Ownership chaining
Access path
© 2014 MapR Technologies 34
But was that the right question?
© 2014 MapR Technologies 35
Unification is Feasible• It is relatively easy to build a DrillContext in Spark
– compare to SqlContext
• Define Datasets as Drill data sources and sinks– Drill runs at the same time as Spark
• Orchestrate transport of Spark data to/from Drill
• Cost of transport is remarkably small
© 2014 MapR Technologies 36
What does the Spark and Drill integration look like
Features at a glance:• Use Drill as an input to Spark• Query Spark RDDs via Drill and create data pipelines
© 2014 MapR Technologies 37
Is unification valuable?
© 2014 MapR Technologies 38
Example of Unification
© 2014 MapR Technologies 39
Simple Session Protocol• Calls started at random
intervals
• During calls, reconnection is done periodically
• Many log events are buffered and sent to current tower during active state
© 2014 MapR Technologies 40
The Resulting Data• Signal strength reports
– Tower, timestamp, rank, caller, caller location*, signal strength• Tower log events: HELLO, FAIL, CONNECT, END• Call end
• Note that data for one tower is often received by another due to caller buffering to diagnostic data
*Location isn’t quite location … poetic license applied for
© 2014 MapR Technologies 41
What can we do with it?
© 2014 MapR Technologies 42
Baby Steps
• What does signal propagation look like?
select x, y, signal from cdr_stream where tower = 3
• Plot results to get a map of signal strength around a tower
© 2014 MapR Technologies 43
Baby Steps
• What does tower coverage look like?
select x, y from cdr_stream where tower = 3 and event_type = ‘CONNECT’.
• Plot results to get a map of coverage area for a tower
© 2014 MapR Technologies 44
What about anomaly detection?
© 2014 MapR Technologies 45
Detecting Tower Loss
It’s important to know if traffic is stopped or delayed because of a problem…
But events from towers come at irregular intervals
How long after the last event should you begin to worry?
© 2014 MapR Technologies 46
Event Stream (timing)• Events of various types arrive at irregular intervals
– we can assume Poisson distribution
• The key question is whether frequency has changed relative to expected values– This shows up as a change in interval
• Want alert as soon as possible
© 2014 MapR Technologies 47
Converting Event Times to Anomaly
99.9%-ile
99.99%-ile
© 2014 MapR Technologies 48
But in the real world, event rates often change
© 2014 MapR Technologies 49
Time Intervals Are Key to Modeling Sporadic Events
© 2014 MapR Technologies 50
Time Intervals Are Key to Modeling Sporadic Events
© 2014 MapR Technologies 51
After Rate Correction
© 2014 MapR Technologies 52
Detecting Anomalies in Sporadic Events
© 2014 MapR Technologies 53
Propagation Anomalies• What happens when something shadows part of the coverage
field?– Can happen in urban areas with a construction crane
• Can solve heuristically– Subtract from reference image composed by long term averages– Doesn’t deal well with weak signal regions and low S/N
• Can solve probabilistically– Compute anomaly for each measurement, use mean of log(p)
© 2014 MapR Technologies 54
© 2014 MapR Technologies 55
© 2014 MapR Technologies 56
Variable Signal/Noise Makes Heuristic Tricky
Far from the transmitter, received signal is dominated by noise. This makes subtraction of average value a bad algorithm.
© 2014 MapR Technologies 57
Other Issues• Finding anomalies in coverage area is similar tricky
• Coverage area is roughly where tower signal strength is higher than neighbors
• Except for fuzziness due to hand-off delays• Except for bias due to large-scale caller motions
– Rush hour– Event mobs
© 2014 MapR Technologies 58
Simple Answer for Propagation Anomalies • Cluster signal strength reports• Cluster locations using k-means, large k• Model report rate anomaly using discrete event models• Model signal strength anomaly using percentile model
• Trade larger k against higher report rates, faster detection
• Overall anomaly is sum of individual log(p) anomalies
© 2014 MapR Technologies 59
Coverage Areas
© 2014 MapR Technologies 60
Just One Tower
© 2014 MapR Technologies 61
Cluster Reports for That Tower
© 2014 MapR Technologies 62
Cluster Reports for That Tower
© 2014 MapR Technologies 63
General Dataflow
© 2014 MapR Technologies 64
Summary• Drill and Spark provide healthy competition in Apache• Over time, they have converged in many respects
– But important distinctions remain• Projects can work together to share key technology
– Apache Arrow … started as off-shoot of Drill, now has >12 major projects as participants, including Spark
• Systems can work together even more deeply– DrillContext makes integration first class
© 2014 MapR Technologies 65
e-book available courtesy of MapR
http://bit.ly/1jQ9QuL
A New Look at Anomaly Detectionby Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
© 2014 MapR Technologies 66
© 2014 MapR Technologies 67
Thank you for coming today!
© 2014 MapR Technologies 68
…helping you put data technology to work
● Find answers
● Ask technical questions
● Join on-demand training course discussions
● Follow release announcements
● Share and vote on product ideas
● Find Meetup and event listings
Connect with fellow Apache Hadoop and Spark professionals
community.mapr.com