43
© 2016 Dremio Corporation @DremioHQ The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics Julien Le Dem Principal Architect, Dremio VP Apache Parquet, Apache Arrow PMC

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

Embed Size (px)

Citation preview

Page 1: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance

AnalyticsJulien Le DemPrincipal Architect, DremioVP Apache Parquet, Apache Arrow PMC

Page 2: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

• Architect at @DremioHQ

• Formerly Tech Lead at Twitter on Data Platforms.

• Creator of Parquet

• Apache member

• Apache PMCs: Arrow, Incubator, Pig, Parquet

Julien Le Dem@J_ Julien

Page 3: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Agenda

• Benefits of Columnar representation

– Immutable On disk (Apache Parquet)

– Mutable on disk (Apache Kudu)

– In memory (Apache Arrow)

• Community Driven Standard

• Interoperability and Ecosystem

Page 4: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Benefits of Columnar formats@EmrgencyKittens

Page 5: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Columnar layout

Logical table

representationRow layout

Column layout

Page 6: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Mutable or Immutable Storage

• Different trade offs– Immutable: (Parquet).

• Higher write throughput (no random modification after completion).• Easy to share, replicate, access concurrently.• Modifications require rewrite of dataset.• No operational overhead (no extra service, just your file system)

– Mutable: (Kudu)• More flexible trade off between update speed and read speed.• Low-latency for short accesses (primary key indexes and quorum

replication)• Database-like semantics (initially single-row ACID)• Needs to be managed (new daemon).

Page 7: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

On Disk and in Memory

• Different trade offs– On disk: Storage.

• Accessed by multiple queries.

• Priority to I/O reduction (but still needs good CPU throughput).

• Mostly Streaming access.

– In memory: Transient.• Specific to one query execution.

• Priority to CPU throughput (but still needs good I/O).

• Streaming and Random access.

Page 8: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Parquet on disk columnar format

Page 9: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Parquet on disk columnar format

• Nested data structures

• Compact format: – type aware encodings

– better compression

• Optimized I/O:– Projection push down (column pruning)

– Predicate push down (filters based on stats)

Page 10: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Access only the data you need

a b c

a1 b1 c1

a2 b2 c2

a3 b3 c3

a4 b4 c4

a5 b5 c5

a b c

a1 b1 c1

a2 b2 c2

a3 b3 c3

a4 b4 c4

a5 b5 c5

a b c

a1 b1 c1

a2 b2 c2

a3 b3 c3

a4 b4 c4

a5 b5 c5

+ =

Columnar StatisticsRead only the data you need!

Page 11: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Parquet nested representation

Document

DocId Links Name

Backward Forward Language Url

Code Country

Columns:docidlinks.backwardlinks.forwardname.language.codename.language.countryname.url

Borrowed from the Google Dremel paper

https://blog.twitter.com/2013/dremel-made-simple-with-parquet

Page 12: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Kudu data representation

Page 13: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Kudu Tablets

• Typed columns• Inserts buffered in an in-memory store (like HBase’s memstore)• Flushed to disk: Columnar layout, similar to Apache Parquet• Updates use MVCC (updates tagged with timestamp, not in-place)

– Allow “SELECT AS OF <timestamp>” queries and consistent cross-tablet scans

• Near-optimal read path for “current time” scans– No per row branches, fast vectorized decoding and predicate evaluation

• Performance worsens based on number of recent updates

Page 14: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Kudu• High throughput for big scans (columnar

storage and replication)– Goal: Within 2x of Parquet

• Low-latency for short accesses (primary key indexes and quorum replication)– Goal: 1ms read/write on SSD

• Database-like semantics (initially single-row ACID)

• Relational data model– SQL query– “NoSQL” style scan/insert/update (Java client)

Parquet

Page 15: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

LSM vs Kudu

• LSM – Log Structured Merge (Cassandra, HBase, etc)– Inserts and updates all go to an in-memory map (MemStore)

and later flush to on-disk files (HFile/SSTable)

– Reads perform an on-the-fly merge of all on-disk HFiles

• Kudu– Shares some traits (memstores, compactions)

– More complex.

– Slower writes in exchange for faster reads (especially scans)

15

Page 16: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Kudu trade-offs: write

• Batch inserts are slower than Parquet

– Extra bloom filter lookup per insert

• Random updates are slower than HBase

– HBase model allows random updates without incurring a disk seek

– Kudu requires a key lookup before update, bloom lookup before insert

16

Page 17: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Kudu trade-offs: read

• Scan speed is close to Parquet and faster than HBase– Columnar on disk like Parquet– Only one DiskRowSet contains updates for a given row. Fewer files

lookup than Hbase (but more than Parquet).

• Single-row reads may be slower than Hbase (and both are faster than Parquet)– Columnar design is optimized for scans– Future: may introduce “column groups” for applications where

single-row access is more important– Especially slow at reading a row that has had many recent updates

(e.g YCSB “zipfian”)

17

Page 18: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Kudu is…

– NOT a SQL database

• “Bring Your Own SQL”

– NOT a filesystem

• data must have tabular structure

– NOT an in-memory database

• Very fast for memory-sized workloads, but can operate on larger data too

18

Page 19: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Arrow in memory columnar format

Page 20: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Arrow goals

• Well-documented and cross language compatible

• Designed to take advantage of modern CPU characteristics

• Embeddable in execution engines, storage layers, etc.

• Interoperable

Page 21: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Arrow in memory columnar format

• Nested Data Structures

• Maximize CPU throughput

– Pipelining

– SIMD

– cache locality

• Scatter/gather I/O

Page 22: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

CPU pipeline

Page 23: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Minimize CPU cache misses

a cache miss costs 10 to 100s cycles depending on the level

Page 24: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Focus on CPU Efficiency

TraditionalMemory Buffer

ArrowMemory Buffer

• Cache Locality

• Super-scalar & vectorized operation

• Minimal Structure Overhead

• Constant value access

– With minimal structure overhead

• Operate directly on columnar compressed data

Page 25: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Columnar data

persons = [{name: ’Joe',age: 18,phones: [

‘555-111-1111’, ‘555-222-2222’

] }, {

name: ’Jack',age: 37,phones: [ ‘555-333-3333’ ]

}]

Page 26: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Java: Memory Management

• Chunk-based managed allocator

– Built on top of Netty’s JEMalloc implementation

• Create a tree of allocators

– Limit and transfer semantics across allocators

– Leak detection and location accounting

• Wrap native memory from other applications

Page 27: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Arrow RPC & IPC

Page 28: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Common Message Pattern

• Schema Negotiation– Logical Description of structure– Identification of dictionary encoded

Nodes

• Dictionary Batch– Dictionary ID, Values

• Record Batch– Batches of records up to 64K– Leaf nodes up to 2B values

Schema Negotiation

Dictionary Batch

Record Batch

Record Batch

Record Batch

1..N Batches

0..N Batches

Page 29: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Record Batch Construction

Schema Negotiation

Dictionary Batch

Record Batch

Record Batch

Record Batch

name (offset)

name (data)

age (data)

phones (list offset)

phones (data)

data header (describes offsets into data)

name (bitmap)

age (bitmap)

phones (bitmap)

phones (offset)

{name: ’Joe',age: 18,phones: [

‘555-111-1111’, ‘555-222-2222’

] }

Each box (vector) is contiguous memory The entire record batch is contiguous on wire

Page 30: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Moving Data Between Systems

RPC• Avoid Serialization & Deserialization• Layer TBD: Focused on supporting vectored io

– Scatter/gather reads/writes against socket

IPC• Alpha implementation using memory mapped files

– Moving data between Python and Drill

• Working on shared allocation approach– Shared reference counting and well-defined ownership semantics

Page 31: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Shared Need => Open Source Opportunity

“We are also considering switching to a columnar canonical in-memory format for data that needs to be materialized during query processing, in order to take advantage of SIMD instructions” -Impala Team

“A large fraction of the CPU time is spent waiting for data to be fetched from main memory…we are designing cache-friendly algorithms and data structures so Spark applications will spend less time waiting to fetch data from memory and more time doing useful work” – Spark Team

“Drill provides a flexible hierarchical columnar data model that can represent complex, highly dynamic and evolving data models and allows efficient processing of it without need to flatten or materialize.” -Drill Team

Page 32: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Community Driven Standard

Page 33: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

An open source standard

• Parquet: Common need for on disk columnar.

• Arrow: Common need for in memory columnar.

• Arrow building on the success of Parquet.

• Benefits:– Share the effort

– Create an ecosystem

• Standard from the start

Page 34: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

The Apache Arrow Project

• New Top-level Apache Software Foundation project– Announced Feb 17, 2016

• Focused on Columnar In-Memory Analytics1. 10-100x speedup on many workloads2. Common data layer enables companies to choose best of

breed systems 3. Designed to work with any programming language4. Support for both relational and complex data as-is

• Developers from 13+ major open source projects involved– A significant % of the world’s data will be processed through

Arrow!

Calcite

Cassandra

Deeplearning4j

Drill

Hadoop

HBase

Ibis

Impala

Kudu

Pandas

Parquet

Phoenix

Spark

Storm

R

Page 35: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Interoperability and Ecosystem

Page 36: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

High Performance Sharing & InterchangeToday With Arrow

• Each system has its own internal memory format

• 70-80% CPU wasted on serialization and deserialization

• Functionality duplication and unnecessary conversions

• All systems utilize the same memory format

• No overhead for cross-system communication

• Projects can share functionality (eg:Parquet-to-Arrow reader)

Page 37: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Language Bindings

Parquet

• Target Languages

– Java

– CPP (underway)

– Python & Pandas (underway)

• Engines integration:– Faster to list those

who don’t support it

Arrow• Target Languages

– Java (beta)– CPP (underway)– Python & Pandas (underway)– R– Julia

• Initial Focus– Read a structure– Write a structure – Manage Memory

Kudu

• Target Languages

– Java

– CPP

• Engines integration:– MapReduce,

– Spark

– Impala

– Drill

Page 38: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Example data exchanges:

Page 39: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

RPC: Query execution

The memory representation is sent over the wire.

No serialization overhead.

Scanner

Scanner

Scanner

Parquet files

projection push downread only a and b

Partial Agg

Partial Agg

Partial Agg

Agg

Agg

Agg

ShuffleArrow batches

Result

Page 40: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

RPC: future arrow based interchange

The memory representation is sent over the wire.

No serialization overhead.

Scanner

projection/predicate push down

Operator

Arrow batches

Tablet

Mem Disk

SQLexecution

Scanner Operator

Scanner Operator

Tablet

Mem Disk

Tablet

Mem Disk

Page 41: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

IPC: Python with Spark or Drill

SQL engine

Pythonprocess

User defined function

SQLOperator

1

SQLOperator

2

readsreads

Page 42: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

What’s Next

• Parquet – Arrow conversion for Python & C++

• Arrow IPC Implementation

• Kudu – Arrow integration

• Apache {Spark, Drill} to Arrow Integration– Faster UDFs, Storage interfaces

• Support for integration with Intel’s Persistent Memory library via Apache Mnemonic

Page 43: The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Analytics

© 2016 Dremio Corporation @DremioHQ

Get Involved

• Join the community

– dev@{arrow,parquet,kudu.incubator}.apache.org

– Slack:

• https://apachearrowslackin.herokuapp.com/

• https://getkudu-slack.herokuapp.com/

– http://{arrow,parquet,kudu}.apache.org

– Follow @Apache{Parquet,Arrow,Kudu}