Download pdf - The BI on Hadoop Benchmark - Meetupfiles.meetup.com/5717572/BI-on-Hadoop Benchmark Meetup...© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY 2 Agenda Market Context

The BI on Hadoop Benchmark

Bay Area Big Data Meetup – March 2016

www.atscale.com

2© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY

Agenda

❑Market Context

❑AtScale Overview

❑Benchmark Setup

❑Results!!

❑Lessons Learned

❑Wrap Up & Q & A


Market Context


But Seriously…


Hadoop Use Cases

Yesterday Today


Agenda

❑Market Context

❑AtScale Overview

❑Benchmark Setup

❑Results!!

❑Lessons Learned

❑Wrap Up & Q & A


What AtScale Does

I.T. needsControl & Consistency

The Business needsFreedom & Self-Service

The Business Interface for Hadoop


How We Do It

❑Any BI tool

❑ Industry standards

❑Schema on demand

❑Write once


Demo Time

Design Center

Designers

The AtscaleVirtual Cube

On-DemandAggregate Engine

Your Business Team


Agenda

❑Market Context

❑AtScale Overview

❑Benchmark Setup

❑Results!!

❑Lessons Learned

❑Wrap Up & Q & A


Benchmark Ingredients

RAM per node 128G

CPU specs data (worker) nodes 32 CPU cores

Storage specs data (worker) nodes 2x 512mb SSD

12 node cluster with:• 1 master node• 1 AtScale gateway node• 10 data nodes

1. Hadoop Cluster



Version: 1.6-SNAPSHOT

Hive Version:1.2

File Format: Parquet

Workers: 70

Memory per worker: 14G

Cores per worker: 4

Version: 2.3

Hive Version:1.2

File Format: Parquet

Workers: 10

Memory per worker: 110G

Tez Version: 0.7

Hive Version: 1.2

File Format: ORC

hive.tez.container.size: 4096mb

hive.cbo.enabled: true

hive.auto.convert.join.noconditionaltask.size:

3036549120

2. SQL-on-Hadoop Engines



3. Benchmark Data Set

Table Name Number of Rows

CUSTOMER 1 Billion

LINEORDER 6 Billion

SUPPLIER 2 Million

PART 2 Million

DATE 16 Thousand

Star-Schema Benchmark (SSB)



4. Benchmark Queries

Query ID Joins Largest Join Table Group Bys Filters Comments

Q1.1 1 16,799 0 31 range condition, 1 comparative filter condition in fact table

Q1.2 1 16,799 0 32 range filter conditions directly on LINEORDER table

Q1.3 1 16,799 0 42 range filter conditions directly on fact, 2 conditions on joined table

Q2.1 3 2,000,000 2 2 filter on p_category (less selective)

Q2.2 3 2,000,000 2 2filter on p_brand, 2 values (more selective)

Q2.3 3 2,000,000 2 2filter on p_brand, 1 value (most selective)

Q3.1 3 1,050,000,000 3 3 filter on region (less selective)

Q3.2 3 1,050,000,000 3 3 filter on nation (more selective)

Q3.3 3 1,050,000,000 3 3 filter on city (most selective)

Q3.4 3 1,050,000,000 3 3filter on city (most selective) and month (vs. year)

Q4.1 4 1,050,000,000 2 2

Q4.2 4 1,050,000,000 3 3includes filter on year (more selective)

Q4.3 4 1,050,000,000 3 3includes filter on year and nation (most selective)



5. Real Bearded Wizard


Agenda

❑Market Context

❑AtScale Overview

❑Benchmark Setup

❑Results!!

❑Lessons Learned

❑Wrap Up & Q & A


Benchmark Framework

❑Performs on Big Data

❑Fast on Small Data

❑Stable for Many Users


Performs on Big Data: 6B Rows


Fast on Small: Adaptive Cache


Stable for Many: Concurrency


Agenda

❑Market Context

❑AtScale Overview

❑Benchmark Setup

❑Results!!

❑Lessons Learned

❑Wrap Up & Q & A


Data Formats & Partitioning

❑ORC for Hive, because the majority of Hive's speed-ups (vectorization, CBO etc) only work on ORC tables

❑Parquet for Impala and Spark - majority of performance work for these engines are done for parquet

❑The tables contained no partitioning to achieve a true test of performance against large data sets


Impala tuning

❑Impala required the least amount of tuning, We configured it so it would use the same amount of Memory as the other engines.

❑For the queries, – Changed the formatting so they would run on Impala.– Changed the ordering of joins for queries, this change showed a

10-20% performance increase on a few queries.


Spark SQL tuning

❑ 14G per memory, 3 Cores and 70 workers - best combo

❑ spark.sql.autoBroadcastJoinThreshold is your friend! The dims were all small enough that we could make sure all queries were broadcast joins.

❑ Changing the join order for the queries yields ~5% performance increase.


Hive 1.2.1 Tuning❑ Required a different setup for each concurrent query test (to set

hive.server2.tez.sessions.per.default.queue,hive.server2.tez.default.queues) to better support different concurrency levels.

❑ Like Spark, hive.auto.convert.join.noconditionaltask.size is your friend, we

set it high enough 3,036,549,120 in our case so that all our queries would

run as broadcast joins.

❑ The exceptions were queries Q4-1 to Q4-3, These we had to force to be

sort-joins due to the GC pressure the broadcast join caused.

❑ Unlike Spark and Impala we did not have to change the join order as Hive

CBO did that automatically.


Agenda

❑Market Context

❑AtScale Overview

❑Benchmark Setup

❑Results!!

❑Lessons Learned

❑Wrap Up & Q & A


Benchmark Key Findings❑ No outright winner - different engines have different sweet spots

❑ SparkSQL and Impala are better options for “Small Data” queries

❑ Impala is the clear winner as concurrency increases, though all engines scaled linearly

❑ Nobody is standing still, and there is plenty more to do!


Next time.

❑ Latest Engines: Spark 2.0, Impala 2.5, Hive 2.0 with LLAP

❑ New Engines: Drill, Presto, Hawq..

❑ New queries, Analytics, Window functions

❑ Data model variations (embedded dimension as maps, etc)