The BI on Hadoop Benchmark
Bay Area Big Data Meetup – March 2016
www.atscale.com
2© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Agenda
❑Market Context
❑AtScale Overview
❑Benchmark Setup
❑Results!!
❑Lessons Learned
❑Wrap Up & Q & A
3© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Market Context
4© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
But Seriously…
5© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Hadoop Use Cases
Yesterday Today
6© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Agenda
❑Market Context
❑AtScale Overview
❑Benchmark Setup
❑Results!!
❑Lessons Learned
❑Wrap Up & Q & A
7© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
What AtScale Does
I.T. needsControl & Consistency
The Business needsFreedom & Self-Service
The Business Interface for Hadoop
8© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
How We Do It
❑Any BI tool
❑ Industry standards
❑Schema on demand
❑Write once
9© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Demo Time
Design Center
Designers
The AtscaleVirtual Cube
On-DemandAggregate Engine
Your Business Team
10© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Agenda
❑Market Context
❑AtScale Overview
❑Benchmark Setup
❑Results!!
❑Lessons Learned
❑Wrap Up & Q & A
11© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Benchmark Ingredients
RAM per node 128G
CPU specs data (worker) nodes 32 CPU cores
Storage specs data (worker) nodes 2x 512mb SSD
12 node cluster with:• 1 master node• 1 AtScale gateway node• 10 data nodes
1. Hadoop Cluster
12© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Benchmark Ingredients
Version: 1.6-SNAPSHOT
Hive Version:1.2
File Format: Parquet
Workers: 70
Memory per worker: 14G
Cores per worker: 4
Version: 2.3
Hive Version:1.2
File Format: Parquet
Workers: 10
Memory per worker: 110G
Tez Version: 0.7
Hive Version: 1.2
File Format: ORC
hive.tez.container.size: 4096mb
hive.cbo.enabled: true
hive.auto.convert.join.noconditionaltask.size:
3036549120
2. SQL-on-Hadoop Engines
13© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Benchmark Ingredients
3. Benchmark Data Set
Table Name Number of Rows
CUSTOMER 1 Billion
LINEORDER 6 Billion
SUPPLIER 2 Million
PART 2 Million
DATE 16 Thousand
Star-Schema Benchmark (SSB)
14© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Benchmark Ingredients
4. Benchmark Queries
Query ID Joins Largest Join Table Group Bys Filters Comments
Q1.1 1 16,799 0 31 range condition, 1 comparative filter condition in fact table
Q1.2 1 16,799 0 32 range filter conditions directly on LINEORDER table
Q1.3 1 16,799 0 42 range filter conditions directly on fact, 2 conditions on joined table
Q2.1 3 2,000,000 2 2 filter on p_category (less selective)
Q2.2 3 2,000,000 2 2filter on p_brand, 2 values (more selective)
Q2.3 3 2,000,000 2 2filter on p_brand, 1 value (most selective)
Q3.1 3 1,050,000,000 3 3 filter on region (less selective)
Q3.2 3 1,050,000,000 3 3 filter on nation (more selective)
Q3.3 3 1,050,000,000 3 3 filter on city (most selective)
Q3.4 3 1,050,000,000 3 3filter on city (most selective) and month (vs. year)
Q4.1 4 1,050,000,000 2 2
Q4.2 4 1,050,000,000 3 3includes filter on year (more selective)
Q4.3 4 1,050,000,000 3 3includes filter on year and nation (most selective)
15© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Benchmark Ingredients
5. Real Bearded Wizard
16© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Agenda
❑Market Context
❑AtScale Overview
❑Benchmark Setup
❑Results!!
❑Lessons Learned
❑Wrap Up & Q & A
17© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Benchmark Framework
❑Performs on Big Data
❑Fast on Small Data
❑Stable for Many Users
18© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Performs on Big Data: 6B Rows
19© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Fast on Small: Adaptive Cache
20© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Stable for Many: Concurrency
21© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Agenda
❑Market Context
❑AtScale Overview
❑Benchmark Setup
❑Results!!
❑Lessons Learned
❑Wrap Up & Q & A
22© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Data Formats & Partitioning
❑ORC for Hive, because the majority of Hive's speed-ups (vectorization, CBO etc) only work on ORC tables
❑Parquet for Impala and Spark - majority of performance work for these engines are done for parquet
❑The tables contained no partitioning to achieve a true test of performance against large data sets
23© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Impala tuning
❑Impala required the least amount of tuning, We configured it so it would use the same amount of Memory as the other engines.
❑For the queries, – Changed the formatting so they would run on Impala.– Changed the ordering of joins for queries, this change showed a
10-20% performance increase on a few queries.
24© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Spark SQL tuning
❑ 14G per memory, 3 Cores and 70 workers - best combo
❑ spark.sql.autoBroadcastJoinThreshold is your friend! The dims were all small enough that we could make sure all queries were broadcast joins.
❑ Changing the join order for the queries yields ~5% performance increase.
25© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Hive 1.2.1 Tuning❑ Required a different setup for each concurrent query test (to set
hive.server2.tez.sessions.per.default.queue,hive.server2.tez.default.queues) to better support different concurrency levels.
❑ Like Spark, hive.auto.convert.join.noconditionaltask.size is your friend, we
set it high enough 3,036,549,120 in our case so that all our queries would
run as broadcast joins.
❑ The exceptions were queries Q4-1 to Q4-3, These we had to force to be
sort-joins due to the GC pressure the broadcast join caused.
❑ Unlike Spark and Impala we did not have to change the join order as Hive
CBO did that automatically.
26© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Agenda
❑Market Context
❑AtScale Overview
❑Benchmark Setup
❑Results!!
❑Lessons Learned
❑Wrap Up & Q & A
27© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Benchmark Key Findings❑ No outright winner - different engines have different sweet spots
❑ SparkSQL and Impala are better options for “Small Data” queries
❑ Impala is the clear winner as concurrency increases, though all engines scaled linearly
❑ Nobody is standing still, and there is plenty more to do!
28© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Next time.
❑ Latest Engines: Spark 2.0, Impala 2.5, Hive 2.0 with LLAP
❑ New Engines: Drill, Presto, Hawq..
❑ New queries, Analytics, Window functions
❑ Data model variations (embedded dimension as maps, etc)