37
Hive Now Sparks Chao Sun | April 2014

Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

1© Cloudera, Inc. All rights reserved.

Hive Now SparksChao Sun | April 2014

Page 2: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

2© Cloudera, Inc. All rights reserved.

Outline

• Background• Architecture and Design• Challenges• Current Status• Benchmarks

Page 3: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

3© Cloudera, Inc. All rights reserved.

Background

• Apache Hive: a popular data processing tool for Hadoop• Apache Spark: a data computing framework to succeed

MapReduce• Marrying the two can benefit users from both community• Hive-7292: The most watched JIRA in Hive (160+)

Page 4: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

4© Cloudera, Inc. All rights reserved.

Community Involvement

• Efforts from both communities (Hive and Spark)• Contributions from many organizations

Page 5: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

5© Cloudera, Inc. All rights reserved.

Design Principles

• No or limited impact on Hive’s existing code path• Maximum code reuse• Minimum feature customization• Low future maintenance cost

Page 6: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

6© Cloudera, Inc. All rights reserved.

Hive Internal

ParserSemanticAnalyzer

HiveServer2

Hadoop Client Task Compiler

Hive Client(Beeline, JDBC) MetaStore

Result

HQL

AST

Task

Hadoop

OperatorTree

Page 7: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

7© Cloudera, Inc. All rights reserved.

Class Hierarchy

TaskCompiler

MapRedCompiler TezCompiler

Task Work

MapRedTask TezTask MapRedWork TezWork

Generate Described By

Page 8: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

8© Cloudera, Inc. All rights reserved.

Class Hierarchy

TaskCompiler

MapRedCompiler TezCompiler

Task Work

MapRedTask TezTask MapRedWork TezWork

Generate Described By

SparkCompiler SparkTask SparkWork

Page 9: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

9© Cloudera, Inc. All rights reserved.

Work – Metadata for Task

• MapRedWork contains a MapWork and a possible ReduceWork• SparkWork contains a graph of MapWorks and ReduceWorks

MapWork1

ReduceWork1

MapWork2

ReduceWork2

MapWork1

ReduceWork1

ReduceWork2

MR Job 1

MR Job 2

Spark Job

Ex Query:

SELECT name, sum(value) AS vFROM srcGROUP BY nameORDER BY v;

Page 10: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

10© Cloudera, Inc. All rights reserved.

Spark Client

• Abreast with MR client and Tez Client• Talk to Spark cluster• Support local, local-cluster, standalone, yarn-cluster, and

yarn-client• Job submission, monitoring, error reporting, statistics,

metrics, and counters.

Page 11: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

11© Cloudera, Inc. All rights reserved.

Spark Context

• Core of Spark client• Heavy-weighted, thread-unsafe• Designed for a single-user application• Doesn’t work in multi-session environment• Doesn’t scale with user sessions

Page 12: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

12© Cloudera, Inc. All rights reserved.

Remote Spark Context (RSC)

• Being created and living outside HiveServer2• In yarn-cluster mode, Spark context lives in application master (AM)• Otherwise, Spark context lives in a separate process (other than HiveServer2)

Session 1 AM (RSC) User 1

User 2 Session 2

HiveServer 2

Node 1

AM (RSC) Node 2

Node 3

Yarn Cluster

Page 13: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

13© Cloudera, Inc. All rights reserved.

Data Processing via MapReduce

• Table as HDFS files and read by MR framework• Map-side processing• Map output is shuffled by MR framework• Reduce-side processing• Reduce output is written to disk as part of reduce-side

processing• Output may be further processed by next MR job or

returned to client

Page 14: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

14© Cloudera, Inc. All rights reserved.

Data Processing via Spark

• Treat Table as HadoopRDD (input RDD)• Apply the function that wraps MR’s map-side processing• Shuffle map output using Spark’s transformations

(groupByKey, sortByKey, etc)• Apply the function that wraps MR’s reduce-side processing• Output is either written to file or shuffled again

Page 15: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

15© Cloudera, Inc. All rights reserved.

Spark Plan

• MapInput – encapsulates a table• MapTran – map-side processing• ShuffleTran – shuffling• ReduceTran – reduce-side processing

MapInput MapTran ShuffleTran

ReduceTranShuffleTranReduceTran

Ex Query:

SELECT name, sum(value) AS vFROM srcGROUP BY nameORDER BY v;

Page 16: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

16© Cloudera, Inc. All rights reserved.

Advantages

• Reuse existing map-side and reduce-side processing• Agonistic to Spark’s special transformations or actions• No need to reinvent wheels• Adopt existing features: authorization, window functions,

UDFs, etc• Open to future features

Page 17: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

17© Cloudera, Inc. All rights reserved.

Challenges

• Missing features or functional gaps in Spark• Concurrent data pipelines• Spark Context issues• Scala vs Java API• Scheduling issues

• Large code base in Hive, many contributors working in different areas

• Library dependency conflicts among projects

Page 18: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

18© Cloudera, Inc. All rights reserved.

Dynamic Executor Scaling

• Spark cluster per user session• Heavy user vs light user• Big query vs small query• Solution: executors up and down based on workload

Page 19: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

19© Cloudera, Inc. All rights reserved.

Current Status

• All functionality is implemented• First round of optimization is completed• More optimization and benchmarking coming• Beta release in CDH5.4• Released in Apache Hive 1.1.0• Follow HIVE-7292 for current and future work

Page 20: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

20© Cloudera, Inc. All rights reserved.

Optimizations

• Map join, bucket map join, SMB, skew join (static and dynamic)

• Split generating and grouping• CBO, vectorization• More to come, including table caching, dynamic partition

pruning

Page 21: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

21© Cloudera, Inc. All rights reserved.

Summary

• Community driven project• Multi-organization support• Combining merits from multiple projects• Benefiting a large user base• Bearing solid foundations• A solid, evolving project

Page 22: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

22© Cloudera, Inc. All rights reserved.

Benchmarks – Cluster Setup

• Done by our team members from Intel• 8 physical nodes • Each has 32 logical cores and 64GB memory• 10000Mb/s network between the nodes

Component Version

Hive spark-branch

Spark 1.3.0

Hadoop 2.6.0

Tez 0.5.3

Page 23: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

23© Cloudera, Inc. All rights reserved.

Benchmarks – Test Configuration

• 320GB and 4TB TPC-DS datasets• Three engines share the most of the configurations

• Each node is allocated 32 cores and 48GB memory• Vectorization enabled• CBO enabled• Hive.auto.convert.join.noconditionaltask.size = 600MB

Page 24: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

24© Cloudera, Inc. All rights reserved.

Benchmarks – Test Configurations

• Hive on Tez• hive.prewarm.numcontainers = 250• hive.tez.auto.reducer.parallelism = true• hive.tez.dynamic.partition.pruning = true

• Hive on Spark• spark.master = yarn-client• spark.executor.memory = 5120m• spark.yarn.executor.memoryOverhead = 1024• spark.executor.cores = 4• spark.kryo.referenceTracking = false• spark.io.compression.codec = lzf

Page 25: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

25© Cloudera, Inc. All rights reserved.

Benchmarks – Data Collecting

• We run each query 2 times and measure the 2nd run• Spark on yarn waits for a number of executors to register

before scheduling tasks, thus with a bigger start-up overhead

• We measure a few queries for Hive on Tez w/ or w/o dynamic partition pruning for comparison, as this optimization hasn’t been implemented in Hive on Spark yet

Page 26: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

26© Cloudera, Inc. All rights reserved.

MR vs Spark vs Tez, 320GB

0

100

200

300

400

500

600

700

800

900

1000

Simplecount(*)

Simplemap join

Simplecommon

join

Q3 Q15 Q19 Q21 Q22 Q28 Q82 Q84 Q88 Q90

MR

Spark

Tez

Page 27: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

27© Cloudera, Inc. All rights reserved.

MR vs Spark vs Tez, 320GB

0

100

200

300

400

500

600

700

800

900

1000

Simplecount(*)

Simplemap join

Simplecommon

join

Q3 Q15 Q19 Q21 Q22 Q28 Q82 Q84 Q88 Q90

MR

Spark

Tez

DPP helps Tez

Page 28: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

28© Cloudera, Inc. All rights reserved.

Benchmark -Dynamic Partition Pruning

• Prune partitions at runtime• Ex Query:

• Can dramatically improve performance when tables are joined on partitioned columns

• To be implemented in Hive on Spark

SELECT count(*) FROM src JOIN src_date ON (src.ds = src_date.ds) WHERE src_date.`date` = ‘2015-04-16’

(src is partitioned on ds, but src_date is NOT partitioned.)

Page 29: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

29© Cloudera, Inc. All rights reserved.

Spark vs Tez vs Tez w/o DPP, 320GB

0

5

10

15

20

25

30

Q3 Q15 Q19

Spark

Tez

Tez w/o DPP

Page 30: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

30© Cloudera, Inc. All rights reserved.

MR vs Spark vs Tez, 4TB

0

1000

2000

3000

4000

5000

6000

7000

Simplecount(*)

Simplemap join

Simplecommon

join

Q3 Q15 Q19 Q21 Q22 Q28 Q82 Q84 Q88 Q90

MR

Spark

Tez

Page 31: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

31© Cloudera, Inc. All rights reserved.

Spark vs Tez, 4TB

0

500

1000

1500

2000

2500

3000

3500

4000

Simplecount(*)

Simplemap join

Simplecommon

join

Q3 Q15 Q19 Q21 Q22 Q28 Q82 Q84 Q88 Q90

Spark

Tez

Page 32: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

32© Cloudera, Inc. All rights reserved.

Spark vs Tez, 4TB

0

500

1000

1500

2000

2500

3000

3500

4000

Simplecount(*)

Simplemap join

Simplecommon

join

Q3 Q15 Q19 Q21 Q22 Q28 Q82 Q84 Q88 Q90

Spark

Tez

DPP helps Tez

Page 33: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

33© Cloudera, Inc. All rights reserved.

Spark vs Tez vs Tez w/o DPP, 4TB

0

200

400

600

800

1000

1200

1400

1600

1800

Q3 Q15 Q19

Spark

Tez

Tez w/o DPP

Page 34: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

34© Cloudera, Inc. All rights reserved.

Spark vs Tez, 4TB

0

500

1000

1500

2000

2500

3000

3500

4000

Simplecount(*)

Simplemap join

Simplecommon

join

Q3 Q15 Q19 Q21 Q22 Q28 Q82 Q84 Q88 Q90

Spark

Tez

Spark is faster

Page 35: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

35© Cloudera, Inc. All rights reserved.

Spark vs Tez, 4TB

0

500

1000

1500

2000

2500

3000

3500

4000

Simplecount(*)

Simplemap join

Simplecommon

join

Q3 Q15 Q19 Q21 Q22 Q28 Q82 Q84 Q88 Q90

Spark

Tez

Page 36: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

36© Cloudera, Inc. All rights reserved.

Benchmarks - Summary

• Spark is as fast as or faster than Tez on many queries (Q28, Q88, etc.)

• Dynamic partition pruning helps Tez a lot on certain queries (Q3, Q15, Q19). Without DPP for Tez, Spark is close or even faster

• Spark is slower on certain queries (common join, Q84) than Tez. Investigation and improvement is on the way

• Bigger dataset seems to help Spark more• Spark will likely be faster after DPP is implemented

Page 37: Hive Now Sparks - events.static.linuxfound.org · Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying

37© Cloudera, Inc. All rights reserved.

Questions?