Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads

Micro-architectural Characterization of Apache Spark on Batch and Stream

Processing Workloads

Ahsan Javed Awan EMJD-DC (KTH-UPC)

(https://www.kth.se/profile/ajawan/)Mats Brorsson(KTH), Eduard Ayguade(UPC and BSC),

Vladimir Vlassov(KTH)

MotivationWhy should we care about architecture support?

*Taken from Babak's slides

Data Growing Faster Than Technology

MotivationCont...

Our Goal Our Goal

Improve the node level performancethrough architecture support

*Source: http://navcode.info/2012/12/24/cloud-scaling-schemes/

Phoenix ++,Metis, Ostrich,

Hadoop, Spark,Flink, etc..

Our Approach

● Performance characterization of in-memory data analytics on a modern cloud server, in 5th International IEEE Conference on Big Data and Cloud Computing, 2015 (Best Paper Award).

● How Data Volume Affects Spark Based Data Analytics on a Scale-up Server in 6th International Workshop on Big Data Benchmarks, Performance Optimization and Emerging Hardware (BpoE), held in conjunction with VLDB 2015, Hawaii, USA

– Limited to batch processing workloads only

– Does not consider the velocity aspect of big data

– Experiments are based on older version of Spark.

What are the major performance bottlenecks??

Our Approach

● Does micro-architectural performance remains consistent across batch and stream processing workloads ?

● How Data-frames micro-architecturally compare to RDDs ?

● How data velocity affect the micro-architectural performance ?

What are the remaining questions??

Progress Meeting 12-12-14Which Scale-out Framework ?

[Picture Courtesy: Amir H. Payberah]

● Tuning of Spark internal Parameters● Tuning of JVM Parameters (Heap size etc..)● Micro-architecture Level Analysis using Hardware Performance

Counters.

Our ApproachWhich benchmarks?

Our Hardware Configuration

Which Machine ?

Hyper Threading and Turbo-boost are disabled

Intel's Ivy Bridge Server

Does micro-architectural performance remains consistent ?

Stream processing is micro-architecturally similar to batch processing in Spark

Cont..

Stream processing is micro-architecturally similar to batch processing in Spark

Cont..

Streaming workloads with similar Spark transformations have different micro-architectural behavior

Cont..

Workload Spark Transformation Input data rate

Window size (s)

Working Set with 2s sampling

interval

WWc FlatMap, Map, ReduceByKeyAndWindow

10^4 30 15 x 10^4

CSpc FlatMap, Map, CountByValueAndWindow

10^4 10 5 x 10^4

CErpz FlatMap, Map, Window, GroupByKey

10^4 30 15 x 10^4

CAuC FlatMap, Map, Window, GroupByKey, Count

10^4 10 5 x 10^4

Tpt FlatMap, ReduceByKeyAndWindow,

Transform

10^1 60 30 x 10^1

Micro-batch size determines the micro-architectural behavior of stream processing workloads with similar Spark transformations

Do Dataframes perform better than RDDs at micro-architectural level?

DataFrame exhibit 25% less back-end bound stalls 64% less DRAM bound stalled cycles

25% less BW consumption10% less starvation of execution resources

Dataframes have better micro-architectural performance than RDDs

How Data Velocity affect micro-architectural performance?

Better CPU utilization at higher data velocity

Cont..

Higher instruction retirement at higher data velocity Higher L1-Bound stalls at higher data velocity

Less starvation at higher data velocity Higher BW consumption at higher velocity

Our ApproachConclusion

● Batch processing and stream processing has same micro-architectural behavior in Spark if the difference between two implementations is of micro-batching only.

● Spark workloads using DataFrames have improved instruction retirement over workloads using RDDs.

● If the input data rates are small, stream processing workloads are front-end bound. However, the front end bound stalls are reduced at larger input data rates and instruction retirement is improved.

THANK YOU

Our ApproachList of Papers

● Performance characterization of in-memory data analytics on a modern cloud server, in 5th International IEEE Conference on Big Data and Cloud Computing, 2015 (Best Paper Award).

● How Data Volume Affects Spark Based Data Analytics on a Scale-up Server in 6th International Workshop on Big Data Benchmarks, Performance Optimization and Emerging Hardware (BpoE), held in conjunction with VLDB 2015, Hawaii, USA .

● Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads. (accepted to BDCloud 2016)

● Node Architecture Implications for In-Memory Data Analytics in Scale-in Clusters (accepted to IEEE BDCAT 2016)

● Implications of In-Memory Data Analytics with Apache Spark on Near Data Computing Architectures (under submission).

Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads

Science

Performance Tuning Tips for Apache SPARK Machine Learning ... · Performance Tuning Tips for Apache SPARK Machine Learning workloads ShreeHarsha GN Senior Staff Software Engineer,

Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Apache spark meetup

Apache Ignite and Apache Spark - GridGain Systems · Ignite and Spark Integration Spark Application Spark Worker Spark Job Spark Job Yarn Mesos Docker HDFS Spark Worker Spark Job

A Tutorial on Apache Spark - Michael Hahslermichael.hahsler.net/SMU/EMIS8331/tutorials/Tutorial_Apache_Spark.pdf · A Tutorial on Apache Spark ... •Apache Spark is considered to

Apache spark

State of Security: Apache Spark & Apache Zeppelin

Apache Spark Operations

Apache Spark Introduction

Introduction to Cassandra • Why Spark - Apache Cassandra | Apache Kafka | Apache Spark · 2017. 12. 20. · • Introduction to Cassandra • Why Spark + Cassandra • Problem background

Using Apache Spark

Apache Spark Streaming

Characterizing the Performance of Analytics Workloads on ...€¦ · analytics frameworks like Spark [3], [4] and Hadoop [5]. In particular, the widely-used Apache Spark open source

Apache Spark 2.0

Apache Spark - Courses€¦ · Apache Spark Introduction to Data Science DATA11001 Nitinder Mohan CollaborativeNetworking (CoNe) nitinder.mohan@helsinki.fi. What is Apache Spark?

Apache Spark & Hadoop

Integrating Apache Hive with Kafka, Spark, and BI...Community Connection: Integrating Apache Hive with Apache Spark--Hive Warehouse Connector Apache Spark-Apache Hive connection configuration

Using Apache Spark Pat McDonough - Databricks. Apache Spark spark.incubator.apache.org github.com/apache/incubator- spark user@spark.incubator.apache.or

Using Apache Spark, Apache Kafka and Apache Cassandra...USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS | 02 Apache Cassandra is well known

Lessons Learned from Dockerizing Spark Workloads