Architecture Support for Big Data Analytics/A.Awan.pdf · Architecture Support for Big Data...

Architecture Support for Big Data Analytics

Ahsan Javed Awan EMJD-DC (KTH-UPC)

(http://uk.linkedin.com/in/ahsanjavedawan/)Supervisors: Mats Brorsson(KTH), Eduard Ayguade(UPC), Vladimir

Vlassov(KTH)

MotivationWhy should we care?

*Source: Babak Falsafi slides

MotivationCont...

● A mismatch between the characteristics of emerging workloads and the underlying hardware.

– M. Ferdman et-al, “Clearing the clouds: A study of emerging scale-out workloads on modern hardware,” in ASPLOS 2012.

– Z. Jia, et-al “Characterizing data analysis workloads in data centers,” in IISWC 2013.

– C. Zheng et-al, “Characterizing os behavior of scale-out data center workloads.” in WIVOSCA 2013.

– M. Dimitrov et al, “Memory system characterization of big data workloads,” in BigData Conference, 2013.

– Z. Jia et-al, “Characterizing and subsetting big data workloads,” in IISWC 2014

– A. Yasin et-al, “Deep-dive analysis of the data analytics workload in cloudsuite,” in IISWC 2014.

– L. Wang et-al, “Bigdatabench: A big data benchmark suite from internet services,” in HPCA, 2014.

– T. Jiang, et-al, “Understanding the behavior of in-memory computing workloads,” in IISWC 2014

MotivationPerformance Characterization of In-Memory Data Analytics

on a Modern Cloud Server

*Source: SGI

MotivationCont...

Our Focus Our Focus

Improve the single node performancein scale-out configuration

*Source: http://navcode.info/2012/12/24/cloud-scaling-schemes/

Phoenix ++,Metis, Ostrich,

Hadoop, Spark,Flink, etc..

Progress Meeting 12-12-14Which Scale-out Framework ?

[Picture Courtesy: Amir H. Payberah]

Our Approach

● A three fold analysis method at Application, Thread and Micro-architectural level

– Tuning of Spark internal Parameters

– Tuning of JVM Parameters (Heap size etc..)

– Concurrency Analysis

– General Architectural Exploration

Methodology

Our ApproachBenchmarks

3GB of Wikipedia raw datasets, Amazon Movies Reviews and numerical records have been used

Our Hardware Configuration

Machine Details

Hyper Threading and Turbo Boost is disabled

Hyper Threading and Turbo-boost are disabled

Our ApproachSystem Configuration

Multicore Scalability of SparkApplication Level Performance

Spark scales poorly in Scale-up configuration

Multicore Scalability of SparkStage Level Performance

● Shuffle Map Stages don't scale beyond 12 threads across different workloads

● No of concurrent files open in Map-side shuffling is C*R where C is no of threads in executor pool and R is no of reduce tasks

Multicore Scalability of SparkTask Level Performance

Percentage increase in Area Under the Curve compared to 1-thread

Is there thread level load imbalance ??

CPU Utilization is not scaling with performance

Is there any Work Time Inflation ??

How does Micro-architecture contribute to Work time inflation ??

Cont...

Is Memory Bandwidth a bottleneck??

Key Findings

● More than 12 threads in an executor pool does not yield significant performance

● Spark runtime system need to be improved to provide better load balancing and avoid work-time inflation.

● Work time inflation and load imbalance on the threads are the scalability bottlenecks.

● Removing the bottlenecks in the front-end of the processor would not remove more than 20% of stalls.

● Effort should be focused on removing the memory bound stalls since they account for up to 72% of stalls in the pipeline slots.

● Memory bandwidth of current processors is sufficient for in-memory data analytics

MotivationHow Data Volume Affects Spark Based Data Analytics on a

Scale-up Server

MotivationDo Spark based data analytics benefit from using larger

scale-up servers

MotivationIs GC detrimental to scalability of Spark applications?

MotivationHow does performance scale with data volume ?

MotivationDoes GC time scale linearly with Data Volume ??

MotivationHow does CPU utilization scale with data volume ?

MotivationIs File I/O detrimental to performance ?

MotivationHow does data size affects micro-architectural

performance ?

MotivationHow Data Volume Affects Spark Based Data Analytics on a

Scale-up Server

MotivationCont..

Key Findings

● Spark workloads do not benefit significantly from executors with more than 12 cores.

● The performance of Spark workloads degrades with large volumes of data due to substantial increase in garbage collection and file I/O time.

● With out any tuning, Parallel Scavenge garbage collection scheme outperforms Concurrent Mark Sweep and G1 garbage collectors for Spark workloads.

● Spark workloads exhibit improved instruction retirement due to lower L1 cache misses and better utilization of functional units inside cores at large volumes of data.

● Memory bandwidth utilization of Spark benchmarks decreases with large volumes of data and is 3x lower than the available off-chip bandwidth on our test machine

Our Approach

● A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade, “Performance charaterization of in-memory data analytics on a mordern cloud server,” in 5th International IEEE Conference on Big Data and Cloud Computing, 2015.

● A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade, “How Data Volume Affects Spark Based Data Analytics on a Scale-up Server”, in 6th International Workshop on Big Data Benchmarking, Performance Optimization and Emerging Hardware (BpoE) held in conjunction with VLDB, 2015.

What are the major bottlenecks??

MotivationFuture Directions

NUMA Aware Task Scheduling

Cache Aware Transformations

Exploiting Processing In Memory Architectures

HW/SW Data Prefectching

Rethinking Memory Architectures

Architecture Support for Big Data Analytics/A.Awan.pdf · Architecture Support for Big Data...

Documents

RSA: Security Analytics Architecture for APT

Juniper Secure Analytics Architecture and …...Title Juniper Secure Analytics Architecture and Deployment Guide Author Juniper Networks Created Date 20170913105844Z

IBM PureData System for Analytics Architecture - IBM Redbooks

Oracle Identity Analytics Architecture

A reference architecture for high performance analytics … · A reference architecture for high performance ... reference architecture for healthcare ... A reference architecture

SAS Insurance Analytics Architecture

Nexidia Technology and Architecture for Interaction Analytics

Software-defined Storage Architecture for Analytics ...vox.veritas.com/legacyfs/online/veritasdata/SDS Architecture For... · Software-defined Storage Architecture for Analytics Computing

Cloud Customer Architecture for Big Data and Analytics

Customer Cloud Architecture for Big Data Analytics · Customer Cloud Architecture for Big Data Analytics ... App & data protection; Security intelligence . 19 ... – Customer Cloud

ANALYTICS ARCHITECTURE FOR THE MODERN ......Analytics Architecture for the Modern Software Development Organization EXECUTIVE SUMMARY In many of today’s organizations, a disconnect

Building the Modern Analytics Architecture

Predictive Analytics, Choice Architecture and Bespoke Education

Privacy in Learning Analytics – Implications for System Architecture

DAMA - Innovations in DG Architecture and Analytics (online)

Analytics Architecture Guidelines - SAP NetWeaver7

COSMOS Data Analytics Architecture

The Chief Analytics Officers Guide to Getting Analytics Right · Self-Service: Democratizing Data to Empower New Users. 20 Planning Your Architecture Building an Analytics Architecture

Big Data & Analytics Architecture

Cloud Customer Architecture for Big Data and Analytics Version 2 · 2020-02-04 · Cloud Customer Architecture for Big Data and Analytics V2.0 . Executive Overview . Big data analytics