State of Resource Management in Big Data

© 2015 IBM Corporation

State of Resource Management in Big Data: What it is and Why You Should Care

Khalid AhmedSenior Technical Staff Member(STSM), Architect, IBM Platform [email protected]

Yong FengArchitect, IBM Platform [email protected]

© 2015 IBM Corporation2

Contents

1. Background2. Resource Management Architectures3. Comparisons: YARN, Mesos, Kubernetes4. Use Cases


IBM Platform Computing Infrastructure software for high performance applications

– Acquired by IBM in 2012

– 20 years managing distributed scale-out systems with 2000+ customers in many industries

– Market leading workload, resource and cluster management

– Unmatched scalability (small clusters to global grids) and enterprise production-proven reliability

– Heterogeneous environments – x86 and Power plus 3rd party systems, virtual and bare metal, accelerators / GPU, cloud, etc.

– Shared services for both compute and data intensive workloads

23 of 30 largest commercial enterprises

Over 5M CPUs under

management

60% of top financial services

companies


Resource Management Terminology

Cluster Management

Resource Allocation

Distributed & Parallel

Execution

Scheduling & Placement

WorkloadManagement

Batch Queuing


History of Resource Management in Distributed Systems

1990s High-performance Computing Batch Queuing Systems Message Passing Interface (MPI)

2000s P2P Computing Parallel SOA Big Data – MR v1 Virtualization

Platform LSF

Sun Grid Engine

NQS/DQS

VMare

United Devices

Apache Hadoop

Datasynapse

2010-2015 Big Data – MR v2 Cloud Computing Virtualization

Globus

Platform Symphony

Openstack

Apache YARN

Apache Mesos

2015+ Containerization Hyperconverge/Hyperscale Hybrid Cloud Data Center OS (DCOS)

DockerKubernetes

Swarm

Cloudfoundry


What problem are we trying to solve? - Creating infrastructure silos to accommodate apps is inefficientMany new solution workloadsin addition to existing apps

Leads to costly, complex, siloed, under-utilized infrastructure and replicated data

Batch OvernightFinancialReporting

CounterpartyCredit RiskModeling

Distributed ETL, Sensitivity Analysis

Hadoop based Sentiment Analysis

Low Utilization= Higher cost


Convergence of Compute & DataData-centric Architecture for High Performance

Data lives on disk and tapeMove data to CPU as neededDeep Storage Hierarchy

Data lives in persistent storage/memoryMany CPU’s surround and useShallow/Flat Storage Hierarchy

Old Compute-centric Model New Data-centric Model

Massive ParallelismData & Computing

Flash Phase Change

Manycore FPGA

input

output

Big Data and Exascale High Performance Computing are driving many similar computer systems requirements: Move the Compute to the Data!

IBM Confidential


Data Center OS: System Software for Hyperscale Datacenters

Hardware

Node OS

Hardware

Node OS

Hardware

Node OS

Hardware

Node OS

Node Agent Node Agent Node Agent Node Agent

Distributed File/Block/Object System

Resource Manager

Remote Execution & Container Management

Distributed Services Manager

Device Drivers for Nodes

Virtual / Physical Hardware

Patterns & REST API

Manage long-running services lifecycle

Aggregate & share resources across multiple frameworks

Manage execution of containers (discovery, clustering, load-balancing)

Persistent storage for applications and services supporting multiple protocols

Nodes become the resources managed by Data Center OS. Specialized hardware (storage, network switches, routers) become software services on commodity hardware.


Resource Manager Architectures


What is expected from Resource Manager

• Resource Abstraction• Workload Placement• High Availability• Monitoring

• Membership Management• Workload Provision and Execution• Scalability• Trouble-shooting

1. Hide the details of resource management and failure handling so that user could focus on application development2. Operate with high availability and reliability, and support application to do the same3. Run workload across tens of thousands of machine efficiently.

Open source resource management solution to manage resources used by services on a shared infrastructure.

• Resource Sharing and Plan• Security and Isolation• Performance• Service Management

• Hortworks, Cloudera, MapR – YARN• Docker - Swarm

• Mesosphere Twitter, eBay, Netflix - Mesos • Google, Redhat, CoreOS - Kubernetes

We need a common solution to manage resources of large clusters (~10K of machines) shared by multiple workloads:• Sharing policies: tenant reservation, shares,

isolation• Placement policies: topology-driven affinity,

anti-affinity, proximity, min/max/desired• Execution: container and non-container

HPC PaaSData

ServicesLong

RunningBatch Job Other

Common Resource Management

Shared Infrastructure and Data


Hadoop YARN

YARN is not the first general Resource Management platform. So what’s different? It’s data!

• Store all your data in one place … (HDFS)

• Interact with that data in multiple ways … (YARN Platform + Apps)

• Scale as you go, shared, multi-tenant, secure … (The Hadoop Stack)


YARN Architecture

Resource management framework: centralo Resource Manager (RM) controls resource allocationo Application Master (AM) negotiates with RM for

resources and launch executors to run jobs Resource allocation policieso Policy plug-ins, currently supports:

Capacity scheduler Fair sharing

Framework Integrationo Implement client to launch application through RMo Implement driver for application scheduler to

communicate with RM and Node Managero Make framework executor available to YARN


Mesos in BDAS

BERKELEY DATA ANALYTICS STACK(BDAS)


Mesos Programming Interface


Mesos Architecture

Resource management framework: hierarchicalo Mesos offers resourceo Framework schedulers accept or reject offered resource Resource allocation policieso Pluggable allocation modules, currently supports

Fair sharingo Resource allocations decisions are delegated to allocation

moduleso Resource preferences are communicated to Mesos through

common APIs Framework Integrationo Modify framework scheduler to communicate with Mesos master

through its APIo Make framework executor binary available to Mesos


Kubernetes Basic concepts• Only support container-based applications/workloads

– Currently only support Docker and Rocket• POD: Smallest schedulable unit

– All containers within a POD are placed onto the same host and share the same namespace (network)

• Replication Group: Manage one or more PODs– Use POD labels to ensure only a desired number of PODs with specific labels are running

at anytime– Used to scale up/down, failure recovery, rolling upgrade

• Services: Find and load-balance between one or more PODs– Use POD labels to define endpoints to a service– Used to handle changes in IP address, host, number of PODs, etc.– Services records are recorded in i) Env variables ii) DNS service entries

• Namespaces: Multi-tenancy support• Pods/Services/RCs can be put into different namespaces to provide logical isolation for

the purpose of management


Kubernetes architecture

API server

Scheduler

Controller mger

kubectl

K8s master

K8s minion K8s minion

Kubelet

Proxy

CAdvisor

Kubelet

Proxy

CAdvisor

Etcd service

state

Many components are pluggable - schedulers - container runtime - persistent data store - cloud providers - …


Comparison of Open-source Resource Managers


Framework(scheduler)

Master

Jobs type A

Master has no knowledge of workloads. Workloads have partial view of system.Issues: Offers are computed without any workload awareness – may be unsuitable for a workloadPossible solution: Optimistic Offer


Jobs type B

State

(2) Offer(5) Revoke offer

(4) accept/decline offers

Master

Jobs type A(short, small)

Master has knowledge of entire state and coarse-grained definition of workloads. Workloads have partial view, but selected based on workload specification.Issues: More complex protocol, master has some properties of monolithic schedulerPossible solution: Multiple level scheduler


Jobs type B(long running)

State

(1) Request(6) Return resource

(1) Partition resources among frameworks

(3) schedule

(2) Allocate resources, based on workload priorities and requirements

(2) Return allocation

(4) Schedule small, short lived tasks

(5) Reclaim resources

Offer Vs Request


Feature Mesos YARN Kubernetes Comment

Container support YARN is planning to support Docker. Mesos support both Docker and its own unified container. Kebernetes only support container as its execution facility.

Placement Policies

YARN focus more on affinity. Marathon support several placement constraints and polices. Kuberentes borrows some placement policy from Marathon and support its own specific placement constraints.

Resource Sharing YARN has a pretty good support for resource sharing (priority/preemption/fair-share), Mesos does not support priority and its preemption is weak. Kuberentes only supports quota.

Service Management

Marathon and Kuberentes both support service life cycle management. Slider is still incubation.

Maturity YARN has longer development history and probably most deployment. Mesos and Kubernetes are relatively new

Mesos = Mesos + Marathon YARN = YARN + Slider

Complete Many features Some features

A little features None

Comparison: Mesos vs YARN vs Kubernetes


Spark on YARN

Cluster Modespark-submit MYJAR --master yarn-cluster –class MYCLASS

Client Modespark-submit MYJAR --master yarn-client –class MYCLASS


Spark on Mesos

Coarse-grain Modeconf.set("spark.mesos.coarse", "true")

Fine-grain Modeconf.set("spark.mesos.coarse", "false")


Spark on YARN Vs Spark on Mesos

Spark on YARNo Coarse graino Fixed size of each Spark Executor

Resource could be wasted if no enough tasks in an executor o Leverage YARN data aware scheduling Spark on Mesos (Coarse-grain mode)o Coarse graino Cannot launch multiple executors in same host (fixed in Spark 2.0.0 by SPARK-5095)

New resource cannot be used Cannot fully use the big memory due to JVM GC issue in big memory environment

o Spark schedule tasks by data affinity within the offer Spark on Mesos (Fine-grain mode)o Fine graino Extra overhead when launching taskso Resource may not be reschedule in time after time finish because of Mesos scheduling interval


USE CASES


Real-timeStreams, FPGA-based applications near market feeds

Near Real-timeAnalytic tasks are often time-critical supporting trading desks – “real-time” risk applications

BatchLong-running

Exchange/ECNsData Feeds

Big DataDiverse sources of structured/unstructured data - RDBMS, DFS (HDFS,GPFS),In-memory caches etc..

Data Intensive Workloads Compute Intensive Workloads

Algorithmic trading / HFT / “Black-box”/ “Robo-trading”

Orders

ProgramTrading

Arbitrage TrendFollowing

Exotics ,Derivative Pricing

SentimentAnalysis

CounterpartyRisk, CVA

CRMAnti-moneyLaundering

(AML)

“Real-time“MarketRisk

Pre-trade, post-trade analytics

Credit scoring

ETL

IncrementalModeling

FraudDetection

IncrementalModeling

Forex

Mining ofUnstructured Data

SensitivityAnalysis

Model Backtesting

RegulatoryReporting

Actuarial analysis

CEP ProtocolConversion

Deeper Counterparty

Modeling

Variable annuity

FX IR Equities

Applications in Financial Services

VaR

ALM Mortgage analyticsStrategy mining data mining

Predictive analytics

Predictive analytics

Optimization

Optimization

Trade surveillance

Portfolio Stress Testing

P&L analysis

Document Processing

Non-structured Data Query

IBM Confidential

Check processing

Image Analytic


Customer Example – Multi-tenancy of cloud native workload at a major bank


Example - Genome Sequencing

All the DNA contained in a living cell makes up the genome. The alphabet of

genome contains only four letters: A, C, G and T. Just like a book uses words and

letters to tell a story, so do these letters in the genomes as they encode genes that

carry out all cellular functions.

SAM BAM Recalibrated BAM

Mark Duplicate & Sort

BWA ADAMSpark

Map to ReferenceRealignment & Recalibration

SamtoolPicard

GATKMutech

Variant Analysis

VCF

Genomics is the study of the DNA sequence and meaning of these letters in the genome (e.g. genes & mutations) so that scientists can precisely

tell the story of life.

Next Generation Sequencing Pipeline for faster results

Complex workflows and dependencies

Your life story

FASTQ

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&docid=qJXNrfsh3HRDbM&tbnid=OdhLwkRiheeynM:&ved=0CAUQjRw&url=http://icones.pro/en/softwared-workflow-png-image.html&ei=awP4UpXCHs3YyAH4loDwDQ&bvm=bv.60983673,d.aWM&psig=AFQjCNFaCKKzBGGWkPRH7-w2z7-i-jTd6Q&ust=1392071883292190






Challenges – Genome Sequencing Poor resource utilization – peaks and valleys of different workloads How to orchestrate multi-phase workflows among many collaborative apps

of distributed workloads, sub- and parallel-flows, across diverse infrastructure

Lack of reliable parallelism in workflow due to variety of workload types and resource needs

Move to job arrays, MPI/MPI2, distributed messaging & cache,

MapReduce, Spark frameworks?

Data, app and resource silos causing inefficiencies in data movements, app integrations and resource sharing

Workload Manager

Job 1 Job 2

Job 3 Job N

MapReduce App

Resource 1 Resource 2Resource 3 Resource 4Resource 5 Resource 6Resource 7 Resource 8

HDFS

Workload Manager

App1 App2

App3 AppN

SOA App


NFS

Workload Manager

Job 1 Job 2

Job 3 Job N

Batch App


POSIX

Workload Manager

App1 App2

App3 AppN

Spark Apps


Objects

Cluster #2 Cluster #4Cluster #3Cluster #1


A Life Science App Workflow with Hybrid Workloads - Genome SequencingGenome Analysis Toolkit (GATK) : A widely-adopted genomics workflow from Broad InstituteADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing, UC, Berkley

GATK workflowPipeline optimized using ADAM on Spark (parallelize mark duplicate and sort processing)

Site A

Site B

Site C

Share results worldwide immediately

ReplicateTo Remote

https://www.broadinstitute.org/gatk/

http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-207.pdf


Platform Computing is Part of IBM Software Defined Infrastructure

IBM Platform Computing/DCOS

Software Defined Compute

SymphonyMapReduceSymphony Application Service

ControllerLSF

High Performance Analytics

(Low Latency Parallel)

Hadoop / Big Data

Application Frameworks(Long Running Services)

High Performance Computing(Batch, Serial, MPI, Workflow)

ExampleApplications

&Application

Frameworks

HomegrownHomegrown

Spectrum ScaleSoftware Defined

Storage

On-premises, On-Cloud, HybridPhysical Infrastructure

Hypervisorx86 Linux on z

Software Defined InfrastructureManagement

IBM Platform Cluster Manager

Bare Metal Provisioning Virtual Machine Provisioning SoftLayer APIs & Services

IBM Cloud Manager with OpenStack

IBM Platform ComputingCloud Service

Other Compute

Management Software

TraditionalCommercialApplications

© 2015 IBM Corporation

IBM Platform Computing

31

Resource Management Community Activities

• Active development with Mesos community – 11 IBM Developers.

• 100+ JIRAs delivered or in progress• Leading several work streams: POWER Support,

Optimistic Offers, Container Support, Swarm and Kubernetes integration

• YARN-plugin to Platform Symphony • Technical Preview of Mesos with IBM Value-

Add (ASC) on Docker Hub – Both x86 and POWER images

32

For more information: ibm.com/systems