Upload
yong-feng
View
95
Download
0
Embed Size (px)
Citation preview
© 2015 IBM Corporation
State of Resource Management in Big Data: What it is and Why You Should Care
Khalid AhmedSenior Technical Staff Member(STSM), Architect, IBM Platform [email protected]
Yong FengArchitect, IBM Platform [email protected]
© 2015 IBM Corporation2
Contents
1. Background2. Resource Management Architectures3. Comparisons: YARN, Mesos, Kubernetes4. Use Cases
© 2015 IBM Corporation3
IBM Platform Computing Infrastructure software for high performance applications
– Acquired by IBM in 2012
– 20 years managing distributed scale-out systems with 2000+ customers in many industries
– Market leading workload, resource and cluster management
– Unmatched scalability (small clusters to global grids) and enterprise production-proven reliability
– Heterogeneous environments – x86 and Power plus 3rd party systems, virtual and bare metal, accelerators / GPU, cloud, etc.
– Shared services for both compute and data intensive workloads
23 of 30 largest commercial enterprises
Over 5M CPUs under
management
60% of top financial services
companies
© 2015 IBM Corporation4
Resource Management Terminology
Cluster Management
Resource Allocation
Distributed & Parallel
Execution
Scheduling & Placement
WorkloadManagement
Batch Queuing
© 2015 IBM Corporation5
History of Resource Management in Distributed Systems
1990s High-performance Computing Batch Queuing Systems Message Passing Interface (MPI)
2000s P2P Computing Parallel SOA Big Data – MR v1 Virtualization
Platform LSF
Sun Grid Engine
NQS/DQS
VMare
United Devices
Apache Hadoop
Datasynapse
2010-2015 Big Data – MR v2 Cloud Computing Virtualization
Globus
Platform Symphony
Openstack
Apache YARN
Apache Mesos
2015+ Containerization Hyperconverge/Hyperscale Hybrid Cloud Data Center OS (DCOS)
DockerKubernetes
Swarm
Cloudfoundry
© 2015 IBM Corporation6
What problem are we trying to solve? - Creating infrastructure silos to accommodate apps is inefficientMany new solution workloadsin addition to existing apps
Leads to costly, complex, siloed, under-utilized infrastructure and replicated data
Batch OvernightFinancialReporting
CounterpartyCredit RiskModeling
Distributed ETL, Sensitivity Analysis
Hadoop based Sentiment Analysis
Low Utilization= Higher cost
© 2015 IBM Corporation7
Convergence of Compute & DataData-centric Architecture for High Performance
Data lives on disk and tapeMove data to CPU as neededDeep Storage Hierarchy
Data lives in persistent storage/memoryMany CPU’s surround and useShallow/Flat Storage Hierarchy
Old Compute-centric Model New Data-centric Model
Massive ParallelismData & Computing
Flash Phase Change
Manycore FPGA
input
output
Big Data and Exascale High Performance Computing are driving many similar computer systems requirements: Move the Compute to the Data!
IBM Confidential
© 2015 IBM Corporation8
Data Center OS: System Software for Hyperscale Datacenters
Hardware
Node OS
Hardware
Node OS
Hardware
Node OS
Hardware
Node OS
Node Agent Node Agent Node Agent Node Agent
Distributed File/Block/Object System
Resource Manager
Remote Execution & Container Management
Distributed Services Manager
Device Drivers for Nodes
Virtual / Physical Hardware
Patterns & REST API
Manage long-running services lifecycle
Aggregate & share resources across multiple frameworks
Manage execution of containers (discovery, clustering, load-balancing)
Persistent storage for applications and services supporting multiple protocols
Nodes become the resources managed by Data Center OS. Specialized hardware (storage, network switches, routers) become software services on commodity hardware.
© 2014 IBM Corporation9
Resource Manager Architectures
© 2014 IBM Corporation10
What is expected from Resource Manager
• Resource Abstraction• Workload Placement• High Availability• Monitoring
• Membership Management• Workload Provision and Execution• Scalability• Trouble-shooting
1. Hide the details of resource management and failure handling so that user could focus on application development2. Operate with high availability and reliability, and support application to do the same3. Run workload across tens of thousands of machine efficiently.
Open source resource management solution to manage resources used by services on a shared infrastructure.
• Resource Sharing and Plan• Security and Isolation• Performance• Service Management
• Hortworks, Cloudera, MapR – YARN• Docker - Swarm
• Mesosphere Twitter, eBay, Netflix - Mesos • Google, Redhat, CoreOS - Kubernetes
We need a common solution to manage resources of large clusters (~10K of machines) shared by multiple workloads:• Sharing policies: tenant reservation, shares,
isolation• Placement policies: topology-driven affinity,
anti-affinity, proximity, min/max/desired• Execution: container and non-container
HPC PaaSData
ServicesLong
RunningBatch Job Other
Common Resource Management
Shared Infrastructure and Data
© 2014 IBM Corporation11
Hadoop YARN
YARN is not the first general Resource Management platform. So what’s different? It’s data!
• Store all your data in one place … (HDFS)
• Interact with that data in multiple ways … (YARN Platform + Apps)
• Scale as you go, shared, multi-tenant, secure … (The Hadoop Stack)
© 2014 IBM Corporation12
YARN Architecture
Resource management framework: centralo Resource Manager (RM) controls resource allocationo Application Master (AM) negotiates with RM for
resources and launch executors to run jobs Resource allocation policieso Policy plug-ins, currently supports:
Capacity scheduler Fair sharing
Framework Integrationo Implement client to launch application through RMo Implement driver for application scheduler to
communicate with RM and Node Managero Make framework executor available to YARN
© 2014 IBM Corporation13
Mesos in BDAS
BERKELEY DATA ANALYTICS STACK(BDAS)
© 2014 IBM Corporation14
Mesos Programming Interface
© 2014 IBM Corporation15
Mesos Architecture
Resource management framework: hierarchicalo Mesos offers resourceo Framework schedulers accept or reject offered resource Resource allocation policieso Pluggable allocation modules, currently supports
Fair sharingo Resource allocations decisions are delegated to allocation
moduleso Resource preferences are communicated to Mesos through
common APIs Framework Integrationo Modify framework scheduler to communicate with Mesos master
through its APIo Make framework executor binary available to Mesos
© 2014 IBM Corporation16
Kubernetes Basic concepts• Only support container-based applications/workloads
– Currently only support Docker and Rocket• POD: Smallest schedulable unit
– All containers within a POD are placed onto the same host and share the same namespace (network)
• Replication Group: Manage one or more PODs– Use POD labels to ensure only a desired number of PODs with specific labels are running
at anytime– Used to scale up/down, failure recovery, rolling upgrade
• Services: Find and load-balance between one or more PODs– Use POD labels to define endpoints to a service– Used to handle changes in IP address, host, number of PODs, etc.– Services records are recorded in i) Env variables ii) DNS service entries
• Namespaces: Multi-tenancy support• Pods/Services/RCs can be put into different namespaces to provide logical isolation for
the purpose of management
© 2014 IBM Corporation17
Kubernetes architecture
API server
Scheduler
Controller mger
kubectl
K8s master
K8s minion K8s minion
Kubelet
Proxy
CAdvisor
Kubelet
Proxy
CAdvisor
Etcd service
state
Many components are pluggable - schedulers - container runtime - persistent data store - cloud providers - …
© 2014 IBM Corporation18
Comparison of Open-source Resource Managers
© 2014 IBM Corporation19
Framework(scheduler)
Master
Jobs type A
Master has no knowledge of workloads. Workloads have partial view of system.Issues: Offers are computed without any workload awareness – may be unsuitable for a workloadPossible solution: Optimistic Offer
Framework(scheduler)
Jobs type B
State
(2) Offer(5) Revoke offer
(4) accept/decline offers
Master
Jobs type A(short, small)
Master has knowledge of entire state and coarse-grained definition of workloads. Workloads have partial view, but selected based on workload specification.Issues: More complex protocol, master has some properties of monolithic schedulerPossible solution: Multiple level scheduler
Framework(scheduler)
Jobs type B(long running)
State
(1) Request(6) Return resource
(1) Partition resources among frameworks
(3) schedule
(2) Allocate resources, based on workload priorities and requirements
(2) Return allocation
(4) Schedule small, short lived tasks
(5) Reclaim resources
Offer Vs Request
© 2013 IBM Corporation20
Feature Mesos YARN Kubernetes Comment
Container support YARN is planning to support Docker. Mesos support both Docker and its own unified container. Kebernetes only support container as its execution facility.
Placement Policies
YARN focus more on affinity. Marathon support several placement constraints and polices. Kuberentes borrows some placement policy from Marathon and support its own specific placement constraints.
Resource Sharing YARN has a pretty good support for resource sharing (priority/preemption/fair-share), Mesos does not support priority and its preemption is weak. Kuberentes only supports quota.
Service Management
Marathon and Kuberentes both support service life cycle management. Slider is still incubation.
Maturity YARN has longer development history and probably most deployment. Mesos and Kubernetes are relatively new
Mesos = Mesos + Marathon YARN = YARN + Slider
Complete Many features Some features
A little features None
Comparison: Mesos vs YARN vs Kubernetes
© 2013 IBM Corporation21
Spark on YARN
Cluster Modespark-submit MYJAR --master yarn-cluster –class MYCLASS
Client Modespark-submit MYJAR --master yarn-client –class MYCLASS
© 2013 IBM Corporation22
Spark on Mesos
Coarse-grain Modeconf.set("spark.mesos.coarse", "true")
Fine-grain Modeconf.set("spark.mesos.coarse", "false")
© 2013 IBM Corporation23
Spark on YARN Vs Spark on Mesos
Spark on YARNo Coarse graino Fixed size of each Spark Executor
Resource could be wasted if no enough tasks in an executor o Leverage YARN data aware scheduling Spark on Mesos (Coarse-grain mode)o Coarse graino Cannot launch multiple executors in same host (fixed in Spark 2.0.0 by SPARK-5095)
New resource cannot be used Cannot fully use the big memory due to JVM GC issue in big memory environment
o Spark schedule tasks by data affinity within the offer Spark on Mesos (Fine-grain mode)o Fine graino Extra overhead when launching taskso Resource may not be reschedule in time after time finish because of Mesos scheduling interval
© 2013 IBM Corporation24
USE CASES
© 2013 IBM Corporation25
Real-timeStreams, FPGA-based applications near market feeds
Near Real-timeAnalytic tasks are often time-critical supporting trading desks – “real-time” risk applications
BatchLong-running
Exchange/ECNsData Feeds
Big DataDiverse sources of structured/unstructured data - RDBMS, DFS (HDFS,GPFS),In-memory caches etc..
Data Intensive Workloads Compute Intensive Workloads
Algorithmic trading / HFT / “Black-box”/ “Robo-trading”
Orders
ProgramTrading
Arbitrage TrendFollowing
Exotics ,Derivative Pricing
SentimentAnalysis
CounterpartyRisk, CVA
CRMAnti-moneyLaundering
(AML)
“Real-time“MarketRisk
Pre-trade, post-trade analytics
Credit scoring
ETL
IncrementalModeling
FraudDetection
IncrementalModeling
Forex
Mining ofUnstructured Data
SensitivityAnalysis
Model Backtesting
RegulatoryReporting
Actuarial analysis
CEP ProtocolConversion
Deeper Counterparty
Modeling
Variable annuity
FX IR Equities
Applications in Financial Services
VaR
ALM Mortgage analyticsStrategy mining data mining
Predictive analytics
Predictive analytics
Optimization
Optimization
Trade surveillance
Portfolio Stress Testing
P&L analysis
Document Processing
Non-structured Data Query
IBM Confidential
Check processing
Image Analytic
© 2013 IBM Corporation26
Customer Example – Multi-tenancy of cloud native workload at a major bank
© 2013 IBM Corporation27
Example - Genome Sequencing
All the DNA contained in a living cell makes up the genome. The alphabet of
genome contains only four letters: A, C, G and T. Just like a book uses words and
letters to tell a story, so do these letters in the genomes as they encode genes that
carry out all cellular functions.
SAM BAM Recalibrated BAM
Mark Duplicate & Sort
BWA ADAMSpark
Map to ReferenceRealignment & Recalibration
SamtoolPicard
GATKMutech
Variant Analysis
VCF
Genomics is the study of the DNA sequence and meaning of these letters in the genome (e.g. genes & mutations) so that scientists can precisely
tell the story of life.
Next Generation Sequencing Pipeline for faster results
Complex workflows and dependencies
Your life story
FASTQ
© 2013 IBM Corporation28
Challenges – Genome Sequencing Poor resource utilization – peaks and valleys of different workloads How to orchestrate multi-phase workflows among many collaborative apps
of distributed workloads, sub- and parallel-flows, across diverse infrastructure
Lack of reliable parallelism in workflow due to variety of workload types and resource needs
Move to job arrays, MPI/MPI2, distributed messaging & cache,
MapReduce, Spark frameworks?
Data, app and resource silos causing inefficiencies in data movements, app integrations and resource sharing
Workload Manager
Job 1 Job 2
Job 3 Job N
MapReduce App
Resource 1 Resource 2Resource 3 Resource 4Resource 5 Resource 6Resource 7 Resource 8
HDFS
Workload Manager
App1 App2
App3 AppN
SOA App
Resource 1 Resource 2Resource 3 Resource 4Resource 5 Resource 6Resource 7 Resource 8
NFS
Workload Manager
Job 1 Job 2
Job 3 Job N
Batch App
Resource 1 Resource 2Resource 3 Resource 4Resource 5 Resource 6Resource 7 Resource 8
POSIX
Workload Manager
App1 App2
App3 AppN
Spark Apps
Resource 1 Resource 2Resource 3 Resource 4Resource 5 Resource 6Resource 7 Resource 8
Objects
Cluster #2 Cluster #4Cluster #3Cluster #1
© 2013 IBM Corporation29
A Life Science App Workflow with Hybrid Workloads - Genome SequencingGenome Analysis Toolkit (GATK) : A widely-adopted genomics workflow from Broad InstituteADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing, UC, Berkley
GATK workflowPipeline optimized using ADAM on Spark (parallelize mark duplicate and sort processing)
Site A
Site B
Site C
Share results worldwide immediately
ReplicateTo Remote
© 2015 IBM Corporation30
Platform Computing is Part of IBM Software Defined Infrastructure
IBM Platform Computing/DCOS
Software Defined Compute
SymphonyMapReduceSymphony Application Service
ControllerLSF
High Performance Analytics
(Low Latency Parallel)
Hadoop / Big Data
Application Frameworks(Long Running Services)
High Performance Computing(Batch, Serial, MPI, Workflow)
ExampleApplications
&Application
Frameworks
HomegrownHomegrown
Spectrum ScaleSoftware Defined
Storage
On-premises, On-Cloud, HybridPhysical Infrastructure
Hypervisorx86 Linux on z
Software Defined InfrastructureManagement
IBM Platform Cluster Manager
Bare Metal Provisioning Virtual Machine Provisioning SoftLayer APIs & Services
IBM Cloud Manager with OpenStack
IBM Platform ComputingCloud Service
Other Compute
Management Software
TraditionalCommercialApplications
© 2015 IBM Corporation
IBM Platform Computing
31
Resource Management Community Activities
• Active development with Mesos community – 11 IBM Developers.
• 100+ JIRAs delivered or in progress• Leading several work streams: POWER Support,
Optimistic Offers, Container Support, Swarm and Kubernetes integration
• YARN-plugin to Platform Symphony • Technical Preview of Mesos with IBM Value-
Add (ASC) on Docker Hub – Both x86 and POWER images
32
For more information: ibm.com/systems