Upload
aidanshribman
View
237
Download
2
Tags:
Embed Size (px)
Citation preview
Benoit Hudzia Sr. Researcher; SAP Research CEC Belfast (UK)
Aidan Shribman Sr. Researcher; SAP Research Israel
SAP Virtualization Week 2012: TRND04
SAP DKOM 2012: NA 6747
The Lego Cloud
© 2012 SAP AG. All rights reserved. 2
Agenda
Introduction
Hardware Trends
Live Migration
Flash Cloning
Memory Pooling
Distributed Shared Memory
Summary
Introduction The evolution of the datacenter
© 2012 SAP AG. All rights reserved. 4
No virtualization
Basic Consolidation
Flexible Resources Management (Cloud)
Resources Disaggregation
(True Utility Cloud)
Evolution of Virtualization
© 2012 SAP AG. All rights reserved. 5
Why Disaggregate Resources?
Better Performance
Replacing slow local devices (e.g. disk) with fast remote devices (e.g. DRAM).
Many remote devices working in parallel (e.g. DRAM, disk, compute)
Superior Scalability
Going beyond boundaries of the single node
Improved Economics
Do more with existing hardware
Reach better hardware utilization levels
© 2012 SAP AG. All rights reserved. 6
The Hecatonchire Project
Hecatonchires in Greek mythology means “Hundred Handed
Ones” – the original idea: provide Distributed Shared Memory
(DSM) capabilities to the cloud
Strategic goal : full resource liberation brought to the cloud by:
Breaking down physical nodes to their core elements (CPU, Memory, I/O)
Extend existing cloud software stack (KVM, QEMU, libvirt, OpenStack)
without degrading any existing capabilities
Using commodity cloud hardware and standard interconnects
Initiated by Benoit Hudzia in 2011. Currently developed by two
SAP Research TI Practice teams located in Belfast and Ra’anana
Hecatonchire is not a monolithic project – but a set of separate
capabilities. We are currently identifying stake holder and
defining use cases for each such capability.
© 2012 SAP AG. All rights reserved. 7
Hecatonchire Architecture
Cluster Servers
Commodity hosts (e.g. 64 GB 16 core)
Commodity network adapters:
– Standard: softiwarp over 1 GbE
– Enterprise: RoCE/iWARP over 10 GbE or native IB
A modified version of QEMU/KVM hypervisor
An RDMA remote memory kernel module
Guests / VMs
Use resource from one or several underlaying hosts
Existing OS/application can run transparently
– Not exactly … but we will get to this later
CPUs
Memory
I/O
CPUs
Memory
I/O
CPUs
Memory
I/O
H/W
OS
App
VM
H/W
OS
App
VM
H/W
OS
Ap
p
VM
H/W
OS
App
VM
Server #1 Server #2 Server #n
Guests
Fast RDMA Communication
© 2012 SAP AG. All rights reserved. 8
The Team - Panoramic View
Hardware Trends The blurring of physical host boundaries
© 2012 SAP AG. All rights reserved. 10
DRAM Latency Has Remained Constant
CPU clock speed and memory bandwidth
increased steadily while memory latency
remained constant
As a result local memory has appears slower
from the CPU perspective
Source: J. Karstens: In-Memory Technology at SAP. DKOM 2010
© 2012 SAP AG. All rights reserved. 11
CPU Cores Stopped Getting Faster
Moore’s law prevailed until 2005 when cores
hit a practical limit of about 3.4 GHz
The “single threaded free lunch” (as coined by
Herb Sutter) is over
So CPU cores have stopped getting faster -
but you do get more cores now
Source: http://www.intel.com/pressroom/kits/quickrefyr.htm
Source: “The Free Lunch Is Over..” by Herb Sutter
© 2012 SAP AG. All rights reserved. 12
But … Interconnects Continue to Evolve
(providing higher bandwidth and lower latency)
© 2012 SAP AG. All rights reserved. 13
Result: Remote Nodes Are Becoming “Closer”
Accessing DRAM on a remote host via IB interconnects is only 20x slower than local DRAM.
Remote DRAM is 100x or 5000x faster than local SSD or HDD devices respectively.
HANA Performance Analysis, Intel Westmere (formally Nehelem-C) and IB QDR, Chaim Bendelac, 2011
© 2012 SAP AG. All rights reserved. 14
Result: Blurring the Boundaries of the Physical Host
10,000,000 ns 2,000ns 2,000ns
10,000,000 ns 60ns-100ns 15ns-80ns
10,000,000 ns 2,000ns 2,000ns
Live Migration Serving as a platform to evaluate remote page faulting
© 2012 SAP AG. All rights reserved. 16
Enabling Live Migration of SAP Workloads
Business Problem
Typical SAP workloads such as SAP ERP are
transactional, large, with a fast rate of memory writes.
Classic live migration fails for such workloads as rapid
memory writes cause memory pages to be re-sent over
and over again
Hecatonchire’s Solution
Enable live migration by reducing both the number of
pages re-sent and the cost of a page re-send
Across the board improvement of live migration metrics
– Downtime - reduced
– service degradation - reduced
– total migration time - reduced
© 2012 SAP AG. All rights reserved. 17
Classic Pre-Copy Live Migration
Pre-migration process
Reservation process
Iterative pre-copy
Stop and copy
Commitment
• VM active on host A
• Destination host selected
(Block devices mirrored)
• Initialize container on target host • Copy dirty pages in successive
rounds
• Suspend VM on host A
• Redirect network traffic
• Synch remaining state
• Activate on host B
• VM state on host A released
© 2012 SAP AG. All rights reserved. 18
Hecatonchire Pre-copy Live Migration
Reducing number of page re-sends
Page LRU reordering such that pages with a low
chance of being re-dirtied are sent first
Contribution to QEMU planned for 2012
Reducing the cost of a page re-sends
By using XBZRLE delta encoder we can much more
efficiently represent page changes
Contributed to QEMU during 2011
© 2012 SAP AG. All rights reserved. 19
Pre-Copy Live-
Migration
Page Pushing
1
Round
Stop
and
Copy
Commit
Total Migration Time
Downtime Live on A Degraded on B Live on B
Page Pushing
1
Round
Stop
and
Copy
Total Migration Time
Downtime Live on A Degraded on B Live on B
Iterative
Pre-Copy
X
Rounds
Post-Copy Live-
Migration
Hybrid Post-Copy
Live-Migration
More Than One Way to Live Migrate…
Iterative
Pre-copy X
Rounds
Stop
and
Copy
Commit
Total Migration Time
Downtime Live on A Live on B
Pre-migrate;
Reservation
Pre-migrate;
Reservation
Pre-migrate;
Reservation Commit
© 2012 SAP AG. All rights reserved. 20
Hecatonchire Post-copy Live Migration
In post-copy live migration we reverse order
1. Transfer of state: Transfer the VM running state from A to
B and Immediately activate the VM on B
2. Transfer of memory: B can initiate a network bound page
fault handled by A; Background actively push memory from
A to B until completion
Post-copy has some unique advantages
Downtime is minimal as only a few MBs for a GB sized VM
need to be transferred before re-activation
Total migration time is minimal and predictable
Hecatonchire unique enhancements
Low latency RDMA page transfer protocol
Demand pre-paging (pre-fetching) mechanism
Full Linux MMU integration
Hybrid post-copy supported
Demo
Flash Cloning Sub-second elastic auto scaling
© 2012 SAP AG. All rights reserved. 24
Automated Elasticity
Elasticity is basis for cloud economics
You can scale-up or scale-down on-demand
You only pay for what you use
Chart depicts scaling evolution
Scale-up approach: purchase bigger machines to meet
rising demands
Traditional scale-out approach: reconfigure the cluster
size according to demand
Automated elasticity: grow and shrink your resources
automatically responding to changing demands
represented by monitored metrics
If you can’t respond fast enough you may either
miss business opportunities or have to increase
your margin of purchased resources
Amazon Web Services - Guide
© 2012 SAP AG. All rights reserved. 25
Hecatonchire Flash Cloning
Business Problem
AWS auto scaling (and others) take minutes to scale-up:
– Disk image clone from a template (AMI) image
– Full boot up sequence of VM
– Acquiring of an IP address via DHCP
– Starting up the application
Hecatonchire Solution
Provide just in time (sub-second) scaling according to demand
– Clone a paused source VM Copy-on-Write (CoW) including:
Disk Image, VM Memory, VM State (registers, etc.)
– Use a post-copy live-migration schema including page-faulting to
fetch missing pages with background active page pushing
– Create a private network switch per clone (to save the need for
assigning a new MAC and performing IP reconfigure)
Memory Pooling Tapping into unused memory resources of remote hosts
© 2012 SAP AG. All rights reserved. 27
Capacity
Access S
peed
Hecatonchire Breakthrough Capability Breaking the Memory Box Barrier for Memory Intensive Applications
msec u
sec n
sec
SAN
NAS
Local Disk
Performance
Barrier
Em
bed
ded
Re
so
urc
es
Lo
cal
Re
so
urc
es
Ne
two
rked
Re
so
urc
es
PB TB GB MB
SSD
© 2012 SAP AG. All rights reserved. 28
The Memory Cloud Turns memory into a distributed memory service
Server1 Server2 Server3
Server1 Server2 Server3
Server1 Server2 Server3
VM VM VM
Storage
Applications
Memory
App
RAM
App App
RAM RAM
Business Problem
Large amounts of DRAM required on-demand – from shared cloud
hosts
Current cloud offerings are limited by the size of their physical host -
AWS can’t go beyond 68 GB DRAM as these large memory
instances fully occupy the physical host
Hecatonchire Solution
Access remote DRAM via low-latency RDMA stack (using pre-
pushing to hide latency)
MMU Integration for transport consumption for applications and
VMs. And as a result also support : compression (zcache), de-
duplication (KSM), N-tier storage
No hardware investment needed! No need for dedicated servers!
© 2012 SAP AG. All rights reserved. 29
Cloud
Management
Stack
VM High Availability
RAM
Many Physical Nodes
Hosting a variety of VMs
VM High Availability KVM Kemari / Xen Remus
RRAIM-1 (Mirroring) Hecatonchire
RAM App
VM VM VM VM
Hecatonchire RRAIM
RRAIM : Remote Redundant Array of Inexpensive Memory Memory Fault Tolerance as Part of a Full HA Solution
App Active Active Master Slave
RRAIM-1
Distributed Shared Memory Our next challenge
© 2012 SAP AG. All rights reserved. 31
Cache-Coherent Non Uniform Memory Access (ccNUMA)
Traditional cluster
Distributed memory
Standard interconnects
OS instance on each node
Distribution handled by application
ccNUMA
Cache coherent shared memory
Fast interconnects
One OS instance
Distribution handled by hardware/hypervisor
© 2012 SAP AG. All rights reserved. 32
Hecatonchire Distributed Shared Memory (DSM) VM
© 2012 SAP AG. All rights reserved. 33
Hecatonchire DSM – Cache Coherency (CC) Challenge
Standard ccNUMA
Inter-node (2000ns) cache-coherency takes too long
Inter-node read is expensive while processor cache not large enough
Adding COMA (Cache Only Memory Access)
Can help to improve performance for multi-read scenario
COMA implementation requires 4k cache-line leading to false data share
NUMA Topology / Dynamic NUMA Topology
Application NUMA-aware implementation may not be complete
Dynamic changes in NUMA will not be supported by most current apps
We need to attempt to hide some of the performance challenges (so that we
can expose a fixed NUMA topology
Adding vCPU live migration
Compact vCPU state (only several KB) can be live migrated
ccNUMA
COMA
Summary
© 2012 SAP AG. All rights reserved. 35
Roadmap
2011
• Live Migration
• Pre-copy XBZRLE Delta Encoding
• Pre-copy LRU page reordering
• Post-copy using RDMA interconnects
2012
• Memory Cloud
• Memory Pooling
• Memory Fault Tolerance (RRAIM)
• Flash Cloning
2013
• Lego Landscape
• Distributed Shared Memory
• Flexible resource management
© 2012 SAP AG. All rights reserved. 36
Key takeaways
Hecatonchire extends standard Linux stack requiring
only standard commodity hardware
With Hecatonchire unmodified applications or VMs
(which are NUMA-aware) can tape into remote resources
tranparently
To be released as open source under GPLv2 and LGPL
licenses to Qemu and Linux communities
Developed by SAP Research Technology Infrastructure
(TI) Practice
Thank you
Benoit Hudzia; Sr. Researcher;
SAP Research CEC Belfast
Aidan Shribman; Sr. Researcher;
SAP Research Israel
Appendix
© 2012 SAP AG. All rights reserved. 39
Communication Stacks have Become Leaner
Traditional network interface
Application / OS context switches
Intermediate buffer copies
OS handling transport processing
RDMA adapters
Zero copy directly from/to
application physical memory
Offloading of transport processing
to RDMA adapter and effectively
bypassing OS and CPU
A standard interface OFED “Verbs”
supporting all RDMA adapters (IB,
RoCE, iWARP)
© 2012 SAP AG. All rights reserved. 40
Linux Kernel Virtual Machine (KVM)
Released as a Linux Kernel Module (LKM)
under GPLv2 license in 2007 by Qumranet
Full virtualization via Intel VT-x and AMD-V
virtualization extensions to the x86 instruction
set
Uses Qemu for invoking KVM, for handling of
I/O and for advanced capabilities such as VM
live migration
KVM considered the primary hypervisor on
most major Linux distributions such as
RedHat and SuSE
© 2012 SAP AG. All rights reserved. 41
Remote Page Faulting Architecture Comparison
Hecatonchire
No context switches
Zero-copy
Use iWarp RDMA
Yobusame
Context switches into user mode
Use standard TCP/IP transport
Horofuchi and Yamahata, KVM Forum 2011 Hudzia and Shribman, SYSTOR 2012
© 2012 SAP AG. All rights reserved. 42
Hecatonchire DSM VM – ccNUMA Challenge
Linux NUMA topology
Linux is aware of NUMA topology (which cores
and memory banks reside in each zone/node).
Linux exposes this topology for applications to
make use of it.
But is up to the application to be NUMA-
aware … if not it may suffer when
running on NUMA topology
And even if the application is NUMA
aware the longer time needed for Cache-
Coherency (cc) may hurt performance
Inter-core: L3 Cache 20 ns
Inter-socket: Main Memory 100 ns
Inter-node (IB): Remote Memory 2,000 ns
Intel Nehalem Memory Hierarchy
© 2012 SAP AG. All rights reserved. 43
Legal Disclaimer
The information in this presentation is confidential and proprietary to SAP and may not be disclosed without the permission of
SAP. This presentation is not subject to your license agreement or any other service or subscription agreement with SAP. SAP
has no obligation to pursue any course of business outlined in this document or any related presentation, or to develop or
release any functionality mentioned therein. This document, or any related presentation and SAP's strategy and possible future
developments, products and or platforms directions and functionality are all subject to change and may be changed by SAP at
any time for any reason without notice. The information on this document is not a commitment, promise or legal obligation to
deliver any material, code or functionality. This document is provided without a warranty of any kind, either express or implied,
including but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. This
document is for informational purposes and may not be incorporated into a contract. SAP assumes no responsibility for errors or
omissions in this document, except if such damages were caused by SAP intentionally or grossly negligent.
All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materially
from expectations. Readers are cautioned not to place undue reliance on these forward-looking statements, which speak only as
of their dates, and they should not be relied upon in making purchasing decisions.