Project HecatonchireDr. Benoit Hudzia, TIP HANA Cloud Computing, Systems Engineering July 2013
The NeedNext Generation Datacentre
© 2013 SAP AG. All rights reserved. 3Public
Advantages of Current Datacenter Designs
• Cheap, off the shelf, commodity parts• No need for custom servers or networking kit• (Sort of) easy to scale horizontally
• Runs standard software• No need for “clusters” or “grid” OSs
• Ideal for VMs• Highly redundant• Homogeneous
© 2013 SAP AG. All rights reserved. 4Public
Lots of Problems
• Datacenters mix customers and applications• Heterogeneous, unpredictable workload patterns• Competition over resources• How to achieve high-reliability?• Inefficient resource consumption model
• Heat and Power• 30 billion watts per year, worldwide• May cost more than the machines
© 2013 SAP AG. All rights reserved. 5Public
Achieving the Right-sized Servers
• Eliminates unnecessary components• Increased power efficiency • Optimized performance by workload
Using high-volume components to build high-value systems
The Lego CloudOriginal Concept
© 2013 SAP AG. All rights reserved. 7Public
Original Idea :Lego CloudResources aggregation Idea : Virtual Machine compute and memory span Multiple physical Nodes Node can be specialized for specific functionality :
– Memory Servers– Compute Servers– IO Serverso Storageo Accelerator (GPU)o Etc..
Hecatonchire Value Proposition Optimal price / performance by using commodity hardware Seamless deployment within existing cloud Memory / CPU / IO resource pooling Just in time resource consumption Heterogeneous workload and Hardware Simplified Reliability , Availability and Serviceability model
CPUs
Memory
I/O
CPUs
Memory
I/O
CPUs
Memory
I/O
H/W
OS
App
VM
H/W
OS
App
VM
H/W
OS
App
VM
H/W
OS
App
VM
Server #1 Server #2 Server #n
Guests
Fast RDMA Communication
© 2013 SAP AG. All rights reserved. 8Public
From Rack Scale to Cloud Scale
Fa
st R
DM
A C
om
mu
nic
atio
n
CP
Us
Mem
ory
I/O
CP
Us
Mem
ory
I/O
CP
Us
Mem
ory
I/O
Rack Scale Datacentre ScaleAuto Scaling VM(s) Efficient Resources Pooling
© 2013 SAP AG. All rights reserved. 9Public
Aggregating resource with Fast Networking
Author: Chaim Bendalac
PCIe SSD (fusionio)
PCIe Fabric Infiniband QDR
Infiniband FDR
Iwarp 10GbE Iwarp 40 GbE
Intel Silicon Photonic
Bandwidth 15Gbps 64Gbps 40 Gbps 54Gbps 10Gbps 40Gbps 100Gbps
Latency (4kB) ~50 µsec ~0.8 µsec 4 µsec 1 µsec 6-7 µsec 1-2 µsec > 1 µsec
The Memory CloudFocus On Memory Aggregation
© 2013 SAP AG. All rights reserved. 11Public
How do we offer better Memory TCO
• Memory in the nodes of current clusters is Overscaled in order to fit the requirements of “any” application• It remains unused most of the time
• How can we unleash your memory-constrained application by using the memory in the rest of nodes
Memory that grows with your business, not before.
© 2013 SAP AG. All rights reserved. 12Public
Decouple memory from cores aggregation
Many shared-memory parallel applications do not scale beyond a few tens of cores...However, may benefit from large amounts of memory:
• In-memory databases• Datamining• VM• Scientific applications• etc
Eliminate Physical Limitation of Cloud / DC
© 2013 SAP AG. All rights reserved. 13Public
The Idea: Turning memory into a distributed memory service
Breaks memory from the bounds of the physical box
Transparent deployment with performance at scale and Reliability
ArchitectureHigh Level Version
© 2013 SAP AG. All rights reserved. 15Public
Linux kernel
High Level Architecture
• Lightweight: Minimal set of hooks in the MMU (5)
• Module with 3 core components
• 3 existing consumption model• Native• User Space Library • KVM
• 4th model under dev– Hgroup• Similar to Cgroup , allow barebone memory extension
without modificationKernel MMU Coherency Engine
RDMA EngineRDMA Drivers
Management/ Tracing
User Space Library
Char Device
Apps
AppsApps
hgroup
Apps
© 2013 SAP AG. All rights reserved. 16Public
Extending Virtual Machine Memory
• Instead of using swap for freeing up the RAM, push pages to remote hosts
• If a page is needed again, request it back
© 2013 SAP AG. All rights reserved. 17Public
How does it work (Simplified Version)
Virtual Address
MMU(+ TLB)
Physical Address
Page Table Entry
Coherency Engine
RDMA Engine RDMA Engine
MMU(+ TLB)
Physical Address
Page Table Entry
Coherency Engine
Miss
Remote PTE
(Custom Swap Entry)
Page request
Page Response
PTE write
Update MMU
Invalidate PTE
Invalidate MMU
Extract Page
Extract Page
Prepare Page for RDMA transfer
Physical Node BPhysical Node A
Network
Fabric
© 2013 SAP AG. All rights reserved. 18Public
Hecatonchire Memory Structure• 2^64 Group (DSM-VM , ex: meta-VM / Process) that own a single
Virtual Memory address space
• 2^64 sub group (SVM- physical process )
• 2^64 Memory region per group that can partition 64 bit address space
• Each Sub group is associated with a single process and can own 0 or more MR
• A virtual memory address space can be backed by more than one Memory region ( RRAIM)
• Each Memory Region is own by one physical Node ( sub-group)
• Under Dev: • MR split – Merge• MR relocation
Heca
Group Group Group
Virtual Memory Address Space
Sub-Group Sub-Group Sub-Group
MR MR MR MR MR MR MR
© 2013 SAP AG. All rights reserved. 19Public
Coherency engine• Different coherency model
• Shared Nothing• Write Invalidate• Some variant
• Custom Distributed Index• Worst case : O(log n) hops • On a 16 Node cluster
• Random R/W on each node• 97 % < 2 hops • 99% < 3 hops
• Smart Prefetching
Coherency Engine
Index
Page Table/MMU
Coherency Engine
Index
Page Table/MMU
Coherency Engine
Index
Page Table/MMU
Coherency Engine
Index
Page Table/MMU
RAM
RAM
RAM
RAM
Prefetch
Prefetch
Prefetch
Prefetch
© 2013 SAP AG. All rights reserved. 20Public
RDMA Engine• In Kernel Implementation
• Optimised memory use
• Fabric agnostic ( as long as we have RDMA verbs)
• Tested on : • Iwarp ( Chelsio ), IB ( Connectx2 – QDR),
SoftIwarp, RoCE, SoftRoCE
• Multiplex Groups/Apps
• Flow Control / Multi queue support
• Under Dev: • Multi NIC support for single group• NIC Multiplexing (active/ active fail over)
RDMA
Engine
App
App
App
RDMA
Engine
App
App
App
RDMA
Engine
App AppApp
RAM
RAM
RAM
Raw PerformanceHow fast can we move memory page
© 2013 SAP AG. All rights reserved. 22Public
Hard Page Fault Resolution Performance(10 GB sequential R/W scan)
Resolution timeAverage (μs) 4KB Page
IOpSPage (4KB)
(with prefetch)
SoftIwarp (10 GbE)
330 50k/s
Iwarp (10GbE)
28 250k/s
Infiniband (QDR)
16 650k/s
© 2013 SAP AG. All rights reserved. 23Public
Average Compounded Page Fault Resolution Time(R/W With Prefetch)
1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads 7 Threads 8 Threads1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
6000IW 10GE Sequential IB 40 Gbps SequentialIW 10GE- Binary splitIB 40Gbps- Binary splitIW 10GE- Random WalkIB- Random WalkM
icro-seconds
Avg IW
Avg IB
Use Case: In Memory DatabaseScaling Out HANA memory
© 2013 SAP AG. All rights reserved. 25Public
Memory Scale out for SAP HANA – Benchmark
Hardware:•Server with Intel Xeon West Mere
• 4 socket • 10Core • 1 TB RAM
•Fabric: • Infiniband QDR 40Gbps Switch + Mellanox ConnectX2 • 10 GbE Ethernet Switch + Chelsio T422 NIC
Virtual Machine:• Large Size: 1 TB Ram - 40 vCPU• Hypervisor: KVM
• Note: 5-7% overhead due to virtualization vs barebone
• Application : SAP HANA ( In memory Database)
• Workload : OLAP ( TPC-H Variant)• Data size ~2.5 TB uncompressed => 300 GB
compressed• 18 different Queries• 15 iteration of each query set
© 2013 SAP AG. All rights reserved. 26Public
Use Case : Scaling out Hana Virtual Memory
KernelHeca
Kernel
Memory Process (empty VM)
Kernel
Memory Process (empty VM)
Kernel
Memory Process (empty VM)
See 4 TB RAMConstraint to 1
TB Physical RAM
(scratch pad)
1 TB Virtual Memory allocated
Kernel
Memory Process (empty VM)
RDMA Fabric
We never consume more than 4 TB of Physical Ram at anytime
Heca
Heca
Heca
Heca
VM
© 2013 SAP AG. All rights reserved. 27Public
1 20 40 600
50
100
150
200
250
300
350
400
450
500
Baseline
Half-Half
Scaling Out Hana (half of memory remote)
Nb Users HECA Overhead per query set
1 ~3%
20 ~3%
40 ~3%
60 ~3%
Virtual Machine:• 1TB Ram , 40 vCPU ()• Application : HANA ( In memory Database)• Workload : OLAP ( TPC-H Variant)• ~300GB data set compressed
Hardware:•Intel Xeon West Mere – 4 socket - 10Core – 1 TBB RAM•Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2
Query Set Completion Time (s)
Users
© 2013 SAP AG. All rights reserved. 28Public
Per Query Overhead
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
-50%
0%
50%
100%
150%
200%
1 User20 Users40 Users
Query
© 2013 SAP AG. All rights reserved. 29Public
Scaling out HANA
Virtual Machine:• 1TB GB Ram , 40 vCPU • Application : HANA ( In memory Database )• Workload : OLAP ( TPC-H Variant)- 80 Users• ~300GB data set compressed
Memory Ratio HECA Overhead per query set – 80 users
1:2 4%1:3 5.6%
2:1:1 0.9%1:1:1 2 %
Hardware:•Intel Xeon West Mere – 4 socket - 10Core – 1 TB RAM•Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2
© 2013 SAP AG. All rights reserved. 30Public
High Availability of Memory Node ( RRAIM ) running HANA
Memory Ratio (Medium 128GB - SAP-H Benchmark – 80 users)
RRAIM Overhead vs Heca
1:2 -0.2%
1:3 -0.1%
RAMRAM
RAMRAM
RAMRAM
RAMRAM
RAMRAM
RAMRAM
HA
Mirroring
• Memory region backed by two remote nodes.
• Remote page faults and swap outs initiated simultaneously to all relevant nodes.
• No immediate effect on computation node upon failure of node.
• When we a new remote enters the cluster, it synchronizes with computation node and mirror node.
© 2013 SAP AG. All rights reserved. 31Public
Enterprise Class Feature
Before DedupAfterDedup
Online RAM Deduplication (via KSM) Automatic Memory Tiering
Local Memory
Remote Memory
Compressed Memory
SSD/NVRAM
HDD
Local Node Remote Node
NextWork in progress and future possible directions
© 2013 SAP AG. All rights reserved. 33Public
Transitioning to a Memory Cloud Transparent Cloud Integration
Compute VMMemory Demander
Memory Cloud Management Services (OpenStack)
App App
memoryMemoryCloud
Heca-NOVA
VM
RAM
VM VM
RAM
Many Physical NodesHosting a variety of VMs
Combination VMMemory Sponsor & Demander
Memory VMMemory Sponsor
PoC Q4 2013
• Automatically reclaim underutilized or abandoned memory resources
• Automatically and intelligently redeploys memory workloads across the infrastructure
© 2013 SAP AG. All rights reserved. 34Public
Hecatonchire Cloud
• Hierarchical, coordinated, global system can set and manage power, budgets, respond to faults, support enclave components that leverage machine learning
• Resource management is hierarchical, and managers are stackable
• Resource managers are integrated
• Resource managers are customizable and adaptable
• Sharing is avoided whenever possible
• Strict enforcement is costly
© 2013 SAP AG. All rights reserved. 35Public
BareBone Memory Scale Out and Fine grained Memory Sharing
BareBone Memory Scale Out
• No need for Virtualization and/or Hypervisor
• Similar to Linux Control Group
• Allow barebones memory scale out for HANA or any other applications
Fine grained Distributed Shared Memory for user space:
• Memory coherence Model • Shared Nothing,• Write Invalidate• Read-Write protection
• Posix Style Shared Memory
PoC Q4 2013
PoC Q4 2013
© 2013 SAP AG. All rights reserved. 36Public
Hierarchical Memory
• In the future a significant Portion of Memory will be non-volatile• Helps reduce power• Helps with resilience• Helps with cost• Aim to offer transparent and/or ease the use Storage Class memory
• Transparent integration with Storage Class memory :• Local NVRAM • Remote NVRAM• Hybrid DRAM/NVRAM• Interface directly with NVMe• Persistence / HA
© 2013 SAP AG. All rights reserved. 37Public
Cpu Aggregation
• Leverage ACPI virtualization feature from Intel/AMD
• “Titanomachie” Smart vCPU and data scheduling • mixing prefetch , vCPU migration/placement , data / computation reordering
• Provide variable granularity support • From Cache line • To Huge page
© 2013 SAP AG. All rights reserved. 38Public
Incremental effort
• Prefetch
• Support New fabric
• Dirty Bit tracking and HW feature from next gen arch of AMD/ Intel
• MRIOV support
Thank You!Contact information:
Dr. Benoit Hudzia [email protected]
Hecatonchire Project: WWW: http://www.hecatonchire.comGithub: https://github.com/hecatonchire/
Backup Slides
© 2013 SAP AG. All rights reserved. 41Public
Raw Bandwidth usageHW: 4 core i5-2500 CPU @ 3.30GHz- SoftIwarp 10GbE – Iwarp Chelsio T422 10GbE - IB ConnectX2 QDR 40 Gbps
Total Gbit/sec (SIW - Seq)
Total Gbit/sec (IW-Seq)
Total Gbit/sec (IB-Seq)
Total Gbit/sec (SIW- Bin split)
Total Gbit/sec (IW- Bin split)
Total Gbit/sec (IB- Bin split)
Total Gbit/sec (SIW- Random)
Total Gbit/sec (IW- Random)
Total Gbit/sec (IB- Random)
0
5
10
15
20
251 Thread 2 Threads3 Threads4 Threads5 Threads6 Threads7 Threads
Gb/s Sequential Walk over 1GB of shared RAM Bin split Walk over 1GB of shared RAM Random Walk over 1GB of shared RAM
Maxing out Bandwidth
Not enough core to saturate (?)
No degradation under high load
Software RDMA has significant
overhead
© 2013 SAP AG. All rights reserved. 42Public
Enabling Live Migration of HANA DB (Small Instance)
Baseline Pre-Copy(Standard)
Post-Copy(Heca)
Downtime N/A 7.47 s 675 ms
Performance Degradation
(80 users)
0% Benchmark Failed
(HANA crash- Vm unresponsive)
5%
© 2013 SAP AG. All rights reserved. 43Public
Redundant Array of Inexpensive RAM: RRAIM
1. Memory region backed by two remote nodes. Remote page faults and swap outs initiated simultaneously to all relevant nodes.
2. No immediate effect on computation node upon failure of node.
3. When we a new remote enters the cluster, it synchronizes with computation node and mirror node.
© 2013 SAP AG. All rights reserved. 44Public
Quicksort Benchmark with Memory Constraint
Memory Ratio (constraint using cgroup)
HECA Overhead RRAIM Overhead
3:4 2.08% 5.21%1:2 2.62% 6.15%1:3 3.35% 9.21%1:4 4.15% 8.68%1:5 4.71% 9.28%
Quicksort Benchmark 512 MB Dataset Quicksort Benchmark 1GB Dataset Quicksort Benchmark 2GB Dataset
3:041:021:031:041:050.00%
2.00%
4.00%
6.00%
8.00%
10.00%
DSM OverheadRRAIM Overhead