Download pptx - Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

Project HecatonchireDr. Benoit Hudzia, TIP HANA Cloud Computing, Systems Engineering July 2013

The NeedNext Generation Datacentre

© 2013 SAP AG. All rights reserved. 3Public

Advantages of Current Datacenter Designs

• Cheap, off the shelf, commodity parts• No need for custom servers or networking kit• (Sort of) easy to scale horizontally

• Runs standard software• No need for “clusters” or “grid” OSs

• Ideal for VMs• Highly redundant• Homogeneous


Lots of Problems

• Datacenters mix customers and applications• Heterogeneous, unpredictable workload patterns• Competition over resources• How to achieve high-reliability?• Inefficient resource consumption model

• Heat and Power• 30 billion watts per year, worldwide• May cost more than the machines


Achieving the Right-sized Servers

• Eliminates unnecessary components• Increased power efficiency • Optimized performance by workload

Using high-volume components to build high-value systems

The Lego CloudOriginal Concept


Original Idea :Lego CloudResources aggregation Idea : Virtual Machine compute and memory span Multiple physical Nodes Node can be specialized for specific functionality :

– Memory Servers– Compute Servers– IO Serverso Storageo Accelerator (GPU)o Etc..

Hecatonchire Value Proposition Optimal price / performance by using commodity hardware Seamless deployment within existing cloud Memory / CPU / IO resource pooling Just in time resource consumption Heterogeneous workload and Hardware Simplified Reliability , Availability and Serviceability model

CPUs

Memory

I/O

CPUs

Memory

I/O

CPUs

Memory

I/O

H/W

OS

App

VM

H/W

OS

App

VM

H/W

OS

App

VM

H/W

OS

App

VM

Server #1 Server #2 Server #n

Guests

Fast RDMA Communication


From Rack Scale to Cloud Scale

Fa

st R

DM

A C

om

mu

nic

atio

n

CP

Us

Mem

ory

I/O

CP

Us

Mem

ory

I/O

CP

Us

Mem

ory

I/O

Rack Scale Datacentre ScaleAuto Scaling VM(s) Efficient Resources Pooling


Aggregating resource with Fast Networking

Author: Chaim Bendalac

PCIe SSD (fusionio)

PCIe Fabric Infiniband QDR

Infiniband FDR

Iwarp 10GbE Iwarp 40 GbE

Intel Silicon Photonic

Bandwidth 15Gbps 64Gbps 40 Gbps 54Gbps 10Gbps 40Gbps 100Gbps

Latency (4kB) ~50 µsec ~0.8 µsec 4 µsec 1 µsec 6-7 µsec 1-2 µsec > 1 µsec

The Memory CloudFocus On Memory Aggregation


How do we offer better Memory TCO

• Memory in the nodes of current clusters is Overscaled in order to fit the requirements of “any” application• It remains unused most of the time

• How can we unleash your memory-constrained application by using the memory in the rest of nodes

Memory that grows with your business, not before.


Decouple memory from cores aggregation

Many shared-memory parallel applications do not scale beyond a few tens of cores...However, may benefit from large amounts of memory:

• In-memory databases• Datamining• VM• Scientific applications• etc

Eliminate Physical Limitation of Cloud / DC


The Idea: Turning memory into a distributed memory service

Breaks memory from the bounds of the physical box

Transparent deployment with performance at scale and Reliability

ArchitectureHigh Level Version


Linux kernel

High Level Architecture

• Lightweight: Minimal set of hooks in the MMU (5)

• Module with 3 core components

• 3 existing consumption model• Native• User Space Library • KVM

• 4th model under dev– Hgroup• Similar to Cgroup , allow barebone memory extension

without modificationKernel MMU Coherency Engine

RDMA EngineRDMA Drivers

Management/ Tracing

User Space Library

Char Device

Apps

AppsApps

hgroup

Apps


Extending Virtual Machine Memory

• Instead of using swap for freeing up the RAM, push pages to remote hosts

• If a page is needed again, request it back


How does it work (Simplified Version)

Virtual Address

MMU(+ TLB)

Physical Address

Page Table Entry

Coherency Engine

RDMA Engine RDMA Engine

MMU(+ TLB)

Physical Address

Page Table Entry

Coherency Engine

Miss

Remote PTE

(Custom Swap Entry)

Page request

Page Response

PTE write

Update MMU

Invalidate PTE

Invalidate MMU

Extract Page

Extract Page

Prepare Page for RDMA transfer

Physical Node BPhysical Node A

Network

Fabric


Hecatonchire Memory Structure• 2^64 Group (DSM-VM , ex: meta-VM / Process) that own a single

Virtual Memory address space

• 2^64 sub group (SVM- physical process )

• 2^64 Memory region per group that can partition 64 bit address space

• Each Sub group is associated with a single process and can own 0 or more MR

• A virtual memory address space can be backed by more than one Memory region ( RRAIM)

• Each Memory Region is own by one physical Node ( sub-group)

• Under Dev: • MR split – Merge• MR relocation

Heca

Group Group Group

Virtual Memory Address Space

Sub-Group Sub-Group Sub-Group

MR MR MR MR MR MR MR


Coherency engine• Different coherency model

• Shared Nothing• Write Invalidate• Some variant

• Custom Distributed Index• Worst case : O(log n) hops • On a 16 Node cluster

• Random R/W on each node• 97 % < 2 hops • 99% < 3 hops

• Smart Prefetching

Coherency Engine

Index

Page Table/MMU

Coherency Engine

Index

Page Table/MMU

Coherency Engine

Index

Page Table/MMU

Coherency Engine

Index

Page Table/MMU

RAM

RAM

RAM

RAM

Prefetch

Prefetch

Prefetch

Prefetch


RDMA Engine• In Kernel Implementation

• Optimised memory use

• Fabric agnostic ( as long as we have RDMA verbs)

• Tested on : • Iwarp ( Chelsio ), IB ( Connectx2 – QDR),

SoftIwarp, RoCE, SoftRoCE

• Multiplex Groups/Apps

• Flow Control / Multi queue support

• Under Dev: • Multi NIC support for single group• NIC Multiplexing (active/ active fail over)

RDMA

Engine

App

App

App

RDMA

Engine

App

App

App

RDMA

Engine

App AppApp

RAM

RAM

RAM

Raw PerformanceHow fast can we move memory page


Hard Page Fault Resolution Performance(10 GB sequential R/W scan)

Resolution timeAverage (μs) 4KB Page

IOpSPage (4KB)

(with prefetch)

SoftIwarp (10 GbE)

330 50k/s

Iwarp (10GbE)

28 250k/s

Infiniband (QDR)

16 650k/s


Average Compounded Page Fault Resolution Time(R/W With Prefetch)

1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads 7 Threads 8 Threads1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000IW 10GE Sequential IB 40 Gbps SequentialIW 10GE- Binary splitIB 40Gbps- Binary splitIW 10GE- Random WalkIB- Random WalkM

icro-seconds

Avg IW

Avg IB

Use Case: In Memory DatabaseScaling Out HANA memory


Memory Scale out for SAP HANA – Benchmark

Hardware:•Server with Intel Xeon West Mere

• 4 socket • 10Core • 1 TB RAM

•Fabric: • Infiniband QDR 40Gbps Switch + Mellanox ConnectX2 • 10 GbE Ethernet Switch + Chelsio T422 NIC

Virtual Machine:• Large Size: 1 TB Ram - 40 vCPU• Hypervisor: KVM

• Note: 5-7% overhead due to virtualization vs barebone

• Application : SAP HANA ( In memory Database)

• Workload : OLAP ( TPC-H Variant)• Data size ~2.5 TB uncompressed => 300 GB

compressed• 18 different Queries• 15 iteration of each query set


Use Case : Scaling out Hana Virtual Memory

KernelHeca

Kernel

Memory Process (empty VM)

Kernel


Kernel


See 4 TB RAMConstraint to 1

TB Physical RAM

(scratch pad)

1 TB Virtual Memory allocated

Kernel


RDMA Fabric

We never consume more than 4 TB of Physical Ram at anytime

Heca

Heca

Heca

Heca

VM


1 20 40 600

50

100

150

200

250

300

350

400

450

500

Baseline

Half-Half

Scaling Out Hana (half of memory remote)

Nb Users HECA Overhead per query set

1 ~3%

20 ~3%

40 ~3%

60 ~3%

Virtual Machine:• 1TB Ram , 40 vCPU ()• Application : HANA ( In memory Database)• Workload : OLAP ( TPC-H Variant)• ~300GB data set compressed

Hardware:•Intel Xeon West Mere – 4 socket - 10Core – 1 TBB RAM•Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2

Query Set Completion Time (s)

Users


Per Query Overhead

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

-50%

0%

50%

100%

150%

200%

1 User20 Users40 Users

Query


Scaling out HANA

Virtual Machine:• 1TB GB Ram , 40 vCPU • Application : HANA ( In memory Database )• Workload : OLAP ( TPC-H Variant)- 80 Users• ~300GB data set compressed

Memory Ratio HECA Overhead per query set – 80 users

1:2 4%1:3 5.6%

2:1:1 0.9%1:1:1 2 %

Hardware:•Intel Xeon West Mere – 4 socket - 10Core – 1 TB RAM•Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2


High Availability of Memory Node ( RRAIM ) running HANA

Memory Ratio (Medium 128GB - SAP-H Benchmark – 80 users)

RRAIM Overhead vs Heca

1:2 -0.2%

1:3 -0.1%

RAMRAM

RAMRAM

RAMRAM

RAMRAM

RAMRAM

RAMRAM

HA

Mirroring

• Memory region backed by two remote nodes.

• Remote page faults and swap outs initiated simultaneously to all relevant nodes.

• No immediate effect on computation node upon failure of node.

• When we a new remote enters the cluster, it synchronizes with computation node and mirror node.


Enterprise Class Feature

Before DedupAfterDedup

Online RAM Deduplication (via KSM) Automatic Memory Tiering

Local Memory

Remote Memory

Compressed Memory

SSD/NVRAM

HDD

Local Node Remote Node

NextWork in progress and future possible directions


Transitioning to a Memory Cloud Transparent Cloud Integration

Compute VMMemory Demander

Memory Cloud Management Services (OpenStack)

App App

memoryMemoryCloud

Heca-NOVA

VM

RAM

VM VM

RAM

Many Physical NodesHosting a variety of VMs

Combination VMMemory Sponsor & Demander

Memory VMMemory Sponsor

PoC Q4 2013

• Automatically reclaim underutilized or abandoned memory resources

• Automatically and intelligently redeploys memory workloads across the infrastructure


Hecatonchire Cloud

• Hierarchical, coordinated, global system can set and manage power, budgets, respond to faults, support enclave components that leverage machine learning

• Resource management is hierarchical, and managers are stackable

• Resource managers are integrated

• Resource managers are customizable and adaptable

• Sharing is avoided whenever possible

• Strict enforcement is costly


BareBone Memory Scale Out and Fine grained Memory Sharing

BareBone Memory Scale Out

• No need for Virtualization and/or Hypervisor

• Similar to Linux Control Group

• Allow barebones memory scale out for HANA or any other applications

Fine grained Distributed Shared Memory for user space:

• Memory coherence Model • Shared Nothing,• Write Invalidate• Read-Write protection

• Posix Style Shared Memory

PoC Q4 2013

PoC Q4 2013


Hierarchical Memory

• In the future a significant Portion of Memory will be non-volatile• Helps reduce power• Helps with resilience• Helps with cost• Aim to offer transparent and/or ease the use Storage Class memory

• Transparent integration with Storage Class memory :• Local NVRAM • Remote NVRAM• Hybrid DRAM/NVRAM• Interface directly with NVMe• Persistence / HA


Cpu Aggregation

• Leverage ACPI virtualization feature from Intel/AMD

• “Titanomachie” Smart vCPU and data scheduling • mixing prefetch , vCPU migration/placement , data / computation reordering

• Provide variable granularity support • From Cache line • To Huge page


Incremental effort

• Prefetch

• Support New fabric

• Dirty Bit tracking and HW feature from next gen arch of AMD/ Intel

• MRIOV support

Thank You!Contact information:

Dr. Benoit Hudzia [email protected]

Hecatonchire Project: WWW: http://www.hecatonchire.comGithub: https://github.com/hecatonchire/

mailto:[email protected]

http://www.hecatonchire.com/

https://github.com/hecatonchire/



Backup Slides


Raw Bandwidth usageHW: 4 core i5-2500 CPU @ 3.30GHz- SoftIwarp 10GbE – Iwarp Chelsio T422 10GbE - IB ConnectX2 QDR 40 Gbps

Total Gbit/sec (SIW - Seq)

Total Gbit/sec (IW-Seq)

Total Gbit/sec (IB-Seq)

Total Gbit/sec (SIW- Bin split)

Total Gbit/sec (IW- Bin split)

Total Gbit/sec (IB- Bin split)

Total Gbit/sec (SIW- Random)

Total Gbit/sec (IW- Random)

Total Gbit/sec (IB- Random)

0

5

10

15

20

251 Thread 2 Threads3 Threads4 Threads5 Threads6 Threads7 Threads

Gb/s Sequential Walk over 1GB of shared RAM Bin split Walk over 1GB of shared RAM Random Walk over 1GB of shared RAM

Maxing out Bandwidth

Not enough core to saturate (?)

No degradation under high load

Software RDMA has significant

overhead


Enabling Live Migration of HANA DB (Small Instance)

Baseline Pre-Copy(Standard)

Post-Copy(Heca)

Downtime N/A 7.47 s 675 ms

Performance Degradation

(80 users)

0% Benchmark Failed

(HANA crash- Vm unresponsive)

5%


Redundant Array of Inexpensive RAM: RRAIM

1. Memory region backed by two remote nodes. Remote page faults and swap outs initiated simultaneously to all relevant nodes.

2. No immediate effect on computation node upon failure of node.

3. When we a new remote enters the cluster, it synchronizes with computation node and mirror node.


Quicksort Benchmark with Memory Constraint

Memory Ratio (constraint using cgroup)

HECA Overhead RRAIM Overhead

3:4 2.08% 5.21%1:2 2.62% 6.15%1:3 3.35% 9.21%1:4 4.15% 8.68%1:5 4.71% 9.28%

Quicksort Benchmark 512 MB Dataset Quicksort Benchmark 1GB Dataset Quicksort Benchmark 2GB Dataset

3:041:021:031:041:050.00%

2.00%

4.00%

6.00%

8.00%

10.00%

DSM OverheadRRAIM Overhead