Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenStack Israel 2015

The Road to a Hyper-Converged OpenStackOpenStack Israel, June 2015

Muli Ben-YehudaChief Scientist

A Brief History of TimeDatacenter Architectures

Copyright 2015. Confidential – Distribution prohibited without permission and subject to Non-Disclosure Agreement

Standalone “Converged” Servers

PerformanceReliabilityPrice

• Reliability (locality)• Manageability• Efficiency


Split Infrastructure

LAN

SAN


• VM Admin• Storage Admin• Expensive Fabric• Expensive Appliances


Hyper-Converged Infrastructure



Every node is both a compute nodeand a storage node

The InterConnect commoditization allowseveryone to build a cluster of servers

A Briefer History of TimeHyper-Converged Infrastructure


The Recipe for 1st Gen Hyper-Convergence

Storage

RAM DISK NIC

ComputeCompute

Two Black Boxes:

1) Control Plane2) Data Plane

• Distributed Storage• Virtualization

But, Shared Fabric

Let’s build a GreatHyper-Converged SolutionBased on OpenStack

® Copyright 2015 www.stratoscale.com @stratoscale +1 877 420 3244

Features/Benefits

1. Software-Only

2. Run Anything (VMs & Containers)

3. Store Everything (Enterprise-grade Storage)

4. Single Infrastructure (Anti-Silo, Cloud-like)

RAM DISK NIC

SERVER

RAM DISK NIC

SERVER

RAM DISK NIC

SERVER

RAM DISK NIC

SERVER

Hyper-Converged Control and Data Planes

WORKLOAD WORKLOADWORKLOAD

1. Performant Data Center/Cloud

2. Efficient Resource Utilization

3. Single Pane of Glass (Manageability)

4. Scalability & Reliability


Considerations

Failure Domain

Storage dictates the Failure Domain

Hardware Heterogeneity

Initial cluster needs to have similar node types so storage is balanced equally across nodes

Large deployments: Heterogeneity works well; e.g. Blades for compute and large servers for storage

Storage Pools can be used to separate failure domains as well:

Flash and HDD – speed/density of storage

Persistent and ephemeral storage // cold and hot: Smarter allocation of the physical storage yet adhering to failure domain rules.

Pass-through to Directly-Attached-Storage (for ephemeral use-cases optimizations)

Containers

Most Containers are stateless/ephemeral

Performance:

Serving storage requires CPU cycles, memory and bandwidth.

A disproportionate storage node should be configured with more CPU capacity, more memory & NICs

Topology

A disproportionate storage node will be hit with more storage requests so there are topology considerations too

The Building Blocks(a bit more than that, actually)


The Sub-Systems

Storage

Compute

Networking


Storage

Server-side Storage

Performant & Scalable (No Meta Data)

Predictable Performance

Robust & Resilient

Heterogeneity (SSD/HDD)

Storage Tiering

Predominately Volume-Only Management

Fine-Grained Control of the system

Virtual

Node Node Node Node Node

Physical

Local mount point

Workload

Single Name Space


Compute

Predictable Performance

Robust & Resilient

Heterogeneity (Windows/Linux/Containers)

Workload SLAs

Scalable (Eliminate Meta Data Bottlenecks)

Predominately VM-Only Management

Fine-Grained Control of the system


Inter-Connect (It’s Shared!)

The Control Plane(or, how things are really done)


Traits of the Solution

1.Scalable Installation Process

2.Distributed Systems Best Practices (Control Plane)

1.No single point of failure

2.Service Discovery

3.Load Balancing

4.Self-Healing

5.Eventual Consistency

3.Cluster-Wide Resource Load-Balancing

1.Interference

2.Contention

3.Optimizations

4.Managing Interference (Analytics)

5.Multiple Data Centers

Run Anything

Software Driven

VM

Store everything

Single infrastructure

Developer Friendly

Open Platform


Single Image

Management plane not dedicated to a specific VM or bare-metal server

No sizing exercise to figure out how to deploy management systems

Consensus based decision making

Even in the event that more than half the cluster was lost, all management processes are still available!

All nodes running a subset of the process required for the cloud to function


Distributed


ServerServerServerServerServer

Distributed Storage

VMVirtual disk

Virtual disk

VirtIO Driver

VMVirtual disk

Virtual disk

VirtIO Driver

VMVirtual disk

Virtual disk

VirtIO Driver

Block Storage Server – runs on all physical servers

Block Storage Client – runs on all physical serversBlock Storage Management

Storage Pool


QoS & Traffic aggregation

WorkloadSLA -

min/max bandwidth

WorkloadSLA - max latency

WorkloadSLA -

absolute priority

Block Storagerequires

bandwidth

Re-Balancingrequires min

latency

Rate Shaper Link SchedulerPolicy Arbiter

Traffic Aggregater

High-Speed Data Path


Topology/Failure Domain

Rack A/Datacenter 1 Rack B/Datacenter 1 Rack C/Datacenter 2

“Rack-Scale Computing”

failure should notcreate havoc

Async Replication


Cluster Load-Balancing & Workload Profiling

Admission

Profiling

ClassificationEngine

Running

PlacementEngine

Workload

● Analytics Layer● Collecting time-series

based performance metrics

● CPU Scheduling● Throttling Networking● Low Latency Live-

Migration


Config Space

Consul.io

• Consensus based• Key/Value• Exposed as DNS• Provides Local Cache• Used for HA• Used for LB• Problem: Strongly Consistent


Post Copy Live Migration (PCLM)

Mechanism that Stratoscale used to migrate VMs from one node to another node.

Only moves the “Working Set” memory, that is, the pages that are actively being utilized by CPU threads at that time.

Moves very small amounts of memory at a timeExample, 200MB of RAM now, then the rest of the memory images at a later

time: Network has freed up, Page fault over the network

Wrapping Up


To Conclude

1.Software-Only

2.Run Anything (VMs & Containers)

3.Store Everything (Enterprise-grade Storage)

4.Single Infrastructure (Anti-Silo, Cloud-like)

RAM DISK NIC

SERVER

RAM DISK NIC

SERVER

RAM DISK NIC

SERVER

RAM DISK NIC

SERVER

Hyper-Converged Control and Data Planes

WORKLOAD WORKLOADWORKLOAD

1.Performant Data Center/Cloud

2.Efficient Resource Utilization

3.Single Pane of Glass (Manageability)

4.Scalability & Reliability

Technology

Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenStack Israel 2015