28
Neutron Deployment at Scale Igor Bolotin, Cloud Architecture Vinay Bannai, SDN Architecture

Neutron scaling

Embed Size (px)

DESCRIPTION

Deploying neutron at scale in a lrage enterprise environment. This presentation was made at Openstack Atlanta 2014 summit.

Citation preview

Page 1: Neutron scaling

Neutron Deployment at Scale

Igor Bolotin, Cloud ArchitectureVinay Bannai, SDN Architecture

Page 2: Neutron scaling

eBay Inc. enables commerce by delivering flexible and scalable solutions that foster merchant growth.

About ebay inc

With 145 million active buyers globally, eBay is one of the world's largest online marketplaces, where practically anyone can buy and sell practically anything.

With 148 million registered accounts in 193 markets and 26 currencies around the world, PayPal enables global commerce, processing almost 8 million payments every day.

eBay Enterprise is a leading provider of commerce technologies, omnichannel operations and marketing solutions. It serves 1000 retailers and brands.

Page 3: Neutron scaling

• Business case

• Cloud at eBay Inc

• Deployment Patterns

• Problem Areas

• How we addressed them

• Future Direction

• Summary

• Q & A

Outline of the Presentation

Page 4: Neutron scaling

• Agility− Reduce time to market

− Enable innovation

• Efficiency− Elastic scale

− Reduce overall cost

• Multi-Tenancy

• Availability

• Security & Compliance

• Software Enabled Data Centers

What our businesses need?

Page 5: Neutron scaling

5

Cloud at eBay Inc

eBay IncCloud

Region/DC Region/DC Region/DC Region/DC

AZ AZ

Nova Cells

OpenstackControllers

Identity & ImageManagement

AZ AZ

Global Orchestration

Page 6: Neutron scaling

• Private Cloud for all eBay Inc properties

• Global Orchestration with traffic and load balancing

• Identity Management− Region level (eventually global)

• Image Management− Region level

• Nova/Cinder/Neutron− Availability Zones

− Active/Active servers

• Trove

• Zabbix for monitoring

• All services run behind a load balancer VIP

Deployment Patterns

Page 7: Neutron scaling

• Shared Cloud between tenants

• Different types of tenants− eBay Production, PP production, StubHub, GSI Enterprise etc

− Dev/QA

− Sandbox environment, internal tenants (IT, VPCs)

• Production Traffic− All bridged and no overlays

− No DHCP

• Dev/QA and some of the internal tenants− Overlays

− DHCP

Deployment Patterns (contd.)

Page 8: Neutron scaling

8

Gateway Nodes

Physical Racks

Page 9: Neutron scaling

• Hypervisor Scale Out

• Overlay Networks

• Bridged Networks

• Neutron Services (DHCP, Metadata, API server)

• SDN Controllers

• Network Gateway Nodes

• Upgrade

Areas With Scale Issues

Page 10: Neutron scaling

10

Hypervisor Scale Out

Nova API

Nova Cells

Nova Cells

Nova Sched

Nova Cells

Nova Sched

Neutron APINeutron

APINeutron API

Nova APINova

API

Nova CellsNova

Cells

Nova Cells

Nova Cells

DHCP AgentDHCP

Agent

SDN ContrlSDN

Contrl

Page 11: Neutron scaling

• Several hundreds of hypervisor in a cell

• Multiple cells in a AZ

• Several thousands of hypervisors in a AZ

• Nova cells mitigate hypervisor scale

• Neutron scaling− Majority of the hypervisors support Bridged VM’s

− Hybrid mode with both overlay and bridged VM’s

Hypervisor Scale Out

Page 12: Neutron scaling

Network Virtualization Layer

L2

VMVM VM VM VM

L2

L2

L3

VMVM VM VM VM

Tenant onOverlayNetwork

Tenant onBridged Network

Bridged and Overlay

Page 13: Neutron scaling

• Overlay technology − VXLAN

− STT

• Handling BUM traffic− ARP

− Unknown unicast

− Multicast

• Logical switches and routers

• Distributed L3 routers− Direct tunnels from hypervisor to hypervisor

• Scale out deployment of Gateway nodes

Overlay Networks

Page 14: Neutron scaling

• Keystone tokens

• Single threaded Quantum/Neutron server

• DHCP Servers

• Healing Instance Info Caching Interval

Neutron Services

Page 15: Neutron scaling

15

Keystone Token Generation and Authentication

Nova Server

Neutron Server

CinderServer

Client

Client

Client

KeystoneServer

ImageServer

• UUID based Token– Needs to be authenticated by

the keystone server for every call

• PKI based Token– Authenticated by the servers

using Keystone certs• Token caching

– Prevents unnecessary token creation

Page 16: Neutron scaling

• Applies to uuid based tokens− 98% of tokens generated by inter-API services

− 92% are quantum/neutron related

− Average of 25 to 30 tokens/sec created by quantum/neutron alone

− RPC call overhead, bloated token table

• Fix− Use token caching (1 hour)

− Use PKI for service tenant

− Reduces network chatter and improves performance

• Openstack bugs− Bug id : 1191159

− Bug id : 1250580

Token Caching

Page 17: Neutron scaling

• Prior to Havana

• One api thread handling both REST calls and the RPC calls

• Broke up the api to two threads− One handles REST API calls

− The other handles RPC calls

• Havana fixes− DHCP renewals not handled by neutron servers, instead dhcp_release

− Multi-worker support

Neutron Multi Worker

Page 18: Neutron scaling

• All nova computes regularly poll neutron server

• To get network info of the instances running on the compute node

• Default is 10 seconds

• Hundreds of hypervisors and tens of thousands of VM’s will add up− Even though only one instance is checked for each interval

• We adjusted the interval to 600 seconds

Heal Instance Info Cache Interval

Page 19: Neutron scaling

• The most common source of problems

• We employ multiple strategies− DHCP active/standby

− Planning to support DHCP active/active

− No DHCP/Config Drive option

• Production Environment− No DHCP

− Config drive management

− Requires “cloudinit” aware images

DHCP Scaling

Page 20: Neutron scaling

SDN Controllers

SDNController

SDNController

SDNController

Neutron API

Neutron API

Neutron API

Nova API

OSCtrl

OSCtrl

OS Ctrl

Page 21: Neutron scaling

• Only with overlay networks

• Scale out architecture

• Problems with high CPU utilization

• Number of flows in the gateway node

• East – West traffic also hitting VIPs− Load Balancer running as a appliance on a hypervisor

− Using SNAT

• OVS Enhancements− Use megaflows in openvswitch

− Multi-core version of ovs-vswitchd

Network Gateway Nodes

Page 22: Neutron scaling

• Prior to OVS 1.11

• Megaflow introduces wildcarding in kernel module

• Fewer misses and punts to user space

• Reduced number of flows in kernel

• Requires OVS 1.11 or greater

• Cons− Using security groups nullifies the effects of megaflows

OVS ImprovementsMegaflows

Page 23: Neutron scaling

23

OVS ImprovementsMulti-Core

vswitchd

Kernel module

User Space

Kernel Space

cpu cores

vswitchd

Kernel module

OVS < 2.0 OVS >= 2.0

Page 24: Neutron scaling

• VPC model

• Neutron Tagging Blueprint− Network Assignment for Bridged VM’s

− Network Selection for VPC tenants

− Network Scheduling

− Additional meta data information

• Blueprint− https://blueprints.launchpad.net/neutron/+spec/network-tagging

Future Work

Page 25: Neutron scaling

• There are two primary ways to plumb a VM into a network− Pass the net-id to the nova boot

− Create a port and pass it to nova boot

• Nova schedules the instance without much knowledge about the underlying network

• BP proposes to address this issue as one of the use cases

Network Assignment in Bridged VM’s

Page 26: Neutron scaling

Rack 1N1

Rack 2N2

Rack 3N3

Rack 4N4

FZ1 FZ2 FZ3 FZ4

Network Tagging

26

VM

Page 27: Neutron scaling

• Know your requirements

• Understand your size and scale

• Pick the SDN controller based on your needs

• Design with multiple failure domains

• Overlay, Bridged or Hybrid

• Monitor your cloud for performance degradation

Summary

Page 28: Neutron scaling

Thank you.

Yes. We are hiring!

[email protected]