OpenStack: Virtual Routers On Compute Nodes

Virtual Routers on Compute Nodes:A (Not So) Irrational Decision?

Sean Lynn

replaceme

• Neutron with OVS and VXLAN tenant networks• Kilo release• Virtual Routers hosted on three control nodes• No HA routers

In the Beginning

• We had major network reliability issues• Customers were being DOSed• Environment was running out of capacity• We had some misconfiguration that was hard to fix• Network upgrade was months behind schedule• Impact of control node failure was huge• Need to reduce failure impact

The Problem

The Dumb Idea• Got together to brainstorm options• Could we colocate routers with another service?

–Spread routers–Spread load–Reduce failure group size

• What about compute nodes?• Is this a bad idea?• Why don’t other people do this?

• Two dual-port 10G Intel X520 NICs– Cross card, two port LACP for tenant traffic

• 1U Cisco C220 rack mount servers– Intel E5-2650 processors– 256 GB RAM

• OpenStack Neutron– OVS– VxLAN

• Legacy Virtual Router testing:– Server Bandwidth consumed? ALL of it!– Server RAM and CPU consumed? Negligible even with OVS

TWC Configurations and Testing

• Traditional network nodes– Gigantic sized– Horizontally scaled smaller servers

• DVR - Distributed Virtual Router• “VR-D” - Virtual Routers - Distributed• Other solutions away from mainline Neutron

All The Options

Network Nodes - Overview

• Gigantic dedicated network nodes?– Servers are generally idle– Large failure domains– Long rebuild times

• Why not scale dedicated network nodes horizontally?– Servers are generally idle

• Resource Usage:– RAM and CPU - incredibly low.– Bandwidth!

Network Nodes - Detail

• DVR - network nodes have less responsibility overall• FloatingIP SNAT takes place on the compute node

– L3 Agent is required on compute nodes• FixedIP SNAT takes place on network nodes

– Still requires a Virtual Router for external gateway– HA planned for Newton

• At TWC we were concerned about DVR’s:– Readiness for production?– Scale issues?– Operational tooling changes– Current use of many Floating IPs– Massive customer conversion to TWC OpenStack

DVR - A Layman’s Summary

DVR Packet Paths

● East-West between VMs (orange)● North-South with a FloatingIP (purple)

● North-South without a FloatingIP (green)

● Other Cases (blue):○ VMs on the same compute node.○ VM and router on same compute

node.

• “VR-D” is “Virtual Routers - Distributed”• Traditional Virtual Routers that cohabitate with VMs on

compute nodes!• Servers running both L3 Agent and nova-compute• Virtual Routers are either Legacy or HA• Our current choice is not to include DHCP Agents• What about?

– VM and Virtual Router bandwidth contention?

“VR-D” - what we made up

VR-D Packet Paths

● East-West between VMs (orange)● North-South with a FloatingIP (green)

● North-South without a FloatingIP (green)○ Chief difference between VR-D

and DVR!● Other Cases (blue):

○ VMs on the same compute node.○ VM and router on same compute

node (shown)

Sean Lynn

modifyme

• Implementation: Surprisingly Easy• Puppet to put l3-agent on all compute nodes• Forgot about Metadata agent• Manageable Issues

Implementation and Automation

L3 Agent Scalability (LP#1498844)

• #1 problem we’ve encountered• L3 agent queries handled by single thread in

neutron-server• Fixed in Mitaka, backport stalled

L3 Agent Scalability (LP#1498844)• Rabbit Queue for L3 agent falls behind• Falls behind with status checks with 100 L3 agents• Restarts request full state - resource hog• Rolling restarts had to be rate limited

Operational Complexity• One more thing to check when a node fails• Tooling has to be updated• Monitoring has to be updated

Where are we going?Generally this “VR-D” solution is working well in production

• HA routers• Custom router scheduling• Routers on all compute nodes• DVR?

Questions?Clayton O’Neill

– [email protected]– IRC: clayton– Twitter: @clayton_oneill

Sean Lynn–[email protected]– IRC: trad511

Technology

OpenStack: Virtual Routers On Compute Nodes