19
Virtual Routers on Compute Nodes: A (Not So) Irrational Decision?

OpenStack: Virtual Routers On Compute Nodes

Embed Size (px)

Citation preview

Page 1: OpenStack: Virtual Routers On Compute Nodes

Virtual Routers on Compute Nodes:A (Not So) Irrational Decision?

Sean Lynn
replaceme
Page 2: OpenStack: Virtual Routers On Compute Nodes

• Neutron with OVS and VXLAN tenant networks• Kilo release• Virtual Routers hosted on three control nodes• No HA routers

In the Beginning

Page 3: OpenStack: Virtual Routers On Compute Nodes

• We had major network reliability issues• Customers were being DOSed• Environment was running out of capacity• We had some misconfiguration that was hard to fix• Network upgrade was months behind schedule• Impact of control node failure was huge• Need to reduce failure impact

The Problem

Page 4: OpenStack: Virtual Routers On Compute Nodes
Page 5: OpenStack: Virtual Routers On Compute Nodes

The Dumb Idea• Got together to brainstorm options• Could we colocate routers with another service?

–Spread routers–Spread load–Reduce failure group size

• What about compute nodes?• Is this a bad idea?• Why don’t other people do this?

Page 6: OpenStack: Virtual Routers On Compute Nodes

• Two dual-port 10G Intel X520 NICs– Cross card, two port LACP for tenant traffic

• 1U Cisco C220 rack mount servers– Intel E5-2650 processors– 256 GB RAM

• OpenStack Neutron– OVS– VxLAN

• Legacy Virtual Router testing:– Server Bandwidth consumed? ALL of it!– Server RAM and CPU consumed? Negligible even with OVS

TWC Configurations and Testing

Page 7: OpenStack: Virtual Routers On Compute Nodes

• Traditional network nodes– Gigantic sized– Horizontally scaled smaller servers

• DVR - Distributed Virtual Router• “VR-D” - Virtual Routers - Distributed• Other solutions away from mainline Neutron

All The Options

Page 8: OpenStack: Virtual Routers On Compute Nodes

Network Nodes - Overview

Page 9: OpenStack: Virtual Routers On Compute Nodes

• Gigantic dedicated network nodes?– Servers are generally idle– Large failure domains– Long rebuild times

• Why not scale dedicated network nodes horizontally?– Servers are generally idle

• Resource Usage:– RAM and CPU - incredibly low.– Bandwidth!

Network Nodes - Detail

Page 10: OpenStack: Virtual Routers On Compute Nodes

• DVR - network nodes have less responsibility overall• FloatingIP SNAT takes place on the compute node

– L3 Agent is required on compute nodes• FixedIP SNAT takes place on network nodes

– Still requires a Virtual Router for external gateway– HA planned for Newton

• At TWC we were concerned about DVR’s:– Readiness for production?– Scale issues?– Operational tooling changes– Current use of many Floating IPs– Massive customer conversion to TWC OpenStack

DVR - A Layman’s Summary

Page 11: OpenStack: Virtual Routers On Compute Nodes

DVR Packet Paths

● East-West between VMs (orange)● North-South with a FloatingIP (purple)

● North-South without a FloatingIP (green)

● Other Cases (blue):○ VMs on the same compute node.○ VM and router on same compute

node.

Page 12: OpenStack: Virtual Routers On Compute Nodes

• “VR-D” is “Virtual Routers - Distributed”• Traditional Virtual Routers that cohabitate with VMs on

compute nodes!• Servers running both L3 Agent and nova-compute• Virtual Routers are either Legacy or HA• Our current choice is not to include DHCP Agents• What about?

– VM and Virtual Router bandwidth contention?

“VR-D” - what we made up

Page 13: OpenStack: Virtual Routers On Compute Nodes

VR-D Packet Paths

● East-West between VMs (orange)● North-South with a FloatingIP (green)

● North-South without a FloatingIP (green)○ Chief difference between VR-D

and DVR!● Other Cases (blue):

○ VMs on the same compute node.○ VM and router on same compute

node (shown)

Sean Lynn
modifyme
Page 14: OpenStack: Virtual Routers On Compute Nodes

• Implementation: Surprisingly Easy• Puppet to put l3-agent on all compute nodes• Forgot about Metadata agent• Manageable Issues

Implementation and Automation

Page 15: OpenStack: Virtual Routers On Compute Nodes

L3 Agent Scalability (LP#1498844)

• #1 problem we’ve encountered• L3 agent queries handled by single thread in

neutron-server• Fixed in Mitaka, backport stalled

Page 16: OpenStack: Virtual Routers On Compute Nodes

L3 Agent Scalability (LP#1498844)• Rabbit Queue for L3 agent falls behind• Falls behind with status checks with 100 L3 agents• Restarts request full state - resource hog• Rolling restarts had to be rate limited

Page 17: OpenStack: Virtual Routers On Compute Nodes

Operational Complexity• One more thing to check when a node fails• Tooling has to be updated• Monitoring has to be updated

Page 18: OpenStack: Virtual Routers On Compute Nodes

Where are we going?Generally this “VR-D” solution is working well in production

• HA routers• Custom router scheduling• Routers on all compute nodes• DVR?

Page 19: OpenStack: Virtual Routers On Compute Nodes

Questions?Clayton O’Neill

[email protected]– IRC: clayton– Twitter: @clayton_oneill

Sean Lynn–[email protected]– IRC: trad511