Upload
claytononeill
View
423
Download
1
Embed Size (px)
Citation preview
Virtual Routers on Compute Nodes:A (Not So) Irrational Decision?
• Neutron with OVS and VXLAN tenant networks• Kilo release• Virtual Routers hosted on three control nodes• No HA routers
In the Beginning
• We had major network reliability issues• Customers were being DOSed• Environment was running out of capacity• We had some misconfiguration that was hard to fix• Network upgrade was months behind schedule• Impact of control node failure was huge• Need to reduce failure impact
The Problem
The Dumb Idea• Got together to brainstorm options• Could we colocate routers with another service?
–Spread routers–Spread load–Reduce failure group size
• What about compute nodes?• Is this a bad idea?• Why don’t other people do this?
• Two dual-port 10G Intel X520 NICs– Cross card, two port LACP for tenant traffic
• 1U Cisco C220 rack mount servers– Intel E5-2650 processors– 256 GB RAM
• OpenStack Neutron– OVS– VxLAN
• Legacy Virtual Router testing:– Server Bandwidth consumed? ALL of it!– Server RAM and CPU consumed? Negligible even with OVS
TWC Configurations and Testing
• Traditional network nodes– Gigantic sized– Horizontally scaled smaller servers
• DVR - Distributed Virtual Router• “VR-D” - Virtual Routers - Distributed• Other solutions away from mainline Neutron
All The Options
Network Nodes - Overview
• Gigantic dedicated network nodes?– Servers are generally idle– Large failure domains– Long rebuild times
• Why not scale dedicated network nodes horizontally?– Servers are generally idle
• Resource Usage:– RAM and CPU - incredibly low.– Bandwidth!
Network Nodes - Detail
• DVR - network nodes have less responsibility overall• FloatingIP SNAT takes place on the compute node
– L3 Agent is required on compute nodes• FixedIP SNAT takes place on network nodes
– Still requires a Virtual Router for external gateway– HA planned for Newton
• At TWC we were concerned about DVR’s:– Readiness for production?– Scale issues?– Operational tooling changes– Current use of many Floating IPs– Massive customer conversion to TWC OpenStack
DVR - A Layman’s Summary
DVR Packet Paths
● East-West between VMs (orange)● North-South with a FloatingIP (purple)
● North-South without a FloatingIP (green)
● Other Cases (blue):○ VMs on the same compute node.○ VM and router on same compute
node.
• “VR-D” is “Virtual Routers - Distributed”• Traditional Virtual Routers that cohabitate with VMs on
compute nodes!• Servers running both L3 Agent and nova-compute• Virtual Routers are either Legacy or HA• Our current choice is not to include DHCP Agents• What about?
– VM and Virtual Router bandwidth contention?
“VR-D” - what we made up
VR-D Packet Paths
● East-West between VMs (orange)● North-South with a FloatingIP (green)
● North-South without a FloatingIP (green)○ Chief difference between VR-D
and DVR!● Other Cases (blue):
○ VMs on the same compute node.○ VM and router on same compute
node (shown)
• Implementation: Surprisingly Easy• Puppet to put l3-agent on all compute nodes• Forgot about Metadata agent• Manageable Issues
Implementation and Automation
L3 Agent Scalability (LP#1498844)
• #1 problem we’ve encountered• L3 agent queries handled by single thread in
neutron-server• Fixed in Mitaka, backport stalled
L3 Agent Scalability (LP#1498844)• Rabbit Queue for L3 agent falls behind• Falls behind with status checks with 100 L3 agents• Restarts request full state - resource hog• Rolling restarts had to be rate limited
Operational Complexity• One more thing to check when a node fails• Tooling has to be updated• Monitoring has to be updated
Where are we going?Generally this “VR-D” solution is working well in production
• HA routers• Custom router scheduling• Routers on all compute nodes• DVR?
Questions?Clayton O’Neill
– [email protected]– IRC: clayton– Twitter: @clayton_oneill
Sean Lynn–[email protected]– IRC: trad511