View
221
Download
0
Category
Preview:
Citation preview
DE-CIX Route Server Testframework
Benedikt Rudolph (other contributors: Allen Taylor, Daniel Spierling, Johannes Moos)
DE-CIX, Researcher
RS functionality, scalability & stress testing: existing projects
2
AMS-IX, DECIX,
NIX.CZ
JPNAP, NTT
2010 2014 2016 Today
DE-CIXLONAP, Switch & Data,
AMS-IX, LINX,
PLIX, DE-CIX,
VIX
Secondary
flaps
Expiring
sessions
Motivation: Make IXP Route Servers future proof!
Estimate route server capacity limits
Make assumptions objective through measurement
Stress testing → better capacity estimation
3
Unresponsive
BGPd ≥ tBGP_hold
Flapping BGP
sessions
Route server hardening based on observed incidents
Execute tests under realistic conditions
Test new SW features / configurations in large scale deployments
Design elements: inherited and new ones
Do not re-invent the wheel, build on existing tools
Container virtualization
Emphasize automation (for repeatable results)
Consider scalability (to meet future requirements)
4
$ ./bgperf.py
BGPerf implementation revisted
6
container name :monitor
GoBGP
container name :tester
ExaBGPExaBGP
ExaBGPExaBGP
container name :target
BIRD
eth1
bgperf-br
eth1
eth1
bgperf.py
Advertise routes to
the target
Get CPU/MEM
stats
Get # of routes the
monitor receives
DE-CIX enhanced BGPerf
7
container name :monitor
GoBGP
container name :tester
ExaBGPExaBGP
ExaBGPExaBGP
container name :target
BIRD
eth1
bgperf-br
eth1
eth1
bgperf.py
Advertise routes to
the target
Get CPU/MEM
stats
Get # of routes the
monitor receives
VPN-GW
l2tpeth1
l2tpeth2
l2tpeth3
eth1
AWS
Manager
• PHP based solution
• Start/Stop ExaBGP cont
• Realistic ExaBGP Configs
ExaBGP
configExaBGP
configExaBGP
config
DE-CIX
BIRD.conf
BIRD
Monitor
BGP Add-Path
Action
Sequencer
Input
Script
Wait-
Convergent
Action
InterruptPeers
Action
Get CPU/MEM/ Routes
Execute / Stop Actions
script:
- wait-convergent:
- interrupt:
- sleep:
- ...
...
Realistic Peer Generation & Simulation
Emulation of peers with ExaBGP (https://github.com/Exa-Networks/exabgp)
One ExaBGP process per peer
Real world peer snapshots from DE-CIX route servers
Auto generated ExaBGP configs incl.:
Session Hold timers
Announced prefixes
AS-Path, BGP next hop, local pref,
(extended) BGP communities, ...
Export from per-customer RIBs
Includes all filtered prefixes as well
~ 720.000 routes in ExaBGP configs
14
LIVE
Routeserver
Emulate packet loss and delay with an existing tool (https://github.com/tylertreat/comcast)
Makes use of iptables and tc (on Linux)
Simulate L2 problems and emerging peer flaps
High Loss leading to missed keepalives
Will result in peer flaps
Example: simulate entire switch / linecard failures
generate 100% packet loss for a given time
No flaps, but high number of sessions go down
RS needs to calculate new best paths / send withdraws
Simulation of L2 problems
15
16
Automation: Benchmark Action Sequencer
Extension for bgperf (Execution Phase)
Load a script and execute actions on the testbed (three types)
Wait-Convergent Action
Wait until BGPd under test reached a steady state e.g.
CPU below x % for n measurement intervals
BGPd received at least m routes
Interrupt-Peers Action
Disrupt communication to a user-defined list of IPs from the testbed
Complete interruption or configurable packet loss
Sleep Action
Wait for a defined amount of time
17
Benchmark Action Sequencer script file
--- # example input script for the bgperf benchmark action sequencer
script: # a script is an ordered list of actions that are executed in sequence
- wait-convergent: # wait for bgpd convergence / steady state
cpu-below: 1 # threshold for cpu utilization
routes: 200 # greater or equal number of routes in master table
confidence: 3 # check criteria in n consecutive measurement intervals
- interrupt-peers: # interrupt a single peer
duration: 1 # duration of the interruption in seconds
peers: [172.31.192.43] # list of peers, at least one peer
loss: 100 # packet-loss in percent [0-100]
recovery: 20 # optional: How long to wait for recovery after duration?
- interrupt-peers: &bigOnes1 # interrupt multiple peers (reusable through anchor)
duration: 20
peers: [172.31.192.43, 172.31.192.44, 172.31.192.45]
- interrupt-peers: # reuse entry above, but redefine duration and add repetition
<<: *bigOnes1 # reference to anchor "&bigOnes1"
duration: 10
- sleep: # sleep for a fixed duration
duration: 1 # duration in seconds
18
BIRD Memory Leak / Cisco Bug
Detection and investigation of a memory leak in BIRD
Customer facing Cisco bug CSCus56036, graceful restart (4s)
Memory leak, BIRD process killed by OoM killer
Comunicate with developers, bug fixed in BIRD v1.6.3
Reproduce scenario and test effectiveness of fix
Graceful Restart bug
21
one peer flappingRS convergent
BGP best paths
CPU utilization in %
Memory usage
Simulation of a realistic L2 disruption
328 peers for 800s (e.g. caused by an edge switch SW-upgrade)
22
L2 interrupt for 328 peers (800s)
L2 interrupt for 328 peers (800s)
BGP best paths
All received BGP routes
CPU utilization in %
BIRD v1.5.0 (Multi-RIB config)
BIRD v1.6.3 (Multi-RIB config)
Conclusion
21
New toolset for route server testing
New
Enhanced and unique test framework. Realistic one-to-one copy of our live IXP network (Use of custom BGPd config)
. High scalability of peers due to AWS cloud integration
. Dynamic and automated test benchmarks, using the action sequencer extension
Enhanced & Unique
Benedikt RudolphResearcher
Daniel SpierlingNetwork Engineer
Johannes MoosSystems Engineer
Thank you!
Recommended