126

Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

Embed Size (px)

Citation preview

Page 1: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges
Page 2: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

Scaling BGP BRKRST-3321

Luc De Ghein

Page 3: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Agenda

Scale Challenges

Full mesh iBGP

Update Groups

Slow Peer

RR Problems & Solutions

Deployment

Multi-Session

MPLS VPN

Memory Utilization

OS Enhancements

3

Page 4: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Scale Challenges

BGP is robust

BGP is simple

– Basically simple, knobs allows for anything

BGP is adaptable and is multi-protocol

BGP allows policy

We need to overcome the following:

– Newer services: new AFs

– More prefixes

– Larger scale: more (BGP) routers

– More multipath

– More resilience PIC Edge, Best External leading to more prefixes/paths

4

Page 5: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Growing Tables of BGP

5

Internet sees ~ 2 updates per sec

Source: bgp.potaroo.net

IPv4 IPv6

AS

Page 6: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

More Services Using BGP

6

IPv4

IDR

IPv4

enterprise

MPLS VPN

IPv6 IDR

BGP flowspec

BGP monitoring

protocol

multicast BGP PW Signalling

(VPLS)

BGP MAC

Signalling (EVPN)

BGP FC

Path Diversity

BGP FC

PIC

mVPN

services

scaling DMVPN Unified

MPLS

1990 1995 1999 2002 2009 2012 future

Inter-AS

MPLS VPN

mVPN

C-signalling

6PE 6VPE

Page 7: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Services = Address Families

7

IPv4 unicast

IPv4 multicast

IPv4 MDT

IPv4 MVPN

vpnv4 unicast

vpnv4 multicast

IPv6 unicast

IPv6 multicast

IPv6 MVPN vpnv6 unicast

vpnv6 multicast

l2vpn vpls

l2vpn evpn

rtfilter unicast

nsap unicast

IPv6

IPv4 unicast

multicast

vpn

multicast in overlay

Layer 2

Page 8: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

mVPN and BGP

8

mVPN old style

– BGP ipv4 MDT needed for multicast signalling in the core

mVPN new style

– New address family mvpn

– BGP replaces PIM in overlay signalling

– BGP Auto Discovery (AD) signalling • Auto Discovery of PE routers

• Advertisement of core/tree tunnel information

– BGP Customer or C-signalling • Customer signalling from PE to PE

BGP in Overlay

PE PE Source

S1,S2 MPLS cloud Receiver

PIM PIM

BGP

Page 9: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Seamless MPLS (aka Unified MPLS)

9

Seamless MPLS (aka Unified MPLS) allows to

create end-end Tunnel Label Switch Path and

extend MPLS services without boundary ( Edge

/Aggregation /Access )

To scale Tunnel LSP are construct in 2 layers

- Intra-Area Tunnel LSP (ISIS+LDP or OSPF+LDP)

- Inter-Area Tunnel LSP (BGP+label: RFC3107)

No IP prefix redistribution between IGP domains

IGP prefixes are moved into BGP

Edge

Aggregation Aggregation Access Core

Edge

L2VPN / L3VPN

Inter-Area MPLS

IGP 1 + LDP IGP 2 + LDP IGP 3 + LDP

Intra-Area MPLS

Seamless MPLS

client (PE)

Seamless MPLS

ABR

Seamless MPLS

ABR

iBGP + label (RFC3107)

MP-iBGP + VPN label

Access

Seamless MPLS

client (PE)

MPLS service

Inter-domain LSP

Intra-domain LSP Intra-domain LSP Intra-domain LSP

Page 10: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Seamless MPLS (aka Unified MPLS)

10

Edge

Aggregation Aggregation Access Core

Edge

L2VPN / L3VPN

Inter-Area MPLS

IGP 1 + LDP IGP 2 + LDP IGP 3 + LDP

Intra-Area MPLS

Seamless MPLS

client (PE)

Seamless MPLS

ABR

Seamless MPLS

ABR

iBGP + label (RFC3107)

MP-iBGP + VPN label

Access

Seamless MPLS

client (PE)

• ABR is a RR

• RR sets next-hop to self for reflected iBGP prefixes

neighbor x.x.x.x next-hop-self all

• RR is in the forwarding path !

MPLS service

Page 11: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Ethernet-VPN (E-VPN)

11

E-VPN is next generation Emulated Ethernet Lan Solution

Customer MAC-Addresses advertised using using BGP

E-VPN defines a single new BGP NLRI used to carry all E-VPN routes

The NLRI has a new SAFI (70)

BGP

PE PE

PE PE 1. New route types and attributes defined

2. MAC address advertisement and withdraw

3. MPLS label

Page 12: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

Full Mesh iBGP

Page 13: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Full Mesh iBGP Is a Full Mesh Scalable?

Per BGP standard: iBGP needs to be full mesh

Total iBGP sessions = n * (n-1) / 2

Sessions per BGP speaker = n - 1

0

2000

4000

6000

8000

10000

12000

0 50 100 150 200

session

peers Two solutions

1. Confederations

2. Route reflectors

13

Page 14: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Confederations Create # of sub-AS inside the larger confederation

Conferation AS looks like normal AS to the outside

R1

R3

R2

R4

subAS 65001

R5

R7

R6

subAS 65002

Full mesh iBGP still needed inside subAS

No full mesh needed between subAS (it’s eBGP)

Every BGP peer needs to be in a subAS

R8

R10

R9

R11

subAS 65003

R12

R13

R14

subAS 65004

R15

R16

confed eBGP

iBGP

eBGP

router bgp 65002

bgp confederation identifier 100

bgp confederation peers 65001 65004

AS 100

14

Page 15: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Confederations

Next-hop-self on confed eBGP peerings

R1

R3

R2

R4

subAS 65001

R5

R7

R6

subAS 65002 R8

R10

R9

R11

subAS 65003

R12

R13

R14

subAS 65004

R15

R16

confed eBGP

iBGP

eBGP

AS 100

router bgp 65002

neighbor R3 next-hop-self

neighbor R14 next-hop-self

Each subAs can have different IGP

Increased BGP scaling

– Divide and conquer

router bgp 65004

neighbor R7 next-hop-self

neighbor R11 next-hop-self

15

Page 16: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Confederations

R1

R3

R2

R4

subAS 65001

R5

R7

R6

subAS 65002 R8

R10

R9

R11

subAS 65003

R12

R13

R14

subAS 65004

R15

R16

confed eBGP

iBGP

eBGP

AS 100

Flexible confed eBGP peerings

– Redundancy needed vs increased memory/CPU

– But full mesh between subAS’s is not needed

Full mesh between two

subAS’s is not needed

No connectivity needed between

any subAS’s

– e.g. no confed eBGP peering between subAS 65002 and subAS 65003

16

Page 17: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Confederations Notes

17

Loop avoidance similar to normal AS

– We introduce a confederation part in the AS_PATH

– Two new segment types: 3: AS_CONFED_SEQUENCE (ordered) ( )

4: AS_CONFED_SET (unordered) [ ]

– They never go out of the Confederation

Attributes do not change across confed-eBGP

– We keep NEXT_HOP, LOCAL_PREF, MED

– Implications: Decisions is consistent (Example: One path with higher LOCAL_PREF wins across the confederation)

Since NEXT_HOP does not change, IGP should be one across the confederation

– Decision process does not change Prefer eBGP paths over iBGP AND confederation-eBGP

e.g. (65001 65002) 300 200

PATH inside confederation

Page 18: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Confederations – Pros & Cons

Pros

– Each subAS can have its own IGP

– iBGP policy control • Each subAS can have its own set of policies

― MED acceptance or stripping

― Local Preference settings

― Route Dampening

• This will not be possible with route reflectors

• Also possible: different confed eBGP policies

– Merging companies is easier/transparent • Adding/changing/moving subAS’s

– Route reflectors can exist inside subAS

– BGP sessions can follow physical topology

Cons

– Migration from full mesh to confederation is cumbersome • Ok for new design

18

Page 19: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Solve The iBGP Full Mesh Route Reflector (RR)

RR

– A route reflector is an iBGP speaker that reflects routes learned from iBGP peers to other iBGP peers, called RR clients

– iBGP full mesh is turned into hub-and-spoke

– RR is the hub in a hub-and-spoke design

R1 R2 R3 R1 R2 R3

OR router bgp 65002

neighbor R1 route-reflector-client

neighbor R2 route-reflector-client

router bgp 65002

neighbor R1 route-reflector-client

neighbor R2 route-reflector-client

RR clients are interconnected

RR clients are regular

iBGP peers

RR RR

RR

If client-to-client reflection

can be turned off

19

Page 20: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Route Reflector What’s Possible?

20

R1 R2 R4 R3

non-client

clients

R5

Any router can peer

eBGP

cluster cluster

AS 100 AS 101

RR RR

iBGP

eBGP

Page 21: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Route Reflector Clusters

21

Redundancy needed, min of 2 RRs per cluster

Cluster = RR and its clients

R1 R2 R4 R3

R5

RR RR RR RR

Page 22: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Route Reflector Overlapping Clusters

22

Full mesh between RRs and non-clients should be kept small

R1 R2 R4 R3

R5

Client can peer with RRs in multiple clusters

Page 23: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Route Reflector Hierarchical RRs

23

Chain RRs to keep the full mesh between RRs and non-clients small

Make RRs clients of other RRs

An RR is a RR and RR client at the same time

iBGP topology should follow physical topology

– To prevent suboptimal routing blackholing and routing loops

Tier 1

Tier 2

Tier 3

There is no limit to the amount of tiers

RR & RR client

RRs in top tier need to be fully meshed

RR

Page 24: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Route Reflector Loop Prevention

24

Because we have RRs and same prefix can be advertised multiple times within iBGP cloud: loop prevention needed in iBGP cloud

Two BGP attributes, two ways

– Originator ID

Set to the router ID of the router injecting the route into the AS

Set by the RR

– Cluster List

Each route reflector the route passes through adds their cluster-ID to this list

Cluster-id = Router ID by default

“bgp cluster-id x.x.x.x” command to set cluster-id

Router discard routes if:

– If ORIGINATOR-ID = our ROUTER-ID

– If CLUSTER_LIST contains our CLUSTER-ID

Page 25: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Route Reflector Loop Prevention

RR1 and RR2 have different cluster-IDs

RRC1 RRC2

RR1 RR2

AS Path: {65000}

AS Path: {65000}

AS Path: {65000}

Originator-ID: RRC1

Cluster-List: {RR1}

AS Path: {65000}

Originator-ID: RRC1

Cluster-List: {RR1, RR2}

loop detected

25

Page 26: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

RRs Questions?

How many RRs?

– More RRs = more redundancy

– But also = more management work and more memory

– 2 RRs for each service set is ok More is possible

Using the same or different cluster-ID for RRs in one set?

– Different cluster-ID Additional memory and processor overhead on RR

– Same cluster-ID Less redundant paths

26

Page 27: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Route Reflector Same Cluster-ID

RR1 has only 1 path for routes from RRC2

– RR1 and RR2 have the same cluster-ID

If one link RR to RR-client fails

– iBGP session remains up, it is between loopback IP addresses

RRC1 RRC2

RR1 RR2

If different cluster-ID:

– RR1 stores the path from RR2

– RR1 uses additional CPU and memory

– Potentially for many routes

27

Page 28: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Route Reflector Route Advertisement

28

RR & RR client

RR

iBGP

eBGP

prefix coming from eBGP peer

RR sends prefix to

clients and non-clients

(and sends to other

eBGP peers)

prefix coming from RR client

RR reflects prefix to

clients and sends to

non-clients (and sends

to eBGP peers)

prefix coming from a non-client

RR reflects prefix to

clients (and sends to

eBGP peers)

Page 29: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

The Need for Virtual Route Reflectors

More BGP services: dedicated RR virtualized and multiplied: vRR

Benefits:

– Scalability (64b OS)

– Performance (Multi core)

– Independence of operation: reload/update/changes version (VM)

– Same BGP implementation and software version as deployed on the Edge (XE/XR)

– Management (Hypervisor)

A dedicated RR does

– IGP

– BGP

A dedicated RR needs

– CPU

– memory

primary backup

service 1

set vRR’s

service 2

set vRR’s

service 3

set vRR’s

29

service 1 set

RR’s primary

backup

service 2 set

RR’s

service 3 set

RR’s

primary

backup

primary

backup

Page 30: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Evolution of Route-Reflector’s From 7200 to vRR

Integration

Scala

bili

ty /

Perf

orm

ance

IOS-7200

IOS XE - ASR1K

64b OS

64b OS

IOS XR 12000

with 8 RP / RR

IOS-XR o 64b Linux

ASK9K Ng VSM

with 4 vRR: 4 x 8 core CPU

vIOS-XE o VMware

IOS-XR ASR9001

64b OS

30

Page 31: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

BGP RR Scale - Selective RIB Download

31

To block some or all of the BGP prefixes into the RIB (and FIB)

Only for RR which is not in the forwarding path

Saves on memory and CPU

Implemented as filter extension to table-map command

router bgp 1

address-family ipv4

table-map block-into-rib filter

route-map block-into-rib deny 10

RR1#show ip route bgp

RR1#

R1#show ip cef

configuration no BGP prefixes in RIB no BGP prefixes in FIB

For AFs IPv4/6

– not needed for AFs vpnv4/6

Benefit

– ASR testing indicated 300% of RR-client session

scaling (in order of 1000s)

Page 32: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Multi-Cluster ID

32

Each set of peers in cluster ID has its own update group

Loop-prevention mechanism is modified

– Taking into account multiple cluster IDs

PE1

RR1

PE2

PE3

PE4

router bgp 1

no bgp client-to-client reflection intra-cluster cluster-id 0.0.0.1

no bgp client-to-client reflection intra-cluster cluster-id 0.0.0.2

cluster ID 2 cluster ID 1

no reflection no reflection

An RR can belong to multiple clusters

– On an IBGP neighbor of RR: configure cluster IDs on a per-neighbor basis

– The global cluster ID is still there

– Intra-cluster client-to-client reflection can be disabled, when clients are meshed

– Can be disabled for all clusters or per cluster

– More work -sending more updates- for RR clients

– Less work -sending fewer updates- for RRs

Page 33: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Comparing Confederations vs Route Reflectors

Confederation Route Reflector

Loop prevention AS Confederation Set Originator/Cluster ID

Break up a single AS Sub-AS’ Clusters

Redundancy Multiple connections between sub-AS’ Client connects to several reflectors

External Connections Anywhere in the network Anywhere in the network

Multilevel Hierarchy RRs within sub-AS’ Clusters within clusters

Policy Control Along outside borders and between sub-AS’ Along outside border

Scalability Medium; still requires full iBGP within each

sub-AS

Very high

Migration Very difficult (impossible in some situations) Moderately easy

Deployment of (new) features Decentralized Central on RR

For Your Reference

33

Page 34: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

Update Groups

Page 35: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Grouping of BGP Neighbors Optimization Can Be Achieved

35

• peer groups

• templates, session-groups, af-groups, neighbor-group

Configuration/administration Performance/scalability

• update groups

• Dynamic way of grouping BGP peers according to common

outbound policy

• BGP formats the update messages once and then

replicates to all members of the update group

• network that have the same best-path attributes can be

grouped into the same message improving packing efficiency

• replication instead of formatting updates per neighbor:

efficiency

• dynamic = policy changes, update group membership

changes

• AF independent : a peer can belong to different update

groups in different address families

• CLI only

BGP neighbors with same outbound policy will

be put in the same update group regardless if

‒ peer-groups are defined

‒ templates are defined

‒ neighbors are individually defined

Page 36: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Peer-groups vs Templates

36

Newest configuration

More flexible (nesting, multiple inheritance)

Less configuration

router bgp 1

neighbor ibgp-peers peer-group

neighbor ibgp-peers remote-as 1

neighbor ibgp-peers description peering to iBGP speakers

neighbor ibgp-peers update-source Loopback0

neighbor 10.100.1.3 peer-group ibgp-peers

neighbor 10.100.1.4 peer-group ibgp-peers

!

address-family ipv4

no neighbor 10.100.1.3 activate

no neighbor 10.100.1.4 activate

exit-address-family

!

address-family vpnv4

neighbor ibgp-peers send-community both

neighbor 10.100.1.3 activate

neighbor 10.100.1.4 activate

exit-address-family

peer-groups

router bgp 1

template peer-policy iBGP-policy

send-community both

inherit peer-policy nhs 10

exit-peer-policy

!

template peer-policy nhs

next-hop-self

exit-peer-policy

!

template peer-session ibgp-peers

remote-as 1

description peering to iBGP speakers

update-source Loopback0

exit-peer-session

!

neighbor 10.100.1.3 inherit peer-session ibgp-peers

address-family vpnv4

neighbor 10.100.1.3 activate

neighbor 10.100.1.3 send-community extended

neighbor 10.100.1.3 inherit peer-policy iBGP-policy

exit-address-family

templates

nesting

possible

Oldest configuration

peer session for

global/session

commands

(AF independent)

peer policy for

policy commands

(AD dependent)

Page 37: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Update Groups and Route Reflectors

Update groups are very usefull on all BGP speakers

– but mostly on RR due to

• # of peers

• equal outbound policy

iBGP typically has no outbound policy

– RRs have large number of iBGP peers in one update group

Turn the issue into a solution

– Create one BGP update once and replicate for all

– Assuming RR does not change attributes on iBGP sessions

RR#show bgp ipv4 unicast neighbors 10.200.1.1

For address family: IPv4 Unicast

Session: 10.200.1.1

BGP table version 3201, neighbor version 3201/0

Output queue size : 0

Index 2, Advertise bit 0

Route-Reflector Client

2 update-group member

For address family: VPNv4 Unicast

Session: 10.200.1.1

BGP table version 1, neighbor version 1/0

Output queue size : 0

Index 3, Advertise bit 0

3 update-group member

a peer can belong to

different update groups in

different address families

37

Page 38: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Update Group on RR

38

RR#show bgp ipv4 unicast update-group 2

BGP version 4 update-group 2, internal, Address Family: IPv4 Unicast

BGP Update version : 3201/0, messages 0

Route-Reflector Client

Topology: global, highest version: 3201, tail marker: 3201

Format state: Current working (OK, last not in list)

Refresh blocked (not in list, last not in list)

Update messages formatted 2013, replicated 24210, current 0, refresh 0, limit 2000

Number of NLRIs in the update sent: max 812, min 0

Minimum time between advertisement runs is 0 seconds

Has 101 members:

10.100.1.2 10.200.1.1 10.200.1.10 10.200.1.100

10.200.1.11 10.200.1.12 10.200.1.13 10.200.1.14

10.200.1.15 10.200.1.16 10.200.1.17 10.200.1.18

...

RR

Page 39: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Update Group Replication

39

RR

BGP update

format BGP update

BGP update

BGP update

replicate

BGP update

BGP update

...

1

n RR#show ip bgp replication 2

Current Next

Index Members Leader MsgFmt MsgRepl Csize Version Version

2 101 10.100.1.2 2013 24210 0/2000 3201/0

update

group 2

formatting

according to

his policy

# of

formatted

messages

# of

replications size of

cache

total # of

members

Page 40: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Adaptive Message Cache Size

40

Cache = place to store formatted BGP message, before they are send

Update message cache size throttles update groups during update generation and controls transient memory usage

Is now adaptive

– Variable (change over time) queue depth from 100 to 5000

• Number of peers in an update groups

• Installed system memory

• Type of address family

• Type of peers in an update group

– Benefits

• Update groups with large number of peers get larger update cache

• Allows routers with bigger system memory to have *appropriately* bigger cache size and thereby queue more update messages

• vpnv4 iBGP update groups have larger cache size

• Old cache sizing scheme could not take advantage of expanded memory available on new platforms

• Results in faster convergence

Page 41: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

New Update Generation and Advertisement IOS (latest)

41

• Delay in BGP message propagation

• Long bgp update generation run

• CPU (and memory) overhead due to “advertised-

bits” for VPN

• One peer can starve buffers

• LIFO behavior

• Temporary memory increase during neighbor/policy

change

• Parallel processing of route-refresh and new peers

– Eliminates serialization of update generation runs

– Serialization = faster peers cannot format new updates until slower one caught up

• Quantum processing in BGP

• Lots of internal changes • e.g. redo “advertised-bits” to save memory

• All BGP processes get fair share of CPU

• More FIFO behavior

• Eliminate serialization of update generation runs

• Better serve new update group members and Route

Refresh requests with progress to other members

• Optimized buffer management

Problem Solution

Results

Page 42: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Dynamic Refresh Update Groups Route-Refresh Update Group Older Solution

42

Route refresh request from one peer can hog CPU

Newer behavior in IOS SB train:

– A different update group (one) is created temporarily for servicing the route-refresh per update group

– Route-refresh update group is created when the "bgp refresh timer" expires • This is one minute (default) after first refresh request

• Can be changed with bgp route-refresh service-interval <0-3600>

• Several route refresh requests from one peer or several peers, will be processed together, when "bgp refresh timer expires, in the new route-refresh update group

– Peers in the route-refresh update group receive the full RIB, while the peers in the regular update group receive the regular BGP updates

– Multiple peers from one update-group that need a route refresh will be put in one route-refresh update group for that update group

– Route-refresh update group is deleted after all route refresh requests have been dealt with

– A new peer is treated like a peer with Route-Refresh

For Your Reference

Page 43: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Route-Refresh Update Group: When is a Route Refresh Request Sent?

43

A route refresh request is sent, when:

– a user types clear ip bgp [AF] {*|peer} in

– a user types clear ip bgp [AF] {*|peer} soft in

– adding or changing the inbound filtering on the BGP neighbor

• via route-map

– configuring allowas-in for the BGP neighbor

– configuring soft-configuration inbound on the BGP neighbor

– in MPLS VPN (for AFI/SAFI 1/128)

• a user adds a route-target import to a VRF

– in 6VPE (for AFI/SAFI 2/128)

• a user adds a route-target import to a VRF

For Your Reference

Page 44: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Parallel Processing of Route-refresh (and New Peers)

44

Original update group handles new transient updates while refresh update group handles re-announcements

IOS

– Serving all peers for which route-refresh is needed, at once

– Tracking the refreshing peers

– Maintains a copy of neighbor instance that needs to re-announce the table

– Allows original update group and its members to advertise transient updates without getting blocked

– Tracked with flags

– SPE flags to indicate refresh

– E flag : refresh end marker. Peer is scheduled or participating in a refresh.

– S flag : refresh has started. If not present: waiting for its refresh to start either because another refresh for the same group is in progress or because the net prepend is not yet complete

– P flag : refresh is paused waiting for other refresh members that started later to catch up

IOS-XR

– Refresh sub-goup

Version 0 Version X

Refresh Group Re-announcements Transient Updates

Table Versions of Prefixes

Page 45: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Parallel Processing of Route-Refresh

45

RR#show bgp ipv4 unicast update-group 1 summary

Summary for Update-group 1, Address Family IPv4 Unicast

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd

10.100.1.1 4 1 42 56007 100001 0 0 00:32:39 0

10.100.1.2 4 1 4 3001 1 0 1999 00:00:09 0 (SE)

10.100.1.3 4 1 40 10036 100001 0 0 00:33:06 0

10.100.1.4 4 1 10034 10035 100001 0 0 00:32:23 100000

RR#show bgp ipv4 unicast update-group 1

BGP version 4 update-group 1, internal, Address Family: IPv4 Unicast

BGP Update version : 1/100001, messages 1999

Topology: global, highest version: 100001, tail marker: 100001

Format state: Current working (OK, last no message space)

Refresh blocked (no message space, last no message space)

Update messages formatted 59997, replicated 89997, current 0, refresh 1999, limit 2000

Has 4 members:

10.100.1.1 10.100.1.2 10.100.1.3 10.100.1.4

RR#show bgp ipv4 unicast replication 1

Current Next

Index Members Leader MsgFmt MsgRepl Csize Version Version

1 4 10.100.1.1 59295 89295 1999/2000 1/100001

For Your Reference

Page 46: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

Slow Peer

Page 47: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Slow Peer Problem

Update groups are a given

– Cannot be disabled

One peer in update group can become persistently slow

– Slow = a peer that cannot keep up with the rate at which we are generating update messages over a prolonged period of time (in the order of minutes)

– High CPU

– Transport issues (packet loss/loaded links/TCP throughput)

A slow peer can slow down convergence towards the other update group members

– Slow in transmitting

– Filling up the replication cache

– The number of formatted updates pending transmission builds up in that update group because one peer (the slow one) does not ack the messages as fast as the other members

– Full cache: no more new messages can be formatted, until the cache is reduced, hence until some messages are send to the slow peers

47

Page 48: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Slow Peer Solution

Slow peer management feature – 2 steps

– Detection • Identify the slow peer

• Slow peers are detected by tracking the queue for the peer and TCP

– Protection • Movement of the slow peer out of the update group

• Slower update group is collection of slow peers with matching policies

• Unblocks the remaining update group members resulting in better convergence

• Allows for fast and slow peers to proceed at the their own speeds

Neither is enabled by default

Detection can be enabled without protection

48

Page 49: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Slow Peer Solution

Recovery

– When slow peer is no longer slow, it can move back to the original update group

• When the slow update group has converged, the recovery timer is started

• When the recovery timer has expired, the slow peer is moved back

Solution before this feature: manual movement

– Create a different outbound policy for the slow peer

• Policy must be different than any other • You do not want the slow peer to move to another already existing update group

• Use something that does not affect the actual policy • For example: change minimum advertisement interval (MRAI) of the peer (under AF)

49

Page 50: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Slow Peer Detection

Detection parameters

– BGP update timestamps

– Peers TCP connection characteristics

per AF/

per VRF

per peer

per peer policy

template

bgp slow-peer detection [threshold <seconds>] default is 5 min

neighbor {<nbr-addr>/<peer-grp-name>} slow-peer detection [threshold < seconds >]

neighbor {<nbr-addr>/<peer-grp-name>} disable

slow-peer detection [threshold < seconds >]

50

Page 51: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Slow Peer Protection

2 ways

– static (CLI): by the user

– automatic movement of slower peer(s) out of update group

Separate slow update group with matching policies created

Any slow members are moved to this slow update group

Automatic recovery

– Slow peers can move back to regular update group

• Multiple static slow peers with the same outbound policy will be placed into the same “slow” update group

• Static slow peer and dynamic slow peer are in the same slow peer update group

51

Page 52: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Slow Peer Protection

static per neighbor/

per peer-group

automatic per AF/

per VRF

neighbor {<nbr-addr>/<peer-grp-name>} slow-peer split-update-group static

slow-peer split-update-group static static via peer policy

template

bgp slow-peer split-update-group dynamic [permanent]

neighbor {<nbr-addr>/<peer-grp-name>} slow-peer split-update-group dynamic [permanent] automatic per neighor

automatic per peer

policy template slow-peer split-update-group dynamic [permanent]

permanent = peer is not

moved back automatically to

the update group

52

Page 53: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Slow Peer Mechanism Details Logging

53

Syslog message for detection (and movement) of slow peer

%BGP-5-SLOWPEER_DETECT: Neighbor IPv4 Unicast 10.100.1.1 has been detected as a

slow peer.

Syslog message for recovery of slow peer

%BGP-5-SLOWPEER_RECOVER: Slow peer IPv4 Unicast 10.100.1.1 has recovered.

Page 54: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Slow Peers: Displaying

54

CLI to display slow peers

Applicable to all address families

• show bgp all summary slow

• show bgp ipv4 unicast neighbors slow

• show bgp ipv4 unicast update-group summary slow

For Your Reference

Page 55: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Slow Peer Mechanism Details Identifying Slow Peer

55

RR#show bgp ipv4 unicast update-group 1 summary

Summary for Update-group 1, Address Family IPv4 Unicast

BGP router identifier 10.100.1.5, local AS number 1

BGP table version is 500001, main routing table version 500001

100000 network entries using 14400000 bytes of memory

BGP using 24373520 total bytes of memory

BGP activity 115574/15574 prefixes, 300000/200000 paths, scan interval 60 secs

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd

10.100.1.1 4 1 1257 67368 402061 0 2000 18:56:16 0

10.100.1.2 4 1 1219 23362 402061 0 0 18:23:46 0

10.100.1.3 4 1 1257 23398 402061 0 0 18:56:42 0

10.100.1.4 4 1 10002 1891 402061 0 0 00:01:37 100000

output queue is not empty,

persistently?

convergence is achieved if all

peers are at the table version

For Your Reference

Page 56: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Slow Peer Mechanism Details Identifying Slow Peer

56

RR#show bgp ipv4 unicast replication 1

Current Next

Index Members Leader MsgFmt MsgRepl Csize Version Version

1 4 10.100.1.1 78114 144314 2000/2000 402061/500001

RR#show bgp ipv4 unicast update-group 1

BGP version 4 update-group 1, internal, Address Family: IPv4 Unicast

BGP Update version : 402061/500001, messages 2000

Topology: global, highest version: 500001, tail marker: 402061 (pending)

Format state: Current blocked (no message space, last no message space)

Refresh blocked (not in list, last not in list)

Update messages formatted 78115, replicated 144318, current 2000, refresh 0, limit 2000

Minimum time between advertisement runs is 0 seconds

Has 4 members:

10.100.1.1 10.100.1.2 10.100.1.3 10.100.1.4

For Your Reference

Page 57: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Slow Peer Mechanism Details Enabling Slow Peer Feature - Verifying

57

RR#show bgp ipv4 unicast neighbors 10.100.1.1

BGP neighbor is 10.100.1.1, remote AS 1, internal link

BGP version 4, remote router ID 10.100.1.1

For address family: IPv4 Unicast

Session: 10.100.1.1

BGP table version 700001, neighbor version 602001/700001

Output queue size : 2000

Index 1, Advertise bit 0

1 update-group member

Slow-peer detection is enabled, threshold value is 300

Slow-peer split-update-group dynamic is enabled,, and active

Dynamically detected slow-peer

For Your Reference

Page 58: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Slow Peer Mechanism Details

RR#show bgp ipv4 unicast update-group 1

BGP version 4 update-group 1, internal, Address Family: IPv4 Unicast

BGP Update version : 1001311/1100001, messages 2000

Extended-community attribute sent to this neighbor

Topology: global, highest version: 1100001, tail marker: 1001311 (pending)

Format state: Current blocked (no message space, last no message space)

Refresh blocked (not in list, last not in list)

Update messages formatted 108548, replicated 238371, current 2000, refresh 0, limit 2000

Number of NLRIs in the update sent: max 1016, min 0

Minimum time between advertisement runs is 0 seconds

Slow-peer detection timer (expires in 240 seconds)

Has 4 members:

10.100.1.1 10.100.1.2 10.100.1.3 10.100.1.4

Slow-Peer Detection Timer

58

For Your Reference

Page 59: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Slow Peer Mechanism Details Slow Peer Movement

59

%BGP-5-SLOWPEER_DETECT: Neighbor IPv4 Unicast 10.100.1.1 has been detected as a slow peer.

RR#show bgp ipv4 unicast neighbors 10.100.1.1

For address family: IPv4 Unicast

Slow-peer detection is enabled, threshold value is 300

Slow-peer split-update-group dynamic is enabled,, and active

RR#show bgp ipv4 unicast update-group 2

BGP version 4 update-group 2, internal, Address Family: IPv4 Unicast

BGP Update version : 402061/500001, messages 1000

Slow update group

Topology: global, highest version: 500001, tail marker: 500001

Update messages formatted 1000, replicated 1000, current 1000, refresh 0, limit 1000

Has 1 member: (1 dynamically detected as slow)

10.100.1.1

For Your Reference

Page 60: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Slow Peer Mechanism Details Slow Peer Movement

60

RR#show bgp ipv4 unicast summary slow

BGP router identifier 10.100.1.5, local AS number 1

BGP table version is 500001, main routing table version 500001

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd

10.100.1.1 4 1 1268 67391 402061 0 3000 19:05:34 0

For Your Reference

Page 61: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Slow Peer Mechanism Details Update Group Recovers

61

RR#show bgp ipv4 unicast update-group 1 summary

BGP table version is 500001, main routing table version 500001

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd

10.100.1.2 4 1 1233 31494 500001 0 0 18:35:48 0

10.100.1.3 4 1 1271 31529 500001 0 0 19:08:44 0

10.100.1.4 4 1 10014 10023 500001 0 0 00:13:38 100000

RR#show bgp ipv4 unicast replication 1

Current Next

Index Members Leader MsgFmt MsgRepl Csize Version Version

1 3 10.100.1.2 86237 168703 0/2000 500001/0

peer 10.100.1.1 has been removed from update group 1

allowing update group 1 remaining peers

to converge faster

For Your Reference

Page 62: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Slow Peer Mechanism Details Recovery of the Slow Peer

62

the recovery timer starts when the slow-peer update group has “converged”

RR#show bgp ipv4 unicast update-group 2

BGP version 4 update-group 2, internal, Address Family: IPv4 Unicast

BGP Update version : 2300001/0, messages 0

Slow update group

Topology: global, highest version: 2300001, tail marker: 2300001

Format state: Current working (OK, last no message space)

Refresh blocked (not in list, last not in list)

Update messages formatted 10000, replicated 10000, current 0, refresh 0, limit 1000

Number of NLRIs in the update sent: max 10, min 0

Minimum time between advertisement runs is 0 seconds

Slow-peer recovery timer (expires in 4 seconds)

For Your Reference

Page 63: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Slow Peer Mechanism Details Slow Peer Recovers

63

%BGP-5-SLOWPEER_RECOVER: Slow peer IPv4 Unicast 10.100.1.1 has recovered.

RR#show bgp ipv4 unicast update-group 1 summary

Summary for Update-group 1, Address Family IPv4 Unicast

BGP router identifier 10.100.1.5, local AS number 1

BGP table version is 500001, main routing table version 500001

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd

10.100.1.1 4 1 1282 79396 500001 0 0 19:18:36 0

10.100.1.2 4 1 1244 31505 500001 0 0 18:46:06 0

10.100.1.3 4 1 1282 31541 500001 0 0 19:19:03 0

10.100.1.4 4 1 10025 10035 500001 0 0 00:23:57 100000

peer 10.100.1.1 has been put back into update group 1

For Your Reference

Page 64: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Slow Peer Mechanism Details Clearing

64

CLI to clear

This is a forced clear of the slow-peer status; the peer is moved to the original update group

Needed when the permanent keyword is configured

• clear bgp AF {unicast|multicast} * slow

• clear bgp AF {unicast|multicast} <AS number> slow

• clear bgp AF {unicast|multicast} peer-group <group-name> slow

• clear bgp AF {unicast|multicast} <neighbor-address> slow

For Your Reference

Page 65: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

RR Problems & Solutions

Page 66: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

BGP Best Path

The BGP4 protocol specifies the selection and propagation of a single best path for each prefix

As defined in the BGP standard, there are no mechanisms to distribute paths other then best path between its speakers

This behavior results in number of disadvantages for new applications and services

– So a solution was needed

66

Page 67: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

BGP Best Path Selection

Path selection mechanism Details

Weight This is a Cisco-defined attribute that is assigned locally to your router and does not get carried through to the router updates. If there are multiple paths

to a particular IP address (which is very common), then BGP looks for the path with the highest weight. There are several ways to set the weight parameter, such as the neighbor command, the as-path access list, or route maps.

Local Preference This is an indicator to the AS as to which path has local preference, with the highest preference being preferred. The default is 100.

Network or Aggregate This criterion prefers the path that was locally originated via a network or aggregate. The aggregation of specific routes into one route is very efficient and saves space on your network.

Shortest AS_PATH BGP uses this one only when there is a “tie” comparing weight, local preference, and locally originated vs. aggregate addresses.

Lowest origin type This deals with protocols such as Interior Gateway Protocol (IGP) being a lower preference than Exterior Gateway Protocol (EGP).

Lowest multi-exit discriminator (MED) This is also known as the external metric of a route. A lower MED value is preferred over a higher value

eBGP over iBGP Similar to “lowest origin type”, BGP AS Path prefers eBGP over iBGP

Lowest IGP metric This criterion prefers the path with the lowest IGP metric to the BGP next hop.

Multiple paths This determines if multiple paths require installation in the routing table.

External paths When both paths are external, it prefers the path that was received first (the oldest one).

Lowest router ID This prefers the route that comes from the BGP router with the lowest router ID.

Minimum cluster list If the originator or router ID is the same for multiple paths, it prefers the path with the minimum cluster list length.

Lowest neighbor address This prefers the path that comes from the lowest neighbor address

For Your Reference

http://www.cisco.com/en/US/tech/tk365/technologies_tech_note09186a0080094431.shtml

67

Page 68: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Best Path Selection – Route Advertisement on RR

The BGP4 protocol specifies the selection and propagation of a single best path for each prefix

If RRs are used, only the best path will be propagated from RRs to ingress BGP speakers

The next slides explain features to distribute redundant paths to get around the “filtering” by the BGP Bestpath Calculation & Advertisement

NH: PE1, P: Z

NH: PE2, P: Z

P:Z

P: Z

Path 1: NH: PE1, best

Path 2: NH: PE2 P: Z

Path 1: NH: PE1, best

NH: PE1, P: Z

ingress PE does not learn 2nd path

68

CE1

PE1

PE2

RR PE3 CE2

Page 69: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Why Having Multiple Paths?

Convergence – BGP Fast Convergence (2+ paths in local BGP DB)

– BGP PIC Edge (2+ paths ready in forwarding plane)

Multipath load balancing

– ECMP

Prevent oscillation

– The additional info on backup paths leeds to local recovery as opposed to relying on iBGP

– Stop persistent route oscillations caused by comparison of paths based on MED in topologies where route reflectors or the confederation structure hide some paths

Allow hot potato routing

– = use optimal route

– The optimal route is not always known on the border routers

69

Page 70: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Diverse BGP Path Distribution Overview

VPN unique RD

BGP Best External

BGP shadow RR / session

BGP Add-Path

70

Page 71: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Diverse BGP Path Distribution Unique RD for MPLS VPN

Unique RD per VRF per PE

One IPv4 prefix in one VRF becomes unique vpnv4 prefix per VPN per PE

RR advertises everything

Available since the beginning of MPLS VPN, but only for MPLS VPN

NH: PE1, P: Z/RD1

NH: PE2, P: Z/RD2

P:Z

NH: PE1, P: Z/RD1

NH: PE2, P: Z/RD2

RR1

PE1

PE2

VRF

VRF

CE1

PE3

CE2

VRF P: Z Path 1: NH: PE1 Path 2: NH: PE2

RD2

RD1

71

Page 72: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Diverse BGP Path Distribution Shadow Route Reflector (aka RR Topologies)

Easy deployment – no upgrade of PE routers is required, just new iBGP session per each extra path (CLI knob on RR2)

One additional “shadow” RR per cluster

RR2 does announce the 2nd best path, which is different from the primary best path on RR1 by next hop

Additional path can be either of the following depending on the configuration: • the oldest multipath that is different than the best path

• backup path (2nd best path which is different from the primary best path by next hop)

NH: PE1, P: Z

NH: PE2, P: Z

P:Z

NH: PE2, P: Z shadow RR

router bgp 1

address-family ipv4

bgp additional-paths select backup

neighbor 10.100.1.3 advertise diverse-path backup

RR1

PE1

PE2

CE1

PE3 CE2

RR2

P: Z Path 1: NH: PE1 Path 2: NH: PE2

P: Z

Path 1: NH: PE1, best

Path 2: NH: PE2

P: Z

Path 1: NH: PE1, best

Path 2: NH: PE2, 2nd best

NH: PE1, P: Z

72

Page 73: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Diverse BGP Path Distribution Shadow Route Reflector

RR(config-router-af)#bgp bestpath igp-metric ignore

shadow RR

RR and shadow RR are co-located. They ‘re on same vlan with same IGP metric towards prefix.

Note 2: primary RRs do not need diverse path code

Note 1: primary and shadow RRs do not need to turn off IGP metric check

RR and shadow RR are not co-located.

Note: primary and shadow RRs need to have

the command to turn IGP metric check off. This is in order for all RRs to calculate the same best path so that primary and shadow RRs do not advertise the same path

all links have the same IGP cost

equal distance

link flaps can change topology

P

RR1

RR2

PE1

PE2

P:Z

shadow RR

PE1

P

RR1

RR2 PE2

P:Z

P: Z

Path 1: NH: PE1, best

Path 2: NH: PE2

P: Z

Path 1: NH: PE1, best

Path 2: NH: PE2, 2nd best

P: Z

Path 1: NH: PE1, best

Path 2: NH: PE2

P: Z

Path 1: NH: PE1, 2nd best

Path 2: NH: PE2, best

RR2 advertises same path as RR1 !

solution

73

Page 74: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Diverse BGP Path Distribution Shadow Route Reflector

The shadow RR needs to (1) calculate and (2) advertise the backup path

Calculation: triggered by any of the following per address family CLI

– Backup path calculation bgp additional-paths select backup

– Multipath calculation (existing CLI) maximum-paths (e)ibgp <n>

– Backup path calculation AND the installation can be triggered by the following CLI if “bgp additional-paths select backup” is NOT configured already

bgp additional-path install

Advertisement of the backup path

– Backup path advertisement advertise diverse-path backup

router bgp 1

address-family vpnv4

bgp additional-paths select backup

neighbor 10.100.1.3 activate

neighbor 10.100.1.3 send-community both

neighbor 10.100.1.3 route-reflector-client

neighbor 10.100.1.3 advertise diverse-path backup

shadow RR

example config

For Your Reference

74

Page 75: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Diverse BGP Path Distribution Shadow Route Reflector - Advertised-Routes

on the shadow RR

RR2#show bgp ipv4 unicast neighbors 10.100.1.3 advertised-routes

BGP table version is 10, local router ID is 10.100.1.10

Status codes: s suppressed, d damped, h history, * valid, > best, i -

internal,

r RIB-failure, S Stale, m multipath, b backup-path, f RT-

Filter,

x best-external, a additional-path, c RIB-compressed,

Origin codes: i - IGP, e - EGP, ? - incomplete

RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path

*bia10.2.5.0/24 10.2.2.12 0 100 0 65000 i

*bia10.2.6.0/24 10.2.4.14 0 100 0 65001 i

*>i 10.100.1.12/32 10.2.2.12 0 100 0 65000 i

*>i 10.100.1.14/32 10.2.4.14 0 100 0 65001 i

RR2#show bgp ipv4 unicast 10.2.5.0/24

BGP routing table entry for 10.2.5.0/24, version 7

Paths: (3 available, best #3, table default)

Advertised to update-groups:

1 2 3

Refresh Epoch 1

65000, (Received from a RR-client)

10.2.2.12 (metric 3) from 10.100.1.2 (10.100.1.2)

Origin IGP, metric 0, localpref 100, valid, internal, backup/repair

rx pathid: 0, tx pathid: 0

Refresh Epoch 1

65000, (Received from a RR-client)

10.2.1.11 (metric 3) from 10.100.1.1 (10.100.1.1)

Origin IGP, metric 0, localpref 100, valid, internal, best

rx pathid: 0, tx pathid: 0x0

For Your Reference

* valid

no “>” not the best path

B backup-path

I internal

A additional-path

this path is advertised

75

Page 76: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Diverse BGP Path Distribution Shadow Session

Easy deployment – only RR needs diverse path code and new iBGP session per each extra path (CLI knob in RR1)

Shadow iBGP session does announce the 2nd best path

– 2nd session between a pair of routers is no issue (use different loopback interfaces)

NH: PE1, P: Z

NH: PE2, P: Z

CE1

P:Z

NH: PE1, P: Z

NH: PE2, P: Z

P: Z Path 1: NH: PE1 Path 2: NH: PE2

Note: second session from RR to RR-client (PE3) has diverse-

path command in order to advertise 2nd best path

PE1

PE2

CE1 CE2 RR PE3

76

P: Z

Path 1: NH: PE1, best

Path 2: NH: PE2, 2nd best

Page 77: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

BGP PIC (Prefix Independent Convergence) Edge

• Convergence in flat FIB is prefix dependent • More prefixes -> more convergence time

• Classical convergence (flat FIB) • Routing protocols react - update RIB -

update CEF table (for affected prefixes)

• Time is proportional to # of prefixes

Problem

Solution

Result

• The idea of PIC: In both SW and HW:

• Pre-install a backup path in RIB

• Pre-install a backup path in FIB

• Pre-install a backup path in LFIB

• Improved convergence

• Reduce packet loss

• Have the same convergence time for all

BGP prefixes (PIC)

77

Page 78: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

MPLS VPN Dual-homed CE - no PIC Edge

Steps in convergence on ingress PE

1. Egress PE goes down

2. IGP notifies ingress PE in sub-second

3. Ingress PE recomputes BGP bestpath

4. Ingress PE installs new BGP bestpath in RIB

5. Ingress PE installs new BGP bestpath in FIB

6. Ingress PE reprograms hardware

NH: PE1, P: Z

NH: PE2, P: Z

P:Z

P: Z Path 1: NH: PE1, best Path 2: NH: PE2

skip this step with PIC Edge (BGP FRR)

because there is a precomputed

backup/repair path

this scales to the number of prefixes

CE1

PE1

PE2

PE3

CE2

78

Page 79: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

MPLS VPN

Steps in convergence on ingress PE

1. Egress PE goes down

2. IGP notifies ingress PE in sub-second

3. Switch to repair path with new NH

4. Ingress PE reprograms hardware

Dual-homed CE - PIC Edge

P: Z Path 1: NH: PE1, best Path 2: NH: PE2, backup/repair

router bgp 1

address-family vpnv4

bgp additional-paths install NH: PE1, P: Z

NH: PE2, P: Z

P:Z CE1

PE1

PE2

PE3

CE2

79

Page 80: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

BGP Additional-Path

Installation of the backup path

– Backup path installation bgp additional-path install

Depending on whether RR is control plane or forwarding plane, user may want to install the backup path by configuring this CLI

– Multipaths are automatically installed per existing CLI behavior

For Your Reference

80

Page 81: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

No BGP Best External With BGP Default Policy

P:Z

P: Z

Path 1: NH: CE1, localpref 100, external, best

P: Z

Path 1: NH: CE1, localpref 100, external, best

P: Z

Path 1: NH: PE1, internal, localpref 100, best

Path 2: NH: PE2, internal, localpref 100, backup/repair

NH: PE1, localpref: 100, P: Z

NH: PE2, localpref: 100, P: Z

NH: PE2, localpref: 100, P: Z NH: PE1, localpref: 100, P: Z

CE1

PE1

PE2

PE3

CE2

81

Page 82: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

P:Z CE1

PE1

PE2

PE3

CE2

No BGP Best External Changed Policy

• If default policy is changed, one egress PE could have iBGP path to other egress PE as best path and not its own external BGP path

P: Z

Path 1: NH: CE1, localpref 100, external, best

Path 2: NH: PE1, localpref: 200, internal, best

P: Z

Path 1: NH: CE1, external, best

P: Z

Path 1: NH: PE1, internal, localpref 200, best

NH: PE1, localpref: 200, P: Z local preference 200

NH: PE2, localpref: 100, P: Z

• Even with full mesh in iBGP, policy can prevent egress PE from learning all paths

NH: PE1, localpref: 200, P: Z

no backup/repair

path

82

Page 83: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

P:Z CE1

PE1

PE2

PE3

CE2

BGP Best External Changed Policy

With Best External, the backup PE (PE2) still propagates its own best external path to the RRs or iBGP peers

PE1 and PE3 have 2 paths

P: Z

Path 1: NH: CE1, external, best

Path 1: NH: PE1, localpref: 200, internal, best

P: Z

Path 1: NH: CE1, external, best

Path 2: NH: PE2, localpref 100, internal, backup/repair

NH: PE1, localpref: 200, P: Z

NH: PE2, localpref: 100, P: Z

P: Z

Path 1: NH: PE1, internal, localpref 200, best

Path 2: NH: PE2, localpref 100, internal, backup/repair

local preference 200

NH: PE2, localpref: 100, P: Z

router bgp 1

address-family vpnv4

bgp additional-paths install

bgp additional-paths select best-external

neighbor x.x.x.x advertise best-external

NH: PE1, localpref: 200, P: Z

backup/repair, advertise-best-external

83

Page 84: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

CE2

P:Z

ADD Path

Add-Path will signal diverse paths from 2 to X paths

RRs and edge BGP speakers need to have Add-Path support

Add-Path capability (send or receive or both) means that the NRLI encoding is extended to carry the Path Identifier

NH: PE1, P: Z

NH: PE2, P: Z

P: Z

Path 1: NH: PE1, best

Path 2: NH: PE2, best2 P: Z

Path 1: NH: PE1, best

Path 2: NH: PE2, backup/repair

router bgp 1

address-family ipv4

bgp additional-paths select best 2

bgp additional-paths send

neighbor PE3 advertise additional-paths best 2

NH: PE1, P: Z

NH: PE2, P: Z

router bgp 1

address-family ipv4

bgp additional-paths receive

bgp additional-paths install

PE routers need to run newer code in

order to understand second path

CE1

PE1

PE2

PE3 CE2 RR

84

Page 85: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

BGP Add-path Different Options

IETF draft draft-ietf-idr-add-paths-guidelines defines 4 flavors of Add-x-Path

2 options are implemented by Cisco

• RR will do best path computation for up to n paths and send n paths to the border routers

• This is the only mandatory selection mode • Pros

• less storage used for paths • less BGP info exchanged

• Cons • more best path computation

• Usecase: Primary + n-1 backup scenario (n is limited to 3, to preserve CPU power) = fast convergence

• RR will do the first best path computation and then send all paths to the border routers

• Pros • all paths are available on border routers

• Cons • all paths stored • more BGP info is exchanged

• Usecase: ECMP, hot potato routing

add-n-path add-all-path

bgp additional-paths select best<N> bgp additional-paths select all

85

Page 86: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Hot-Potato Routing No RRs

NH: PE2, P: Z

NH: PE1, P: Z NH: PE3, P: Z

eBGP: P: Z

eBGP: P: Z

eBGP: P: Z P: Z

Path 1: NH: PE1

Path 2: NH: PE2

Path 3: NH: PE3, best

Hot-potato routing = packets are passed on (to next AS) as soon as received

Shortest path though own AS must be used

In transit AS: same prefix could be announced many times from many eBGP peers

PE1

PE2

PE3

PE4

86

Page 87: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Hot-Potato Routing RRs are Present

Introducing RRs break hot potato routing

Solutions: Unique RD for MPLS VPN or Add Path

P: Z

Path 1: NH: PE1, best

NH: PE2, P: Z

NH: PE3, P: Z

eBGP: P: Z

eBGP: P: Z

eBGP: P: Z

PE1

PE2

PE4

PE3

NH: PE1, P: Z

RR

87

P: Z

Path 1: NH: PE1, best

Path 2: NH: PE2

Path 3: NH: PE3

NH: PE1, P: Z

Page 88: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Hot-Potato Routing in Large Transit SP No Add Path

Often large transit ISPs do have full mesh iBGP between regional RRs and hub/spoke between local BR and RR

Full mesh and global hot potato routing

RR

RR

RR RR

RR

RR

PE

PE

PE

PE

PE

PE

PE

PE

PE

88

Page 89: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Hot-Potato Routing in Large Transit SP With Add Path

With introduction of add-all-path we now could have centralized RR and do hot-potato routing

Initially, add-all-path could be deployed between centralized and regional RR’s

Also possible: remove the need for regional RR if all PE routers support add-path

RR

RR

RR RR

RR

RR

PE

PE

PE

PE

PE

PE

PE

PE

PE

RR

89

Page 90: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Don’t Forget

If you want primary and backup path installed in the RIB of the ingress PE (ECMP), you must have BGP multipath enabled

– But multipath does not solve the issue of RR only sending best path

With full mesh iBGP (and default policy), you can avoid all of the above, possible only needing/wanting BGP multipath on ingress PE

With different RD per PE for one VRF, you can avoid all of the above, but only in MPLS VPN networks

Inter-AS – ASBRs also perform Bestpath calculation and advertise only the bestpath across the AS-border /

into the local AS

– Consequently, the same requirements apply to Inter-AS as Intra-AS Unique RD, Add-Path, Best-external and/or RR-topologies

90

Page 91: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

Deployment

Page 92: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Path MTU Discovery (PMTUD)

• MSS (Max Segment Size) – Limit on the largest segment that can traverse a TCP session

– Anything larger must be fragmented & re-assembled at the TCP layer

– MSS is 536 bytes by default for client BGP without PMTUD

– Enable PMTU for BGP with

Older command “ip tcp path-mtu-discovery”

Newer command “bgp transport path-mtu-discovery” (PMTUD now on by default)

• 536 bytes is inefficient for Ethernet (MTU of 1500) or POS (MTU of 4470) networks

– TCP is forced to break large segments into 536 byte chunks

– Adds overheads

– Slows BGP convergence and reduces scalability

Another helpful command

show ip bgp neighbors | include max data

92

Page 93: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Session/Timers

93

Timers = keepalive and holdtime

– Default is ok

– Smallest is 3/9 for keepalive/holdtime

– Scaling <> small timers

Use BFD

– Built for speed

– When failure occurs, BFD notifies BFD client (in 10s of msecs)

Do not use Fast Session Deactivation (FSD) – Tracks the route to the BGP peer

– A temporary loss of IGP route, will kill off the iBGP sessions

– Very dangerous for iBGP peers IGP may not have a route to a peer for a split second

FSD would tear down the BGP session

– It is off by default neighbor x.x.x.x fall-over

– Next Hop Tracking (NHT), enabed by default, does the job fine

BFD Control Packets

BFD

BGP

EIGRP

OSPF

IS-IS

BFD Clients BFD Clients

BFD

BGP

EIGRP

OSPF

IS-IS

Page 94: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

Multisession

Page 95: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Multisession

95

BGP Multisession = multiple BGP sessions (and hence TCP sessions) between one pair of BGP speakers

– Even if there is only one BGP neighbor statement defined between the BGP speakers in the configuration)

Introduced with Multi Topology Routing (MTR)

– One session per topology

Now: possibility to have one session per AF/group of AFs

– Good for incremental deployment of AFs

• Avoids a BGP reset

• But multisession needs to be enabled beforehand

– Good for troubleshooting

– Good for issues when BGP session resets • For example “malformed update”

– Not so good for scalability

R2 R1

single session

carrying all AFs

R2 R1

multisession

1 topo per session

R2 R1

multisession

1 AF per session

Page 96: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Multisession

96

capability

multisession for

MTR

multisession

without MTR

BGP: 10.100.1.2 passive rcvd OPEN w/ optional parameter type 2 (Capability)

len 3

BGP: 10.100.1.2 passive OPEN has CAPABILITY code: 131, length 1

BGP: 10.100.1.2 passive OPEN has MULTISESSION capability, without grouping

R2#show bgp ipv4 unicast neighbors

BGP neighbor is 10.100.1.1, remote AS 1, internal link

BGP multisession with 3 sessions (3 established), first up for 00:05:43

Neighbor sessions:

3 active, is multisession capable

Session: 10.100.1.1 session 1

Topology IPv4 Unicast

Session: 10.100.1.1 session 2

Topology IPv4 Unicast voice

Session: 10.100.1.1 session 3

Topology IPv4 Unicast video

R2#show ip bgp neighbors 10.100.1.1 | include session|address family

BGP multisession with 3 sessions (3 established), first up for 00:02:29

Neighbor sessions:

3 active, is multisession capable

Session: 10.100.1.1 session 1

Session: 10.100.1.1 session 2

Session: 10.100.1.1 session 3

Route refresh: advertised and received(new) on session 1, 2, 3

Multisession Capability: advertised and received

For address family: IPv4 Unicast

Session: 10.100.1.1 session 1

session 1 member

For address family: IPv6 Unicast

Session: 10.100.1.1 session 2

session 2 member

For address family: VPNv4 Unicast

Session: 10.100.1.1 session 3

session 3 member

1 session per

topology

1 session per

address family

For Your Reference

Page 97: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Multisession Conclusion

97

Increases # of TCP sessions

Not really needed

Current default behavior = multisession is off

– Can be turned on by “neighbor x.x.x.x transport multi-session”

Makes sense to have IPv4 and IPv6 on seperate TCP sessions

– IPv6 over IPv4 (or IPv4 over IPv6) can be done, but next hop mediation is needed

Page 98: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

MPLS VPN

Page 99: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

RR-groups

99

Use one RR (set of RRs) for a subset of prefixes

– By carving up range of RTs

Only for vpnv4/6

– RR only stores and advertises the specific range of prefixes

Less storage on RR, but more RRs needed + more peerings

RR1

RR2

RR1

RR2

rr-group 1

rr-group 2 PE1 PE2

vpnv4/6

vpnv4/6

vpnv4/6

vpnv4/6

Page 100: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

RR-groups Configuration Example

100

Dividing of RTs done by simple ext community list 1-99 or ext community list with regular expression 100-500

PEs still send all vpnv4/6 prefixes to RR, but RR filters them

RR1

RR2

RR1

RR2

rr-group 1

rr-group 2 PE1 PE2

vpnv4/6

vpnv4/6

vpnv4/6

vpnv4/6

address-family vpnv4

bgp rr-group 100

address-family vpnv6

bgp rr-group 100

ip extcommunity-list 100 permit RT:1:(1|3|5)....

address-family vpnv4

bgp rr-group 100

address-family vpnv6

bgp rr-group 100

ip extcommunity-list 100 permit RT:1:(2|4|6)....

Dividing RT = more work

PEs are not involved, only RRs

BGP(4): 10.100.1.1 rcvd UPDATE w/ attr: nexthop 10.100.1.1, origin ?, localpref 100, metric 0, extended community

RT:1:10001

BGP(4): 10.100.1.1 rcvd 1:10001:100.1.1.2/32, label 22 -- DENIED due to: extended community not supported;

Page 101: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Route Target Constraint (RTC)

Current behavior:

– RR sends all vpnv4/6 routes to PE

– PE routers drops vpnv4/6 for which there is no importing VRF

RTC behavior: RR sends only “wanted” vpnv4/6 routes to PE

– “wanted”: PE has VRF importing the Route Targets for the specific routes

– RFC 4684

– New AF “rtfilter”

Received RT filters from neighbors are translated into oubound filtering policies for vpnv4/6 prefixes

The RT Filtering information is obtained from the VPN RT import list

Result: RR do not send vpnv4/6 unnecessarily prefixes to the PE routers

101

Page 102: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Route Target Constraint (RTC)

RR PE1 PE2

PE1 installs Default RT

filter for RR

PE sends all its vpnv4/6

prefixes to RR

capability 1/132 (RTFilter)

for vpnv4 & vpnv6

CE1

CE3

CE2

CE4

origin AS

Route

Target 1:1

origin AS

Route

Target 1:2

RR installs RT filter RT:1:1 & RT 1:2

for PE1 (implicitly denying all else)

MP_REACH_NLRI

AF RTFilter

RR sends only RED and green (not

blue) vpnv4/6 prefixes to PE1

BGP capability exchange

OPEN message

AF RTFilter exchange

AF vpnv4/6 prefixes exchange

origin AS

Route

Target 0:0

102

Page 103: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Route Target Constraint (RTC)

103

Results

– Eliminates the waste of processing power on the PE and the waste of bandwidth

– Number of vpnv4 formatted message is reduced by 75%

– BGP Convergence time is reduced by 20 - 50%

– The more sparse the VPNs (few common VPNs on PEs), the more performance gain

Note: PE and RR need the support for RTC

– Incremental deployment is possible (per PE)

– Behavior towards non-RT Constraint peers is not changed

Notes

– RTC clients of RR and non-RTC clients will end up in different update groups on RR

– RTC clients of RR with different set of importing RTs will be in the same update group on the RR In IOS-XR, different filter group under same subgroup

Page 104: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Update Groups and RTC

104

Now two levels of grouping of neighbors

We have selective replication possible per update group

– IOS • Update group

• Replication group (not visible) ― One or more per update group

– IOS-XR • Update group

• Filter group ― One or more per update group

Two levels needed for RT Constraint

– Within update group (same outbound policy), still different subset of neighbors possible because of different wanted RTs

Page 105: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

VRF-Based Dampening

105

BGP route dampening is now configurable per-VRF instead of for whole VPN table

Allows service provider to configure dampening parameters on an individual customer basis

Gives operators more flexible control of unstable customer routes in service provider network

router bgp <asn>

address-family ipv4 [unicast | multicast | vrf vrf-name]

bgp dampening <parameters>

Page 106: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Full Internet in a VRF? Why? Because design dictates it

Unique RD, so that RR can advertise 2 paths?

PRO

• Remove Internet routing table from P routers

• Security: move Internet into VPN, out of global

• Added flexibility

• More flexible DDOS mitigation

CON

• Increased memory and bandwidth consumption

Platform must support enough MPLS labels

– Label allocation is per-prefix by default

– Perhaps per-ce or per-vrf label allocation is wanted here

– Per-CE : one MPLS label per next-hop (CE router)

– Per-VRF : one MPLS label per VRF (all CE routers in the VRF)

– Con: IP lookup needed after label lookup

– Now also per-CE and per-VRF label allocation for 6PE

106

Page 107: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Full Internet in a VRF?

Two Internet gateways for redundancy

RRs are present: unique RDs needed

Considerations

107

RD 1:1

RD 1:2

PE1

PE2

PE3

PE4

RR

Page 108: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

Memory Utilization

Page 109: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

High Memory Utilization

BGP stores

– Prefixes many

– Paths could be many per prefix

– Attributes could be many per path

Prefixes, Paths, Attributes

AS 100 AS 10 AS 200

AS 20

{200}

{200}

{10 200}

{20 200}

{20 200}

109

R1#show bgp ipv4 unicast 10.1.1.0/24

Paths: (3 available, best #3, table default)

10 200

10.1.5.3 from 10.1.5.3 (10.100.1.3)

Origin IGP, localpref 100, valid, external

Community: internet 200:1 200:2 200:3

20 200

10.100.1.2 (metric 11) from 10.100.1.2 (10.100.1.2)

Origin IGP, metric 0, localpref 100, valid, internal

Community: 100:1 100:2 no-export

20 200

10.1.4.4 from 10.1.4.4 (10.100.1.4)

Origin IGP, localpref 100, valid, external, best

Community: internet 20:20

R1#show bgp ipv4 unicast paths

Address Hash Refcount Metric Path

0x6C8EE00 2229 1 0 20 200 i

0x6C8ED60 2246 1 0 10 200 i

0x6C8EEA0 2264 1 0 20 200 i

Page 110: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

High Memory Utilization Solutions

110

Prefixes

– Aggregation

– Filter

– Partial routing tables instead of full routing tables

Paths

– Reduce # peerings

– Use RRs instead of iBGP full mesh

Attributes

– Reduce # attributes

• Filter (extended) communties

• Typically inbound on eBGP peering

• Limit your own communities

Page 111: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

High Memory Utilization

• Multiple views exist in IGPs, as well

–But not on the same scale

–Neighbor adjacencies in IGPs are generally on a lower scale – In the hundreds, not the thousands

–Neighbor adjacencies in IGPs normally pick up different routes, rather than the same route multiple times

Views and Routes

111

For Your Reference

Page 112: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

High Memory Utilization

The more unique attribute sets you’re receiving, the more unique attribute sets you need to store

– You might have the same number of routes and views over time, but memory utilization can increase

Attributes

10.1.1.0/24

10.1.2.0/24

10.1.3.0/24

10.1.4.0/24

AS Path 1

AS Path 2

Community Set 1

Community Set 2

AS Path 3

Community Set 3

Community Set 4 BGP implementations build their memory structures around

minimizing storage

Attributes are stored once

– Rather than once per route

– Each route references an attribute set, rather than storing the attribute set

This is similar to the way BGP updates are formed

112

Page 113: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

High Memory Utilization Soft Reconfiguration Inbound vs Route Refresh

113

Recommendation

– Use soft reconfiguration inbound only on eBGP peering when you need to know what the peer previously advertised and you filtered out

AS 10 AS 20 AS 10 AS 20

route refresh soft reconfiguration inbound

Filtered prefixes are dropped: much more memory used

Support only on router itself

Changed filter: re-apply policy to table with filtered prefixes

Filtered prefixes are dropped

Support needed on peer, but this a very old feature

Changed filter: router sends out route refresh request to peer to get the full table from peer again

inbound filter inbound filter

Page 114: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Aggregation Reducing Number of Prefixes

114

Aggregate is to summarize many prefixes into fewer larger prefixes

– i.e. smaller mask

Natural for Service Providers, when using PA addressing

Many times not done because traffic engineering or load balancing is needed

Added bonus for aggregate

– Aggregate is stable if only a few components of the aggegate flap

– Even better when dampening is in place

Page 115: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Aggregation Aggregate Command

115

Summary-only needed

– Otherwise more specifics are sent too

By default: no AS-SET, no community info, more specifics are advertised

AS 10 AS 20 192.168.1.0/24

192.168.2.0/24

192.168.3.0/24

...

192.168.255.0/24

192.168.0.0/16

router bgp 1

address-family ipv4

aggregate-address 192.168.0.0 255.255.0.0 summary-only

Page 116: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

OS Enhancements

Page 117: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

BGP Scaling Enhancements Overview

BGP PE-CE Scale Enhancements

– Modified internal data structures and optimized internal algorithms for VRF based update generation

– Result = faster convergence / greater VRF and PE-CE session scaling

BGP PE Scale Enhancements

– Modified internal data structures for VRFs

– Result = considerable memory savings / greater prefix scalability

BGP Generic Scale Enhancements

– Parceling of BGP processes

– Created new BGP task IOS process: “BGP Task”

– Result = optimized update generation / faster convergence

BGP Keepalive Enhancements

– Priority queues for reading/writing Keepalive/Update messages

– Results = avoid neighbor flaps / ability to support small keepalive values in a scaled setup

117

Page 118: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

BGP PE Scale - PE - CE Optimization

118

• Slow convergence when the number of PE-

CE sessions was scaled on a PE router

• Selective evaluation of VPN prefixes for

processing/update generation

• Modified internal data structures and optimized

internal algorithms for VRF based update generation

• Update Generation convergence now bounded by

number of vpn nets and its related peers

• For each CE update group only consider prefixes in

CE’s VRF, not whole VPN table

• Enables VRF and peer session scaling

• Convergence typically improved by 300-400%

• Initial convergence for ASR with 4K VRFs, 8K

peers, and 2.3M routes is under 10 mins

Problem Solution

Results

Page 119: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

BGP PE Scale - VRF Based Advertised Bits

119

• Increased memory consumption when the

number of VRFs was scaled on a PE router

• Smart reuse of advertise bit space for VRFs

• Prefixes in a VRF used to have advertise bit

for every CE update group on the router

• Bits only needed for CEs in the same VRF

• Nets advertise bits made independent of number of

VPN VRFs configured

• Results in considerable memory savings per net

• For 1000 VRFs, net memory savings of 120+ bytes per net

• For 500 VRFs, net memory savings of 60+ bytes per net

• Greater prefix scalability

Problem Solution

Results

Page 120: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

BGP General Scale - Update Generation Enhancements

120

Update Generation is the most time sensitive, critical task for BGP

– Need to optimize to improve end-to-end convergence

Update Generation Parceling and a new process

– Parceling uses certain percentage of process quantum

– By parceling we mean having a task execute its work in small portions requiring limited CPU time (fraction of IOS process time)

– Long lived tasks split up their work across multiple parcels

– Less blocking

– Runs as a separate BGP process

– Created new BGP task IOS process: “BGP Task”

– Currently, only the new update generation is task based (4 tasks per BGP table)

Peer based update message queues

Inline freeing of transmitted update messages

Optimizing prefix based checkpointing

Page 121: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

BGP General Scale - Keepalive Enhancements

121

Persistent network issue

– Delayed processing of BGP Keepalives typically results in session flaps and network outage for

peers configured with aggressive keepalive values

– Unnecessary CPU and transient memory usage

Separate Keepalive Process to handle servicing of keepalive requests

Priority Queues for reading/writing Keepalive/Update messages

Optimizing Keepalive timeout cases

Benefits

– Ability to support aggressive keepalive values in a scaled router setup under stressed conditions

– Fixed unwanted session flaps and network outages

Page 122: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Multi-Instance BGP

122

A new IOS-XR BGP architecture to support multiple BGP instances

Each BGP instance is a separate process running on the same or a different RP/DRP node

The BGP instances do not share any prefix table between them

They can peer with each other

Multiple ASNs are possible *

Solves the 32-bit OS virtual memory limit

Different BGP routers: isolate services/AFs on common infrastructure

Achieve higher prefix scale (especially on a RR) by having different instances carrying different BGP tables

Achieve higher session scale by distributing the overall peering sessions between multiple instances

BGP vpnv4

PE1 10.0.0.1

10.0.0.2

BGP IPv4

BGP vpnv4

RR1 (active)

BGP IPv4

RR2 (backup)

BGP

20.0.0.1

30.0.0.1

20.0.0.2

30.0.0.2

* restrictions apply

Page 123: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Performance: Multi-thread Strategy

CPU clock: speed doesn’t have to increase anymore

CPU core: number of core per CPU are increasing: 2,4,8,16,64

IOS XE:

– All BGP processes runs within a single threat

– Optimized for single-core CPU

IOS XR: – Each component runs as a separate process (i.e. BGP, OSPF, ISIS)

– Most processes are multi-threaded

– Each thread is scheduled for the same time-slice using pre-emptive scheduling

– BGP has 16+ threads each of which performs a specific set of tasks

– Processes are optimized for multi-core CPUs

123

For Your Reference

Page 124: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Maximize your Cisco Live experience with your

free Cisco Live 365 account. Download session

PDFs, view sessions on-demand and participate in

live activities throughout the year. Click the Enter

Cisco Live 365 button in your Cisco Live portal to

log in.

Complete Your Online Session Evaluation

Give us your feedback and you could win fabulous prizes. Winners announced daily.

Receive 20 Cisco Daily Challenge points for each session evaluation you complete.

Complete your session evaluation online now through either the mobile app or internet kiosk stations.

124

Page 125: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges
Page 126: Scaling BGP - Deniz AYDINmonsterdark.com/wp-content/uploads/Scaling-BGP.pdf · BRKRST-3321 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public Scale Challenges

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-3321 Cisco Public

Fast Convergence and Keeping Scalability

Keep BGP timers, use BFD

NHT enabled by default

Quick alternate paths

– BGP ADD-PATH

– Diverse session

– PIC

– Different RDs

MRAI

– Defaults are fine

Dampening

– One flap is propagated fast, not the subsequent ones