1
pdfcrowd.com Aug 17 42 Comments OTV Decoded – A Fancy GRE Tunnel Posted by Brian McGahan, CCIE #8593 in CCIE Data Center, Nexus Tweet About Brian McGahan, CCIE #8593: Brian McGahan w as one of the youngest engineers in the w orld to obtain the CCIE, having achieved his first CCIE in Routing & Sw itching at the age of 20 in 2002. Brian has been teaching and developing CCIE training courses for over 8 years, and has assisted thousands of engineers in obtaining their CCIE certification. When not teaching or developing new products Brian consults w ith large ISPs and enterprise customers in the midw est region of the United States. Find all posts by Brian McGahan, CCIE #8593 | Visit Website Ashraf Esmat 3xCCIE #19158 August 17, 2012 at 6:18 pm Joshua Walton August 17, 2012 at 11:54 pm Luciano August 18, 2012 at 12:01 am Brian McGahan, CCIE #8593 August 19, 2012 at 8:39 am laurent August 18, 2012 at 2:10 am Al August 18, 2012 at 4:46 am Alexander Lim August 18, 2012 at 7:04 am Brian McGahan, CCIE #8593 August 18, 2012 at 8:45 am CJ Infantino August 20, 2012 at 7:52 am Brian McGahan, CCIE #8593 August 20, 2012 at 11:09 am AB September 1, 2012 at 8:32 pm Krunal August 18, 2012 at 3:15 pm Brian McGahan, CCIE #8593 August 18, 2012 at 3:35 pm Krunal August 18, 2012 at 5:38 pm Brian McGahan, CCIE #8593 August 18, 2012 at 6:27 pm Aryan August 18, 2012 at 9:22 pm Marcio Costa August 19, 2012 at 9:19 am Brian McGahan, CCIE #8593 August 19, 2012 at 9:50 am Marcio Costa August 20, 2012 at 6:26 pm Surya August 28, 2012 at 2:12 am Richard Chan August 20, 2012 at 5:59 pm Brian McGahan, CCIE #8593 August 21, 2012 at 12:16 am Ashraf Esmat Khalil 3xCCIE #19158 August 23, 2012 at 8:21 am Brian McGahan, CCIE #8593 August 23, 2012 at 8:23 am Ashraf Esmat Khalil 3xCCIE #19158 August 23, 2012 at 9:15 am Brian McGahan, CCIE #8593 August 23, 2012 at 11:31 am Brian McGahan, CCIE #8593 August 23, 2012 at 11:34 am Ashraf Esmat Khalil 3xCCIE #19158 August 23, 2012 at 11:48 am Devang August 24, 2012 at 9:41 am Brian McGahan, CCIE #8593 August 24, 2012 at 1:56 pm Rob August 27, 2012 at 4:37 am Brian McGahan, CCIE #8593 August 31, 2012 at 2:49 pm Brett Gianpetro August 29, 2012 at 8:44 am Brian McGahan, CCIE #8593 August 29, 2012 at 8:46 pm Jake Howering August 29, 2012 at 10:18 pm Mark McKillop, CCDE#2009::5 August 31, 2012 at 1:25 pm Victor Moreno August 31, 2012 at 4:22 pm Brian McGahan, CCIE #8593 August 31, 2012 at 9:44 pm Victor Moreno September 7, 2012 at 5:57 pm Brian McGahan, CCIE #8593 September 7, 2012 at 9:32 pm AB September 1, 2012 at 8:38 pm Mark Snow, CCIE #14073 September 4, 2012 at 2:43 pm Edit: For those of you that want to take a look first-hand at these packets, the Wireshark PCAP files referenced in this post can be found here One of the hottest topics in networking today is Data Center Virtualized Workload Mobility (VWM). For those of you that have been hiding under a rock for the past few years, workload mobility basically means the ability to dynamically and seamlessly reassign hardware resources to virtualized machines, often between physically disparate locations, while keeping this transparent to the end users. This is often accomplished through VMware vMotion, which allows for live migration of virtual machines between sites, or as similarly implemented in Microsoft’s Hyper-V and Citrix’s Xen hypervisors. One of the typical requirements of workload mobility is that the hardware resources used must be on the same layer 2 network segment. E.g. the VMware Host machines must be in the same IP subnet and VLAN in order to allow for live migration their VMs. The big design challenge then becomes, how do we allow for live migrations of VMs between Data Centers that are not in the same layer 2 network? One solution to this problem that Cisco has devised is a relatively new technology called Overlay Transport Virtualization (OTV). As a side result of preparing for INE’s upcoming CCIE Data Center Nexus Bootcamp I’ve had the privilege (or punishment depending on how you look at it ) of delving deep into the OTV implementation on Nexus 7000. My goal was to find out exactly what was going on behind the scenes with OTV. The problem I ran into though was that none of the external Cisco documentation, design guides, white papers, Cisco Live presentations, etc. really contained any of this information. The only thing that is out there on OTV is mainly marketing info, i.e. buzzword bingo, or very basic config snippets on how to implement OTV. In this blog post I’m going to discuss the details of my findings about how OTV actually works, with the most astonishing of these results being that OTV is in fact, a fancy GRE tunnel. From a high level overview, OTV is basically a layer 2 over layer 3 tunneling protocol. In essence OTV accomplishes the same goal as other L2 tunneling protocols such as L2TPv3, Any Transport over MPLS (AToM), or Virtual Private LAN Services (VPLS). For OTV specifically this goal is to take Ethernet frames from an end station, like a virtual machine, encapsulate them inside IPv4, transport them over the Data Center Interconnect (DCI) network, decapsulate them on the other side, and out pops your original Ethernet frame. For this specific application OTV has some inherent benefits over other designs such as MPLS L2VPN with AToM or VPLS. The first of which is that OTV is transport agnostic. As long as there is IPv4 connectivity between Data Centers, OTV can be used. For AToM or VPLS, these both require that the transport network be MPLS aware, which can limit your selections of Service Providers for the DCI. For OTV you can technically use it over any regular Internet connectivity. Another advantage of OTV is that provisioning is simple. AToM and VPLS tunnels are Provider Edge (PE) side protocols, while OTV is a Customer Edge (CE) side protocol. This means for AToM and VPLS the Service Provider has to pre-provision the pseudowires. Even though VPLS supports enhancements like BGP auto-discovery, provisioning of MPLS L2VPN is still requires administrative overhead. OTV is much simpler in this case, because as we’ll see shortly, the configuration is just a few commands that are controlled by the CE router, not the PE router. The next thing we have to consider with OTV is how exactly this layer 2 tunneling is accomplished. After all we could just configure static GRE tunnels on our DCI edge routers and bridge IP over them, but this is probably not the best design option for either control plane or data plan scalability. The way that OTV implements the control plane portion of its layer 2 tunnel is what is sometimes described as “MAC in IP Routing”. Specifically OTV uses Intermediate System to Intermediate System (IS-IS) to advertise the VLAN and MAC address information of the end hosts over the Data Center Interconnect. For those of you that are familiar with IS-IS, immediately this should sound suspect. After all, IS-IS isn’t an IP protocol, it’s part of the legacy OSI stack. This means that IS-IS is directly encapsulated over layer 2, unlike OSPF or EIGRP which ride over IP at layer 3. How then can IS-IS be encapsulated over the DCI network that is using IPv4 for transport? The answer? A fancy GRE tunnel. The next portion that is significant about OTV’s operation is how it actually sends packets in the data plane. Assuming for a moment that the control plane “just works”, and the DCI edge devices learn about all the MAC addresses and VLAN assignments of the end hosts, how do we actually encapsulate layer 2 Ethernet frames inside of IP to send over the DCI? What if there is multicast traffic that is running over the layer 2 network? Also what if there are multiple sites reachable over the DCI? How does it know specifically where to send the traffic? The answer? A fancy GRE tunnel. Next I want to introduce the specific topology that will be used for us to decode the details of how OTV is working behind the scenes. Within the individual Data Center sites, the layer 2 configuration and physical wiring is not relevant to our discussion of OTV. Assume simply that the end hosts have layer 2 connectivity to the edge routers. Additionally assume that the edge routers have IPv4 connectivity to each other over the DCI network. In this specific case I chose to use RIPv2 for routing over the DCI (yes, you read that correctly), simply so I could filter it from my packet capture output, and easily differentiate between the routing control plane in the DCI transport network vs. the routing control plane that was tunneled inside OTV between the Data Center sites. What we are mainly concerned with in this topology is as follows: OTV Edge Devices N7K1-3 and N7K2-7 These are the devices that actually encapsulate the Ethernet frames from the end hosts into the OTV tunnel. I.e. this is where the OTV config goes. DCI Transport Device N7K2-8 This device represents the IPv4 transit cloud between the DC sites. From this device’s perspective it sees only the tunnel encapsulated traffic, and does not know the details about the hosts inside the individual DC sites. Additionally this is where packet capture is occurring so we can view the actual payload of the OTV tunnel traffic. End Hosts R2, R3, Server 1, and Server 3 These are the end devices used to generate data plane traffic that ultimately flows over the OTV tunnel. Now let’s look at the specific configuration on the edge routers that is required to form the OTV tunnel. N7K1-3: vlan 172 name OTV_EXTEND_VLAN ! vlan 999 name OTV_SITE_VLAN ! spanning-tree vlan 172 priority 4096 ! otv site-vlan 999 otv site-identifier 0x101 ! interface Overlay1 otv join-interface Ethernet1/23 otv control-group 224.100.100.100 otv data-group 232.1.2.0/24 otv extend-vlan 172 no shutdown ! interface Ethernet1/23 ip address 150.1.38.3/24 ip igmp version 3 ip router rip 1 no shutdown N7K2-7: vlan 172 name OTV_EXTEND_VLAN ! vlan 999 name OTV_SITE_VLAN ! spanning-tree vlan 172 priority 4096 ! otv site-vlan 999 otv site-identifier 0x102 ! interface Overlay1 otv join-interface port-channel78 otv control-group 224.100.100.100 otv data-group 232.1.2.0/24 otv extend-vlan 172 no shutdown ! interface port-channel78 ip address 150.1.78.7/24 ip igmp version 3 ip router rip 1 As you can see the configuration for OTV really isn’t that involved. The specific portions of the configuration that are relevant are as follows: Extend VLANs These are the layer 2 segments that will actually get tunneled over OTV. Basically these are the VLANs that you virtual machines reside on that you want to do the VM mobility between. In our case this is VLAN 172, which maps to the IP subnet 172.16.0.0/24. Site VLAN Used to synchronize the Authoritative Edge Device (AED) role within an OTV site. This is for is when you have more than one edge router per site. OTV only allows a specific Extend VLAN to be tunneled by one edge router at a time for the purpose of loop prevention. Essentially this Site VLAN lets the edge routers talk to each other and figure out which one is active/standby on a per-VLAN basis for the OTV tunnel. The Site VLAN should not be included in the extend VLAN list. Site Identifier Should be unique per DC site. If you have more than one edge router per site, they must agree on the Site Identifier, as it’s used in the AED election. Overlay Interface The logical OTV tunnel interface. OTV Join Interface The physical link or port-channel that you use to route upstream towards the DCI. OTV Control Group Multicast address used to discover the remote sites in the control plane. OTV Data Group Used when you’re tunneling multicast traffic over OTV in the data plane. IGMP Version 3 Needed to send (S,G) IGMP Report messages towards the DCI network on the Join Interface. At this point that’s basically all that’s involved in the implementation of OTV. It “just works”, because all the behind the scenes stuff is hidden from us from a configuration point of view. A quick test of this from the end hosts shows us that: R2#ping 255.255.255.255 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 255.255.255.255, timeout is 2 seconds: Reply to request 0 from 172.16.0.3, 4 ms Reply to request 1 from 172.16.0.3, 1 ms Reply to request 2 from 172.16.0.3, 1 ms Reply to request 3 from 172.16.0.3, 1 ms Reply to request 4 from 172.16.0.3, 1 ms R2#traceroute 172.16.0.3 Type escape sequence to abort. Tracing the route to 172.16.0.3 VRF info: (vrf in name/id, vrf out name/id) 1 172.16.0.3 0 msec * 0 msec The fact that R3 responds to R2’s packets going to the all hosts broadcast address (255.255.255.255) implies that they are in the same broadcast domain. How specifically is it working though? That’s what took a lot further investigation. To simplify the packet level verification a little further, I changed the MAC address of the four end devices that are used to generate the actual data plane traffic. The Device, IP address, and MAC address assignments are as follows: The first thing I wanted to verify in detail was what the data plane looked like, and specifically what type of tunnel encapsulation was used. With a little searching I found that OTV is currently on the IETF standards track in draft format. As of writing, the newest draft is draft-hasmit-otv-03. Section 3.1 Encapsulation states: 3. Data Plane 3.1. Encapsulation The overlay encapsulation format is a Layer-2 ethernet frame encapsulated in UDP inside of IPv4 or IPv6. The format of OTV UDP IPv4 encapsulation is as follows: 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Version| IHL |Type of Service| Total Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Identification |Flags| Fragment Offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Time to Live | Protocol = 17 | Header Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source-site OTV Edge Device IP Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Destination-site OTV Edge Device (or multicast) Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Port = xxxx | Dest Port = 8472 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | UDP length | UDP Checksum = 0 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |R|R|R|R|I|R|R|R| Overlay ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Instance ID | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | Frame in Ethernet or 802.1Q Format | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ A quick PING sweep of packet lengths with the Don’t Fragment bit set allowed me to find the encapsulation overhead, which turns out to be 42 bytes, as seen below: R3#ping 172.16.0.2 size 1459 df-bit Type escape sequence to abort. Sending 5, 1459-byte ICMP Echos to 172.16.0.2, timeout is 2 seconds: Packet sent with the DF bit set ..... Success rate is 0 percent (0/5) R3#ping 172.16.0.2 size 1458 df-bit Type escape sequence to abort. Sending 5, 1458-byte ICMP Echos to 172.16.0.2, timeout is 2 seconds: Packet sent with the DF bit set !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/4 ms None of my testing however could verify what the encapsulation header was though. The draft says that the transport is supposed to be UDP port 8472, but none of my logging produced results showing that any UDP traffic was even in the transit network (save for my RIPv2 routing ). After much frustration, I finally broke out the sniffer and took some packet samples. The first capture below shows a normal ICMP ping between R2 and R3. MPLS? GRE? Where did those come from? That’s right, OTV is in fact a fancy GRE tunnel. More specifically it is an Ethernet over MPLS over GRE tunnel. My poor little PINGs between R2 and R3 are in fact encapsulated as ICMP over IP over Ethernet over MPLS over GRE over IP over Ethernet (IoIoEoMPLSoGREoIP for short). Let’s take a closer look at the encapsulation headers now: In the detailed header output we see our transport Ethernet header, which in a real deployment can be anything depending on what the transport of your DCI is (Ethernet, POS, ATM, Avian Carrier, etc.) Next we have the IP OTV tunnel header, which surprised me in a few aspects. First, all documentation I read said that without the use of an OTV Adjacency Server, unicast can’t be used for transport. This is true… up to a point. Multicast it turns out is only used to establish the control plane, and to tunnel multicast over multicast in the data plane. Regular unicast traffic over OTV will be encapsulated as unicast, as seen in this capture. The next header after IP is GRE. In other words, OTV is basically the same as configuring a static GRE tunnel between the edge routers and then bridging over them, along with some enhancements (hence fancy GRE). The OTV enhancements (which we’ll talk about shortly) are the reason why you wouldn’t just configure GRE statically. Nevertheless this surprised me because even in hindsight the only mention of OTV using GRE I found was here. What’s really strange about this is that Cisco’s OTV implementation doesn’t follow what the standards track draft says, which is UDP, even though the authors of the OTV draft are Cisco engineers. Go figure. The next header, MPLS, makes sense since the prior encapsulation is already GRE. Ethernet over MPLS over GRE is already well defined and used in deployment, so there’s no real reason to reinvent the wheel here. I haven’t verified this in detail yet but I’m assuming that the MPLS Label value would be used in cases where the edge router has multiple overlay interfaces, in which case the label in the data plane would quickly tell it which overlay interface the incoming packet is destined for. This logic is similar to MPLS L3VPN where the bottom of the stack VPN label tells a PE router which CE facing link the packet is ultimately destined for. I’m going to do some more testing later with a larger more complex topology to actually verify this fact though, as all data plane traffic over this tunnel is always sharing the same MPLS label value. Next we see the original Ethernet header, which is sourced from R2’s MAC address 0000.0000.0002 and going to R3’s MAC address 0000.0000.0003. Finally we have the original IP header and the final ICMP payload. The key with OTV is that this inner Ethernet header and its payload remain untouched, so it looks like from the end host perspective that all the devices are just on the same LAN. Now that it was apparent that OTV was just a fancy GRE tunnel, the IS-IS piece fell into place. Since IS-IS runs directly over layer 2 (e.g. Ethernet), and OTV is an Ethernet over MPLS over GRE tunnel, then IS-IS can encapsulate as IS-IS over Ethernet over MPLS over GRE (phew!). To test this, I changed the MAC address of one of the end hosts, and looked at the IS-IS LSP generation of the edge devices. After all the goal of the OTV control plane is to use IS-IS to advertise the MAC addresses of end hosts in that particular site, as well as the particular VLAN that they reside in. The configuration steps and packet capture result of this are as follows: R3#conf t Enter configuration commands, one per line. End with CNTL/Z. R3(config)#int gig0/0 R3(config-if)#mac-address 1234.5678.9abc R3(config-if)# *Aug 17 22:17:10.883: %LINK-5-CHANGED: Interface GigabitEthernet0/0, changed state to reset *Aug 17 22:17:11.883: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/0, changed state to down *Aug 17 22:17:16.247: %LINK-3-UPDOWN: Interface GigabitEthernet0/0, changed state to up *Aug 17 22:17:17.247: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/0, changed state to up The first thing I noticed about the IS-IS encoding over OTV is that it uses IPv4 Multicast. This makes sense, because if you have 3 or more OTV sites you don’t want to have to send your IS-IS LSPs as replicated Unicast. As long as all of the AEDs on all sites have joined the control group (224.100.100.100 in this case), the LSP replication should be fine. This multicast forwarding can also be verified in the DCI transport network core in this case as follows: N7K2-8#show ip mroute IP Multicast Routing Table for VRF "default" (*, 224.100.100.100/32), uptime: 20:59:33, ip pim igmp Incoming interface: Null, RPF nbr: 0.0.0.0 Outgoing interface list: (count: 2) port-channel78, uptime: 20:58:46, igmp Ethernet1/29, uptime: 20:58:53, igmp (150.1.38.3/32, 224.100.100.100/32), uptime: 21:00:05, ip pim mrib Incoming interface: Ethernet1/29, RPF nbr: 150.1.38.3 Outgoing interface list: (count: 2) port-channel78, uptime: 20:58:46, mrib Ethernet1/29, uptime: 20:58:53, mrib, (RPF) (150.1.78.7/32, 224.100.100.100/32), uptime: 21:00:05, ip pim mrib Incoming interface: port-channel78, RPF nbr: 150.1.78.7 Outgoing interface list: (count: 2) port-channel78, uptime: 20:58:46, mrib, (RPF) Ethernet1/29, uptime: 20:58:53, mrib (*, 232.0.0.0/8), uptime: 21:00:05, pim ip Incoming interface: Null, RPF nbr: 0.0.0.0 Outgoing interface list: (count: 0) Note that N7K1-3 (150.1.38.3) and N7K2-7 (150.1.78.7) have both joined the (*, 224.100.100.100). A very important point about this is that the control group for OTV is an Any Source Multicast (ASM) group, not a Source Specific Multicast (SSM) group. This implies that your DCI transit network must run PIM Sparse Mode and have a Rendezvous Point (RP) configured in order to build the shared tree (RPT) for the OTV control group used by the AEDs. You technically could use Bidir but you really wouldn’t want to for this particular application. This kind of surprised me how they chose to implement it, because there are already more efficient ways of doing source discovery for SSM, for example how Multicast MPLS L3VPN uses the BGP AFI/SAFI Multicast MDT to advertise the (S,G) pairs of the PE routers. I suppose the advantage of doing OTV this way though is that it makes the OTV config very straightforward from an implementation point of view on the AEDs, and you don’t need an extra control plane protocol like BGP to exchange the (S,G) pairs before you actually join the tree. The alternative to this of course is to use the Adjacency Server and just skip using multicast all together. This however will result in unicast replication in the core, which can be bad, mkay? Also for added fun in the IS-IS control plane the actual MAC address routing table can be verified as follows: N7K2-7# show otv route OTV Unicast MAC Routing Table For Overlay1 VLAN MAC-Address Metric Uptime Owner Next-hop(s) ---- -------------- ------ -------- --------- ----------- 172 0000.0000.0002 1 01:22:06 site port-channel27 172 0000.0000.0003 42 01:20:51 overlay N7K1-3 172 0000.0000.000a 42 01:18:11 overlay N7K1-3 172 0000.0000.001e 1 01:20:36 site port-channel27 172 1234.5678.9abc 42 00:19:09 overlay N7K1-3 N7K2-7# show otv isis database detail | no-more OTV-IS-IS Process: default LSP database VPN: Overlay1 OTV-IS-IS Level-1 Link State Database LSPID Seq Number Checksum Lifetime A/P/O/T N7K2-7.00-00 * 0x000000A3 0xA36A 893 0/0/0/1 Instance : 0x000000A3 Area Address : 00 NLPID : 0xCC 0x8E Hostname : N7K2-7 Length : 6 Extended IS : N7K1-3.01 Metric : 40 Vlan : 172 : Metric : 1 MAC Address : 0000.0000.001e Vlan : 172 : Metric : 1 MAC Address : 0000.0000.0002 Digest Offset : 0 N7K1-3.00-00 0x00000099 0xBAA4 1198 0/0/0/1 Instance : 0x00000094 Area Address : 00 NLPID : 0xCC 0x8E Hostname : N7K1-3 Length : 6 Extended IS : N7K1-3.01 Metric : 40 Vlan : 172 : Metric : 1 MAC Address : 1234.5678.9abc Vlan : 172 : Metric : 1 MAC Address : 0000.0000.000a Vlan : 172 : Metric : 1 MAC Address : 0000.0000.0003 Digest Offset : 0 N7K1-3.01-00 0x00000090 0xCBAB 718 0/0/0/1 Instance : 0x0000008E Extended IS : N7K2-7.00 Metric : 0 Extended IS : N7K1-3.00 Metric : 0 Digest Offset : 0 So at this point we see what our ICMP PING was actually ICMP over IP over Ethernet over MPLS over GRE over IP over Ethernet, and our routing protocol was IS-IS over Ethernet over MPLS over GRE over IP over Ethernet :/ What about multicast in the data plane though? At this point verification of multicast over the DCI core is pretty straightforward, since we can just enable a multicast routing protocol like EIGRP and look at the result. This can be seen below: R2#config t Enter configuration commands, one per line. End with CNTL/Z. R2(config)#router eigrp 1 R2(config-router)#no auto-summary R2(config-router)#network 0.0.0.0 R2(config-router)#end R2# R3#config t Enter configuration commands, one per line. End with CNTL/Z. R3(config)#router eigrp 1 R3(config-router)#no auto-summary R3(config-router)#network 0.0.0.0 R3(config-router)#end R3# *Aug 17 22:39:43.419: %SYS-5-CONFIG_I: Configured from console by console *Aug 17 22:39:43.423: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.0.2 (GigabitEthernet0/0) is up: new adjacency R3#show ip eigrp neighbors IP-EIGRP neighbors for process 1 H Address Interface Hold Uptime SRTT RTO Q Seq (sec) (ms) Cnt Num 0 172.16.0.2 Gi0/0 11 00:00:53 1 200 0 1 Our EIGRP adjacency came up, so multicast obviously is being tunneled over OTV. Let’s see the packet capture result: We can see EIGRP being tunneled inside the OTV payload, but what’s with the outer header? Why is EIGRP using the ASM 224.100.100.100 group instead of the SSM 232.1.2.0/24 data group? My first guess was that link local multicast (i.e. 224.0.0.0/24) would get encapsulated as control plane instead of as data plane. This would make sense because control plane protocols like OSPF, EIGRP, PIM, etc. you would want those tunneling to all OTV sites, not just the ones that joined the SSM feeds. To test if this was the case, the only change I needed to make was to have one router join a non-link-local multicast group, and have the other router send ICMP pings. Since they’re effectively in the same LAN segment, no PIM routing is needed in the DC sites, just basic IGMP Snooping, which is enabled in NX-OS by default. The config on the IOS routers is as follows: R2#config t Enter configuration commands, one per line. End with CNTL/Z. R2(config)#ip multicast-routing R2(config)#int gig0/0 R2(config-if)#ip igmp join-group 224.10.20.30 R2(config-if)#end R2# R3#ping 224.10.20.30 repeat 1000 size 1458 df-bit Type escape sequence to abort. Sending 1000, 1458-byte ICMP Echos to 224.10.20.30, timeout is 2 seconds: Packet sent with the DF bit set Reply to request 0 from 172.16.0.2, 1 ms Reply to request 1 from 172.16.0.2, 1 ms Reply to request 2 from 172.16.0.2, 1 ms The packet capture result was as follows: This was more as expected. Now the multicast data plane packet was getting encapsulated in the ICMP over IP over Ethernet over MPLS over GRE over IP *Multicast* over Ethernet OTV group. The payload wasn’t decoded, as I think even Wireshark was dumbfounded by this string of encapsulations. In summary we can make the following observations about OTV: OTV encapsulation has 42 bytes of overhead that consists of: New Outer Ethernet II Header – 14 Bytes New Outer IP Header – 20 Bytes GRE Header – 4 Bytes MPLS Header – 4 Bytes OTV uses both Unicast and Multicast transport ASM Multicast is used to build the control plane for OTV IS-IS, ARP, IGMP, EIGRP, etc. Unicast is used for normal unicast data plane transmission between sites SSM Multicast is used for normal multicast data plane transmission between sites Optionally ASM & SSM can be replaced with the Adjacency Server GRE is the ultimate band-aid of networking Now the next time someone is throwing around fancy buzzwords about OTV, DCI, VWM, etc. you can say “oh, you mean that fancy GRE tunnel”? I’ll be continuing this series in the coming days and weeks on other Data Center and specifically CCIE Data Center related technologies. If you have a request for a specific topic or protocol that you’d like to see the behind the scene’s details of, drop me a line at [email protected]. Happy Labbing! Tags: gre, IS-IS, MPLS, multicast, otv Download this page as a PDF You can leave a response, or trackback from your own site. 42 Responses to “OTV Decoded – A Fancy GRE Tunnel” currently I am building DR, and definitely this gonna help me alot. Reply Well done, Brian! Reply Brian, Sorry for being off-topic but I don’t know where else to ask this. I am interested in the new CCNP SP track and I have noticed that currently there is no specific documentation for this track. I would love to see a CCNP SP ATC from you .. there is really nothing like that or for that matter, specific documentation out there right now.. time is precious so having to the point training materials would be convenient. Thanks Reply Hi Luciano, We’re planning CCNP SP for Q1 2013. Reply Simply amazing post Brian!!!! That is just Wowwww! Thanks a lot for sharing! Regards, Laurent Reply Hello Brian, this is by far the best explanation of OTV I have come across. Would love to see something in the same vein for virtual port channels Reply Excellent post Brian! Do you agree with Cisco that OTV is the best way of extending VLANs over DCI for enterprise network? Suggestion for next post: FabricPath. Reply For DCI yes I would choose OTV over dark fiber/WDM or AToM/VPLS because of the extra enhancements. For example with OTV the STP demarc is at the edge router automatically. With other DCIs you can manually filter STP, but you leave yourself open to loops if you misconfigure the network or in certain failure scenarios. Also OTV optimizes ARP, by having the edge routers send proxy ARP replies to hosts within their own site that are trying to reach hosts in other sites across the OTV DCI. This can be a big control plane savings depending on the size of the sites. Reply So you’re saying STP stops at the OTC Edge devices? Does that mean STP is only local to each DC? That would be preferable if that was the case. It never seems like a good idea to have a L2 DCI. Stretching VLANs is just creating one massive failure domain. Now if you have a broadcast storm, etc, in one VLAN in DC1 – you’re taking out DC2 as well. Not a good design if you ask me. CJ Reply That’s correct. The broadcast domain spans over the DCI, but the STP domain does not. The OTV edge devices should be the STP root switches for the extend VLANs, which ensures that they’re always at the top of the tree, and that all downstream links are forwarding. There’s other protections against broadcast storms like the ARP proxying. Even though ARP is a broadcast which normally would be flooded, the OTV edge device stops the ARP from going across the DCI if it already has the entry in its local ARP cache. The result is that while a layer 2 loop could still take out one DC, its not going to flood over the DCI and automatically take out the other DC sites. You don’t get this extra functionality by default if you run something like dark fiber or AToM or VPLS. Reply Hi Brian, Is OTV VRF-aware i.e. can I have the Join-interface in a VRF? Also, does the latest NX-OS support SVI or loopback interfaces in the OTV VDC? Thanks. AB. Reply Excellent post Brian, you really nailed it down. I was so surprised to see ethernet frames are not encapsulated in UDP but it is actually GRE o MPLS. Cisco might have done this because encapsulating TCP packets over UDP would not make sense. One thing I figured out that UDP and GRE+MPLS has same header length. 16 bytes. If you look at the draft of OTV http://tools.ietf.org/pdf/draft- hasmit-otv-03.pdf the expiration date is Jan 9th 2012. Also on newer implementation of OTV on ASR 1000 “show otv” command actually tells you it does GRE/IPv4 encapsulation. I supposed when draft is resubmitted again it will include GRE+MPLS changes. From Cisco’s OTV implementation point of view the ASIC on line card does not need to be change as total 42 bytes of overhead remains intact in both implementation and ASIC case parse same length of packet. This would be a software change instead. Just for curiosity and for everyone’s benefit if you can upload actual .pcap file, I would really appreciate. Reply I’ll post the pcap files tonight. Reply Thank you Brian. Reply The Wireshark PCAP files referenced in this post can be found here Reply Hey Brian, Can you do a detailed post on how how vPC works (low level information) & How Layer 2 loop prevention happens with vPC scenario. Also covering details of vPC vs vPC+ Thanks! Reply you rock man!! tks for the blog about it. That would be good to see the comparison of these DCI technologies from the design point of view. Reply That would be an interesting write-up. Something like “Design Considerations in Choosing a Data Center Interconnect Technology”, comparing OTV, VPLS, AToM, L2TPv3, WDM, etc. I’ll definitely take it under consideration. Thanks Marcio! Reply Thanks Brian. You got my point. I’m already waiting to read this article/blog Reply Hi all. To me OTV is “no go” in DCI area. Just because it doesn’t support unknown unicast flooding. How many customers use Microsft NLB or other similar technologies ? My prefered scenario is currently Trill over DWDM. Reply Hi Brian “the VMware Host machines must be in the same IP subnet and VLAN in order to allow for live migration their VMs.” Could you clarify: I thought it is the VMware guest that must end up in the same VLAN; the vMotion VLANs can be routed (though that might not be supported) Reply The vmkernel interface needs to be on the same subnet of the src/dst vm hosts. It’s possible this has changed in new versions though. A good design doc on this is http://www.cisco.com/en/US/docs/solutions/Enterprise/Data_Center/DCI/4.0/EMC/EMC_2.html#wp1261714 Reply Hi Brain, Does OTV require enabling multicast end to end, i.e the underling infrastructure ( service provider network infrastructure) should support multicast? Reply By default it requires that the DCI network be multicast capable. This means that your layer 3 interconnect provider should be running PIM sparse mode with an RP or BIDIR PIM. If PIM isn’t running then you can use the OTV Adjacency Server which replaces the multicast requirement with unicast only. Reply Hi Brain, Thank you very much for the clarification, one more question the default gateway for most of the servers is ASA or ACE not SVI on N7K (based on the setup), my question is ,is OTV working fine with these security devices.? Reply It should be fine because the servers are essentially layer 2 adjacent over OTV. You can run into some weird routing patterns though where a server has a default gateway in a remote DC over the DCI. It will work, it’s just that you have sub-optimal forwarding. One of the solutions that can help with this is Locator/Identifier Separation Protocol (LISP), where the goal is to separate your location (e.g. DC site) to your IP address. Reply This might be of interest to you as well: http://www.netcraftsmen.net/component/content/article/69-data- center/818.html Thank you very much, looking forward to learning from you next week on line. How does OTV different from VPLS? Reply OTV is a customer side layer 2 tunnel. VPLS is a provider side layer 2 tunnel. VPLS requires MPLS for transit, OTV does not. There are other differences of course but those are really the main ones. Reply Great post, thanks a lot. Looking forward to Data Center ciriculum (hopefully online classes?) You made an argument for OTV vs VPLS and AToM but what about L2TP? Are OTV and L2TP not pretty much the same with L2TP being supported by cheaper hardware? Reply L2TPv3 is also an option, and like you said it’s supported on lower level platforms (i.e. cheaper ones). It doesn’t have any built in enhancements though like the ARP flooding optimization. Reply Nice work. This is good stuff. I am wondering if the MPLS label is used to encode the VLAN information. That Cisco OTV FAQ that you linked to notes that an “OTV shim” is added to the header to encode VLAN information. I’m thinking that maybe OTV shim=MPLS label. Can you try and extend another VLAN over OTV using the same overlay interface and see if it generates a different MPLS label? Reply That’s a likely assumption. I’ll check into it in more detail and post an update. Reply Really a great post. Understanding the header details has significant implications re: HW support for other similar overlay technologies like VXLAN and LISP. Nice Job ! Reply Very insightful posting Brian, certainly helps me (even as a Cisco employee familiar with the technology). Great Work!!! Reply Very nice work Brian. And clearly well received by the readers. Regarding the encapsulation. Keep in mind that OTV is shipping on ASICs that were finalized well before we even conceived OTV. So to get the solution to the users, we worked with the existing ASIC capabilities and managed to deliver a hardware accelerated Ethernet in IP tunnel. As you well describe the trick is to do this in two stages and utilize the capabilities of the existing hardware (Eth-in-MPLS + MPLS-in-GRE). We do not really use the MPLS bits for any MPLS purpose (that is a very important point) and we use if as the OTV shim that carries segment (VLAN) information. We did however, design OTV with an ideal header in mind and that is what we proposed to the standards bodies, and it is also what you will see in future hardware as well as the new wave of technology proposals (LISP and VXLAN use the same exact header proposed by OTV to the IETF). I wouldn’t trivialize OTV as “just” a fancy GRE tunnel. Although we use the GRE encap (because the secret sauce isn’t really in the encapsulation), the real value OTV brings is in its control plane and the way it handles traffic an simplifies configuration. In other words, the encap is not really that important and we are working on getting most of these to converge into one. Reply Hi Victor, First off, thanks for the detailed response! Don’t take it the wrong way, I’m not trying to trivialize OTV, in fact I think it’s a *brilliant* way of solving the problem. Not only with the optimizations of ARP flooding, demarc-ing the STP domain local to the DCI edge, using IS-IS to exchange the control plane AND hiding the details from the front-end is all a great idea. I have a few questions if you don’t mind. With the current implementation, does the label value encode the VLAN? Are your plans to move towards the UDP encap in future releases? If the EoMPLSoGRE is already hardware accelerated in the ASIC, what’s the advantage of using the UDP encap? Why not just amend the next draft proposal with the EoMPLSoGRE format? Another point that others have brought up is the security of OTV. I’m sure as you know even in MPLS L3VPN environments, many designs require encryption due to compliance. Are there any plans to integrate GETVPN or other similar tunneling techniques into the AED itself, or is it assumed that this should be done on your “true” L3 edge device, such as ASR1K upstream of the AED? I saw some recent documents talking about the integration of GET and LISP on ASR. Thanks again for reading! Brian Reply Hi Brian, Here are the answers to your questions from a little while back. Q: I have a few questions if you don’t mind. With the current implementation, does the label value encode the VLAN? A: Yes . Q: Are your plans to move towards the UDP encap in future releases? A: Yes, there are benefits to the UDP encap, so once the HW is available, we will support both modes. Q: If the EoMPLSoGRE is already hardware accelerated in the ASIC, what’s the advantage of using the UDP encap? A: The UDP encap is more efficient, but more importantly, the UDP encap allows better entropy as the core devices can hash on UDP port numbers and the encapsulated traffic doesn’t get polarized to a single path. Q: Why not just amend the next draft proposal with the EoMPLSoGRE format? A: That would be very confusing as the ideal encap is UDP and our other overlay efforts are converging on this UDP encap (LISP, VXLAN, OTV). EoMPLSoGRE was simply a way to get to market in a timely manner. Q: Another point that others have brought up is the security of OTV. I’m sure as you know even in MPLS L3VPN environments, many designs require encryption due to compliance. Are there any plans to integrate GETVPN or other similar tunneling techniques into the AED itself, or is it assumed that this should be done on your “true” L3 edge device, such as ASR1K upstream of the AED? A: The right tools for the right job in the right places. The encapsulated traffic can be easily encrypted by the WAN edge routers, like all other inter DC traffic is encrypted. No need to raise the cost of high density port offerings like the N7K by adding crypto HW as you probably want to manage the policy at the WAN edge anyway and there is little incentive to encrypt the traffic between DC aggregation and WAN edge. Q: I saw some recent documents talking about the integration of GET and LISP on ASR. A: Yes, LISP plays a role in CPE devices like the ASR1K and ISRs, therefore it makes sense for the product to support such solution. The ASR implementation of OTV is also integrated with crypto, allowing the encapsulated traffic to be encrypted. The model is similar to what you would do across multiple boxes in larger networks, only you do it on a single router at smaller sites that don’t require the speeds and densities of a Nexus switching infrastructure. Reply Hi Victor, Thanks for the reply. That makes sense now about the layer 4 flow hashing in the DCI core and why you would want to use the UDP encap. Reply Hi Brian, Is OTV VRF-aware i.e. can I have the Join-interface in a VRF? Also, is the limitation of a separate VDC for SVI routing removed in the newer versions of NX-OS, and the SVI or Loopback interface as Join-interfaces? Thanks. AB. Reply The OTV join-interface is VRF aware, as everything inside the N7K NX-OS is in a VRF (everything is inside the default VRF if not otherwise specified). The limitation of no SVI isn’t lifted yet. Technically you can have an SVI, but it must be in the shutdown state. Reply Leave a Reply Name (required) Mail (will not be published) (required) Submit Comment Search Search Submit Categories Select Category CCIE Bloggers Brian Dennis CCIE #2210 Routing & Sw itching ISP Dial Security Service Provider Voice Brian McGahan CCIE #8593 Routing & Sw itching Security Service Provider Petr Lapukhov CCIE #16379 Routing & Sw itching Security Service Provider Voice Mark Snow CCIE #14073 Voice Security Popular Posts FabricPath - CCIE DC Nexus Sw itching Class Preview CCIE DC Written and Nexus Sw itching Videos are Both Now Available twitter.com/inetraining Check out INE's YouTube channel featuring #FabricPath – #CCIEDC #Nexus #Sw itching Class preview - http://t.co/l8P0AHv3 Think you're ready to tackle the #CCIE beast? Try INE's Mock Lab Assessment first. http://t.co/BXhhQw OW OTV Decoded – http://t.co/0P1XDUvw #cciedc #nexus Blog Home | INE Home | Members | Contact Us | Subscribe © 2011 INE, Inc., All Rights Reserved Free Resources View Archives All Access Pass CCIE Bloggers

OTV Decoded – a Fancy GRE Tunnel

  • Upload
    wpera

  • View
    43

  • Download
    8

Embed Size (px)

DESCRIPTION

OTV Decoded – a Fancy GRE Tunnel

Citation preview

pdfcrowd.com

Aug

17 42 CommentsOTV Decoded – A Fancy GRE TunnelPosted by Brian McGahan, CCIE #8593 in CCIE Data Center,Nexus

Tweet

About Brian McGahan, CCIE #8593:Brian McGahan w as one of the youngest engineers in the w orld to obtain the CCIE, having achieved his f irst CCIE inRouting & Sw itching at the age of 20 in 2002. Brian has been teaching and developing CCIE training courses for over 8years, and has assisted thousands of engineers in obtaining their CCIE certif ication. When not teaching or developingnew products Brian consults w ith large ISPs and enterprise customers in the midw est region of the United States.

Find all posts by Brian McGahan, CCIE #8593 | Visit Website

Ashraf Esmat 3xCCIE #19158August 17, 2012 at 6:18 pm

Joshua WaltonAugust 17, 2012 at 11:54 pm

LucianoAugust 18, 2012 at 12:01 am

Brian McGahan, CCIE #8593August 19, 2012 at 8:39 am

laurentAugust 18, 2012 at 2:10 am

AlAugust 18, 2012 at 4:46 am

Alexander LimAugust 18, 2012 at 7:04 am

Brian McGahan, CCIE #8593August 18, 2012 at 8:45 am

CJ InfantinoAugust 20, 2012 at 7:52 am

Brian McGahan, CCIE #8593August 20, 2012 at 11:09 am

ABSeptember 1, 2012 at 8:32 pm

KrunalAugust 18, 2012 at 3:15 pm

Brian McGahan, CCIE #8593August 18, 2012 at 3:35 pm

KrunalAugust 18, 2012 at 5:38 pm

Brian McGahan, CCIE #8593August 18, 2012 at 6:27 pm

AryanAugust 18, 2012 at 9:22 pm

Marcio CostaAugust 19, 2012 at 9:19 am

Brian McGahan, CCIE #8593August 19, 2012 at 9:50 am

Marcio CostaAugust 20, 2012 at 6:26 pm

SuryaAugust 28, 2012 at 2:12 am

Richard ChanAugust 20, 2012 at 5:59 pm

Brian McGahan, CCIE #8593August 21, 2012 at 12:16 am

Ashraf Esmat Khalil 3xCCIE #19158August 23, 2012 at 8:21 am

Brian McGahan, CCIE #8593August 23, 2012 at 8:23 am

Ashraf Esmat Khalil 3xCCIE #19158August 23, 2012 at 9:15 am

Brian McGahan, CCIE #8593August 23, 2012 at 11:31 am

Brian McGahan, CCIE #8593August 23, 2012 at 11:34 am

Ashraf Esmat Khalil 3xCCIE #19158August 23, 2012 at 11:48 am

DevangAugust 24, 2012 at 9:41 am

Brian McGahan, CCIE #8593August 24, 2012 at 1:56 pm

RobAugust 27, 2012 at 4:37 am

Brian McGahan, CCIE #8593August 31, 2012 at 2:49 pm

Brett GianpetroAugust 29, 2012 at 8:44 am

Brian McGahan, CCIE #8593August 29, 2012 at 8:46 pm

Jake HoweringAugust 29, 2012 at 10:18 pm

Mark McKillop, CCDE#2009::5August 31, 2012 at 1:25 pm

Victor MorenoAugust 31, 2012 at 4:22 pm

Brian McGahan, CCIE #8593August 31, 2012 at 9:44 pm

Victor MorenoSeptember 7, 2012 at 5:57 pm

Brian McGahan, CCIE #8593September 7, 2012 at 9:32 pm

ABSeptember 1, 2012 at 8:38 pm

Mark Snow, CCIE #14073September 4, 2012 at 2:43 pm

Edit: For those of you that want to take a look first-hand at these packets, the Wireshark PCAP filesreferenced in this post can be found here

One of the hottest topics in networking today is Data Center Virtualized Workload Mobility (VWM). For those of youthat have been hiding under a rock for the past few years, workload mobility basically means the ability todynamically and seamlessly reassign hardware resources to virtualized machines, often between physicallydisparate locations, while keeping this transparent to the end users. This is often accomplished through VMwarevMotion, which allows for live migration of virtual machines between sites, or as similarly implemented in Microsoft’sHyper-V and Citrix’s Xen hypervisors.

One of the typical requirements of workload mobility is that the hardware resources used must be on the samelayer 2 network segment. E.g. the VMware Host machines must be in the same IP subnet and VLAN in order toallow for live migration their VMs. The big design challenge then becomes, how do we allow for live migrations ofVMs between Data Centers that are not in the same layer 2 network? One solution to this problem that Cisco hasdevised is a relatively new technology called Overlay Transport Virtualization (OTV).

As a side result of preparing for INE’s upcoming CCIE Data Center Nexus Bootcamp I’ve had the privilege (orpunishment depending on how you look at it ) of delving deep into the OTV implementation on Nexus 7000. Mygoal was to find out exactly what was going on behind the scenes with OTV. The problem I ran into though wasthat none of the external Cisco documentation, design guides, white papers, Cisco Live presentations, etc. reallycontained any of this information. The only thing that is out there on OTV is mainly marketing info, i.e. buzzwordbingo, or very basic config snippets on how to implement OTV. In this blog post I’m going to discuss the details ofmy findings about how OTV actually works, with the most astonishing of these results being that OTV is in fact, afancy GRE tunnel.

From a high level overview, OTV is basically a layer 2 over layer 3 tunneling protocol. In essence OTVaccomplishes the same goal as other L2 tunneling protocols such as L2TPv3, Any Transport over MPLS (AToM),or Virtual Private LAN Services (VPLS). For OTV specifically this goal is to take Ethernet frames from an endstation, like a virtual machine, encapsulate them inside IPv4, transport them over the Data Center Interconnect(DCI) network, decapsulate them on the other side, and out pops your original Ethernet frame.

For this specific application OTV has some inherent benefits over other designs such as MPLS L2VPN with AToMor VPLS. The first of which is that OTV is transport agnostic. As long as there is IPv4 connectivity between DataCenters, OTV can be used. For AToM or VPLS, these both require that the transport network be MPLS aware,which can limit your selections of Service Providers for the DCI. For OTV you can technically use it over anyregular Internet connectivity.

Another advantage of OTV is that provisioning is simple. AToM and VPLS tunnels are Provider Edge (PE) sideprotocols, while OTV is a Customer Edge (CE) side protocol. This means for AToM and VPLS the Service Providerhas to pre-provision the pseudowires. Even though VPLS supports enhancements like BGP auto-discovery,provisioning of MPLS L2VPN is still requires administrative overhead. OTV is much simpler in this case, becauseas we’ll see shortly, the configuration is just a few commands that are controlled by the CE router, not the PErouter.

The next thing we have to consider with OTV is how exactly this layer 2 tunneling is accomplished. After all wecould just configure static GRE tunnels on our DCI edge routers and bridge IP over them, but this is probably notthe best design option for either control plane or data plan scalability.

The way that OTV implements the control plane portion of its layer 2 tunnel is what is sometimes described as“MAC in IP Routing”. Specifically OTV uses Intermediate System to Intermediate System (IS-IS) to advertise theVLAN and MAC address information of the end hosts over the Data Center Interconnect. For those of you that arefamiliar with IS-IS, immediately this should sound suspect. After all, IS-IS isn’t an IP protocol, it’s part of the legacyOSI stack. This means that IS-IS is directly encapsulated over layer 2, unlike OSPF or EIGRP which ride over IP atlayer 3. How then can IS-IS be encapsulated over the DCI network that is using IPv4 for transport? The answer? Afancy GRE tunnel.

The next portion that is significant about OTV’s operation is how it actually sends packets in the data plane.Assuming for a moment that the control plane “just works”, and the DCI edge devices learn about all the MACaddresses and VLAN assignments of the end hosts, how do we actually encapsulate layer 2 Ethernet framesinside of IP to send over the DCI? What if there is multicast traffic that is running over the layer 2 network? Alsowhat if there are multiple sites reachable over the DCI? How does it know specifically where to send the traffic?The answer? A fancy GRE tunnel.

Next I want to introduce the specific topology that will be used for us to decode the details of how OTV is workingbehind the scenes. Within the individual Data Center sites, the layer 2 configuration and physical wiring is notrelevant to our discussion of OTV. Assume simply that the end hosts have layer 2 connectivity to the edge routers.Additionally assume that the edge routers have IPv4 connectivity to each other over the DCI network. In thisspecific case I chose to use RIPv2 for routing over the DCI (yes, you read that correctly), simply so I could filter itfrom my packet capture output, and easily differentiate between the routing control plane in the DCI transportnetwork vs. the routing control plane that was tunneled inside OTV between the Data Center sites.

What we are mainly concerned with in this topology is as follows:

OTV Edge Devices N7K1-3 and N7K2-7These are the devices that actually encapsulate the Ethernet frames from the end hosts into theOTV tunnel. I.e. this is where the OTV config goes.

DCI Transport Device N7K2-8This device represents the IPv4 transit cloud between the DC sites. From this device’s perspective itsees only the tunnel encapsulated traffic, and does not know the details about the hosts inside theindividual DC sites. Additionally this is where packet capture is occurring so we can view the actualpayload of the OTV tunnel traffic.

End Hosts R2, R3, Server 1, and Server 3These are the end devices used to generate data plane traffic that ultimately flows over the OTVtunnel.

Now let’s look at the specific configuration on the edge routers that is required to form the OTV tunnel.

N7K1-3:

vlan 172

name OTV_EXTEND_VLAN

!

vlan 999

name OTV_SITE_VLAN

!

spanning-tree vlan 172 priority 4096

!

otv site-vlan 999

otv site-identifier 0x101

!

interface Overlay1

otv join-interface Ethernet1/23

otv control-group 224.100.100.100

otv data-group 232.1.2.0/24

otv extend-vlan 172

no shutdown

!

interface Ethernet1/23

ip address 150.1.38.3/24

ip igmp version 3

ip router rip 1

no shutdown

N7K2-7:

vlan 172

name OTV_EXTEND_VLAN

!

vlan 999

name OTV_SITE_VLAN

!

spanning-tree vlan 172 priority 4096

!

otv site-vlan 999

otv site-identifier 0x102

!

interface Overlay1

otv join-interface port-channel78

otv control-group 224.100.100.100

otv data-group 232.1.2.0/24

otv extend-vlan 172

no shutdown

!

interface port-channel78

ip address 150.1.78.7/24

ip igmp version 3

ip router rip 1

As you can see the configuration for OTV really isn’t that involved. The specific portions of the configuration thatare relevant are as follows:

Extend VLANsThese are the layer 2 segments that will actually get tunneled over OTV. Basically these are theVLANs that you virtual machines reside on that you want to do the VM mobility between. In our casethis is VLAN 172, which maps to the IP subnet 172.16.0.0/24.

Site VLANUsed to synchronize the Authoritative Edge Device (AED) role within an OTV site. This is for is whenyou have more than one edge router per site. OTV only allows a specific Extend VLAN to betunneled by one edge router at a time for the purpose of loop prevention. Essentially this Site VLANlets the edge routers talk to each other and figure out which one is active/standby on a per-VLANbasis for the OTV tunnel. The Site VLAN should not be included in the extend VLAN list.

Site IdentifierShould be unique per DC site. If you have more than one edge router per site, they must agree onthe Site Identifier, as it’s used in the AED election.

Overlay InterfaceThe logical OTV tunnel interface.

OTV Join InterfaceThe physical link or port-channel that you use to route upstream towards the DCI.

OTV Control GroupMulticast address used to discover the remote sites in the control plane.

OTV Data GroupUsed when you’re tunneling multicast traffic over OTV in the data plane.

IGMP Version 3Needed to send (S,G) IGMP Report messages towards the DCI network on the Join Interface.

At this point that’s basically all that’s involved in the implementation of OTV. It “just works”, because all the behindthe scenes stuff is hidden from us from a configuration point of view. A quick test of this from the end hosts showsus that:

R2#ping 255.255.255.255

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 255.255.255.255, timeout is 2 seconds:

Reply to request 0 from 172.16.0.3, 4 ms

Reply to request 1 from 172.16.0.3, 1 ms

Reply to request 2 from 172.16.0.3, 1 ms

Reply to request 3 from 172.16.0.3, 1 ms

Reply to request 4 from 172.16.0.3, 1 ms

R2#traceroute 172.16.0.3

Type escape sequence to abort.

Tracing the route to 172.16.0.3

VRF info: (vrf in name/id, vrf out name/id)

1 172.16.0.3 0 msec * 0 msec

The fact that R3 responds to R2’s packets going to the all hosts broadcast address (255.255.255.255) implies thatthey are in the same broadcast domain. How specifically is it working though? That’s what took a lot furtherinvestigation.

To simplify the packet level verification a little further, I changed the MAC address of the four end devices that areused to generate the actual data plane traffic. The Device, IP address, and MAC address assignments are asfollows:

The first thing I wanted to verify in detail was what the data plane looked like, and specifically what type of tunnelencapsulation was used. With a little searching I found that OTV is currently on the IETF standards track in draftformat. As of writing, the newest draft is draft-hasmit-otv-03. Section 3.1 Encapsulation states:

3. Data Plane

3.1. Encapsulation

The overlay encapsulation format is a Layer-2 ethernet frame

encapsulated in UDP inside of IPv4 or IPv6.

The format of OTV UDP IPv4 encapsulation is as follows:

1 2 3

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

|Version| IHL |Type of Service| Total Length |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| Identification |Flags| Fragment Offset |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| Time to Live | Protocol = 17 | Header Checksum |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| Source-site OTV Edge Device IP Address |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| Destination-site OTV Edge Device (or multicast) Address |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| Source Port = xxxx | Dest Port = 8472 |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| UDP length | UDP Checksum = 0 |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

|R|R|R|R|I|R|R|R| Overlay ID |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| Instance ID | Reserved |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| |

| Frame in Ethernet or 802.1Q Format |

| |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

A quick PING sweep of packet lengths with the Don’t Fragment bit set allowed me to find the encapsulationoverhead, which turns out to be 42 bytes, as seen below:

R3#ping 172.16.0.2 size 1459 df-bit

Type escape sequence to abort.

Sending 5, 1459-byte ICMP Echos to 172.16.0.2, timeout is 2 seconds:

Packet sent with the DF bit set

.....

Success rate is 0 percent (0/5)

R3#ping 172.16.0.2 size 1458 df-bit

Type escape sequence to abort.

Sending 5, 1458-byte ICMP Echos to 172.16.0.2, timeout is 2 seconds:

Packet sent with the DF bit set

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/4 ms

None of my testing however could verify what the encapsulation header was though. The draft says that thetransport is supposed to be UDP port 8472, but none of my logging produced results showing that any UDP trafficwas even in the transit network (save for my RIPv2 routing ). After much frustration, I finally broke out the snifferand took some packet samples. The first capture below shows a normal ICMP ping between R2 and R3.

MPLS? GRE? Where did those come from? That’s right, OTV is in fact a fancy GRE tunnel. More specifically it isan Ethernet over MPLS over GRE tunnel. My poor little PINGs between R2 and R3 are in fact encapsulated asICMP over IP over Ethernet over MPLS over GRE over IP over Ethernet (IoIoEoMPLSoGREoIP for short). Let’stake a closer look at the encapsulation headers now:

In the detailed header output we see our transport Ethernet header, which in a real deployment can be anythingdepending on what the transport of your DCI is (Ethernet, POS, ATM, Avian Carrier, etc.) Next we have the IP OTVtunnel header, which surprised me in a few aspects. First, all documentation I read said that without the use of anOTV Adjacency Server, unicast can’t be used for transport. This is true… up to a point. Multicast it turns out isonly used to establish the control plane, and to tunnel multicast over multicast in the data plane. Regular unicasttraffic over OTV will be encapsulated as unicast, as seen in this capture.

The next header after IP is GRE. In other words, OTV is basically the same as configuring a static GRE tunnelbetween the edge routers and then bridging over them, along with some enhancements (hence fancy GRE). TheOTV enhancements (which we’ll talk about shortly) are the reason why you wouldn’t just configure GRE statically.Nevertheless this surprised me because even in hindsight the only mention of OTV using GRE I found was here.What’s really strange about this is that Cisco’s OTV implementation doesn’t follow what the standards track draftsays, which is UDP, even though the authors of the OTV draft are Cisco engineers. Go figure.

The next header, MPLS, makes sense since the prior encapsulation is already GRE. Ethernet over MPLS overGRE is already well defined and used in deployment, so there’s no real reason to reinvent the wheel here. Ihaven’t verified this in detail yet but I’m assuming that the MPLS Label value would be used in cases where theedge router has multiple overlay interfaces, in which case the label in the data plane would quickly tell it whichoverlay interface the incoming packet is destined for. This logic is similar to MPLS L3VPN where the bottom of thestack VPN label tells a PE router which CE facing link the packet is ultimately destined for. I’m going to do somemore testing later with a larger more complex topology to actually verify this fact though, as all data plane trafficover this tunnel is always sharing the same MPLS label value.

Next we see the original Ethernet header, which is sourced from R2’s MAC address 0000.0000.0002 and going toR3’s MAC address 0000.0000.0003. Finally we have the original IP header and the final ICMP payload. The keywith OTV is that this inner Ethernet header and its payload remain untouched, so it looks like from the end hostperspective that all the devices are just on the same LAN.

Now that it was apparent that OTV was just a fancy GRE tunnel, the IS-IS piece fell into place. Since IS-IS runsdirectly over layer 2 (e.g. Ethernet), and OTV is an Ethernet over MPLS over GRE tunnel, then IS-IS canencapsulate as IS-IS over Ethernet over MPLS over GRE (phew!). To test this, I changed the MAC address of oneof the end hosts, and looked at the IS-IS LSP generation of the edge devices. After all the goal of the OTV controlplane is to use IS-IS to advertise the MAC addresses of end hosts in that particular site, as well as the particularVLAN that they reside in. The configuration steps and packet capture result of this are as follows:

R3#conf t

Enter configuration commands, one per line. End with CNTL/Z.

R3(config)#int gig0/0

R3(config-if)#mac-address 1234.5678.9abc

R3(config-if)#

*Aug 17 22:17:10.883: %LINK-5-CHANGED: Interface GigabitEthernet0/0, changed state to reset

*Aug 17 22:17:11.883: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/0, changed

state to down

*Aug 17 22:17:16.247: %LINK-3-UPDOWN: Interface GigabitEthernet0/0, changed state to up

*Aug 17 22:17:17.247: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/0, changed

state to up

The first thing I noticed about the IS-IS encoding over OTV is that it uses IPv4 Multicast. This makes sense,because if you have 3 or more OTV sites you don’t want to have to send your IS-IS LSPs as replicated Unicast. Aslong as all of the AEDs on all sites have joined the control group (224.100.100.100 in this case), the LSPreplication should be fine. This multicast forwarding can also be verified in the DCI transport network core in thiscase as follows:

N7K2-8#show ip mroute

IP Multicast Routing Table for VRF "default"

(*, 224.100.100.100/32), uptime: 20:59:33, ip pim igmp

Incoming interface: Null, RPF nbr: 0.0.0.0

Outgoing interface list: (count: 2)

port-channel78, uptime: 20:58:46, igmp

Ethernet1/29, uptime: 20:58:53, igmp

(150.1.38.3/32, 224.100.100.100/32), uptime: 21:00:05, ip pim mrib

Incoming interface: Ethernet1/29, RPF nbr: 150.1.38.3

Outgoing interface list: (count: 2)

port-channel78, uptime: 20:58:46, mrib

Ethernet1/29, uptime: 20:58:53, mrib, (RPF)

(150.1.78.7/32, 224.100.100.100/32), uptime: 21:00:05, ip pim mrib

Incoming interface: port-channel78, RPF nbr: 150.1.78.7

Outgoing interface list: (count: 2)

port-channel78, uptime: 20:58:46, mrib, (RPF)

Ethernet1/29, uptime: 20:58:53, mrib

(*, 232.0.0.0/8), uptime: 21:00:05, pim ip

Incoming interface: Null, RPF nbr: 0.0.0.0

Outgoing interface list: (count: 0)

Note that N7K1-3 (150.1.38.3) and N7K2-7 (150.1.78.7) have both joined the (*, 224.100.100.100). A veryimportant point about this is that the control group for OTV is an Any Source Multicast (ASM) group, not a SourceSpecific Multicast (SSM) group. This implies that your DCI transit network must run PIM Sparse Mode and have aRendezvous Point (RP) configured in order to build the shared tree (RPT) for the OTV control group used by theAEDs. You technically could use Bidir but you really wouldn’t want to for this particular application. This kind ofsurprised me how they chose to implement it, because there are already more efficient ways of doing sourcediscovery for SSM, for example how Multicast MPLS L3VPN uses the BGP AFI/SAFI Multicast MDT to advertise the(S,G) pairs of the PE routers. I suppose the advantage of doing OTV this way though is that it makes the OTVconfig very straightforward from an implementation point of view on the AEDs, and you don’t need an extra controlplane protocol like BGP to exchange the (S,G) pairs before you actually join the tree. The alternative to this ofcourse is to use the Adjacency Server and just skip using multicast all together. This however will result in unicastreplication in the core, which can be bad, mkay?

Also for added fun in the IS-IS control plane the actual MAC address routing table can be verified as follows:

N7K2-7# show otv route

OTV Unicast MAC Routing Table For Overlay1

VLAN MAC-Address Metric Uptime Owner Next-hop(s)

---- -------------- ------ -------- --------- -----------

172 0000.0000.0002 1 01:22:06 site port-channel27

172 0000.0000.0003 42 01:20:51 overlay N7K1-3

172 0000.0000.000a 42 01:18:11 overlay N7K1-3

172 0000.0000.001e 1 01:20:36 site port-channel27

172 1234.5678.9abc 42 00:19:09 overlay N7K1-3

N7K2-7# show otv isis database detail | no-more

OTV-IS-IS Process: default LSP database VPN: Overlay1

OTV-IS-IS Level-1 Link State Database

LSPID Seq Number Checksum Lifetime A/P/O/T

N7K2-7.00-00 * 0x000000A3 0xA36A 893 0/0/0/1

Instance : 0x000000A3

Area Address : 00

NLPID : 0xCC 0x8E

Hostname : N7K2-7 Length : 6

Extended IS : N7K1-3.01 Metric : 40

Vlan : 172 : Metric : 1

MAC Address : 0000.0000.001e

Vlan : 172 : Metric : 1

MAC Address : 0000.0000.0002

Digest Offset : 0

N7K1-3.00-00 0x00000099 0xBAA4 1198 0/0/0/1

Instance : 0x00000094

Area Address : 00

NLPID : 0xCC 0x8E

Hostname : N7K1-3 Length : 6

Extended IS : N7K1-3.01 Metric : 40

Vlan : 172 : Metric : 1

MAC Address : 1234.5678.9abc

Vlan : 172 : Metric : 1

MAC Address : 0000.0000.000a

Vlan : 172 : Metric : 1

MAC Address : 0000.0000.0003

Digest Offset : 0

N7K1-3.01-00 0x00000090 0xCBAB 718 0/0/0/1

Instance : 0x0000008E

Extended IS : N7K2-7.00 Metric : 0

Extended IS : N7K1-3.00 Metric : 0

Digest Offset : 0

So at this point we see what our ICMP PING was actually ICMP over IP over Ethernet over MPLS over GRE over IPover Ethernet, and our routing protocol was IS-IS over Ethernet over MPLS over GRE over IP over Ethernet :/What about multicast in the data plane though? At this point verification of multicast over the DCI core is prettystraightforward, since we can just enable a multicast routing protocol like EIGRP and look at the result. This can beseen below:

R2#config t

Enter configuration commands, one per line. End with CNTL/Z.

R2(config)#router eigrp 1

R2(config-router)#no auto-summary

R2(config-router)#network 0.0.0.0

R2(config-router)#end

R2#

R3#config t

Enter configuration commands, one per line. End with CNTL/Z.

R3(config)#router eigrp 1

R3(config-router)#no auto-summary

R3(config-router)#network 0.0.0.0

R3(config-router)#end

R3#

*Aug 17 22:39:43.419: %SYS-5-CONFIG_I: Configured from console by console

*Aug 17 22:39:43.423: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.0.2 (GigabitEthernet0/0) is

up: new adjacency

R3#show ip eigrp neighbors

IP-EIGRP neighbors for process 1

H Address Interface Hold Uptime SRTT RTO Q Seq

(sec) (ms) Cnt Num

0 172.16.0.2 Gi0/0 11 00:00:53 1 200 0 1

Our EIGRP adjacency came up, so multicast obviously is being tunneled over OTV. Let’s see the packet captureresult:

We can see EIGRP being tunneled inside the OTV payload, but what’s with the outer header? Why is EIGRP usingthe ASM 224.100.100.100 group instead of the SSM 232.1.2.0/24 data group? My first guess was that link localmulticast (i.e. 224.0.0.0/24) would get encapsulated as control plane instead of as data plane. This would makesense because control plane protocols like OSPF, EIGRP, PIM, etc. you would want those tunneling to all OTVsites, not just the ones that joined the SSM feeds. To test if this was the case, the only change I needed to makewas to have one router join a non-link-local multicast group, and have the other router send ICMP pings. Sincethey’re effectively in the same LAN segment, no PIM routing is needed in the DC sites, just basic IGMP Snooping,which is enabled in NX-OS by default. The config on the IOS routers is as follows:

R2#config t

Enter configuration commands, one per line. End with CNTL/Z.

R2(config)#ip multicast-routing

R2(config)#int gig0/0

R2(config-if)#ip igmp join-group 224.10.20.30

R2(config-if)#end

R2#

R3#ping 224.10.20.30 repeat 1000 size 1458 df-bit

Type escape sequence to abort.

Sending 1000, 1458-byte ICMP Echos to 224.10.20.30, timeout is 2 seconds:

Packet sent with the DF bit set

Reply to request 0 from 172.16.0.2, 1 ms

Reply to request 1 from 172.16.0.2, 1 ms

Reply to request 2 from 172.16.0.2, 1 ms

The packet capture result was as follows:

This was more as expected. Now the multicast data plane packet was getting encapsulated in the ICMP over IPover Ethernet over MPLS over GRE over IP *Multicast* over Ethernet OTV group. The payload wasn’t decoded, asI think even Wireshark was dumbfounded by this string of encapsulations.

In summary we can make the following observations about OTV:

OTV encapsulation has 42 bytes of overhead that consists of:New Outer Ethernet II Header – 14 BytesNew Outer IP Header – 20 BytesGRE Header – 4 BytesMPLS Header – 4 Bytes

OTV uses both Unicast and Multicast transportASM Multicast is used to build the control plane for OTV IS-IS, ARP, IGMP, EIGRP, etc.Unicast is used for normal unicast data plane transmission between sitesSSM Multicast is used for normal multicast data plane transmission between sitesOptionally ASM & SSM can be replaced with the Adjacency Server

GRE is the ultimate band-aid of networking

Now the next time someone is throwing around fancy buzzwords about OTV, DCI, VWM, etc. you can say “oh, youmean that fancy GRE tunnel”?

I’ll be continuing this series in the coming days and weeks on other Data Center and specifically CCIE Data Centerrelated technologies. If you have a request for a specific topic or protocol that you’d like to see the behind thescene’s details of, drop me a line at [email protected].

Happy Labbing!

Tags: gre, IS-IS, MPLS, multicast, otv

Download this page as a PDF

You can leave a response, or trackback from your own site.

42 Responses to “OTV Decoded – A Fancy GRE Tunnel”

currently I am building DR, and definitely this gonna help me alot.

Reply

Well done, Brian!

Reply

Brian,

Sorry for being off-topic but I don’t know where else to ask this. I am interested in the new CCNP SP track and I have noticed thatcurrently there is no specific documentation for this track. I would love to see a CCNP SP ATC from you .. there is really nothinglike that or for that matter, specific documentation out there right now.. time is precious so having to the point training materialswould be convenient.

Thanks

Reply

Hi Luciano,

We’re planning CCNP SP for Q1 2013.

Reply

Simply amazing post Brian!!!! That is just Wowwww!

Thanks a lot for sharing!

Regards,Laurent

Reply

Hello Brian, this is by far the best explanation of OTV I have come across.

Would love to see something in the same vein for virtual port channels

Reply

Excellent post Brian!Do you agree with Cisco that OTV is the best way of extending VLANs over DCI for enterprise network?

Suggestion for next post: FabricPath.

Reply

For DCI yes I would choose OTV over dark fiber/WDM or AToM/VPLS because of the extra enhancements. For example withOTV the STP demarc is at the edge router automatically. With other DCIs you can manually filter STP, but you leave yourselfopen to loops if you misconfigure the network or in certain failure scenarios. Also OTV optimizes ARP, by having the edgerouters send proxy ARP replies to hosts within their own site that are trying to reach hosts in other sites across the OTV DCI.This can be a big control plane savings depending on the size of the sites.

Reply

So you’re saying STP stops at the OTC Edge devices? Does that mean STP is only local to each DC?

That would be preferable if that was the case. It never seems like a good idea to have a L2 DCI. Stretching VLANs isjust creating one massive failure domain.

Now if you have a broadcast storm, etc, in one VLAN in DC1 – you’re taking out DC2 as well. Not a good design if youask me.

CJ

Reply

That’s correct. The broadcast domain spans over the DCI, but the STP domain does not. The OTV edgedevices should be the STP root switches for the extend VLANs, which ensures that they’re always at the top ofthe tree, and that all downstream links are forwarding. There’s other protections against broadcast storms likethe ARP proxying. Even though ARP is a broadcast which normally would be flooded, the OTV edge devicestops the ARP from going across the DCI if it already has the entry in its local ARP cache. The result is thatwhile a layer 2 loop could still take out one DC, its not going to flood over the DCI and automatically take outthe other DC sites. You don’t get this extra functionality by default if you run something like dark fiber or AToMor VPLS.

Reply

Hi Brian,

Is OTV VRF-aware i.e. can I have the Join-interface in a VRF? Also, does the latest NX-OS support SVI or loopbackinterfaces in the OTV VDC?

Thanks.

AB.

Reply

Excellent post Brian, you really nailed it down. I was so surprised to see ethernet frames are not encapsulated in UDP but it isactually GRE o MPLS. Cisco might have done this because encapsulating TCP packets over UDP would not make sense. One thingI figured out that UDP and GRE+MPLS has same header length. 16 bytes. If you look at the draft of OTV http://tools.ietf.org/pdf/draft-hasmit-otv-03.pdf the expiration date is Jan 9th 2012. Also on newer implementation of OTV on ASR 1000 “show otv” commandactually tells you it does GRE/IPv4 encapsulation. I supposed when draft is resubmitted again it will include GRE+MPLS changes.From Cisco’s OTV implementation point of view the ASIC on line card does not need to be change as total 42 bytes of overheadremains intact in both implementation and ASIC case parse same length of packet. This would be a software change instead.

Just for curiosity and for everyone’s benefit if you can upload actual .pcap file, I would really appreciate.

Reply

I’ll post the pcap files tonight.

Reply

Thank you Brian.

Reply

The Wireshark PCAP files referenced in this post can be found here

Reply

Hey Brian,

Can you do a detailed post on how how vPC works (low level information) & How Layer 2 loop prevention happens with vPCscenario.

Also covering details of vPC vs vPC+

Thanks!

Reply

you rock man!! tks for the blog about it.That would be good to see the comparison of these DCI technologies from the design point of view.

Reply

That would be an interesting write-up. Something like “Design Considerations in Choosing a Data Center InterconnectTechnology”, comparing OTV, VPLS, AToM, L2TPv3, WDM, etc. I’ll definitely take it under consideration.

Thanks Marcio!

Reply

Thanks Brian. You got my point. I’m already waiting to read this article/blog

Reply

Hi all.

To me OTV is “no go” in DCI area. Just because it doesn’t support unknown unicast flooding.

How many customers use Microsft NLB or other similar technologies ?

My prefered scenario is currently Trill over DWDM.

Reply

Hi Brian“the VMware Host machines must be in the same IP subnet and VLAN in order to allow for live migration their VMs.”

Could you clarify: I thought it is the VMware guest that must end up in the same VLAN; the vMotion VLANs can be routed (though thatmight not be supported)

Reply

The vmkernel interface needs to be on the same subnet of the src/dst vm hosts. It’s possible this has changed in newversions though. A good design doc on this ishttp://www.cisco.com/en/US/docs/solutions/Enterprise/Data_Center/DCI/4.0/EMC/EMC_2.html#wp1261714

Reply

Hi Brain,

Does OTV require enabling multicast end to end, i.e the underling infrastructure ( service provider network infrastructure) shouldsupport multicast?

Reply

By default it requires that the DCI network be multicast capable. This means that your layer 3 interconnect provider should berunning PIM sparse mode with an RP or BIDIR PIM. If PIM isn’t running then you can use the OTV Adjacency Server whichreplaces the multicast requirement with unicast only.

Reply

Hi Brain, Thank you very much for the clarification, one more question the default gateway for most of the servers is ASA or ACE not SVI on N7K (based on the setup), my question is ,is OTVworking fine with these security devices.?

Reply

It should be fine because the servers are essentially layer 2 adjacent over OTV. You can run into some weirdrouting patterns though where a server has a default gateway in a remote DC over the DCI. It will work, it’s justthat you have sub-optimal forwarding. One of the solutions that can help with this is Locator/IdentifierSeparation Protocol (LISP), where the goal is to separate your location (e.g. DC site) to your IP address.

Reply

This might be of interest to you as well: http://www.netcraftsmen.net/component/content/article/69-data-center/818.html

Thank you very much, looking forward to learning from you next week on line.

How does OTV different from VPLS?

Reply

OTV is a customer side layer 2 tunnel. VPLS is a provider side layer 2 tunnel. VPLS requires MPLS for transit, OTV does not.There are other differences of course but those are really the main ones.

Reply

Great post, thanks a lot. Looking forward to Data Center ciriculum (hopefully online classes?)

You made an argument for OTV vs VPLS and AToM but what about L2TP? Are OTV and L2TP not pretty much the same with L2TPbeing supported by cheaper hardware?

Reply

L2TPv3 is also an option, and like you said it’s supported on lower level platforms (i.e. cheaper ones). It doesn’t have anybuilt in enhancements though like the ARP flooding optimization.

Reply

Nice work. This is good stuff.

I am wondering if the MPLS label is used to encode the VLAN information. That Cisco OTV FAQ that you linked to notes that an “OTVshim” is added to the header to encode VLAN information. I’m thinking that maybe OTV shim=MPLS label. Can you try and extendanother VLAN over OTV using the same overlay interface and see if it generates a different MPLS label?

Reply

That’s a likely assumption. I’ll check into it in more detail and post an update.

Reply

Really a great post. Understanding the header details has significant implications re: HW support for other similar overlaytechnologies like VXLAN and LISP.

Nice Job !

Reply

Very insightful posting Brian, certainly helps me (even as a Cisco employee familiar with the technology). Great Work!!!

Reply

Very nice work Brian. And clearly well received by the readers.

Regarding the encapsulation. Keep in mind that OTV is shipping on ASICs that were finalized well before we even conceived OTV.So to get the solution to the users, we worked with the existing ASIC capabilities and managed to deliver a hardware acceleratedEthernet in IP tunnel. As you well describe the trick is to do this in two stages and utilize the capabilities of the existing hardware(Eth-in-MPLS + MPLS-in-GRE). We do not really use the MPLS bits for any MPLS purpose (that is a very important point) and we useif as the OTV shim that carries segment (VLAN) information.

We did however, design OTV with an ideal header in mind and that is what we proposed to the standards bodies, and it is also whatyou will see in future hardware as well as the new wave of technology proposals (LISP and VXLAN use the same exact headerproposed by OTV to the IETF).

I wouldn’t trivialize OTV as “just” a fancy GRE tunnel. Although we use the GRE encap (because the secret sauce isn’t really in theencapsulation), the real value OTV brings is in its control plane and the way it handles traffic an simplifies configuration. In otherwords, the encap is not really that important and we are working on getting most of these to converge into one.

Reply

Hi Victor,

First off, thanks for the detailed response! Don’t take it the wrong way, I’m not trying to trivialize OTV, in fact I think it’s a*brilliant* way of solving the problem. Not only with the optimizations of ARP flooding, demarc-ing the STP domain local to theDCI edge, using IS-IS to exchange the control plane AND hiding the details from the front-end is all a great idea.I have a few questions if you don’t mind. With the current implementation, does the label value encode the VLAN? Are yourplans to move towards the UDP encap in future releases? If the EoMPLSoGRE is already hardware accelerated in the ASIC,what’s the advantage of using the UDP encap? Why not just amend the next draft proposal with the EoMPLSoGRE format?

Another point that others have brought up is the security of OTV. I’m sure as you know even in MPLS L3VPN environments,many designs require encryption due to compliance. Are there any plans to integrate GETVPN or other similar tunnelingtechniques into the AED itself, or is it assumed that this should be done on your “true” L3 edge device, such as ASR1Kupstream of the AED? I saw some recent documents talking about the integration of GET and LISP on ASR.

Thanks again for reading!

Brian

Reply

Hi Brian,

Here are the answers to your questions from a little while back.

Q: I have a few questions if you don’t mind. With the current implementation, does the label value encode the VLAN?

A: Yes.

Q: Are your plans to move towards the UDP encap in future releases?

A: Yes, there are benefits to the UDP encap, so once the HW is available, we will support both modes.

Q: If the EoMPLSoGRE is already hardware accelerated in the ASIC, what’s the advantage of using the UDP encap?

A: The UDP encap is more efficient, but more importantly, the UDP encap allows better entropy as the core devicescan hash on UDP port numbers and the encapsulated traffic doesn’t get polarized to a single path.

Q: Why not just amend the next draft proposal with the EoMPLSoGRE format?

A: That would be very confusing as the ideal encap is UDP and our other overlay efforts are converging on this UDPencap (LISP, VXLAN, OTV). EoMPLSoGRE was simply a way to get to market in a timely manner.

Q: Another point that others have brought up is the security of OTV. I’m sure as you know even in MPLS L3VPNenvironments, many designs require encryption due to compliance. Are there any plans to integrate GETVPN or othersimilar tunneling techniques into the AED itself, or is it assumed that this should be done on your “true” L3 edgedevice, such as ASR1K upstream of the AED?

A: The right tools for the right job in the right places. The encapsulated traffic can be easily encrypted by the WAN edgerouters, like all other inter DC traffic is encrypted. No need to raise the cost of high density port offerings like the N7Kby adding crypto HW as you probably want to manage the policy at the WAN edge anyway and there is little incentive toencrypt the traffic between DC aggregation and WAN edge.

Q: I saw some recent documents talking about the integration of GET and LISP on ASR.

A: Yes, LISP plays a role in CPE devices like the ASR1K and ISRs, therefore it makes sense for the product to supportsuch solution.The ASR implementation of OTV is also integrated with crypto, allowing the encapsulated traffic to be encrypted. Themodel is similar to what you would do across multiple boxes in larger networks, only you do it on a single router atsmaller sites that don’t require the speeds and densities of a Nexus switching infrastructure.

Reply

Hi Victor, Thanks for the reply. That makes sense now about the layer 4 flow hashing in the DCI core and why youwould want to use the UDP encap.

Reply

Hi Brian,

Is OTV VRF-aware i.e. can I have the Join-interface in a VRF? Also, is the limitation of a separate VDC for SVI routing removed in thenewer versions of NX-OS, and the SVI or Loopback interface as Join-interfaces?

Thanks.AB.

Reply

The OTV join-interface is VRF aware, as everything inside the N7K NX-OS is in a VRF (everything is inside the default VRF ifnot otherwise specified).The limitation of no SVI isn’t lifted yet. Technically you can have an SVI, but it must be in the shutdown state.

Reply

Leave a Reply

Name (required)

Mail (will not be published) (required)

Submit Comment

SearchSearch

Submit

CategoriesSelect Category

CCIE BloggersBrian Dennis CCIE #2210

Routing & Sw itchingISP DialSecurityService ProviderVoice

Brian McGahan CCIE #8593

Routing & Sw itchingSecurityService Provider

Petr Lapukhov CCIE #16379

Routing & Sw itchingSecurityService ProviderVoice

Mark Snow CCIE #14073

VoiceSecurity

Popular PostsFabricPath - CCIE DC Nexus

Sw itching Class Preview

CCIE DC Written and Nexus

Sw itching Videos are Both Now

Available

twitter.com/inetraining

Check out INE's YouTube channelfeaturing #FabricPath – #CCIEDC#Nexus #Sw itching Class preview -http://t.co/l8P0AHv3

Think you're ready to tackle the#CCIE beast? Try INE's Mock LabAssessment f irst.http://t.co/BXhhQw OW

OTV Decoded –http://t.co/0P1XDUvw #cciedc#nexus

Blog Home | INE Home | Members | Contact Us | Subscribe

© 2011 INE, Inc., All Rights Reserved

Free Resources View Archives All Access Pass CCIE Bloggers