Open Data Center Network Reference Architecture

Towards an Open Data Center with an Interoperable Network

Volume 5: WAN and Ultra Low Latency Applications

Last update: May 2012

It is common for the volume of WAN traffic to increase at an annual rate of thirty percent

of more, and this traffic volume is expected to increase even further with the advent of

larger cloud data centers and multi-site enterprise disaster recovery solutions. In the past

data centers didn't extend broadcast domains over long distance. Filtering was required

for traffic intended to go outside a given broadcast domain. In a more modern

environment, there may be tens to hundreds or even thousands of virtual servers on a

single domain; if this is extended over distance, it would require a huge amount of WAN

bandwidth (otherwise, it might take a very long time to move a VM and its associated

data). Higher data rates on the WAN and service provider network would also drive

disproportionately higher data rates on switches within the data center and at the WAN

edge, which does not lend itself to cost effective scaling. This has motivated the

development of new WAN technologies as the data center network has evolved.

Multi-site connectivity can be implemented in a number of ways. Public Internet

connections with IPSec secure tunneling are readily available and low cost, but do not

provide the quality of service and performance guarantees required for many larger

enterprises. Managed data connectivity services provide additional layers of security and

performance running over a public or private Internet connection. Leased line data

services are available from service providers which include options for private

management of point-to-point networks (known as private circuits or Layer 2 VPN) or

full mesh connectivity (Layer 3 VPN). In areas where leased optical fiber (or “dark

fiber”) is available, it is often cost effective for larger enterprises to use dedicated optical

wavelength division multiplexing (WDM) solutions. The cost of WDM is falling rapidly,

and it is also available as an integrated option on large Ethernet switches.

Historically, there have been four distinct generations of enterprise WAN technologies.

Starting in the mid to late 1980s, it became common for enterprise IT organizations to

deploy integrated TDM-based WANs to carry both voice and data traffic. In the early

1990s, IT organizations began to deploy Frame Relay-based WANs. In the mid to late

1990s, some IT organizations replaced their Frame Relay-based WANs with WANs

based on ATM (Asynchronous Transfer Mode) technology. Since around the year 2000,

most IT organizations have replaced their legacy WANs with MPLS based technology

combined with some Internet based services. More recently, MPLS has also been used

within a single data center to deliver the same benefits as when it is used on the WAN.

Since the price/performance of MPLS services tends to lag behind the expected growth of

WAN traffic, new technologies such as virtual private LAN services (VPLS) are being

deployed. VPLS represents the combination of Ethernet and MPLS whereby an Ethernet

frame is encapsulated inside of MPLS. As is typically the case with WAN services, the

viability of using VPLS vs. alternative services will hinge largely on the relative cost of

the services, which will vary by service provider and geographic location. When MPLS is

deployed between data centers, it functions as an overlay on top of an existing leased line

infrastructure; this can make it difficult to create a cost effective infrastructure for smaller

and medium sized enterprises.

There are several types of MPLS/VPLS services, depending on whether the application is

within a single data center or between data centers, and whether traffic is being managed

within a single VPN or between VPNs. Core switches that support MPLS/VPLS

standards enable collapsing the router and core tiers of the data center network into a

flatter network with fewer layers. MPLS/VPLS enable VM migration between data

centers by keeping VMs in a single layer-2 domain which spans multiple data centers.

One of the challenges associated with modern WANs in general and hybrid cloud

computing in particular is that hybrid clouds depend heavily on VM migration among

geographically dispersed servers. This is necessary in order to ensure high availability

and dynamic response to changes in user demand for services. The desire to have

transparency relative to the location of the applications has a number of networking

implications including the following:

• VLAN Extension - The VLANs within which VMs are migrated must be extended

over the WAN between the private and public data centers.

• Secure Tunnelling - These tunnels must provide an adequate level of security for all the

required data flows over the Internet.

• Universal Access to Central Services - All application services, such as load balancing,

should be available and function transparently in this environment

o Disaster recovery solutions - As data centers become larger, there is an increasing

need for multi-site managed backup, recovery, and continuous availability solutions. The

nature of these solutions depends on each user’s tolerace for the period of time there data

can be unavailable during an outage (recovery time objective), the amount of data which

can afford to be lost (recovery point objective), and other factors. Technical problems

remain with supporting multi-hop across FCoE switches at extended distance, so Fibre

Channel will continue to be used for long distance storage backups. Many enterprise

applications will continue to use ultra high availability solutions for their mission critical

data (such as GDPS in a mainframe environment).

Since WAN costs can be relatively high compared with inter-site networking, small and

medium sized clients often cannot simply add more WAN capacity to their networks on

demand. There is a tradeoff between cost containment and increased network traffic

demands. Various WAN optimization and acceleration techniques can be used to get

increasing performance from the existing infrastructure. WAN optimization should

enable locating key servers in a centralized location by providing application

performance similar to that achieved on a LAN. Application accelerators for TCP/IP and

similar protocols also play an important role in performance optimization. If real time

applications are deployed over an accelerated WAN, then quality of service and

bandwidth optimization are desired features. There are vendor proprietary alternatives to

MPLS/VPLS; there are a number of concerns with these alternatives, including

requirements to configure them on core routers, security issues (particularly for

alternatives which transport traffic over an untrusted IP connection rather than an

MPLS/VPLS tunnel), and guaranteeing lossless performance and reserved bandwidth.

Further, MPLS/VPLS is a very mature protocol, with well developed traffic engineering

facilities, and since MPLS/VPLS is a shared network model, in principle it offers lower

cost to the end users.

An MPLS backbone for site to site connectivity is compatible with a dual homed Ethernet

architecture in the data center, including core switch connectivity with MLAG, TRILL,

SPB and other features described previously. Routers and firewalls should be deployed in

an active/active configuration and use separate WAN links, cross-connected to provide

high availability. Load balancing across redundant connections is optional depending on

traffic volumes and availability requirements of the application.

Ultra Low Latency Applications

One application which has recently received considerable interest is the design of data

centers to accommodate extremely low latency applications. In some cases, the network

will not be the limiting factor for low latency (for example, storage controllers may have

latency many times larger than the network); in other cases, the network latency may be a

significant factor. These applications may include areas such as telemedicine and other

remote control systems; one of the largest applications involves real time electronic

financial transactions. Sometimes known as high frequency trading (HFT), this approach

is currently responsible for over 1/3 of all stock transactions and is expected to grow

significantly in coming years. The overriding design consideration for HFT applications

is lowering latency, which refers to the total end to end time delay within the data center

network due to a combination of time of flight and processing delays within the network

equipment. Financial applications are especially sensitive to latency; a difference of

microseconds or less can mean millions of dollars in lost revenue. There are several

published examples from retail and online merchants in which latency reduces the

number of search queries and retail transactions; some of these effects can persist even

after a brief increase in latency has been restored to nominal levels. High latency

translates directly to lower performance because applications stall or idle when they are

waiting for a response over the network. Further, new types of network traffic are

particularly sensitive to latency, including virtual machine migration and storage traffic.

In the case of HFT, both the magnitude and consistency of the latency (jitter, or variation

in packet arrival times) are important. Low latency is critical to high performance,

especially for modern applications where the ratio of communication to computation is

relatively high compared to legacy applications. The Securities Technology Analysis

Center (STAC™) is a vendor neutral benchmarking organization comprised of leading

financial market firms, who write and maintain a library of test suites which represent

customer-defined, simulated market trading environments. Testing with this benchmark

is observed and audited by STAC™ and made available to their members and subscribing

companies.

Today there is a tradeoff between virtualization and latency, so that applications with

very low latency requirements do not virtualize their applications. In the long term, this

may change as increased speeds of multi-core processors and better software reduce the

latency overhead associated with virtualization.

The internal design of data center switches can influence latency. The number of switch

chip hops within a switch should be minimized; a single-chip switch offers not only

lower latency than a multi-chip switch, but also provides more consistent, deterministic

latency to every switch port. Single-chip solutions also offer higher reliability and lower

power dissipation.

Most of the latency associated with data center networks is incurred by the upper layer

protocols (TCP windowing, flow control, packet retransmission and routing, store and

forward, etc.). For this reason, techniques such as iWarp and RoCE can be used to

minimize the network stack latency. However, a significant amount of latency is also

incurred from wide are network transport. There are three major sources of latency in the

wide area network (WAN); fiber latency, WAN equipment latency, and the contributions

of equipment in the fiber path (signal regenerators, amplifiers, and dispersion

compensators). The fiber latency is fixed at 5 microseconds per km, and will be

dominated by the WAN distance rather than distances within the data center. This is

particularly difficult to adjust, since fiber paths are oftern indirect and much longer than

the geographic distance between two locations. For connections between major cities,

existing fiber routes are not particularly direct, and new, more direct fiber builds are often

not economically justified since it is much easier to reinforce existing fiber routes.

Figure 5.1 Sources of latency in the network

There are also potentially significant sources of latency in the long distance optical

transport equipment. For example, optical transponders are used to convert an incoming

data signal to a specific modulated optical wavelength for multiplexing purposes, or to

aggregate lower data rates using time division multiplexing. The electronic time

multiplexing, performance monitoring, protocol conversion, clock recovery, and forward

error correction (FEC) algorithms used in this application are all sources of added latency.

While this is usually negligible for typical applications, it can be significant for latency

sensitive applications. Higher data rates (over 10 Gbit/second) require FEC in order to

detect and correct bit errors, but this can add tens to hundreds of microseconds latency.

Similarly, the convergence of optical and electrical signals in a sub-rate multiplexing

architecture can be achieved using the industry standard IETF G.709, known as Optical

Transport Network (OTN). This approach encapsulates user data in a digital wrapper to

decouple the server links from the long haul links, and is commonly used to encapsulate

lower data rate traffic into a 40-100 Gbit/second backbone. However, OTN

encapsulation also introduces tens of microseconds additional latency, and should be

Less Latency

More Latency

Framing FEC

Store-Forward

Line coding Switching

Address lookup

Packet forwarding Routing

TCP windowing Flow control

Packet re-send

Application

Presentation Session

Bits

Frames

Packets

Segments

Data

1

2

3

4

5

6

7

disabled for ultra low latency networks. We also note that many vendor proprietary inter-

switch links (ISLs) on Fibre Channel switches are not fully compatible with OTN, and

thus OTN should be disabled if these interconnects are used for long distance

transmission.

For distances exceeding 80-100 km, optical amplification and dispersion compensation

are required. Optical fiber amplifiers consist of specially doped sections of fiber which

may be tens to hundreds of meters or more in length. Optical signals passing through this

fiber are amplified without requiring electronic to optical signal conversion, so the

overall latency from an optical amplifier is lower than a corresponding electronic

amplifier; there is a tradeoff in signal integrity since the optical amplifier cannot retime a

signal like the electronic amplifier. Although the latency introduced by a single optical

amplifier is typically very low (less than a few microseconds), for fiber links with poor

noise figures, many amplifiers placed close together may be required, thus increasing the

aggregate latency. The type of optical amplifier will also make a difference. Erbium

doped fiber amplifiers (EDFAs) require longer fiber lengths within the amplifier, and

thus add more latency compared with Raman amplifiers.

Extended distance links also require dispersion compensation, to overcome the fixed

levels of chromatic dispersion associated with long distances of installed fiber. The type

of dispersion compensator can make a significant difference in latency. One approach

involves inserting spools of specially treated dispersion compensating fiber into the link,

which have a negative dispersion shift and cancel out the positive dispersion associated

with the rest of the fiber. A typical 100 km link can be compensated with about 14 km of

dispersion shifted fiber, which adds about 70 microseconds to the link latency [6]. If the

dispersion compensating fiber is not optimally placed, additional optical amplifier stages

may be required, which further increases the link latency. Another approach is the use of

dispersion compensation gratings, which are short lengths of optical fiber fabricated with

a chirped fiber Bragg grating in their core. This diffraction grating is able to induce high

levels of negative dispersion proportional to the optical wavelength; several possible

designs have been proposed [7]. A 100 km length of fiber can be compensated using

only about 20 meters of fiber Bragg grating, with an additional latency of less than 0.15

microseconds. Although dispersion compensating grating are currently more expensive,

the cost difference may be justified in cases where ultra-low latency is required.

Additional latency tuning can also be achieved through tuning of the application

environment, operating system, and hardware environment of the servers attached to the

network, although these details are beyond the scope of this architecture.

In summary, for ultra-low latency applications such as high frequency financial trading,

the data center network can introduce significant amounts of latency. Within a data

center, the entire network stack must be considered, including the server adapter, top of

rack switches, and core switches; end to end solutions which perform well on

independently audited latency benchmark tests are recommended. The number of switch

chips within a network switch should be minimized. For links between data centers,

latency can be optimized by selecting the shortest possible physical fiber path, disabling

FEC and OTN, using Raman amps instead of EDFAs, and using dispersion compensating

Bragg gratings instead of dispersion compensating fiber. In the future, as transaction

rates increase, we expect further reductions in latency will be possible through faster

processors and network interface controllers, accelerated middleware appliances, and

ultra-low latency switches, combined with a certain amount of tuning and design

optimization.

Technical References

A Bach, “High speed networking and the race to zero”, Hot Interconnects (HOTI)

conference, 11 Madison Ave, New York, NY, August 25-27, 2009;

http://www.hoti.org/hoti17/program/

“Best practices for tuning system latency”, IBM White Paper, March 2011, 17 pp.,

http://publib.boulder.ibm.com/infocenter/lnxinfo/v3r0m0/topic/performance/rtbestp/rtbes

tp_pdf.pdf

Discussion on why MPLS is more secure than OTV:

http://www.bandwidth.com/wiki/article/How_secure_is_MPLS

http://www.infinibandta.org/content/pages.php?pg=press_room_item&rec_id=663 IETF,

―Request for Comments (RFC) Pages, http://www.ietf.org/rfc.html

http://www.hoti.org/hoti17/program/

http://www.bandwidth.com/wiki/article/How_secure_is_MPLS

http://www.infinibandta.org/content/pages.php?pg=press_room_item&rec_id=663

Documents

Open Data Center Network Reference Architecture