Optics in Future Data Center Networksdjm202/pdf/papers/s... · other optical or electrical interconnect solutions. The optical transmitter and receiver modules include non-retimed

Optics in Future Data Center Networks

Laurent Schares1, Daniel M. Kuchta

1, and Alan F. Benner

2

1: IBM – TJ Watson Research Center, Yorktown Heights, NY 10598, USA. Email: [email protected], [email protected]

2 : IBM – Systems and Technology Group, Poughkeepsie, NY 12601, USA. Email: [email protected]

Abstract— Optical interconnects offer significant advantages for

future high performance data center networks. Progress towards

integrating new optical technologies deeper into systems is

reviewed, and the prospects for optical architectures beyond

point-to-point optical links are discussed.

Index Terms—Computer networks, optical interconnections,

optical packaging, optical switches.

I. INTRODUCTION

Datacenters and supercomputers are among today’s largest compute systems, and major installations of each category can scale to hundreds of thousands of processor cores. One of the most important differences between datacenters and super-computers is the interconnection network, which is architected to meet the needs of vastly different software applications. The performance of tightly coupled supercomputer applications strongly depends on extensive inter-node communication and on networks that provide either very low latency, such as multi-tori networks, or very high bisection bandwidth, such as fat-tree or Clos networks, or a combination of both. On the other hand, conventional data center networks typically have a tree-like topology that is often significantly oversubscribed, as many traditional datacenter applications require little inter-node communication. Many modern data center applications however, such as software redistributions, all-to-all data shuffles or cluster-type applications, need to move huge amounts of data between servers, and network bandwidth plus latency can become performance bottlenecks. This leads us to consider some of the supercomputing topologies and packaging techniques for future datacenter networks.

Traditional datacenter networks often include three different fabrics: a local area network such as Ethernet, a storage area network (usually Fibre Channel), and a cluster network (most typically InfiniBand). The consolidation of these multiple networks into a single converged layer-2 or layer-3 fabric [1], [2] has recently seen considerable activity in order to simplify the network management and lower infrastructure costs. The network convergence could well result in an increased bandwidth required for the converged link, as each port of a single hub chip has to provide the connectivity that was previously provided by separate ports or chips for each protocol. Most optical connectivity in today’s datacenters is provided by active optical cables or by pluggable optical transceivers connected to fibers. These active cables or pluggable transceivers are attached by electrical connectors to the edge of switch or server boxes. However, as data rates and the number of lanes per I/O port increase, the size of the electrical connectors and the power dissipation in long copper traces between a host chip and the edge of the box will become limiting factors.

In the context of network bottlenecks, the question motivating this paper is how optical interconnects are best used to meet bandwidth and latency needs of data center networks. What are the most probable future insertion points for optical technologies, and which issues must be addressed in practical implementations? Can optics lead to a disruptive change in data center architectures by deeper integration of optics closer to the processors or even by using optical switches or multicast buses?

We expect the above mentioned trends – data-intensive applications, network convergence and packaging density – to continue and trigger the integration of high-bandwidth optical links closer to the processing cores. A major part of this work describes how to integrate optical interconnects deep into two IBM computer systems: first, at product level, into the high-end POWER7-IH systems (in section II, [3]), and second, at R&D level, into a research prototype of blade servers (in section III, [4]). The highest-end systems of section II generate such a high I/O bandwidth that electrical modules and connectors simply do not have the necessary bandwidth density to deliver the required I/O bandwidth to the available area at the edge of a box. Deep system integration requires optical transmitter/receiver modules that are extremely small and consume little power. Very low cost optical modules are a must and will largely be enabled by enormous volumes: for example, a single large POWER7-IH system will use an amount of parallel optics roughly on par with today’s total world market for parallel optics and may consequently lead to unprecedented economies of scale [3], [5].

In section IV, we will discuss the prospects of alternative optically enabled architectures for data center applications. We show in section IV.A how a software controlled optical circuit switch enabled a low-latency stream computing system [6]. Such a reconfigurable network is particularly attractive for data-driven workloads that have long-lived circuit-like communication patterns, such as streaming applications. In section IV.B, we review emerging research of optically connected memory systems. As cloud computing data centers gain popularity, the need for virtualization grows in importance. While virtualization can lead to significantly higher processor utilization, the performance heavily relies on high-bandwidth memory access. As memory bus bandwidths are approaching 1 Tb/s per bus, the limited electrical connector/packaging density triggers research into optical memory expansion links.

This work focuses on optical networks within a single data center. Optical technologies for the increasing communication needs between separate data centers, driven by the needs of high-availability and low-latency service delivery, are discussed elsewhere [7]-[8].

2010 18th IEEE Symposium on High Performance Interconnects

978-0-7695-4208-9/10 $26.00 © 2010 IEEE

DOI 10.1109/HOTI.2010.10

104

II. OPTICS FOR POWER7-IH SYSTEMS

This section describes the optical interconnects developed for IBM’s POWER7-IH computing systems, in particular for the Blue Waters Supercomputer [3], [9]. With more than 300,000 compute cores, Blue Waters is expected to achieve sustained performance of at least 1 petaflop on scientific and technical applications. Figure 1 shows a node drawer of the POWER7-IH system. The drawer is about 2U high by 3 ft. wide and 4.5 ft. deep and contains 8 nodes of 32-way POWER7 symmetric multiprocessors (SMPs). Each node incorporates a glass-ceramic multi-chip module (MCM) with four 8-core POWER7 processor chips, up to 512 GB of memory and a second MCM with a hub/switch chip and optical modules. Twelve node drawers are packaged into a rack, with optical fiber cables used for all drawer-to-drawer and rack-to-rack communication. Because all drawer-to-drawer communication is implemented optically, there is no electrical backplane.

Figure 1 A 256-core node drawer of the POWER7-IH System [3]. 224+224

12-channel optical transmitters and receivers provide more than 2+2 Tbytes/sec of optical I/O bandwidth from a single drawer.

The POWER7-IH interconnect topology is a fully connected two-tier network. The key network component is the high-bandwidth hub/switch chip on each node. In the first tier, which is composed of a four-drawer “supernode”, the hub/switch of every SMP is directly connected to the 31 other hub/switches in a supernode. In the second tier, every supernode has a direct connection to every other supernode to link up to 16,384 hub/switches. Optical fibers cables can be up to 100 meters long in the largest POWER7-IH systems.

Each of the 8 hub/switch MCMs in the node drawer provides (280+280) GBytes/sec to optical I/Os, while the rest of the bandwidth is electrical and stays within the node drawer. The escape of such high optical bandwidths from densely packaged modules deep within the drawers requires optical transmitters and receivers that provide significantly higher bandwidth density than current commercially available optical

transceivers. The 28+28 optical modules occupy less than 60x60mm

2 on each MCM.

Figure 2 Rear view of POWER7-IH drawer, showing dense packing of fiber

optic connectors at the edge of the drawer. The inset shows an MCM with a

hub/switch chip and 28+28 optical transmitter/receiver attachment sites [3].

The POWER7-IH drawers use 12-channel, 10-Gb/s per channel, optical transmitter and receiver modules called MicroPOD™ [5], [10], developed by Avago Technologies together with IBM and USConec. These modules with dimensions of 7.8mm (L) x 8.2mm (W) x 3.9mm (H) enable higher density interconnects than previously possible with other optical or electrical interconnect solutions. The optical transmitter and receiver modules include non-retimed 12-channel VCSEL drivers and receiver circuits that are wire-bonded to 850-nm parallel VCSEL and photodiode arrays on a 250-µm pitch. The beams are expanded and collimated to provide relaxed alignment tolerances. A detachable optical connector with dimensions of 7.4mm x 5.7mm, called PRIZM™ LightTurn™ [11], couples the optical signals to a 12-fiber ribbon. This connector includes a total-internal-reflection surface and aspheric reflecting lenses to couple the vertical optical beams to near-horizontal fiber ribbons.

The electrical channels between hub/switch chips and the optical transmitters/receivers are an important part of the end-to-end opto-electronic link. Since the O/E modules are mounted on top of the hub/switch chip MCM, the electrical signals do not have to exit the hub modules and do not have to travel through transmission lines on the printed circuit board. The on-MCM electrical channels are kept to a maximum length of 50 mm, which is short enough for good signal integrity and high power efficiency. Good end-to-end optoelectronic link performance relies on low jitter of the optical part of the link, particularly since the optical modules are non-retimed and since the optical receiver includes a limiting amplifier [12]. Jitter measurements performed on the optical link operating at 10 Gb/s have shown low total jitter of 0.19UI at BER=10

-12 [3].

The enormous off-drawer bandwidths require the use of very high-density optical connectors at the edge of the drawers, as shown in Figure 2. The optical I/O ports on the POWER7-IH drawers are based on MPO connectors with 24-fiber and 48-fiber ferrules. Since each drawer may contain up to a total of 5,376 fibers, modifications had to be made to allow for handling of the densely packed connectors. Since some of the POWER7-IH installations might need more than 360,000 connectors, insertion loss requirements have to be balanced against the other systems-level requirements such as connector size, cost, and complexity.

Node 0: 32-way SMP, with DRAM

On-board electrical hub-to-hub interconnect

Board-to-Board Optical hub-to-hub interconnect

PCIe Cards

Water cooling pipes

& Hub/switch

Node 1: 32-way…

Node 2: 32-way…

Node 3: 32-way…

Node 4: 32-way…

Node 5: 32-way…

Node 6: 32-way…

Node 7: 32-way…

PCIe Cards

Fiber Optic I/O Ports

Hub/switch chip on MCM

with sites for Optics

105

III. OPTICS FOR BLADE SERVERS

The blade form factor is becoming an increasingly popular packaging entity for servers. Blade servers are book-like structures that are plugged into a chassis with a midplane/backplane, which provides connections for power and almost all of the I/O bandwidth. Blade servers provide numerous benefits over rack-mounted servers, including lower parts count, better power efficiency, reliability and density, and simpler management. In addition to slots for blade servers, a blade enclosure or chassis, such as IBM’s BladeCenter® [13], typically includes first-level switch modules or extension modules that relay high speed I/O from the blades to the edge of the chassis. However, the signal integrity of the electrical I/O path between blade and outside connector can be challenging at high data rates, as it includes several electrical connectors and long copper traces. In some situations, a repeater IC is placed in this path to regenerate the signal, which adds to the power dissipation.

Figure 3 High speed blade daughter card with blind mate optical connector and a pair of 12 fiber ribbons connecting to two optical modules, one on each

side of the InfiniBand HCA in the center of the card (from [4]).

We have demonstrated a blade packaging concept where the majority of high speed I/O between blades and from chassis to chassis is carried over optical interconnects directly attached to each blade [4]. Initially, the optical connections are fiber based. As I/O ports become wider with a larger number of high-speed lanes and as the optical waveguide technology matures, we expect fibers to be replaced by denser and lower-cost polymer waveguides, which can be printed directly onto a printed circuit board or fabricated in a flex format [14].

Introducing direct fiber connections to the blade servers should be compatible with the current ease-of-use blade server concept. Bringing fibers out through the front of the blade would create a potential cable management issue and is not desirable. Our “fiber to the blade” implementations are based on high-speed daughter cards that plug onto the blade server and have small optical transceivers and fiber ribbons. These daughter cards connect to blind mate optical connectors mounted on a passive bar that we added inside the BladeCenter chassis. Additional fiber cables connect the adapter bar to optical adapters on the back side of the chassis.

We have implemented two versions of optical interconnects to blade servers: a dual-port InfiniBand DDR version using parallel multimode optics [4], and a dual-port 10-Gb/s Ethernet version using single mode optics. Since the packaging concepts

of either version are fairly similar, the remainder of this section focuses on the multimode InfiniBand version, for the sake of brevity. The particular InfiniBand host channel adapter (HCA) that we interfaced to had two 20-Gbit/s ports comprised of four transmit and four receive lanes, each running at 5 Gbit/s per lane. Sixteen fibers are needed to support both ports. In our implementation, we chose to bring 24 fibers to each blade, since ribbon fiber is commercially and conveniently available in multiples of 12.

Figure 4 High-speed optical daughter card mounted on a blade server connected to the electrical midplane. The daugther card extends through the

midplane and plugs into the optical connector bar.

The optical daughter card is shown in Figure 3. It contains an InfiniBand HCA, two optical transceivers, an electrical connector that plugs into the host blade, a blind mate optical connector, and flex fiber ribbons connecting the optical transceivers to the optical connector. The optical blind mate connectors are placed on an extension of the daughter card about 6 cm beyond the edge of the blade and into the plenum space of the chassis between the two blower fans. They could not be mounted in the same plane at the edge of the blade as the electrical connectors connecting to the midplane, because the only available space for connectors was also the exit port for the cooling air (Figure 4). Our implementation allows the cooling air to exit the blade and spread around the connectors. We verified that the thermal impact of the blind mate connectors and connector bar was minimal.

Integrating optical transceivers on the daughter cards is challenging because of height and board space restrictions in a densely packaged system such as blade servers. There is a maximum height of 18 mm between the daughter card PCB and the blade server PCB. In the region of the daughter card, the server PCB has many tall power supply components that restrict the available board space for the optical modules. At the time that we designed these daughter cards, no commercially available parallel optical modules were small enough to fit into this tight space. In this project, we used unpackaged versions of POP4 optical transceivers [4], which have 4 transmitter and 4 receiver channels. To meet the space and height restrictions, we removed the heat sink and replaced the standard MTP connector with a smaller MT ferrule. While the two optical transceivers added 2.2W of power to the daughter card, they eliminated the need for electrical repeater chips and their resulting power dissipation. We note that the

106

MicroPOD™ transmitter and receiver modules described in section II are smaller than the transceivers that we used in this project and could be used in a newer implementation.

Figure 5 Rear view of modified IBM BladeCenter chassis with blind mate optical connector bar and with optical adapters mounted on the rear of the

chassis (from [4]). The internal fiber routing is designed to minimize air flow

resistance. The top blower in the chassis has been removed for clarity.

The optical connectors were mounted on an aluminum bar in the back of the chassis. This bar had 14 optical connectors precisely placed such that the optical daughter cards on each of the 14 blades could blind mate to the connector bar at the same time that the blade itself connected to the electrical midplane. A portion of the mounting bar and five of the connections can be seen in Figure 5. For each blade position, a “Y”-cable consisting of a 24-fiber connector, plugged into the blind mate adapter, and two 12-fiber MTP connectors spans from the connector bar to a set of MTP adapters mounted at the rear of the chassis. The length of the Y cables ranges from 22 to 40 cm depending on the blade slot position. The Y cables are bundled together and routed along the internal sides of the chassis to avoid blocking the inputs to the exhaust blower. The total insertion loss from the transceivers to the adapters at the rear of the chassis was characterized and found to be compliant with our optical link budget. The largest measured insertion loss was only 1.3 dB and included two optical connectors, the flex cable on the daughter card and the fibers in the chassis. The blowers were exercised during the loss measurements to check for mechanical stability of the connections.

We tested the performance of the optically connected blades in a small Linux cluster using two chassis and an InfiniBand switch. Although the optical transceivers were rated to 3 Gb/s, we could operate them at 5 Gbit/s/channel as required for DDR InfiniBand. We established a 41-m long point-to-point link between two blades and measured the bandwidth using an MPI RDMA write bidirectional bandwidth benchmark test [4]. As expected, the optical cables only added time-of-flight to the total end-to-end latency and thus contributed a negligible amount to the latency for all but the smallest packet transmissions, even over the longest links.

IV. ALTERNATIVE OPTICAL ARCHITECTURES

A. Reconfigurable Networks using Optical Circuit Switching

In a conventional tree-like data center network that includes top-of-rack, aggregation and core switches/routers, the end-to-

end latency between two nodes can be tens of microseconds if data has to travel through several stages of Ethernet switches/routers. Additionally, the capacity between different branches of the tree is often significantly oversubscribed due to the high power dissipation and cost of core routers, which can lead to congestion and further impairments for latency-sensitive applications. Most higher-bandwidth and lower latency network architectures being proposed are based on flat 2-level hierarchies [3], [15], or on Clos- and fat-tree topologies with advanced layer-2 routing [16], [17]. While providing high bandwidth, these approaches come at the expense of high power dissipation and cost of a large number of electrical packet switches and cables.

The addition of an optical circuit switched (OCS) network to a conventional data center network, as shown in Figure 6, for high-bandwidth and latency-sensitive data traffic is an attractive alternative [6], [18]. This approach is particularly suited for streaming applications, which often have long-lived circuit-like communication patterns that are typically on the order of minutes or longer.

Figure 6 Example of optical circuit switch with software control embedded

in datacenter network, enabling low latency for circuit-like traffic pattern.

We demonstrated the use of a software-controlled optical circuit switch in a stream computing system with layer-2 routing [6]. A software optimizing scheduler adapted the physical interconnect topology in response to system needs, matching logical flow graphs by reconfiguring the optical switch. The dynamic reallocation of network resources allowed us to balance the networking load over time for varying data and data processing graphs. Our OCS demonstrator comprised three IBM BladeCenter® chassis interconnected over two networks: an optical network using a 3D MEMS-mirror based optical circuit switch for data traffic, and a parallel electrical 1-Gb/s Ethernet network for both control and data traffic. Each chassis hosted four blade servers and a 10-Gb/s Ethernet switch module with optical transceivers. The 10GbE switches were optically connected to the OCS. Our stream computing middleware included a dynamic routing mechanism that reconfigured data streams to use either the optical or the electrical network. We demonstrated full functionality of our hybrid optical/electrical network demonstrator and ran streaming applications with multiple jobs. Our middleware was able to control the optical switch and set up the proper layer-2 routing, and the optimizing scheduler reconfigured both the

Software Control

Middleware Scheduler

. . . . . .. . .. . .

Optical Circuit

Switch Electrical

Packet

Router

Top-of-rack

switches

Servers

Datacenter

networkSoftware Control

Middleware Scheduler

. . . . . .. . .. . . . . . . . .. . .. . .

Optical Circuit

Switch Electrical

Packet

Router

Top-of-rack

switches

Servers

Datacenter

network

107

optical and electrical network appropriately in response to changing requirements of streaming applications.

B. Optically Attached Memory

As memory bus bandwidths are approaching 1 Tb/s per bus, it is becoming difficult to maintain bandwidth and capacity per core without innovations in the memory system architecture. The interest in alleviating bandwidth-to-memory bottlenecks and taking advantage of future ultra-dense memory, such as phase change memory, triggers investigation into optical buses specifically designed for memory interconnection [19]–[22].

Cost is a major reason for considering optical memory expansion links. The memory capacity is often limited by available board space, unless expensive high-density memory packaging technologies are used. Optical links to memory enable us to place memory meters away from processors without sacrificing bandwidth (Figure 7). Consequently, as more memory per core is required, optical memory expansion is becoming a promising low-cost solution. It is however important to consider the extra time-of-flight latency introduced by optical expansion memory links, which may limit the maximum processor-memory separation. Tiered memory subsystems may provide a tradeoff, based on some memory close to the processors and lots of expansion memory.

Figure 7 . Example of optically connected expansion memory (from [19]).

Optical multi-drop memory architectures are also being proposed. Multi-drop buses can provide expandability by simply adding more memory modules and can be designed to provide uniform access times for all attached modules. While electrical multi-drop buses are difficult to implement at high speeds because of impedance mismatches, the optical version can be implemented based on optical splitters. The maximum number of drops depends on the optical link budget and splitter loss; 4-drop and 8-drop buses are believed to be realistic.

Efficient coding schemes for optical memory buses are an important part of future research. Memory controller interfaces are often uncoded in order to minimize latency, while most of today’s optical transceivers are ac-coupled and will not pass uncoded data if a data line is held at one logic level for an exceedingly long time.

V. CONCLUSIONS AND ACKNOWLEDGEMENTS

As data centers workloads demand higher I/O and memory bandwidths, optical interconnects will play an increasingly important role. We expect optical interconnects to be integrated significantly closer to host chips. This paper has summarized how optical interconnects are integrated deep into high-end

systems and into research prototypes of blade servers. We have also reviewed emerging research into optically enabled architectures based on circuit switches and on optically attached memory.

Thanks are due to many colleagues at IBM for the work summarized here, in particular to the authors of references [3], [4], [6] and [19]. Part of this work has been sponsored by DARPA under contract No. HR0011-07-9-0002.

REFERENCES

[1] D. Cohen et al., "Remote Direct Memory Access over the Converged Enhanced Ethernet Fabric: Evaluating the Options", IEEE Hot Interconnects Symposium 2009.

[2] F. Marti, "iWARP 2.0", 2010 Sonoma Workshop, OpenFabrics Alliance. http://www.openfabrics.org/archives/sonoma2010.htm

[3] A. Benner, D. M. Kuchta, P. K. Pepeljugoski, R. A. Budd, G. Hougham, B. V. Fasano, K. Marston, H. Bagheri, E. J. Seminaro, H. Xu, D. Meadowcroft, M. H. Fields, L. McColloch, M. Robinson, F. W. Miller, R. Kaneshiro, R. Granger, D. Childers, E. Childers, "Optics for High-Performance Servers and Supercomputers", Optical Fiber Communications (OFC), paper OTuH1, March 2010.

[4] D. M. Kuchta, Y. Taira, C. Baks, G. McVicker, L. Schares, H. Numata, “Optical interconnects for servers”, Japanese Journal of Applied Physics, Vol.47, No.8, pp.6642-6645, Aug. 2008.

[5] M. H. Fields et al., "Transceivers and Optical Engines for Computer and Datacenter Interconnects", OFC 2010, paper OTuP1, March 2010.

[6] L. Schares, X.J. Zhang, R. Wagle, D. Rajan, P. Selo, S.P. Chang, J. Giles, K. Hildrum, D. Kuchta, J. Wolf, E. Schenfeld, “A reconfigurable interconnect fabric with optical circuit switch and software optimizer for stream computing systems”, OFC 2009, paper OTuA1, 2009.

[7] J. Berthold, “Optical Networking for Data Center Interconnects Across Wide Area Networks”, IEEE Hot Interconnects Symposium 2009.

[8] C. Lam, “Optical Network Technologies for Datacenter Networks”, Optical Fiber Communications (OFC), paper NWA3, March 2010.

[9] Blue Waters Petascale Supercomputer [Online]. Available: http://www.ncsa.illinois.edu/BlueWaters/

[10] MicroPODTM optical transmitter and receiver modules [Online]. Available: http://www.avagotech.com/pages/en/press/micropod.

[11] PRIZM™ LightTurn™ Connector [Online]. Available: http://usconec.com/pages/products/prizm/prizmfrm.html

[12] P. Pepeljugoski, D. Kuchta, "Jitter Performance of Short Length Optical Interconnects for Rack-to-Rack Applications", Optical Fiber Communications Conference (OFC), paper OTuA2, Mar. 2009.

[13] IBM BladeCenter® [Online]. Available: http://www-03.ibm.com/systems/bladecenter/

[14] F. E. Doany et al., "160 Gb/s Bidirectional Polymer Waveguide Board-Level Optical Interconnects Using CMOS-Based Transceivers," IEEE Trans. Adv. Pkg., vol. 32, pp. 345-359, 2009.

[15] J. Kim, W.J. Dally, S. Scott, D. Abts, "Technology-Driven, Highly-Scalable Dragonfly Topology", Proc. of 35th ISCA, 2008.

[16] A. Greenberg et al., "VL2: a scalable and flexible data center network”, Proc. of ACM SIGCOMM, Aug. 2009.

[17] R. N. Mysore, et al., "PortLand: a scalable fault-tolerant layer 2 data center network fabric", ACM SIGCOMM, 2009.

[18] G. Wang et al., "Your Data Center Is a Router: The Case for Reconfi-gurable Optical Circuit Switched Paths", ACM HotNets-VIII, Oct. 2009.

[19] Y. Katayama, A. Okazaki, "Optical Interconnect Opportunities for Future Server Memory Systems", Proc. of IEEE HPCA, p. 46-50, 2007.

[20] M. Tan et al., "A High-Speed Optical Multi-drop Bus for Computer Interconnections", IEEE Hot Interconnects Symposium, 2008.

[21] D. Brunina, C. P. Lai, A. S. Garg, K. Bergman, "Optically-Connected Memory Systems for High-Performance Computing", 21st Workshop on Interconnections within High Speed Digital Systems, May 2010.

[22] S. Beamer et al., “Re-architecting DRAM memory systems with monolithically integrated silicon photonics,” Proc. of ISCA 2010.

108

Documents

Optics in Future Data Center Networksdjm202/pdf/papers/s... · other optical or electrical interconnect solutions. The optical transmitter and receiver modules include non-retimed