19
TeraRack: A Tbps Rack for Machine Learning Training Mehrdad Khani 1 Manya Ghobadi 1 Mohammad Alizadeh 1 Ziyi Zhu 2 Madeleine Glick 2 Keren Bergman 2 Amin Vahdat 3 Benjamin Klenk 4 Eiman Ebrahimi 4 1 MIT 2 Columbia University 3 Google 4 NVIDIA Abstract We explore a novel, silicon photonics-based approach to build a high bandwidth rack designated for machine learning train- ing. Our goal is to scale state-of-the-art ML training platforms, such as NVIDIA’s DGX and Intel’s Gaudi, from a handful of GPUs in one platform to 256 GPUs in a rack while main- taining Tbps communication bandwidth. Our design, called TeraRack, leverages the emergence of silicon photonics tech- nology to achieve Tbps bandwidth in/out of the GPU chip. TeraRack enables accelerating the training time of popular ML models using ( i ) a scheduling algorithm that finds the best wavelength allocation to maximize the throughput between communicating nodes; and ( ii ) a device placement algorithm that partitions ML models across nodes to ensure a sparse and local communication pattern that can be supported effi- ciently on the interconnect. We build a small prototype with FPGA boards and a 10 mm × 10 mm silicon photonics chip. Simulation results show that TeraRack’s performance on re- alistic ML training workloads is equivalent to a full-bisection 256×1.2Tbps electrical fabric at 6× lower cost, enabling faster model/data parallel training. 1 INTRODUCTION Today’s approach to assembling custom platforms for machine learning (ML) training—electrically interconnecting a few GPUs within one box—has a fundamental scaling challenge. The number of compute nodes in the box is limited by the wiring complexity of the interconnection between the nodes. For instance, Nvidia’s DGX-2 [1], one of today’s fastest ML training platforms, offers 1.2 Tbps communication bandwidth between GPU pairs at all times, but it is limited to intercon- necting 16 GPUs using a custom-designed network fabric (NVSwitch [2]). Similarly, Intel’s Gaudi training platform [3] interconnects 8 Tensor Processing Cores using an on-chip wiring mesh that provides 1 Tbps bandwidth for every pair [4]. Many researchers have recognized this cliff in scaling from a handful of accelerators to sizes needed by large ML training workloads. There are a number of proposals to scale across HPC-like clusters, such as IBM Summit [5, 6], with thousands of commodity or even customized servers [712]. However, these clusters have limited server-to-server bandwidth, mak- ing the network a bottleneck for large DNN training [1317]. The reason is that as the number of parallel workers increases, the amount of computation at each worker decreases but the amount of communication remains relatively constant (since Bandwidth per node (Gbps) 100 500 900 1300 1700 2100 2500 Scale (# of compute nodes) 1 10 100 1000 10000 100000 TeraRack Dell PowerEdge IBM PowerSystem Nidia DGX-1 Intel Gaudi Nvidia DGX-2 IBM Summit SNSC Piz Daint Microsoft Philly Google TPU Pod Sunway TaihuLight Our proposed platform Custom platforms HPC-like clusters Figure 1: A high-level comparison of today’s ML training platforms. Custom platforms [1, 4, 22–24] have high per-node bandwidth but are limited in scale, whereas HPC-like clus- ters [5, 27–30] can scale to thousands of nodes, but the training speed suffers from limited per-node bandwidth. it depends primarily on model size [11, 12, 18, 19]). Conse- quently, the total training time becomes dominated by the com- munication time, resulting in diminishing returns with addi- tional workers. Data-parallel training methods aim to mitigate this problem by processing more data in each training iteration as the system scales, thereby keeping computation and com- munication time balanced. However, this approach also hits scaling limits, since processing more data in each training iter- ation does not always translate to faster training [20]. Further, recent work has shown that other parallelization strategies with higher bandwidth requirements, such as model parallelism or hybrid model and data parallelism, can significantly improve performance [21]. We posit that the community will soon re- quire 100s of accelerators for single training workloads and such deployments will require Tbps bandwidth per accelerator to maintain a scalable, cost-balanced system. Figure 1 compares the bandwidth and scale of existing ML platforms. Custom platforms [1, 4, 2224] provide Tbps of communication bandwidth to each node, but they suffer from scalability challenges because they rely on electrical packet switching chips that are already operating near their practi- cal limits [25, 26]. HPC-like clusters [5, 2731] enjoy a high scale, but they are limited by the communication bandwidth available between servers. Our proposed platform, TeraRack, fills the gap between these two sets of solutions. We explore the opportunities and challenges of building an ML training platform at moderate scale (e.g., 256 nodes) with ultra-high inter-node bandwidth (e.g., 2.4 Tbps). We argue that the new wave of silicon photonics-based cards, such as Intel/Ayar’s 1

TeraRack: A Tbps Rack for Machine Learning Training · compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Second, Ter- ... train

  • Upload
    others

  • View
    60

  • Download
    8

Embed Size (px)

Citation preview

Page 1: TeraRack: A Tbps Rack for Machine Learning Training · compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Second, Ter- ... train

TeraRack: A Tbps Rack for Machine Learning Training

Mehrdad Khani1 Manya Ghobadi1 Mohammad Alizadeh1

Ziyi Zhu2 Madeleine Glick2 Keren Bergman2

Amin Vahdat3 Benjamin Klenk4 Eiman Ebrahimi41 MIT 2 Columbia University 3 Google 4 NVIDIA

AbstractWe explore a novel, silicon photonics-based approach to builda high bandwidth rack designated for machine learning train-ing. Our goal is to scale state-of-the-art ML training platforms,such as NVIDIA’s DGX and Intel’s Gaudi, from a handfulof GPUs in one platform to 256 GPUs in a rack while main-taining Tbps communication bandwidth. Our design, calledTeraRack, leverages the emergence of silicon photonics tech-nology to achieve Tbps bandwidth in/out of the GPU chip.TeraRack enables accelerating the training time of popularML models using (i) a scheduling algorithm that finds the bestwavelength allocation to maximize the throughput betweencommunicating nodes; and (ii) a device placement algorithmthat partitions ML models across nodes to ensure a sparseand local communication pattern that can be supported effi-ciently on the interconnect. We build a small prototype withFPGA boards and a 10 mm × 10 mm silicon photonics chip.Simulation results show that TeraRack’s performance on re-alistic ML training workloads is equivalent to a full-bisection256×1.2Tbps electrical fabric at 6× lower cost, enabling fastermodel/data parallel training.

1 INTRODUCTIONToday’s approach to assembling custom platforms for machinelearning (ML) training—electrically interconnecting a fewGPUs within one box—has a fundamental scaling challenge.The number of compute nodes in the box is limited by thewiring complexity of the interconnection between the nodes.For instance, Nvidia’s DGX-2 [1], one of today’s fastest MLtraining platforms, offers 1.2 Tbps communication bandwidthbetween GPU pairs at all times, but it is limited to intercon-necting 16 GPUs using a custom-designed network fabric(NVSwitch [2]). Similarly, Intel’s Gaudi training platform [3]interconnects 8 Tensor Processing Cores using an on-chipwiring mesh that provides 1 Tbps bandwidth for every pair [4].

Many researchers have recognized this cliff in scaling froma handful of accelerators to sizes needed by large ML trainingworkloads. There are a number of proposals to scale acrossHPC-like clusters, such as IBM Summit [5, 6], with thousandsof commodity or even customized servers [7–12]. However,these clusters have limited server-to-server bandwidth, mak-ing the network a bottleneck for large DNN training [13–17].The reason is that as the number of parallel workers increases,the amount of computation at each worker decreases but theamount of communication remains relatively constant (since

Band

wid

th p

er n

ode

(Gbp

s)

100

500

900

1300

1700

2100

2500

Scale (# of compute nodes)

1 10 100 1000 10000 100000

TeraRackDell PowerEdgeIBM PowerSystemNidia DGX-1Intel GaudiNvidia DGX-2IBM SummitSNSC Piz DaintMicrosoft PhillyGoogle TPU PodSunway TaihuLight

Our proposed platform

Custom platforms

HPC-like clusters

Figure 1: A high-level comparison of today’s ML trainingplatforms. Custom platforms [1, 4, 22–24] have high per-nodebandwidth but are limited in scale, whereas HPC-like clus-ters [5, 27–30] can scale to thousands of nodes, but the trainingspeed suffers from limited per-node bandwidth.

it depends primarily on model size [11, 12, 18, 19]). Conse-quently, the total training time becomes dominated by the com-munication time, resulting in diminishing returns with addi-tional workers. Data-parallel training methods aim to mitigatethis problem by processing more data in each training iterationas the system scales, thereby keeping computation and com-munication time balanced. However, this approach also hitsscaling limits, since processing more data in each training iter-ation does not always translate to faster training [20]. Further,recent work has shown that other parallelization strategies withhigher bandwidth requirements, such as model parallelism orhybrid model and data parallelism, can significantly improveperformance [21]. We posit that the community will soon re-quire 100s of accelerators for single training workloads andsuch deployments will require Tbps bandwidth per acceleratorto maintain a scalable, cost-balanced system.

Figure 1 compares the bandwidth and scale of existing MLplatforms. Custom platforms [1, 4, 22–24] provide Tbps ofcommunication bandwidth to each node, but they suffer fromscalability challenges because they rely on electrical packetswitching chips that are already operating near their practi-cal limits [25, 26]. HPC-like clusters [5, 27–31] enjoy a highscale, but they are limited by the communication bandwidthavailable between servers. Our proposed platform, TeraRack,fills the gap between these two sets of solutions. We explorethe opportunities and challenges of building an ML trainingplatform at moderate scale (e.g., 256 nodes) with ultra-highinter-node bandwidth (e.g., 2.4 Tbps). We argue that the newwave of silicon photonics-based cards, such as Intel/Ayar’s

1

Page 2: TeraRack: A Tbps Rack for Machine Learning Training · compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Second, Ter- ... train

TeraPHY chip [32–36], present a unique opportunity to designinterconnects with Tbps bandwidth.

Silicon photonics have been proposed as a potential gamechanger for ML-based systems [37–39], with a projected costof $1-$5/Gbps, thus matching the cost of today’s datacenter in-terconnects [40–42].1 Micro-ring resonators (MRRs) [43] playan important role in the success of silicon photonics [44, 45].MRRs act as spectral filters to select and forward wavelengths,and they can switch between different wavelengths within afew microseconds. Prior work shows the feasibility of MRRsin optical switches and transceivers [34, 46–59]. In this work,we explore the use of MRRs to build rack-scale interconnects.

An immediate challenge of optical interconnects is that theytend to have either a millisecond reconfiguration time [60–63],or a limited port count [55, 64–66]. For instance, the Calientswitch has 320 ports, but it takes roughly 30 ms to reconfigurethe switch [67]. Recent proposals, such as Mordia [64], intro-duce microsecond switching time using wavelength-selectiveswitches, but they are inherently limited to 4-8 ports. Low portcount limits the scalability, and slow reconfiguration time hurtsthe performance of training (§2). We overcome this challengeby eliminating the need for an optical switch to interconnect thenodes in TeraRack altogether. Instead, we connect the MRRson each node with fiber cables to form several rings-of-micro-rings and reconfigure the wavelengths on each ring on demand.

Although a ring topology eliminates the need for an opticalswitch, it introduces a well-known challenge: wavelength as-signment [68]. The wavelength assignment problem is to selecta suitable wavelengths, among the many possible choices foreach communicating node, such that no two paths sharing thesame fiber are assigned the same wavelength. We develop awavelength scheduling algorithm that exploits a connectionwith min-cost flow routing. Our algorithm provides a near-optimal solution, achieving training times that are within lessthan 5% of an integer linear programming (ILP) wavelengthallocation solution.

To evaluate TeraRack, we develop an accurate simulatorfor parallel neural network training. Our simulations comparethe cost and training iteration time of TeraRack with thoseof the state-of-the-art electrical and optical architectures fordata/model parallelism workloads. Our results demonstratetwo key takeaways. First, for three representative DNN models(ResNet152 [69], VGG16 [70], Transformer [71]), TeraRackspeeds up training time by factors of 2.12–4.09× and 1.65–3×compared to cost-equivalent electrical-switch and OCS-basedinterconnects respectively, at 256-GPU scale. Second, Ter-aRack achieves up to 55× speedup in total training time fromfast (rescheduling allocation once every 100µs) wavelengthreuse at 256 GPUs. Finally, based on component costs, we es-timate a TeraRack interconnect will be an order of magnitudecheaper than a full-bisection bandwidth electrical interconnect.

We also build a small, three-node prototype using FPGAboards to emulate GPUs and three MRRs to show the feasibility

1Note that this is far cheaper than wide-area networks’ DWDM technology(details in Section 4).

of our architecture. We use Stratix V FPGAs and commod-ity 10 Gbps SFP+ transceivers to emulate the GPU trainingworkflow, as no commercial GPU chip currently supports highbandwidth optical interfaces.2 We use the MRRs to selectand forward any of the wavelengths to the target emulatedGPUs. Using our prototype, we benchmark the throughput,reconfiguration time, and optical power budget of TeraRack’sarchitecture. Our prototype demonstrates that (i) using MRRsto select/bypass traffic does not impact TCP’s throughput; (ii)MRR’s reconfiguration latency is within 25 µs; and (iii) theoptical power loss of each MRR is 0.025–0.05 dB.

2 BACKGROUND AND MOTIVATIONA common approach to distributed training is data parallelism(DP) where the training data is distributed across multipleworkers (e.g., GPU, TPU, CPU). Each worker has an iden-tical copy of the model but trains on an independent subsetof the training data, called a mini-batch, in parallel. In DPtraining, workers need to communicate their model parametersafter each training iteration. This can be done in a variety ofways, including parameter servers [73], ring-allreduce [74, 75],tree-reduce [76], and hierarchical all-reduce [8].

In addition to DP training, there is a rapid increase in modelparallelism (MP) training, motivated by the rapid increasein the computation and memory requirements of neural net-work training. The size of deep learning models has beendoubling about every 3.5 months [9, 14, 77]. Many models,such as Google’s Neural Machine Translation [78, 79] andNvidia’s Megatron [80], no longer fit on a single device [81–83] and need to be distributed across multiple GPUs [84]. Totrain such models, model parallelism (and hybrid data-modelparallelism) approaches partition the model (and data) acrossdifferent workers [85]. Model parallelism is an active area of re-search, with various model partitioning techniques [13, 14, 81–83, 86]. For example, pipeline parallel approaches, such asPipeDream [13] and GPipe [14], have emerged as a sub-areaof model parallelism. Recent work [21] explores optimizingover a large space of fine-grained parallelization strategies(e.g., parallelizing each operator in a DNN computation graphseparately), demonstrating an increase in training throughputof up to 3.8×. We posit that all of these approaches will benefitfrom a platform with high network bandwidth.

2.1 Need for Tbps Per-node BandwidthWe profile traffic traces from deep neural networks distributedacross a DGX-1V system [24] to motivate the need for arack with 256 nodes and 2.4 Tbps per-node bandwidth, or615.4 Tbps of total interconnect bandwidth. To put this intocontext, the interconnect bandwidth of today’s commodityelectrical packet switches is 12.8 Tbps (128 ports, each 100Gbps); 48 times slower than our proposal.

Our DGX system contains eight GPUs connected by NVLinkswith 1.2 Tbps bandwidth per GPU: each GPU has six NVLinks

2The first chips are expected to hit the market later this year [72].

2

Page 3: TeraRack: A Tbps Rack for Machine Learning Training · compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Second, Ter- ... train

0 1 2 30

0.3

0.6

0.9

1.2

Time (sec)

Thr

ough

put(

Tbp

s)

(a) Transformer (8-way data parallel)

0 1 2 30

0.2

0.4

0.6

0.8

Time (sec)

Thr

ough

put(

Tbp

s)

(b) VGG (4-way model parallel)

0 5 10 150

0.3

0.6

0.9

Time (sec)

Thr

ough

put(

Tbp

s)

(c) GPT2 (4-way model and 2-way data parallel)

Figure 2: GPU-to-GPU throughput for different distributed training workloads.

(each with 200 Gbps) to connect to seven other GPUs. Ourprofiling includes a mix of data and model parallel training offive popular deep neural networks: ResNet [69], Inception [87],VGG [88], GPT2 [80, 89], and Transformer [90]. We are notclaiming our observations hold for all DNN jobs, but thesemodels represent the state-of-the-art neural networks in pro-duction systems.

We instrument each NVLink to log its network traffic usingNvidia’s nvml API every 500 µs while running the trainingworkloads. We convert these logs into a series of GPU-to-GPUtime-series representing the bandwidth demand between GPUpairs. These measurements illuminate the nature of ML work-loads and provide a general framework to evaluate TeraRack.For reproducibility, our traffic trace data and GPU instrumen-tation code will be publicly available.

Fig. 2 shows the GPU-to-GPU bandwidth utilization forthree training networks (Transformer [71], VGG [70], andGPT2 [91]) as examples of data, model, and hybrid data-model parallel workloads, respectively. We run the Trans-former model in data parallel mode on our DGX-1 box acrossall eight GPUs, the VGG model on four GPUs in a 4-waymodel parallel mode, and the GPT2 model on eight GPUs,with a hybrid 4-way model and 2-way data parallelism [84].The figure shows that the throughput can hit the peak capacityof∼1.2 Tbps. Moreover, we find that the throughput is periodic.The intuition behind the periodicity is that each training itera-tion repeats the same pattern of computation dependencies. Toquantify this observation, we calculate the autocorrelation ofeach workload and find the autocorrelation peaks correspondto training iterations (figure shown in Appendix A). Prior workhas reported a periodic behavior for GPU-to-memory trans-fers [92], our measurements confirms their findings but forGPU-to-GPU traffic.

To understand the growing need for Tbps bandwidth for eachnode, we now provide the intuition on the bandwidth require-ments of distributed training tasks from first principles. In abalanced parallel training system, the communication time andcomputation time should be roughly equal, with near-perfectoverlap. The advent of new hardware architectures, such as Cer-beras [93] and Google’s TPU [94], along with improvementsin software stacks, are leading to a continuous decrease in com-pute time. Hence, bandwidth has to increase to avoid makingthe network a bottleneck. In data-parallel training, this effect

can be partially compensated by increasing the amount of datathat each GPU trains (i.e., local batch size). However, there isa limit on decreasing the total training time with increasing thebatch size [20]. Importantly, model-parallel training cannotleverage this technique since the size of the communicationdata is proportional to the batch size.

As an example, we calculate the required bandwidth fordata parallelism with the Transformer model [90]. Today, thebatch size used for Transformer at the largest scale is 1280. Wemeasure the backward pass as 23.5 ms on a V100 GPU, duringwhich 420 MB need to be all-reduced. To achieve maximaloverlap, the system needs to provide an effective all-reducebandwidth of about 420MB

23.5ms ≈150 Gbps per node. Hence, as weincrease the system scale or improve the processor’s perfor-mance through hardware or software, the required bandwidthgrows rapidly. For example, a 5× processor improvement anda 2× larger system requires an effective bandwidth of about1.36 Tbps per GPU for Transformer model.

2.2 The Solution: An OpticalInterconnect with µs Reconfigurability

The previous section establishes the need for Tbps per nodebandwidth. This section discusses the desired properties of aninterconnect to achieve such high bandwidth at moderate scale(e.g., 256 nodes).The strawperson design. An interconnect with electrical packetswitches is the first design option to consider. For instance, theDGX-2 fabric uses 12 NVSwitch chips [2] to interconnect 16GPUs. An NVSwitch offers the highest on-chip bandwidth,with 18 ports of 200 Gbps capacity (3.6 Tbps total bandwidth).However, scaling this fabric to 256 GPUs in one densely con-nected box would require a large number of these switchingchips arranged in a multi-layer topology. For instance, the stan-dard fat-tree topology [95] providing 2.4 Tbps of bandwidth to256 GPUs would require 960 NVSwitches and 9216 NVLinks.The cost of such an interconnect is 10× higher than TeraRack(see Section 4 for detailed cost comparison). However, usingcommodity electrical packet switches to interconnect 16 DGX-2 boxes (each with 16 GPUs) does not give the desired per-GPU bandwidth: each DGX box can provide up to 800 Gbpsbandwidth shared across 16 GPUs (50 Gbps per GPU). Sec-tion 5 demonstrates that such a design cannot support the highbandwidth requirement of hybrid data-model parallel jobs.

3

Page 4: TeraRack: A Tbps Rack for Machine Learning Training · compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Second, Ter- ... train

106 107 108 1090

0.5

1

Message size (B)

CD

F ResNet50

ResNet152

VGG16

InceptionV3

GPT2

Transformer

Figure 3: CDF of message sizes for typical computer vision andnatural language processing neural network models.

Optical interconnects. Given the limited bandwidth providedby electrical fabrics, another attractive solution is a circuit-based optical interconnect. At first glance, it appears that stan-dard optical interconnects fit the bill, as ML training consists ofthousands of iterations with the same communication pattern,making it suitable for long-lasting optical circuits. An impor-tant consideration is that with a 256 port-count, MEMS-basedoptical switches have a reconfiguration latency of 30 ms.As aresult, they are suitable only for circuits that can last for sev-eral hundreds of milliseconds, such as large data center flows.Although ML jobs last many hours, and the communicationpattern of ML jobs does not change between training itera-tions, the optical circuits may have to be reconfigured withiniterations. Consequently, the iteration time plays an importantrole in determining potential circuit reconfiguration times. Forinstance, a circuit-based optical interconnect with a 30 msreconfiguration time is not ideal if the iteration time is a fewmilliseconds: reconfiguring the circuits during the iterationwill bloat the total training time significantly. In fact, our eval-uations (§5) show that all of model parallel training iterationsrequire circuit reconfiguration within the iteration.Estimating the total iteration time. The computation timeplays an important role in determining the training iterationtime. Section 2.1 shows that scaling the number of nodes orimproving the compute capabilities of each node will result ina decrease in the computation time per node. To prevent thenetwork from becoming a bottleneck, the communication time(which depends on the size of messages transferred betweennodes) should overlap entirely with the computation time, mak-ing it a suitable proxy to calculate the lower bound on trainingiteration time for each node. Fig. 3 shows the cumulative distri-bution function of transfer sizes during iterations for differentworkloads, including ResNet50, ResNet152, VGG16, Incep-tionV3, GPT2, and Transformer. We train the Transformermodel in 8-way data-parallel mode and ResNet, VGG16, In-ceptionV3 are 4-way model parallel. GPT2 is trained with amix of 2-way data and 4-way model parallelism. In most cases,the transfer sizes vary from 1 MB to 1 GB. For instance, themedian transfer size of Transformer is about 200 MB, or 1.6 mson a 1 Tbps link. This means reconfiguring circuits for thisworkload requires an optical interconnect with a circuit recon-figuration time shorter than 160 µs to ensure the reconfigurationlatency is a small fraction (<10%) of the circuit-hold time.

2.3 Prior Proposals Are InsufficientIn this section, we briefly summarize why prior proposals fallshort in supporting the above requirements of Tbps bandwidthand µs reconfiguration time simultaneously. Sections 4 and 5quantitatively compare the cost and performance of TeraRackwith relevant prior work.Static topologies. Fat-tree [95], Dragonfly [96, 97], Slim Fly [98],Diamond [99], WaveCube [100], and Quartz [101] are all ex-amples of static topologies proposed for datacenters. However,increasing the bandwidth of static topologies to support Tbpsbandwidth is prohibitively expensive and requires non-trivialcabling and topology changes. For example, in Section 4, weshow the cost of a static topology to support the scale andbandwidth of TeraRack is 10× larger than the projected costof TeraRack.Traffic oblivious topologies. Jellyfish [102], Rotornet [103],and Opera [104] take advantage of the unpredictability ofdatacenter workloads and use expander-based topologies toimprove the flow completion time of short and long flows. Ran-dom permutations are not ideal for ML workloads, however,as a training workload is a periodic repetition of thousands ofiterations.3D MEMS-based technologies. Helios [61], c-Through [62],and OSA [63] use commercialized high port count opticalswitches with ≈30 ms reconfiguration latency. These propos-als might be suitable for data parallel workloads using thering-reduce algorithm, but they are inefficient to support tree-reduce, hierarchical all-reduce [8], and hybrid data-modelparallel training jobs. Section 5 evaluates the impact of recon-figuration delay on ML training time.2D MEMS-based technologies. The lower port count 2DMEMS is often a component of a Wavelength Selective Switch(WSS) which, in combination with a wavelength selective el-ement such as a grating, has been used as a microsecond recon-figurable multiplexer in Mordia [64] and REACToR [65]. How-ever, such designs have limited port count. Megaswitch [60]reveals that as WSS port count increases, the reconfigurationtime changes from 11.5 µs to 3 ms, making such proposalsunsuitable for our needs. Moreover, each Megaswitch nodebroadcasts its wavelengths on a dedicated fiber, thus limit-ing the number of nodes in the topology to 30. In contrast,TeraRack can support up to 256 nodes.Free-space interconnects. ProjecToR [105] and FireFly [106]have high port count and microsecond reconfiguration latency,but they use free-space optical links which have alignmentchallenges and are sensitive to environmental vibrations.Nanosecond switching fabrics. Shoal [107], Larry [108] andXFabric [109] have proposed reconfigurable datacenter in-terconnects with nanosecond switching fabric. Similarly, Sir-ius [25] and WS-TDMA [26, 110–112] have demonstratedthe feasibility of sub-nanosecond wavelength switching usingfast tunable lasers and arrayed waveguide grating routers (AW-GRs). We believe these proposals have the potential to becomewidely deployed in datacenter environments, but these fabricsdo not support Tbps bandwidth between communicating nodes.

4

Page 5: TeraRack: A Tbps Rack for Machine Learning Training · compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Second, Ter- ... train

600

Gbp

s

GPU

Clock/data

recovery

Receiver

Clock/data

distribu-tion

Off-chip Comb laser

MU

X

Modulator

TeraPHY Optical I/O chiplet

Micro Ring Resonators

Electrical I/O

RxTx

Ring tuning control

(a) TeraRack node

PCI-E connection (control plane)

GPU

TeraPHY

GPU

GPU

GPU

Node

1

Node

2

Node

255

Node

256

counter clock-wise ring

clock-wise ring

TeraPHY

TeraPHY

TeraPHY

(b) Topology with 256 nodes

Figure 4: TeraRack’s design.

Moreover, building a control plane with nanosecond responsetime is overkill for our use-case.

3 TERARACK DESIGNTeraRack interconnects 256 compute nodes and is tailored forthe communication patterns and bandwidth requirements ofML workloads.TeraRack node. Fig. 4a illustrates the components in a Ter-aRack node designed based on TeraPHY silicon photonicstechnology [32, pages 25– 28]. Each node in TeraRack hasfour optical interfaces [32, 34–36]. Note that the optical inter-faces do not need the expensive and bulky DWDM transceiversused in wide-area networks. Instead, each interface can beconnected to other nodes via an MTP fiber connector [113].The capacity of each interface is 600 Gbps, achieving a totalfull-duplex bandwidth of 2.4 Tbps per node. On the send side,an off-chip comb laser generates light that is steered into thenode via a fiber coupler, toward an array of MRRs modulatingthe GPU’s transmitting data at 25 Gbps per wavelength. On thereceive side, a second array of MRRs selects the wavelengthstargeted to the GPU and passes through the remaining wave-lengths. Each interface has 24 micro-ring resonators to selectand forward any sub-set of 24 wavelengths. Section 6 measuresthe reconfiguration latency of MRRs in our testbed at 25 µs.Our laser technology is based on a comb laser source, calledSuperNova Light Supply [114], that produces a spectrum con-sisting of multiple equidistant frequencies. At its simplest,

Node4

RGB

R R

GB

Node4

Src Node1

13

1

23 1

3

1

Sink Node1

(a) Wavelength allocation (b) Flow allocation

Node3

Node2

Node3

Node2

Node1

Figure 5: Wavelength allocation and its equivalent flow routingtranslation.

an optical comb can be used to replace an array of indepen-dent lasers, like those of a dense wavelength division multi-plexed (DWDM) communications system. The SuperNovalaser source is capable of supplying light for 256 wavelengths.TeraRack topology. TeraRack’s design does not require anoptical switch to interconnect its nodes. Instead, we use fiberrings to interconnect the micro-rings on each node. Fig. 4bshows the high-level view of TeraRack’s topology. Data planetraffic is distributed across four single-mode fiber rings thatconnect the four interfaces of each node (two clock-wise ringsand two counter clock-wise). The figure shows one ring in eachdirection for clarity.Wavelength scheduling. One of the core properties of Ter-aRack is the ability to dynamically place wavelengths aroundthe fiber ring, to maximize the throughput between communi-cating nodes. The optimal wavelength allocation maximizesthe throughput while ensuring that no two paths sharing thesame fiber are assigned the same wavelength. More formally,the wavelength allocation problem corresponds to the fol-lowing optimization problem. LetTMi j denote the predictedGPU-to-GPU traffic matrix, andW denote the total numberof wavelengths. We can represent a wavelength allocation asa 3-dimensional binary matrix, Λ, where Λi jk is 1 if GPU isends data to GPU j using λk and is zero otherwise. Thereare several possible objectives. A natural one is to minimizethe maximum completion time for any GPU-to-GPU transfer,where the completion time is TMi j∑

kΛi jk. This can be expressed as

an Integer Linear Program (ILP) by maximizing the minimuminverse of the completion time, as follows:

maximizeΛ∈{0,1}N×N×W mini j :TMi j>0

∑kΛi jk/TMi j (1)

The full formulation is presented in Appendix D. The con-straints are (1) ensure fiber segments do not contain overlap-ping wavelengths (ring constraint) and (2) ensure that eachGPU can use each wavelength for communication with at mostone other GPU (node constraint).

Note that the size of the ILP solution space,Λ∈ {0,1}N×N×W ,grows with the number of nodes in the network, rendering itintractable at larger scale. Hence, instead of solving the ILP,we present a more practical algorithm, FlowRing, that turnsthis discrete optimization problem into a min-cost flow routingproblem which can be solved efficiently.

5

Page 6: TeraRack: A Tbps Rack for Machine Learning Training · compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Second, Ter- ... train

The FlowRing algorithm. Our approach is inspired by priorwork on formulating wavelength assignment in optical net-works as flow routing problems [68, 115–119]. To developintuition, consider the example in Fig. 5a with four TeraRacknodes placed around a fiber ring. The arrows represent thetraffic demand between nodes and the colors on each arrowshows one possible wavelength allocation. Notice the reuse ofwavelengths on non-overlapping ring segments. For example,wavelengthR is used to satisfy three demands on three differentring segments, and wavelengths G and B are used to satisfytwo different demands. Also notice that, on each ring segment,a wavelength is only assigned to a single demand.

We can construct a flow routing problem corresponding tothis example as shown in Fig. 5b. We cut the ring at Node1,and then divide Node1 into two nodes: a source and a sink.The source node injects a continuous flow of size 1 units intothe ring and the sink node absorbs a flow of 1 unit. As theflow circles the graph between the source and the sink, it getssplit across different outgoing edges at each node. These out-going edges correspond to traffic demands out of that node,and the fraction of the flow on each edge corresponds to thefraction of wavelengths assigned to carry that demand. Forinstance, in Fig. 5b, the flow is split at Node2. with 1/3 going tothe Node2→Node3 edge, and 2/3 going to the Node2→Node4edge. WithW = 3 total wavelengths, this corresponds to theallocation in Fig. 5a.

A standard property of a valid flow routing is “flow con-servation,” i.e., the fact that the total flow into any node mustbe the same as the total flow out of that node (except for thesource and sink nodes). In our setting, the implication of flowconservation for wavelengths is that when a wavelength isdropped at a GPU, it must be added back to the fiber ring onthe next segment (provided there is any demand between thatGPU and its subsequent GPUs). Flow conservation enforcesthe invariant that the total flow across a cut at any segment ofthe ring has size 1. This invariant captures the idea that, in anythroughput-maximizing allocation, each ring segment mustcarry all wavelengths if there is traffic demand between GPUsthat come before and after that segment.

Our FlowRing algorithm involves the following steps:Step 1: Graph construction. We construct a directed graph,G= (V ,E), whereV is the set of nodes on the ring and for everyTMuv > 0, there is a directed edge e = (u,v ). After includingedges for the entire TM in G, we check whether every adja-cent node-pair on the ring is connected inG. If not, we add a“dummy” edge between them to E. For instance, in Fig. 5b, thedotted line between Node4 and Node1 is a dummy edge, sinceit does not correspond to a demand in the traffic matrix. The di-rection of all edges inG is the same as that of wave propagationon the fiber.3 We then add the sink and source nodes by cuttingthe edges inG along an arbitrary ring segment. For simplicity,let us assume for now that this process cuts only one edge of the

3We solve separate instances of the flow routing problem for the clockwiseand counter-clockwise rings, including each demand on the ring for whichit needs the fewest hops.

graph, as is the case in Fig. 5b in which we cut the edge betweenNode1 and Node2. We add two terminal nodes on the two endsof the cut edge to be the source and sink. The source node in-jects a unit-sized flow into the ring and the sink node receives it.Step 2: Compute min-cost flow. Having constructed the graphG, we solve the following flow routing problem:

maximize∑

e ∈E :TMe>0

feTMe

, (2)

where for an edge e = (u,v ), fe is the flow on the edge andTMe =TMuv is the traffic demand on that edge. The constraints(not shown for brevity) are the standard flow conservation con-straints. The intuition for the above objective is that we wish tomaximize throughput but preferentially allocate a larger flow(more wavelengths) to GPU-to-GPU paths with smaller de-mand. The reason for favoring smaller demands is to completethem quickly, reducing the number of nodes that each node hasto communicate with. This keeps the unsatisfied traffic patternsparse over time, allowing the remaining traffic to be handledefficiently in future wavelength reconfiguration events.

The objective in Eq. (2) can be equivalently be written asa min-cost flow routing problem [120] by defining the weightof edge e as we = −1/TMe if TMe > 0, and we = 0 if e is adummy edge. The problem is then to minimize

∑ewe fe . Min-

cost flow routing can be solved using the network simplexalgorithm [121, 122] and there exist a variety of combinatorialalgorithms [120]. The procedure for constructing the graphand defining the flow routing problem is slightly more com-plicated when the cut chosen for adding the source and sinknodes includes more than one edge. In this case we need addi-tional constraints to ensure consistency of flows between thecut edges (details in Appendix B).Step 3: Remove and repeat. The solution obtained by solvingthe above min-cost flow problem may result in some GPU-to-GPU demands completing very quickly. However, since recon-figuration incurs delay (e.g., 25 µs in our prototype), we cannotreconfigure wavelengths too quickly without hurting efficiency(more on this below). Therefore we should plan the wavelengthallocation based on a time horizon rather than looking only atthe instantaneous traffic demands. To this end, we iterativelysolve the min-cost flow problem in Equation (2), serving theTM with step-size of ∆ based on the flows obtained after eachiteration, and repeating this procedure until there is no un-served demand left in the TM . We compute the mean of theflow allocations over all iterations as the final flow allocation.Step 4: Mapping flows to wavelengths. Finally, we scale theflows from the previous step byW and then map them to integernumbers using a technique called randomized rounding [123].This produces the final wavelength allocation.

Scheduling frequency. An important consideration in Ter-aRack’s design is how frequently to reschedule the wavelengthallocations. By rescheduling frequently, we can tailor the wave-length allocation better to the traffic demands. But reschedulingtoo quickly is also undesirable because each reconfiguration

6

Page 7: TeraRack: A Tbps Rack for Machine Learning Training · compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Second, Ter- ... train

incurs a delay during which no traffic can be sent. In our exper-iments, we found that setting the rescheduling period to 100 µs(4× the reconfiguration delay) provides the best performance.

Device placement. Splitting a DNN’s computation graph acrossmultiple devices is non-trivial in the case of model/pipelineparallelism due to the need to balance computation and mem-ory usage while maintaining low communication betweendevices [13, 14, 21, 124, 125]. TeraRack can use any existingmodel partitioning algorithm, but we present a simple heuristicin order to split a computation graph. The goal of this heuristicis to try to keep the communication between GPUs both sparse(i.e., a GPU communicates with few other GPUs) and local (i.e.,a GPU communicates with nearby GPUs along the ring). Giventhe run-time of each operation, we first split the large operationsacross the batch dimension into equal-sized parallel operationsfollowing a similar approach to [126]. We then topologicallysort the resulting computation graph. We traverse the opera-tions in this sort order, placing each operation on the next GPUin round-robin manner. The motivation for this algorithm is thatDNN computation graphs often have long chains of operations(with each operation sending data only to the next operationon the chain). Now consider two operations i and i+1 alongsuch a chain, and suppose that operation i is assigned to GPUj. Then the above algorithm is likely to assign operation i+1 toGPU j+1. Hence most communication will be between adja-cent GPU pairs along the ring (or more generally, nearby GPUpairs). This reduces the interference between GPU-GPU com-munications, enabling more wavelength reuse for TeraRack.

4 COST OF TERARACKWe estimate the cost of a TeraRack interconnect with 256 nodesand 2.4 Tbps bandwidth per node and compare it with the costof a scaled DGX fabric (ElectNet), an OCS-based interconnect(OcsNet), and Megaswitch [60], an optical ring interconnect.The cost of Mordia [64], Quartz [101], and OSA [63] fabricsare higher than that of Megaswitch, hence we omit them in ourcost comparison. Table 1 compares the cost of these intercon-nects.Cost of ElectNet fabric. We posit that a fully electrical fabriccomparable with TeraRack is an on-chip scaled DGX plat-form with 256 GPUs and 2.4 Tbps bandwidth per GPU. Wereverse engineer the cost of today’s DGX-2 and use the com-ponents’ cost to estimate the cost of ElectNet fabric. The retailprice of a commercial DGX-2 box with 16 V100 GPUs and 12NVSwitches (each with 18 NVLink ports) is $400K [127]. Weassume that the actual cost of a DGX-2 fabric is $100k (a profitmargin of $300K). We estimate the cost of a 200 Gbps NVLinkis $100; 10× lower than the price of a commodity electricaltransceivers [128]) and the cost of a V100 GPU is $1000 (8×lower than the price tag on Amazon [129]). As a result, we es-timate the cost of an NVSwitch is $5,200. To scale the currentfabric to 256 GPUs, we scale the DGX into an on-chip fat-treeinterconnect with k=16 using NVSwitches and NVLinks. Afat-tree with k =16 can interconnect 1024 nodes, however,given that the bandwidth of NVLinks is only 200 Gbps, we

Component Unit Cost($) Count Total Cost($K)

ElectNet

NVLink 100 9216 921.6NVSwitch 5200 960 4992Packaging/box 1000 1 1All together 5,914

OcsNet

OCS switch 100000 12 1200Transceiver 100 3072 307.2Fiber cable 4.4 3072 13.5All together 1,520.7

Megaswitch

WSS 60000 256 15360Amplifier 100 256 25.6DWDM transceiver 10000 24832 248320All together 263,705

TeraRack

SiP interface 441 1024 451.6Electrical circuits 308 256 78.8SiP amplifier 100 256 25.6Comb laser 100 256 25.6Packaging 30 256 7.7Single-mode fiber 3 1024 3.1PCI-E slot 45 256 11.5All together 604

Table 1: Component and interconnect costs of TeraRack and itselectrical and optical counterparts. The table excludes the costof GPUs, assuming it is roughly the same for all fabrics.

bundle every four ports together to achieve 800 Gbps per-GPU bandwidth. To further scale the per-node bandwidth to2.4 Tbps, we replicate the same fat-tree interconnect threetimes. The resulting interconnect has 960 NVSwitches inter-connected with 9216 NVLinks.Cost of OcsNet. To interconnect 256 GPUs optically, we de-sign an OCS-based fabric following the state-of-the-art op-tical proposals [61, 62, 103, 104] using 12 OCS switches.We interconnect the GPUs to each OCS using 200 Gbps op-tical transceivers to achieve 2.4 Tbps per GPU bandwidth.The cost of the OCS and transceivers is obtained from onlinequotes [130, 131]. We estimate the cost of fiber cables basedon $0.44/m [132] and average length of 10 meters.Cost of Megaswitch. The required components for Megaswitchare obtained from the paper [60, Table 1]. We note that the cur-rent design of Megaswitch does not scale beyond 32 nodeshowever, we conservatively assume the fabric is scalable with-out requiring new components. The prices are based on quoteswe obtained from vendors. The most dominating component inthe cost of Megaswitch is the price of DWDM transceivers. Wenote that there is an important distinction between the cost ofcommodity DWDM transponders and TeraRack’s SiP-baseddesign. In particular, DWDM transponders are designed to op-erate at long distances which will impose strict challenges onthe laser, manufacturing, forward-error correction, photodiodesensitivity, modulation scheme, as well as light coupling. Incontrast, SiP interfaces are designed for short distances, and donot require coherent detection, hence they can take advantageof considerable development and commercialization of photon-ics components for short distance datacenters. However, evenif we replace the WDMD interfaces in Megaswitch with Ter-aRack’s SiP interface, the price will come down to $30,384M,that is still a factor of 50 higher than that of TeraRack. In-tuitively, this is because Megaswitch’s design consumes too

7

Page 8: TeraRack: A Tbps Rack for Machine Learning Training · compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Second, Ter- ... train

many interfaces and does not take advantage of wavelengthreuse, a key point in TeraRack’s design (§3).Cost of TeraRack. A TeraRack node consists of four TeraPHYSiP interfaces each with 600 Gbps capacity [32]. The manufac-turing cost of a silicon photonics chip depends on the area ofthe chip. Prior work has built TeraPHY SiP interfaces with size8.86 mm × 5.5 mm [32, Slide 41]. This area contains opticaltransmit, receive, and MRRs. The cost of manufacturing thisSiP interface is $44,082 for a volume of 20 chips ($4,408/chip)based on 2020 Europractice pricelist [133].4 Hence, assumingthe cost will drop by a factor of 10 at mass production, our costestimation for each SiP interface in TeraRack is $441. Simi-larly, we use Europractice pricelist to provide the estimatedcost for on-chip electrical circuitry based on a 10 mm2 chiparea [34–36, 134].5 The electrical circuitry includes drivers,MRR’s tuning control logic, and CMOS transimpedance am-plification (TIA). Our estimates on the cost of on-chip SiPamplifier [135], off-chip comb laser [114], and packaging arebased on quotes from manufacturers such as Tyndall [136] andPLCC [137], but their price lists are, unfortunately, confiden-tial. The cost of single-mode fiber to interconnect TeraRacknodes into rings is obtained assuming $3/meter [106]. Finally,to put the rack together, we estimate the cost of PCI-E slotsusing online sellers [138].

5 EVALUATIONIn this section, we quantify TeraRack’s performance and costwith respect to other electrical and optical network intercon-nects (§5.2), analyze its performance for different paralleliza-tion strategies (§5.3), and evaluate its design decisions (§5.4).

Our simulation results show:

• For three representative DNN models, TeraRack speedsup training time by factors of 2.12–4.09× and 1.65–3×compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Ter-aRack’s performance is equivalent to electrical-switchand OCS-based interconnects that cost at least 6× and4×more respectively.• For data-parallel training, TeraRack effectively acts as

a 256×2.4 Tbps electrical switch. Further, TeraRack’shigh bandwidth enables hybrid data and model paral-lel training strategies that outperform data-parallel-onlytraining by up to 2× at 256-GPU scale.• TeraRack’s performance strongly depends on its ability

to reuse wavelengths: it outperforms a baseline withoutwavelength reuse by 55× at 256-GPU scale. TeraRackalso relies on fast wavelength rescheduling (once every100 µs) to deliver good performance.

4Europractice is an EC initiative the provides the industry and academiawith a platform to develop smart integrated systems, ranging from advancedprototype design to volume production. The cost is listed ase80,000 on page10 under imec Si-Photonics iSiPP50G and the volume is listed as 20 sampleson page 6 under iSiPP50G.5Page 6 under GLOBALFOUNDRIES 22 nm FDSOI listse14,000/mm2 for50 samples.

5.1 Methodology and SetupSimulator. Our simulator models GPUs and the network in-terconnect. We simulate a TensorFlow application by definingML computation graphs that are trained on multiple GPUs withdata or model parallelism, or a hybrid of the two. We simulatea system with a high degree of communication-computationoverlap. Specifically, in order to achieve high utilization formodel-parallel workloads, we simulate micobatches as inGPipe [14]. Each device involved in model-parallel trainingsends the results of its microbatch as soon as it is ready, hencereducing the idle time of dependent workers. The intercon-nect component simulates our optical ring and SiP interfaces,as well as electrical, full-mesh, and optical circuit switching(OCS) topologies.

To simulate GPU behavior, we profile each kernel in eachof our DNN models. We use the Tensorflow graph profiler toolwhile running each workload on a V100 GPU near its full uti-lization. We then construct a graph profile for each DNN modelthat we provide as input to our simulator. Each node in thisgraph is an operation kernel, labeled with the node’s GPU time,CPU time, peak memory size, output bytes, and the numberof floating-point operations required for running the operation.We profile three DNN models: ResNet152, VGG16, and Trans-former. We use the ImageNet [139] training specifications forResNet152 and VGG16, and LM1B [140] for Transformer.Parallelization strategy search. We develop a simple ap-proach to explore the space of hybrid data and model paral-lelism techniques and pick the best one for a given DNN modeland network interconnect. Our algorithm takes as input: (1) thenumber of GPUs, (2) the bandwidth available per GPU, (3) thegraph profile for the DNN model as described above, and (4)a curve providing the number of training iterations required toreach desired accuracy as a function of the (global) batch size.The training iterations vs. batch size curves were measuredfor the three models we consider in [20]. These curves play animportant role in determining a good parallelization strategy.In particular, for DP training, we use these curves to determinethe impact of increasing batch size (to reduce communica-tion bandwidth requirement, see §2.1) on training time. Giventhis information, we loop through all possible hybrid DP-MPconfigurations and use our device placement algorithm (§3)to allocate the operations to devices. We then estimate eachconfiguration’s run-time based on the graph profile and thebottleneck bandwidth. To estimate the effect of the network,we simply compute the latency for each data transfer (edge) inthe graph profile according to the bottleneck bandwidth. Wefinally select the fastest of all these parallelization strategies.

Two points are worth noting about this procedure. First, oneof the strategies it considers as part of this process is basicDP (no MP). However, as we show, in many cases DP is notthe best strategy for large-scale training (§5.3). Second, theruntime computed for a configuration in this procedure is onlyan estimate. In our actual simulations, a GPU’s bandwidthcan vary over time (e.g., due to wavelength reconfiguration).However, the full simulator is too slow to use as part of the

8

Page 9: TeraRack: A Tbps Rack for Machine Learning Training · compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Second, Ter- ... train

8 16 32 64 128 2560.1

1

Number of GPUs

Tota

lTra

inin

gTi

me

(hou

rs)

(a) ResNet152

8 16 32 64 128 2560.1

1

Number of GPUs

ElectNet OcsNet TeraRack Optimal

(b) Transformer

8 16 32 64 128 2560.1

1

Number of GPUs

(c) VGG16

Figure 6: TeraRack’s performance at different scales compared to its equivalent cost counterparts.

search procedure, and we have found that the configurationsfound by the simple method align well with simulations.Compared schemes. We consider five classes of architectures:(1) Optimal: a fully-connected interconnect with 256 GPUsconnected in a mesh topology. The bandwidth of each node is255×2.4 Tbps and we assume zero propagation latency.(2) ElectNet: an electrical switch fabric with 256 nodes. Wevary the per-port bandwidth between 200 Gbps to 2.4 Tbps tocompare the cost and efficiency of different design points. Foreach bandwidth value, we design a topology with commodity64×200 Gbps switches as described in §4.(3) Megaswitch: an optical ring interconnect as described in §4.(4) OcsNet: an optical circuit switch interconnect with 256-port switches as described in §4. We connect each GPU to allthe optical switches, similar to state-of-the-art optical inter-connect proposals [61, 62, 103, 104]. We vary the number ofOCS switches in the interconnect between 4–32, resulting inbandwidth per GPU between 800 Gbps–6.4 Tbps. Since OCSreconfiguration delay of 30 ms is too long compared to thetypical training iteration time of our DNN models (<20 ms),we compute the best single-shot circuit schedule for each work-load to achieve the best possible performance with the OCS-based interconnects. We find the optimal schedule by solvingan Integer Linear Program, assuming complete knowledge ofthe traffic matrix.(5) TeraRack: our design with 256 GPU nodes and 2.4 Tbpsbandwidth per-GPU. We assume reconfiguration delay of 25 µs(as measured in §6). We use FlowRing as our scheduling al-gorithm with rescheduling frequency of once every 100 µs,unless otherwise stated.

5.2 Overall System PerformanceComparison between cost equivalent architectures. We firstcompare the total training time of TeraRack with cost-equivalentoptical and electrical architectures for different DNN trainingworkloads. Using components costs in Table 1, we constructeach interconnect such that the total cost of the interconnectmatches that of TeraRack. For ElectNet interconnect, this trans-lates into an electrical fat-tree architecture with 256 nodes andtotal bandwidth of 51.2 Tbps. For OcsNet interconnect, thistranslates into an OCS-based interconnect with total band-width of 204.8 Tbps. Fig. 6 shows that scaling the number of

8 16 32 64 128 2560.1

1

Number of GPUs

Tota

lTra

inin

gTi

me

(hou

rs)

ElectNet $1 ElectNet $3

ElectNet $6 ElectNet $12

TeraRack $1

(a) Electrical Fabrics

8 16 32 64 128 2560.1

1

Number of GPUs

OcsNet $1 OcsNet $2

OcsNet $4 OcsNet $8

Megaswitch $1000 TeraRack $1

(b) Optical Fabrics

Figure 7: Comparison with electrical and optical architecturesat various cost points. The model is Transformer.

GPUs reduces the total training time. TeraRack is able to speedup the training time by factors of 2.12–4.09× and 1.65–3×compared to ElectNet and OcsNet, respectively at 256 GPUsscale. In particular, the gains with the Transformer model aremore pronounced whereTeraRack is within 2× of the Optimalarchitecture at all scales, while ElectNet and OcsNet are up to8.7× and 3.5×worse than Optimal. This is because the DNNarchitecture of the Transformer model is highly amenable tomodel parallelism and requires very high per-node bandwidth.We will discuss this in more detail later in this section.How much more would it cost to match TeraRack’s per-formance? We next compare the total training time of theTransformer model with electrical and optical architecturesat various price points to find the extra investment needed tomatch the performance of TeraRack. We use the Transformermodel because Fig. 6 shows it is a more challenging DNN forall the architectures and it has the largest model size (2.1 GB).Fig. 7a shows the total training time of each solution. The $figure besides each approach indicates cost ratio comparedto TeraRack. For instance, ElectNet $3 is an ElectNet archi-tecture that costs 3 times more than TeraRack. We observethat TeraRack’s performance is between ElectNet architecturewith 6–12 times higher cost. In other words, to match the per-formance of TeraRack, operators need to invest in electricalfabrics that are 6–12 times more expensive. Similarly, Fig. 7bshows that the performance of TeraRack matches that of anOcsNet interconnect that is 4 times more expensive.

9

Page 10: TeraRack: A Tbps Rack for Machine Learning Training · compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Second, Ter- ... train

16 32 64 128 2560.1

1

Number of GPUs

Tota

lTra

inin

gTi

me

(hou

rs)

(a) ResNet152

16 32 64 128 2560.1

1

Number of GPUs

Optimal–DP TeraRack–DP TeraRack–Hybrid

(b) Transformer

16 32 64 128 2560.1

1

Number of GPUs

(c) VGG16

Figure 8: Data Parallelism with TeraRack.

100 1,000248163264128

Bandwidth per Node (Gbps)

Opt

imal

Hyb

rid

Deg

ree

DP degree

MP degree

(a) ResNet152

100 1,000248163264128

Bandwidth per Node (Gbps)

(b) Transformer

100 1,000248163264128

Bandwidth per Node (Gbps)

(c) VGG16

Figure 9: Optimal balance between the degree of MP and DP atdifferent per-node bandwidths for 256 GPUs.

5.3 Impact of Parallelization StrategyToday, most ML workloads rely on data parallelism trainingpartly due to the lack of a platform that can support the highbandwidth requirement of model parallel or hybrid data-modelparallel workloads. Fig. 8 shows that when we force the paral-lelization strategy to be DP only, TeraRack matches the perfor-mance of an optimal network (Optimal-DP and TeraRack-DPlines are nearly identical in all three workloads). This is to beexpected since TeraRack is perfectly suited to the communi-cation pattern of DP with ring-allreduce [74, 75]. Fig. 8 alsoshows that TeraRack’s best hybrid DP-MP strategy takes ad-vantage of the high bandwidth available and outperforms aDP-only approach by up to 2× at the 256-node scale.

Our results confirm the intuitive approach taken today: whenthe network bandwidth is low, it is more efficient to use DPstrategy, but if the network has high bandwidth available, com-bining MP and DP results in much better performance improve-ments in training time. Fig. 9 plots the optimal balance betweenDP and MP suggested by our parallelization strategy searchalgorithm. Here we keep the number of GPUs fixed at 256 andsweep the per-node bandwidth from 75 Gbps to 2.4 Tbps. Thefigure shows that as the per-node bandwidth increases on thex-axis, the optimal strategy uses more model parallelism todecrease the total training time. For instance, the Transformerworkload shown in Fig. 9b starts with 32-way DP and 8-wayMP but at 2.4 Tbps bandwidth per-node, our algorithm findsthe best training time with 2-way DP and 128-MP.

5.4 TeraRack’s Design DecisionsWavelength reuse. One of the key properties of TeraRack’sflow scheduling algorithm is its ability to reuse wavelengthsacross GPU pairs that do not share the same fiber segment. Toevaluate the impact of wavelength reuse, Fig. 10a compares

8 16 32 64 128 2560.1

1

10

Number of GPUs

Tota

lTra

inin

gTi

me

(hou

rs)

With reuse

Without reuse

(a) Impact of wavelength reuse

100 1,0000.01

0.1

Circuit hold-time (µs)

Trai

ning

Iter

atio

nTi

me

(sec

)

Transformer

VGG16

(b) Impact of circuit hold-time

Figure 10: Performance of TeraRack’s circuit schedulingalgorithm.

1 3 5 7 9 11 13 150

0.5

1

Flow Path Length (Segments)

PDF ResNet152 Transformer VGG16

Figure 11: Flow path length PDF distribution.

the training time with and without this feature as the number ofGPUs are scaled for the Transformer model. The figure showsthat wavelength reuse dramatically improves the performanceof TeraRack (a 55× speed up when the number of GPUs is 256).Indeed, without wavelength reuse, performance degrades atlarger scale because the total bandwidth across the ring is fixed(96 wavelengths) but the amount of communication increaseswith scale.Traffic locality. The main enabler of wavelength reuse is the lo-cality in GPU-to-GPU communication achieved by TeraRack’sdevice placement algorithm. In Fig. 11, we plot the Probabil-ity Density Function (PDF) of the number of segments that aGPU-to-GPU flow traverses on the ring. We see that more than70% of the flows are between GPU pairs that are less than fournodes apart. The maximum distance traversed by a flow in allof our workloads was 16 hops. Another implication of Fig. 11is that the majority of wavelengths in TeraRack pass less than3-4 amplifiers on the ring, therefore limiting the impact ofamplifiers on signal-to-noise ratio.Impact of wavelength rescheduling frequency. Fig. 10b showsthe impact of the wavelength rescheduling frequency on Ter-aRack’s performance. We observe that holding scheduled cir-cuits for more than 100 µs can degrade the performance, andtraining time gets substantially worse as the rescheduling in-terval reaches a millisecond.FlowRing algorithm. TeraRack’s performance with the FlowRingalgorithm is within 5% of its performance using the ILP solu-tion for all three models at 256-GPU scale.

6 TERARACK PROTOTYPE6.1 BenchmarksWe evaluate TeraRack using large-scale simulations and asmall-scale prototype. The two experimental frameworks per-mit us to study different aspects of TeraRack. The prototype

10

Page 11: TeraRack: A Tbps Rack for Machine Learning Training · compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Second, Ter- ... train

Rx Buffer

Compute CoreCache

Create request

Classifier

Tx Buffer

Resp.MEM

Remote data

Sync.

Local data

Stratix V FPGA board

Req.

Zoomed in

10mm

10mm

Fiber I/O

MRRs

Bias control board

SiP Chip

Figure 12: Photo of our prototype. We use Stratix V FPGAboards to emulate GPUs. Our SiP chip (shown at the top rightcorner of the photo) has six micro-ring resonators (MRRs).

allows us to benchmark the switching time and throughput ofSiP-based architecture.

6.2 Prototype ArchitectureFig. 12 shows a photograph of our experimental testbed. Webuilt a three-node prototype of TeraRack using FPGA develop-ment boards (to emulate GPUs), and a thermo-optic SiP chipwhich has six micro-ring resonators (MRRs). Each MRR istuned to select one wavelength by receiving the appropriatebias signal from the bias control board. We use Stratix V FP-GAs to emulate the GPU training workflow, as no commercialGPU chip supports optical interfaces. Our FPGAs have 50Mb embedded memory and 1152 MB DDR3 memory. TheFPGAs are programmed and configured as individual com-pute nodes with their own local memory. The controller logicis implemented using one of the FPGAs. A digital-to-analogconverter (DAC) provides the necessary bias signals to the SiPchip to cause a state change in the MRRs, depending on thescheduling decision. We use commodity SFP+ transceiversconnected to the high-speed serial transceiver port on the FPGAboard to achieve the conversion between electrical and opticaldomains. Our three input wavelengths are λ1 =1546.92 nm,λ2 =1554.94 nm, and λ3 = 1556.55 nm. Our SiP optical chipconsists of six MRRs (we use three of them as shown in Fig. 13)to select and forward any of the wavelengths to the target em-ulated GPUs. To evaluate our prototype, we implement 2Dconvolutional computation workloads in Verilog to performdata fetching, computing, and storing between emulated GPUnodes. A GPU node can get access to the other GPU node’smemory and perform read/write operations, similar to how realGPUs communicate today.

Fig. 13 illustrates the logical details of our testbed. Thephysical connections between FPGAs are managed using thecontrol logic of the FPGA to configure the optical SiP chip. Ourcontroller provides the necessary bias signals on the SiP chip tocause a state change in the MRRs, depending on the schedulingdecision. Each GPU transmits at a uniquely assigned wave-length, and the wavelengths are multiplexed together, formingthe input to the bus waveguide of the SiP chip. Each MRR can

MU

X

λ1 λ2 λ3

λ1, λ2, λ3

λ2/λ3 λ1/λ3 λ1/λ2

MRR1 MRR2 MRR3

GPU1 GPU2 GPU3

Fiber ring

Traffic Matrix Prediction

Wavelength Allocation

Micro Ring Bias Voltages

Controller

SiP switch

Figure 13: Logical schematic of our prototype.

be tuned to select and forward any of the input wavelengths.Depending on which wavelengths are selected at each MRR,the SiP device will reconfigure the connectivity between chips.Example: programming the MRRs. Assume in the first con-figuration, GPU1 is connected to GPU2; this means MRR1 istuned to select and forward λ2 to GPU1, while MRR2 is tunedto select and forward λ1 to GPU2. For simplicity of the config-uration logic, MRR3 is always tuned to λ1 but is effectively inidle mode, as the optical power of λ1 has been dropped throughMRR2. To change the state to Configuration2 where GPU1is connected to GPU3, MRR1 should be tuned to select andforward λ3, while MRR2 should be detuned from λ1 for theoptical power of λ1 to pass through MRR3 to GPU3. Note thatin this configuration, MRR3, can remain tuned to λ1.Testbed limitations. Our use of commodity FPGAs and transceiversis driven by pragmatic concerns. It allows us to implementworkloads without needing separate modulation logic at thetransmitter or demodulation logic at the receiver. Packets areforwarded to the SFP+ transceiver which modulates the lightfor us. However, this method has limitations as well. Imple-menting convolutional neural networks in an FPGA, ratherthan a GPU as would be the case in the actual system, intro-duces complex Verilog logic with overhead on (de)serializingthe remote memory access commands.

To validate the feasibility of our optical design, we answerthe following four key questions. (i ) What is the impact ofusing MRRs to select/bypass wavelengths on throughput? (ii )How fast can we reconfigure the MRRs to dynamically tune toappropriate wavelengths? (iii ) What is the end-to-end switch-ing time? (iv ) What is the impact of our scheduling algorithmon throughput?MRRs as select/bypass interfaces. We first examine the se-lect/bypass functions of our MRR-based interfaces. A transceiverchannel is instantiated on the FPGA and a SFP+ optical transceiverat 1546.92 nm is used to perform the throughput measurementsfor select, bypass and loopback cases. As shown in Fig. 14a,the throughput measurement of the select mode (the MRRtuned at 1546.92 nm) is the curve in black while the resultfor bypassing the MRR is in blue. The red curve is the base-line measurement where the optical transmitter is connecteddirectly to the receiver channel without coupling the opticalsignal in/out the SiP chip. Our measurements show in all threecases, the throughput is 9 to 9.3 Gbps confirming the feasibilityof the idea of using MRRs as select/bypass interfaces.

11

Page 12: TeraRack: A Tbps Rack for Machine Learning Training · compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Second, Ter- ... train

9 9.1 9.2 9.30

0.5

1

Throughput (Gbps)

Cum

ula

tiv

eD

istrib

utio

nFunctio

n

Loopback

Bypass

Select

(a) Micro-ring select/bypass through-put

20μs 8.4μs

0 50 100 150 2000

1

2

3

4

Time (µs)

Rx

Sig

nal(V

olt

s)

MRR1 MRR2

(b) Micro-ring reconfiguration time

10 15 20 250

0.1

0.2

0.3

Reconfirguration Time (µs)

Fre

quen

cy

Config.1 Config.2

(c) End-to-end reconfiguration time

0 200 400 600 800 1,000

0.6

0.8

1

Slot length (µs)

Norm

alizedThroughput

GPU3 → GPU2

GPU2 → GPU3

(d) End-to-end throughput

Figure 14: (a) MRR’s impact on throughput. (b) MRR’s reconfiguration time. (c) End-to-end reconfiguration time between the FPGAboards. The reconfiguration time includes clock and data recovery, as well as optical switching time. (d) Achieved throughput betweenthe FPGA boards while changing the reconfiguration slot length on the x-axis.

MRR reconfiguration time. To measure the reconfigurationtime of our MRRs, we place InGaAs PIN photodetectors afterMRR1 and MRR2 in Fig. 13 and change the bias voltage fromConfig1 to Config2, where MRR1 and MRR2 are tuned into andout of resonance with λ1. We switch light between the two pho-todetectors by applying different bias signals to the SiP chip ev-ery 125 µs. The photodetectors convert the received photocur-rent into voltage. We use an oscilloscope to measure real timelight intensity and can therefore measure the reconfigurationspeed. Fig. 14b shows the receive signal at the photodetectors.In one case, the signal reaches stable state in approximately20 µs, and in another case, it takes only 8.4 µs. This is becausetuning the MRR into the chosen wavelength is faster than tun-ing out of that wavelength due to our use of the thermal tuningeffect. We conservatively, consider 25 µs as the switching timein our simulations. This experiment micro-benchmarks themicro-ring reconfiguration time; additional time might be re-quired for transceivers to start decoding bits. This additionaltime is not fundamental, and next we show how we measuredthe end-to-end reconfiguration time between FPGAs.End-to-end reconfiguration time. The end-to-end reconfig-uration time includes the MRRs’ reconfiguration time, thetransceivers’ locking time, and the handshaking time betweennewly connected nodes. The distribution of end-to-end switch-ing time between Config1 and Config2 is shown in Fig. 14c. Weperform 300 measurements to obtain the distribution, showingthat the average switching time to Config1 is 13 µs and Config2is 15 µs. Indeed, it is reasonable that the fastest end-to-endreconfiguration time may be less than the micro-ring recon-figuration time, as the receiver at the FPGA receives enoughoptical power to start the synchronization process before sta-bilization of the light output power. As described above, themicro-ring reconfiguration times for tuning and detuning arenot equal, leading to two distinct distributions. The additionalvariations in the distribution of the reconfiguration time are aconsequence of the time required for the transceiver to lockonto the new signal and carry out the handshaking protocol.Putting it all together. We also measure the achieved through-put while changing the scheduling slot length between the twoconfigurations. We conduct five different case studies withslot lengths of 64, 128, 256, 512 and 1000 µs and measure theideal throughput. The curve in blue in Fig. 14d indicates the

Component count loss per component (dB) total loss (dB)MRR 24 0.025–0.05 0.6–1.2Coupler 2 2 4Waveguide on chip 1 1–2 1–2Fiber connector 1 1 1End-to-end 6.6–8.2

Table 2: Optical loss measured at each SiP interface.

switching state from GPU3 to GPU2 lasting the duration set bythe experiment; the curve in red indicates the switching fromGPU2 to GPU3. As the plot shows, the link can achieve above90% of the ideal throughput, when the scheduling slot lengthis 220 µs. This is because the end-to-end reconfiguration takesonly about 20 µs; hence, having a scheduling slot 10 timeslarger will result in near optimal throughput.Power budget. Optical power loss is a key measure for anyoptical system. Table 2 shows the power loss measured at eachcomponent of a SiP interface used in TeraRack nodes. The lossper MRR is negligible (0.025–0.025 per MRR), however cou-pling the light in and out of each node creates 4 dB loss. This isbecause each SiP interface has an input and output coupler with2 dB loss. In addition to coupling loss, waveguide loss on thechip can attenuate the light by 1–2 dB, depending on the lengthand bends. The fiber connector fixtures outside each node addsanother 1 dB from node to node. Overall, the total loss incurredby passing through each SiP interface is 6.6–8.2 dB. Hence,assuming a 20 dB power budget based on transmit power and re-ceiver sensitivity [141], TeraRack requires amplification everyother node. At first blush, it appears infeasible to scale Ter-aRack to 256 nodes on a ring, since amplifiers add non-linearnoise and hence a light signal cannot traverse more than 30 am-plifiers, limiting the scale of TeraRack to 60 nodes. However,recall in Fig. 11 that the path length in TeraRack is limited to 16nodes, indicating that although the size of the ring is 256 nodes,our scheduling algorithm is able to place GPUs locally closeto each other such that every GPU only interacts with at most aGPU that is 15 nodes away. As a result, TeraRack’s design canscale to 256 nodes since its device placement algorithm (§3)partitions ML models across nodes to ensure a sparse and localcommunication pattern with path lengths shorter than 15 hops.

12

Page 13: TeraRack: A Tbps Rack for Machine Learning Training · compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Second, Ter- ... train

7 CONCLUSIONThis paper investigated a silicon photonics-based approach forbuilding a 256×2.4Tbps interconnect for a rack of 256 GPUstailored for parallel machine learning training. We use a smallprototype to evaluate the design choices of TeraRack. Usinglarge scale simulations, we showed how our design leveragesthe microsecond reconfiguration time of MRRs in addition toeffective dynamic reuse of wavelengths in the interconnect toachieve high performance on three typical DNN training work-loads. TeraRack’s performance is equivalent to a 256×1.2 Tbpselectrical fabric with at least 6× lower cost. Compared to op-tical interconnects, TeraRack has 4× lower cost.

This work does not raise any ethical issues.

REFERENCES[1] Nvidia DGX-2. https://www.nvidia.com/

content/dam/en-zz/Solutions/Data-Center/dgx-2/dgx-2-print-datasheet-738070-nvidia-a4-web-uk.pdf.

[2] Alexander Ishii, Denis Foley, Eric Anderson, Bill Dally, GlennDearth Larry Dennison, Mark Hummel, and John Schafer. NVIDIA’sNVLink-Switching Chip and Scale-Up GPU-Compute Server.HotChips, 2018. https://www.hotchips.org/hc30/2conf/2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf.

[3] Intel Acquires Artificial Intelligence Chipmaker Habana Labs, Dec.2019. https://newsroom.intel.com/news-releases/intel-ai-acquisition/#gs.sgcow8.

[4] Gaudi Training Platform Paper White, June2019. https://habana.ai/wp-content/uploads/2019/06/Habana-Gaudi-Training-Platform-whitepaper.pdf.

[5] S S Vazhkudai, B R de Supinski, A S Bland, A Geist, J Sexton, J Kahle,C J Zimmer, S Atchley, S H Oral, D E Maxwell, V G Vergara Larrea,A Bertsch, R Goldstone, W Joubert, C Chambreau, D Appelhans,R Blackmore, B Casses, G Chochia, G Davison, M A Ezell, E Gon-siorowski, L Grinberg, B Hanson, B Hartner, I Karlin, M L Leininger,D Leverman, C Marroquin, A Moody, M Ohmacht, R Pankajakshan,F Pizzano, J H Rogers, B Rosenburg, D Schmidt, M Shankar, F Wang,P Watson, B Walkup, L D Weems, and J Yin. The design, deployment,and evaluation of the coral pre-exascale systems. 7 2018.

[6] Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda,Nathan Luehr, Everett H. Phillips, Ankur Mahesh, Michael Matheson,Jack Deslippe, Massimiliano Fatica, Prabhat, and Michael Houston.Exascale deep learning for climate analytics. CoRR, abs/1810.01993,2018.

[7] Nouamane Laanait, Joshua Romero, Junqi Yin, M. Todd Young, SeanTreichler, Vitalii Starchenko, Albina Borisevich, Alex Sergeev, andMichael Matheson. Exascale deep learning for scientific inverseproblems. arXiv, 2019.

[8] Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong,Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu,Tiegang Chen, Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu.Highly scalable deep learning training system with mixed-precision:Training imagenet in four minutes. CoRR, abs/1807.11205, 2018.

[9] Shar Narasimhan. NVIDIA Clocks World’s Fastest BERTTraining Time and Largest Transformer Based Model,Paving Path For Advanced Conversational AI, Aug. 2019.https://devblogs.nvidia.com/training-bert-with-gpus/.

[10] Forrest N. Iandola, Khalid Ashraf, Matthew W. Moskewicz, and KurtKeutzer. Firecaffe: near-linear acceleration of deep neural networktraining on compute clusters. CoRR, abs/1511.00175, 2015.

[11] Minsik Cho, Ulrich Finkler, David Kung, and Hillery Hunter. Blue-connect: Decomposing all-reduce for deep learning on heterogeneousnetwork hierarchy. SysML Conference, 2019.

[12] Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, LukaszWesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, andKaiming He. Accurate, large minibatch SGD: training imagenet in1 hour. CoRR, abs/1706.02677, 2017.

[13] Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri,Nikhil R. Devanur, Gregory R. Ganger, and Phillip B. Gibbons.Pipedream: Fast and efficient pipeline parallel DNN training. CoRR,abs/1806.03377, 2018.

[14] Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee,Jiquan Ngiam, Quoc V. Le, and Zhifeng Chen. Gpipe: Efficienttraining of giant neural networks using pipeline parallelism. CoRR,abs/1811.06965, 2018.

[15] E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield,T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, M. Abeydeera,L. Adams, H. Angepat, C. Boehn, D. Chiou, O. Firestein, A. Forin,K. S. Gatlin, M. Ghandi, S. Heil, K. Holohan, A. El Husseini, T. Juhasz,K. Kagi, R. Kovvuri, S. Lanka, F. van Megen, D. Mukhortov, P. Patel,B. Perez, A. Rapsang, S. Reinhardt, B. Rouhani, A. Sapek, R. Seera,S. Shekar, B. Sridharan, G. Weisz, L. Woods, P. Yi Xiao, D. Zhang,R. Zhao, and D. Burger. Serving dnns in real time at datacenter scalewith project brainwave. IEEE Micro, 38(2):8–20, Mar 2018.

[16] Alexander Sergeev and Mike Del Balso. Horovod: fast and easydistributed deep learning in tensorflow. CoRR, abs/1802.05799, 2018.

[17] Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael,Adrian Caulfield, Todd Massengil, Ming Liu, Daniel Lo, ShlomiAlkalay, and Michael Haselman. Accelerating persistent neuralnetworks at datacenter scale. In Hot Chips, volume 29, 2017.

[18] Meet Horovod: Uber’s Open Source Distributed Deep LearningFramework for TensorFlow, 2017. https://eng.uber.com/horovod.

[19] Yang You, Igor Gitman, and Boris Ginsburg. Scaling sgd batch sizeto 32k for imagenet training. CoRR, 08 2017.

[20] Christopher J. Shallue, Jaehoon Lee, Joseph Antognini, JaschaSohl-Dickstein, Roy Frostig, and George E. Dahl. Measuring theeffects of data parallelism on neural network training. Journal ofMachine Learning Research, 20(112):1–49, 2019.

[21] Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond data and modelparallelism for deep neural networks. SysML, 2019.

[22] Dell Power Edge System. https://www.dell.com/en-us/work/shop/dell-nvidia-nvlink-for-quadro-rtx6000-and-rtx8000/apd/490-bfhk/graphic-video-cards#polaris-pd.

[23] IBM Power Systems Facts and Features. https://www.ibm.com/downloads/cas/EPNDE9D0.

[24] Nvidia DGX-1 With Tesla V100 System Architec-ture. http://www.dhtech.com/wp-content/uploads/2018/06/dgx1-v100-system-architecture-whitepaper.pdf.

[25] Hitesh Ballani, Paolo Costa, Istvan Haller, Krzysztof Jozwik, Kai Shi,Benn Thomsen, and Hugh Williams. Bridging the last mile for opticalswitching in data centers. In Optical Fiber Communication Conference(OFC’18). OSA, January 2018.

[26] K. Clark, H. Ballani, P. Bayvel, D. Cletheroe, T. Gerard, I. Haller,K. Jozwik, K. Shi, B. Thomsen, P. Watts, H. Williams, G. Zervas,P. Costa, and Z. Liu. Sub-nanosecond clock and data recovery in anoptically-switched data centre network. In 2018 European Conferenceon Optical Communication (ECOC), pages 1–3, Sep. 2018.

[27] High Performance Analytics and Computing Platform.https://hbp-hpc-platform.fz-juelich.de/.

[28] Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee,unjie Qian, Wencong Xiao, and Fan Yang. Analysis of large-scalemulti-tenant gpu clusters for dnn training workloads. In Proceedings ofthe 2019 USENIX Conference on Usenix Annual Technical Conference,USENIX ATC ’19, page 947–960, USA, 2019. USENIX Association.

[29] Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song,Xiaomeng Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao,Wei Zhao, Xunqiang Yin, Chaofeng Hou, Chenglong Zhang, Wei Ge,Jian Zhang, Yangang Wang, Chunbo Zhou, and Guangwen Yang. Thesunway taihulight supercomputer: system and applications. Science

13

Page 14: TeraRack: A Tbps Rack for Machine Learning Training · compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Second, Ter- ... train

China Information Sciences, 59(7):072001, Jun 2016.[30] Paul Teich. Tearing Apart Google’s TPU 3.0 AI Accelera-

tor, May 2018. https://www.nextplatform.com/2018/05/10/tearing-apart-googles-tpu-3-0-ai-coprocessor/.

[31] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, IvoDanihelka, Agnieszka Grabska-Barwinska, Sergio Gómez Col-menarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou,AdriàPuigdomènech Badia, Karl Moritz Hermann, Yori Zwols,Georg Ostrovski, Adam Cain, Helen King, Christopher Summerfield,Phil Blunsom, Koray Kavukcuoglu, and Demis Hassabis. Hybridcomputing using a neural network with dynamic external memory.Nature, 538(7626):471–476, 2016.

[32] Mark Wade, Erik Anderson, Shahab Ardalan, Pavan Bhargava,Sidney Buchbinder, Michael Davenport, John Fini, Anatoly Khilo,Chandru Ramamurthy Roy Meade, Michael Rust, Vladimir StojanovicForrest Sedgwick, Derek Van Orden, Chong Zhang Edward Wang,Chen Sun, Sergey Shumarayev, Conor O’Keeffe, Tim T. Hoang, DavidKehlet, Ravi V. Mahajan, Allen Chan, and Tina Tran. TeraPHY: AChiplet Technology for Low-Power, High-Bandwidth Optical I/O.HotChips, pages i–xlviii, August 2019. https://www.hotchips.org/hc31/HC31_2.9_AyarLabs_20190820_HC_FINAL.pdf.

[33] Ayar Labs and Intel add optical input-output to an FPGA,Sept. 2019. http://www.gazettabyte.com/home/2019/9/11/ayar-labs-and-intel-add-optical-input-output-to-an-fpga.html.

[34] M. Wade, M. Davenport, M. De Cea Falco, P. Bhargava, J. Fini, D. VanOrden, R. Meade, E. Yeung, R. Ram, M. Popovic, V. Stojanovic, andC. Sun. A bandwidth-dense, low power electronic-photonic platformand architecture for multi-tbps optical i/o. pages 1–3, Sep. 2018.

[35] Amir H. Atabaki, Sajjad Moazeni, Fabio Pavanello, Hayk Gevorgyan,Jelena Notaros, Luca Alloatti, Mark T. Wade, Chen Sun, Seth A.Kruger, Huaiyu Meng, Kenaish Al Qubaisi, Imbert Wang, BohanZhang, Anatol Khilo, Christopher V. Baiocco, Milovs A. Popovic,Vladimir M. Stojanovic, and Rajeev J. Ram. Integrating photonicswith silicon nanoelectronics for the next generation of systems on achip. Nature, 556(7701):349–354, 2018.

[36] Chen Sun, Mark T. Wade, Yunsup Lee, Jason S. Orcutt, Luca Alloatti,Michael S. Georgas, Andrew S. Waterman, Jeffrey M. Shainline,Rimas R. Avizienis, Sen Lin, Benjamin R. Moss, Rajesh Kumar, FabioPavanello, Amir H. Atabaki, Henry M. Cook, Albert J. Ou, Jonathan C.Leu, Yu-Hsin Chen, Krste Asanovic, Rajeev J. Ram, MilošA. Popovic,and Vladimir M. Stojanovic. Single-chip microprocessor thatcommunicates directly using light. Nature, 528(7583):534–538, 2015.

[37] Bhavin Shastri, Lukas Chrostowski, and Sudip Shekhar. Siliconphotonics for neuromorphic computing and machine learning.https://siepic.ubc.ca/content/applications-silicon-photonics.

[38] Marc A. Taubenblatt. Optical interconnects for large scale computingsystems: Trends and challenges. In Advanced Photonics 2018 (BGPP,IPR, NP, NOMA, Sensors, Networks, SPPCom, SOF), page NeTh3F.2.Optical Society of America, 2018.

[39] Michael Y.-S. Fang, Sasikanth Manipatruni, Casimir Wierzynski,Amir Khosrowshahi, and Michael R. DeWeese. Design of opticalneural networks with component imprecisions. Opt. Express,27(10):14009–14029, May 2019.

[40] Aaron Tan. The perfect storm for silicon photonics. sept2019. https://www.computerweekly.com/news/252471095/The-perfect-storm-for-silicon-photonics.

[41] Isabel de Sousa and Louis-Marie Achard. The future of packagingwith silicon photonics. 2016. https://www.ibm.com/downloads/cas/M0NL8N85.

[42] Tom Hausken. Breaking the cost barrier. The Optical Society, 2015.https://www.osa-opn.org/home/articles/volume_26/july_august_2015/departments/breaking_the_cost_barrier/.

[43] Valentina Donzella, Ahmed Sherwali, Jonas Flueckiger, Samantha M.Grist, Sahba Talebi Fard, and Lukas Chrostowski. Design andfabrication of soi micro-ring resonators based on sub-wavelengthgrating waveguides. Opt. Express, 23(4):4791–4803, Feb 2015.

[44] W. Bogaerts, P. De Heyn, T. Van Vaerenbergh, K. De Vos,S. Kumar Selvaraja, T. Claes, P. Dumon, P. Bienstman,D. Van Thourhout, and R. Baets. Silicon microring res-onators. Laser & Photonics Reviews, 6(1):47–73, 2012.https://onlinelibrary.wiley.com/doi/abs/10.1002/lpor.201100017.

[45] Q. Cheng, M. Bahadori, Y. Hung, Y. Huang, N. Abrams, andK. Bergman. Scalable microring-based silicon clos switch fabric withswitch-and-select stages. IEEE Journal of Selected Topics in QuantumElectronics, 25(5):1–11, Sep. 2019.

[46] E. R. H. Fuchs, R. E. Kirchain, and S. Liu. The future of siliconphotonics: Not so fast? insights from 100g ethernet lan transceivers.Journal of Lightwave Technology, 29(15):2319–2326, Aug 2011.

[47] David Thomson, Aaron Zilkie, John E Bowers, Tin Komljenovic,Graham T Reed, Laurent Vivien, Delphine Marris-Morini, EricCassan, Leopold Virot, Jean-Marc Fedeli, Jean-Michel Hartmann,Jens H Schmid, Dan-Xia Xu, Frederic Boeuf, Peter O’Brien, Goran ZMashanovich, and M Nedeljkovic. Roadmap on silicon photonics.Journal of Optics, 18(7):073003, 2016.

[48] N. Ophir, C. Mineo, D. Mountain, and K. Bergman. Silicon photonicmicroring links for high-bandwidth-density, low-power chip i/o. IEEEMicro, 33(1):54–67, Jan 2013.

[49] G.T. Reed and A.P. Knights. Silicon Photonics: An Introduction. Wiley,2004.

[50] Qixiang Cheng, Meisam Bahadori, Madeleine Glick, SebastienRumley, and Keren Bergman. Recent advances in optical technologiesfor data centers: a review. Optica, 5(11):1354–1370, Nov 2018.

[51] Madeleine Glick, Lionel C. Kimmerling, and Robert C. Pfahl. Aroadmap for integrated photonics. Opt. Photon. News, 29(3):36–41,Mar 2018.

[52] D. Kim, K. Y. Au, H. Y. L. X. Luo, Y. L. Ye, S. Bhattacharya, and G. Q.Lo. 2.5d silicon optical interposer for 400 gbps electronic-photonicintegrated circuit platform packaging. In 2017 IEEE 19th ElectronicsPackaging Technology Conference (EPTC), pages 1–4, Dec 2017.

[53] Xin Tu, Chaolong Song, Tianye Huang, Zhenmin Chen, and HongyanFu. State of the art and perspectives on silicon photonic switches.Micromachines, 10(1):51, 01 2019.

[54] Dessislava Nikolova, David M. Calhoun, Yang Liu, Sebastien Rumley,Ari Novack, Tom Baehr-Jones, Michael Hochberg, and Keren Bergman.Modular architecture for fully non-blocking silicon photonic switchfabric. Microsystems and Nanoengineering, 3:16071, 01 2017.

[55] Qixiang Cheng, Richard Dai, Meisam Bahadori, Nathan C. Abrams,Padraic E Morrissey, Madeleine Glick, Peter O’Brien, and KerenBergman. Si/sin microring-based optical router in switch-and-selecttopology. 09 2018.

[56] A. Rohit, J. Bolk, X. J. M. Leijtens, and K. A. Williams. Monolithicnanosecond-reconfigurable 4 × 4 space and wavelength selectivecross-connect. Journal of Lightwave Technology, 30(17):2913–2921,Sep. 2012.

[57] Nicolas Sherwood-Droz, Howard Wang, Long Chen, Benjamin G. Lee,Aleksandr Biberman, Keren Bergman, and Michal Lipson. Optical4×4 hitless silicon router for optical networks-on-chip (noc). Opt.Express, 16(20):15915–15922, Sep 2008.

[58] Qixiang Cheng, Sebastien Rumley, Meisam Bahadori, and KerenBergman. Photonic switching in high performance datacenters. Opt.Express, 26(12):16022–16043, Jun 2018.

[59] Ayar Labs TeraPHY Silicon Chip. https://ayarlabs.com/product/.[60] Li Chen, Kai Chen, Zhonghua Zhu, Minlan Yu, George Porter, Chun-

ming Qiao, and Shan Zhong. Enabling wide-spread communicationson optical fabric with megaswitch. In 14th USENIX Symposium onNetworked Systems Design and Implementation (NSDI 17), pages577–593, Boston, MA, 2017. USENIX Association.

[61] Nathan Farrington, George Porter, Sivasankar Radhakrishnan,Hamid Hajabdolali Bazzaz, Vikram Subramanya, Yeshaiahu Fainman,George Papen, and Amin Vahdat. Helios: A hybrid electrical/opticalswitch architecture for modular data centers. SIGCOMM’10, pages339–350.

14

Page 15: TeraRack: A Tbps Rack for Machine Learning Training · compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Second, Ter- ... train

[62] Guohui Wang, David G. Andersen, Michael Kaminsky, KonstantinaPapagiannaki, T.S. Eugene Ng, Michael Kozuch, and Michael Ryan.c-Through: Part-time optics in data centers. SIGCOMM’10, pages327–338.

[63] Ankit Singla, Atul Singh, and Yan Chen. OSA: An optical switchingarchitecture for data center networks with unprecedented flexibility.In Presented as part of the 9th USENIX Symposium on NetworkedSystems Design and Implementation (NSDI 12), pages 239–252, SanJose, CA, 2012. USENIX.

[64] George Porter, Richard Strong, Nathan Farrington, Alex Forencich,Pang Chen-Sun, Tajana Rosing, Yeshaiahu Fainman, George Papen,and Amin Vahdat. Integrating microsecond circuit switching into thedata center. SIGCOMM’13, pages 447–458.

[65] He Liu, Feng Lu, Alex Forencich, Rishi Kapoor, Malveeka Tewari,Geoffrey M. Voelker, George Papen, Alex C. Snoeren, and GeorgePorter. Circuit switching under the radar with REACToR. NSDI’14,pages 1–15.

[66] G. Michelogiannakis, Y. Shen, X. Meng M. Y. Teh, B. Aivazi,T. Groves, J. Shalf, M. Glick, M. Ghobadi, L. Dennison, andK. Bergman. Bandwidth steering for hpc using silicon nanophotonics.ACM/IEEE Supercomputing Conference (SC), 10 2019.

[67] Edge 64 Optical Circuit Switch. https://www.calient.net/products/edge640-optical-circuit-switch/.

[68] Rajiv Ramaswami, Kumar Sivarajan, and Galen Sasaki. OpticalNetworks: A Practical Perspective, 3rd Edition. Morgan KaufmannPublishers Inc., San Francisco, CA, USA, 3rd edition, 2009.

[69] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. In 2016 IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 770–778, June 2016.

[70] Siddharth Das. CNN Architectures, 2017.[71] Jakob Uszkorei. Transformer: A Novel Neural Network Architecture

for Language Understanding , Aug. 2017. https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html.

[72] Tiffany Trader. Ayar Labs toDemo Photonics Chiplet in FPGAPackage at Hot Chips, Aug. 2019.https://www.hpcwire.com/2019/08/19/ayar-labs-to-demo-photonics-chiplet-in-fpga-package-at-hot-chips/.

[73] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola,Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, andBor-Yiing Su. Scaling distributed machine learning with the parameterserver. OSDI’14, pages 583–598. USENIX Association, 2014.

[74] Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimizationof collective communication operations in mpich. Int. J. High Perform.Comput. Appl., 19(1):49–66, February 2005.

[75] Baidu, 2017. https://github.com/baidu-research/baidu-allreduce.[76] J. R. Quinlan. Induction of decision trees. Mach. Learn., 1(1):81–106,

March 1986.[77] AI and Compute. https://openai.com/blog/ai-and-compute/.[78] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad

Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao,Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson,Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, TakuKudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil,Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, OriolVinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’sneural machine translation system: Bridging the gap between humanand machine translation. CoRR, abs/1609.08144, 2016.

[79] Neural Machine Translation.[80] Raul Puri. Megatron: a large, powerful transformer, Aug. 2019.

https://github.com/NVIDIA/Megatron-LM.[81] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin,

Mark Mao, Marc aurelio Ranzato, Andrew Senior, Paul Tucker,Ke Yang, Quoc V. Le, and Andrew Y. Ng. Large scale distributeddeep networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.Weinberger, editors, Advances in Neural Information Processing

Systems 25, pages 1223–1231. Curran Associates, Inc., 2012.[82] Tal BenNun and Torsten Hoefler. Demystifying parallel and dis-

tributed deep learning: An in-depth concurrency analysis. CoRR,abs/1802.09941, 2018.

[83] Amir Gholami, Ariful Azad, Kurt Keutzer, and Aydin Buluç. Integratedmodel and data parallelism in training neural networks. CoRR,abs/1712.04432, 2017.

[84] MegatronLM: Training Billion+ Parameter Language Mod-els Using GPU Model Parallelism, Jul. 2019. https://nv-adlr.github.io/MegatronLM.

[85] Seunghak Lee, Jin Kyu Kim, Xun Zheng, Qirong Ho, Garth A Gibson,and Eric P Xing. On model parallelization and scheduling strategiesfor distributed machine learning. In Z. Ghahramani, M. Welling,C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances inNeural Information Processing Systems 27, pages 2834–2842. CurranAssociates, Inc., 2014.

[86] Ravichandra Addanki, Shaileshh Bojja Venkatakrishnan, ShreyanGupta, Hongzi Mao, and Mohammad Alizadeh. Placeto: Learninggeneralizable device placement algorithms for distributed machinelearning. CoRR, abs/1906.08879, 2019.

[87] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, andZbigniew Wojna. Rethinking the inception architecture for computervision. In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 2818–2826, 2016.

[88] Karen Simonyan and Andrew Zisserman. Very deep convolu-tional networks for large-scale image recognition. arXiv preprintarXiv:1409.1556, 2014.

[89] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei,and Ilya Sutskever. Language models are unsupervised multitasklearners. OpenAI Blog, 1(8), 2019.

[90] MLPerf v0.6: NVIDIA Implementation of Attention Mechanisms forTranslation, Aug. 2019. https://github.com/mlperf/training_results_v0.6/tree/master/NVIDIA/benchmarks/transformer/implementations/pytorch.

[91] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.Improving Language Understanding by Generative Pre-Training.

[92] Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, MuthianSivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng,Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. Gandiva:Introspective cluster scheduling for deep learning. In 13th USENIXSymposium on Operating Systems Design and Implementation (OSDI18), pages 595–610, Carlsbad, CA, 2018. USENIX Association.

[93] Andy Hock. Introducing the Cere-bras CS-1, the Industry’s Fastest Ar-tificial Intelligence Computer, Nov.2019. https://www.cerebras.net/introducing-the-cerebras-cs-1-the-industrys-fastest-artificial-intelligence-computer/.

[94] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, GauravAgrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden,Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, ChrisClark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb,Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland,Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, RobertHundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, AlexanderKaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, NaveenKumar, Steve Lacy, James Laudon, James Law, Diemthu Le, ChrisLeary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean,Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan,Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, MarkOmernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, MattRoss, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov,Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, MercedesTan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, VijayVasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe HyunYoon. In-datacenter performance analysis of a tensor processing unit.SIGARCH Comput. Archit. News, 45(2):1–12, June 2017.

15

Page 16: TeraRack: A Tbps Rack for Machine Learning Training · compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Second, Ter- ... train

[95] Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. Ascalable, commodity data center network architecture. SIGCOMMComput. Commun. Rev., 38(4):63–74, August 2008.

[96] John Kim, Wiliam J. Dally, Steve Scott, and Dennis Abts. Technology-driven, highly-scalable dragonfly topology. SIGARCH Comput. Archit.News, 36(3):77–88, June 2008.

[97] Min Yee Teh, Jeremiah J. Wilke, Keren Bergman, and SébastienRumley. Design space exploration of the dragonfly topology. InJulian M. Kunkel, Rio Yokota, Michela Taufer, and John Shalf, editors,High Performance Computing, pages 57–74, Cham, 2017. SpringerInternational Publishing.

[98] M. Besta and T. Hoefler. Slim fly: A cost effective low-diameternetwork topology. In SC ’14: Proceedings of the InternationalConference for High Performance Computing, Networking, Storageand Analysis, pages 348–359, Nov 2014.

[99] Yong Cui, Shihan Xiao, Xin Wang, Zhenjie Yang, Chao Zhu, Xi-angyang Li, Liu Yang, and Ning Ge. Diamond: Nesting the data centernetwork with wireless rings in 3d space. In 13th USENIX Symposiumon Networked Systems Design and Implementation (NSDI 16), pages657–669, Santa Clara, CA, March 2016. USENIX Association.

[100] K. Chen, X. Wen, X. Ma, Y. Chen, Y. Xia, C. Hu, and Q. Dong.Wavecube: A scalable, fault-tolerant, high-performance opticaldata center architecture. In 2015 IEEE Conference on ComputerCommunications (INFOCOM), pages 1903–1911, April 2015.

[101] Yunpeng James Liu, Peter Xiang Gao, Bernard Wong, and SrinivasanKeshav. Quartz: A new design element for low-latency dcns.SIGCOMM’14, pages 283–294.

[102] Ankit Singla, Chi-Yao Hong, Lucian Popa, and P. Brighten Godfrey.Jellyfish: Networking data centers randomly. In Proceedings ofthe 9th USENIX Conference on Networked Systems Design andImplementation, NSDI’12, pages 17–17, Berkeley, CA, USA, 2012.USENIX Association.

[103] William M. Mellette, Rob McGuinness, Arjun Roy, Alex Forencich,George Papen, Alex C. Snoeren, and George Porter. Rotornet: Ascalable, low-complexity, optical datacenter network. SIGCOMM ’17,pages 267–280, 2017.

[104] William M. Mellette, Rajdeep Das, Yibo Guo, Rob McGuinness,Alex C. Snoeren, and George Porter. Expanding across time to deliverbandwidth efficiency and low latency. NSDI’20, 2020.

[105] Monia Ghobadi, Ratul Mahajan, Amar Phanishayee, Nikhil Devanur,Janardhan Kulkarni, Gireeja Ranade, Pierre-Alexandre Blanche,Houman Rastegarfar, Madeleine Glick, and Daniel Kilper. Projector:Agile reconfigurable data center interconnect. SIGCOMM ’16, pages216–229, 2016.

[106] Navid Hamedazimi, Zafar Qazi, Himanshu Gupta, Vyas Sekar,Samir R. Das, Jon P. Longtin, Himanshu Shah, and Ashish Tanwer.Firefly: A reconfigurable wireless data center fabric using free-spaceoptics. SIGCOMM’14, pages 319–330.

[107] Vishal Shrivastav, Asaf Valadarsky, Hitesh Ballani, Paolo Costa,Ki Suh Lee, Han Wang, Rachit Agarwal, and Hakim Weatherspoon.Shoal: A network architecture for disaggregated racks. In 16th USENIXSymposium on Networked Systems Design and Implementation(NSDI’19). USENIX, February 2019.

[108] Andromachi Chatzieleftheriou, Sergey Legtchenko, Hugh Williams,and Antony Rowstron. Larry: Practical network reconfigurability inthe data center. In 15th USENIX Symposium on Networked SystemsDesign and Implementation (NSDI 18), pages 141–156, Renton, WA,April 2018. USENIX Association.

[109] Sergey Legtchenko, Nicholas Chen, Daniel Cletheroe, Antony Row-stron, Hugh Williams, and Xiaohan Zhao. Xfabric: A reconfigurablein-rack network for rack-scale computers. In 13th USENIX Symposiumon Networked Systems Design and Implementation (NSDI 16), pages15–29, Santa Clara, CA, March 2016. USENIX Association.

[110] A. Funnell, J. Benjamin, H. Ballani, P. Costa, P. Watts, and B. C.Thomsen. High port count hybrid wavelength switched tdma (ws-tdma)optical switch for data centers. In 2016 Optical Fiber Communications

Conference and Exhibition (OFC), pages 1–3, March 2016.[111] A. C. Funnell, K. Shi, P. Costa, P. Watts, H. Ballani, and B. C. Thomsen.

Hybrid wavelength switched-tdma high port count all-optical datacentre switch. Journal of Lightwave Technology, 35(20):4438–4444,Oct 2017.

[112] Istvan Haller Daniel Cletheroe Raphael Behrendt Benn ThomsenFotini Karinou Krzysztof Jozwik Paolo Costa Hitesh Ballani Kai Shi,Sophie Lange. System demonstration of nanosecond wavelengthswitching with burst-mode pam4 transceiver. In ECOC, 2019.

[113] Mtp fiber connector guide, Jan. 2019. https://www.sanspot.com/technical/mtp-fiber-connector-guide.

[114] Supernova light supply. https://ayarlabs.com/products/.[115] Amit Shukla, Lambodar P. Singh, and Raja Datta. Wavelength

assignment in a ring topology for wavelength routed wdm opticalnetworks. 2005.

[116] Asuman E. Ozdaglar and Dimitri P. Bertsekas. Routing and wavelengthassignment in optical networks.

[117] George N. Rouskas. Routing and Wavelength Assignment in OpticalWDM Networks. American Cancer Society, 2003.

[118] Gordon Wilfong and Peter Winkler. Ring routing and wavelengthtranslation. In Proceedings of the Ninth Annual ACM-SIAM Symposiumon Discrete Algorithms, SODA ’98, pages 333–341, Philadelphia, PA,USA, 1998. Society for Industrial and Applied Mathematics.

[119] S. Subramaniam and R. A. Barry. Wavelength assignment in fixed rout-ing wdm networks. In Proceedings of ICC’97 - International Confer-ence on Communications, volume 1, pages 406–410 vol.1, June 1997.

[120] Ravindra K Ahuja, Thomas L Magnanti, and James B Orlin. Networkflows. 1988.

[121] Robert E Tarjan. Dynamic trees as search trees via euler tours, appliedto the network simplex algorithm. Mathematical Programming,78(2):169–177, 1997.

[122] James B Orlin, Serge A Plotkin, and Éva Tardos. Polynomialdual network simplex algorithms. Mathematical programming,60(1-3):255–276, 1993.

[123] Prabhakar Raghavan and Clark D. Thompson. Randomized rounding:A technique for provably good algorithms and algorithmic proofs.Technical Report UCB/CSD-85-242, EECS Department, Universityof California, Berkeley, May 1985.

[124] Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, RasmusLarsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, SamyBengio, and Jeff Dean. Device placement optimization with reinforce-ment learning. In Proceedings of the 34th International Conferenceon Machine Learning-Volume 70, pages 2430–2439. JMLR. org, 2017.

[125] Ravichandra Addanki, Shaileshh Bojja Venkatakrishnan, ShreyanGupta, Hongzi Mao, and Mohammad Alizadeh. Learning generalizabledevice placement algorithms for distributed machine learning. InH. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox,and R. Garnett, editors, Advances in Neural Information ProcessingSystems 32, pages 3983–3993. Curran Associates, Inc., 2019.

[126] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley,Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXivpreprint arXiv:1909.08053, 2019.

[127] NVIDIA unveils 350lb ’DGX-2’ supercomputer with a $399,000 pricetag, 2018.

[128] 10gbase-t sfp+ copper. https://www.fs.com/products/66612.html.[129] Nvidia tesla v100 volta gpu accelerator 32gb graphics card. https:

//www.amazon.com/NVIDIA-Tesla-Volta-Accelerator-Graphics/dp/B07JVNHFFX/ref=sr_1_2?keywords=tesla+v100+gpu&qid=1580321850&sr=8-2.

[130] 0.5m (2ft) generic compatible 200g qsfp56 passive. http://www.tarluz.com/product-details/qsfp-dd-200g-sr8-100m-optical-transceiver/.

[131] Cisco qsfp-100g-sr4-s compatible 100gbase-sr4 qsfp28 850nm 100mdom transceiver module. https://www.fs.com/products/48354.html.

[132] Danyang Zhuo, Monia Ghobadi, Ratul Mahajan, Amar Phanishayee,Xuan Kelvin Zou, Hang Guan, Arvind Krishnamurthy, and Thomas

16

Page 17: TeraRack: A Tbps Rack for Machine Learning Training · compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Second, Ter- ... train

Anderson. RAIL: A case for redundant arrays of inexpensive linksin data center networks. In 14th USENIX Symposium on NetworkedSystems Design and Implementation (NSDI 17), pages 561–576,Boston, MA, 2017. USENIX Association.

[133] 2020 General Europractice Pricelist, Jan. 2020. https://europractice-ic.com/wp-content/uploads/2020/01/General-MPW-EUROPRACTICE-200123-v3.pdf.

[134] D. Kim, K. Y. Au, H. Y. L. X. Luo, Y. L. Ye, S. Bhattacharya, and G. Q.Lo. 2.5d silicon optical interposer for 400 gbps electronic-photonicintegrated circuit platform packaging. In 2017 IEEE 19th ElectronicsPackaging Technology Conference (EPTC), pages 1–4, Dec 2017.

[135] John Bowers, Hyundai Park, Ying-Hao Kuo, Alexander Fang, RichardJones, Mario Paniccia, Oded Cohen, and Omri Raday. Integratedoptical amplifiers on silicon waveguides. 07 2007.

[136] Tyndall national institute. https://www.tyndall.ie/packaging/.[137] Plc connections. http://plcconnections.com/products.html.[138] Pci express riser card pci-e extender. https://www.ebowsos.com/

products/.[139] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet:

A Large-Scale Hierarchical Image Database. In CVPR09, 2009.[140] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten

Brants, and Phillipp Koehn. One billion word benchmark for measuringprogress in statistical language modeling. CoRR, abs/1312.3005, 2013.

[141] Alvaro moscoso martir, Juliana Müller, Johannes Hauck, NicolasChimot, Rony Setter, Avner Badihi, Daniel Rasmussen, AlexandreGarreau, Mads Nielsen, Elmira Islamova, Sebastian Romero-García,Bin Shen, Anna Sandomirsky, Sylvie Rockman, Chao Li, SaeedSharif Azadeh, Guo-Qiang Lo, Elad Mentovich, Florian Merget, andJeremy Witzens. Silicon photonics wdm transceiver with soa andsemiconductor mode-locked laser. Scientific Reports, 7, 05 2016.

17

Page 18: TeraRack: A Tbps Rack for Machine Learning Training · compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Second, Ter- ... train

Figure 15: Throughput autocorrelation: Peer-to-peer commu-nications between each two GPUs is predictable and repetitiveat training time-scale. Peak-to-Peak horizontal distance isequivalent of the training time per iteration.

Node4

RGB

R R

GB

Node4

Src Node

13

1

23 1

3

1

Sink Node

(a) Wavelength allocation (b) Flow allocation

Node3

Node2

Node3

Node2

Node1 Node1

23

X

X’YY’

Figure 16: Wavelength allocation and its equivalent flow routingtranslation for multiedge cut.

A ML WORKLOADS ARE PERIODICTo measure the periodicity of the communication, we calculatethe autocorrelation of each workload. For example, Fig. 15plots the autocorrelation of a VGG workload in Fig. 2b, withpeaks corresponding to training iterations. The autocorrelationis periodic with a slow decay, indicating future communica-tion requirements between GPUs can be easily predicted fromrecent measurements. This observation helps in making bet-ter scheduling decisions as the future state of demands arecorrelated with their current states.

B MULTI-EDGE SEGMENT CUTIn section 3, we explain how we can express a wavelengthallocation problem as a flow problem if we cut at a segmentwhich only one traffic demand passes through. In Fig. 16, weshow a more general case of cuts with higher degrees. Supposethat we want to inject the flows at the segment between Node3and Node4. The problem basically remains the same while weneed to add the following three constraints in addition to theflow conservation constraints: (1) X =X ′, (2) Y =Y ′, and (3)X+Y =1. We can simply add these constraints to our simplexproblem as well.

C OPTICAL SIMULATIONSFigure.17 demonstrates our approach to achieve SiP inter-faced GPU node at large scale. Every WDM input of 64 wave-lengths from the previous GPU node is first de-interleaved into4 groups with 16 wavelengths each. Cascaded SiP micro-ringfilters are used to perform wavelength selective add/drop or topass wavelength(s) through the node based on the requirementof global scheduler. To overcome the spectral power variabilitycaused by the multi-staged optical components, we add opticalamplifiers, optical (de)multiplexers and variable optical attenu-ators (VOAs) to equalize the optical power for each wavelengthat the output of the GPU node. An interleaver then combinesall 4 groups and forwards the new WDM signal to the nextGPU node. We perform the simulation of our SiP add/dropinterface using the American Institute for Manufacturing In-tegrated Photonics (AIM Photonics) process design kit (PDK)in OptSim software. The add/drop filters are from the AIMPDK and the (de)interleavers are built with cascaded 2-stageMZI. The optical multiplexer/demultiplexers are designed us-ing ideal OptSim models with a bandwidth of 0.5nm. Themultiplexer/demultiplexer function can also be implementedwith multimode interference couplers (MMI). In the simula-tion, we achieve an equalized optical spectrum at the outputof a GPU node for two cases: 1) 64 bypass wavelengths, 2) 64wavelengths with 32 wavelengths being dropped and addedwhile the other 32 wavelengths bypass the node.

Figure 17: System level diagram of GPU nodes with scalableSiP select/bypass interface. The incoming 64 wavelengthsare separated into four groups with 16 wavelengths each forselect/bypass.

18

Page 19: TeraRack: A Tbps Rack for Machine Learning Training · compared to cost-equivalent electrical-switch and OCS-based interconnects respectively, at 256-GPU scale. Second, Ter- ... train

D INTEGERLINEAR PROGRAM FORMULATION

We can formulate the wavelength allocation ILP in section 3as follows:

maximizeΛ∈{0,1}N×N×W mini j :TMi j>0

∑kΛi jk/TMi j

s.t.(1)

∑(j≤i≤N+l )∩(l<j )Λ(i modN )jk ≤ 1 ∀l ,k

(2)∑

iΛi jk ≤ 1 ∀j,k(3)

∑jΛi jk ≤ 1 ∀i,k

(3)

19