8
IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 1 Overview of the IBM Neural Computer Architecture Pritish Narayanan, Charles E. Cox, Alexis Asseman, Nicolas Antoine, Harald Huels, Winfried W. Wilcke, and Ahmet S. Ozcan Abstract—The IBM Neural Computer (INC) is a highly flexible, re-configurable parallel processing system that is intended as a research and development platform for emerging machine intelligence algorithms and computational neuroscience. It consists of hundreds of programmable nodes, primarily based on Xilinx’s Field Programmable Gate Array (FPGA) technology. The nodes are interconnected in a scalable 3d mesh topology. We overview INC, emphasizing unique features such as flexibility and scalability both in the types of computations performed and in the available modes of communication, enabling new machine intelligence approaches and learning strategies not well suited to the matrix manipulation/SIMD libraries that GPUs are optimized for. This paper describes the architecture of the machine and applications are to be described in detail elsewhere. Index Terms—FPGA, parallel processors, machine intelligence, re-configurable hardware. 1 I NTRODUCTION T HE revolution in deep learning over the last decade has been mainly driven by the confluence of two equally important factors – the generation of large amounts of data, and the availability of GPUs to train large neural networks. Those contributed to accelerated research in the algorith- mic domain by significantly decreasing experiment turn- around time. In recent years, as the computational demands for deep learning have increased, especially in consumer facing domains such as image and speech recognition, there is a trend towards more hardware specialization to improve performance and energy-efficiency. For instance, recent hardware approaches [1], [2], [3], [4], [5], [6], [7] feature techniques such as reduced precision, aggressive compression schemes and customized systolic data paths aimed at accelerating today’s DNNs. While these efforts are extremely important both from research and commercialization perspectives, the currently available hardware landscape is also somewhat restrictive. This is because GPUs (and their variants) may not lend themselves well to use cases where mini-batching/data parallelism is non-trivial because multiply-accumulate op- erations are not dominant or when low latency is critical. Therefore, as machine intelligence algorithms continue to evolve, it is unfortunate that promising approaches may be sidelined simply because they do not map well to a GPU, just as backpropagation trailed more conventional machine learning approaches for decades due to the lack of GPUs. One of the prime examples of an algorithm which is not well matched to SIMD architecture is Monte Carlo Tree Search used in the Google Deepmind’s AlphaGo system [8]. This indicates the need for new computer architectures The authors, except H. Huels, are with the IBM Research Division, Almaden Research Center, San Jose, CA 95120. H. Huels is with IBM Boeblingen, Germany. E-mail: [email protected] Manuscript revised January 29, 2020. for machine intelligence. However, there is an obvious chal- lenge – namely, how does one build hardware for algorithms and use cases that may not yet exist? The approach we have taken is to build a hardware system with tremendous flexibility, which will be explored in depth in the rest of the paper. This flexibility extends to the types of algorithms being executed, the portions that need to be off-loaded and accelerated, the model of communication, and the types of parallelism deployed. The system, named IBM Neural Computer (INC), is fundamentally a large, highly scalable parallel processing system with compute nodes interconnected in a 3D mesh topology. The total number of compute nodes within the existing single-cage system is 432. Each compute node con- tains a Xilinx Zynq System-on-Chip (an ARM A9 CPU + FPGA logic on the same die) along with 1GB of dedicated RAM [9], [10], [11]. The availability of FPGA resources on every node allows application-specific processor offload, a feature that is not available on any parallel machine of this scale that we are aware of. The communication network that realizes the 3D mesh topology is implemented using single-span and multi- span SERDES (Serializer-Deserializer) links connected to the FPGA fabric of the Zynq. It is therefore possible to build tailored hardware network controllers based on the communication mode(s) most suited to the application. The ability to optimize the system performance across applica- tion code, middle-ware, system software, and hardware is a key feature of INC. One may envision that this 3D topology of distributed memory and compute, with the ability to have nodes ex- change signals/messages with one another is somewhat reminiscent of works targeting the human brain (most prominently, the SpiNNaker project [12], [13]). However, an important distinction is that our goal is more general than computational neuroscience, and extends to machine intelligence in general. The rest of the paper is organized as follows. Section arXiv:2003.11178v1 [cs.DC] 25 Mar 2020

IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 1 … · 2020-03-26 · IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 2 Fig. 1. 3D Mesh Topology of a single INC card, with some node

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 1 … · 2020-03-26 · IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 2 Fig. 1. 3D Mesh Topology of a single INC card, with some node

IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 1

Overview of the IBM Neural ComputerArchitecture

Pritish Narayanan, Charles E. Cox, Alexis Asseman, Nicolas Antoine, Harald Huels, Winfried W. Wilcke,and Ahmet S. Ozcan

Abstract—The IBM Neural Computer (INC) is a highly flexible, re-configurable parallel processing system that is intended as aresearch and development platform for emerging machine intelligence algorithms and computational neuroscience. It consists ofhundreds of programmable nodes, primarily based on Xilinx’s Field Programmable Gate Array (FPGA) technology. The nodes areinterconnected in a scalable 3d mesh topology. We overview INC, emphasizing unique features such as flexibility and scalability both inthe types of computations performed and in the available modes of communication, enabling new machine intelligence approaches andlearning strategies not well suited to the matrix manipulation/SIMD libraries that GPUs are optimized for. This paper describes thearchitecture of the machine and applications are to be described in detail elsewhere.

Index Terms—FPGA, parallel processors, machine intelligence, re-configurable hardware.

F

1 INTRODUCTION

THE revolution in deep learning over the last decade hasbeen mainly driven by the confluence of two equally

important factors – the generation of large amounts of data,and the availability of GPUs to train large neural networks.Those contributed to accelerated research in the algorith-mic domain by significantly decreasing experiment turn-around time. In recent years, as the computational demandsfor deep learning have increased, especially in consumerfacing domains such as image and speech recognition,there is a trend towards more hardware specialization toimprove performance and energy-efficiency. For instance,recent hardware approaches [1], [2], [3], [4], [5], [6], [7]feature techniques such as reduced precision, aggressivecompression schemes and customized systolic data pathsaimed at accelerating today’s DNNs.

While these efforts are extremely important both fromresearch and commercialization perspectives, the currentlyavailable hardware landscape is also somewhat restrictive.This is because GPUs (and their variants) may not lendthemselves well to use cases where mini-batching/dataparallelism is non-trivial because multiply-accumulate op-erations are not dominant or when low latency is critical.Therefore, as machine intelligence algorithms continue toevolve, it is unfortunate that promising approaches may besidelined simply because they do not map well to a GPU,just as backpropagation trailed more conventional machinelearning approaches for decades due to the lack of GPUs.One of the prime examples of an algorithm which is not wellmatched to SIMD architecture is Monte Carlo Tree Searchused in the Google Deepmind’s AlphaGo system [8].

This indicates the need for new computer architectures

• The authors, except H. Huels, are with the IBM Research Division,Almaden Research Center, San Jose, CA 95120. H. Huels is with IBMBoeblingen, Germany.

E-mail: [email protected]

Manuscript revised January 29, 2020.

for machine intelligence. However, there is an obvious chal-lenge – namely, how does one build hardware for algorithmsand use cases that may not yet exist? The approach wehave taken is to build a hardware system with tremendousflexibility, which will be explored in depth in the rest ofthe paper. This flexibility extends to the types of algorithmsbeing executed, the portions that need to be off-loaded andaccelerated, the model of communication, and the types ofparallelism deployed.

The system, named IBM Neural Computer (INC), isfundamentally a large, highly scalable parallel processingsystem with compute nodes interconnected in a 3D meshtopology. The total number of compute nodes within theexisting single-cage system is 432. Each compute node con-tains a Xilinx Zynq System-on-Chip (an ARM A9 CPU +FPGA logic on the same die) along with 1GB of dedicatedRAM [9], [10], [11]. The availability of FPGA resources onevery node allows application-specific processor offload, afeature that is not available on any parallel machine of thisscale that we are aware of.

The communication network that realizes the 3D meshtopology is implemented using single-span and multi-span SERDES (Serializer-Deserializer) links connected tothe FPGA fabric of the Zynq. It is therefore possible tobuild tailored hardware network controllers based on thecommunication mode(s) most suited to the application. Theability to optimize the system performance across applica-tion code, middle-ware, system software, and hardware is akey feature of INC.

One may envision that this 3D topology of distributedmemory and compute, with the ability to have nodes ex-change signals/messages with one another is somewhatreminiscent of works targeting the human brain (mostprominently, the SpiNNaker project [12], [13]). However,an important distinction is that our goal is more generalthan computational neuroscience, and extends to machineintelligence in general.

The rest of the paper is organized as follows. Section

arX

iv:2

003.

1117

8v1

[cs

.DC

] 2

5 M

ar 2

020

Page 2: IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 1 … · 2020-03-26 · IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 2 Fig. 1. 3D Mesh Topology of a single INC card, with some node

IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 2

Fig. 1. 3D Mesh Topology of a single INC card, with some node numbers shown. Each node contains its own dedicated Zynq System-on-Chip withan ARM Cortex-A9 and FPGA logic. Node 100 (blue shaded cube) is the gateway node to the external Ethernet. Nodes 000 (green shaded cube)and 200 have PCIe Interfaces to communicate with a host computer.

Fig. 2. Hierarchical organization of the INC system (a) Conceptual design of INC 9000 with 48 cards and 1296 nodes (not yet built), (b) INC 3000with 16 cards and 432 nodes (operational), (c) Single card with 27 nodes, (d) One node.

2 begins with an overview of the INC system. Section 3discusses inter-node communication mechanisms in INC.Section 4 contains details on diagnostic and debug capabili-ties of the INC system, which are crucial in a developmentenvironment. Section 5 will conclude the paper.

2 INC OVERVIEW

The INC system is designed primarily to be a developmentplatform for emerging machine intelligence algorithms. Itis a parallel processing system with a large number ofcompute nodes organized in a high bandwidth 3D meshnetwork. The platform is designed to be highly flexible.Within each node is a Xilinx Zynq system-on-chip, which in-tegrates a dual-core Cortex A9 ARM processor and an FPGAon the same die, allowing the system to be reconfigured ona per node basis. Each node also includes 1GB of dedicatedDRAM that can be used as program and data space, andis accessible both from the processor and the FPGA. In

an eventual at-scale high-performance learning task, weenvision that most of the performance critical steps willbe offloaded and optimized on the FPGA, with the ARMonly providing auxiliary support – including initialization,diagnostics, output transfer, etc.

While INC is a distributed system in that it is com-posed of distinct processor+memory nodes interconnectedby communication links, it has a unique combination offeatures not available elsewhere. It is not a multi-FPGA‘sea of gates’ system [14], [15] whose structure would needto be defined by the logic resident on the FPGA. It hasa very well defined structure of compute nodes with awell defined communications network. Therefore it does notcarry the performance compromise associated with the needto support a fully-generic interconnect.

It is also different from other distributed systems suchas BlueGene [16]. In addition to the available FPGA offloadcapability at every node, the communication interfaces are

Page 3: IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 1 … · 2020-03-26 · IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 2 Fig. 1. 3D Mesh Topology of a single INC card, with some node

IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 3

not pre-defined to support a limited set of known usecases. Instead, access to the physical communication links isthrough the FPGA, and multiple distinct ‘logical’ channelsof communication can be established, all utilizing the sameunderlying SERDES links. In this way, the network inter-faces can be designed (and even progressively optimized)to best suit the applications executing on INC.

2.1 INC cardThe basic building block of the system is an INC card. Eachcard contains 27 nodes arranged in a 3×3×3 cube.

Figure 1 shows the 3x3x3 topology of an individualcard, along with (XYZ) co-ordinates overlaid to indicate theorganization of the 3D mesh. The 27 nodes are placed on thecard in a way to minimize the connection lengths betweenlogically adjacent nodes. All 27 nodes on a single cardare identical except for some important differences. Node(100) includes an Ethernet port, and can act as a gatewayconnecting an internal Ethernet network implemented onthe FPGAs to a conventional external network. Node (000) isa controller node, and includes a 4 lane PCIe 2.0 connectionthat can be connected to a host PC. It also has a serialconnection that can serve as a console during boot time,or be forwarded to the other nodes on the card. Node (200)is also capable of supporting a PCIe interface, should anapplication need additional bandwidth.

2.2 Backplane, Cages and RacksIn an INC system, individual cards plug into a backplane.Each backplane can support up to 16 cards, and the back-plane wiring arranges the 432 nodes of the 16 cards into a12×12×3 mesh. The backplane and cards are enclosed in anINC card cage (INC 3000 system (Fig. 2b)). Connectors onthe back side of the backplane allow up to four cages to beconnected vertically to build a system of up to 12×12×12 or1728 nodes (INC 9000 system (Fig. 2a)).

The card design supports building an INC system withanywhere from one to 512 cards (13,824 nodes). However,to grow beyond the size of the INC 9000 system, a newbackplane design is required.

2.3 Physical LinksEach node on a card is connected to its nearest orthogonalneighbors by a single span link composed of two unidi-rectional serial connections. Each node has six single spanlinks. The nodes on the faces of the cube (i.e. all nodes otherthan the central (111) node) have single span links that leavethe card, and may have nearest neighbors on other cards inthe system. In addition to single span links, 6 bi-directionalmulti-span links allow for more efficient communication ina larger system. Multi-span links connect nodes that arethree nodes apart in any one orthogonal direction, andwill always begin and terminate on different cards. Witha total of 432 links leaving or entering the card, and 1 Giga-byte (GB) per second per link, this amounts to a potentialmaximum bandwidth of 432 GB per second leaving andentering one card. The bisection bandwidths for the INC9000 and INC 3000 systems are 864 GB per second and 288GB per second respectively.

The communications links are pairs of high speed, serial,unidirectional SERDES connections. Each connection onlyhas two wires: the differential data lines. There are no addi-tional lines for handshake signals. The links are controlledby a credit scheme to ensure that overrun errors do notoccur and no data is lost. A receiving link sends (via itspaired transmit link) a count of how many bytes of datait is willing to receive. A transmitting link will decrementits count as it sends data and never send more data than itholds credits for from the receiver. The receiving side willadd to the credit balance as it frees up buffer space. Thiscredit system is implemented entirely in the hardware fabricand does not involve the ARM processor or software.

2.4 Packet RoutingThe communication network currently supports directedand broadcast packet routing schemes. Features such asmulti-cast or network defect avoidance are being consid-ered at the time of writing, and can be included based onapplication or hardware needs.

In a directed routing mode, a packet originating from theprocessor complex or the FPGA portion of a compute nodeis routed to a single destination. Both single-span and multi-span links may be used for the routing, and the packet willbe delivered with a minimum number of hops. However, adeterministic routing path is not guaranteed, as each nodeinvolved may make a routing decision based on which linkshappen to be idle at that instant. This implies that in-orderdelivery of packets is not guaranteed1. The packet routingmechanism is implemented entirely on the FPGA fabric, andthe ARM processors may only be involved at the source anddestination nodes, if at all.

A broadcast packet radiates out from the source nodein all directions and is delivered to every node in thesystem. Broadcast packets only use the single-span links inthe system for simplicity of routing. Depending on whichlink received a broadcast packet, the receiving node maychoose to a) forward to all other links, b) forward to asubset of links, or c) stop forwarding. By choosing the rulesfor these three scenarios carefully, it is possible to ensurethat all nodes in the system receive exactly one copy of thebroadcast packet.

3 CONNECTIVITY AND COMMUNICATION

Multiple virtual channels can be designed to sit atop theunderlying packet router logic described in the previoussection to give the processor and FPGA logic differentvirtual or logical interfaces to the communication network.In this section we will review three approaches currentlyimplemented on INC – Internal Ethernet, Postmaster DirectMemory Access (DMA) and Bridge FIFO.

3.1 Internal EthernetOne of the virtual interfaces is designed to appear similarto an Ethernet interface2. While the underlying hardware is

1. This does not preclude applications that need in-order delivery, asreordering can be achieved in either FPGA hardware or in software, ora different packet routing scheme can be devised as necessary.

2. Note that this is an interface for node to node communication,and is different from the ‘real’ physical Ethernet interface at node (100)which is intended for communicating with the external world.

Page 4: IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 1 … · 2020-03-26 · IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 2 Fig. 1. 3D Mesh Topology of a single INC card, with some node

IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 4

Fig. 3. Operation of the virtual internal Ethernet interface: Network packets generated at a source node are initially in its DRAM. At the request of theEthernet Device driver, a DMA transfer is initiated and the packet is transferred to the router logic facing the SERDES links. The packet will traversethrough zero or more intermediate nodes without processor interaction before reaching the destination, where the Ethernet device implemented onthe FPGA fabric will raise a hardware interrupt, notifying the driver and thereby the kernel of the new packet to be processed.

quite different from an Ethernet network, this design pointis chosen to take advantage of the large amount of standardapplication software readily available for IP networks suchas ssh. A Linux OS and associated device driver runningon an ARM processor can then use these applications tocommunicate with other nodes on the internal networkoperating as if it were communicating with a real Ether-net device. Similarly, applications that depend on standardparallel software libraries (e.g. Message Passing Interface(MPI) [17] and its variants) can be easily supported. Usingstable, well-established networking applications was alsoextremely useful during initial debugging of the networkhardware and the system software.

The operating mechanism for transmitting packets isconceptualized in Fig. 3. During the Transmit Operation,the application passes information to the kernel network-ing stack, which adds various headers and sends it on toa virtual internal Ethernet interface (ethX). This interfaceis owned by the device driver, which manages a set ofbuffer descriptors that contain information about the sizeand memory locations of various packets in the DRAM.The device driver then informs the hardware about theavailability of a packet to be transmitted, by setting a statusbit. The actual transfer from the DRAM into the FPGA fabricis a DMA operation, using an AXI-HP bus on the Zynq chip.Packet receive is conceptually a reverse operation, with thedistinction that the device driver has two mechanisms toknow of the arrival of a packet on the interface – one is ahardware interrupt, and the other is a polling mechanismthat is far more efficient under high traffic conditions. Notethat while this description assumes applications running insoftware as the producers and consumers of the packets, theinternal Ethernet can also be accessed from other hardwareblocks on the FPGA fabric itself, if at all necessary.

The availability of this virtual internal Ethernet alsomakes it straightforward for any node in the system tocommunicate with the external world using TCP/IP. This is

done by using the physical Ethernet port on node (100) andconfiguring this node as an Ethernet gateway implementingNetwork Address Translation (NAT) and port forwarding.One immediate and obvious use of this feature is the imple-mentation of an NFS (network file system) service to saveapplication data from each of the nodes (whose file systemsare implemented on the DRAM and are therefore volatile)to a non-volatile external storage medium.

3.2 Postmaster DMAThe postmaster DMA logic provides a method to movesmall amounts of data between nodes in the system. Thefunction is intended to be used directly by machine intelli-gence application code on the software or by applicationhardware modules on the FPGA. It provides a commu-nications channel with much lower overhead than goingthrough the TCP/IP stack.

Postmaster DMA is a tunneled queue model (Fig. 4),where the processor (or hardware module) sees a queuethat can be written at a known, fixed, address. Data writtento that queue is transferred to a remote node where it ispicked up by a DMA engine and moved to a pre-allocatedbuffer in system memory. Multiple initiators may send datato the same target. At the target, the received data is storedin a linear stream in the order in which it is received. ThePostmaster hardware guarantees that a packet of data froma single initiator is always stored in contiguous locations.To reiterate, it is not necessary that the postmaster DMA beused only for processor-to-processor communication; otherFPGA hardware implemented on the source/destinationnodes could use it too if necessary (although in some casesthis may make the ‘DMA’ superfluous).

Packets from multiple initiators will be interleavedwithin the single data queue. This model is particularlywell-suited to Machine Intelligence applications in whichregions or learners are distributed across multiple nodes,and each node generates multiple small outputs during each

Page 5: IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 1 … · 2020-03-26 · IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 2 Fig. 1. 3D Mesh Topology of a single INC card, with some node

IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 5

Fig. 4. Operation of Postmaster DMA, which is a lightweight, high bandwidth interface: An application on the source node (either on the CPU or onthe FPGA) writes data to a transmit queue on the FPGA logic. This data is then received on a destination node, where it can be either consumedin the destination FPGA and/or written into a memory mapped region as shown. System software is only involved in the initialization and tear-downprocess, and is not shown.

ReceiverEmitterBridge FIFO

Mux

Bridge FIFO

TransmitPacket

MuxPacket

Router

Bridge FIFO

Demux

Bridge FIFO

Receive

Bridge FIFO

TransmitReceiver

Bridge FIFO

ReceiveEmitter

Postmaster packets

NetTunnel packets

Network

FPGAFPGA

IP block provided with INC User defined logic

SOURCE NODE DESTINATION NODE

Packet

Router

Packet

Demux

Postmaster packets

NetTunnel packets

Fig. 5. Implementation of a Bridge FIFO in the FPGA Logic

time step which become the inputs in the next time step. Thefunction of Postmaster is to allow the node to send thoseoutputs to their intended targets as they are generated ratherthan collect them and send them out as a larger transmissionat the end of the time step. In addition to eliminating theburden of aggregating the data, this approach also allowsmuch more overlap of computation and communication.

3.3 Bridge FIFO

The bridge FIFO is intended to facilitate direct hardware tohardware communication between two hardware moduleslocated in separate FPGAs by exposing a regular FIFOinterface. The bridge FIFO takes care of assembling the datainto network packets and communicating with the packetrouter logic. Figure 5 presents an implementation of thebridge FIFO.

The interface is composed of two modules implementedin pairs: the Bridge FIFO transmit and the Bridge FIFOreceive. The first one corresponds to the write port of theFIFO while the second corresponds to the read port. Theyare always implemented in pairs and must be located onthe source node for the transmit unit and on the destinationnode for the receive unit. Together, they form a communica-tion channel.

On the source Node, the Bridge FIFO transmit convertsits input (words of data) into network packets. These BridgeFIFO packets are multiplexed with other protocol packetswithin the Packet Mux unit which transmits all the networkspackets to the Packet Router unit. The Packet Mux unitenables coexistence of multiple communication protocols.On the destination Node, the Packet Router transmits thereceived network packets to the Packet Demux unit, whichseparates the various protocol packets and directs them totheir corresponding receiver. The Bridge FIFO receive unitreceives the packets and converts them back into words ofdata.

If multiple independent communication channels arerequired, then multiple pairs of Bridge FIFO transmit andreceive can be instantiated. The Bridge FIFO transmits willbe multiplexed within the Bridge FIFO mux and the bridgeFIFO receives will be demultiplexed within the Bridge FIFOdemux. The Bridge FIFO Mux (respectively Demux) sup-ports up to 32 Bridge FIFO transmit (respectively receive). Ifmore channels are required, then another Bridge FIFO Mux(respectively Demux) must be instantiated.

The Bridge FIFO supports different configurable bit-widths ranging from 7 to 64. If a wider FIFO is needed, thenmultiple bridge FIFOs must be used in parallel to achievethe required width.

Page 6: IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 1 … · 2020-03-26 · IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 2 Fig. 1. 3D Mesh Topology of a single INC card, with some node

IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 6

TABLE 1Communication latency of the bridge FIFO between two nodes.

Number of hops (in Node) 0 1 3 6Latency (in µs) 0.25 1.1 2.5 4.7

Table 1 presents the measured latency when using thebridge FIFO. Number of hops being zero correspond to thecase where emitter and receiver are on the same node, wherethe latency corresponds the the delay incurred by the BridgeFIFO logic alone. The cases with 1, 3 and 6 hops are the best,average and worst case respectively on a single card system(27 nodes arranged in a cubic 3D mesh with 3 nodes on eachedge).

4 DIAGNOSTIC CAPABILITIES

INC features a wide array of diagnostic capabilities builtinto the hardware platform. This is especially important in adevelopment platform, as the reconfigurable hardware, thesystem software and the application software are all concur-rently evolving. While some of these have been mentionedin passing in earlier sections, we present a more detaileddiscussion here.

4.1 JTAGEach INC card has a single JTAG chain that is daisy chainedthrough all 27 Zynq FPGAs. The JTAG chain can be usedto access both the individual processors and the FPGAs, asthese appear as different devices on chain. Therefore, it canbe used for a broad variety of tasks including configuringthe FPGAs, loading code, debugging FPGA logic with XilinxChipscope, and debugging the ARM code through the ARMdebug access port (DAP). This mode of debug is especiallyuseful in ironing out issues during initial system bring up,when other modes may not yet be readily available.

4.2 Ring Bus and NetTunnelThe Ring Bus is a sideband communications channel thatlinks all 27 nodes on the escape card. The bus is imple-mented as a ring composed of 27 unidirectional point-to-point links. The topology allows data transfer between anytwo nodes by forwarding request and write data or readresponse through the intervening nodes. The topology alsosupports broadcast write operations by forwarding a givenwrite command to all nodes on the ring. The routing ofring traffic is controlled by the hardware with no processorintervention.

The NetTunnel logic is functionally similar to the RingBus, but uses the network fabric as the transport as opposedto a dedicated sideband channel. This allows the NetTunnellogic to span the entire system, whereas the Ring Bus isconfined to a single card. Note that NetTunnel does not au-tomatically make the Ring Bus superfluous, as the networkand router logic can change depending on the demands ofthe applications. In this scenario, having a dedicated andreliable sideband for communication is particularly useful.

As both of these mechanisms have access to the entire4GB address space on each node they can reach, they

can be used in a wide variety of scenarios. For instance,debugging reconfigurable logic often involves reading aset of hardware registers to determine the current status,active interrupts, errors and so forth. This is especiallyuseful when communication between nodes is involved, asthe issue could be at the source, the destination or indeedalong links on intermediate nodes. Similarly, checkpoints,statistics or relevant program data may be written into, andsubsequently retrieved from, hardware registers. This canbe very useful in debugging application or device drivercode if stdout is not available or is too late, such as in a hangscenario.

4.3 PCIe SandboxPCIe Sandbox is an interactive utility that runs on a host x86machine and provides access to the INC system throughthe PCIe interface on node (000). Using a set of simplecommands, a user can read and write to addresses on allnodes in the INC system. PCIe Sandbox also supports a‘read all’ command that uses the Ring Bus to retrieve datafrom the same address location on all nodes of the card.Underneath, PCIe Sandbox ‘translates’ these commands intoread or write requests on the Ring Bus and NetTunnelmechanisms described above. This abstraction layer provesvery useful for rapid debugging.

By reading and writing to registers on a node, specialtasks can be accomplished including attaching the UARTserial console to a particular node, reading bitstream buildIDs for all the nodes, temperature of the card, EEPROMinformation (which may contain useful information such asUSB-UART serial number, MAC ID of the gateway Ethernetinterface on (100) etc), and the system configuration (i.e.how many cards are on the system).

Finally, PCIe Sandbox is capable of loading chunks ofdata into the DRAM of the compute nodes. This is thepreferred method for system boot up, as one can broadcastthe kernel images and associated device-trees to all thenodes in the system and then write to a boot commandregister that initiates boot. The same method is used toconfigure the FPGAs in the system with new bitstreams orto program FLASH chips in the individual nodes.

Programming the FPGAs and FLASH using the PCIehost connection and internal network is much faster thanprogramming over JTAG. For example, programming 27FPGAs on a single card over JTAG takes approximately 15minutes. On the other hand, programming 27 FPGAs on asingle card over PCIe takes a couple of seconds, includingthe data transfer. Similarly, programming 432 FPGAs on 16cards is nearly identical to programming one card, thanks tothe network broadcast capability. It is important to note thatJTAG can only work on a single card. The programmingspeed advantage is even more pronounced for program-ming the FLASH chips. On one occasion, it took more than5 hours to program 27 FLASH chips on a single card overJTAG. In contrast, it takes about 2 minutes to program 1, 16,or 432 (or anything in between) FLASH chips over the PCIeinterface.

4.4 ARM ProcessorThe ARM processor provides another dimension of per-node debug and diagnostic capability. Getting Linux run-

Page 7: IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 1 … · 2020-03-26 · IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 2 Fig. 1. 3D Mesh Topology of a single INC card, with some node

IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 7

ning on the ARM immediately makes available a rich andvaried set of diagnostic utilities. Network utilities such asiperf [18], tcpdump [19] and even ping [20] were used indebugging the internal Ethernet interface and the associateddevice driver. Counters on the ARM Performance Moni-toring Unit can be readily accessed by perf [21] or otherutilities to understand bottlenecks in application code andidentify candidates for FPGA offload. Standard debuggerssuch as gdb [22] or valgrind [23] could also be cross-compiled and ported, though this is untested at this time.

5 CONCLUSIONS

We presented an overview of the INC machine intelli-gence platform, and publications on applications will followshortly. Here we focused on highlighting the main featuresof the hardware including flexibility, configurability, highbandwidth, highly interconnected node-to-node communi-cation modes, and diagnostic capabilities. The uniqueness ofour system is that each of the 432 nodes allows application-specific offload and this feature is not available on any otherparallel machine of this scale.

The practical challenges of using the INC systems lieat the interface of hardware and software. Programming di-rectly at the FPGA level is often a daunting task for machinelearning practitioners who are used to high-level languagesand frameworks. On the other hand, hardware designers areaccustomed to working with well-defined specifications andfunctionality, as circuit and micro-architecture trade-offs areoften different as the task being offloaded change. This is arather significant disconnect, exacerbated by the fast-paced,rapidly evolving field of artificial intelligence.

ACKNOWLEDGMENTS

The authors would like to thank Spike Narayan and JeffWelser for management support. C. Cox thanks Mike Mas-tro and Kevin Holland for their support in manufacturingthe cards and the system.

REFERENCES

[1] F. Sijstermans, “The NVIDIA Deep Learning Accelerator,” in Pro-ceedings of Hot Chips 30, Cupertino, CA, 2018.

[2] W. J. Dally, C. T. Gray, J. Poulton, B. Khailany, J. Wilson, andL. Dennison, “Hardware-Enabled Artificial Intelligence,” in 2018Symposium on VLSI Circuits Digest of Technical Papers, Honolulu,HI, 2018.

[3] M. Courbariaux, Y. Bengio, and J. David, “Low precisionarithmetic for deep learning,” CoRR, vol. abs/1412.7024, 2014.[Online]. Available: http://arxiv.org/abs/1412.7024

[4] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan,“Deep learning with limited numerical precision,” in Proceedingsof the 32nd International Conference on Machine Learning, ser.Proceedings of Machine Learning Research, F. Bach and D. Blei,Eds., vol. 37. Lille, France: PMLR, 07–09 Jul 2015, pp. 1737–1746.[Online]. Available: http://proceedings.mlr.press/v37/gupta15.html

[5] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressingdeep neural network with pruning, trained quantization andhuffman coding,” CoRR, vol. abs/1510.00149, 2015. [Online].Available: http://arxiv.org/abs/1510.00149

[6] C. Chen, J. Choi, K. Gopalakrishnan, V. Srinivasan, andS. Venkataramani, “Exploiting approximate computing for deeplearning acceleration,” in 2018 Design, Automation Test in EuropeConference Exhibition (DATE), March 2018, pp. 821–826.

[7] N. P. Jouppi, C. Young, N. Patil, D. A. Patterson, G. Agrawal,R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle,P. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau,J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland,R. Hagmann, R. C. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt,J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, A. Koch,N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu,K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony,K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix,T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross,A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham,J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian,H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox,and D. H. Yoon, “In-datacenter performance analysis of a tensorprocessing unit,” CoRR, vol. abs/1704.04760, 2017. [Online].Available: http://arxiv.org/abs/1704.04760

[8] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van denDriessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel,and D. Hassabis, “Mastering the game of Go with deep neuralnetworks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489,jan 2016.

[9] V. Rajagopalan, V. Boppana, S. Dutta, B. Taylor, and R. Wittig,“Xilinx zynq-7000 epp: An extensible processing platform family,”in 2011 IEEE Hot Chips 23 Symposium (HCS), Aug 2011, pp. 1–24.

[10] L. H. Crockett, R. A. Elliot, M. A. Enderwitz, and R. W. Stewart,The Zynq Book: Embedded Processing with the Arm Cortex-A9 on theXilinx Zynq-7000 All Programmable Soc. UK: Strathclyde AcademicMedia, 2014.

[11] Xilinx, “Zynq 7000 technical reference manual (ug 585),” 2018.[Online]. Available: https://www.xilinx.com/support/.../userguides/ug585-Zynq-7000-TRM.pdf

[12] S. B. Furber, F. Galluppi, S. Temple, and L. A. Plana, “The spin-naker project,” Proceedings of the IEEE, vol. 102, no. 5, pp. 652–665,May 2014.

[13] E. Painkras, L. A. Plana, J. Garside, S. Temple, F. Galluppi, C. Pat-terson, D. R. Lester, A. D. Brown, and S. B. Furber, “Spinnaker: A1-w 18-core system-on-chip for massively-parallel neural networksimulation,” IEEE Journal of Solid-State Circuits, vol. 48, no. 8, pp.1943–1953, Aug 2013.

[14] H. Krupnova, “Mapping multi-million gate socs on fpgas: indus-trial methodology and experience,” in Proceedings Design, Automa-tion and Test in Europe Conference and Exhibition, vol. 2, Feb 2004,pp. 1236–1241 Vol.2.

[15] S. Asaad, R. Bellofatto, B. Brezzo, C. Haymes, M. Kapur, B. Parker,T. Roewer, P. Saha, T. Takken, and J. Tierno, “A cycle-accurate,cycle-reproducible multi-fpga system for accelerating multi-core processor simulation,” in Proceedings of the ACM/SIGDAInternational Symposium on Field Programmable Gate Arrays, ser.FPGA ’12. New York, NY, USA: ACM, 2012, pp. 153–162.[Online]. Available: http://doi.acm.org/10.1145/2145694.2145720

[16] R. Haring, M. Ohmacht, T. Fox, M. Gschwind, D. Satterfield,K. Sugavanam, P. Coteus, P. Heidelberger, M. Blumrich, R. Wis-niewski, a. gara, G. Chiu, P. Boyle, N. Chist, and C. Kim, “The ibmblue gene/q compute chip,” IEEE Micro, vol. 32, no. 2, pp. 48–60,March 2012.

[17] A. Geist, W. Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk,W. Saphir, T. Skjellum, and M. Snir, “Mpi-2: Extending themessage-passing interface,” in Euro-Par’96 Parallel Processing,L. Bouge, P. Fraigniaud, A. Mignotte, and Y. Robert, Eds. Berlin,Heidelberg: Springer Berlin Heidelberg, 1996, pp. 128–135.

[18] A. Tirumala, F. Qin, J. Dugan, J. Ferguson, and K. Gibbs,“iPerf: TCP/UDP bandwidth measurement tool,” 2005. [Online].Available: https://iperf.fr/

[19] tcpdump, “tcpdump,” 2018. [Online]. Available: http://www.tcpdump.org/

[20] M. Muuss, “The Story of the PING Program,” 2010. [Online].Available: https://www.webcitation.org/5saCKBpgH

[21] V. Weaver, “Linux perf event features and overhead,” in The 2ndInternational Workshop on Performance Analysis of Workload Opti-mized Systems, FastPath, 01 2013.

[22] gdb, “GDB: The GNU Project Debugger,” 2018. [Online].Available: http://www.tcpdump.org/

[23] Valgrind, “Valgrind,” 2018. [Online]. Available: http://valgrind.org/

Page 8: IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 1 … · 2020-03-26 · IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 2 Fig. 1. 3D Mesh Topology of a single INC card, with some node

IEEE TRANSACTIONS, DRAFT MANUSCRIPT, FEB 2020 8

Pritish Narayanan received his PhD in Electri-cal and Computer Engineering from the Univer-sity of Massachusetts Amherst. He joined IBMResearch Almaden as a Research Staff Mem-ber in 2013 as part of the Storage Class Mem-ory/MIEC project, where he investigated circuitdesign challenges for access devices used in 3Dcrosspoint memory. His current research inter-ests are in the area of ultra-high-performancehardware systems for Artificial Intelligence andCognitive computing including i) Novel non-Von

Neumann architectures based on emerging memory, where he is thelead circuit architect for two deep learning test sites based on PhaseChange Memory (PCM) and mixed-signal hardware and ii) FPGA-basedsystems exploiting massive parallelism and/or approximate computingtechniques. Dr. Narayanan has presented one prior keynote (Interna-tional Memory Workshop 2017) and a tutorial session (Device ResearchConference 2017), in addition to several invited talks. He won Best PaperAwards at IEEE Computer Society Symposium on VLSI 2008 and atIEEE Nanoarch 2013. He has also been a Guest Editor for the Journalof Emerging Technologies in Computing, the Program Chair at IEEENanoarch 2015, Special Session Chair for IEEE Nano 2016 and servedon the Technical Program Committees of several conferences..

Alexis Asseman received his M.S. degree inelectrical engineering jointly from the GrenobleInstitute of Technology in France, the Politecnicoof Turin in Italy, and EPFL in Switzerland in 2016.He joined the IBM Almaden Research Centerin San Jose, CA in 2017 as a Research Engi-neer focusing on Machine Learning and high-performance and/or energy efficient implemen-tations on FPGA hardware.

Charles Cox received the B.S. degree in com-puter engineering from the University of Illinois,Urbana, in 1983. He joined the then IBM SanJose Research Laboratory in 1984 and is cur-rently an Advisory Engineer at the IBM AlmadenResearch Center, San Jose, CA. He is interestedin the rapid prototyping of custom hardware toverify new algorithms amenable to hardware re-alizations, and has worked in the areas of im-age processing, communications, pattern recog-nition, and error correction.

Nicolas Antoine received his M.S. degree inelectrical engineering jointly from the Ecole Cen-trale de Nantes in France and the RWTH Aachenin Germany in 2018. He joined the IBM AlmadenResearch Center in San Jose, CA in 2018 asa Research Engineer working on hardware effi-cient implementation of deep learning algorithmson FPGA.

Harald Huels received the M.S degree in com-munications engineering from the TechnischeHochschule Mittelhessen, Germany in 1984. Heworked for several companies as well as the Uni-versity of Mainz in the field of measurement andcontrol technology as development engineer. Af-ter this he was self-employed and developedcontrol systems based on microprocessors forindustrial automation. He joined IBM Storage Di-vision in 1990. Until 1998 he worked at the Anal-ysis Lab responsible for automation and testing,

designing complex test systems for quality assurance in the disk driveproduction. From 1998 until 2001 he was on a working assignment atIBM San Jose, CA responsible for development and research of new testmethodologies focused on future products of IBMs disk drives. Return-ing from the US he joined Engineering and Technology Services of IBMand started a research project with IBM Almaden. He was responsiblefor second level packaging of the Ice-Cube project. In 2005 he joinedIBM Research and Development, Boeblingen, Germany. Since then heis working as Senior Hardware Architect for both System-p and System-z with a strong focus on packaging. His major interest is in system designutilizing the newest technologies to build servers.

Winfried Wilcke received a Ph.D. in experimen-tal nuclear physics (1976) and worked at theUniversity of Rochester, Lawrence Berkeley Laband Los Alamos on heavy-ion nuclear reactionsand meson physics. In 1983 he joined IBM Re-search in Yorktown Heights,NY, where he playeda key role in creating IBM’s line of very fast dis-tributed memory supercomputers , including theVictor (research) and the IBM SP series of su-percomputers. In 1991, Wilcke co-founded HALComputer Systems, where he was first Director

of Architecture and later CTO. HAL grew to 450 people before it waspurchased by Fujitsu. It was instrumental in creating jointly with SunMicrosystems the 64-bit SPARC architecture, which is now underlyingall Sun and Fujitsu Sparc systems. In 1996, Wilcke retired temporarilyfrom R&D and embarked on a long tropical sailing/diving voyage, thenrejoined IBM Research in California, where, in 2001, he founded anIBM spin-out company. Later he became senior manager of NanoscaleScience & Technology, which includes projects in quantum physics, bat-teries and machine learning and especially general artificial intelligence.Wilcke is co-author of over 120 publications (3500 citations) in nuclearphysics, computer architecture and energy, a book (Random Walk atAmazon) and numerous patents in computer engineering. Wilcke hasbeen chair of several conferences, including Compcon, Hot Chips andis the founder of the Beyond Lithium Ion conference series.

Ahmet S. Ozcan received his Ph.D. in Physicsfrom Boston University (2003) and joined IBM in2006 after a postdoctoral appointment. He hadworked in IBMs Microelectronics division andcontributed to the development of several CMOStechnology nodes, which were commercializedin IBMs server products and in mobile devicesthrough industry partnerships. In 2015, after anassignment in France on FDSOI technology de-velopment, Ahmet moved to Silicon Valley towork on brain-inspired computing and artificial

intelligence. He and his team currently work on visually-grounded lan-guage and reasoning using memory-augmented neural networks, aswell as developing applications for FPGA based re-configurable hard-ware systems. An IBM Master Inventor, Ahmet holds over 56 patentsand 85 pending patent applications. He has authored and co-authoredmore than 50 peer-reviewed articles and one book chapter.