Offloading Collective Operations to Programmable Logic · board is a Xilinx Virtex-7 690T FPGA device. There are ﬁve peripheral subsystems that complement the FPGA. A high-speed

Offloading Collective Operations to Programmable Logic

Martin Swany(With thanks to Omer Arap)

Center for Research in Extreme Scale Computing (CREST)Department of Intelligent Systems Engineering

Indiana University, Bloomington, IN

FPGAs Should Be Used As MPI Accelerators

Martin Swany

Center for Research in Extreme Scale Computing (CREST)Department of Intelligent Systems Engineering

Indiana University, Bloomington, IN

I know MPI accelerators. We have the best MPI acclerators. - DT

Overview

• We have developed an architecture and implementation for offloading collective operations to programmable logic in the communication substrate– Not the first to use FPGAs for collectives

• It is clear that there is a tradeoff between performance and generality– FPGAs are at one end of that spectrum– Some folks live at the other end

• Programmable logic is burgeoning– Xilinx Zynq– Intel Xeon+Altera (HARP)– Mellanox ConnectX-4 + FPGA– FPGAs on other NICs and in switches

Programming FPGAs

• The programmable logic provided by FPGAs is a powerful option for creating task-specific functionality for applications

• Programming with FPGAs is not easy• Many agree: Accelerate common functionality• Others want to make it easy to use general-

purpose toolchains to deploy arbitrary kernels– Jeff and OpenACC– Franck’s workshop– New NSF call for ways to make it easy

• It may not ever be easy

Things that go bump in the wire

• We have been exploring a (relatively) generic approach for scenarios in which there is programmable logic in the communication pipeline– Addresses the problem of how to get data to and from the device

• A bump in the wire– Rich Graham clearly got this phrase from us

• In teaching how to use FPGAs, we have found this model to be quite useful– Students who haven’t done Verilog can get to this in a semester– Important topic in Indiana U.’s new engineering program

• This also maps to a growing use case

Collective Offload• Offloading collective logic is a common technique

in various platforms to improve performance• Mellanox’s CORE-Direct is one such state of the

art collective operation offload framework.– Defines primitive tasks: send, receive, wait, binary calculations.– Collective algorithm defines list of tasks and tasks are performed

by the NIC without CPU involvement

• Collective offload systems have been done many times, but ours is easy to use and is a starting point

Goals of FPGA-based Offload

• Reduce collective operation latency and variance

• Provide selection of collective algorithms that are implemented in hardware

• Implement collective algorithms independent of end-to-end communication protocols

• Utilize hardware-level, protocol-independent multicasting to optimize, and when possible, redesign major collective operation algorithms

• Support collective operation offload for different classes of collective operations

NetFPGA

A line-rate, flexible, open networking platform for teaching and research 3

(a) NetFPGA SUME Board (b) NetFPGA SUME Block Diagram

Fig. 1. NetFPGA SUME Board and Block Diagram

board is a Xilinx Virtex-7 690T FPGA device. There are fiveperipheral subsystems that complement the FPGA. A high-speed serial interfaces subsystem composed of 30 serial linksrunning at up to 13.1Gb/s. These connect four 10Gb/s SFP+Ethernet interfaces, two expansion connectors and a PCIe edgeconnector directly to the FPGA. The second subsystem, thelatest generation 3.0 of PCIe is used to interface between thecard and the host device, allowing both register access andpacket transfer between the platform and the motherboard.The memory subsystem combines both SRAM and DRAMdevices. SRAM memory is devised from three 36-bit QDRII+devices, running at 500MHz. In contrast, DRAM memory iscomposed of two 64-bit DDR3 memory modules running at933MHz (1866MT/s). Storage subsystems of the design permitboth a MicroSD card and external disks through two SATAinterfaces. Finally, the FPGA configuration subsystem is con-cerned with use of the FLASH devices. Additional NetFPGASUME features support debug, extension and synchronizationof the board, as detailed later. A block diagram of the board isprovided in Figure 1(b). The board is implemented as a dual-slot, full-size PCIe adapter, that can operate as a standaloneunit outside of a PCIe host.

A. High-Speed Interfaces SubsystemThe High-Speed Interfaces subsystem is the main enabler

of 100Gb/s designs over the NetFPGA SUME board. Thissubsystem includes 30 serial links connected to Virtex-7 GTHtransceivers, which can operate at up to 13.1Gb/s. The seriallinks are divided into four main groups; while the first groupconnects four serial links to four SFP+ Ethernet Interfaces, thesecond one, associates ten serial links to an FMC connector.Additional eight links are connected to a SAMTEC QTH-DPconnector and are intended for passing traffic between multipleboards. The last eight links connect to the PCIe subsystem(Section IV-C).

The decision to use an FPGA version that supports onlyGTH transceivers rather than the one with GTZ transceivers,reaching 28.05Gb/s, arises as a trade-off between transceiver

speed and availability of memory interfaces. An FPGA withGTZ transceivers allows multiple 100Gb/s ports, but lacks theI/O required by memory interfaces, making a packet bufferingdesign of 40Gb/s and above infeasible.

There are also four motives that support our decision touse SFP+ Ethernet ports over CFP. Firstly, as the board isintended to be a commodity board, it is very unlikely thatthe main users like researchers and academia will be ableto afford multiple CFP ports. Secondly, 10Gb/s equipment isfar more common than 100Gb/s equipment; this provides asimpler debug environment and allows inter-operability withother commodity equipment (e.g. deployed routers, traffic gen-eration NICs). In addition, SFP+ modules also support 1Gb/soperation. The third, CFP modules protrude the board at overtwice the depth of SFP+; CFP use would have required eitherremoving other subsystems from the board or not complyingwith PCIe adapter cards form factor. Lastly, being an opensource platform, NetFPGA is using only open source FPGAcores or the cores available through the Xilinx XUP program.As a CAUI-10 core is currently unavailable, it can not be madethe default network interface of the board.

A typical 100Gb/s application can achieve the requiredbandwidth by assembling an FMC daughter board. For exam-ple, the four SFP+ on board together with Faster Technology’s3

octal SFP+ board create a 120G system. Alternatively, native100Gb/s port can be used by assembling a CFP FMC daughterboard.

B. Memory Subsystem

DRAM memory subsystem contains two SoDIMM mod-ules, supporting up to 16GB4 of memory running at 1866MT/s.Two 4GB DDR3-SDRAM modules are supplied with the cardand are officially supported by Xilinx MIG cores. Users canchoose to supplement or replace these with any other SoDIMM

3http://www.fastertechnology.com/48GB is the maximum density per module defined by JEDEC standard no.

21C-4.20.18-R23B

Xilinx Zynq

All Programmable SoC family—the Zynq Z-7100—with enhanced DSP resources in the FPGA fabric. All five Zynq devices are optimized for specific combinations of system power, cost, and size.

Xilinx is leading the industry to usher in the trend to smarter systems with application-focused solutions for smarter networks, data centers, and vision-based systems. These solutions build on the Zynq-7000 All Programmable SoC with an ever expanding portfolio of building blocks for smarter systems called SmartCORE™ IP, a new generation of design tools called Vivado™ that includes the ability to design at higher abstraction levels, a variety of application design kits, and system-level expertise to help the rapid design and implementation of smarter systems.

Zynq: A Generation AheadXilinx Zynq-7000 All Programmable SoCs are a generation ahead of alternatives and the smartest solution to a wide range of system-design problems in all markets, across the entire application spectrum. Here are 9 reasons why this is true:

Reason 1: Most efficient ARM+FPGA for analytics and control

A 1GHz, dual-core, hardened implementation of the ARM Cortex-A9 MPCore microprocessor sits at the heart of each Zynq All Programmable SoC. The two ARM processors communicate with on-chip memory, SDRAM and Flash memory controllers, and peripheral blocks through the ARM AMBA AXI-based interconnect. Together, these hardened blocks constitute the Zynq-7000 All Programmable SoC’s Processor System (PS).

The on-chip PS is attached to the Zynq device’s on-chip Programmable Logic (PL) through multiple ARM AMBA AXI ports, creating extremely efficient coupling between these two key components of the Zynq architecture. There are two 32-bit AXI master interfaces; two 32-bit AXI slave interfaces; four 64-bit configurable and buffered, high-performance AXI slave interfaces; and one 64-bit AXI ACP (Accelerator Coherency Port) interface. That’s nine AXI interfaces in total linking the Zynq PS to the PL.

The number and size of these ARM AXI PS-PL connections is a critical architectural choice—a choice based on a careful consideration of the bandwidth requirements of the Zynq PS. The four 64/32-bit configurable, high-performance AXI ports provide the PL with direct, high-speed access to the Zynq-7000 All Programmable SoC’s on-chip memory and SDRAM controller through four independent 1Kbyte FIFO buffers. In this way, several separate hardware accelerators implemented in the Zynq PL can have independent, high-speed access to a Zynq-based system’s main memories. If that access needs to be coherent with the on-chip caches, then the accelerators implemented in the PL can employ the 64-bit ACP connection, which is directly attached to the ARM Cortex-A9 MPCore processor’s snoop control unit.

ZYNQ-7000 ALL PROGRAMMABLE SOCXILINX BACKGROUNDER

THE ZYNQ-7000 SOC FAMILY: F IVE MEMBERS SPANNING A RANGE OF APPLICATIONS

2xI2C

2xSPI

2xCAN

2xUART

GPIO

2x SDIOwith DMA

2x USBwith DMA

2x GigEwith DMA

XADC2x ADC, Mux,

Thermal Sensor

EMIOACPGeneral Purpose

AXI PortsHigh Performance

AXI Ports

PCIe Gen21-8 Lanes

SecurityAES, SHA, RSA

Multi-Gigabit TransceiversMulti-Standard I/Os (3.3V & High-Speed 1.8V)

Processing System

Programmable Logic(System Gates, DSP, RAM)

Proc

esso

r I/O

Mux

Flash Controller NOR, NAND, SRAM, Quad SPI

Multiport DRAM ControllerDDR3, DDR3L, DDR2

DMATimersConfiguration

ARM® CoreSight™ Multi-Core Debug and Trace

256 Kbyte On-ChipMemory

SnoopControl

Unit

512 Kbyte L2 Cache

General Interrupt Controller

WatchdogTimer

Cortex- A9 MPCore32/32 KB I/D Caches

NEON™ DSP/FPU Engine NEON DSP/FPU Engine

Cortex™- A9 MPCore32/32 KB I/D Caches

AMBA® Interconnect AMBA Interconnect

AMBA Interconnect AMBA Interconnect

THE ZYNQ-7000 ALL PROGRAMMABLE SOC

Zedboard & Ethernet FMC

Collective Offload Design

• Ease implementation with the “bump in the wire” model

• Host offloads collective to the Programmable Logic (PL) by sending a specially crafted UDP packet– Simplest solution

• Host blocks until the PL generates a release message, which is also a UDP packet

Collective Offload Packet!

0..3 4..7 8..11 12..15 16..19 20..23 24..27 28..31 32..35 36..39 40..43 44..47 48..51 52..55 56..59 60..63

dst_MAC src_MAC_1

src_MAC_2 type ver IHL Diff_Serv

Total_Length Identification flags fragment_offset TTL Protocol

Hdr_Cksum src_IP dst_IP_1

dst_IP_2 UDP_Source_Port UDP_Dest_POrt Length

UDP_checksum coll_id comm_size coll_type

algo_type node_type msg_type rank

local_rank_count operation data_type count

AXI-Based Platforms

• Advanced eXtensible Interface (AXI) standards define intermodule communication in programmable logic

• Widely utilized in recent Xilinx FPGA designs– Recent designs depend on AXI-based protocols

• AXIS (AXI-Stream) is for streaming data between modules. – We employ the AXIS protocol in our design

• Support for multiple ranks per host – Provides insight how our design scales

Implementation

• Modules inherited from the NetFPGA package– Input Arbiter, Output Queues

• Modules from Xilinx IP catalog– AXI Ethernet, AXI DMA, AXI Width Converter, FPU, AXIS Data

FIFO

• New Modules– Stream Separator – Stream Aggregator– Collective Forwarding Engine (CFE)– Collective Instruction Processing Engine (CIPE)

• Implements MPI_Allreduce, MPI_Reduce, MPI_Barrier, MPI_Scan, MPI_Alltoall…

Input Arbiter

CFE (Collective Forwarding Engine)

CIPE (Collective Instruction Processing Engine)

Output Queues

Width Converter

AXI Ethernet

Stream Aggregator

32-b data32-b ctrl

32-bit

64-bit

Width Converter

AXI Ethernet

Stream Aggregator

32-b data32-b ctrl

32-bit

64-bit

Width Converter

AXI Ethernet

Stream Aggregator

32-b data32-b ctrl

32-bit

64-bit

AXI Ethernet

Width Converter

Stream Separator

32-b data32-b ctrl

32-bit

64-bit

Width Converter

AXI Ethernet

Stream Aggregator

32-b data32-b ctrl

32-bit

64-bit

Width Converter

AXI DMA

Stream Aggregator

32-b data32-b ctrl

32-bit

64-bit

AXI DMA

Width Converter

Stream Separator

32-b data32-b ctrl

32-bit

64-bit

AXI Ethernet

Width Converter

Stream Separator

32-b data32-b ctrl

32-bit

64-bit

AXI Ethernet

Width Converter

Stream Separator

32-b data32-b ctrl

32-bit

64-bit

AXI Ethernet

Width Converter

Stream Separator

32-b data32-b ctrl

32-bit

64-bit

64-bit

64-bit

64-bit

Data Path

Collective Forwarding Engine (CFE) – Heart of the design– Responsible for:

• Assessing the offload request header fields.• Detecting collective operation and algorithm to

run• Perform state transitions and save the state

for algorithms• Based on the state, make forwarding

decisions• Determine the instructions which the CIPE will

apply

Collective Instruction Processing Engine (CIPE)

• CFE attaches 64-bit instruction header• CIPE processes the instruction• Example Instructions

– Forward packet, do nothing– Buffer packet in specified buffer– Perform reduction operations on incoming packet and specified

buffers• Buffer or not buffer the result

– Perform reduction operations on specified buffers and ignore incoming packet

• Buffer or not buffer the result– Ignore incoming packet and forward data on specified buffer


• Interfaces with FPU and FIFO cores– FPU cores for streaming and applying floating point arithmetic– FIFO cores for streaming intermediate results of single cycle

arithmetic operations

• MIN, MAX, BOR, LOR, BAND, LAND, BXOR, LXOR, Integer SUM

• Internal BRAM buffers

RSLT2FIFO 2

Buffer 1

CFEOutput Queues

Op1_1

FPU 1Op1_2

Buffer 2

Buffer 3

Buffer 4

Instruction Processor RSLT1

FIFO 1

RSLT4FIFO 4

RSLT3FIFO 3

Op2_1

FPU 2Op2_2

Op3_1

FPU 3Op3_2

Op4_1

FPU 4Op4_2


• Fully pipelined design– FPU cores can be configured to produce results in 1 cycle

• Due to observed instabilities, we increased the first result cycle to 11 (the maximum)

– Each extra FPU core introduces extra cycle– Each FIFO core employed introduces an extra cycle


Evaluation

• Experimental Setup– 8 Zynq Zedboards with EthernetFMC adapters

• Directly connected

• Come see it at SC

ARM Ave ARM Min PL Ave PL Min Speedup Ave Speedup Min ARM Diff PL Diff ARM % Diff PL % Diff8 474.418 344.875 264.001 254.875 1.797031072 1.353114272 129.543 9.126 27.30566715 3.456805088

16 651.901 481.688 288.386 277.125 2.26051542 1.738161479 170.213 11.261 26.11025294 3.9048358832 1413.432 1110.344 396.775 353.312 3.562301052 3.142672765 303.088 43.463 21.44340867 10.9540671764 3247.172 2668.188 529.798 472.109 6.12907561 5.651635533 578.984 57.689 17.83040751 10.88886708

128 8274.137 7020.398 782.918 742.266 10.56833155 9.458062204 1253.739 40.652 15.15250473 5.192370082

0"

2"

4"

6"

8"

10"

12"

0"

2000"

4000"

6000"

8000"

10000"

12000"

8" 16" 32" 64" 128"

Speedu

p&

Time&(usecs)&

Communicator&Size&

ARM&Host&vs&PL&Offloaded&(Msg&Size&=&8&byte)&Speedup"Ave" Speedup"Min" ARM"Ave" ARM"Min" PL"Ave" PL"Min"


16 669.094 506.75 291.758 281.625 2.293318435 1.799378606 162.344 10.133 24.26325748 3.47308385732 1501.446 1195.062 404.684 369.625 3.710168922 3.233174163 306.384 35.059 20.40592868 8.66330272564 3423.248 2827.922 540.836 509.312 6.329549068 5.552435442 595.326 31.524 17.39067692 5.828754003

128 8722.401 7200.812 795.331 760.242 10.96700745 9.471736631 1521.589 35.089 17.44461187 4.411873798

0"

2"

4"

6"

8"

10"

12"

0"

2000"

4000"

6000"

8000"

10000"

12000"

8" 16" 32" 64" 128"

Speedu

p&

Time&(usecs)&

Communicator&Size&

ARM&Host&vs&PL&Offloaded&(Msg&Size&=&128&byte)&Speedup"Ave" Speedup"Min" ARM"Ave" ARM"Min" PL"Ave" PL"Min"


16 798.617 601.875 328.766 320.75 2.429135008 1.876461419 196.742 8.016 24.63533834 2.43820833132 1782.801 1388.562 457.331 414.969 3.898272805 3.346182486 394.239 42.362 22.11346078 9.26287524864 4033.201 3207.953 608.148 580.578 6.631939923 5.525447054 825.248 27.57 20.46136555 4.53343594

128 9964.274 8374.555 942.519 911.148 10.57196088 9.191212624 1589.719 31.371 15.95418793 3.328420966

0"

2"

4"

6"

8"

10"

12"

0"

2000"

4000"

6000"

8000"

10000"

12000"

8" 16" 32" 64" 128"

Speedu

p&

Time&(usecs)&

Communicator&Size&

ARM&Host&vs&PL&Offloaded&(Msg&Size&=&1024&Byte)&Speedup"Ave" Speedup"Min" ARM"Ave" ARM"Min" PL"Ave" PL"Min"

Zynq Cluster – MPI_Allreduce

Zynq Cluster – MPI_Allreduce ARM Ave ARM Min PL Ave PL Min Speedup Ave Speedup Min ARM Diff PL Diff ARM % Diff PL % Diff

8 474.418 344.875 264.001 254.875 1.797031072 1.353114272 129.543 9.126 27.30566715 3.45680508816 651.901 481.688 288.386 277.125 2.26051542 1.738161479 170.213 11.261 26.11025294 3.9048358832 1413.432 1110.344 396.775 353.312 3.562301052 3.142672765 303.088 43.463 21.44340867 10.9540671764 3247.172 2668.188 529.798 472.109 6.12907561 5.651635533 578.984 57.689 17.83040751 10.88886708

128 8274.137 7020.398 782.918 742.266 10.56833155 9.458062204 1253.739 40.652 15.15250473 5.192370082

0"

5"

10"

15"

20"

25"

30"

0"

200"

400"

600"

800"

1000"

1200"

1400"

1600"

1800"

8" 16" 32" 64" 128"

Percen

tage)

Time)(usecs))

Communicator)Size)

Ave,)Min)Difference:)ARM)vs)PL)Msg)(Size)=)8B))ARM"%"Diff" PL"%"Diff" ARM"Diff" PL"Diff"


16 669.094 506.75 291.758 281.625 2.293318435 1.799378606 162.344 10.133 24.26325748 3.47308385732 1501.446 1195.062 404.684 369.625 3.710168922 3.233174163 306.384 35.059 20.40592868 8.66330272564 3423.248 2827.922 540.836 509.312 6.329549068 5.552435442 595.326 31.524 17.39067692 5.828754003

128 8722.401 7200.812 795.331 760.242 10.96700745 9.471736631 1521.589 35.089 17.44461187 4.411873798

0"

5"

10"

15"

20"

25"

30"

0"

200"

400"

600"

800"

1000"

1200"

1400"

1600"

1800"

8" 16" 32" 64" 128"

Percen

tage)

Time)(usecs))

Communicator)Size)



16 798.617 601.875 328.766 320.75 2.429135008 1.876461419 196.742 8.016 24.63533834 2.43820833132 1782.801 1388.562 457.331 414.969 3.898272805 3.346182486 394.239 42.362 22.11346078 9.26287524864 4033.201 3207.953 608.148 580.578 6.631939923 5.525447054 825.248 27.57 20.46136555 4.53343594

128 9964.274 8374.555 942.519 911.148 10.57196088 9.191212624 1589.719 31.371 15.95418793 3.328420966

0"

5"

10"

15"

20"

25"

30"

0"

200"

400"

600"

800"

1000"

1200"

1400"

1600"

1800"

8" 16" 32" 64" 128"

Percen

tage)

Time)(usecs))

Communicator)Size)


Zynq Cluster – MILC

Communicator Size% speedup (min results) % speedup (average results)8 6.122437884 6.293399464

16 8.784307057 9.09306662132 12.65767238 12.1294688264 18.98636656 18.83355062

0"

2"

4"

6"

8"

10"

12"

14"

16"

18"

20"

8" 16" 32" 64"

Perentage(

Communicator(Size(

MILC(4(Clover(Dynamical(%"speedup"(min"results)" %"speedup"(average"results)"

Communicator Size% speedup (min results) % speedup (average results)8 11.16 11.12

16 20.19 20.8432 31.36 32.8264 39.66 40.93

0.00#

5.00#

10.00#

15.00#

20.00#

25.00#

30.00#

35.00#

40.00#

45.00#

8# 16# 32# 64#

Perentage(

Communicator(Size(

MILC(4(Schroed_cl_inv(%#speedup#(min#results)# %#speedup#(average#results)#

Conclusions and Future Work

• Empirical validation of the efficacy of collective offload

• Promising performance• Increasing availability of programmable logic

stands to increase impact• The use of UDP allowed straightforward

implementation, but is limiting• Clearly, this the right way to use FPGAs

25

Documents

Offloading Collective Operations to Programmable Logic · board is a Xilinx Virtex-7 690T FPGA device. There are ﬁve peripheral subsystems that complement the FPGA. A high-speed