A c el ra tdR iv n sf o g O -T University of Tsukuba HC A HC A

• Combining goodness of different type of accelerators: GPU + FPGA• GPU is still an essential accelerator for simple and large degree of

parallelism to provide ~10 TFLOPS peak performance• FPGA is a new type of accelerator for application-specific hardware with

programmability and speeded up based on pipelining of calculation• FPGA is good for external communication between them with advanced

high speed interconnection up to 100Gbps x4 chan.

University of Tsukuba | Center for Computational Sciences

https://www.ccs.tsukuba.ac.jp/

Multi-Hybrid Accelerated Computing Platform

Supercomputer at CCS: Cygnus

OpenCL-ready High Speed FPGA Networking [1]

comp.

node…

IB EDR Network (100Gbps x4/node)

Ordinary inter-node communication channel for CPU and GPU, but

they can also request it to FPGA

comp.

nodecomp.

node…comp.

node

Deneb nodesAlbireo nodes

comp.

node

comp.

node

Ordinary inter-node network (CPU, GPU) by IB EDR

With 4-ports x full bisection b/w

……

Inter-FPGA direct network

• Our new supercomputer “Cygnus”• Operation started in May 2019• 2x Intel Xeon CPUs, 4x NVIDIA V100 GPUs, 2x Intel

Stratix10 FPGAs• Deneb: 48 CPU+GPU nodes• Albireo: 32 CPU+GPU+FPGA nodes

with 2D-torus dedicated network for FPGAs (100Gbpsx4)

Albireo node (x32)

Deneb node (x48)

Specification of Cygnus

Target GPU:NVIDIA Tesla V100

Target FPGA:Nallatech 520N

Item Specification

Peak performance

2.4 PFLOPS DP(GPU: 2.2 PFLOPS, CPU: 0.2 PFLOPS, FPGA: 0.6 PFLOPS SP)⇨ enhanced by mixed precision and variable precision on FPGA

# of nodes80 (32 Albireo (GPU+FPGA) nodes, 48 Deneb (GPU-only) nodes)

Memory192 GiB DDR4-2666/node = 256GB/s, 32GiB x 4 for GPU/node = 3.6TB/s

CPU / node Intel Xeon Gold (SKL) x2 sockets

GPU / node NVIDIA V100 x4 (PCIe)

FPGA / nodeIntel Stratix10 x2 (each with 100Gbps x4 links/FPGA and x8 links/node)

Global File System

Lustre, RAID6, 2.5 PB

Interconnection Network

Mellanox InfiniBand HDR100 x4 (two cables of HDR200 / node)4 TB/s aggregated bandwidthj

Programming Language

CPU: C, C++, Fortran, OpenMP, GPU: OpenACC, CUDAFPGA: OpenCL, Verilog HDL

System Vendor

NEC

• FPGA design plan• Router- For the dedicated

network, this impl. is mandatory.

- Forwarding packets to destinations

• User Logic- OpenCL kernel runs

here.- Inter-FPGA comm.

can be controlled from OpenCL kernel.

• SL3- SerialLite III : Intel

FPGA IP- Including transceiver

modules for Inter-FPGA data transfer.

- Users don’t need to care

CPU

PCIe network (switch)

G

PU

G

PU

FPGA

HCA HCA

Inter-FPGA

direct network(100Gbps x4)

Network switch

(100Gbps x2)

CPU


G

PU

G

PU

FPGA

HCA HCA

Inter-FPGA

direct network(100Gbps x4)

SINGLE

NODE

(with FPGA)

Network switch

(100Gbps x2)

CPU


G

PU

G

PU

HCA HCA

Network switch

(100Gbps x2)

CPU


G

PU

G

PU

HCA HCA

SINGLE

NODE

(without FPGA)

Network switch

(100Gbps x2)

FPGA FPGA FPGA

FPGA FPGA FPGA

FPGA FPGA FPGA

(only for Albirero nodes)

Inter-FPGA direct network

64 FPGAs on Albireo nodes are

connected directly as 2D-Torus

configuration without Ethernet sw.

Cluster System with FPGAs

sender(__global float* restrict x, int n) {for (int i = 0; i < n; i++) {float v = x[i];

write_channel_intel(simple_out, v);}

}

receiver(__global float* restrict x, int n) {for (int i = 0; i < n; i++) {float v = read_channel_intel(simple_in);

x[i] = v;}

}

Communication Integrated Reconfigurable CompUting System (CIRCUS)CIRCUS enables OpenCL code communicate

with other FPGAs on different nodesExtending Intel’s channel mechanism to

external communications Pipeline manner: sending/receiving data

from/to compute pipeline directly

・ I/O Channel- connects OpenCLwith peripherals

- We used this feature

Comm. w/o channels

Comm. w/ channels

・ Channel Extension: Transferring data between kernels directly (low latency and high bandwidth)・ We can use multiple kernel design to exploit space parallelism in an FPGA

FPGA-based parallel comp. with OpenCL- Needs a communication system being suitable to OpenCL and Intel FPGAs

- Using of Intel FPGA SDK for OpenCL

CIRCUS

Backends

sender code on FPGA1

receiver code on FPGA2

Our proposed method Pipelined communication experiment

90.7Gbps↑

Recv.

Comp.

Send

Authentic Radiation Transfer [2]

• Accelerated Radiative transfer on grids Oct-Tree (ARGOT) has been developer in Center for Computational Sciences, University of Tsukuba• ART is one of algorithms used in ARGOT and

dominant part (90% or more of computation time) of ARGOT program

• ART is ray tracing based algorithm• problem space is divided

into meshes and reactions are computed on each mesh

• Memory access pattern depends on ray direction

• Not suitable for SIMD architecture

0

200

400

600

800

1000

1200

1400

(16,16,16) (32,32,32) (64,64,64) (128,128,128)

Perf

orm

anc

e [M

me

sh/s

]

mesh size

CPU(14C)

CPU(28C)

P100(x1)

FPGA

better

Table 2: Resource usage and clock frequency of the implementation.

size # of PEs ALMs (% ) Registers (% ) M20K (% ) MLAB DSP (% ) Freq. [MHz](16, 16, 16) (2, 2, 2) 132,283 31% 267,828 31% 739 27% 14,310 312 21% 193.2(32, 32, 32) (2, 2, 2) 169,882 40% 344,447 40% 796 29% 21,100 312 21% 173.8(64, 64, 64) (2, 2, 2) 169,549 40% 344,512 40% 796 29% 21,250 312 21% 167.0

(128, 128, 128) (2, 2, 2) 169,662 40% 344,505 40% 796 29% 21,250 312 21% 170.4

Table 3: Performance comparison between FPGA, CPU andGPU implementations. The unit is M mesh/sec.

Size CPU(14C) CPU(28C) P100 FPGA(16,16,16) 112.4 77.2 105.3 1282.8(32,32,32) 158.9 183.4 490.4 1165.2(64,64,64) 175.0 227.2 1041.4 1111.0

(128,128,128) 95.4 165.0 1116.1 1133.5

per link) multiple interconnection links (up to 4 channels) onit. Additionally, HLS such as OpenCL programming envi-ronment is provided, and there are several tyeps of researchto involve them in FPGA computing. In [3], Kobayashi, etal. show the basic feature to utilize the high speed intercon-nection over FPGA driven by OpenCL kernels. Therefore,although the performance of our implementation is almostsame as NVIDIA P100 GPU, the overall performance withmultiple computation nodes with FPGA to be connected di-rectly can easily overcome the GPU implementation whichrequires host CPU control and kernel switching for inter-node communication. Networking overhead on FPGAs ismuch lower than one on GPUs. To improve current ARTmethod implementation with such an interconnection fea-ture on FPGA is our next step toward high performanceparallel FPGA computing.

8. CONCLUSIONIn this paper, we optimized ART method used in ARGOTprogram which solves a fundamental calculation in earlystage universe with space radiative transfer phenomenon,on an FPGA using Intel FPGA SDK for OpenCL. W e par-allelized the algorithm using the SDK ’s channel extensionin an FPGA. W e achieved 4.89 times faster performancethan the CPU implementation using OpenMP as well as al-most same performance as the GPU implementation usingCUDA.Although the performance of the FPGA implementationis comparable to NVIDIA P100 GPU, it has room to im-prove its performance. The most important optimizationis resource optimization. I f we can implement larger num-ber of PEs than one of the current, we can improve per-formance. However, it is difficult for us to reduce usageof ALMs and registers because we do not describe them di-rectly in OpenCL code. Not only resource but also frequencyis important. W e suppose Arria 10 with OpenCL design canrun on 200MHz or higher frequency.W ewill implement the network functionality into theARTdesign to parallelize it among multiple FPGAs. W e con-sider networking using FPGAs is an important feature forparallel applications using FPGAs. Although GPUs havehigher computation performance FLOPS and higher mem-ory bandwidth than FPGAs, I/O including networking is a

weak point for GPUs because they are connected to NICsthrough PCIe bus. In addition to networking, we will try torun our code on Stratix 10 FPGA which is the next gener-ation Intel FPGA. W e expect we can implement more PEsthan Arria 10 FPGA because it has 2.2 times more ALMblocks and 3.8 times more DPS blocks.

9. REFERENCES[1] K . M . Górski, E . Hivon, A. J. Banday, B. D. W andelt,F. K . Hansen, M . Reinecke, and M . Bartelmann.Healpix: A framework for high-resolution discretizationand fast analysis of data distributed on the sphere. TheAstrophysical Journal, 622(2):759, 2005.

[2] K . Hill, S. Craciun, A. George, and H. Lam.Comparative analysis of opencl vs. hdl withimage-processing kernels on stratix-v fpga. In 2015IEEE 26th International Conference onApplication-specific Systems, Architectures andProcessors (ASAP) , pages 189–193, July 2015.

[3] R. Kobayashi, Y . Oobata, N. Fujita, Y . Yamaguchi,and T . Boku. Opencl-ready high speed fpga network forreconfigurable high performance computing. InProceedings of the International Conference on HighPerformance Computing in Asia-Pacific Region, HPCAsia 2018, pages 192–201, New York, NY , USA, 2018.ACM .

[4] Y . Luo, X . W en, K . Yoshii, S. Ogrenci-Memik,G. Memik, H. Finkel, and F. Cappello. Evaluatingirregular memory access on opencl fpga platforms: Acase study with xsbench. In 2017 27th InternationalConference on Field Programmable Logic andApplications (FPL) , pages 1–4, Sept 2017.

[5] T . Okamoto, K . Yoshikawa, and M . Umemura. argot:accelerated radiative transfer on grids using oct-tree.Monthly Notices of the Royal Astronomical Society,419(4):2855–2866, 2012.

[6] S. Tanaka, K . Yoshikawa, T . Okamoto, andK . Hasegawa. A new ray-tracing scheme for 3d diffuseradiation transfer on highly parallel architectures.Publications of the Astronomical Society of Japan,67(4):62, 2015.

[7] H. R. Zohouri, N. M aruyama, A. Smith, M . Matsuda,and S. Matsuoka. Evaluating and optimizing openclkernels for high performance computing with fpgas. InProceedings of the International Conference for HighPerformance Computing, Networking, Storage andAnalysis, SC ’16, pages 35:1–35:12, Piscataway, NJ,USA, 2016. IEEE Press.

(16x16x16) (8x8x8)

mesh

• Problem space is divided into small blocks• e.g. (16, 16, 16) → 8 × (8, 8, 8)• PE is assigned to each of small blocks

• PEs are connected by channels each other• PE: Processing Element• BE: Boundary Element

• Kernel of PEs and BEs are started automatically by autorun attribute• Lower control overhead and resource usage

because of decreasing number of host controlled kernels

4.9x faster

almost equal performance

Reference

[1] Norihisa Fujita, Ryohei Kobayashi, Yoshiki Yamaguchi, and Taisuke Boku, Parallel Processing on FPGA Combining Computation and Communication in OpenCL Programming, 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp.479-488, May 2019

[2] Norihisa Fujita, Ryohei Kobayashi, Yoshiki Yamaguchi, Yuuma Oobata, Taisuke Boku, Makito Abe, Kohji Yoshikawa, and Masayuki Umemura: Accelerating Space Radiate Transfer on FPGA using OpenCL (Accepted), International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART 2018)

Acknowledgment

This research is a part of the project titled “Development of Computing-Communication Unified Supercomputer in Next Generation” under the program of “Research and Development for Next-Generation Supercomputing Technology” by MEXT. We thank Intel University Program for providing us both of hardware and software.

Documents

A c el ra tdR iv n sf o g O -T University of Tsukuba HC A HC A