AI processing at MEC(Multi-access Edge Computing)

AI processing at MEC(Multi-access Edge Computing)

CentralizedCloud

⑤スマートコミュニティ

MEC data center

5G5G

5G

5G enables < 0.5msec latencyTiming Critical JobsLow Power but High Performance

FPGA Computing

CREST Project: Computing basis for Society5.0 General Manager: Prof.Sakai• An integrated multi-node system for MEC (PJ leader: Amano) :

• 300million yen for 5.5 years (2019.10-2025.3)• Group 1: Hardware/Architecture

• Amano (Keio U.), Kudoh (U.of Tokyo), Namiki (TAT)

• Group 2: Reconfigurable Logic• Iida, Kuga, Amagasaki (Kumamoto U.), Zhou (KIT)

• Group 3: Runtime software• Sugaya (Shibaura Tech.), Ohkawa(U. of Tokai), Takano(AIST)

• Group 4: Application• Nishi (Keio U.), Nakadai(TIT), Inagaki(Alps),

• Group 5: CAD• Wakabayashi, Takahashi , et. al(NEC)

Final goal

3

Middleware/runtime-OS

High level synthesis

Reconfigurable Core

CPU

SERDESHw.

Switch

Hard Crust（Shell) I/O

SERDESHw.

Crust-Core HUB

Hw SLM logicCGRA

FPGA

FPGA

DSA

DSA

Crust Core HUB

Crust Core HUB

Crust Core HUB

Crust Core HUB

Crust Core HUB

Crust Core HUB

FPGA

FPGA

DSA

DSAF

PG

A

FP

GA

DS

A

DS

A

Edge Device Edge Device Edge Device Edge Device

5G Radio

5G Control Plane

5G User Plane

Multi-AccessEdge

Cloud

Distributed OS

Although the final goal is to develop the original chip,

we will use an FPGA cluster for the prototype and another possibility.

FPGA computing in MEC data center• Benefits:

• Suitable to timing critical applications.• Direct data input/output from/to IoT devices• The constant execution time with the hardware engine

• Low energy computation.• Much energy efficient than GPUs

• Scalability for multiple access.• Easy to increase the number of FPGAs• An FPGA itself can be used as a switch.

• Programming environment has been improved:• Open-CL is widespread for computational usage.• Vivado-HLS is popularly used for general usage.

• High performance FPGAs are available.• Intel’s Stratix 10 with a lot of floating DSPs• Xilinx’s Ultrascale+ with UltraRAM

HostPC

FPGA

PCIe

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

Our Proposal : A virtual FPGA system

A lot of cost-efficient middle-scale FPGAs aretightly connected.

They can be treated as if they were a single FPGA in HLS description level.

The data can be directly input/output to/from serial linksThe latency and bandwidth are guaranteed.

Higher performance per cost than conventionalFPGA in cloud.

Practically infinite resource is used.Separated into a number of virtual FPGAs and shared by the multiple accesses.

Direct datafrom IoT devices

FPGAFPGAFPGA

FPGAFPGAFPGAFPGASTDM switch STDM switch STDM switch STDM switch

FPGA

STDM switch STDM switch STDM switch STDM switch

Circuit switching network

HLS modules

FiC-SW

Host CPUI/O boardKCU1500

High Speed Serial Links

Flow-in-Cloud (FiC) overview

Flow-in-Cloud (FiC) SW Board

FiC Network8x4 9.9Gbps

Ethernet

Control Network

Application Logic Area

SWControl board

Raspberry Pi 3 model B

FPGAXilinx Kintex

Ultrascale XCKU095

Rusberry Pi 3

FPGA KU085/095

STDMSwitch

HLS modules

DDR-4 SDRAM 16GB

Here, we call eachlink “channel”,

and a bundle of4 channels “bundle”.

A board has 8 bundleseach of which has4 channels

DDR-4 SDRAM 16GB

The current FiC with 24 boards

9

The current FiC system

Control Server

SW

Users

FPGA Configuration and Table information(*.json)

FPGA

FPGA

FPGA

FPGA

SW

SW

SW

SW

Results

FiC Board ControlNetwork

Deliver of Configuration data with RESTful API through HTTP

Internet / Intranet

GUI control from remote terminals

FiC Network

File configured

Now under configuration

GUI from remote terminals

Con

volu

tion

Maxp

oolin

g

Weightdata

Con

volu

tion

Maxp

oolin

g

Weightdata

Cla

ssifie

r(fully c

on

necte

d)

Weightdata

Cla

ssifie

r(fully c

on

necte

d)

Weightdata

ReL

U

Softm

ax

Image Results

FPGA programming contest for CNN: Special Course of Computer Architecture (Graduated School)

A simple CNN Lennet

A design templatewithout optimizationis distributed.

12

Executing lenet in the cloud.

Optimize the design in Vivado-HLS, and generate configuration data with Vivadoon the local environment.

Reserve the FiC-SW board

Execute application by using the I/O board, and evaluate the execution time

Configure it through Rasbpi

XilinxAurora

XilinxAurora

STDMswitch

8.5Gbps x32 (4 chan. x 8 lane )

HLS module

PR domain

Static domain

8.5Gbps x32 (4 chan. x 8 lane )

Raspi3

Ethernet

DRAM

170bit 170bit

100MHz

100MHz

9.9Gbps

9.9Gbps

Block Diagram of FiC

STDMswitch

STDMswitch

STDMswitch

sw0sw1

sw2sw3

STDM (Static Time Division Multiplexing)

14

Port1

Port2

Port3

Port 4

Port1

Port2

Port3

Port 4

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4An input register is selected according to the pre-loaded table, and transferred to the outputregister.

Input data arrive at each port cyclically registered.

Output data are cyclicallysent to the output port

An example of4x4 with four slots

STDM (Static Time Division Multiplexing)

15

Port1

Port2

Port3

Port 4

Port1

Port2

Port3

Port 4

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4

P2S1 P2S2 P1S3 P3S4 P2S1 P2S2P1S3 P3S4…. ….port1

• A circuit is established betweensource and destination.

• Latency and bandwidth are kept.

• Latency = 45+2 x (# of slots)clock cycles

Multicast using the STDM

Port1

Port2

Port3

Port 4

Port1

Port2

Port3

Port 4

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4

For internal usage

Multicast is done efficiently.

Multiple outputs can receive the same data in a specific slot.

The resource usage

GT: High speed link

Enough design is remained for HLS design.

4 switches are provided for each channel.

fic00

fic01

fic02

fic03

1

2

1

2

fic04

fic05

fic06

fic07

1

2

1

2

fic08

fic09

fic10

fic11

1

2

1

2

m2fic00

m2fic01

m2fic02

m2fic03

1

2

1

2

m2fic04

m2fic05

m2fic06

m2fic07

1

2

1

2

m2fic08

m2fic09

m2fic10

m2fic11

1

2

1

2

3

3

3

3

3

3

3

3

3

3

3

3

4

4

4

4

4

4

4

4

4

4

4

4

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

The current configuration: 4x6 torus

fic00

fic01

fic02

fic03

1

2

1

2

fic04

fic05

fic06

fic07

1

2

1

2

fic08

fic09

fic10

fic11

1

2

1

2

m2fic00

m2fic01

m2fic02

m2fic03

1

2

1

2

m2fic04

m2fic05

m2fic06

m2fic07

1

2

1

2

m2fic08

m2fic09

m2fic10

m2fic11

1

2

1

2

3

3

3

3

3

3

3

3

3

3

3

3

4

4

4

4

4

4

4

4

4

4

4

4

Round Trip Time (24boards)

Round Trip Time (24 nodes HLS-HLS)

Slot number

Late

ncy

(clo

ck c

ycle

s)

1.2GB HLS-HLS effective bandwidth4.6um max. latency

Transferred Data (x160b x4)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 2000 4000 6000 8000 10000 12000

Node-to-Node Bandwidth (HLS-HLS)B

an

dw

idth

(G

B /

sec）

Transferred Data (x160b x4)

2-slot

4-slot

8-slot

16-slot

32-slot

64-slot

0

0.5

1

1.5

2

2.5

3

3.5

4

4 32 256 2048 16384 131072 1048576

バン

ド幅

(GB

/sec

)

データサイズ(Byte)

GPUtoGPU(PEACH3)

GPUtoGPU(PEACH2)

GPUtoGPU(MPI/IB)

Transferred Data (Byte)

Ban

dw

idth

(G

B /

sec）

FiC 2slot

Comparison between other FPGA Connected Multi-GPU system and MPI/Infiniband

Kaneda, et.al “Performance Evaluation of PEACH3: FPGA switch for Tightly Coupled Accelerators,”

HEART 2017

0

200

400

600

800

1000

1200

0 10 20 30 40 50 60 70

The number of Slots

8 boards

4 boards

2 boards

Rou

nd

Tri

p T

ime (

clo

cks)

Round Trip Time vs. The number of Slots

Pass through time for a board is about 450nsec

with 2 slots.

1

320

Slot synchronousdelay increases the round trip time

0

1

2

3

4

5

6

8 32 128 512

レイ

テン

シ(μsec)

データサイズ(Byte)

GPUtoGPU(PEACH3)

GPUtoGPU(PEACH2)

GPUtoGPU(MPI/IB)

Late

ncy

(μsec）

Transferred Data (Byte)

FiC 2slot (450ns)

K’s Tofu (100ns)

Comparison between other systems

Ajima, et.al “The Tofu Interconnect,” Hot Interconnect 2012

fic00

fic01

fic02

fic03

1

2

1

2

fic04

fic05

fic06

fic07

1

2

1

2

fic08

fic09

fic10

fic11

1

2

1

2

m2fic00

m2fic01

m2fic02

m2fic03

1

2

1

2

m2fic04

m2fic05

m2fic06

m2fic07

1

2

1

2

m2fic08

m2fic09

m2fic10

m2fic11

1

2

1

2

3

3

3

3

3

3

3

3

3

3

3

3

4

4

4

4

4

4

4

4

4

4

4

4

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

Broadcast can be done with the maximum hop

①

②

③

④⑤

0

100

200

300

400

500

0 5 10 15 20 25 30

1-to-all and All-to-all

Array Size

Late

ncy

(clo

cks)

2x2

2x4

4x6

4x4

All-to-all

1-to-all

All-to-all broadcast can be done efficiently

fic00

fic01

fic02

fic03

1

2

1

2

fic04

fic05

fic06

fic07

1

2

1

2

fic08

fic09

fic10

fic11

1

2

1

2

m2fic00

m2fic01

m2fic02

m2fic03

1

2

1

2

m2fic04

m2fic05

m2fic06

m2fic07

1

2

1

2

m2fic08

m2fic09

m2fic10

m2fic11

1

2

1

2

3

3

3

3

3

3

3

3

3

3

3

3

4

4

4

4

4

4

4

4

4

4

4

4

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

Reduction Computation

① ① ① ① ① ①

②

0

200

400

600

800

1000

1200

0 5 10 15 20

Reduction Calculation (170bit integer data)

The number of Slots

Com

pu

tati

on

Tim

e (

clo

cks)

Each nodefinishes

computation with1 clock cycle.

Communication centric example: CG method17.8x performance improvementto Xeon E5-2667 16 CPUs 2.9GHz

All-to-all data transfer needs9000 clock cycles

input[0]

Input[1]

input[M-1]

・・・

weight[0][0]

weight[0][1]

weight[N-1][M-1]

・・・・

・・

weight[0][1]

output[0]

output[1]

output[N-1]

・・

・

Size (N, M) multiply-add

M

N×M

N

register

registerInput buffer (x2)

Weight buffer

PE[0]

PE[1]

PE[N-1]

・・・・・・

bias[0]

bias[N-1]

・・・

Bias buffer

Output buffer(x2)

ReLu

ImplementationExample(LeNet FullyConnected layer)

Fully connection layer of the LenetFrequency（MHｚ）

Power（W) image/sec GOPS GOPS/W

1 board 100 17.89 23551 120.58 6.74

4 boards 100 71.56 94226 482.34 6.74

Frequency（MHz）

Power（W) GOPS GOPS/W

FiC 4boards 100 71.56 482.34 6.74

Stratex-V [1] 120 25.8 136.5 5.29

KCU060 [2] 200 25.0 172.0 6.88

[1] N.Suda, V.Chandra, G.Dasika, et.al “Throughput-optimized open-based FPGA Accelerator for large-scalecomvolutional neural networks,” FPGA2016.[2] C.Zhang, Z.Fang, P.Zhao, P.Pan, J.Cong, “Caffeine: Towards uniformed representation and accelerationfor deep convolutional neural networks,” ICCCAD2017.

Higher performance is achieved with the similar power efficiency compared witha single FPGA system.

Conclusion and future work

● A multi-FPGA system for MEC has been developed:

○ The management system has been used for a design contest in a class.

○ The latency and bandwidth can be guaranteed.

■ 34GB max. aggregated throughput with 32l channel

■ 1.2GB/sec 1-to-1 HLS throughput

■ 450nsec pass through time

■ 4.6μsec maximum HLS-to-HLS latency in the 24 nodes.● Now on going…

○ We distributed small systems to other groups: TAT, Shibaura Tech, Tokai University and NEC.

○ Practical application is now under development:

■ Genome matching, Alexnet, Lesnet, RNA and distributed learning.

Documents

AI processing at MEC(Multi-access Edge Computing)