Upload
others
View
12
Download
3
Embed Size (px)
Citation preview
AI processing at MEC(Multi-access Edge Computing)
CentralizedCloud
⑤スマートコミュニティ
MEC data center
5G5G
5G
5G enables < 0.5msec latencyTiming Critical JobsLow Power but High Performance
FPGA Computing
CREST Project: Computing basis for Society5.0 General Manager: Prof.Sakai• An integrated multi-node system for MEC (PJ leader: Amano) :
• 300million yen for 5.5 years (2019.10-2025.3)• Group 1: Hardware/Architecture
• Amano (Keio U.), Kudoh (U.of Tokyo), Namiki (TAT)
• Group 2: Reconfigurable Logic• Iida, Kuga, Amagasaki (Kumamoto U.), Zhou (KIT)
• Group 3: Runtime software• Sugaya (Shibaura Tech.), Ohkawa(U. of Tokai), Takano(AIST)
• Group 4: Application• Nishi (Keio U.), Nakadai(TIT), Inagaki(Alps),
• Group 5: CAD• Wakabayashi, Takahashi , et. al(NEC)
Final goal
3
Middleware/runtime-OS
High level synthesis
Reconfigurable Core
CPU
SERDESHw.
Switch
Hard Crust(Shell) I/O
SERDESHw.
Crust-Core HUB
Hw SLM logicCGRA
FPGA
FPGA
DSA
DSA
Crust Core HUB
Crust Core HUB
Crust Core HUB
Crust Core HUB
Crust Core HUB
Crust Core HUB
FPGA
FPGA
DSA
DSAF
PG
A
FP
GA
DS
A
DS
A
Edge Device Edge Device Edge Device Edge Device
5G Radio
5G Control Plane
5G User Plane
Multi-AccessEdge
Cloud
Distributed OS
Although the final goal is to develop the original chip,
we will use an FPGA cluster for the prototype and another possibility.
FPGA computing in MEC data center• Benefits:
• Suitable to timing critical applications.• Direct data input/output from/to IoT devices• The constant execution time with the hardware engine
• Low energy computation.• Much energy efficient than GPUs
• Scalability for multiple access.• Easy to increase the number of FPGAs• An FPGA itself can be used as a switch.
• Programming environment has been improved:• Open-CL is widespread for computational usage.• Vivado-HLS is popularly used for general usage.
• High performance FPGAs are available.• Intel’s Stratix 10 with a lot of floating DSPs• Xilinx’s Ultrascale+ with UltraRAM
HostPC
FPGA
PCIe
FPGA
FPGA
FPGA
FPGA
FPGA
FPGA
Our Proposal : A virtual FPGA system
A lot of cost-efficient middle-scale FPGAs aretightly connected.
They can be treated as if they were a single FPGA in HLS description level.
The data can be directly input/output to/from serial linksThe latency and bandwidth are guaranteed.
Higher performance per cost than conventionalFPGA in cloud.
Practically infinite resource is used.Separated into a number of virtual FPGAs and shared by the multiple accesses.
Direct datafrom IoT devices
FPGAFPGAFPGA
FPGAFPGAFPGAFPGASTDM switch STDM switch STDM switch STDM switch
FPGA
STDM switch STDM switch STDM switch STDM switch
Circuit switching network
HLS modules
FiC-SW
Host CPUI/O boardKCU1500
High Speed Serial Links
Flow-in-Cloud (FiC) overview
Flow-in-Cloud (FiC) SW Board
FiC Network8x4 9.9Gbps
Ethernet
Control Network
Application Logic Area
SWControl board
Raspberry Pi 3 model B
FPGAXilinx Kintex
Ultrascale XCKU095
Rusberry Pi 3
FPGA KU085/095
STDMSwitch
HLS modules
DDR-4 SDRAM 16GB
Here, we call eachlink “channel”,
and a bundle of4 channels “bundle”.
A board has 8 bundleseach of which has4 channels
DDR-4 SDRAM 16GB
The current FiC with 24 boards
9
The current FiC system
Control Server
SW
Users
FPGA Configuration and Table information(*.json)
FPGA
FPGA
FPGA
FPGA
SW
SW
SW
SW
Results
FiC Board ControlNetwork
Deliver of Configuration data with RESTful API through HTTP
Internet / Intranet
GUI control from remote terminals
FiC Network
File configured
Now under configuration
GUI from remote terminals
Con
volu
tion
Maxp
oolin
g
Weightdata
Con
volu
tion
Maxp
oolin
g
Weightdata
Cla
ssifie
r(fully c
on
necte
d)
Weightdata
Cla
ssifie
r(fully c
on
necte
d)
Weightdata
ReL
U
Softm
ax
Image Results
FPGA programming contest for CNN: Special Course of Computer Architecture (Graduated School)
A simple CNN Lennet
A design templatewithout optimizationis distributed.
12
Executing lenet in the cloud.
Optimize the design in Vivado-HLS, and generate configuration data with Vivadoon the local environment.
Reserve the FiC-SW board
Execute application by using the I/O board, and evaluate the execution time
Configure it through Rasbpi
XilinxAurora
XilinxAurora
STDMswitch
8.5Gbps x32 (4 chan. x 8 lane )
HLS module
PR domain
Static domain
8.5Gbps x32 (4 chan. x 8 lane )
Raspi3
Ethernet
DRAM
170bit 170bit
100MHz
100MHz
9.9Gbps
9.9Gbps
Block Diagram of FiC
STDMswitch
STDMswitch
STDMswitch
sw0sw1
sw2sw3
STDM (Static Time Division Multiplexing)
14
Port1
Port2
Port3
Port 4
Port1
Port2
Port3
Port 4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4An input register is selected according to the pre-loaded table, and transferred to the outputregister.
Input data arrive at each port cyclically registered.
Output data are cyclicallysent to the output port
An example of4x4 with four slots
STDM (Static Time Division Multiplexing)
15
Port1
Port2
Port3
Port 4
Port1
Port2
Port3
Port 4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
P2S1 P2S2 P1S3 P3S4 P2S1 P2S2P1S3 P3S4…. ….port1
• A circuit is established betweensource and destination.
• Latency and bandwidth are kept.
• Latency = 45+2 x (# of slots)clock cycles
Multicast using the STDM
Port1
Port2
Port3
Port 4
Port1
Port2
Port3
Port 4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
For internal usage
Multicast is done efficiently.
Multiple outputs can receive the same data in a specific slot.
The resource usage
GT: High speed link
Enough design is remained for HLS design.
4 switches are provided for each channel.
fic00
fic01
fic02
fic03
1
2
1
2
fic04
fic05
fic06
fic07
1
2
1
2
fic08
fic09
fic10
fic11
1
2
1
2
m2fic00
m2fic01
m2fic02
m2fic03
1
2
1
2
m2fic04
m2fic05
m2fic06
m2fic07
1
2
1
2
m2fic08
m2fic09
m2fic10
m2fic11
1
2
1
2
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
The current configuration: 4x6 torus
fic00
fic01
fic02
fic03
1
2
1
2
fic04
fic05
fic06
fic07
1
2
1
2
fic08
fic09
fic10
fic11
1
2
1
2
m2fic00
m2fic01
m2fic02
m2fic03
1
2
1
2
m2fic04
m2fic05
m2fic06
m2fic07
1
2
1
2
m2fic08
m2fic09
m2fic10
m2fic11
1
2
1
2
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
Round Trip Time (24boards)
Round Trip Time (24 nodes HLS-HLS)
Slot number
Late
ncy
(clo
ck c
ycle
s)
1.2GB HLS-HLS effective bandwidth4.6um max. latency
Transferred Data (x160b x4)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 2000 4000 6000 8000 10000 12000
Node-to-Node Bandwidth (HLS-HLS)B
an
dw
idth
(G
B /
sec)
Transferred Data (x160b x4)
2-slot
4-slot
8-slot
16-slot
32-slot
64-slot
0
0.5
1
1.5
2
2.5
3
3.5
4
4 32 256 2048 16384 131072 1048576
バン
ド幅
(GB
/sec
)
データサイズ(Byte)
GPUtoGPU(PEACH3)
GPUtoGPU(PEACH2)
GPUtoGPU(MPI/IB)
Transferred Data (Byte)
Ban
dw
idth
(G
B /
sec)
FiC 2slot
Comparison between other FPGA Connected Multi-GPU system and MPI/Infiniband
Kaneda, et.al “Performance Evaluation of PEACH3: FPGA switch for Tightly Coupled Accelerators,”
HEART 2017
0
200
400
600
800
1000
1200
0 10 20 30 40 50 60 70
The number of Slots
8 boards
4 boards
2 boards
Rou
nd
Tri
p T
ime (
clo
cks)
Round Trip Time vs. The number of Slots
Pass through time for a board is about 450nsec
with 2 slots.
1
320
Slot synchronousdelay increases the round trip time
0
1
2
3
4
5
6
8 32 128 512
レイ
テン
シ(μsec)
データサイズ(Byte)
GPUtoGPU(PEACH3)
GPUtoGPU(PEACH2)
GPUtoGPU(MPI/IB)
Late
ncy
(μsec)
Transferred Data (Byte)
FiC 2slot (450ns)
K’s Tofu (100ns)
Comparison between other systems
Ajima, et.al “The Tofu Interconnect,” Hot Interconnect 2012
fic00
fic01
fic02
fic03
1
2
1
2
fic04
fic05
fic06
fic07
1
2
1
2
fic08
fic09
fic10
fic11
1
2
1
2
m2fic00
m2fic01
m2fic02
m2fic03
1
2
1
2
m2fic04
m2fic05
m2fic06
m2fic07
1
2
1
2
m2fic08
m2fic09
m2fic10
m2fic11
1
2
1
2
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Broadcast can be done with the maximum hop
①
②
③
④⑤
0
100
200
300
400
500
0 5 10 15 20 25 30
1-to-all and All-to-all
Array Size
Late
ncy
(clo
cks)
2x2
2x4
4x6
4x4
All-to-all
1-to-all
All-to-all broadcast can be done efficiently
fic00
fic01
fic02
fic03
1
2
1
2
fic04
fic05
fic06
fic07
1
2
1
2
fic08
fic09
fic10
fic11
1
2
1
2
m2fic00
m2fic01
m2fic02
m2fic03
1
2
1
2
m2fic04
m2fic05
m2fic06
m2fic07
1
2
1
2
m2fic08
m2fic09
m2fic10
m2fic11
1
2
1
2
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Reduction Computation
① ① ① ① ① ①
②
0
200
400
600
800
1000
1200
0 5 10 15 20
Reduction Calculation (170bit integer data)
The number of Slots
Com
pu
tati
on
Tim
e (
clo
cks)
Each nodefinishes
computation with1 clock cycle.
Communication centric example: CG method17.8x performance improvementto Xeon E5-2667 16 CPUs 2.9GHz
All-to-all data transfer needs9000 clock cycles
input[0]
Input[1]
input[M-1]
・・・
weight[0][0]
weight[0][1]
weight[N-1][M-1]
・・・・
・・
weight[0][1]
output[0]
output[1]
output[N-1]
・・
・
Size (N, M) multiply-add
M
N×M
N
register
registerInput buffer (x2)
Weight buffer
PE[0]
PE[1]
PE[N-1]
・・・・・・
bias[0]
bias[N-1]
・・・
Bias buffer
Output buffer(x2)
ReLu
ImplementationExample(LeNet FullyConnected layer)
Fully connection layer of the LenetFrequency(MHz)
Power(W) image/sec GOPS GOPS/W
1 board 100 17.89 23551 120.58 6.74
4 boards 100 71.56 94226 482.34 6.74
Frequency(MHz)
Power(W) GOPS GOPS/W
FiC 4boards 100 71.56 482.34 6.74
Stratex-V [1] 120 25.8 136.5 5.29
KCU060 [2] 200 25.0 172.0 6.88
[1] N.Suda, V.Chandra, G.Dasika, et.al “Throughput-optimized open-based FPGA Accelerator for large-scalecomvolutional neural networks,” FPGA2016.[2] C.Zhang, Z.Fang, P.Zhao, P.Pan, J.Cong, “Caffeine: Towards uniformed representation and accelerationfor deep convolutional neural networks,” ICCCAD2017.
Higher performance is achieved with the similar power efficiency compared witha single FPGA system.
Conclusion and future work
● A multi-FPGA system for MEC has been developed:
○ The management system has been used for a design contest in a class.
○ The latency and bandwidth can be guaranteed.
■ 34GB max. aggregated throughput with 32l channel
■ 1.2GB/sec 1-to-1 HLS throughput
■ 450nsec pass through time
■ 4.6μsec maximum HLS-to-HLS latency in the 24 nodes.● Now on going…
○ We distributed small systems to other groups: TAT, Shibaura Tech, Tokai University and NEC.
○ Practical application is now under development:
■ Genome matching, Alexnet, Lesnet, RNA and distributed learning.