Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
FP-AMR: A Reconfigurable Fabric Framework for
Adaptive Mesh Refinement Applications
Tianqi Wang∗ Tong Geng† Xi Jin∗ Martin Herbordt†
∗Department of Physics; University of Science and Technology of China, Hefei, China†Department of Electrical and Computer Engineering; Boston University, Boston, MA, USA
Abstract—Adaptive mesh refinement (AMR) is one of the mostwidely used methods in High Performance Computing accountinga large fraction of all supercomputing cycles. AMR operatesby dynamically and adaptively applying computational resourcesnon-uniformly to emphasize regions of the model as a functionof their complexity. Because AMR generally uses dynamic andpointer-based data structures, acceleration is challenging, espe-cially in hardware. As far as we are aware there has been noprevious work published on accelerating AMR with FPGAs.
In this paper, we introduce a reconfigurable fabric frameworkcalled FP-AMR. The work is in two parts. In the first FP-AMR offloads the bulk per-timestep computations to the FPGA;analogous systems have previously done this with GPUs. Inthe second part we show that the rest of the CPU-basedtasks–including particle mesh mapping, mesh refinement, andcoarsening–can also be mapped efficiently to the FPGA. We haveevaluated FP-AMR using the widely used program AMReX andfound that a single FPGA outperforms a Xeon E5-2660 CPUserver (8 cores) by from 21×-23× depending on problem sizeand data distribution.
Index Terms—Reconfigurable Computing, Adaptive Mesh Re-finement, Scientific Computing, High Performance Computing
I. INTRODUCTION
The increase in supercomputer performance over the last
several decades has, by itself, been insufficient to solve many
challenging modeling and simulation problems. For example,
the complexity of solving evolutionary partial differential
equations (PDEs) scales as Ω(n4), where n is the num-
ber of mesh points per dimension. Thus, the three-order-of-
magnitude improvement in performance over the past 25 years
has meant just a 5.6× gain in spatio-temporal resolution [1].
To address this problem, many simulation packages [2]–[6] use
Adaptive Mesh Refinement (AMR)–which selectively applies
computing to regions of most interest–to increase resolution.
Its utility and versatility have made AMR one of the most
widely used frameworks in HPC [7].
The problem with traditional uniform meshes is that difficult
regions (discontinuities, steep gradients, shocks) require high
resolution; but since that high resolution is applied everywhere
equally, most of the computation is wasted. For example, the
physical universe is populated by structures at very different
scales, from superclusters down to galaxies. These strongly
correlated structures need fine resolution in both space and
time, while regions of space that are mostly void could
make due with coarse resolution (e.g., [6], [8]). When using
AMR, simulations start from a coarse grid. The program then
identifies regions that need more resolution and superimposes
finer sub-grids only on those regions.
With the emergence of FPGAs in large-scale clouds and
clusters, it makes sense to investigate whether they are an
appropriate tool for AMR acceleration; we are not aware of
any previous study. The challenge is that AMR generally
uses dynamic and pointer-based data structures, characteristics
that make any acceleration difficult [9] [10] (where random
off-chip DRAM access is expensive and inefficient). AMR
frameworks that support GPU acceleration [8] generally use
the GPU only to execute the PDE solver and use a host CPU to
handle operations on the dynamic data structures [5]. FPGAs
can be designed with a cache to support traversal of pointer-
based data structures [11], but this approach does not support
their modification. Our approach provides a carefully designed
data structure and memory subsystem to support all relevant
operations.
In this work, we propose FP-AMR, a reconfigurable fabric
framework for block-structured adaptive mesh refinement ap-
plications [2]. FP-AMR helps programmers develop AMR ap-
plications for both FPGA-centric clusters and traditional HPC
clusters with FPGAs as accelerators. The novel contributions
are as follows:
• FP-AMR can offload all CPU-based tasks to FPGAs,
including particle mesh mapping, mesh refinement, and
coarsening. In all GPU AMR versions of which we are
aware, these functions must be executed on a CPU. The
advantage of complete offload is to drastically reduce
communication overhead.
• FP-AMR uses a Z-order space-filling curve to map the
simulation space and the adaptive mesh to a customized
data structure. This design uses spatial locality to reduce
data access conflicts and improve memory access effi-
ciency.
• FP-AMR supports direct FPGA-to-FPGA communica-
tion. This means that inter-accelerator communication,
which is necessary to periodically update ghost-zones,
avoids transit through the CPUs.
• Experiments show that, with FP-AMR, a single FPGA
outperforms a Xeon E5-2660 CPU by 21× to 23× de-
pending on cluster scale and simulation initial conditions.
Also that, using the secondary network, performance
scales perfectly to eight FPGAs and is likely to scale
similarly to much larger FPGA clusters.
II. BACKGROUND
A. AMR
Cosmological simulation is used as a motivational exam-
ple of AMR (Algorithm 1). The functions Calc_Potential,
Calc_Accel and Motion_Update, which are listed in Lines
2-4, can be replaced by any application-specific customized
functions. The key feature is the function Particle_Map listed
in Line 1.
Algorithm 1 Algorithm for Time forward
Input: Pi Particle information of time Ti
Output: Pi+1 Particle information of time Ti+1
1: Mass_Meshi = Particle_Map(Pi)# Map Ti particles to adaptive mesh
2: Potentiali = Calc_Potential(Mass_Meshi)# Calculate Ti potential distribution
3: Acceli = Calc_Accel(Potentiali)# Calculate Ti all particles’ accelarte
4: Pi+1 = Motion_Update(Acceli, Pi)# Motion update to generate Pi+1
5: return Pi+1
Grid Level 2
Grid Level 1
Grid Level 0
t/4 t/4 t/4 t/4
t/2t/2
t
1
2
3 4
5
6
7 8
9
Grid Level 0
Grid Level 1
Grid Level 2
(A) Space Dimension Refinement
(B) Time Dimension Refinement
Fig. 1. Spatio-temporal for AMR algorithm
Figure 1 shows details of Particle_Map: how a cosmologi-
cal simulation uses AMR to implement various resolutions in
the spatio-temporal regions of high mass density. As shown
in Figure 1(A), AMR starts from the coarsest grid (Level 0),
then identifies blocks as high mass density (red), applies finer
resolution to these blocks, and then generates a finer grid
(Level 1) for just those blocks. Continuing, the next grid (Level
2) is generated for high-density blocks in Level 1 (blue).
Figure 1(B) shows the subcycling-in-time approach used by
AMR. Step 1: Integrate Level 0 over t. Step 2: Integrate Level
1 over t/2. Step 3: Integrate Level 2 over t/4. Step 4: Integrate
Level 2 over t/4. Step 5: Synchronize Levels 1 and 2. Step
6: Integrate Level 1 over t/2. Step 7: Integrate Level 2 over
t/4. Step 8: Integrate Level 2 over t/4. Step 9: Synchronize
Levels 1, 2, and 3.
Algorithm 2 Algorithm for Particle Map
Input: Pi Particle information of time Ti
Output: Mass_Meshi Adaptive mesh of time Ti
1: Initialize(Mesh_Conf, P_Sorti)2: for each block ∈ Space do
3: count = 04: par_list = ∅5: for each par ∈ Pi do
6: if (par_loc ∈ block) then
7: count+ = 18: par_list← par9: end if
10: end for
11: if (count > threshold1) then
12: Mesh_Conf ←Refine(block)13: else if (
∑adj count < threshold2) then
14: Mesh_Conf ←Coarsen(block)15: else
16: Mesh_Conf ←Keep(block)17: end if
18: P_Sorti ← par_list19: end for
20: for each par ∈ P_Sorti do
21: Mass_Mesh←Interpolation(par,Mesh_Conf)22: end for
23: return Mass_Mesh
Details of Particle_Map are shown in Algorithm 2. In Lines
2-19, the first nested loop traverses the whole simulation space
and counts each mesh block’s particle number. According to
the particle density of each block and its adjacent blocks, the
block is refined, coarsened, or kept. Based on Mesh_Conf ,
generated by Lines 2-19, Lines 20-22 traverse all particles and
interpolate the particles’ masses to the adaptive mesh. At the
same time, all particles are sorted based on spatial locality for
the next step’s particle interpolation.
dx
dy1S dxdy
2 (1 )S dx dy 3 (1 )(1 )S dx dy
4 (1 )S dx dy
1m
2m 3m
4m
M
1 1
2 2
3 3
4 4
(1 )
(1 ) (1 )
(1 )
m M S M dx dy
m M S M dx dy
m M S M dx dy
m M S M dx dy
Fig. 2. Bi-linear interpolation
In summary, the key features of Particle_Map used to build
and adapt the adaptive mesh are:
1) Refine mesh and coarsen mesh: According to the mess
density, AMR method needs to decide each block should be
refined or coarsened.
2) Interpolate: As shown in Fig 2, bi-linear or tri-linear
methods can be used for the 2D/3D dimension cases. Based
on area/volume, mass is partitioned and mapped to the mesh.
(A) System Overview
(B) FPGA Design
(C) FP-AMR Module
Host Communication Network
DR
AM
Host
CPU
IO
Reconf
Device
DR
AM
PCIe
No
de
1
DR
AM
Host
CPU
IO
Reconf
Device
DR
AM
PCIe
No
de
2
DR
AM
Host
CPU
IO
Reconf
Device
DR
AM
PCIe
No
de
n
Device Directly Communication Network
Particle
Information
Sorted Particle
Information
AMR Grid
Status
DRAM
Mapped Mesh
PE_Sort
FPGA
PE_Map
Phase I: Sorting Phase II: Particle Mapping
DR
AM
DD
R
Ctrl
No
rth
Tra
ns
Ea
st
Tra
ns
Sou
th
Tra
ns
We
st
Tra
ns
Up
Tra
ns
Do
wn
Tra
ns
Router
FP-AMR
Kernel 2
Kernel n-1Kernel n
User Logic
PCIe Ctrl
Kernel 1
FP-AMR
FPGA
Fig. 3. Overview of FP-AMR
B. FPGA-centric clusters
In traditional HPC clusters, accelerators communicate
through their hosts, which results in substantial latency over-
head. In FPGA-centric clusters, such as Catapult I [12] and
many other recent systems [13], [14], FPGAs are intercon-
nected directly through their transceivers [15]–[17]. As shown
in Figure 3, the yellow marked reconfigurable devices can
exchange data without detouring host CPU. While there is
no technological reason why GPUs (and other accelerators)
cannot be built with a similar capability, to date NVLink
supports only a small number of GPUs and, unlike with
FPGAs, adds substantially to the cost of the cluster.
C. Challenge and Motivation
For AMR-based applications, accelerators use optimized
kernels to handle the functions in Algorithm 1 Line 2-4.
Particle_Map uses tree-based dynamic data structures. For
scientific simulations, the size of these data structures is
usually hundreds of MB. These pointer-based dynamic data
structures must, therefore, be stored off-chip. Since the large
number of random accesses makes it difficult to accelerate,
programmers generally use host CPUs.
Host
Comm
Device
Loca
tio
n
Time
Th Th ThTcTcTcTc Td Td
TimeStep_1 TimeStep_2
Fig. 4. Communication and computation phases
Figure 4 shows how traditional clusters execute AMR
through phases and communication between host and device.
The fraction of time for Particle_Map can be expressed as
Ratioamr =Th + 2Tc
Th + Td + 2Tc
(1)
where Th is CPU time, Td is accelerator time, and Tc is
communication latency.
A better accelerator can reduce Td, but the time required to
exchange hundreds of MB of data between host and device
(e.g., through PCIe) is substantial. Moreover, Tc cannot be
overlapped because the timestep 2 cannot start before timestep
1 finishes. Offload of Particle_Map to the accelerator is
therefore essential to improve performance further.
III. FP-AMR FRAMEWORK
A. Overview
Figure 3 shows FP-AMR in an FPGA-centric cluster. Figure
3(A) shows the FPGA; Figure 3(B) gives details of the
FPGA design. A router handles FPGA-to-FPGA communi-
cation through the transceivers. Functions 2-4 in Algorithm
1 are instantiated as user-specific kernels 1-n. FP-AMR is in
the red block and wraps the user-specific kernels. According
to Algorithm 2, two nested loops are executed sequentially.
As shown in Figure 3(C), the FP-AMR module consists
of two parts, PE-Sort and PE-Map, corresponding the two
loops. PE-sort reads the particle information list and then
generates the sorted particle information list and AMR grid
status. According to the values in these structures, PE-map
generates the final mass mapping mesh. For most simulation
applications, the adaptive mesh data structures are too large
(usually several GB) to be stored on-chip.
B. Space Filling Curve
According to Algorithm 2 line 2-19 and line 20-22, AMR
must traverse the whole simulation space and the particle
information list. Since all necessary data structures (Figure
3(C)) are stored off-chip, FP-AMR needs to access off-chip
memory frequently. To make this more efficient, FP-AMR uses
on-chip cache to take advantage the spatial locality. To map
the 2D/3D simulation space to the 1D memory address space,
a space-filling curve (SFC) is advantageous. SFCs have ranges
Y_Block_ID
X_Block_ID
X_Grid_ID
Y_Grid_IDMesh Grid
Points
000 001 010 011
000
001
010
011
000
001
010
011
100
000 001 010 011 100
Block
Block ID
Point ID
(A) Simulation space contain blocks and grid points (B) Extract blocks and index(C) Extract grid points and index
Y_Block_ID
X_Block_ID000 001 010 011
000
001
010
011
00
01
02
03
04 06
07
08
09
10
11
12
13
14
1505
X_Grid_ID
Y_Grid_ID
000 001 010 011 100
000
001
010
011
100
00
01
04
05
16 17
07
06
03
02 08
09
12
13
18 19
15
14
11
10 20
21
22
23
24
Fig. 5. Use space filling curve to index the whole simulation space and adaptive mesh grid
00 01 10 11
00
01
10
11
00
01 03
02
04
05
06
07
08
09
10
11
12
13
14
15
X
Y
00 01 10 11
00
01
10
11
05
04 07
06
03
00
02
01
09
08
10
11
13
14
12
15
X
Y
(A) Z-ordering Curve (B) Peano-Hilbert Curve
Fig. 6. Different Space Filling Curves
that contain the entire 2D/3D dimensional unit square; Z-order
and Peano-Hilbert are two widely used SFCs.
As shown in Figure 5(A), the whole simulation space
contains two basic elements: blocks and mesh grids. Blocks are
the orange space in Figure 5(A) and are used to count the par-
ticles’ spatial distribution. The mesh grids are the black points
in Figure 5(A)(C) and are used to store the mapped particles’
mass. To take advantage the spatial locality, the blocks and
mesh grids both need to be indexed with the SFC. We use the
Z-order curve as an example. In Figure 5(B)(C), the blocks and
mesh grid points are extracted separately and indexed. Figure
6 shows the indexing methods for the Z-order and Peano-
Hilbert curves. Note that the Z-order curve only needs the
bit-shuffle operation, as shown in Figure 6(A). In contrast, the
Peano-Hilbert curve generator is a recursive algorithm, which
is extremely expensive for hardware implementation. Clearly,
the Z-order curve is a more reasonable choice.
For the adaptive mesh, the blocks of the whole simulation
space are shown in Figure 7(A). From the adaptive mesh, we
can extract different level mesh grids (Level 1 to Level 3),
which are shown in Figure 7(B). AMR uses a finer grid to
index critical areas. For a particle located in the block marked
with the star (in Figure 7(A)), we generate its indices in all
three grid levels (see Figure 7(B)), and use the spliced three
indices to reference the particle (in Figure 7(B)).
C. Data Structures
In FP-AMR, particle information and adaptive mesh are
stored off-chip DRAM so their data structures must be de-
00
01_00
01_01
01_02
01_03
02
03_00
03_01_00
03_01_01
03_01_02
03_01_03
03_03
03_02
L1 Block-IndexL2 Block-IndexL3 Block-Index Index = 03_01_02
(A) Adaptive Mesh (B) Adaptive Mesh Index Method
Fig. 7. Adaptive mesh indexing
signed carefully. Figure 8 shows all related data structures in
Algorithm 1 and their relationships. For time step i, (1) Sort:
the particle information list of T imei (Pi: from particle 0 to
particle n) is sorted to generate P_Sorti (from particle 0′ to
particle n′) based on spatial locality. (2) Interpolation: Gen-
erate Mass_Meshi of T imei. (3) Calc_potential/Calc_accel:
User specific kernels calculate the discrete potential and each
particle’s acceleration of T imei based on Mass_Meshi. (4)
Motion update: Based on T imei’s acceleration update and
the (locality, velocity) field of P_Sorti, the result is the next
timestep’s particle information list Pi+1
There are two kinds of data structures: the particle infor-
mation lists (Pi, P_Sorti) and the adaptive mass mesh point
list (Mass_Meshi). Both have similar memory layout.
For the particle information list (Figure 9(A)), the list is
divided into several sections, where each section contains all
particles that share the same block-index. The sections are
stored contiguously and the sections with larger block-indices
are stored in higher address space. As shown in Figure 8,
from time step i to time step i + 1, the location field of the
particle information list is updated. In the next time step i+2,
even if the particle spatial distribution is relatively stable, there
will be some particles flowing in and out of each block. For
each section, there is an extra back-up space at the end of the
section, which is just in case more particles need to be located
in the block.
Besides the memory to store particle information, FP-AMR
needs an extra base-address array and counter array. The base-
address array keeps each section’s base address and counter
array records for how many particles exist in each block.
Figure 9 shows how the coarsen and refine operations affect
the particle information list. First, the section or adjacent
sections (include the backup blank) is fragmented or merged.
Then all related particles are re-sorted. The base-address and
counter arrays are updated.
(2) Interpolation
To: Next Timestep
Timestep i+1
(3)Calc_potential/
Calc_accel/
Motion_update
P_i
P_Sort_i
Mass_Mesh_i
P_i+1
(1) Sort
Particle n’
Particle 0'
Particle 1'
Particle 2'
Mass <Loc’,Vec’> Other
Mass
Grid_Point 0
Grid_Point 1
Grid_Point m
Grid_Point 2
Particle n’
Particle 0'
Particle 1'
Particle 2'
Mass <Loc,Vec> Other
Particle n
Particle 0
Particle 1
Particle 2
Mass <Loc,Vec> Other From: Next Timestep
Timestep i+1
Fig. 8. Data structures’ relationship
IV. ARCHITECTURE
In this section, the hardware architecture of the two phases
of FP-AMR, PE_Sort, and PE_Map (as shown in Figure
3(C)), is introduced. The PE_Sort module sorts all particle
information based on spatial locality and decides which blocks
need to be refined or coarsened. PE_Map maps the sorted
particles to the adaptive mesh with bi-/tri-linear interpolation.
A. PE_Sort
The data path of module PE_Sort is shown in Figure 10.
The inputs of PE_Sort are: coordinates of this level’s origin
(x0, y0); this level’s length L; the particle coordinates vector
(x, y); and the particles’ mass m. First, the pipeline calculates
(idx, idy) and (dx, dy):
(idx, idy) = ⌊x− x0
L,y − y0
L⌋ (2)
(dx, dy) = (x− x0
L− idx,
y − y0L
− idy) (3)
(idx, idy) shows which block the particle is located in and
(dx, dy) is the offset within the block. Using (idx, idy), the
Z-order module generates the block index, which is used
(A) Data Structure of Particle Information List
Coarse
Refine
Update Base Addr
Array (Refine)
Update Counter
Array (Refine)(B) Coarse and Refine Operation
Particle m1
Particle 0
Back up blank’
Base 02_00
Counter m1
Particle m2
Particle 0
Back up blank’
Base 02_01
Counter m2
Particle m3
Particle 0
Back up blank’
Base 02_02
Counter m3
Particle m4
Particle 0
Back up blank’
Base 02_03
Counter m4
Mass Loc OtherVec
Particle n1
Particle 0
Mass Loc OtherVec
Back up blank
Base 02
Counter n1
Block-index = 02
Block-index = 03_00
Block-index = 03_01_00
Block-index = 03_01_01
Block-index = 03_01_02
Block-index = 03_01_03
Me
mo
ry A
dd
ress In
crea
se
Loc Vec
Particle n
Particle 0
Particle 1
Particle 2
Mass Other
Back up blank
Base k
Counter k
Counter Array
Counter kCounter k+1Counter k+2
Base Addr Array
Base kBase k+1Base k+2
Update Base Addr
Array (Coarse)
Update Counter
Array (Coarse)
Fig. 9. Data structure of particle information list
/ / /
y0yx
0x z
0z L
FP2FIX FP2FIX FP2FIX
Z-ordering
(Bit Shuffle)
m
Base_Addr
Array
Counter-Z
Array
Base_Addr Counter_Z
+Const 1
BufferAddr
Cache
DRAM
- - -
PE_Sort
FIX2FP FIX2FP FIX2FP
1-dx 1-dy 1-dz
'L
dx dydz
'm
Forward to Buffer
Backward to
RefinementRefine?
Const 1
+
dx dy dz mZ-order
Fig. 10. Data path of PE_Sort module
as the key to search the base-address array and increment
the corresponding counter in the counter array. The search
results of the base-address and counter arrays are added to
form the address of the particle’s information. Then all related
particle information (such as (dx, dy,m)) are written to on-
chip cache. Because of the spatial locality of the particles,
the cache has a low miss rate. Based on the counter array
search result, PE_Sort decides whether the block needs to be
refined, coarsened, or kept as is. If refine or coarsen operations
is necessary, then the data is streamed backward as shown by
the blue arrow.
x
xm0
+
RAM 0
x
xm1
+
RAM 1
x
xm2
+
RAM 2
x
xm3
+
RAM 3
dx
1-dx
dy
1-dy
Z-ordering Adjacent
Z-orde (i,j)
Addr_0 Addr_1 Addr_2 Addr_3
PE_Map
m
- -
dx dyConst 1
1-dx 1-dydx dy
Z-orde
(i,j)
Z-orde
(i+1,j)
Z-orde
(i,j+1)
Z-orde
(i+1,j+1)
RAM_Offset:Higher bits RAM_Select:lowest 2 bits
Fig. 11. Datapath of the PE_Map module
B. PE_Map
The datapath of the PE_Map module is shown in Figure 11.
The bi-linear interpolation result of the particle mass m is
(m0,m1,m2,m3). These results are the added to the related
grid points.m0 = m · dx · dy (4)
m1 = m · (1− dx) · dy (5)
m2 = m · dx · (1− dy) (6)
m3 = m · (1− dx) · (1− dy) (7)
At this point, we examine the design with respect to FPGA
resource constraints. We find that the number of computational
and on-chip memory units is sufficient. However, the latter is
only the case if data are mapped to remove bank conflicts.
In PE_Map, each particle map operation needs to add mass
interpolation results to four (2D), or eight (3D), adjacent grid
points. If these adjacent grid points are not located in different
memory banks, then they cannot be accessed them in a single
cycle. As waiting for data causes idle stages in the pipeline a
suitable memory partition is necessary.1) Memory Partition: The Z-order indexing is helpful in
creating a collisionless data layout. As shown in Figure 12,
grid points must be mapped to four different RAM banks.
The grid-point index is z − order. The remainder of z−order4
is used for RAM bank selection while the quotient of z−order4
is used as the address offset. For example, grid point 14 is
mapped to RAM 2’s offset 3.
Figure 13 shows simultaneous memory accesses. For parti-
cles located in green Block_0, the memory access request for
(00, 01, 02, 03) is collisionless. For particles locate in orange
Block_3, the memory access request for (03, 06, 09, 12) is col-
lisionless. For particles locate in ivory Block_5, the memory
access request for (12, 13, 06, 07) is also collisionless.
Fig. 12. Memory partition method
Block_5 Block_8Block_4
Block_3 Block_7Block_1
Block_2 Block_6Block_0
00
05
01
04
02
07
03
06
08
13
09
12
10
15
11
14Block_5: Grid Points (6,7,12,13)
00
04
08
12
01
05
09
13
02
06
10
14
03
07
11
15
RAM 0 RAM 1 RAM 2 RAM 3
Block_3: Grid Points (3,6,9,12)
00
04
08
12
01
05
09
13
02
06
10
14
03
07
11
15
RAM 0 RAM 1 RAM 2 RAM 3
Block_0: Grid Points (0,1,2,3)
00
04
08
12
01
05
09
13
02
06
10
14
03
07
11
15
RAM 0 RAM 1 RAM 2 RAM 3
Fig. 13. Conflict free parallel memory access
LOAD ADDER
LD C0
ADD C0, M0
ADD C0, M1
LD C0
ST C0+M0
ST C0+M1
LD C0
ADD C0, M2
ST C0+M2
FP Delay
FP Delay
FP Delay
STORE
Bubble!
Bubble!
Bubble!Bubble!
Bubble!
Bubble!
LD C0
ADD C0, M0
ADD C1, M1LD C2
ST C1+M0
ST C2+M1
LD C1
ADD C2, M2
ST C0+M0
FP Delay
LOAD ADDER STORE
LD C0
ADD C0, M0
ST C0+M0
LD C1
ADD C1, M1
ST C1+M1
LD C2
ADD C2, M2
ST C2+M2
LD C0
ADD C0, M0
ST C0+M0
LD C0
ADD C0, M1
ST C0+M1
LD C0
ADD C0, M2
ST C0+M2
Z-o
rde
rin
g
Block-id = i
Block-id = i+4
Block-id = i+8
Particle 0
Particle 0'
Particle 1'
Particle 2'
Particle 1
Particle 2
Fig. 14. Read-after-write data hazard
2) RAW hazard: In Figure 11, PE_Map reads the temporary
mass from RAM, adds the interpolation result, and stores the
sum back to RAM. However, the floating point adders have
long latency (more than ten cycles). As shown in Figure 14, if
we handle particles (marked blue) sequentially, Particle0 −Particle2, then PE_Map needs to repeatedly read and write
the same position in RAM causing a RAW hazard. Therefore,
FP-AMR handles the particles (marked brown) sequentially,
Particle′0−Particle′2. Because these particles have different
block id and different grid points, the hazard is avoided.
V. EXPERIMENTAL EVALUATION
We use an astrophysics AMR-based N-body simulation,
which solves the Poisson Equation, to evaluate the perfor-
mance of FP-AMR in an FPGA-centric cluster. We implement
three different systems as the control group: 1) CPU-only
cluster, 2) CPU-GPU cluster (GPUs as co-processors), and
3) CPU-FPGA cluster (FPGAs as co-processors). These three
versions are based on AMReX, which is a publicly available
software framework designed for building massively parallel
block-structured AMR applications.
A. Experimental Setup
We use NVIDIA Tesla P100 GPUs and Xilinx VCU118
Boards as platforms. These devices all use the 16nm process.
The host CPU is one socket Intel Xeon E5-2660 with 8 cores
running a multithreaded version. Table I shows details of the
FP-AMR and the three control clusters. We currently have
resources to test all four configurations up to eight nodes with
one accelerator per node. The control clusters all use CPUs
to handle AMReX-based AMR and use CPU-side Ethernet
for communication between nodes. For the Poisson solver, the
CPU-only version uses the Intel AVX-enhanced FFTW library;
the CPU-GPU version uses cuFFT; and the two FPGA versions
both use Xilinx FFT IP.
TABLE IEXPERIMENT SETUP
Design CPU Side Poisson Solver Interconnection
CPU-only AMReX FFTW (AVX) Eth
CPU-GPU AMRex cuFFT Eth
CPU-FPGA AMReX Xilinx FFT IP Eth
FP-AMR Initial Xilinx FFT IP Eth + Trans
In the FP-AMR version, the CPU only handles program
initialization and workload scheduling. All AMR and Pois-
son solver tasks are completed on the FPGA side. All
other communication is direct FPGA-to-FPGA through FPGA
transceivers. The design is coded in Verilog. For the Poisson
solver, the frequency is 250MHz. For the utility (DMA, bus,
etc) and FP-AMR parts, the frequency is 200MHz.
The performance depends on the initial conditions of the
simulation. For example, the particles’ kinetic energy can
influence the number of refine/coarsen operations and the
frequency of particle information exchanges. We use NGenIC,
a widely used initial conditions generator for cosmological
simulations. We generate two different initial conditions:
strong interaction and weak interaction. These have the same
number of particles distribution, but the kinetic energy of the
first is twice that of the second. In both cases results between
32- and 64-bit FP are nearly indistinguishable and so the
CPU/GPU/FPGA implementations all use FP32 format.
B. Performance Summary
To measure performance we run the system for 100
timesteps after a 10 timestep warm-up. Figure 15 shows
average node performance for different initial conditions and
node count. Table II shows the systems’ performance. All of
the accelerators do very well compared with the CPU-only
system: CPU+GPU is 35×-37× better, CPU+FPGA 17×-19×,
and FP-AMR 21×-23×.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
WI SI WI SI WI SI WI SI
Single Node Two Nodes Four Nodes Eight Nodes
Pe
r N
od
e P
erf
orm
an
ce
(1/T
ime
/No
de
)
Performance with different initial condition and
node count
CPU-only CPU+GPU CPU+FPGA FP-AMR
Fig. 15. Different initial condition
Compared with the CPU+FPGA system, FP-AMR’s perfor-
mance is 19%-20% better. The improvement of FP-AMR over
originates from two effects. First, at the end of each time step,
nodes transfer particles as they move through boundaries. FP-
AMR benefits from the direct FPGA-to-FPGA connections.
Second, as mentioned in Section II, FP-AMR offloads several
tasks to the FPGAs. If the initial condition has higher kinetic
energy, it will cause more particle information exchange and
influence the particles’ spatial locality, which has a negative
effect on particle map. As shown in Figure 15, for strong
interaction the initial the time consumption of each timestep
is 22.8% larger.
TABLE IIPERFORMANCE OVERVIEW
CPU-only
CPU+GPU CPU+FPGA FP-AMR
Performance 1x 34.6x-36.6x
17.3x-19.5x
20.7x-23.2x
Cache MissRatio
17.1% 17.1% 17.1% 18.7%
Energy Con-sumption
1x 0.075x 0.015x 0.012x
C. Memory subsystem
In our experiments, we find a diminishing return beyond
a 4KB 4-way set-associate cache. The replacement policy is
LRU. To evaluate FP-AMR’s memory subsystem, we measure
the cache miss rate of the particle map in FP-AMR and the
CPU-only system. For FP-AMR, we instrument the FPGA
cache with counters and find a cache miss rate of 18.7%. For
the CPU-only system, we use Linux-perf to profile the CPU’s
last level cache miss ratio, which is 17.1%. Compared with the
CPU’s highly optimized and sophisticated cache design, FP-
AMR’s memory subsystem only causes a 1.6% higher cache
miss ratio.
D. Performance and Resource Utilization Break-Down
Figure 16 gives a break-down of time consumption for 8
node versions of CPU-GPU, CPU-FPGA and, FP-AMR. For
Strong Interaction
Weak Interaction
Fig. 16. CPU-FPGA and FP-AMR system time consumption break down andutilization break down
the Poisson solver, for each initial condition, the two FPGA
clusters have similar times (as expected), while the GPU
cluster is more than three times faster. For intra node particle
map, both CPU-FPGA and CPU-GPU clusters execute this
operation on the CPU and so, as expected, their performance
is similar. FP-AMR executes this on the FPGAs and is twice
as fast. For inter node particle exchange, again the CPU-FPGA
and CPU-GPU clusters have similar performance, while FP-
AMR, which uses the direct FPGA-FPGA connections, is so
fast it is barely visible.
Figure 16 also shows the FPGA resource utilization. The
FP-AMR design has three parts: basic utility (DRAM con-
troller, Transceivers controller, etc), FP-AMR module, and
Poisson solver. For FP-AMR, the FPGA resource bottleneck
is DSP slices. The FP-AMR framework uses less than 5% of
DSP slices. This also explains the disparity in performance
between the GPU and FPGA in the Poisson solver.
E. System Scalability
Figure 17(A-B) shows the performance scalability of the
four systems with two different initial conditions. Clearly,
the AMReX framework makes sure the three control groups’
performance has good scalability. Based on this limited-scale
experiment, FP-AMR’s performance has comparable linearity.
VI. DISCUSSION AND FUTURE WORK
In this work, we study the mapping of AMR onto FPGAs
and FPGA clusters. We believe that this is the first such
study. We create two versions, one where the FPGAs execute
only the computations that are traditionally offloaded to the
accelerator, while the CPUs retain control and data structures;
and a second, FP-AMR, where the entire computation (after
initialization) is executed on the FPGAs. For the FPGAs to
Fig. 17. Scalability
execute the balance of the CPU-based tasks–including parti-
cle mesh mapping, mesh refinement and coarsening–requires
using customized data structures and memory subsystem.
Another advantage of FP-AMR is that inter-iteration particle
exchanges are executed using direct FPGA-FPGA connections.
The FP-AMR enhancements lead to a 20% performance
improvement over the CPU+FPGA version.
Overall we find that both FPGA-based versions of AMR
substantially improve performance over CPU-based versions,
with factors of 17×-19× and 21×-23×, respectively. Energy
consumption is improved by a much greater factor. The
limiting factor on FPGA performance is number of DSP units.
When compared with a high-end GPU of similar process
generation, we find that the performance of the GPU is 1.46×-
1.67× better than FP-AMR. This appears to be entirely due
to floating point resources available on the GPUs for the
execution of the intra-iteration payload computations. Clearly
having more DSPs would allow the FP-AMR to have equal
or better performance. The energy consumption of the FPGA-
based systems is about 6× less than that of the GPU-based
systems.
We tested scalability of all of the systems and found all
of the systems scaled well within the size of the clusters to
which we had access. Testing on larger systems is required to
demonstrate the expected benefit of the direct FPGA-FPGA
connections on scalability.
It is likely that we can improve the performance of the
payload operations on the FPGA (Poisson solver). This is
something that we did not concentrate on so far in this study,
but is the primary hindrance to matching GPU performance.
Exchanging the IP-based FFTs for custom FFTs could bridge
the gap, even given the disparity in floating point resources.
REFERENCES
[1] C. Burstedde, O. Ghattas, G. Stadler, T. Tu, and L. C. Wilcox, “Towardsadaptive mesh PDE simulations on petascale computers,” Proceedings
of Teragrid, vol. 8, 2008.[2] A. Dubey, A. Almgren, J. Bell, M. Berzins, S. Brandt, G. Bryan,
P. Colella, D. Graves, M. Lijewski, F. Löffler et al., “A survey ofhigh level frameworks in block-structured adaptive mesh refinementpackages,” Journal of Parallel and Distributed Computing, vol. 74,no. 12, pp. 3217–3227, 2014.
[3] D. Calhoun, “Adaptive Mesh Refinement Resources,” 2017. [On-line]. Available: math.boisestate.edu/∼calhoun/www_personal/research/amr_software/
[4] W. Zhang, A. Almgren, M. Day, T. Nguyen, J. Shalf, and D. Unat,“BoxLib with tiling: an AMR software framework,” arXiv preprint
arXiv:1604.03570, 2016.[5] H.-Y. Schive, J. A. ZuHone, N. J. Goldbaum, M. J. Turk, M. Gaspari, and
C.-Y. Cheng, “GAMER-2: a GPU-accelerated adaptive mesh refinementcode–accuracy, performance, and scalability,” Monthly Notices of the
Royal Astronomical Society, vol. 481, no. 4, pp. 4815–4840, 2018.[6] M. N. Farooqi, T. Nguyen, W. Zhang, A. S. Almgren, J. Shalf, and
D. Unat, “Phase asynchronous AMR execution for productive andperformant astrophysical flows,” in Proceedings of the International
Conference for High Performance Computing, Networking, Storage, and
Analysis, 2018, p. 70.[7] P. Davis, “Adaptive Mesh Refinement: An Essential Ingredient in
Computational Science,” SIAM News, May 1, 2017.[8] A. Almgren, V. Beckner, J. Bell, M. Day, L. Howell, C. Joggerst,
M. Lijewski, A. Nonaka, M. Singer, and M. Zingale, “CASTRO: A newcompressible astrophysical solver. I. Hydrodynamics and self-gravity,”The Astrophysical Journal, vol. 715, no. 2, p. 1221, 2010.
[9] B. Peng, T. Wang, X. Jin, and C. Wang, “An Accelerating Solution forN -body MOND simulation with FPGA-SOC,” International Journal of
Reconfigurable Computing, vol. 2016, 2016.[10] T. Wang, L. Zheng, X. Jin, B. Peng, and C. Wang, “FPGA acceleration
of TreePM N-body simulations for Modified Newton Dynamics,” inInternational Conference on Field-Programmable Technology, 2016, pp.201–204.
[11] J. Coole, J. Wernsing, and G. Stitt, “A traversal cache framework forFPGA acceleration of pointer data structures: A case study on Barnes-Hut N -body simulation,” in International Conference on Reconfigurable
Computing and FPGAs, 2009, pp. 143–148.[12] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides,
J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray et al., “Areconfigurable fabric for accelerating large-scale datacenter services,”ACM SIGARCH Computer Architecture News, vol. 42, no. 3, pp. 13–24, 2014.
[13] A. D. George, M. C. Herbordt, H. Lam, A. G. Lawande, J. Sheng, andC. Yang, “Novo-G#: Large-scale reconfigurable computing with directand programmable interconnects,” in IEEE High Performance Extreme
Computing Conference, 2016, pp. 1–7.[14] T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Patel, and M. Herbordt,
“A framework for acceleration of CNN training on deeply-pipelinedFPGA clusters with work and weight load balancing,” in International
Conference on Field Programmable Logic and Applications, 2018, pp.394–398.
[15] J. Sheng, B. Humphries, H. Zhang, and M. Herbordt, “Design of 3DFFTs with FPGA Clusters,” in IEEE High Perf. Extreme Computing
Conf., 2014.[16] J. Sheng, C. Yang, A. Caulfield, M. Papamichael, and M. Herbordt,
“HPC on FPGA Clouds: 3D FFTs and Implications for MolecularDynamics,” in Proc. IEEE Conf. on Field Programmable Logic and
Applications, 2017.[17] J. Sheng, C. Yang, and M. Herbordt, “High Performance Dynamic
Communication on Reconfigurable Clusters,” in Proc. IEEE Conf. on
Field Programmable Logic and Applications, 2018.