9
FP-AMR: A Reconfigurable Fabric Framework for Adaptive Mesh Refinement Applications Tianqi Wang * Tong Geng Xi Jin * Martin Herbordt * Department of Physics; University of Science and Technology of China, Hefei, China Department of Electrical and Computer Engineering; Boston University, Boston, MA, USA Abstract—Adaptive mesh refinement (AMR) is one of the most widely used methods in High Performance Computing accounting a large fraction of all supercomputing cycles. AMR operates by dynamically and adaptively applying computational resources non-uniformly to emphasize regions of the model as a function of their complexity. Because AMR generally uses dynamic and pointer-based data structures, acceleration is challenging, espe- cially in hardware. As far as we are aware there has been no previous work published on accelerating AMR with FPGAs. In this paper, we introduce a reconfigurable fabric framework called FP-AMR. The work is in two parts. In the first FP- AMR offloads the bulk per-timestep computations to the FPGA; analogous systems have previously done this with GPUs. In the second part we show that the rest of the CPU-based tasks–including particle mesh mapping, mesh refinement, and coarsening–can also be mapped efficiently to the FPGA. We have evaluated FP-AMR using the widely used program AMReX and found that a single FPGA outperforms a Xeon E5-2660 CPU server (8 cores) by from 21×-23× depending on problem size and data distribution. Index Terms—Reconfigurable Computing, Adaptive Mesh Re- finement, Scientific Computing, High Performance Computing I. I NTRODUCTION The increase in supercomputer performance over the last several decades has, by itself, been insufficient to solve many challenging modeling and simulation problems. For example, the complexity of solving evolutionary partial differential equations (PDEs) scales as Ω(n 4 ), where n is the num- ber of mesh points per dimension. Thus, the three-order-of- magnitude improvement in performance over the past 25 years has meant just a 5.6× gain in spatio-temporal resolution [1]. To address this problem, many simulation packages [2]–[6] use Adaptive Mesh Refinement (AMR)–which selectively applies computing to regions of most interest–to increase resolution. Its utility and versatility have made AMR one of the most widely used frameworks in HPC [7]. The problem with traditional uniform meshes is that difficult regions (discontinuities, steep gradients, shocks) require high resolution; but since that high resolution is applied everywhere equally, most of the computation is wasted. For example, the physical universe is populated by structures at very different scales, from superclusters down to galaxies. These strongly correlated structures need fine resolution in both space and time, while regions of space that are mostly void could make due with coarse resolution (e.g., [6], [8]). When using AMR, simulations start from a coarse grid. The program then identifies regions that need more resolution and superimposes finer sub-grids only on those regions. With the emergence of FPGAs in large-scale clouds and clusters, it makes sense to investigate whether they are an appropriate tool for AMR acceleration; we are not aware of any previous study. The challenge is that AMR generally uses dynamic and pointer-based data structures, characteristics that make any acceleration difficult [9] [10] (where random off-chip DRAM access is expensive and inefficient). AMR frameworks that support GPU acceleration [8] generally use the GPU only to execute the PDE solver and use a host CPU to handle operations on the dynamic data structures [5]. FPGAs can be designed with a cache to support traversal of pointer- based data structures [11], but this approach does not support their modification. Our approach provides a carefully designed data structure and memory subsystem to support all relevant operations. In this work, we propose FP-AMR, a reconfigurable fabric framework for block-structured adaptive mesh refinement ap- plications [2]. FP-AMR helps programmers develop AMR ap- plications for both FPGA-centric clusters and traditional HPC clusters with FPGAs as accelerators. The novel contributions are as follows: FP-AMR can offload all CPU-based tasks to FPGAs, including particle mesh mapping, mesh refinement, and coarsening. In all GPU AMR versions of which we are aware, these functions must be executed on a CPU. The advantage of complete offload is to drastically reduce communication overhead. FP-AMR uses a Z-order space-filling curve to map the simulation space and the adaptive mesh to a customized data structure. This design uses spatial locality to reduce data access conflicts and improve memory access effi- ciency. FP-AMR supports direct FPGA-to-FPGA communica- tion. This means that inter-accelerator communication, which is necessary to periodically update ghost-zones, avoids transit through the CPUs. Experiments show that, with FP-AMR, a single FPGA outperforms a Xeon E5-2660 CPU by 21× to 23× de- pending on cluster scale and simulation initial conditions. Also that, using the secondary network, performance scales perfectly to eight FPGAs and is likely to scale similarly to much larger FPGA clusters.

FP-AMR: A Reconfigurable Fabric Framework for Adaptive …FP-AMR: A Reconfigurable Fabric Framework for Adaptive Mesh Refinement Applications Tianqi Wang ∗Tong Geng †Xi Jin

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FP-AMR: A Reconfigurable Fabric Framework for Adaptive …FP-AMR: A Reconfigurable Fabric Framework for Adaptive Mesh Refinement Applications Tianqi Wang ∗Tong Geng †Xi Jin

FP-AMR: A Reconfigurable Fabric Framework for

Adaptive Mesh Refinement Applications

Tianqi Wang∗ Tong Geng† Xi Jin∗ Martin Herbordt†

∗Department of Physics; University of Science and Technology of China, Hefei, China†Department of Electrical and Computer Engineering; Boston University, Boston, MA, USA

Abstract—Adaptive mesh refinement (AMR) is one of the mostwidely used methods in High Performance Computing accountinga large fraction of all supercomputing cycles. AMR operatesby dynamically and adaptively applying computational resourcesnon-uniformly to emphasize regions of the model as a functionof their complexity. Because AMR generally uses dynamic andpointer-based data structures, acceleration is challenging, espe-cially in hardware. As far as we are aware there has been noprevious work published on accelerating AMR with FPGAs.

In this paper, we introduce a reconfigurable fabric frameworkcalled FP-AMR. The work is in two parts. In the first FP-AMR offloads the bulk per-timestep computations to the FPGA;analogous systems have previously done this with GPUs. Inthe second part we show that the rest of the CPU-basedtasks–including particle mesh mapping, mesh refinement, andcoarsening–can also be mapped efficiently to the FPGA. We haveevaluated FP-AMR using the widely used program AMReX andfound that a single FPGA outperforms a Xeon E5-2660 CPUserver (8 cores) by from 21×-23× depending on problem sizeand data distribution.

Index Terms—Reconfigurable Computing, Adaptive Mesh Re-finement, Scientific Computing, High Performance Computing

I. INTRODUCTION

The increase in supercomputer performance over the last

several decades has, by itself, been insufficient to solve many

challenging modeling and simulation problems. For example,

the complexity of solving evolutionary partial differential

equations (PDEs) scales as Ω(n4), where n is the num-

ber of mesh points per dimension. Thus, the three-order-of-

magnitude improvement in performance over the past 25 years

has meant just a 5.6× gain in spatio-temporal resolution [1].

To address this problem, many simulation packages [2]–[6] use

Adaptive Mesh Refinement (AMR)–which selectively applies

computing to regions of most interest–to increase resolution.

Its utility and versatility have made AMR one of the most

widely used frameworks in HPC [7].

The problem with traditional uniform meshes is that difficult

regions (discontinuities, steep gradients, shocks) require high

resolution; but since that high resolution is applied everywhere

equally, most of the computation is wasted. For example, the

physical universe is populated by structures at very different

scales, from superclusters down to galaxies. These strongly

correlated structures need fine resolution in both space and

time, while regions of space that are mostly void could

make due with coarse resolution (e.g., [6], [8]). When using

AMR, simulations start from a coarse grid. The program then

identifies regions that need more resolution and superimposes

finer sub-grids only on those regions.

With the emergence of FPGAs in large-scale clouds and

clusters, it makes sense to investigate whether they are an

appropriate tool for AMR acceleration; we are not aware of

any previous study. The challenge is that AMR generally

uses dynamic and pointer-based data structures, characteristics

that make any acceleration difficult [9] [10] (where random

off-chip DRAM access is expensive and inefficient). AMR

frameworks that support GPU acceleration [8] generally use

the GPU only to execute the PDE solver and use a host CPU to

handle operations on the dynamic data structures [5]. FPGAs

can be designed with a cache to support traversal of pointer-

based data structures [11], but this approach does not support

their modification. Our approach provides a carefully designed

data structure and memory subsystem to support all relevant

operations.

In this work, we propose FP-AMR, a reconfigurable fabric

framework for block-structured adaptive mesh refinement ap-

plications [2]. FP-AMR helps programmers develop AMR ap-

plications for both FPGA-centric clusters and traditional HPC

clusters with FPGAs as accelerators. The novel contributions

are as follows:

• FP-AMR can offload all CPU-based tasks to FPGAs,

including particle mesh mapping, mesh refinement, and

coarsening. In all GPU AMR versions of which we are

aware, these functions must be executed on a CPU. The

advantage of complete offload is to drastically reduce

communication overhead.

• FP-AMR uses a Z-order space-filling curve to map the

simulation space and the adaptive mesh to a customized

data structure. This design uses spatial locality to reduce

data access conflicts and improve memory access effi-

ciency.

• FP-AMR supports direct FPGA-to-FPGA communica-

tion. This means that inter-accelerator communication,

which is necessary to periodically update ghost-zones,

avoids transit through the CPUs.

• Experiments show that, with FP-AMR, a single FPGA

outperforms a Xeon E5-2660 CPU by 21× to 23× de-

pending on cluster scale and simulation initial conditions.

Also that, using the secondary network, performance

scales perfectly to eight FPGAs and is likely to scale

similarly to much larger FPGA clusters.

Page 2: FP-AMR: A Reconfigurable Fabric Framework for Adaptive …FP-AMR: A Reconfigurable Fabric Framework for Adaptive Mesh Refinement Applications Tianqi Wang ∗Tong Geng †Xi Jin

II. BACKGROUND

A. AMR

Cosmological simulation is used as a motivational exam-

ple of AMR (Algorithm 1). The functions Calc_Potential,

Calc_Accel and Motion_Update, which are listed in Lines

2-4, can be replaced by any application-specific customized

functions. The key feature is the function Particle_Map listed

in Line 1.

Algorithm 1 Algorithm for Time forward

Input: Pi Particle information of time Ti

Output: Pi+1 Particle information of time Ti+1

1: Mass_Meshi = Particle_Map(Pi)# Map Ti particles to adaptive mesh

2: Potentiali = Calc_Potential(Mass_Meshi)# Calculate Ti potential distribution

3: Acceli = Calc_Accel(Potentiali)# Calculate Ti all particles’ accelarte

4: Pi+1 = Motion_Update(Acceli, Pi)# Motion update to generate Pi+1

5: return Pi+1

Grid Level 2

Grid Level 1

Grid Level 0

t/4 t/4 t/4 t/4

t/2t/2

t

1

2

3 4

5

6

7 8

9

Grid Level 0

Grid Level 1

Grid Level 2

(A) Space Dimension Refinement

(B) Time Dimension Refinement

Fig. 1. Spatio-temporal for AMR algorithm

Figure 1 shows details of Particle_Map: how a cosmologi-

cal simulation uses AMR to implement various resolutions in

the spatio-temporal regions of high mass density. As shown

in Figure 1(A), AMR starts from the coarsest grid (Level 0),

then identifies blocks as high mass density (red), applies finer

resolution to these blocks, and then generates a finer grid

(Level 1) for just those blocks. Continuing, the next grid (Level

2) is generated for high-density blocks in Level 1 (blue).

Figure 1(B) shows the subcycling-in-time approach used by

AMR. Step 1: Integrate Level 0 over t. Step 2: Integrate Level

1 over t/2. Step 3: Integrate Level 2 over t/4. Step 4: Integrate

Level 2 over t/4. Step 5: Synchronize Levels 1 and 2. Step

6: Integrate Level 1 over t/2. Step 7: Integrate Level 2 over

t/4. Step 8: Integrate Level 2 over t/4. Step 9: Synchronize

Levels 1, 2, and 3.

Algorithm 2 Algorithm for Particle Map

Input: Pi Particle information of time Ti

Output: Mass_Meshi Adaptive mesh of time Ti

1: Initialize(Mesh_Conf, P_Sorti)2: for each block ∈ Space do

3: count = 04: par_list = ∅5: for each par ∈ Pi do

6: if (par_loc ∈ block) then

7: count+ = 18: par_list← par9: end if

10: end for

11: if (count > threshold1) then

12: Mesh_Conf ←Refine(block)13: else if (

∑adj count < threshold2) then

14: Mesh_Conf ←Coarsen(block)15: else

16: Mesh_Conf ←Keep(block)17: end if

18: P_Sorti ← par_list19: end for

20: for each par ∈ P_Sorti do

21: Mass_Mesh←Interpolation(par,Mesh_Conf)22: end for

23: return Mass_Mesh

Details of Particle_Map are shown in Algorithm 2. In Lines

2-19, the first nested loop traverses the whole simulation space

and counts each mesh block’s particle number. According to

the particle density of each block and its adjacent blocks, the

block is refined, coarsened, or kept. Based on Mesh_Conf ,

generated by Lines 2-19, Lines 20-22 traverse all particles and

interpolate the particles’ masses to the adaptive mesh. At the

same time, all particles are sorted based on spatial locality for

the next step’s particle interpolation.

dx

dy1S dxdy

2 (1 )S dx dy 3 (1 )(1 )S dx dy

4 (1 )S dx dy

1m

2m 3m

4m

M

1 1

2 2

3 3

4 4

(1 )

(1 ) (1 )

(1 )

m M S M dx dy

m M S M dx dy

m M S M dx dy

m M S M dx dy

Fig. 2. Bi-linear interpolation

In summary, the key features of Particle_Map used to build

and adapt the adaptive mesh are:

1) Refine mesh and coarsen mesh: According to the mess

Page 3: FP-AMR: A Reconfigurable Fabric Framework for Adaptive …FP-AMR: A Reconfigurable Fabric Framework for Adaptive Mesh Refinement Applications Tianqi Wang ∗Tong Geng †Xi Jin

density, AMR method needs to decide each block should be

refined or coarsened.

2) Interpolate: As shown in Fig 2, bi-linear or tri-linear

methods can be used for the 2D/3D dimension cases. Based

on area/volume, mass is partitioned and mapped to the mesh.

(A) System Overview

(B) FPGA Design

(C) FP-AMR Module

Host Communication Network

DR

AM

Host

CPU

IO

Reconf

Device

DR

AM

PCIe

No

de

1

DR

AM

Host

CPU

IO

Reconf

Device

DR

AM

PCIe

No

de

2

DR

AM

Host

CPU

IO

Reconf

Device

DR

AM

PCIe

No

de

n

Device Directly Communication Network

Particle

Information

Sorted Particle

Information

AMR Grid

Status

DRAM

Mapped Mesh

PE_Sort

FPGA

PE_Map

Phase I: Sorting Phase II: Particle Mapping

DR

AM

DD

R

Ctrl

No

rth

Tra

ns

Ea

st

Tra

ns

Sou

th

Tra

ns

We

st

Tra

ns

Up

Tra

ns

Do

wn

Tra

ns

Router

FP-AMR

Kernel 2

Kernel n-1Kernel n

User Logic

PCIe Ctrl

Kernel 1

FP-AMR

FPGA

Fig. 3. Overview of FP-AMR

B. FPGA-centric clusters

In traditional HPC clusters, accelerators communicate

through their hosts, which results in substantial latency over-

head. In FPGA-centric clusters, such as Catapult I [12] and

many other recent systems [13], [14], FPGAs are intercon-

nected directly through their transceivers [15]–[17]. As shown

in Figure 3, the yellow marked reconfigurable devices can

exchange data without detouring host CPU. While there is

no technological reason why GPUs (and other accelerators)

cannot be built with a similar capability, to date NVLink

supports only a small number of GPUs and, unlike with

FPGAs, adds substantially to the cost of the cluster.

C. Challenge and Motivation

For AMR-based applications, accelerators use optimized

kernels to handle the functions in Algorithm 1 Line 2-4.

Particle_Map uses tree-based dynamic data structures. For

scientific simulations, the size of these data structures is

usually hundreds of MB. These pointer-based dynamic data

structures must, therefore, be stored off-chip. Since the large

number of random accesses makes it difficult to accelerate,

programmers generally use host CPUs.

Host

Comm

Device

Loca

tio

n

Time

Th Th ThTcTcTcTc Td Td

TimeStep_1 TimeStep_2

Fig. 4. Communication and computation phases

Figure 4 shows how traditional clusters execute AMR

through phases and communication between host and device.

The fraction of time for Particle_Map can be expressed as

Ratioamr =Th + 2Tc

Th + Td + 2Tc

(1)

where Th is CPU time, Td is accelerator time, and Tc is

communication latency.

A better accelerator can reduce Td, but the time required to

exchange hundreds of MB of data between host and device

(e.g., through PCIe) is substantial. Moreover, Tc cannot be

overlapped because the timestep 2 cannot start before timestep

1 finishes. Offload of Particle_Map to the accelerator is

therefore essential to improve performance further.

III. FP-AMR FRAMEWORK

A. Overview

Figure 3 shows FP-AMR in an FPGA-centric cluster. Figure

3(A) shows the FPGA; Figure 3(B) gives details of the

FPGA design. A router handles FPGA-to-FPGA communi-

cation through the transceivers. Functions 2-4 in Algorithm

1 are instantiated as user-specific kernels 1-n. FP-AMR is in

the red block and wraps the user-specific kernels. According

to Algorithm 2, two nested loops are executed sequentially.

As shown in Figure 3(C), the FP-AMR module consists

of two parts, PE-Sort and PE-Map, corresponding the two

loops. PE-sort reads the particle information list and then

generates the sorted particle information list and AMR grid

status. According to the values in these structures, PE-map

generates the final mass mapping mesh. For most simulation

applications, the adaptive mesh data structures are too large

(usually several GB) to be stored on-chip.

B. Space Filling Curve

According to Algorithm 2 line 2-19 and line 20-22, AMR

must traverse the whole simulation space and the particle

information list. Since all necessary data structures (Figure

3(C)) are stored off-chip, FP-AMR needs to access off-chip

memory frequently. To make this more efficient, FP-AMR uses

on-chip cache to take advantage the spatial locality. To map

the 2D/3D simulation space to the 1D memory address space,

a space-filling curve (SFC) is advantageous. SFCs have ranges

Page 4: FP-AMR: A Reconfigurable Fabric Framework for Adaptive …FP-AMR: A Reconfigurable Fabric Framework for Adaptive Mesh Refinement Applications Tianqi Wang ∗Tong Geng †Xi Jin

Y_Block_ID

X_Block_ID

X_Grid_ID

Y_Grid_IDMesh Grid

Points

000 001 010 011

000

001

010

011

000

001

010

011

100

000 001 010 011 100

Block

Block ID

Point ID

(A) Simulation space contain blocks and grid points (B) Extract blocks and index(C) Extract grid points and index

Y_Block_ID

X_Block_ID000 001 010 011

000

001

010

011

00

01

02

03

04 06

07

08

09

10

11

12

13

14

1505

X_Grid_ID

Y_Grid_ID

000 001 010 011 100

000

001

010

011

100

00

01

04

05

16 17

07

06

03

02 08

09

12

13

18 19

15

14

11

10 20

21

22

23

24

Fig. 5. Use space filling curve to index the whole simulation space and adaptive mesh grid

00 01 10 11

00

01

10

11

00

01 03

02

04

05

06

07

08

09

10

11

12

13

14

15

X

Y

00 01 10 11

00

01

10

11

05

04 07

06

03

00

02

01

09

08

10

11

13

14

12

15

X

Y

(A) Z-ordering Curve (B) Peano-Hilbert Curve

Fig. 6. Different Space Filling Curves

that contain the entire 2D/3D dimensional unit square; Z-order

and Peano-Hilbert are two widely used SFCs.

As shown in Figure 5(A), the whole simulation space

contains two basic elements: blocks and mesh grids. Blocks are

the orange space in Figure 5(A) and are used to count the par-

ticles’ spatial distribution. The mesh grids are the black points

in Figure 5(A)(C) and are used to store the mapped particles’

mass. To take advantage the spatial locality, the blocks and

mesh grids both need to be indexed with the SFC. We use the

Z-order curve as an example. In Figure 5(B)(C), the blocks and

mesh grid points are extracted separately and indexed. Figure

6 shows the indexing methods for the Z-order and Peano-

Hilbert curves. Note that the Z-order curve only needs the

bit-shuffle operation, as shown in Figure 6(A). In contrast, the

Peano-Hilbert curve generator is a recursive algorithm, which

is extremely expensive for hardware implementation. Clearly,

the Z-order curve is a more reasonable choice.

For the adaptive mesh, the blocks of the whole simulation

space are shown in Figure 7(A). From the adaptive mesh, we

can extract different level mesh grids (Level 1 to Level 3),

which are shown in Figure 7(B). AMR uses a finer grid to

index critical areas. For a particle located in the block marked

with the star (in Figure 7(A)), we generate its indices in all

three grid levels (see Figure 7(B)), and use the spliced three

indices to reference the particle (in Figure 7(B)).

C. Data Structures

In FP-AMR, particle information and adaptive mesh are

stored off-chip DRAM so their data structures must be de-

00

01_00

01_01

01_02

01_03

02

03_00

03_01_00

03_01_01

03_01_02

03_01_03

03_03

03_02

L1 Block-IndexL2 Block-IndexL3 Block-Index Index = 03_01_02

(A) Adaptive Mesh (B) Adaptive Mesh Index Method

Fig. 7. Adaptive mesh indexing

signed carefully. Figure 8 shows all related data structures in

Algorithm 1 and their relationships. For time step i, (1) Sort:

the particle information list of T imei (Pi: from particle 0 to

particle n) is sorted to generate P_Sorti (from particle 0′ to

particle n′) based on spatial locality. (2) Interpolation: Gen-

erate Mass_Meshi of T imei. (3) Calc_potential/Calc_accel:

User specific kernels calculate the discrete potential and each

particle’s acceleration of T imei based on Mass_Meshi. (4)

Motion update: Based on T imei’s acceleration update and

the (locality, velocity) field of P_Sorti, the result is the next

timestep’s particle information list Pi+1

There are two kinds of data structures: the particle infor-

mation lists (Pi, P_Sorti) and the adaptive mass mesh point

list (Mass_Meshi). Both have similar memory layout.

For the particle information list (Figure 9(A)), the list is

divided into several sections, where each section contains all

particles that share the same block-index. The sections are

stored contiguously and the sections with larger block-indices

are stored in higher address space. As shown in Figure 8,

from time step i to time step i + 1, the location field of the

particle information list is updated. In the next time step i+2,

even if the particle spatial distribution is relatively stable, there

will be some particles flowing in and out of each block. For

each section, there is an extra back-up space at the end of the

section, which is just in case more particles need to be located

Page 5: FP-AMR: A Reconfigurable Fabric Framework for Adaptive …FP-AMR: A Reconfigurable Fabric Framework for Adaptive Mesh Refinement Applications Tianqi Wang ∗Tong Geng †Xi Jin

in the block.

Besides the memory to store particle information, FP-AMR

needs an extra base-address array and counter array. The base-

address array keeps each section’s base address and counter

array records for how many particles exist in each block.

Figure 9 shows how the coarsen and refine operations affect

the particle information list. First, the section or adjacent

sections (include the backup blank) is fragmented or merged.

Then all related particles are re-sorted. The base-address and

counter arrays are updated.

(2) Interpolation

To: Next Timestep

Timestep i+1

(3)Calc_potential/

Calc_accel/

Motion_update

P_i

P_Sort_i

Mass_Mesh_i

P_i+1

(1) Sort

Particle n’

Particle 0'

Particle 1'

Particle 2'

Mass <Loc’,Vec’> Other

Mass

Grid_Point 0

Grid_Point 1

Grid_Point m

Grid_Point 2

Particle n’

Particle 0'

Particle 1'

Particle 2'

Mass <Loc,Vec> Other

Particle n

Particle 0

Particle 1

Particle 2

Mass <Loc,Vec> Other From: Next Timestep

Timestep i+1

Fig. 8. Data structures’ relationship

IV. ARCHITECTURE

In this section, the hardware architecture of the two phases

of FP-AMR, PE_Sort, and PE_Map (as shown in Figure

3(C)), is introduced. The PE_Sort module sorts all particle

information based on spatial locality and decides which blocks

need to be refined or coarsened. PE_Map maps the sorted

particles to the adaptive mesh with bi-/tri-linear interpolation.

A. PE_Sort

The data path of module PE_Sort is shown in Figure 10.

The inputs of PE_Sort are: coordinates of this level’s origin

(x0, y0); this level’s length L; the particle coordinates vector

(x, y); and the particles’ mass m. First, the pipeline calculates

(idx, idy) and (dx, dy):

(idx, idy) = ⌊x− x0

L,y − y0

L⌋ (2)

(dx, dy) = (x− x0

L− idx,

y − y0L

− idy) (3)

(idx, idy) shows which block the particle is located in and

(dx, dy) is the offset within the block. Using (idx, idy), the

Z-order module generates the block index, which is used

(A) Data Structure of Particle Information List

Coarse

Refine

Update Base Addr

Array (Refine)

Update Counter

Array (Refine)(B) Coarse and Refine Operation

Particle m1

Particle 0

Back up blank’

Base 02_00

Counter m1

Particle m2

Particle 0

Back up blank’

Base 02_01

Counter m2

Particle m3

Particle 0

Back up blank’

Base 02_02

Counter m3

Particle m4

Particle 0

Back up blank’

Base 02_03

Counter m4

Mass Loc OtherVec

Particle n1

Particle 0

Mass Loc OtherVec

Back up blank

Base 02

Counter n1

Block-index = 02

Block-index = 03_00

Block-index = 03_01_00

Block-index = 03_01_01

Block-index = 03_01_02

Block-index = 03_01_03

Me

mo

ry A

dd

ress In

crea

se

Loc Vec

Particle n

Particle 0

Particle 1

Particle 2

Mass Other

Back up blank

Base k

Counter k

Counter Array

Counter kCounter k+1Counter k+2

Base Addr Array

Base kBase k+1Base k+2

Update Base Addr

Array (Coarse)

Update Counter

Array (Coarse)

Fig. 9. Data structure of particle information list

/ / /

y0yx

0x z

0z L

FP2FIX FP2FIX FP2FIX

Z-ordering

(Bit Shuffle)

m

Base_Addr

Array

Counter-Z

Array

Base_Addr Counter_Z

+Const 1

BufferAddr

Cache

DRAM

- - -

PE_Sort

FIX2FP FIX2FP FIX2FP

1-dx 1-dy 1-dz

'L

dx dydz

'm

Forward to Buffer

Backward to

RefinementRefine?

Const 1

+

dx dy dz mZ-order

Fig. 10. Data path of PE_Sort module

as the key to search the base-address array and increment

the corresponding counter in the counter array. The search

results of the base-address and counter arrays are added to

form the address of the particle’s information. Then all related

particle information (such as (dx, dy,m)) are written to on-

chip cache. Because of the spatial locality of the particles,

Page 6: FP-AMR: A Reconfigurable Fabric Framework for Adaptive …FP-AMR: A Reconfigurable Fabric Framework for Adaptive Mesh Refinement Applications Tianqi Wang ∗Tong Geng †Xi Jin

the cache has a low miss rate. Based on the counter array

search result, PE_Sort decides whether the block needs to be

refined, coarsened, or kept as is. If refine or coarsen operations

is necessary, then the data is streamed backward as shown by

the blue arrow.

x

xm0

+

RAM 0

x

xm1

+

RAM 1

x

xm2

+

RAM 2

x

xm3

+

RAM 3

dx

1-dx

dy

1-dy

Z-ordering Adjacent

Z-orde (i,j)

Addr_0 Addr_1 Addr_2 Addr_3

PE_Map

m

- -

dx dyConst 1

1-dx 1-dydx dy

Z-orde

(i,j)

Z-orde

(i+1,j)

Z-orde

(i,j+1)

Z-orde

(i+1,j+1)

RAM_Offset:Higher bits RAM_Select:lowest 2 bits

Fig. 11. Datapath of the PE_Map module

B. PE_Map

The datapath of the PE_Map module is shown in Figure 11.

The bi-linear interpolation result of the particle mass m is

(m0,m1,m2,m3). These results are the added to the related

grid points.m0 = m · dx · dy (4)

m1 = m · (1− dx) · dy (5)

m2 = m · dx · (1− dy) (6)

m3 = m · (1− dx) · (1− dy) (7)

At this point, we examine the design with respect to FPGA

resource constraints. We find that the number of computational

and on-chip memory units is sufficient. However, the latter is

only the case if data are mapped to remove bank conflicts.

In PE_Map, each particle map operation needs to add mass

interpolation results to four (2D), or eight (3D), adjacent grid

points. If these adjacent grid points are not located in different

memory banks, then they cannot be accessed them in a single

cycle. As waiting for data causes idle stages in the pipeline a

suitable memory partition is necessary.1) Memory Partition: The Z-order indexing is helpful in

creating a collisionless data layout. As shown in Figure 12,

grid points must be mapped to four different RAM banks.

The grid-point index is z − order. The remainder of z−order4

is used for RAM bank selection while the quotient of z−order4

is used as the address offset. For example, grid point 14 is

mapped to RAM 2’s offset 3.

Figure 13 shows simultaneous memory accesses. For parti-

cles located in green Block_0, the memory access request for

(00, 01, 02, 03) is collisionless. For particles locate in orange

Block_3, the memory access request for (03, 06, 09, 12) is col-

lisionless. For particles locate in ivory Block_5, the memory

access request for (12, 13, 06, 07) is also collisionless.

Fig. 12. Memory partition method

Block_5 Block_8Block_4

Block_3 Block_7Block_1

Block_2 Block_6Block_0

00

05

01

04

02

07

03

06

08

13

09

12

10

15

11

14Block_5: Grid Points (6,7,12,13)

00

04

08

12

01

05

09

13

02

06

10

14

03

07

11

15

RAM 0 RAM 1 RAM 2 RAM 3

Block_3: Grid Points (3,6,9,12)

00

04

08

12

01

05

09

13

02

06

10

14

03

07

11

15

RAM 0 RAM 1 RAM 2 RAM 3

Block_0: Grid Points (0,1,2,3)

00

04

08

12

01

05

09

13

02

06

10

14

03

07

11

15

RAM 0 RAM 1 RAM 2 RAM 3

Fig. 13. Conflict free parallel memory access

LOAD ADDER

LD C0

ADD C0, M0

ADD C0, M1

LD C0

ST C0+M0

ST C0+M1

LD C0

ADD C0, M2

ST C0+M2

FP Delay

FP Delay

FP Delay

STORE

Bubble!

Bubble!

Bubble!Bubble!

Bubble!

Bubble!

LD C0

ADD C0, M0

ADD C1, M1LD C2

ST C1+M0

ST C2+M1

LD C1

ADD C2, M2

ST C0+M0

FP Delay

LOAD ADDER STORE

LD C0

ADD C0, M0

ST C0+M0

LD C1

ADD C1, M1

ST C1+M1

LD C2

ADD C2, M2

ST C2+M2

LD C0

ADD C0, M0

ST C0+M0

LD C0

ADD C0, M1

ST C0+M1

LD C0

ADD C0, M2

ST C0+M2

Z-o

rde

rin

g

Block-id = i

Block-id = i+4

Block-id = i+8

Particle 0

Particle 0'

Particle 1'

Particle 2'

Particle 1

Particle 2

Fig. 14. Read-after-write data hazard

2) RAW hazard: In Figure 11, PE_Map reads the temporary

mass from RAM, adds the interpolation result, and stores the

sum back to RAM. However, the floating point adders have

long latency (more than ten cycles). As shown in Figure 14, if

we handle particles (marked blue) sequentially, Particle0 −Particle2, then PE_Map needs to repeatedly read and write

the same position in RAM causing a RAW hazard. Therefore,

FP-AMR handles the particles (marked brown) sequentially,

Particle′0−Particle′2. Because these particles have different

block id and different grid points, the hazard is avoided.

Page 7: FP-AMR: A Reconfigurable Fabric Framework for Adaptive …FP-AMR: A Reconfigurable Fabric Framework for Adaptive Mesh Refinement Applications Tianqi Wang ∗Tong Geng †Xi Jin

V. EXPERIMENTAL EVALUATION

We use an astrophysics AMR-based N-body simulation,

which solves the Poisson Equation, to evaluate the perfor-

mance of FP-AMR in an FPGA-centric cluster. We implement

three different systems as the control group: 1) CPU-only

cluster, 2) CPU-GPU cluster (GPUs as co-processors), and

3) CPU-FPGA cluster (FPGAs as co-processors). These three

versions are based on AMReX, which is a publicly available

software framework designed for building massively parallel

block-structured AMR applications.

A. Experimental Setup

We use NVIDIA Tesla P100 GPUs and Xilinx VCU118

Boards as platforms. These devices all use the 16nm process.

The host CPU is one socket Intel Xeon E5-2660 with 8 cores

running a multithreaded version. Table I shows details of the

FP-AMR and the three control clusters. We currently have

resources to test all four configurations up to eight nodes with

one accelerator per node. The control clusters all use CPUs

to handle AMReX-based AMR and use CPU-side Ethernet

for communication between nodes. For the Poisson solver, the

CPU-only version uses the Intel AVX-enhanced FFTW library;

the CPU-GPU version uses cuFFT; and the two FPGA versions

both use Xilinx FFT IP.

TABLE IEXPERIMENT SETUP

Design CPU Side Poisson Solver Interconnection

CPU-only AMReX FFTW (AVX) Eth

CPU-GPU AMRex cuFFT Eth

CPU-FPGA AMReX Xilinx FFT IP Eth

FP-AMR Initial Xilinx FFT IP Eth + Trans

In the FP-AMR version, the CPU only handles program

initialization and workload scheduling. All AMR and Pois-

son solver tasks are completed on the FPGA side. All

other communication is direct FPGA-to-FPGA through FPGA

transceivers. The design is coded in Verilog. For the Poisson

solver, the frequency is 250MHz. For the utility (DMA, bus,

etc) and FP-AMR parts, the frequency is 200MHz.

The performance depends on the initial conditions of the

simulation. For example, the particles’ kinetic energy can

influence the number of refine/coarsen operations and the

frequency of particle information exchanges. We use NGenIC,

a widely used initial conditions generator for cosmological

simulations. We generate two different initial conditions:

strong interaction and weak interaction. These have the same

number of particles distribution, but the kinetic energy of the

first is twice that of the second. In both cases results between

32- and 64-bit FP are nearly indistinguishable and so the

CPU/GPU/FPGA implementations all use FP32 format.

B. Performance Summary

To measure performance we run the system for 100

timesteps after a 10 timestep warm-up. Figure 15 shows

average node performance for different initial conditions and

node count. Table II shows the systems’ performance. All of

the accelerators do very well compared with the CPU-only

system: CPU+GPU is 35×-37× better, CPU+FPGA 17×-19×,

and FP-AMR 21×-23×.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

WI SI WI SI WI SI WI SI

Single Node Two Nodes Four Nodes Eight Nodes

Pe

r N

od

e P

erf

orm

an

ce

(1/T

ime

/No

de

)

Performance with different initial condition and

node count

CPU-only CPU+GPU CPU+FPGA FP-AMR

Fig. 15. Different initial condition

Compared with the CPU+FPGA system, FP-AMR’s perfor-

mance is 19%-20% better. The improvement of FP-AMR over

originates from two effects. First, at the end of each time step,

nodes transfer particles as they move through boundaries. FP-

AMR benefits from the direct FPGA-to-FPGA connections.

Second, as mentioned in Section II, FP-AMR offloads several

tasks to the FPGAs. If the initial condition has higher kinetic

energy, it will cause more particle information exchange and

influence the particles’ spatial locality, which has a negative

effect on particle map. As shown in Figure 15, for strong

interaction the initial the time consumption of each timestep

is 22.8% larger.

TABLE IIPERFORMANCE OVERVIEW

CPU-only

CPU+GPU CPU+FPGA FP-AMR

Performance 1x 34.6x-36.6x

17.3x-19.5x

20.7x-23.2x

Cache MissRatio

17.1% 17.1% 17.1% 18.7%

Energy Con-sumption

1x 0.075x 0.015x 0.012x

C. Memory subsystem

In our experiments, we find a diminishing return beyond

a 4KB 4-way set-associate cache. The replacement policy is

LRU. To evaluate FP-AMR’s memory subsystem, we measure

the cache miss rate of the particle map in FP-AMR and the

CPU-only system. For FP-AMR, we instrument the FPGA

cache with counters and find a cache miss rate of 18.7%. For

the CPU-only system, we use Linux-perf to profile the CPU’s

last level cache miss ratio, which is 17.1%. Compared with the

CPU’s highly optimized and sophisticated cache design, FP-

AMR’s memory subsystem only causes a 1.6% higher cache

miss ratio.

D. Performance and Resource Utilization Break-Down

Figure 16 gives a break-down of time consumption for 8

node versions of CPU-GPU, CPU-FPGA and, FP-AMR. For

Page 8: FP-AMR: A Reconfigurable Fabric Framework for Adaptive …FP-AMR: A Reconfigurable Fabric Framework for Adaptive Mesh Refinement Applications Tianqi Wang ∗Tong Geng †Xi Jin

Strong Interaction

Weak Interaction

Fig. 16. CPU-FPGA and FP-AMR system time consumption break down andutilization break down

the Poisson solver, for each initial condition, the two FPGA

clusters have similar times (as expected), while the GPU

cluster is more than three times faster. For intra node particle

map, both CPU-FPGA and CPU-GPU clusters execute this

operation on the CPU and so, as expected, their performance

is similar. FP-AMR executes this on the FPGAs and is twice

as fast. For inter node particle exchange, again the CPU-FPGA

and CPU-GPU clusters have similar performance, while FP-

AMR, which uses the direct FPGA-FPGA connections, is so

fast it is barely visible.

Figure 16 also shows the FPGA resource utilization. The

FP-AMR design has three parts: basic utility (DRAM con-

troller, Transceivers controller, etc), FP-AMR module, and

Poisson solver. For FP-AMR, the FPGA resource bottleneck

is DSP slices. The FP-AMR framework uses less than 5% of

DSP slices. This also explains the disparity in performance

between the GPU and FPGA in the Poisson solver.

E. System Scalability

Figure 17(A-B) shows the performance scalability of the

four systems with two different initial conditions. Clearly,

the AMReX framework makes sure the three control groups’

performance has good scalability. Based on this limited-scale

experiment, FP-AMR’s performance has comparable linearity.

VI. DISCUSSION AND FUTURE WORK

In this work, we study the mapping of AMR onto FPGAs

and FPGA clusters. We believe that this is the first such

study. We create two versions, one where the FPGAs execute

only the computations that are traditionally offloaded to the

accelerator, while the CPUs retain control and data structures;

and a second, FP-AMR, where the entire computation (after

initialization) is executed on the FPGAs. For the FPGAs to

Fig. 17. Scalability

execute the balance of the CPU-based tasks–including parti-

cle mesh mapping, mesh refinement and coarsening–requires

using customized data structures and memory subsystem.

Another advantage of FP-AMR is that inter-iteration particle

exchanges are executed using direct FPGA-FPGA connections.

The FP-AMR enhancements lead to a 20% performance

improvement over the CPU+FPGA version.

Overall we find that both FPGA-based versions of AMR

substantially improve performance over CPU-based versions,

with factors of 17×-19× and 21×-23×, respectively. Energy

consumption is improved by a much greater factor. The

limiting factor on FPGA performance is number of DSP units.

When compared with a high-end GPU of similar process

generation, we find that the performance of the GPU is 1.46×-

1.67× better than FP-AMR. This appears to be entirely due

to floating point resources available on the GPUs for the

execution of the intra-iteration payload computations. Clearly

having more DSPs would allow the FP-AMR to have equal

or better performance. The energy consumption of the FPGA-

based systems is about 6× less than that of the GPU-based

systems.

We tested scalability of all of the systems and found all

of the systems scaled well within the size of the clusters to

which we had access. Testing on larger systems is required to

demonstrate the expected benefit of the direct FPGA-FPGA

connections on scalability.

It is likely that we can improve the performance of the

payload operations on the FPGA (Poisson solver). This is

something that we did not concentrate on so far in this study,

but is the primary hindrance to matching GPU performance.

Exchanging the IP-based FFTs for custom FFTs could bridge

the gap, even given the disparity in floating point resources.

Page 9: FP-AMR: A Reconfigurable Fabric Framework for Adaptive …FP-AMR: A Reconfigurable Fabric Framework for Adaptive Mesh Refinement Applications Tianqi Wang ∗Tong Geng †Xi Jin

REFERENCES

[1] C. Burstedde, O. Ghattas, G. Stadler, T. Tu, and L. C. Wilcox, “Towardsadaptive mesh PDE simulations on petascale computers,” Proceedings

of Teragrid, vol. 8, 2008.[2] A. Dubey, A. Almgren, J. Bell, M. Berzins, S. Brandt, G. Bryan,

P. Colella, D. Graves, M. Lijewski, F. Löffler et al., “A survey ofhigh level frameworks in block-structured adaptive mesh refinementpackages,” Journal of Parallel and Distributed Computing, vol. 74,no. 12, pp. 3217–3227, 2014.

[3] D. Calhoun, “Adaptive Mesh Refinement Resources,” 2017. [On-line]. Available: math.boisestate.edu/∼calhoun/www_personal/research/amr_software/

[4] W. Zhang, A. Almgren, M. Day, T. Nguyen, J. Shalf, and D. Unat,“BoxLib with tiling: an AMR software framework,” arXiv preprint

arXiv:1604.03570, 2016.[5] H.-Y. Schive, J. A. ZuHone, N. J. Goldbaum, M. J. Turk, M. Gaspari, and

C.-Y. Cheng, “GAMER-2: a GPU-accelerated adaptive mesh refinementcode–accuracy, performance, and scalability,” Monthly Notices of the

Royal Astronomical Society, vol. 481, no. 4, pp. 4815–4840, 2018.[6] M. N. Farooqi, T. Nguyen, W. Zhang, A. S. Almgren, J. Shalf, and

D. Unat, “Phase asynchronous AMR execution for productive andperformant astrophysical flows,” in Proceedings of the International

Conference for High Performance Computing, Networking, Storage, and

Analysis, 2018, p. 70.[7] P. Davis, “Adaptive Mesh Refinement: An Essential Ingredient in

Computational Science,” SIAM News, May 1, 2017.[8] A. Almgren, V. Beckner, J. Bell, M. Day, L. Howell, C. Joggerst,

M. Lijewski, A. Nonaka, M. Singer, and M. Zingale, “CASTRO: A newcompressible astrophysical solver. I. Hydrodynamics and self-gravity,”The Astrophysical Journal, vol. 715, no. 2, p. 1221, 2010.

[9] B. Peng, T. Wang, X. Jin, and C. Wang, “An Accelerating Solution forN -body MOND simulation with FPGA-SOC,” International Journal of

Reconfigurable Computing, vol. 2016, 2016.[10] T. Wang, L. Zheng, X. Jin, B. Peng, and C. Wang, “FPGA acceleration

of TreePM N-body simulations for Modified Newton Dynamics,” inInternational Conference on Field-Programmable Technology, 2016, pp.201–204.

[11] J. Coole, J. Wernsing, and G. Stitt, “A traversal cache framework forFPGA acceleration of pointer data structures: A case study on Barnes-Hut N -body simulation,” in International Conference on Reconfigurable

Computing and FPGAs, 2009, pp. 143–148.[12] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides,

J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray et al., “Areconfigurable fabric for accelerating large-scale datacenter services,”ACM SIGARCH Computer Architecture News, vol. 42, no. 3, pp. 13–24, 2014.

[13] A. D. George, M. C. Herbordt, H. Lam, A. G. Lawande, J. Sheng, andC. Yang, “Novo-G#: Large-scale reconfigurable computing with directand programmable interconnects,” in IEEE High Performance Extreme

Computing Conference, 2016, pp. 1–7.[14] T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Patel, and M. Herbordt,

“A framework for acceleration of CNN training on deeply-pipelinedFPGA clusters with work and weight load balancing,” in International

Conference on Field Programmable Logic and Applications, 2018, pp.394–398.

[15] J. Sheng, B. Humphries, H. Zhang, and M. Herbordt, “Design of 3DFFTs with FPGA Clusters,” in IEEE High Perf. Extreme Computing

Conf., 2014.[16] J. Sheng, C. Yang, A. Caulfield, M. Papamichael, and M. Herbordt,

“HPC on FPGA Clouds: 3D FFTs and Implications for MolecularDynamics,” in Proc. IEEE Conf. on Field Programmable Logic and

Applications, 2017.[17] J. Sheng, C. Yang, and M. Herbordt, “High Performance Dynamic

Communication on Reconfigurable Clusters,” in Proc. IEEE Conf. on

Field Programmable Logic and Applications, 2018.