TR-CIS-0420-09 BobZigon

TWO LAYER LINEAR DIFFUSION EQUATION ON THE GPU

Technical Report TR-CIS-0420-09

Submitted to the Faculty

of

Indiana University-Purdue University Indianapolis

by

Robert J. Zigon

December 2015

Indianapolis, Indiana

ii

TABLE OF CONTENTS

Page

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Graphics Processing Unit (GPU) . . . . . . . . . . . . . . . . . . . . . . . . 32.1 Software Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Hardware Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1 One Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Factorization of a Tridiagonal Matrix . . . . . . . . . . . . . . . . . . . 11

3.2.1 LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.2 UL Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 CPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.4 Parallel LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 153.5 Parallel Solver and Recursive Doubling . . . . . . . . . . . . . . . . . . 183.6 Recursive Doubling and Nilpotent Matrices . . . . . . . . . . . . . . . 193.7 GPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Diffusion with Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2 Two Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.3 Interface Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.4 Complete System for Two Layers . . . . . . . . . . . . . . . . . . . . . 304.5 Difference Between Inclusion and Exclusion of the Interface Condition . 314.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

A CPU code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

B GPU code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

iii

LIST OF TABLES

Table Page

3.1 CPU LU vs CPU ZC with 1 Layer, Terms=20, Time Steps=50 . . . . . . . 23


3.3 CPU LU vs GPU RD1L with 1 Layer, Terms=20, Time Steps=50 . . . . . 25

4.1 Experimental parameters for 50 timesteps . . . . . . . . . . . . . . . . . . 33

4.2 CPU 2L vs GPU RD2L, κ1 = 1, κ2 = 1, Terms=32, Time Steps=50 . . . . 34

4.3 CPU 2L vs GPU RD2L, κ1 = 1, κ2 = 10, Time Steps=50 . . . . . . . . . . 35

iv

LIST OF FIGURES

Figure Page

2.1 Memory bandwidth for the CPU and GPU . . . . . . . . . . . . . . . . . . 4

2.2 Floating point operations per second for the CPU and GPU . . . . . . . . 4

2.3 A GPU Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 A GPU and a Streaming Multiprocessor (SM or SMX) . . . . . . . . . . . 8


3.2 CPU LU vs GPU RD1L with 1 Layer, Terms=20, Time Steps=50 . . . . . 25

4.1 A plant with multiple layers of soil . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Time Evolution of Interface-Experiment 1 . . . . . . . . . . . . . . . . . . 35

4.3 Solution to the Interface Neighborhood-Experiment 1 . . . . . . . . . . . . 36



4.6 Error in the Interface Neighborhood-Experiment 2 . . . . . . . . . . . . . . 37





4.11 Error in the Interface Neighborhood-Experiment 4 . . . . . . . . . . . . . . 40

v

ABSTRACT

Zigon, Robert MS, Purdue University, December 2015. Two Layer Linear DiffusionEquation on the GPU. Major Professors: Raymond Chin, Shaofin Fang and FengguangSong.

The purpose of this project is to investigate the mathematical framework for evalu-

ating the two layer linear diffusion equation on a GPU. The diffusion equation is first

approximated using finite differences to produce the matrix equation Ax = f . The two

term non-linear recurrence relation for the LU factorization of the A matrix is then con-

verted into a three term linear recurrence relation by way of a Riccati transform. The

three term relation is then shown to be parallelizable. After the numeric underflow prob-

lem for the LU solver of the system is reconciled, Stone’s recursive doubling algorithm is

then implemented. Finally, the parallel implementation is applied to a form of the two

layer diffusion equation that properly models the flux across the internal boundary.

1

1 INTRODUCTION

In physics, diffusion is simply defined as the change in distribution of a collection of

particles, as well as its depletion, in time and space. The underlying partial differential

equation can be used to model many different types of processes. For example, open a

bottle of perfume. As the molecules of the scent first escape the container, they are in

very high concentration. Over time they spread outward in every direction where they

are in low concentration.

Another example of diffusion exists in biology. A process called morphogenesis con-

trols the spatial distribution of cells during the embryonic development of an organism.

Natural patterns, such as the spots on a leopard, are believed to be the result of cellular

differentiation in many different directions [1].

The diffusion equation also appears in oncology with the use of radio frequency

thermal ablation (RFA). In this process, tumor cells are killed by focusing energy on a

diseased portion of the body. In order to better understand the ablation process, models

are used to analyze the energy and temperature distribution in the context of the muscle,

fat and bone that are adjacent to the tumor cells [2].

Yet another example of diffusion exists in hydrogeology – the study of the movement

of groundwater in the soil and rocks of the Earth’s crust. Groundwater does not always

flow down hill in the subsurface by following the surface topology. Instead, it can be

driven by pressure gradients in both saturated and unsaturated regions. This results in

a behavior that is difficult to predict for all but the simplest situations.

The goal of this project is to implement a solver for the one dimensional linear

diffusion equation with two layers on a GPU. We will begin with a description of the

modern GPU. The problem itself will then be investigated in two phases. The first

phase will start with the finite difference equations for the one dimensional, constant

coefficient, linear diffusion equation (due to its relative simplicity) on the CPU and GPU.

2

The second phase will then investigate adding layers to the first phase. This two phase

approach will allow us to first understand the issues surrounding diffusion on different

hardware architectures, and then focus on the two layer problem, so that an efficient

parallel solver can be implemented.

The solver consists of two components that are designed to improve execution time or

the accuracy of the solution. First, the tridiagonal structure of the underlying matrices

will be considered so that the LU-decomposition can be applied with a computational

complexity of O(n). While doing so, we will show how to implement Stone’s recursive

doubling algorithm to solve 226 equations, a goal that, up to now, has not exceeded

1, 024 equations. For the second component, we will demonstrate the mathematics to

treat the boundary of two different diffusion coefficients in a manner that reduces error

in the solution.

3

2 GRAPHICS PROCESSING UNIT (GPU)

The modern Graphics Processing Unit (GPU) has its genesis in 2D and 3D computer

graphics. In 2000, parallel processing and floating point arithmetic capabilities were

added to graphics cards to accelerate the rate that world geometries could be trans-

formed, illuminated, projected, clipped and then displayed as pixels. This sequence of

operations is called the graphics pipeline. It makes heavy use of five basic floating point

operators (addition, subtraction, multiplication, division and square root). The process

itself is called embarrassingly parallel because the transformation sequence applied to a

three dimensional vertex is independent of the other vertices.

In 2002, researchers became interested in these parallel processing and floating point

capabilities. They used the graphics application programming interface (API) to com-

pute functions such as fast fourier transforms and convolutions. NVidia took note of

the scientific computing trend with GPUs and developed CUDA - the Compute Unified

Device Architecture. CUDA [6] is a computing platform and programming model that

uses a C like language to expose the massive parallelism of GPU hardare. In retrospect,

the demand for real time graphics has caused the GPU to evolve into a highly parallel,

multithreaded, many core processor with very high memory bandwidth and computing

throughput as illustrated in figures 2.1 and 2.2.

2.1 Software Model

CUDA was designed to overcome the challenge of writing applications that transpar-

ently scale with increasing numbers of processing cores by maintaining a low learning

curve for programmers familiar with the C programming language. There are three

abstractions at the core of CUDA - a hierarchy of thread groups, shared memories and

barrier synchronization. These abstractions guide the programmer to partition a prob-

4

Figure 2.1.: Memory bandwidth for the CPU and GPU

Figure 2.2.: Floating point operations per second for the CPU and GPU

5

lem into sub-problems that can be solved independently in parallel by blocks of threads

executing a kernel.

A kernel is a program written in CUDA C that is downloaded from the host to a

GPU board at runtime. Parameters are passed to the kernel at invocation to provide

it with operands that are transformed by the GPU. In listing 2.1 the kernel program

extends from line 1 through line 5. Line 13 of the main program (that is executing on

the Intel CPU) essentially downloads the V ectorAdd kernel to the GPU and launches

it on 1,000 threads. At runtime each thread is assigned a unique thread index, in this

case ranging from 0 to 999. The ith thread loads A[i], B[i], adds them, and then writes

the result to C[i]. When all of the threads have executed the V ectorAdd kernel, control

is returned to the main program at line 14.

Modern GPU hardware (like an NVidia Tesla K20 board) can have as many as 2,496

processing cores. When a kernel is launched, one of the required parameters is the thread

count. The requested thread count can exceed the number of physical cores on the GPU.

From a conceptual standpoint, the hardware maps blocks of 32 threads (called warps)

to 32 cores until they have finished executing. When one warp finishes another one is

allocated to the idle block of cores for execution. This is one of the key abstractions

in CUDA that lends itself to transparent scalability. If, for example, a next generation

board arrives with 10,000 cores, the kernels are oblivious to the environmental change.

The hardware and runtime take care of the mapping from threads to cores, and the

kernel is executed. The result is a platform that preserves the user’s investment in code

while insulating from hardware changes.

6

1 __global__ void VectorAdd(float *A, float *B, float *C)

2 {

3 int i = threadIdx.x;

4 C[i] = A[i] + B[i];

5 }

6

7 int main()

8 {

9 const int N = 1000;

10 ...

11 // Kernel invocation from the host with N threads

12

13 VectorAdd <<<1, N>>>(A, B, C);

14 ...

15 }

Listing 2.1: Example kernel and host code

2.2 Hardware Model

From a hardware perspective, the fundamental computing unit in an NVidia GPU

is a core (see figure 2.3). A core contains a 32 bit arithmetic logic unit (ALU) capable

of performing operations such as min, max, add, subtract, multiply, divide, compare

and bitwise logical operators. A core also contains a single and double precision floating

point unit.

In figure 2.4 we see that a collection of cores are grouped together in a unit known as

a Streaming Multiprocessor (SM). The SM in the figure has 192 cores within. A GPU

itself is then a collection of SM’s. Although figure 2.4 shows 8 SM’s (for a total of 1, 536

cores), the Tesla K20 cards used in this project have 13 SM’s, for a total of 2, 496 cores.

An SM is designed to execute thousands of programming threads concurrently.

To manage such a large number of threads, it employs a unique architecture called

SIMT (Single Instruction,Multiple Thread). An SM schedules and executes the threads

grouped as warps. Individual threads composing a warp begin execution at the same

instruction address, but have their own register state and are therefore free to branch

7

Figure 2.3.: A GPU Core

and execute independently. However, full efficiency is realized when all 32 threads of a

warp agree on their execution path. If threads of a warp diverge via a data dependent

conditional branch, the warp serially executes each branch taken, disabling threads that

are not on that path, and when all paths complete, the threads converge back to a

common path.

The SIMT architecture is somewhat similar to the architecture of vector processors

known as SIMD(Single Instruction,Multiple Data). A key difference, however, is

that the SIMD organization exposes the width of the vector to the software (and

programmer). SIMT , on the other hand, specifies the execution and branching behavior

in terms of a single thread. This model simplifies parallel programming somewhat and

aids in program correctness.

8

Figure 2.4.: A GPU and a Streaming Multiprocessor (SM or SMX)

9

3 DIFFUSION

3.1 One Layer

The one dimensional diffusion equation is

∂u

∂t= κ

∂2u

∂x2, 0 ≤ x ≤ L, t > 0 (3.1)

where u = u(x, t) is the dependent variable and κ is a real constant. The initial

condition and boundary conditions are u(x, 0) = 0, u(0, t) = g(t) and u(L, t) = h(t).

The particular problem we will solve has boundary and initial conditions that are equal

to

u(0, t) = 0, u(L, t) = 0, u(x, 0) = sin(πx/L). (3.2)

The solution to (3.1) subject to (3.2) is then

u(x, t) = sin(πx/L) exp(−κπ2t/L2). (3.3)

The results of this project will be validated for both the CPU and GPU using (3.3).

We begin with an implicit time discretization scheme with a trapezoidal rule in which

tj = jδt. Then∂u

∂t= F (x, t)

is converted tou(j)(xi)− u(j−1)(xi)

δt=F (xi, tj) + F (xi, tj−1)

2.

Define u(xi, tj) to be replaced by u(j)i . The time discretization of the diffusion equation

becomesu(j)(xi)− u(j−1)(xi)

δt=κ

2

{(∂2u

∂x2

)(j)

xi

+

(∂2u

∂x2

)(j−1)

xi

}. (3.4)

We now use a central difference with uniform spacing h = L/N for the spatial

dimension, where N is the number of sub-intervals. As a result, (3.4) becomes

10

u(j)(xi)− u(j−1)(xi)

δt=κ

2

{(ui−1 − ui + ui+1

h2

)(j)

+

(ui−1 − ui + ui+1

h2

)(j−1)}. (3.5)

If we let r = 2h2

κδt, then (3.5) can be rewritten as

−u(j)i−1 + (2 + r)u(j)i − u

(j)i+1 = −u(j−1)

i−1 − (2− r)u(j−1)i − u(j−1)

i+1 , i = 1 . . . N, j > 0. (3.6)

For convenience, define

f(j)1 = u

(j−1)0 − (2− r)u(j−1)

1 − u(j−1)2

f(j)i = u

(j−1)i−1 − (2− r)u(j−1)

i − u(j−1)i+1 , 2 ≤ i ≤ N − 2

f(j)N−1 = u

(j−1)N−2 − (2− r)u(j−1)

N−1 + u(j−1)N

then (3.6) becomes

(2 + r)u(j)1 − u

(j)2 = f

(j)1 + u

(j)0

−u(j)i−1 + (2 + r)u(j)i − u

(j)i+1 = f

(j)i , i = 2 . . . N − 2, j > 0 (3.7)

−u(j)N−2 + (2 + r)u(j)N−1 = f

(j)i + u

(j)N .

Now rewrite (3.7) in terms of a left boundary equation, a set of interior equations

for 1 < i < N − 1 and a right boundary equation. The left boundary equation is

(2 + r)u(j)1 − u

(j)2 = f

(j)1 + g(j), j > 0. (3.8)

The interior equations are

−u(j)i−1 + (2 + r)u(j)i − u

(j)i+1 = f

(j)i , i = 2 . . . N − 2, j > 0 (3.9)

and the right boundary equation is

−u(j)N−2 + (2 + r)u(j)N−1 = f

(j)N−1 + h(j), j > 0. (3.10)

This is done to emphasize the roles of the interior and boundary forcing functions. As

such, the boundaries need no special treatment.

11

We can now rewrite (3.8), (3.9) and (3.10) in matrix form to yield

Ax = f (3.11)

where

A =

2 + r −1

−1 2 + r −1. . . . . . . . .

. . . . . . . . .

−1 2 + r −1

−1 2 + r

(N−1)×(N−1)

(3.12)

x =

u(j)1

u(j)2

...

u(j)N−2

u(j)N−1

(N−1)

, and f =

f(j)1 + g(j)

f(j)2

...

f(j)N−2

f(j)N−1 + h(j)

(N−1)

. (3.13)

This is the system of equations that will be solved.

3.2 Factorization of a Tridiagonal Matrix

3.2.1 LU Factorization

Assume a tridiagonal matrix B is represented as

B =

b1 c1

a2 b2 c2. . . . . . . . .

. . . . . . . . .

aN−1 bN−1 cN−1

aN bN

. (3.14)

B can be factored into a lower bi-diagonal matrix L and an upper bi-diagonal matrix

U such that B = LU, where

12

LU =

γ1

a2 γ2

a3 γ3. . . . . .

aN−1 γN−1

aN γN

1 δ1

1 δ2

1 δ3. . . . . .

1 δN−1

1

. (3.15)

If we equate the coefficients of (3.14) and (3.15), the result is

γ1 = b1

δ1 = c1/γ1

γi = bi − aiδi−1, i = 2, . . . , N

δi = ci/γi, i = 2, . . . , N − 1.

Alternatively, you may cast LU such that

LU =

1

β2 1

β3 1. . . . . .

βN−1 1

βN 1

η1 c1

η2 c2

η3 c3. . . . . .

ηN−1 cN−1

ηN

with similar systems of recurrence relations for their coefficients. The method chosen

depends on its application.

If we have a system of equations involving B it can be written as

Bx = f , or (3.16)

L(Ux) = f .

Now, introduce the intermediate vector y and we have

Ly = f , the forward substitution (3.17)

Ux = y, the backward substitution. (3.18)

13

The system of equations that results from (3.17) is

γ1y1 = f1

a2y1 + γ2y2 = f2

aiyi−1 + γiyi = fi, i = 3, . . . , N

and solving for the y vector yields,

y1 = f1/γ1

yi = (fi − aiyi−1)/γi, i = 2, . . . , N. (3.19)

With the forward substitution step complete, we can write the backward substitution

(3.18) as

xN = yN

xi = yi − δixi+1, i = N − 1, . . . , 1 (3.20)

and generate the solution to (3.16).

3.2.2 UL Factorization

As it turns out, there is also a UL factorization. If L is pre-multiplied by U, the

resulting system is then

Bx = f , or (3.21)

U(Lx) = f ,

where

UL =

1 δ̄1

1 δ̄2

1 δ̄3. . . . . .

1 δ̄N−1

1

β1

a2 β2

a3 β3. . . . . .

aN−1 βN−1

aN βN

. (3.22)

14

The UL factorization is equivalent to reordering the vectors x and f from N to 1. It

follows that the information is transmitted from N to 1. The need for this will become

apparent when dealing with the interface equation in section 4.

When the intermediate vector z is introduced, we obtain

Uz = f , the backward substitution

Lx = z, the forward substitution.

If we equate the coefficients of (3.14) and (3.22), the result is

βN = bN

δ̄i = c1/βi+1, i = N − 1, . . . , 1

βi = bi − ai+1δ̄i, i = N − 1, . . . , 1.

Now that the coefficients of U and L are computed, we can solve for the elements of

vectors z and x to yield

zN = fN

zi = fi − δ̄izi+1, i = N − 1, . . . , 1

and

x1 = z1/β1

xi = (zi − aixi−1)/βi, i = 2, . . . , N

which generates the solution to the system (3.21).

3.3 CPU Implementation

The linear system (3.11) that results from discretizing (3.1) is both tridiagonal and

diagonally dominant (r > 0). The tridiagonal property implies that the LU decompo-

sition can be performed in O(n) time. The diagonal dominance implies that pivoting is

not required. We use these properties to generate the solution to (3.11).

15

Listing A.1 contains the implementation of the factorization and solver for (3.12)

and (3.13) that runs on the CPU. A main function named RunCPU 1K1LTest re-

peatedly calls the pair with various configurations, then measures the execution time

and the relative error with respect to the initial conditions in (3.2). At a high level,

RunCPU 1K1LTest performs 5 functions.

1. Assign boundary conditions.

2. Assign initial conditions.

3. Initialize the sub-diagonal, diagonal, and sup-diagonal of matrix A.

4. Compute the LU factorization.

5. Repeatedly Solve the system and Step time.

3.4 Parallel LU Factorization

The symmetry of A in (3.12) leads to an LU factorization of

LU =

γ1

−1 γ2

−1 γ3. . . . . .

−1 γN−1

−1 γN

1 δ1

1 δ2

1 δ3. . . . . .

1 δN−1

1

where

γ1 = 2 + r

δ1 = −1/γ1

γi = 2 + r + δi−1, i = 2, . . . , N

δi = −1/γi, i = 2, . . . , N − 1

16

which can be simplified to

γ1 = 2 + r

γi = 2 + r − 1/γi−1, i = 2, . . . , N. (3.23)

The nonlinear two-term recurrence in (3.23) does not lend itself to parallel evaluation.

Stone [7] and Blelloch [8], however, describe an algorithm for parallel evaluation of m-th

order linear recurrence relations. So (3.23) is modified through the use of a Riccati

transformation γi = qi/qi−1 to produce a linear three-term recurrence

q0 = 1

q1 = 2 + r

qi = (2 + r)qi−1 − qi−2, i = 2, . . . , N. (3.24)

The issue with (3.24) is that it can overflow when using floating point arithmetic.

The term qi grows and becomes unbounded as i → ∞. We seek a truncated form that

gives acceptable results within the finite precision of the microprocessor. Recognize that

initial value problem in (3.24) can be solved analytically. Assume a solution of the form

qi = Axi+ +Bxi−

with A and B real constants. Examination of the characteristic equation x2− (2 + r)x+

1 = 0 for (3.24) yields roots of

x± =b±√b2 − 4

2, where b = 2 + r.

With the initial conditions of q0 = 1 and q1 = b, we can find the constants A and B

from

q0 = 1 = A+B

q1 = b = Ax+ +Bx−,

where A = (b− x−)/(x+ − x−) and B = 1− A, to yield the solution

qi =xi+1+ − xi+1

−

x+ − x−

17

which can be rewritten as

qi = xi+

{1− (x−/x+)i+1

1− (x−/x+)

}.

With1− (x−/x+)i+1

1− (x−/x+)= 1 + x−/x+ +O

{(x−x+

)2},

it follows that

qi = xi+

{1 + x−/x+ +O

{(x−x+

)2}}.

For sufficiently large i = N

qN = xN+

{1 + x−/x+ +O

{(x−x+

)2}}and qi will overflow as qN → xN+ .

When (3.23) is combined with (3.19) and (3.20), the forward and backward substi-

tution logic simplifies to

y1 = f1/γ1

yi = (fi + yi−1)/γi, i = 2, . . . , N.

and

xN = yN

xi = yi + xi+1/γi, i = N − 1, . . . , 1.

We now see that 1/γi is needed and not qi, so

1/γi =xi+ − xi−xi+1+ − xi+1

−

=xi+xi+1+

{1− (x−/x+)i

1− (x−/x+)i+1

}=

1

x+

{1− (x−/x+)i +O

{(x−x+

)i+1}}

.

This tells us that 1/γi approaches its asymptotic limit faster than qi. If we impose the

constraint that1

γK− 1

γK−1

< 10−M (3.25)

18

for some integers K and M , then

1

γK≈ 1

x+− 1

x+

(x−x+

)i=

1

x+(1− 10−M).

This suggests we stop computing when (3.25) is satisfied. From that point on we impose

1

γi=

1

x+, for i > K.

In summary, this section shows how to transform a non-linear two term recurrence rela-

tion into a linear three term recurrence relation and conversely, that is now a candidate

for parallelization. While doing so, numeric overflow needed to be identified and the

problem mitigated by truncating the series when an acceptable level of accuracy has

been reached.

3.5 Parallel Solver and Recursive Doubling

The sequential nature of the solver for the CPU does not lend itself to efficient

implementation on the GPU. Other researchers have chosen to implement Cyclic Re-

duction (CR) and Parallel Cyclic Reduction (PCR) for their tridiagonal solvers. Two

other researchers [9,10] have attempted to implement Stone’s Recursive Doubling (RD)

algorithm [11] to solve tridiagonal systems on a GPU. Each of them reported problems

like huge numerical errors and RD suffers from arithmetic underflow and instability but

they failed to analyze the source of instability. This section, on the other hand, will

discuss the source of the instability (the numeric underflow that occurs on the GPU)

and then describes how to address the issue.

Stone described RD in terms of the following theorem and claimed that it could be

used to solve recurrence relations of all orders.

Theorem (Stone) 1 Let yi(j) satisfy a non-homogeneous two-term recurrence

yi+1(j) = y1(j) + yi(j − 1) ∗ (−mj), i, j ≥ 1 (3.26)

with the boundary conditions

y1(j) = bj, j ≥ 1; yi(j) = 0, j ≤ 0; yi(j) = 0, i ≤ 0.

19

Then,

a) for s ≥ 1, yi(j) satisfies the recurrence relation

yi+s(j) = ys(j) + yi(j − s)i∏

k=j−s+1

(−mk), i ≥ 1, j ≥ s;

b)

yi(j) =

j∑k=1

y1(k)

j∏s=k+1

(−ms), i ≥ j ≥ 1; (3.27)

c) for i ≥ j ≥ 1, yi(j) = zj, where zj is the jth component of the solution to

zi = bi −mizi−1, z1 = b1.

Quite simply, the problem lies with (3.27). If the sequence {mk}n1 is bounded by m̄,

theni∏

k=j−s+1

(−mk) ≈ m̄p, where p = i− j + s.

The RD algorithm needs to be modified to avoid the arithmetic underflow that m̄p

causes when m̄ < 1.

3.6 Recursive Doubling and Nilpotent Matrices

The non-homogeneous two-term recurrence (3.26) with initial value y1 = b1 has a

solution of the form

yi =i∑

k=1

bk

i∏s=k+1

(−ms) (3.28)

If the first few terms of the solution are expanded, we have

y1 = b1

y2 = b2 − b1m2

y3 = b3 − b2m3 + b1m2m3.

This is a compact representation of the linear system

Ly = b (3.29)

20

where L is a lower bidiagonal matrix given by

L =

1

m2 1

m3 1. . . . . .

mn−1 1

mn 1

n×n

.

We rewrite L as the sum of an identity matrix I of order n and a nilpotent matrix N of

index n, such that

L = I + N

where N is the first lower diagonal matrix

N =

0

m2 0

m3 0. . . . . .

mn−1 0

mn 0

n×n

.

With N nilpotent, the inverse of L can be expressed as

L−1 = (I + N)−1 = I−N1 + N2 −N3 + · · ·+ (−1)n−1Nn−1.

For some p < n, the pth power of N is matrix with non-zero values filling the pth lower

diagonal. For example,0

m2 0

0 m3 0

0 0 m4 0

2

=

0

0 0

m2m3 0 0

0 m3m4 0 0

.

21

We can now relate the solution of the non-homogeneous two-term recurrence (3.28) to

the entries of the pth lower diagonal of Np given by

Npp+i,i =

p+i∏k=i+1

mk, i = 1, 2, . . . , n− p.

If there is an upper bound on the entries of N,

m̄ = max1≤i≤n

|mi|,

then the absolute value of an entry on the pth lower diagonal is

|Npp+i,i| =

p+i∏k=i+1

|mk| ≤ m̄p, i = 1, 2, . . . , n− p.

If m̄ < 1, we can now choose a value for p ∈ N such that the solution has an error

no bigger than 10−digit,

m̄p ≤ 10−digit or p ≤ −digit ln 10

ln m̄,

where digit is the number of required decimal places. All of this now suggests computing

y = (I−N1 + N2 −N3 + · · ·+ (−1)pNp)b

as the truncated form of the solution to (3.29). From a practical standpoint we have

yi = bi −miyi−1, i = 2, . . . , p with y1 = b1 (3.30)

for the first p elements of y. Thereafter, for elements p+ 1 ≤ j ≤ n, we have

yj = bj − bj−1mj + bj−2(mjmj−1) + · · ·+ bj−p−2

j−p−2∏k=j

(−mk) + bj−p−1

j−p−1∏k=j

(−mk)︸︷︷︸p terms

(3.31)

which is just (3.27).

22

3.7 GPU Implementation

The forward and backward substitution logic is now represented by

y1 = f1/γ1

yi =1

γi(yi−1 + fi), i = 2, . . . , N − 1

and

xN−1 = yN−1

xi = yi +1

γixi+1, i = N − 2, . . . , 1

respectively. Each of these two equations are remarkably similar to (3.30) and (3.31),

and the GPU code takes advantage of this.

Assume, for example, a diffusion problem has 220 equations in 220 unknowns, and for

accuracy reasons it requires p = 15. The γ vector is first computed with the CPU and

then passed into the GPU kernels. It was deemed unnecessary to implement the trun-

cated version of the γ vector on the GPU given that the LU factorization is performed

once outside of the solver loop. The ForwardLU Kernel (listing B.1) is launched with

n = 220 threads. Recall that the threads are organized in groups of 32 called a warp,

all of which execute the same instruction on a different data element. For a given warp

0 ≤ j < 215, threads numbered 32j ≤ i < (32j + 32) will each execute

sum = 0

sum = (sum+ fi−k)1

γi−k, k = 1, . . . , p.

The BackwardLU Kernel in (listing B.2) behaves in a similar way.

3.8 Results

All of the software development was performed on a Lenovo S20 workstation with

14GB of RAM, a 300gb Western Digital VelociRaptor hard drive, an Intel Xeon W3520

quad core cpu @ 2.67GHz, and an NVIDIA Tesla K40c. The workstation is running

23

Windows 7/64. The development tools consisted of Visual Studio 2010, CUDA 6.5 and

Tesla Driver 341.44. All of the generated applications were 64 bit.

Table 3.1 and figure 3.1 summarize the results of the CPU version of the LU decom-

position (CPU LU) versus the CPU version of the truncated γ vector (CPU ZC) that

was described in section 3.4. Each element xi of the solution vector x was the sum of

20 terms. The number of time steps was 50. Table 3.2 is a similar experiment with the

number of terms equal to 32. Table 3.2 emphasizes that the additional 12 terms results

in relative errors that are nearly identical to the CPU LU version.

Table 3.3 and figure 3.2 summarize the results of the CPU version of the LU decom-

position versus the GPU version of the RD algorithm (GPU RD1L) that was described

in section 3.5. Again, each element xi of the solution vector x was the sum of 20 terms.

The number of time steps was 50. In this case, the data clearly shows that the GPU

prefers large problems over smaller problems, and ultimately executes 36 times faster

than the CPU LU algorithm.

Table 3.1.: CPU LU vs CPU ZC with 1 Layer, Terms=20, Time Steps=50

Points CPU LU

(ms)

CPU

MaxRel

Err

CPU

ZC (ms)

CPU

MaxRel

Err

Speed

Up

210 1,024 1.1 2.467E-05 3.5 3.136E-05 0.31

216 65,536 69.4 6.858E-09 231.0 1.773E-05 0.30

220 1,048,576 1,222.2 7.650E-11 3,650.9 4.526E-06 0.33

222 4,194,304 4,809.4 2.056E-10 14,671.2 2.333E-07 0.32

224 16,777,216 19,164.3 3.473E-10 59,080.4 2.333E-07 0.32

226 67,108,864 77,671.4 5.239E-09 236,833.5 2.333E-07 0.32

24

Table 3.2.: CPU LU vs CPU ZC with 1 Layer, Terms=32, Time Steps=50

Points CPU LU

(ms)

CPU

MaxRel

Err

CPU

ZC (ms)

CPU

MaxRel

Err

Speed

Up

1,024 1.1 2.467E-05 5.7 2.467E-05 0.19

65,536 69.4 6.858E-09 373.6 8.523E-09 0.18

1,048,576 1,222.2 7.650E-11 5,848.9 2.486E-10 0.20

4,194,304 4,809.4 2.056E-10 23,322.6 2.070E-10 0.20

16,777,216 19,164.3 3.473E-10 93,991.6 3.458E-10 0.20

67,108,864 77,671.4 5.239E-09 377,045.0 5.241E-09 0.20

Figure 3.1.: CPU LU vs CPU ZC with 1 Layer, Terms=20, Time Steps=50

103 104 105 106 107 108

100

101

102

103

104

105

106

Number of Points

Tim

e(m

s)

CPU LUCPU ZC

25

Table 3.3.: CPU LU vs GPU RD1L with 1 Layer, Terms=20, Time Steps=50

Points CPU LU

(ms)

CPU

MaxRel

Err

GPU

RD1L

(ms)

GPU

MaxRel

Err

Speed

Up

1,024 1.1 2.467E-05 2.7 3.136E-05 0.4

65,536 69.4 6.858E-09 4.9 1.773E-05 14.1

1,048,576 1,222.2 7.650E-11 37.3 4.526E-06 32.7

4,194,304 4,809.4 2.056E-10 136.9 2.333E-07 35.1

16,777,216 19,164.3 3.473E-10 536.8 2.333E-07 35.7

67,108,864 77,671.4 5.239E-09 2,143.9 2.333E-07 36.2

Figure 3.2.: CPU LU vs GPU RD1L with 1 Layer, Terms=20, Time Steps=50

103 104 105 106 107 108

100

101

102

103

104

105

106

Number of Points

Tim

e(m

s)

CPU LUGPU RD1L

26

4 DIFFUSION WITH LAYERS

4.1 Background

Consider a practical problem from hydrogeolgy shown in figure 4.1. The one dimen-

sional, transient state, unsaturated flow through this section of earth is

∂θ(z, t)

∂t=

∂

∂z

[K(θ)

(∂ψ(z, t)

∂z+ 1

)](4.1)

where

θ(z, t) is the saturation (water content),

K(θ) is the hydraulic conductivity,

ψ(z, t) is the pressure head,

z is the elevation above a vertical datum, and

t is time.

Figure 4.1 effectively has multiple diffusion coefficients (in this case 5) because the

hydraulic conductivity K(θ) varies with the subsurface soil type.1 The hydraulic conduc-

tivity coefficient has units of distance/time, and as such, can be considered the diffusive

velocity with which water moves through a substructure. The value can be as small

as 10−8 meters/day for unfractured shale, and as large as 104 meters/day for gravel,

thereby spanning 12 orders of magnitude [12].

If the diffusion coefficients are approximately the same order of magnitude then a

finite difference approach may work. However, as the ratio of the coefficients begins to

1Although equation (4.1) technically is a non-linear partial differential equation, it is used as an examplebecause we believe the reader can relate to the notion of water percolating down through different layersof soil.

27

Figure 4.1.: A plant with multiple layers of soil

vary by an order of magnitude or more, the finite difference approach can suffer from a

host of numerical problems effecting its convergence.

4.2 Two Layers

A two layer problem is simply two single layers connected by interface conditions

across a common boundary. The interface conditions describe the continuity of u(x, t)

and its flux, K∂u/∂x, across an interface x = l. Equations (4.2) through (4.5) describe

the two layer problem.

28

∂u

∂t= κ1

∂2u

∂x2, 0 ≤ x < l1, t > 0 (4.2)

∂u

∂t= κ2

∂2u

∂x2, l1 < x ≤ 1, t > 0 (4.3)

u(l−1 , t) = u(l+1 , t) (4.4)

K1∂u

∂x

∣∣∣∣l−1

= K2∂u

∂x

∣∣∣∣l+1

(4.5)

In this case, the diffusion coefficients κ1, κ2, are constant, as are the flux coefficients

K1 and K2. We have used ± to denote the right side (+) and left side (−) of the interface

x = l1. The mathematics of two layers is similar to that in section 3.1, but it will now

be generalized.

Assume we have layers with width l1 and l2. Each layer i has Ni sub-intervals and

the width of each sub-interval is hi = li/Ni. Now define

Mi =i∑

k=1

Nk, ri =2h2iκiδt

i = 1, 2;

so that the coefficient matrices take on the form

A1 =

2 + r1 −1

−1 2 + r1 −1. . . . . . . . .

. . . . . . . . .

−1 2 + r1 −1

−1 2 + r1

(N1−1)×(N1−1)

, (4.6)

A2 =

2 + r2 −1

−1 2 + r2 −1. . . . . . . . .

. . . . . . . . .

−1 2 + r2 −1

−1 2 + r2

(N2−1)×(N2−1)

; (4.7)

29

the solution vectors take the form

x1 =

u(j)1

u(j)2

...

u(j)N1−2

u(j)N1−1

(N1−1)

,x2 =

u(j)N1+1

u(j)N1+2

...

u(j)M2−2

u(j)M2−1

(N2−1)

;

the right hand terms take the form

f1 =

f(j)1 + g(j)

f(j)2

...

f(j)N1−2

f(j)N1−1 + u

(j)

N1−

(N1−1)

, and f2 =

f(j)N1+1 + u

(j)

N1+

f(j)N1+2

...

f(j)M2−2

f(j)M2−1 + h(j)

(N2−1)

;

and the system of equations is

A1x1 = f1 and A2x2 = f2.

We now need an equation that connects the layers, while preserving continuity of the

state variable and the flux.

4.3 Interface Equation

At the interface x = l1 we have the continuity of the state variable

u(j)

N1− = u

(j)

N1+ = u

(j)N1

and the continuity of the flux

K1

u(j)

N1− − u(j)N1−1

h1= K2

u(j)N1+1 − u

(j)

N1+

h2.

When the two previous interface conditions are combined, we are left with an expression

for the solution across the interface,(K1

h1+K2

h2

)u(j)N1

=K1

h1u(j)N1−1 +

K2

h2u(j)N1+1 ∀j. (4.8)

30

4.4 Complete System for Two Layers

In matrix form, the tridiagonal system for two layers with the interface included

consists of

Aifx(j) = f (j) (4.9)

where

Aif =

2 + r1 −1

−1 2 + r1 −1. . . . . . . . .

. . . . . . . . .

−1 2 + r1 −1

−1 2 + r1

A1

−1

−K1/h1 K1/h1 +K2/h2 −K2/h2

−1

2 + r2 −1

−1 2 + r2 −1. . . . . . . . .

. . . . . . . . .

−1 2 + r2 −1

−1 2 + r2

A2

and

x =

u1

u2...

uN1−1

uN1

uN1+1

...

uM2−2

uM2−1

, f =

f1 + u0

f2...

fN1−1

0

fN1+1

...

fM2−2

fM2−1 + uM2

.

Notice that the sub-matrix in the upper left hand corner of Aif is A1 from (4.6), the

sub-matrix in the lower right hand corner of Aif is A2 from (4.7), and the 5 terms in

the center of Aif relate to the interface equation (4.8).

Solving (4.9) now consists of the following 5 steps.

31

1. Compute the LU factorization of A1 and then perform the forward substitution

to obtain the relation between u(j)

N1− and u

(j)N1−1.

2. Compute the UL factorization of A2 and then perform the backward substitution

to obtain the relation between u(j)

N1+ and u

(j)N1+1.

3. Substitute the results of (1) and (2) into the interface equation (4.8) to find u(j)N1

.

4. Perform the backward substitution to layer 1.

5. Perform the forward substitution to layer 2.

4.5 Difference Between Inclusion and Exclusion of the Interface Condition

Equations (4.4) and (4.5) describe the continuity of the state variable and its flux

across the interface. If we look at

K1∂u1∂x

= K2∂u2∂x

and assume that K2 >> K1, we have

K1

K2

∂u1∂x

=∂u2∂x→ 0 ≈ ∂u2

∂x.

This suggests a horizontal tangent of u2 that is present in the solution at the interface.

This behavior is not captured in the form that excludes the interface condition and

results in an error between the two. These errors are described in section 4.7 when the

experiments are discussed.

4.6 Implementation

The CPU implementation for the two layer problem is a simple adaptation to the

standard LU factorization and solver running through the entire set of state variables

from 1 to M2. Listing A.1 shows factorization requiring 5 lines of code and the solver

requires 6 lines of code. In contrast, the two layer factorization in listing A.2 requires

11 lines of code and the solver in listing A.3 requires 13 lines.

32

The GPU implementation for the two layer problem also leverages the logic from the

one layer problem. The forward and backward substitution kernels for the LU solver

require approximately 27 lines of code each. Since we now need forward and backward

substitution kernels for the UL solver, they also contribute approximately 27 lines each

to the total line count.

4.7 Results

If the interface expression in (4.8) is excluded and we are presented with a two layer

problem, the tridiagonal matrix A12 has the form

A12 =

2 + r1 −1

−1 2 + r1 −1. . . . . . . . .

−1 2 + r1 −1

−1 2 + r2 −1. . . . . . . . .

−1 2 + r2 −1

−1 2 + r2

(4.10)

for the system

A12x = f . (4.11)

Four experiments were performed that apply the CPU LU decomposition to (4.11)

and compared the results to the 5 step algorithm (CPU 2L) described in section 4.4.

The parameters for the experiments are described in table 4.1.

Experiment 1 in figures 4.2 and 4.3 shows that the results of solving (4.9) and (4.11)

are nearly identical. Figure 4.2 demonstrates the time evolution of the single interface

point uN1 when the interface equation is utilized and compares that value with two values

uN1 and uN1+1 that straddle uN1+1/2 in (4.11) that excludes the interface. In figure 4.3,

the systems are solved with 1024 points over 50 time steps. With the interface point at

u512, the 32 points on either side are in excellent agreement with both systems.

33

Table 4.1.: Experimental parameters for 50 timesteps

Experiment# κ1 κ2 K1 K2

1 1 1 1 1

2 1 10 1 1

3 1 1 1 10

4 1 10 1 10

Experiment 2 in figures 4.4 and 4.5 demonstrates the next set of results. The time

evolution plot in figure 4.4 shows the interface point significantly above the excluded

form, while the graph of the solutions in figure 4.5 do a poor job of coinciding. The

error plot between the two solutions in figure 4.6 further quantifies the consequences of

ignoring the interface equation.

Experiment 3 in figures 4.7 and 4.8 demonstrates results that are in good agreement

with each other. This experiment also seems to suggest that the flux coefficient has little

impact on the solution.

Finally, experiment 4 in figures 4.9 and 4.10 graphically demonstrates the problem

that occurs when the diffusion and conductivity coefficients are κ1 = 1, κ2 = 10 and

K1 = 1, K2 = 10 respectively. Figure 4.9 shows the actual interface point uN1 receding

from the two points that excluded the interface equation. Again, when the solution

is graphed after 50 time steps in figure 4.10, it is readily apparent that excluding the

interface equation introduces error into the solution. As in experiment 2, figure 4.11

quantifies the solution error. The results of experiment 2 and 4 seem to suggest that

the diffusion coefficients, along with the solver, play a large role in producing a correct

solution.

With experiments 1 through 4 giving us faith in the interface equation approach, the

benefits of accelerating the process on the GPU are now reported. Table 4.2 demon-

strates the 18 fold acceleration of the GPU (GPU RD2L) over the CPU (CPU 2L)

34

version with 2 layers, as well as relative errors that are small when compared with the

analytical solution corresponding to κ1 = 1, κ2 = 1, K1 = 1, K2 = 1 with 32 terms.

The results of table Table 4.3 are not surprising. With κ1 = 1, κ2 = 10, K1 = 1,K2 =

1 and the number of terms set to 32, a maximum acceleration of 18 fold is measured.

The modifications made to RD allow us to process problems that are 65, 536 times larger

than the previously reported sizes for Stone’s approach.

Table 4.2.: CPU 2L vs GPU RD2L, κ1 = 1, κ2 = 1, Terms=32, Time Steps=50

Points CPU 2L

(ms)

CPU

MaxRel

Err

GPU

RD2L

(ms)

GPU

MaxRel

Err

Speed

Up

1,024 1.7 8.756E-05 7.4 8.756E-05 0.2

65,536 70.9 2.318E-08 11.2 2.360E-08 6.3

1,048,576 1,038.4 8.091E-11 66.1 2.533E-10 15.7

4,194,304 4,163.8 2.056E-10 238.1 2.070E-10 17.4

16,777,216 16,638.1 3.473E-10 928.7 3.458E-10 17.9

67,108,864 66,963.9 5.239E-09 3,698.4 5.241E-09 18.1

35

Table 4.3.: CPU 2L vs GPU RD2L, κ1 = 1, κ2 = 10, Time Steps=50

Points CPU 2L

(ms)

GPU

RD2L

(ms)

Speed

Up

768 0.7 7.4 0.1

49,152 50.7 10.2 4.9

786,432 818.5 51.6 15.8

3,145,728 3,161.0 180.8 17.4

12,582,912 12,524.5 700.1 17.8

50,331,648 50,639.1 2,779.5 18.2

Figure 4.2.: Time Evolution of Interface-Experiment 1

0 10 20 30 40 500.9900

0.9920

0.9940

0.9960

0.9980

1.0000

Time Step

Sol

uti

on

valu

es

Include interface eqn, uN1

Exclude interface eqn, uN1

Exclude interface eqn, uN1+1

κ1 = 1, κ2 = 1

K1 = 1, K2 = 1

36

Figure 4.3.: Solution to the Interface Neighborhood-Experiment 1

480 500 520 540

0.9700

0.9800

0.9900

1.0000

Spatial Index

Sol

uti

onva

lues

Include interface eqnExclude interface eqn

κ1 = 1, κ2 = 1

K1 = 1, K2 = 1

at ts = 50.


0 10 20 30 40 500.9900

0.9920

0.9940

0.9960

0.9980

1.0000

Time Step

Sol

uti

onva

lues




κ1 = 1, κ2 = 10

K1 = 1, K2 = 1

37


480 500 520 540

0.9700

0.9800

0.9900

1.0000

Spatial Index

Solu

tion

valu

es


κ1 = 1, κ2 = 10

K1 = 1, K2 = 1

at ts = 50.

Figure 4.6.: Error in the Interface Neighborhood-Experiment 2

480 500 520 54010−5

10−4

10−3

10−2

Spatial Index

|Error|

κ1 = 1, κ2 = 10

K1 = 1, K2 = 1

at ts = 50.

38


0 10 20 30 40 500.9900

0.9920

0.9940

0.9960

0.9980

1.0000

Time Step

Solu

tion

valu

es




κ1 = 1, κ2 = 1

K1 = 1, K2 = 10


480 500 520 540

0.9700

0.9800

0.9900

1.0000

Spatial Index

Sol

uti

onva

lues


κ1 = 1, κ2 = 1

K1 = 1, K2 = 10

at ts = 50.

39


0 10 20 30 40 500.9900

0.9920

0.9940

0.9960

0.9980

1.0000

Time Step

Solu

tion

valu

es




κ1 = 1, κ2 = 10

K1 = 1, K2 = 10


480 500 520 540

0.9700

0.9800

0.9900

1.0000

Spatial Index

Sol

uti

onva

lues


κ1 = 1, κ2 = 10

K1 = 1, K2 = 10

at ts = 50.

40

Figure 4.11.: Error in the Interface Neighborhood-Experiment 4

480 500 520 54010−5

10−4

10−3

10−2

Spatial Index

|Error|

κ1 = 1, κ2 = 10

K1 = 1, K2 = 10

at ts = 50.

41

5 CONCLUSION

We started this project with the goal of creating a GPU based solver for the two layer

linear diffusion equation. Before the solver could be addressed, the LU factorization of

the finite difference form of the diffusion equation was investigated.

The two term non-linear recurrence relation that results from the LU factorization

was shown to be sequential and not to be parallelizable. When converted to a three term

linear recurrence relation the result could now be paralellized for a GPU. However, it

suffered from arithmetic overflow. When the underlying initial value problem was solved

analytically, we were able to demonstrate that a polynomial form of the solution could

be truncated to achieve a prescribed level of machine accuracy. Our experiments showed

excellent agreement with the solution (3.3) to the initial value problem.

We mentioned briefly that other researchers reported numerical instabilities when

applying Stone’s recursive doubling algorithm to the actual solver for problems with

1, 024 unknowns. Overflow is inherent in Stone’s algorithm. In addition, the instability is

due to an arithmetic underflow problem that appears when the two term linear recurrence

relation associated with the solver is evaluated. When Stone’s algorithm is recast using

matrix algebra, we were able to overcome these deficiencies by applying the theory of

nilpotent matrices to generate a truncated form of the matrix polynomial.

Our experiments demonstrated a GPU based algorithm that is approximately 36

times faster than the CPU version for the 1 layer problem, and approximately 18 times

faster than the CPU for the 2 layer problem. Our 1 layer implementation required two

GPU kernels and the 2 layer implementation required 5 kernels. The GPU is designed

for large problems that stream RAM through the cores. We attribute the 2 fold loss in

efficiency to the presence of the additional kernels, as well as awkward coding patterns

that were required to handle the interface condition. On the positive side, our algorithm

42

is also capable of processing 226 equations, a value that is 65, 536 times larger than that

of other researchers.

Finally, we described the mathematical framework for solving the two layer problem.

Our experiments show that error is introduced into the solution when the interface

conditions, the continuity of the state variable and the flux across the interface boundary,

are not properly accounted for. In fact, our work suggests that the flux coefficients

control the placement of the solution, while the diffusion coefficients in combination

with the type of solver used controls the error in the solution. We’ve also shown that

with the straight forward LU solver for two layers requiring approximately 54 lines of

kernel code on the GPU, our parallel two layer solution is only modestly more complex

with approximately 108 lines of code.

43

REFERENCES

[1] Greg Turk. Generating textures on arbitrary surfaces using reaction diffusion. InProc. of the 18th annual confereence on Computer Graphics and Interactive Tech-niques, pages 289–298. ACM Press, 1991.

[2] Daniele Bertaccini and Daniela Calvetti. Fast simulation of solid tumors thermalablation treatments with a 3d reaction diffusion model. Computers in Biology andMedicine, 37(8):1173–1182, 2007.

[3] F. T. Tracy. Clean two- and three-dimensional analytical solutions of richards’equation for testing numerical solvers. Water Resources Research, 42(8), 2006.

[4] R. Allan Freeze. Three dimensional, transient, saturated-unsaturated flow in agroundwater basin. Water Resources Research, 7(2):347–366, 1971.

[5] James S. Boswell and Greg A. Olyphant. Modeling the hydrologic response ofgroundwater dominated wetlands to transient boundary conditions: Implicationsfor wetland restoraton. Journal of Hydrology, 332(3):467–476, 2007.

[6] NVidia Corp. Cuda C Programming Guide. Technical Report PG02829001 v5.5,NVidia Corp, July 2013.

[7] Harold S. Stone and Peter M. Kogge. A parallel algorithm for the efficient solutionof a general class of recurrence equations. IEEE Transactions on Computers, C-22(8):786–793, 1973.

[8] Guy E. Blelloch. Prefix sums and their applications. Technical Report CMU-CS-90-190, Carnegie Mellon University, November 1990.

[9] Volodymyr Kindratenko. A guide for implementing tridiagonal solvers on gpus. InNumerical Computations with GPUs. Springer International Publishing, 2014.

[10] Jonathan Cohen Yao Zhang and John D. Owens. Fast tridiagonal solvers on thegpu. In Proc. of the 15th ACM SIGPLAN Symposium on Principles and Practiceof Parallel Programming, pages 127–136. ACM Press, January 2010.

[11] Harold S. Stone. An efficient parallel algorithm for the solution of a tridiagonallinear system of equations. Journal of the ACM, 2(1):27–38, 1973.

[12] Ralph C. Heath. Basic ground-water hydrology. Technical Report 2220, U.S. Geo-logical Survey Water-Supply, 1983.

44

A CPU CODE

1 // =======================================================================

2 // LUFactorizationCPU

3 // =======================================================================

4 void LUFactorizationCPU(const CVector &SubDiag , const CVector &MainDiag ,

5 const CVector &SuperDiag , CVector &Gamma , CVector &Delta)

6 {

7 Gamma [1] = MainDiag [1];

8 Delta [1] = SuperDiag [1] / Gamma [1];

9

10 for (int i = 2; i <= Gamma.Last(); i++)

11 {

12 Gamma[i] = MainDiag[i]-SubDiag[i]*Delta[i-1];

13 Delta[i] = SuperDiag[i] / Gamma[i];

14 }

15 }

16

17 // =======================================================================

18 // LUSolveCPU

19 // =======================================================================

20 void LUSolveCPU(const CVector &Gamma , const CVector &Delta ,

21 const CVector &ASubDiag ,

22 const CVector &f, CVector &x, CVector &y)

23 {

24 //

25 // Forward substitution . Solve Ly = f.

26 //

27 y[1] = f[1] / Gamma [1];

28

29 for (int i = 2; i <= y.Last(); i++)

30 {

31 y[i] = (f[i] - ASubDiag[i]*y[i-1]) / Gamma[i];

32 }

33

34 //

35 // Backward substitution . Solve Ux = Temp.

36 //

37 x[x.Last()] = y[y.Last()];

45

38

39 for (int i = x.Last() -1; i > 0; i--)

40 {

41 x[i] = y[i] - Delta[i]*x[i+1];

42 }

43 }

Listing A.1: 1 Layer CPU Factorization and Solver

46

1 // =======================================================================

2 // TwoLayerFactorization

3 // =======================================================================

4 void TwoLayerFactorization(const double r1, const double r2 ,

5 CVector &Gamma1 , CVector &Delta1 , CVector &Gamma2 , CVector &Delta2)

6 {

7 //

8 // The LU factorization for matrix A1

9 //

10 const double b1 = 2.0 + r1;

11 const double b2 = 2.0 + r2;

12

13 Gamma1 [1] = b1;

14 Delta1 [1] = -1.0 / Gamma1 [1];

15

16 for (int i = 2; i <= Gamma1.Last(); i++)

17 {

18 Gamma1[i] = b1 + Delta1[i-1];

19 Delta1[i] = -1.0 / Gamma1[i];

20 }

21

22 //

23 // The UL factorization for matrix A2

24 //

25

26 Gamma2[Gamma2.Last()] = b2;

27

28 for (int i = Gamma2.Last() -1; i >= 1; i--)

29 {

30 Delta2[i] = -1.0 / Gamma2[i+1];

31 Gamma2[i] = b2 + Delta2[i];

32 }

33

34 }

Listing A.2: 2 Layer CPU Factorization

47

1 // =======================================================================

2 // Solve2LayerCPU

3 // =======================================================================

4 void Solve2LayerCPU(const CVector &Gamma1 , const CVector &Delta1 , CVector &y,

5 CVector &f, const CVector &Gamma2 , const CVector &Delta2 ,

6 CVector &z, CVector &u,

7 const double K1, const double K2,

8 const double H1, const double H2)

9 {

10

11 // Top down , y[1] to the point y[N_1 -1]

12 // The forward substitution for LU.

13

14 y[1] = f[1] / Gamma1 [1];

15


17 {

18 y[i] = (f[i] + y[i-1]) / Gamma1[i];

19 }

20

21 // Bottom up , z[N_1 -1] to z[1]

22 // The backward substitution for UL.

23

24 z[z.Last()] = f[f.Last()];

25

26 for (int i = 1; i < z.Last(); i++)

27 {

28 z[z.Last()-i] = f[f.Last()-i] - Delta2[Delta2.Last()-i]*z[z.Last()-i+1];

29 }

30

31 // Fix up U_{N_1}

32

33 const int N1 = Gamma1.Last()+1;

34

35 const double alpha = (K1/H1)*(1.0 - (1.0/ Gamma1[Gamma1.Last()])) + (K2/H2)*(1.0 -

(1.0/ Gamma2 [1]));

36

37 u[N1] = ((K1/H1)*y[y.Last()] + (K2/H2)*(z[1]/ Gamma2 [1])) / alpha; // update U_{

N_1}

38

39 // Advance up from point U{N_1 -1} to U_1

40 // The backward substitution for LU.

41

48

42 for (int i = N1 -1; i >= 1; i--)

43 {

44 u[i] = y[i] - Delta1[i]*u[i+1];

45 }

46

47 // Move down from point U{N_1 +1} to U_{M_2}

48 // The forward substitution for UL.

49


51 {

52 u[N1+i] = (z[i] + u[N1+i-1]) / Gamma2[i];

53 }

54

55 }

Listing A.3: 2 Layer CPU Solver

49

B GPU CODE

1 // =======================================================================

2 // ForwardLU_kernel -- compute the forward scan on the GPU

3 // =======================================================================

4 __global__ void ForwardLU_kernel(const double * __restrict__ gpuGamma ,

5 const double * __restrict__ gpuFhat ,

6 double *gpuYhat , unsigned int NumberOfElements)

7 {

8 __shared__ double sG[MAXTERMS +1024];

9 __shared__ double sF[MAXTERMS +1024];

10

11 const unsigned int gtid0 = blockIdx.x*blockDim.x + threadIdx.x;

12 const unsigned int idx0 = threadIdx.x;

13

14 double TG0 = 0.0;

15 double TF0 = 0.0;

16

17 if (gtid0 < NumberOfElements)

18 {

19 TG0 = gpuGamma[gtid0];

20 TF0 = gpuFhat[gtid0 ];

21 }

22 sG[idx0+MAXTERMS] = TG0;

23 sF[idx0+MAXTERMS] = TF0;

24

25 if (idx0 < MAXTERMS)

26 {

27 TG0 = 0.0;

28 TF0 = 0.0;

29 if (gtid0 > MAXTERMS)

30 {

31 TG0 = gpuGamma[gtid0 -MAXTERMS ];

32 TF0 = gpuFhat[gtid0 -MAXTERMS ];

33 }

34 sG[idx0] = TG0;

35 sF[idx0] = TF0;

36 }

37 __syncthreads ();

50

38

39 double Sum0 = 0.0;

40

41 for (int i = 1; i <= MAXTERMS; i++)

42 {

43 double F0 = sF[idx0+i];

44 double G0 = sG[idx0+i];

45

46 Sum0 = (Sum0 + F0)*G0;

47 }

48

49 if (gtid0 < NumberOfElements)

50 {

51 gpuYhat[gtid0] = Sum0;

52 }

53 }

Listing B.1: GPU Forward LU Solver

51

1 // =======================================================================

2 // BackwardLU_kernel -- compute the backward scan on the GPU

3 // =======================================================================

4 __global__ void BackwardLU_kernel(const double * __restrict__ gpuGamma ,

5 const double * __restrict__ gpuYhat ,

6 double *gpuXhat , unsigned int NumberOfElements)

7 {

8 __shared__ double sG [1024+ MAXTERMS ];

9 __shared__ double sY [1024+ MAXTERMS ];

10

11 const unsigned int gtid = blockIdx.x*blockDim.x + threadIdx.x;

12 const unsigned int idx = threadIdx.x;

13 const unsigned int warpid = idx >> 5;

14

15 double TG = 0.0;

16 double TY = 0.0;

17 if (gtid < NumberOfElements)

18 {

19 TG = gpuGamma[gtid];

20 TY = gpuYhat[gtid];

21 }

22 sG[idx] = TG;

23 sY[idx] = TY;

24

25 if (blockIdx.x < (NumberOfElements >>10)) // if not in the last block

26 {

27 if (warpid == 31) // if in the last warp

28 {

29 TG = gpuGamma[gtid+MAXTERMS ];

30 TY = gpuYhat[gtid+MAXTERMS ];

31 sG[idx+MAXTERMS] = TG;

32 sY[idx+MAXTERMS] = TY;

33 }

34 }

35

36 __syncthreads ();

37

38 double Sum = 0.0;

39

40 for (int i = MAXTERMS -1; i >= 0; i--)

41 {

42 double Y = sY[idx+i];

43 double G = (i == 0 ? 1.0 : sG[idx+i-1]);

52

44

45 Sum = (Sum + Y)*G;

46 }

47

48 if (gtid < NumberOfElements)

49 gpuXhat[gtid] = Sum;

50 }

Listing B.2: GPU Backward LU Solver

Documents

TR-CIS-0420-09 BobZigon