92
Nine-Month Report Hardware Acceleration for Real Time Solution of the Discrete-Finite Element Method Supervised by Dr. Steven F Quigley Prof. Andrew H C Chan Lin Zhang Department of Electronics, Electrical and Computer Engineering

Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

  • Upload
    voliem

  • View
    217

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

Nine-Month ReportHardware Acceleration for Real Time

Solution of the Discrete-Finite Element Method

Supervised by

Dr. Steven F Quigley

Prof. Andrew H C Chan

Lin ZhangDepartment of Electronics, Electrical and Computer

EngineeringThe University of Birmingham

June 2008

Page 2: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

Index

Index

Abstract - 1 -1. Introduction - 2 -

1.1 Overview - 2 -1.2 Outline - 3 -

2. Combined Discrete Finite Element Method - 5 -2.1 Finite Element Method - 5 -

2.1.1 Beam element.....................................................................- 7 -

2.1.2 Space frame element........................................................- 10 -

2.2 Discrete Element Method - 15 -3. Parallel Solution for Linear Matrix Equations - 16 -

3.1 Classic Methods for Linear Matrix Equation Solution - 16 -3.1.1 Direct Solution Process.....................................................- 16 -

3.1.2 Iterative Computing Method..............................................- 21 -

3.2 Overview of Parallel Computation - 26 -3.2.1 Matrix Partition..................................................................- 26 -

3.2.2 Matrix Multiplication..........................................................- 28 -

3.3 Paralleled Methods for Equation Solutions - 29 -3.3.1 Jacobi................................................................................- 29 -

3.3.2 Gauss-Seidel....................................................................- 29 -

3.3.3 Successive over-relaxation...............................................- 30 -

3.3.4 CG and PCG.....................................................................- 31 -

3.4 Summary and Outlook - 32 -4. Case Studies - 33 -

4.1 Beam model - 33 -

Page 3: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

1. Introduction

4.2 Space Frame Model - 41 -4.2.1 Optimization and balance.................................................- 42 -

4.2.2 Design for parallel solution of matrix equations................- 47 -

5. Conclusion - 48 -5.1 Summary of the report - 48 -5.2 Work plan for next nine months - 48 -5.3 Outline plan to PhD submission- 49 -5.4 Publication plan - 49 -

Reference I

Page 4: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

Abstract

Abstract

This nine-month progressive report will present a general overview of the

combined finite-discrete element method and then go through the

fundamental principles of both finite element method and parallel computing

processes. After these background reviews, the report will result in a

fundamental structure to deal with the FEM problems. Two case studies on

finite element models will be introduced in the following part of the report

using the resulted FEM structures. This part will introduce the designs and

implements have been done so far. The finial part of this report will be the

work plan for the next stage.

1

Page 5: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

1. Introduction

1. Introduction

1.1 Overview

The combined Discrete-Finite Element Method (DFEM) is a promising

approach for creating virtual reality and gaming environments that exhibit

highly realistic physical behaviour[1]. It combines both the advantages and

benefits of the finite element tools and techniques with discrete element

algorithms. However, based on the complexity of this method, the DFEM

equations are computationally expensive, and cannot be solved in real time

on common desktop PCs and workstations for complex virtual environment.

Many numerical problems can be greatly speeded up by using hardware

accelerators, specialized integrated circuits that exploit high level of

parallelism to give rapid solution.

A practical attempt to solve DFEM equations in real time could be performed

by using hardware accelerators on low cost plug-in board for desktop PCs.

This will involve investigation of a variety of formulations of the DFEM in order

to identify which method can be best accelerated by low cost hardware with a

clear partition between hardware and software. It will also involve design of

the hardware accelerators and evaluation of their effectiveness and scalability

to large-scale problems.

2

Page 6: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

1. Introduction

The study for the past nine months was still the initial stage of the whole

project. The main objectives are:

1) Studying the background of the DFEM especially in the Civil engineering

field and the fundamentals of the parallel techniques for the solution of

linear matrix equations.

2) Developing proper techniques for hardware parallel computing which fit

the characteristics of the FEM.

3) Design of a hardware accelerator and its software interface for two simple

FEM models.

1.2 Outline

This report is organised as follows:

Section 1 will generally introduce the research background of the whole

project and state the target for the work in the first nine month

period, followed by the outline of the whole report in 1.2.

Section 2 will be a summarized literature review of the combined Discrete-

Finite Element Method. Based on the current research progress,

which is more focused on finite element method currently, this part

will begin with a detailed description of the FEM, followed by a

brief introduction of the discrete element method (DEM).

3

Page 7: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

1. Introduction

Section 3 will talk about the general parallel techniques for the solution of

linear matrix equations. Several approaches for matrix calculation

and equation solution will also be introduced.

Section 4 will describe the case studies on a beam model in 32-bit fixed-

point, as well as the space frame model in floating-point;

Section 5 will be about the conclusion of the past nine month work, the plan

for future work and publication, and expected date to finish this

study.

4

Page 8: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

2. Combined Finite-Discrete Element Method

2. Combined Discrete Finite

Element Method

Just like as name suggests, the combined finite-discrete element method is a

combination of finite element-based analysis of continua and discrete

element-based transient dynamics, contact detection and contact interaction

solution. It is the solution for transient dynamic analysis of system which

contains a large number of deformable interactants in a process of breakage,

fracture and fragmentation [2].

2.1 Finite Element Method

The finite element method is a technique for approximating the governing

differential equations for a continuous system with a set of algebraic

equations using a finite number of variables [3].

Classical Structural Mechanics (force, displacement, energy) can be used to

solve simple and normal engineering problems. With the development of

structural engineering, especially the usage of computer calculation, matrix

analysis methods became more popular, and one of the most commonly used

matrix methods is the Structural Mechanics matrix displacement method

5

Page 9: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

2. Combined Finite-Discrete Element Method

which is the precursor of the finite element method. In the finite element

method, after dividing a structure into units called elements, an approximate

displacement function is used to represent the behaviour of the structure.

Therefore, the finite element method is an approximate numerical analysis.

The principle of virtual work can be used to derive the finite element

equations. For further information about the FEM one can refer to various

textbooks such as [4, 5].

Generally speaking, in solid and structural mechanics the FEM can be used to

solve one-, two- and three-dimensional as well as axisymmetric problems,

including the elastic, elasto-plastic and viscoelastic analysis of trusses,

frames, plates, shells and solid bodies [6].

The one-dimensional structural elements can be used for the analysis of

skeletal type systems like planar trusses, space trusses, beams, continuous

beams, planar frames, grid systems and space frames [7]. This would result in

the same matrix format for the matrix displacement method in structural

mechanics. It is used to calculate the internal forces and distortion of the

plane or three dimensional structures.

The common procedure to solve matrix structure problem is:

1) Discretization of the whole structure as close to its geometry as possible.

That is to say the skeletal shape, including elements and nodes, of the

element should fit the real object as much as possible.

2) Find out the properties of these elements and build the element matrices

6

Page 10: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

2. Combined Finite-Discrete Element Method

such as the stiffness matrix.

3) Analysis of the whole structure by assembling the equilibrium equations for

each node, the structural stiffness matrix and the load vector.

4) Apply the boundary conditions and solve the matrix equations. Normally,

the result from the solutions is the displacement of each node.

5) Calculate out the internal force and the distortion from the displacement of

the nodes and output the quantities of interest.

The present study till now was focused on the beam and space frame models.

2.1.1 Beam element

The deformation of a beam element is assumed to be described only by the

transverse displacement ( , ) and rotation ( , ) of the beam, where

as in [Figure 1].

Figure 1 A typical beam element

7

Page 11: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

2. Combined Finite-Discrete Element Method

Because there are four nodal displacements, a cubic displacement model is

required

Using the condition

and at

and

and at

The equation can be also written as

where the shape function is given by

8

Page 12: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

2. Combined Finite-Discrete Element Method

Figure 2 Deformation of an element of frame in plane[8]

From [Figure 2] axial displacement due to the transverse displacement can

be expressed as

where is the distance from the neutral axis. The axial strain is given by

where strain-displacement matrix is given by:

The element stiffness matrix can be calculated from

9

Page 13: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

2. Combined Finite-Discrete Element Method

with

where is the area moment of inertia of the cross section about

-axis.

2.1.2 Space frame element

A typical space frame element is shown as in [Figure 3]

10

Page 14: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

2. Combined Finite-Discrete Element Method

Figure 3 Element with 12 degrees of freedom[9]

The 12 degrees of freedom could be divided into three individual groups. The

and could be described using a linear displacement model and leads to

the stiffness matrix as:

where , and are the area of cross section, Young’s modulus and second

moment of inertia respectively.

From Hook’s law, where is the shear modulus of the material. The

stiffness matrix of the element corresponding to torsional displacement

degrees of freedom and can be derived as

11

Page 15: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

2. Combined Finite-Discrete Element Method

Since , where is the polar moment of inertia of the cross section,

Eq. can be rewritten as

The second group of degrees of freedom are , , and . These four

could be seen as a beam element, which also be called bending

displacements in the plane . Thus, the corresponding stiffness matrix can

be derived as

where is the area moment of inertia of the cross section about

-axis.

12

Page 16: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

2. Combined Finite-Discrete Element Method

Similarly, , , and contribute the bending displacements in the plane

. The stiffness matrix can be derived as

where is the area moment of inertia of the cross section about

-axis.

Build, and together, we can get the overall local stiffness matrix of the

element as

13

Page 17: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

2. Combined Finite-Discrete Element Method

To transfer the local stiffness matrix to a global one is normally used

where the transformation matrix is given by

and the transformation sub-matrix is given by

Three nodes are needed to define the space frame element, where two of

them are both ends of the element and the other one is use to define the

position of the local plane. , and are for the local coordinates.

The local direction can be define as

14

Page 18: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

2. Combined Finite-Discrete Element Method

Therefore, the local direction can be defined by the cross product from and

Similarly, the local direction can be produced by the cross product of and

2.2 Discrete Element Method

The discrete element method was first proposed by Cundall in the early 1970s

[10], and was originally used to analyze the mechanical behaviour of

discontinuous media, such as rocks. The main idea about the DEM is to

divide those discontinuous objects into an aggregate of rigid elements, and

then the motion equations are used to calculate their motions. From those

equations, the whole state of movement could be derived. The most important

benefit of the DEM is that the relative motions of particles are permitted and

there is no requirement for continuity of displacement and compatible

deformation conditions.

Despite originating from traditional discontinuous medium problems, the

application area of the DEM was extended to continuous objects and

15

Page 19: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

2. Combined Finite-Discrete Element Method

mechanical transformation problems between continuous and discontinuous

medium. One typical example is the damage and destruction of brittle

material, such as concrete, under dynamic loading like shock and penetration.

This kind of topic is normally hard to solve and simulated directly by those

algorithms only based on continuity mechanics, such as finite element method

which was mentioned before [11, 12].

Present work on DEM has just started and will go deeper in the following

months.

16

Page 20: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

3. Parallel Solution for Linear Matrix Equations

3. Parallel Solution for Linear

Matrix Equations

The most time consuming part of the FEM could be matrix equation solution.

On the other hand, although no solution of matrix equations is required for the

DEM, a small time step which is less than the critical time step has to be used

because of the conditionally stable nature of the explicit scheme used. The

FEM contains a large number of matrix addition and multiplications and the

DEM contains large number of floating point calculations which has normally

caused the real time solution of large FE and DE systems to be impracticable

on sequential clusters. One acceptable solution for the matrix equations is the

use of parallel techniques.

3.1 Classic Methods for Linear Matrix Equation Solution

There are several classic methods normally used for matrix equation

solutions, which could be roughly divided into two categories, direct solving

process and iterative computing method.

17

Page 21: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

3. Parallel Solution for Linear Matrix Equations

3.1.1 Direct Solution Process

The direct solution process includes Gaussian Elimination Method, Gauss-

Jordan Elimination Method and LU decomposition method. These schemes

are normally suitable for dense linear equations systems.

3.1.1.1 Gaussian elimination method

There are two basic steps in Gaussian elimination. Step one is Forward

Elimination, which reduces the given system of equations into triangular form,

or results in a degenerate equation with no solution, indicating that the system

is singular. This is accomplished through the use of elementary row

operations. The second step uses back substitution to find the solution of the

system above [13].

The technique of partial pivoting is also widely used in Gaussian elimination

method which moves the entry of row with the largest absolute value to the

“pivot position”. This difference improves the numerical stability of the

algorithm and reduces the round-off error.

3.1.1.2 Gaussian-Jordan elimination method

The Gaussian-Jordan Elimination is a variation of the traditional Gaussian

18

Page 22: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

3. Parallel Solution for Linear Matrix Equations

Elimination method, which put zeros both above and below the diagonal

elements, where the traditional one would keep an upper triangular matrix. In

other words, Gauss-Jordan elimination brings a matrix to reduced row

echelon form.

Gauss–Jordan elimination is considerably less efficient than Gaussian

elimination with back-substitution when solving a system of linear equations.

However, it is well suited for calculating the matrix inverse [14].

3.1.1.3 LU decomposition method

The LU decomposition is a matrix decomposition which decomposes a matrix

as the product of a lower and upper triangular matrix [15]. It can also be

viewed as a variant of the Gauss elimination method.

Let be a square matrix and , where is an upper triangular matrix

with unity at the diagonal and is a lower triangular matrix. Thus the linear

matrix equation can be write as

Let , the original problem can be equivalent transform ed into

19

Page 23: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

3. Parallel Solution for Linear Matrix Equations

It reduces the whole solution into two simpler ones that can be easily solved

by back-substitution.

Using the matrix multiplication shows in, the Doolittle decomposition could be

derived as

One advantage for using a computer to solve the LU method is the efficiency

of the memory requirement for storing the matrix. On one hand, there is no

need to store the “0”s in either or and the “1”s on diagonal. Thus, and

could be stored in one square matrix form. On the other hand, each would

20

Page 24: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

3. Parallel Solution for Linear Matrix Equations

be only used once to calculate the corresponding or , that the

result of and can just overwrite the .

In most practical considerations, such as in FEM, the coefficient matrices

normally are symmetrical when the problem is linear elastic. In this case, the

LDLT method could reduce the calculation, the memory requirement and

simplify the program design.

Let , where

Because , . Based on the

21

Page 25: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

3. Parallel Solution for Linear Matrix Equations

uniqueness of the decomposition for matrix, we can safely obtain the result

that . Moreover, is unique if A is not singular.

where

or

3.1.2 Iterative Computing Method

On the other hand, iterative computing methods are more suitable for sparse

matrix problems although they can also be used to deal with dense ones as

well. The typical schemes in this category are the Jacobi method, Gauss-

Seidel method, Successive over-relaxation (SOR) and Conjugate Gradient

Method for symmetric matrices.

22

Page 26: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

3. Parallel Solution for Linear Matrix Equations

3.1.2.1 Jacobi method

The Jacobi method is an iterative algorithm used to solve linear equations.

The solution of vector is sought where .

Let , where , and represent the diagonal, lower triangular

and upper triangular part of the coefficient matrix respectively. Then the

equation to be solved can be rewritten as

and

If for each , the definition for Jacobi method can be expressed as

where is the iteration count. The element based approach can be described

as [16]

23

Page 27: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

3. Parallel Solution for Linear Matrix Equations

3.1.2.2 Gauss-Seidel method

Gauss-Seidel method is very similar to Jacobi method which described above.

The Gauss-Seidel iteration is

where , and represent the diagonal, lower triangular and upper triangular

part of the coefficient matrix respectively. One explicit implementation of the

Gauss-Seidel is[17]

Compared with the Jacobi method, the computation of in Gauss-Seidel

method only uses the most updated values. It has reason to believe that this

improvement can speed up the convergence at the first place. Moreover, from

the express of, it is clear that after the calculation of , the value of is

useless. This characteristic of the algorithm may result in memory saving in

the program design stage that the value of could be overwritten when a new

value is obtained.

24

Page 28: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

3. Parallel Solution for Linear Matrix Equations

3.1.2.3 Successive over-relaxation

Successive over-relaxation (SOR) is a numerical method originally used to

speed up the convergence of the Gauss-Seidel method [18, 19]. The key

point of the SOR is the introduction of the relaxation factor and the

correction where

Thus the general iteration can be expressed as

where the right-hand side should be calculated according to eq. (3.18)

instead of the normally iteration expressions.

For Gauss-Seidel method, the SOR expression is

Similarly, the iteration expression for Jacobi method is

25

Page 29: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

3. Parallel Solution for Linear Matrix Equations

If the coefficient matrix is positive definite, the SoR iteration is convergent

when [20].

3.1.2.4 Conjugate Gradient (CG) and Preconditioned Conjugate Gradient (PCG) Method

The Conjugate Gradient method is an effective method for symmetric positive

definite systems as shown in [21].

The method proceeds by generating vector sequences of iterates (i.e.,

successive approximations to the solution), residuals corresponding to the

iterates and search directions used in updating the iterates and residuals[22].

Under most circumstances, CG is used in combination with some kind of

preconditioning[23]. The matrix is implicitly multiplied by an approximation

of where normally constructed to be an approximation of and

is easier to solve than that it reduced the condition number of the

coefficient matrix. Jacobi preconditioning is usually used, which is a diagonal

matrix with the diagonal elements of the matrix .

The algorithm of Preconditioned Conjugate Gradient is as follows[24]:

26

Page 30: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

3. Parallel Solution for Linear Matrix Equations

3.2 Overview of Parallel Computation

Parallel computing is a kind of process which can use diverse computational

resource to deal with numerical problem[25]. There are two different aspects

of parallel techniques: on time domain using say pipeline, and on spatial

domain using say domain decomposition. The spatial domain parallelization

seems more widely accepted as the definition of the parallel computation.

The process which can speed up the calculation related to matrix is the main

point to focus for both FEM and DEM in this research.

27

Page 31: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

3. Parallel Solution for Linear Matrix Equations

3.2.1 Matrix Partition

To dealing the matrix related issues, the first step is to divide the whole

problem, normally the coefficient matrix , into several sub-matrixes that

would be used in the following procedures. The two common ways to do so

are striped partitioning as shown in [Figure 4] and [Figure 5] and

checkerboard partitioning showing in [Figure 6].

Figure 4 Striped partitioning examples of 16x16 matrixes

28

Page 32: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

3. Parallel Solution for Linear Matrix Equations

Figure 5 Striped row-major mapping of a 27x27 matrix on 3 processes

Figure 6 Checkerboard partitioning examples of a 16x16 matrix

3.2.2 Matrix Multiplication

Let , and where . One simplest parallel

matrix multiplication method is described as in [Figure 7] where .

29

Page 33: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

3. Parallel Solution for Linear Matrix Equations

Figure 7 Example of the simple matrix multiplication

The pseudo code is

where , . The matrix shift one process forward per column

per loop[26].

There are also some sophisticated schemes for parallel matrix multiplication,

such as Cannon’s method[27], Fox’s method[28], DNS method[29] and

Systolic method[30].

3.3 Paralleled Methods for Equation Solutions

Classic iterative matrix equation solution methods, such as Jacobi and Gauss-

Seidel, can also be modified to run in parallel.

30

Page 34: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

3. Parallel Solution for Linear Matrix Equations

3.3.1 Jacobi

From, let and , then the basic process for

each iteration is one multiplication for matrix and vector followed by vectors

addition

The parallel algorithm of the Jacobi method is very obvious by using.

3.3.2 Gauss-Seidel

Gauss-Seidel method is a scheme which calculates each element one by one.

It is a strict sequential process in that the calculation of each new element

needs all the newest result of the former elements.

Let , the parallel algorithm of Gauss-Seidel

method can be expressed as[31]

31

Page 35: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

3. Parallel Solution for Linear Matrix Equations

where

The matrix multiplication in and are using parallel technique described in.

3.3.3 Successive over-relaxation

The parallel version of SOR is varied by the method using for generate the

SOR expression as discussed in section 3.1.2.3. Generally speaking, the

parallel method mentioned in section 3.3.1 and 3.3.2 can be used to replace

the on the right hand side of Eqn while the calculation of is

running simultaneously.

32

Page 36: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

3. Parallel Solution for Linear Matrix Equations

3.3.4 CG and PCG

As expressed in Eqn, if matrix or is not very sparse, most work is done in

or solving . This is where parallelism most beneficial.

Rewritten the preconditioner , the parallel version of PCG is[24]:

All compute intensive operations can be done in parallel.

33

Page 37: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

3. Parallel Solution for Linear Matrix Equations

3.4 Summary and Outlook

In this section, several classical methods for linear matrixes equation solution

have been discussed, as well as their parallel forms. There is another group

of solution schemes which concentrate on the local stiffness matrix generate

in FEM for each element instead of dealing with the whole global stiffness

matrix. it could increase the efficiency of the calculation of the classical

methods, especially for FEM. More research will be done for these element-

by-element methods.

34

Page 38: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

4. Case Study

4. Case Studies

There are some FPGA-based applications about FEM using on different

fields[32, 33]. These efforts gave the common steps to dealing with FEM

applications to follow.

4.1 Beam model

Beam is one of the simplest models among all FEM schemes. It has been

chosen as the first problem to work on for building the common structure for

future work.

Based on the principles introduced in 2.1.1, the work was divided into two

parts, software and hardware. The software version was written using

MATLAB m-files, which provides convenient matrix calculation functions. The

whole work includes four stages: Data Initialization, Equation Matrix Solution,

Position Calculation and Result Plotting. The most important part within these

three is Equation Matrix Solution. In the first instance, LDLT scheme was

used, which can be seen as an extension of the LU method but is only

suitable for symmetric matrix. Thus it was considered the best solution to fit

the FEM. The expressions can be found at and . The only problem of LDLT

scheme is its sequential nature which could be complex and resource costing

when moved to hardware. Another attempt to solve the equations was Jacobi

iterative method. This scheme was widely used on microprocessors because

35

Page 39: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

4. Case Study

it does not require much resource and has good efficiency. There are also

some attempts already been made for the FPGA platform [34]. In any case,

the software part is only the first stage to make sure that the whole design can

run correctly on the arithmetic level and obtain a result of the given problem to

compare with the hardware generated result [Figure 8].

Figure 8 User Interface of the software for beam model with the left end fixed and right end loaded

The hardware part was written in VHDL which was generated by System

Generator. Following by the same structure of software implementation, the

program was organized as shown in [Figure 9].

36

Page 40: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

4. Case Study

Figure 9 Four blocks of the Hardware

The parallel techniques were used in the blocks on FPGA.

As discussed in section three, Jacobi method first came into the consideration

of the hardware design because of its iterative nature. From equation , the

idealized block diagram of Jacobi iteration is shown in [Figure 10]. The first

pipeline stage is multiplication. Each row of matrix feeds into the multiplier

leaf nodes with the vector at the other input simultaneously. After the

multiplication, the output of each multiplier will be routed to an accumulator to

obtain the value of . The third stage in the pipeline is the subtraction

of , followed by the last division stage.

37

Page 41: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

4. Case Study

Figure 10 Idealized Jacobi block diagram[35]

In the actual design, the division in hardware is normally a time consuming

process that it should run at the beginning and obtain the value of . This

value should be stored for later use. That is to say, the last stage of the

pipeline could be changed to a simple multiplication and the division process

is hidden behind the three stages described earlier. Another solution to avoid

performing division on hardware is to add one more group of input values

which contains the quotient from the software. The second solution was the

way that is used in the current design.

38

Page 42: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

4. Case Study

Table 1 Design parameters of Jacobi method

Parameter Type Description

integer Input vector length

integer Multiplier latency

integer Adder latency

integer Data path width

integer Reduction vector length

Simply increasing the size of the binary reduction tree to deal with larger

matrices is unwise as the resources on FPGA are limited although a larger

board can normally be found. This way of using hardware resource is

inefficient. When the problem size increases beyond a certain amount, this

method will definitely fail. A better way to deal with the problem is to separate

the whole problem into several sub-matrix with an upper boundary of for the

largest data path width whenever the input vector [35].

The time for get one value of should be

where is the time cost in each sub binary reduction tree, is

39

Page 43: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

4. Case Study

the number of sub blocks and is the time for subtraction by and the

addition of the . is the time cost of the reduction tree for all the result

generated from sub blocks that

And the time cost by comparison of the condition of convergence is

It is clear that when and are any power of while is any multiple of ,

the whole system will have the highest efficiency.

By using the 32 bit fixed-point core in System Generator and RC2000-V2

(Virtex-II xc2v6000-4ff1152), the time and resource cost are listed below

Table 2 Available resource on Virtex-II xc2v6000 board[36]

40

Page 44: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

4. Case Study

Table 3 Core statistics

Core Latency (Clock cycles) Resource cost

(Slices) FFs LUTs

Multiply 587 1104 1119

add 17 0 33

If a problem has , the total slices cost is BRAMs=73, Slices=

10642.

Table 4 FPGA resource utilization

Slices % BRAMs %

10642 31.5 73 50.7

The time cost by this problem per iteration should be

41

Page 45: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

4. Case Study

336 2592 20544

559 4548 38805

The time cost of the Positioning part of the program was 28 clock cycles per

element and the hardware cost was 15604 Slices, 28420 FFs and 29354

LUTs as discussed in the three month report [37].

With the problem size for an coefficient matrix of single precision

elements, Jacobi iteration solution on software will cost times of division,

times of multiplication and times of addition according to Eqn.

And the time cost by comparison of the condition of convergence is

times of addition and times of multiplication. Thus, the time cost for software

calculation is increased as an exponential function:

42

Page 46: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

4. Case Study

For hardware, when the structure is defined, the time cost per iteration of this

certain design is fixed. When the whole system is fully loaded, it would reach

its peak efficiency point.

Figure 11 Time Costing Comparison between Hardware and Software

4.2 Space Frame Model

As discussed in section three, the space frame model is much more

43

Page 47: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

4. Case Study

complicated than beam model. The problem cannot be solved as easy as

“pump the data in and wait for the result” as it is in the beam model. The work

for space frame model was divided into two stages: structure design and

hardware generation. The work has been done so far was limited on design

stage for hardware.

4.2.1 Optimization and balance

In the structure design stage, many points should be considered to find a

solution which can generate the result of the problem in a rational time with an

acceptable computation resource cost.

The input data normally contain these components:

(1) Size of problem. The size of the space frame problem can be defined by

six parameters: the number of elements (nelem), the number of points

(npoin), the number of output (noutp), the number of nodes (nnode), the

number of dimension (ndime) and the degrees of freedom (ndofn).

(2) Element connectivity Matrix lnods[nnode][nelem]. This matrix describes

how those elements link together.

(3) Matrix nboun[ndofn][npoin]. This matrix contains the information about the

boundary conditions which shows the property of each nodal point on the

end of elements.

(4) Matrix presc[ndofn][npoin]. This matrix contains the value of the force or

44

Page 48: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

4. Case Study

non-zero boundary condition applied on each nodal point.

(5) Matrix coord[ndime][npoin]. As its name shows, the number in each row of

the matrix gives the coordinates of each nodal point.

(6) Vector d[7]. This vector contains the material parameters of the element.

4.2.1.1 Software or Hardware

The whole design follows the basic structure suggested in [Figure 9] in

general, but there is a need to clarify the boundary of the software and

hardware parts of work. After reading in all the given values mentioned

before, the following procedures used is shown in [Figure 12]

Figure 12 Process to calculate space frame element

(1) Process for boundary conditions where “0” for free ends, “1” for fixed ends

and “-1” for tied nodes;

(2) Create the coordinate array for each element;

45

Page 49: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

4. Case Study

(3) Generate the local stiffness matrix for each element;

(4) Assemble the global stiffness matrix;

(5) Generate the force vector;

(6) Calculate out the final result.

Generally, the software part, which was written in ANSI-C, is controlling the

data input and output with some simple process based on its sequential

nature. It can reach a pretty high frequency for instructions. Looking back on

[Figure 12], the procedure (1) which uses to generate new boundary condition

matrix is a sequential process that more suitable to deal it on PC side.

Similarly, procedures (2) and (5) should be run on PC side as well.

Procedure (3) is a repeatable process that can be sped up by sharing the

work with multiple instances. It is more suitable to be put into the FPGA side.

The procedure (6) is the solution for matrix equations. It should definitely be

put on the FPGA side as well. The communication between PC and FPGA is

very time consuming that move process (4) back to PC is unreasonable. As

described above, we can safely reach the conclusion that the red zone in

[Figure 12] is the extent of the hardware process for the space frame model.

4.2.1.2 Data storage

The key issue of data storage is to find the balance between time

consumption and resource costs. Normally, a highly compressed storage

strategy would result in complex access procedures which could cost more

46

Page 50: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

4. Case Study

time. It can be seen as a compromise between time and storage.

In common structural computing problems, the size of the total stiffness matrix

used is normally quite big that would cost large amount of memory to store

them. There are several ways to save all these data. The easiest way is full-

matrix method that saves all of the elements in the stiffness matrix into the

processor. Actually, this method is lavish under most circumstances and

rarely be used in practical cases.

There are two storage schemes frequently used in program design, especially

for sparse matrix, two dimensional constant bandwidth and one dimensional

variable bandwidth storages.

Based on the symmetry of the total stiffness matrix, only half of the data in the

matrix need to be saved. Furthermore, the non-zero elements in the matrix

are usually concentrated in the area near the diagonal. Thus, those elements

can be stored into a matrix that equals to the dimension of the

global stiffness matrix and is the half-bandwidth. Back to the space frame

element, there are three degrees of freedom per node, that

When the largest difference between the equation numbers of each frame

element is

47

Page 51: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

4. Case Study

shows the matchup between the original stiffness matrix and the modified

new matrix .

If there is one element in matrix is , this element in matrix should be

. The ratio for the cost of memory resource in space frame method is

It is clear that to gain a high efficient storage result, the difference between the

equation numbers of each frame element should be small.

The one dimensional variable bandwidth scheme is much more complex than

the two-dimensional constant bandwidth method that will not be discussed

further here.

48

Page 52: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

4. Case Study

The choice between the full-matrix method and the two dimensional constant

bandwidth one is dependent on the size of the problem. If it is a small one that

the memory space is good enough for the whole process, the full-matrix

storage method could provide better performance on time scale. Otherwise, if

the memory space becomes the bottleneck of the problem, the band storage

method should be a better choice.

4.2.2 Design for parallel solution of matrix equations

The Jacobi method often converges rather slowly when compared to the more

sophisticated methods. As discussed in section three, the SOR method

should be a better solution for speed up. In this design, the successive over-

relaxation of Jacobi (JSR) is used. According to, the algorithm for JSR only

needs to change slightly from [Figure 10].

The main problem for an efficient JSR is finding a proper relaxation factor .

This will be done in the future work.

49

Page 53: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

5. Conclusion

5. Conclusion

5.1 Summary of the report

This nine-month progressive report has presented a general overview of the

combined finite-discrete element method then went through the fundamental

principles of both finite element method and parallel computing processes.

These background studies result in a fundamental structure to deal with the

FEM problems. Moreover, two case studies on finite element models have

been introduced.

5.2 Work plan for next nine months

The following points summarise the suggested work to be done in the next

period (10 to 18 working months):

Continue current case study on space frame element for FEM. The work

of hardware programming is still on the way. More detailed analysis about

the time and computing resource cost which was mentioned in the report

should be performed at numerical level.

Explore and adapt various methods for the efficient data transfer

management between PC and FPGA. The data transfer is a time

50

Page 54: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

5. Conclusion

consuming part in comparison with the time of actual calculation. When

the size of the problem grows, it is impossible to store all the required data

on the hardware part at the beginning. The control of the data transfer

management should be added into consideration of the whole design.

Background study on discrete element method. The work have be done

so for was mostly focused on finite element method, which more attention

should be given to DEM in the following work.

Case study on the discrete element method.

A Gantt chart has been shown in [Figure 13] demonstrating the time plan for

each aspect.

5.3 Outline plan to PhD submission

If the proposed work is completed smoothly, the expected date for submission

of the finial thesis will be in the period between sept.2010 and Jan 2011.

5.4 Publication plan

There are two related conference in consideration for my second year study.

1) International Conference on Field Programmable Logic and Application

2009

2) International Conference on Field-Programmable Technology 2009.

51

Page 55: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

5. Conclusion

The article will focus on the feasibility of the real time solution for FDEM on

FPGAs based on several small case studies. It will also concern about the

comparison between the time efficiency between FPGAs and common

super computers on FDEM problems.

52

Page 56: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

5. Conclusion

53

Figure 13G

antt chart for 10-18 month plan

Page 57: Nine-Month Report - University of Birminghampostgrad.eee.bham.ac.uk/zhangl/report/Nine Month Report... · Web viewNine-Month Report Hardware Acceleration for Real Time Solution of

Appendix

Reference

I