The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

Robert Strzodka, Stanford University, Max Planck Center

The Chances and Challenges The Chances and Challenges of Parallelismof Parallelism

Comparison ofComparison ofHardwired (GPU) and Hardwired (GPU) and

Reconfigurable (FPGA) DevicesReconfigurable (FPGA) Devices

0

5e-05

0.0001

0.00015

0.0002

0.00025

0.0003

1000 10000 100000 1e+06 1e+07

Seco

nds

per g

rid n

ode

Domain size in grid nodes

Normlized CPU (double) and CPU-GPU (mixed precision) execution time

1x1 CG: Opteron 2501x1 CG: GF7800GTX

2x2 MG__MG: Opteron 2502x2 MG__MG: GF7800GTX

0

200

400

600

800

1000

1200

1400

1600

20 25 30 35 40 45 50

Num

ber o

f slic

es

Bits of mantissa

Area of s??e11 float kernels on the xc2v500/xc2v8000(CG)

AdderMultiplier

CG kernel normalized (1/30)

2

The ChancesThe Chances

• GPU: 249 GFLOPS single precision166 GB/s internal bandwidth51.2 GB/s external bandwidth

• FPGA: 192 mad25x18 at 550MHz + logicalmost unrestricted internal bandwidth 120.0 GB/s external bandwidth (for all IO pins)

• Clearspeed: 50 GFLOPS double precision200.0 GB/s internal bandwidth

6.4 GB/s external bandwidth

• Cell BE: 230 GFLOPS single precision21 GFLOPS double precision

204.8 GB/s internal bandwidth25.6 GB/s external bandwidth

3

The ChallengesThe Challenges

• Computing Paradigms

• Parallel Programming

• Precision and Accuracy

• Algorithmic Optimization

• Large Range Scaling

4

Processor

InstructionInstruction--StreamStream--Based ProcessingBased Processing

instructions

cache

mem

ory

mem

orydata datadata

datadata

datadata

Software Hardware

5

Processor

DataData--StreamStream--Based ProcessingBased Processing

mem

ory

mem

ory

pipelinedatadata

configuration

pipelinepipeline

Flowware Hardware/Morphware

Configware

Nomenclature from[Reiner Hartenstein. Data-stream-based computing: Models and architectural resources, MIDEM 2003]

6







7

PDE Example: The Poisson ProblemPDE Example: The Poisson Problem

such that : find :function Given ++ →Ω→Ω RR ub

bu =Δ− Ωdomain the inside0=u Ω∂boundary the on

2

2

2

2

:),( asgiven isoperator Laplace the2DIn yu

xuyxu

∂∂

+∂∂

=Δ

Ωdomain

Ω∂boundary

),( yxusolution

8

PDE Example: PDE Example: DiscretizationDiscretization and Solversand Solvers

buA =

After discretization the Poisson Problem becomes a linear equation system

bu =Δ−

)(:guess initial:

1

0

kkk uGuuu

+=

=+

For large systems, Au=b is typically solved with iterative schemes

( ) *210 ,,, uuuuu kk

k ⎯⎯ →⎯= ∞→KWe obtain a convergent series:

yuUbPyLLUPA

==

=

,

For small systems, Au=b is typically solved with a LU decomposition

9

Matrix Vector Product as Stencil OperationMatrix Vector Product as Stencil Operation

Step n Step n+1

( )C

nv≤−αββ

1+nvα

( )( )C

nh vF

≤−αββ

∑≤− C

nvAαββ

ββα:

,

10

• Configware

Maths: Banded Matrix Vector Product r = AvMaths: Banded Matrix Vector Product r = Av

( )

∑==

=

βββα

α

αα

vA

vAFvAr

,

,. ),( α

0β 1β2β

• Flowware

( ){ } ( ) HEIGHTWIDTH33

,,

HEIGHTWIDTHHEIGHTWIDTHHEIGHTWIDTH

0

,,⋅⋅

⋅⋅⋅

∈≠

∈∈

R

RR

βαβα AA

Avr

...221100 ,,, +++= ββαββαββαα vAvAvAr

3β

...),,(),,(),,( ,.,.,. 221100vAFrvAFrvAFr αααααα ===

11

• Configware in C/C++float kernel( float v[HEIGHT][WIDTH], float A[HEIGHT][WIDTH][3][3],

int x, int y ) {float r= 0;for( int yo= -1; yo <= 1; yo++ ) {for( int xo= -1; xo <= 1; xo++ ) {

r+= A[y][x][yo+1][xo+1] * v[y+yo][x+xo];}}return r;

}

CPU: Banded Matrix Vector Product r = AvCPU: Banded Matrix Vector Product r = Av

• Flowware in C/C++

extern float A[HEIGHT][WIDTH][3][3];extern float r[HEIGHT][WIDTH], v[HEIGHT][WIDTH];

for( int y= 0; y < HEIGHT; y++ ) {for( int x= 0; x < WIDTH; x++ ) {

r[y][x]= kernel( v, A, x, y );}}

12

• Configware in Cg (high level language for GPUs)float kernel( array2d v, array2d Al, array2d Ac, array2d Au,

float2 xy : WPOS ) : COLOR {float r= 0; array2d A[3]= { Al, Ac, Au };for( int yo= -1; yo <= 1; yo++ ) {for( int xo= -1; xo <= 1; xo++ ) {

r+= arr2d(A[yo+1],xy)[xo+1] * arr2d(v,xy+float2(xo,yo));}}return r;

}

GPU: Banded Matrix Vector Product r = AvGPU: Banded Matrix Vector Product r = Av

• Flowware in C++// load configware to the GPU, define names for arrays, then initialize// enum EnumArr { ARR_r, ARR_v, ARR_Al, ARR_Ac, ARR_Au, ARR_NUM };for( int i= 0; i < ARR_NUM; i++ ) {

GPUArr* arr= new GPUArr( "Array name", (i<=ARR_v)? 1 : 3 );arr->Initialize(WIDTH, HEIGHT);arrP.push_back(arr);

}// ...SciGPU::op( ARR_r, VP_ID, FP_MAT_VEC, ARR_v, ARR_Al, ARR_Ac, ARR_Au );

13

• Configware in ASC (high level language for FPGAs)void kernel() {

HWfloatFormat(32, 24, SIGNMAGNITUDE);Arch(OUT); IOtype<float> r_out; Arch(TMP); HWfloat r;Arch(IN); IOtype<float> v_in; Arch(TMP); HWfloat v;Arch(IN); IOtype<float> A_in[3][3]; Arch(TMP); HWfloat A[3][3];v= v_in; r= 0;UNROLL_LOOP( int yo= 0; yo < 3; yo++ ) {UNROLL_LOOP( int xo= 0; xo < 3; xo++ ) {

A[yo][xo]= A_in[yo][xo];r+= A[yo][xo] * prev(v, yo*WIDTH+xo);

}}r_out= r;

}

FPGA: Banded Matrix Vector Product r = AvFPGA: Banded Matrix Vector Product r = Av

• Flowware in C++

• The FPGA will use the same framework as the GPU.• Object-orientation: One interface, different implementations.• In development.

[Oskar Mencer: ASC, A Stream Compiler for Computing with FPGAs, IEEE Trans. CAD, 2006]

14

Applicatione.g. in

C/C++, Java,Fortran, Perl

Shaderprograms

e.g. inHLSL, GLSL,

Cg

GPU ProgrammingGPU Programming

Graphicshardware

e.g.Radeon (ATI),GeForce (NV)

Operatingsystem

e.g.Windows, Unix,Linux, MacOS

Graphics APIe.g.

OpenGL,DirectX

Window manager

e.g.GLUT, Qt,

Win32, MotifGPU library

Hides thegraphicsspecificdetails

Flowware

Configware

Logic Synthesis

System Level Model

Behavioral Synthesis

RTL / Libraries

- Traditional hardware design process is vertically fragmented across many companies, file formats, etc...this is the major culprit for the productivity gap.

ASC bridges the VLSI CAD Productivity Gap witha Software Approach to Hardware Generation

FPGA ProgrammingFPGA Programming

slide courtesy

of Oskar Mencer

ASC bridges VLSI CAD Productivity GapASC bridges VLSI CAD Productivity Gap

Architecture GenerationModule GenerationGate Level (PamDC)

Parallelizing Compileror Manual Optimization

ASCASC

- Very high performance:Programmer has easy access to the design on all levels of abstraction.

- Easy to use: C++ syntax with custom types.e.g. Most comprehensive Floating Point library available today (>200 different units) created in 2 months!

16







17

The Erratic The Erratic RoundoffRoundoff ErrorErrorS

mal

ler i

s be

tter

-100

-90

-80

-70

-60

-50

-40

-30

-20

0 10 20 30 40 50

y =

log2

( f(a

) ),

0 --

> 2^

-100

x = log2( 1/a ), a = 1 / 2^x

Roundoff error for: 0 = f(a):= |(1+a)^3 - (1+3a^2) - (3a+a^3)|

single precisiondouble precision

18

Precision and AccuracyPrecision and Accuracy

• There is no monotonic relation between the computational precision and the accuracy of the final result.

• Increasing precision can decrease accuracy !

• The increase or decrease of precision in different parts of a computation can have very different impact on the accuracy.

• The above can be exploited to significantly reduce the precision in parts of a computation without a loss in accuracy.

• We obtain a mixed precision method.

19

Resource Consumption for Integer OperationsResource Consumption for Integer Operations

Operation Area Latencymin(r,0)max(r,0) b+1 2

add(r1,r2)sub(r1,r2)

2b b

add(r1,r2,r3)→add(r4,r5) 2b 1mult(r1,r2)

sqr(r) b(b-2) b log(b)

sqrt(r) 2c(c-5) c(c+3)

b: bitlength of argument, c: bitlength of result

20

Resource Consumption on a FPGAResource Consumption on a FPGA

0

200

400

600

800

1000

1200

1400

1600

20 25 30 35 40 45 50

Num

ber o

f slic

es

Bits of mantissa


AdderMultiplier


Sm

alle

r is

bette

r

21

Generalized Iterative RefinementGeneralized Iterative Refinement

with find parameters and :function aFor 0 NMNN XQF ℜ∈ℜ∈ℜ→ℜ

0);( 0 =QXF

equations of systemlinear a solve toused typicallyis This BAX =11111 ~:,0~,: +++++ +==−−= kkkkkkk XXXXABAXBB

process iterativean itself requires osolution t eapproximat The 2)directly osolution t eapproximatan findcan We1)

:cases h twodistinguis weNow

FF

iterate we some with starting exactly, solvecannot weAs 0 NXF ℜ∈

,~:,0);~(),,,,(: 111101 +++++ +=== kkkkkkkk XXXQXFQQXHQ K

. parametersdifferent with solve repeatedly wei.e. kPF

22

CPU Results: LU SolverCPU Results: LU Solver

chart courtesy

of Jack Dongarra

Larg

er is

bet

ter

[Langou et al. Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems), SC 2006, to appear]

23

GPU Results: Conjugate Gradient (CG) and GPU Results: Conjugate Gradient (CG) and MultigridMultigrid (MG)(MG)S

mal

ler i

s be

tter

5e-7

5e-6

5e-5

5e-4

6 7 8 9 10

Sec

onds

per

grid

nod

e

Data level

Performance of double precision CPU and mixed precision CPU-GPU solvers

CG CPUCG GPU

MG2+2 CPUMG2+2 GPU

24

FPGA Results: Conjugate Gradient with MUL18x18FPGA Results: Conjugate Gradient with MUL18x18S

mal

ler i

s be

tter

0

10000

20000

30000

40000

50000

60000

20 25 30 35 40 45 50

Num

ber o

f slic

es

Bits of mantissa

Area of Conjugate Gradient s??e11 float kernels on the xc2v8000

Number of SlicesQuadratic fit

Number of 4 input LUTsNumber of Slice Flip Flops

Number of MULT18X18s * 500

25

FPGA Results: Conjugate Gradient with MUL18x18FPGA Results: Conjugate Gradient with MUL18x18La

rger

is b

ette

r

40

60

80

100

120

140

20 25 30 35 40 45 50

Freq

uenc

y / I

O B

lock

s

Bits of mantissa

Frequency/IO of Conjugate Gradient s??e11 float kernels on the xc2v8000

Maximal Frequency in MHzNumber of bonded IOBs in 10s

26







27

ArithmeticArithmetic Intensity in MatrixIntensity in Matrix--Vector ProductsVector Products

• Analysis of banded MatVec r=Av, pre-assembled– Reads per component of r:

9 times into v, once into each band of A– Operations per component of r:

9 multiply-adds

18 reads

18 ops

18/18=1

• Arithmetic intensity• Operations per memory access• Computation / bandwidth

> 8• Rule of thumb for CPU/GPU

• Arithmetic intensity on floats should be• On doubles twice as high

28

Trading Computation for BandwidthTrading Computation for Bandwidth

• Three possibilities for a matrix vector product A·v if Adepends on some data and must be computed itself– On-the-fly: compute entries of A for each A·v application

• Lowest memory requirement• Good for simple entries or seldom use of A

– Partial assembly: precompute only some intermediate results• Allows to balance computation and bandwidth requirements • Good choice of precomputed results requires little memory

– Full assembly: precompute all entries of A, use these in A·v• Good if other computations hide bandwidth problem in A·v• Otherwise try to use partial assembly

( ).][div:][,][ 1h

kh

kkkkk UGUAUUUA ∇−==⋅ + τ1• For example, pre-compute only G[] when solving

29

Standard Conjugate GradientStandard Conjugate Gradient

kUr

kRr

1−kPr

A

kβkk

kkkk

PAQ

PRPrr

rrr

=

+= −− 11β

Vector operations 1

kk QPrr

⋅Dot product 1

kα

kkkk

kkkk

QRR

PUUrrr

rrr

α

α

−=

+=+

+

1

1

Vector operations 2

11 ++ ⋅ kk RRrr

Dot product 2

1+kβ

kUr

kRr

kPr

kQr

kα

30

Pipelined Conjugate GradientPipelined Conjugate Gradient

kUr

kRr

kPr

kQr

Akαkβ

kk

kkkk

kkkk

kkkk

PAQ

PRP

QRR

PUU

rr

rrr

rrr

rrr

=

+=

−=

+=

++

+

+

β

α

α

11

1

1

Vector operations

11

11

11

++

++

++

⋅

⋅

⋅

kk

kk

kk

QQ

QP

RR

rr

rr

rrDot products

Scalaroperations

1+kα1+kβ

31







32

DiscretizationDiscretization GridsGrids

Deformed tensor-product gridParallel dynamic updates

One array for values,second for deformation

Equidistant gridEasy to implement

One array holds all values

33

DiscretizationDiscretization GridsGrids

Adaptive gridCan handle coherently changing

dynamic grid topology

A hash, tree or page table is needed

Unstructured gridGood performance for static,

poor for dynamic grid topology

An index array is needed

34

GliftGlift : Generic, Efficient, Random: Generic, Efficient, Random--Access GPU Access GPU Data StructuresData Structures

STL-like abstraction of data containers from algorithms for GPUs

The Glift slides are based on Aaron Lefohn‘s presentation at the GPGPU Vis05 tutorial

35

GliftGlift: Virtual Memory: Virtual Memory

• Virtual N-D address space– Defined by physical memory and address translator– Address translator can be a simple analytical or a

complex mapping based on page table, tree or hash.– The same user interface irrespective of actual

physical storage

Abstraction

Virtual representation of memory: 3D grid

Translation

3D native mem

Translation

2D slices

Translation

Flat 3D array

36

GliftGlift ComponentsComponents

Application

PhysMem AddrTrans

C++ / Cg / OpenGL

VirtMem

Container Adaptors

Implementation

Algorithms based on VirtMem do not depend on the physical memory capabilities: data layout opt., code reuse, portability

37

FEAST: Generalized TensorFEAST: Generalized Tensor--Product GridsProduct Grids

• Sufficient flexibility in domain discretization– Global unstructured macro

mesh, domain decomposition– (an)isotropic refinement into

local tensor-product grids

• Efficient computation– High data locality, large problems map well to clusters – Problem specific solvers depending on anisotropy level– Hardware accelerated solvers on regular sub-problems

[Stefan Turek et al. Hardware–oriented numerics and concepts for PDE software, 2006]

38

FEAST: Deformation FEAST: Deformation AdaptivityAdaptivity

• This grid is a tensor-product !

• Easier to accelerate in hardware than resolution adaptive grids

• Anisotropy leveldetermines optimal solver

39

FEAST: AdFEAST: Ad--hoc GPU Cluster Performancehoc GPU Cluster PerformanceS

mal

ler i

s be

tter

0.0006

0.0008

0.001

0.0012

0.0014

0.0016

0.0018

0.002

0.0022

6 6.5 7 7.5 8 8.5 9

Sec

onds

per

mac

ro g

rid n

ode

Level

CPU, GPU Performance Study for 1x16p, 2x16p (Threshold=20K)

1x16p CPU MGCPU21x16p GPU FX1400

2x16p CPU MGCPU22x16p GPU FX1400

40

ConclusionsConclusions

• Flowware/configware distinction is important for efficiency; abstract interfaces facilitate programming

• Mixed precision methods often allow to reduce the computational precision without a loss of final accuracy

• Balancing arithmetic intensity is more effective than one-sided bandwidth or computation optimizations

• Clever discretizations combine high flexibility with very efficient parallel data layout for PDE solvers

41

Collaborators and Associated ProjectsCollaborators and Associated Projects

• FPGAs, ASC– Lee Howes, Oliver Pell, Oskar Mencer (Imperial College)

• Mixed Precision Methods, FEAST– Dominik Göddeke, Stefan Turek (Univeristy of Dortmund)

• Cluster Computing, Scout– Patrick McCormick, Advanced Computing Lab (LANL)

• Parallel Adaptive Grids, Glift– Aaron Lefohn (Neoptica), Joe Kniss (University of Utah), Shubhabrata

Sengupta, John Owens (University of California, Davis)

• Application Integration, PhysBAM– Ron Fedkiw’s group, physical simulation and computer graphics

(Stanford University)

Robert Strzodka, Stanford University, Max Planck Center

The Chances and Challenges The Chances and Challenges of Parallelismof Parallelism

Comparison ofComparison ofHardwired (GPU) and Hardwired (GPU) and

Reconfigurable (FPGA) DevicesReconfigurable (FPGA) Devices

0

5e-05

0.0001

0.00015

0.0002

0.00025

0.0003

1000 10000 100000 1e+06 1e+07

Seco

nds

per g

rid n

ode

Domain size in grid nodes

Normlized CPU (double) and CPU-GPU (mixed precision) execution time

1x1 CG: Opteron 2501x1 CG: GF7800GTX

2x2 MG__MG: Opteron 2502x2 MG__MG: GF7800GTX

0

200

400

600

800

1000

1200

1400

1600

20 25 30 35 40 45 50

Num

ber o

f slic

es

Bits of mantissa


AdderMultiplier


Documents

The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired