42
Robert Strzodka, Stanford University, Max Planck Center The Chances and Challenges The Chances and Challenges of Parallelism of Parallelism Comparison of Comparison of Hardwired (GPU) and Hardwired (GPU) and Reconfigurable (FPGA) Devices Reconfigurable (FPGA) Devices 0 5e-05 0.0001 0.00015 0.0002 0.00025 0.0003 1000 10000 100000 1e+06 1e+07 Seconds per grid node Domain size in grid nodes Normlized CPU (double) and CPU-GPU (mixed precision) execution time 1x1 CG: Opteron 250 1x1 CG: GF7800GTX 2x2 MG__MG: Opteron 250 2x2 MG__MG: GF7800GTX 0 200 400 600 800 1000 1200 1400 1600 20 25 30 35 40 45 50 Number of slices Bits of mantissa Area of s??e11 float kernels on the xc2v500/xc2v8000(CG) Adder Multiplier CG kernel normalized (1/30)

The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

Robert Strzodka, Stanford University, Max Planck Center

The Chances and Challenges The Chances and Challenges of Parallelismof Parallelism

Comparison ofComparison ofHardwired (GPU) and Hardwired (GPU) and

Reconfigurable (FPGA) DevicesReconfigurable (FPGA) Devices

0

5e-05

0.0001

0.00015

0.0002

0.00025

0.0003

1000 10000 100000 1e+06 1e+07

Seco

nds

per g

rid n

ode

Domain size in grid nodes

Normlized CPU (double) and CPU-GPU (mixed precision) execution time

1x1 CG: Opteron 2501x1 CG: GF7800GTX

2x2 MG__MG: Opteron 2502x2 MG__MG: GF7800GTX

0

200

400

600

800

1000

1200

1400

1600

20 25 30 35 40 45 50

Num

ber o

f slic

es

Bits of mantissa

Area of s??e11 float kernels on the xc2v500/xc2v8000(CG)

AdderMultiplier

CG kernel normalized (1/30)

Page 2: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

2

The ChancesThe Chances

• GPU: 249 GFLOPS single precision166 GB/s internal bandwidth51.2 GB/s external bandwidth

• FPGA: 192 mad25x18 at 550MHz + logicalmost unrestricted internal bandwidth 120.0 GB/s external bandwidth (for all IO pins)

• Clearspeed: 50 GFLOPS double precision200.0 GB/s internal bandwidth

6.4 GB/s external bandwidth

• Cell BE: 230 GFLOPS single precision21 GFLOPS double precision

204.8 GB/s internal bandwidth25.6 GB/s external bandwidth

Page 3: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

3

The ChallengesThe Challenges

• Computing Paradigms

• Parallel Programming

• Precision and Accuracy

• Algorithmic Optimization

• Large Range Scaling

Page 4: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

4

Processor

InstructionInstruction--StreamStream--Based ProcessingBased Processing

instructions

cache

mem

ory

mem

orydata datadata

datadata

datadata

Software Hardware

Page 5: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

5

Processor

DataData--StreamStream--Based ProcessingBased Processing

mem

ory

mem

ory

pipelinedatadata

configuration

pipelinepipeline

Flowware Hardware/Morphware

Configware

Nomenclature from[Reiner Hartenstein. Data-stream-based computing: Models and architectural resources, MIDEM 2003]

Page 6: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

6

The ChallengesThe Challenges

• Computing Paradigms

• Parallel Programming

• Precision and Accuracy

• Algorithmic Optimization

• Large Range Scaling

Page 7: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

7

PDE Example: The Poisson ProblemPDE Example: The Poisson Problem

such that : find :function Given ++ →Ω→Ω RR ub

bu =Δ− Ωdomain the inside0=u Ω∂boundary the on

2

2

2

2

:),( asgiven isoperator Laplace the2DIn yu

xuyxu

∂∂

+∂∂

Ωdomain

Ω∂boundary

),( yxusolution

Page 8: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

8

PDE Example: PDE Example: DiscretizationDiscretization and Solversand Solvers

buA =

After discretization the Poisson Problem becomes a linear equation system

bu =Δ−

)(:guess initial:

1

0

kkk uGuuu

+=

=+

For large systems, Au=b is typically solved with iterative schemes

( ) *210 ,,, uuuuu kk

k ⎯⎯ →⎯= ∞→KWe obtain a convergent series:

yuUbPyLLUPA

==

=

,

For small systems, Au=b is typically solved with a LU decomposition

Page 9: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

9

Matrix Vector Product as Stencil OperationMatrix Vector Product as Stencil Operation

Step n Step n+1

( )C

nv≤−αββ

1+nvα

( )( )C

nh vF

≤−αββ

∑≤− C

nvAαββ

ββα:

,

Page 10: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

10

• Configware

Maths: Banded Matrix Vector Product r = AvMaths: Banded Matrix Vector Product r = Av

( )

∑==

=

βββα

α

αα

vA

vAFvAr

,

,. ),( α

0β 1β2β

• Flowware

( ){ } ( ) HEIGHTWIDTH33

,,

HEIGHTWIDTHHEIGHTWIDTHHEIGHTWIDTH

0

,,⋅⋅

⋅⋅⋅

∈≠

∈∈

R

RR

βαβα AA

Avr

...221100 ,,, +++= ββαββαββαα vAvAvAr

...),,(),,(),,( ,.,.,. 221100vAFrvAFrvAFr αααααα ===

Page 11: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

11

• Configware in C/C++float kernel( float v[HEIGHT][WIDTH], float A[HEIGHT][WIDTH][3][3],

int x, int y ) {float r= 0;for( int yo= -1; yo <= 1; yo++ ) {for( int xo= -1; xo <= 1; xo++ ) {

r+= A[y][x][yo+1][xo+1] * v[y+yo][x+xo];}}return r;

}

CPU: Banded Matrix Vector Product r = AvCPU: Banded Matrix Vector Product r = Av

• Flowware in C/C++

extern float A[HEIGHT][WIDTH][3][3];extern float r[HEIGHT][WIDTH], v[HEIGHT][WIDTH];

for( int y= 0; y < HEIGHT; y++ ) {for( int x= 0; x < WIDTH; x++ ) {

r[y][x]= kernel( v, A, x, y );}}

Page 12: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

12

• Configware in Cg (high level language for GPUs)float kernel( array2d v, array2d Al, array2d Ac, array2d Au,

float2 xy : WPOS ) : COLOR {float r= 0; array2d A[3]= { Al, Ac, Au };for( int yo= -1; yo <= 1; yo++ ) {for( int xo= -1; xo <= 1; xo++ ) {

r+= arr2d(A[yo+1],xy)[xo+1] * arr2d(v,xy+float2(xo,yo));}}return r;

}

GPU: Banded Matrix Vector Product r = AvGPU: Banded Matrix Vector Product r = Av

• Flowware in C++// load configware to the GPU, define names for arrays, then initialize// enum EnumArr { ARR_r, ARR_v, ARR_Al, ARR_Ac, ARR_Au, ARR_NUM };for( int i= 0; i < ARR_NUM; i++ ) {

GPUArr* arr= new GPUArr( "Array name", (i<=ARR_v)? 1 : 3 );arr->Initialize(WIDTH, HEIGHT);arrP.push_back(arr);

}// ...SciGPU::op( ARR_r, VP_ID, FP_MAT_VEC, ARR_v, ARR_Al, ARR_Ac, ARR_Au );

Page 13: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

13

• Configware in ASC (high level language for FPGAs)void kernel() {

HWfloatFormat(32, 24, SIGNMAGNITUDE);Arch(OUT); IOtype<float> r_out; Arch(TMP); HWfloat r;Arch(IN); IOtype<float> v_in; Arch(TMP); HWfloat v;Arch(IN); IOtype<float> A_in[3][3]; Arch(TMP); HWfloat A[3][3];v= v_in; r= 0;UNROLL_LOOP( int yo= 0; yo < 3; yo++ ) {UNROLL_LOOP( int xo= 0; xo < 3; xo++ ) {

A[yo][xo]= A_in[yo][xo];r+= A[yo][xo] * prev(v, yo*WIDTH+xo);

}}r_out= r;

}

FPGA: Banded Matrix Vector Product r = AvFPGA: Banded Matrix Vector Product r = Av

• Flowware in C++

• The FPGA will use the same framework as the GPU.• Object-orientation: One interface, different implementations.• In development.

[Oskar Mencer: ASC, A Stream Compiler for Computing with FPGAs, IEEE Trans. CAD, 2006]

Page 14: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

14

Applicatione.g. in

C/C++, Java,Fortran, Perl

Shaderprograms

e.g. inHLSL, GLSL,

Cg

GPU ProgrammingGPU Programming

Graphicshardware

e.g.Radeon (ATI),GeForce (NV)

Operatingsystem

e.g.Windows, Unix,Linux, MacOS

Graphics APIe.g.

OpenGL,DirectX

Window manager

e.g.GLUT, Qt,

Win32, MotifGPU library

Hides thegraphicsspecificdetails

Flowware

Configware

Page 15: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

Logic Synthesis

System Level Model

Behavioral Synthesis

RTL / Libraries

- Traditional hardware design process is vertically fragmented across many companies, file formats, etc...this is the major culprit for the productivity gap.

ASC bridges the VLSI CAD Productivity Gap witha Software Approach to Hardware Generation

FPGA ProgrammingFPGA Programming

slide courtesy

of Oskar Mencer

ASC bridges VLSI CAD Productivity GapASC bridges VLSI CAD Productivity Gap

Architecture GenerationModule GenerationGate Level (PamDC)

Parallelizing Compileror Manual Optimization

ASCASC

- Very high performance:Programmer has easy access to the design on all levels of abstraction.

- Easy to use: C++ syntax with custom types.e.g. Most comprehensive Floating Point library available today (>200 different units) created in 2 months!

Page 16: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

16

The ChallengesThe Challenges

• Computing Paradigms

• Parallel Programming

• Precision and Accuracy

• Algorithmic Optimization

• Large Range Scaling

Page 17: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

17

The Erratic The Erratic RoundoffRoundoff ErrorErrorS

mal

ler i

s be

tter

-100

-90

-80

-70

-60

-50

-40

-30

-20

0 10 20 30 40 50

y =

log2

( f(a

) ),

0 --

> 2^

-100

x = log2( 1/a ), a = 1 / 2^x

Roundoff error for: 0 = f(a):= |(1+a)^3 - (1+3a^2) - (3a+a^3)|

single precisiondouble precision

Page 18: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

18

Precision and AccuracyPrecision and Accuracy

• There is no monotonic relation between the computational precision and the accuracy of the final result.

• Increasing precision can decrease accuracy !

• The increase or decrease of precision in different parts of a computation can have very different impact on the accuracy.

• The above can be exploited to significantly reduce the precision in parts of a computation without a loss in accuracy.

• We obtain a mixed precision method.

Page 19: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

19

Resource Consumption for Integer OperationsResource Consumption for Integer Operations

Operation Area Latencymin(r,0)max(r,0) b+1 2

add(r1,r2)sub(r1,r2)

2b b

add(r1,r2,r3)→add(r4,r5) 2b 1mult(r1,r2)

sqr(r) b(b-2) b log(b)

sqrt(r) 2c(c-5) c(c+3)

b: bitlength of argument, c: bitlength of result

Page 20: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

20

Resource Consumption on a FPGAResource Consumption on a FPGA

0

200

400

600

800

1000

1200

1400

1600

20 25 30 35 40 45 50

Num

ber o

f slic

es

Bits of mantissa

Area of s??e11 float kernels on the xc2v500/xc2v8000(CG)

AdderMultiplier

CG kernel normalized (1/30)

Sm

alle

r is

bette

r

Page 21: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

21

Generalized Iterative RefinementGeneralized Iterative Refinement

with find parameters and :function aFor 0 NMNN XQF ℜ∈ℜ∈ℜ→ℜ

0);( 0 =QXF

equations of systemlinear a solve toused typicallyis This BAX =11111 ~:,0~,: +++++ +==−−= kkkkkkk XXXXABAXBB

process iterativean itself requires osolution t eapproximat The 2)directly osolution t eapproximatan findcan We1)

:cases h twodistinguis weNow

FF

iterate we some with starting exactly, solvecannot weAs 0 NXF ℜ∈

,~:,0);~(),,,,(: 111101 +++++ +=== kkkkkkkk XXXQXFQQXHQ K

. parametersdifferent with solve repeatedly wei.e. kPF

Page 22: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

22

CPU Results: LU SolverCPU Results: LU Solver

chart courtesy

of Jack Dongarra

Larg

er is

bet

ter

[Langou et al. Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems), SC 2006, to appear]

Page 23: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

23

GPU Results: Conjugate Gradient (CG) and GPU Results: Conjugate Gradient (CG) and MultigridMultigrid (MG)(MG)S

mal

ler i

s be

tter

5e-7

5e-6

5e-5

5e-4

6 7 8 9 10

Sec

onds

per

grid

nod

e

Data level

Performance of double precision CPU and mixed precision CPU-GPU solvers

CG CPUCG GPU

MG2+2 CPUMG2+2 GPU

Page 24: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

24

FPGA Results: Conjugate Gradient with MUL18x18FPGA Results: Conjugate Gradient with MUL18x18S

mal

ler i

s be

tter

0

10000

20000

30000

40000

50000

60000

20 25 30 35 40 45 50

Num

ber o

f slic

es

Bits of mantissa

Area of Conjugate Gradient s??e11 float kernels on the xc2v8000

Number of SlicesQuadratic fit

Number of 4 input LUTsNumber of Slice Flip Flops

Number of MULT18X18s * 500

Page 25: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

25

FPGA Results: Conjugate Gradient with MUL18x18FPGA Results: Conjugate Gradient with MUL18x18La

rger

is b

ette

r

40

60

80

100

120

140

20 25 30 35 40 45 50

Freq

uenc

y / I

O B

lock

s

Bits of mantissa

Frequency/IO of Conjugate Gradient s??e11 float kernels on the xc2v8000

Maximal Frequency in MHzNumber of bonded IOBs in 10s

Page 26: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

26

The ChallengesThe Challenges

• Computing Paradigms

• Parallel Programming

• Precision and Accuracy

• Algorithmic Optimization

• Large Range Scaling

Page 27: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

27

ArithmeticArithmetic Intensity in MatrixIntensity in Matrix--Vector ProductsVector Products

• Analysis of banded MatVec r=Av, pre-assembled– Reads per component of r:

9 times into v, once into each band of A– Operations per component of r:

9 multiply-adds

18 reads

18 ops

18/18=1

• Arithmetic intensity• Operations per memory access• Computation / bandwidth

> 8• Rule of thumb for CPU/GPU

• Arithmetic intensity on floats should be• On doubles twice as high

Page 28: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

28

Trading Computation for BandwidthTrading Computation for Bandwidth

• Three possibilities for a matrix vector product A·v if Adepends on some data and must be computed itself– On-the-fly: compute entries of A for each A·v application

• Lowest memory requirement• Good for simple entries or seldom use of A

– Partial assembly: precompute only some intermediate results• Allows to balance computation and bandwidth requirements • Good choice of precomputed results requires little memory

– Full assembly: precompute all entries of A, use these in A·v• Good if other computations hide bandwidth problem in A·v• Otherwise try to use partial assembly

( ).][div:][,][ 1h

kh

kkkkk UGUAUUUA ∇−==⋅ + τ1• For example, pre-compute only G[] when solving

Page 29: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

29

Standard Conjugate GradientStandard Conjugate Gradient

kUr

kRr

1−kPr

A

kβkk

kkkk

PAQ

PRPrr

rrr

=

+= −− 11β

Vector operations 1

kk QPrr

⋅Dot product 1

kkkk

kkkk

QRR

PUUrrr

rrr

α

α

−=

+=+

+

1

1

Vector operations 2

11 ++ ⋅ kk RRrr

Dot product 2

1+kβ

kUr

kRr

kPr

kQr

Page 30: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

30

Pipelined Conjugate GradientPipelined Conjugate Gradient

kUr

kRr

kPr

kQr

Akαkβ

kk

kkkk

kkkk

kkkk

PAQ

PRP

QRR

PUU

rr

rrr

rrr

rrr

=

+=

−=

+=

++

+

+

β

α

α

11

1

1

Vector operations

11

11

11

++

++

++

kk

kk

kk

QQ

QP

RR

rr

rr

rrDot products

Scalaroperations

1+kα1+kβ

Page 31: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

31

The ChallengesThe Challenges

• Computing Paradigms

• Parallel Programming

• Precision and Accuracy

• Algorithmic Optimization

• Large Range Scaling

Page 32: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

32

DiscretizationDiscretization GridsGrids

Deformed tensor-product gridParallel dynamic updates

One array for values,second for deformation

Equidistant gridEasy to implement

One array holds all values

Page 33: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

33

DiscretizationDiscretization GridsGrids

Adaptive gridCan handle coherently changing

dynamic grid topology

A hash, tree or page table is needed

Unstructured gridGood performance for static,

poor for dynamic grid topology

An index array is needed

Page 34: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

34

GliftGlift : Generic, Efficient, Random: Generic, Efficient, Random--Access GPU Access GPU Data StructuresData Structures

STL-like abstraction of data containers from algorithms for GPUs

The Glift slides are based on Aaron Lefohn‘s presentation at the GPGPU Vis05 tutorial

Page 35: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

35

GliftGlift: Virtual Memory: Virtual Memory

• Virtual N-D address space– Defined by physical memory and address translator– Address translator can be a simple analytical or a

complex mapping based on page table, tree or hash.– The same user interface irrespective of actual

physical storage

Abstraction

Virtual representation of memory: 3D grid

Translation

3D native mem

Translation

2D slices

Translation

Flat 3D array

Page 36: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

36

GliftGlift ComponentsComponents

Application

PhysMem AddrTrans

C++ / Cg / OpenGL

VirtMem

Container Adaptors

Implementation

Algorithms based on VirtMem do not depend on the physical memory capabilities: data layout opt., code reuse, portability

Page 37: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

37

FEAST: Generalized TensorFEAST: Generalized Tensor--Product GridsProduct Grids

• Sufficient flexibility in domain discretization– Global unstructured macro

mesh, domain decomposition– (an)isotropic refinement into

local tensor-product grids

• Efficient computation– High data locality, large problems map well to clusters – Problem specific solvers depending on anisotropy level– Hardware accelerated solvers on regular sub-problems

[Stefan Turek et al. Hardware–oriented numerics and concepts for PDE software, 2006]

Page 38: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

38

FEAST: Deformation FEAST: Deformation AdaptivityAdaptivity

• This grid is a tensor-product !

• Easier to accelerate in hardware than resolution adaptive grids

• Anisotropy leveldetermines optimal solver

Page 39: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

39

FEAST: AdFEAST: Ad--hoc GPU Cluster Performancehoc GPU Cluster PerformanceS

mal

ler i

s be

tter

0.0006

0.0008

0.001

0.0012

0.0014

0.0016

0.0018

0.002

0.0022

6 6.5 7 7.5 8 8.5 9

Sec

onds

per

mac

ro g

rid n

ode

Level

CPU, GPU Performance Study for 1x16p, 2x16p (Threshold=20K)

1x16p CPU MGCPU21x16p GPU FX1400

2x16p CPU MGCPU22x16p GPU FX1400

Page 40: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

40

ConclusionsConclusions

• Flowware/configware distinction is important for efficiency; abstract interfaces facilitate programming

• Mixed precision methods often allow to reduce the computational precision without a loss of final accuracy

• Balancing arithmetic intensity is more effective than one-sided bandwidth or computation optimizations

• Clever discretizations combine high flexibility with very efficient parallel data layout for PDE solvers

Page 41: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

41

Collaborators and Associated ProjectsCollaborators and Associated Projects

• FPGAs, ASC– Lee Howes, Oliver Pell, Oskar Mencer (Imperial College)

• Mixed Precision Methods, FEAST– Dominik Göddeke, Stefan Turek (Univeristy of Dortmund)

• Cluster Computing, Scout– Patrick McCormick, Advanced Computing Lab (LANL)

• Parallel Adaptive Grids, Glift– Aaron Lefohn (Neoptica), Joe Kniss (University of Utah), Shubhabrata

Sengupta, John Owens (University of California, Davis)

• Application Integration, PhysBAM– Ron Fedkiw’s group, physical simulation and computer graphics

(Stanford University)

Page 42: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired

Robert Strzodka, Stanford University, Max Planck Center

The Chances and Challenges The Chances and Challenges of Parallelismof Parallelism

Comparison ofComparison ofHardwired (GPU) and Hardwired (GPU) and

Reconfigurable (FPGA) DevicesReconfigurable (FPGA) Devices

0

5e-05

0.0001

0.00015

0.0002

0.00025

0.0003

1000 10000 100000 1e+06 1e+07

Seco

nds

per g

rid n

ode

Domain size in grid nodes

Normlized CPU (double) and CPU-GPU (mixed precision) execution time

1x1 CG: Opteron 2501x1 CG: GF7800GTX

2x2 MG__MG: Opteron 2502x2 MG__MG: GF7800GTX

0

200

400

600

800

1000

1200

1400

1600

20 25 30 35 40 45 50

Num

ber o

f slic

es

Bits of mantissa

Area of s??e11 float kernels on the xc2v500/xc2v8000(CG)

AdderMultiplier

CG kernel normalized (1/30)