77
Paraisoproject for automated generation of partial differential equations solvers for parallel computers Takayuki Muranush @nushio Astrophisicist, Assistant professor at The Hakubi Center, Kyoto University2010-2015

Paraiso project - cps-jp.org

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Paraiso project - cps-jp.org

「Paraiso」project for automated generation

of partial differential equations solvers for parallel computers

Takayuki Muranush @nushio

Astrophisicist, Assistant professor at

The Hakubi Center, Kyoto University(2010-2015)

Page 2: Paraiso project - cps-jp.org

Chapter 1. An Introduction to a Problem

we (numerical astronomers) have.

Page 3: Paraiso project - cps-jp.org
Page 4: Paraiso project - cps-jp.org

Acceleration

is

Parallelization

Page 5: Paraiso project - cps-jp.org
Page 6: Paraiso project - cps-jp.org

1 node

メモリ: (4GBx6) + (8GBx3 + 2GBx3) CPU:Westmere EP x2 GPU:Tesla 2050 (515Gflops + 3GB) x3 通信:Infiniband QDR 10GB/s ローカルディスク:SSD x2 RAID0 (460MB/s read)

2.2×1015flops

2.4?×1014flops

1.3×1013 Byte

7.6×1013 Byte

7.0? ×1015 Byte

Disks

1.4×1013 Byte/s

6.1×1014 Byte/s

3.4×1013 Byte/s

6.6×1011 Byte/s

1×1014 Byte/s

GPU CPU

GPUのメモリ ホストの

メモリ

SSD

DISK

Communications

L1 L2 L3

shared L1 L2

TSUBAME 2.0 Memory Hierarchy

Page 7: Paraiso project - cps-jp.org

• Thanks to the architects, today we have access to huge amount of memory and compute capability.

• That doesn’t come for free to the programmers. anymore.

90’832’896 CUDA Thread in Parallel

L1 Cache

L2 Cache

VRAM

HOST MEMORY

SSD

Hard Disk

Register

Shared Memory

4,386,816 floating operations in Parallel

Page 8: Paraiso project - cps-jp.org

What kind of computations

I’d like to do with such hardwares?

Page 9: Paraiso project - cps-jp.org

many categories of problems

Page 10: Paraiso project - cps-jp.org

Target Problem of Paraiso: Partial Differential Equations,

Explicit Solvers, on Uniform Mesh

General Relativity

Magneto-Hydrodynamics Hydrodynamics

Radiative Transfer (Relativistic)

• Hyperbolic PDEs that appers in astrophysics • combinations of these equations • combinations with chemistry etc..

Page 11: Paraiso project - cps-jp.org

Partial Differential Equations, Explicit Solvers, on Uniform Mesh

From computational point of view: • They are d-Dimensional, real-number cell

automata. • The state of each cell is a tuple of real numbers. • The state of the cell at generation (n+1) is defiend

as function of the states of its neighbor cells at generation (n).

n-th generation n+1 -th generation

Page 12: Paraiso project - cps-jp.org

What kind of equations?

• For example, the General Theory of Relativity which only two man on the earth truly understood, is as follows

it’s easy, isn’t it?

Page 13: Paraiso project - cps-jp.org

What kind of algorithms?

• A description of BSSN algorithm, to solve the Einstein’s equations

Page 14: Paraiso project - cps-jp.org

What kind of code we use?

Page 15: Paraiso project - cps-jp.org

*BSSN algorithm Fortran+OpenMP

Page 16: Paraiso project - cps-jp.org

*My MHD Solver in CUDA+MPI

Page 17: Paraiso project - cps-jp.org

What’s wrong?

• what makes our programs this long?

Page 18: Paraiso project - cps-jp.org

Programming is to choose

algebraic concepts

physical equations

time integration methods

space interpolation methods

data structures

optimization techniques

and hardware designs

tensors, its symmetry…

HD, MHD, GR, …

1st order, 2nd order, 4th order, more…

1st, 2nd, 3rd, TVD, Shock…

SoA, AoS, distribution, communication timing…

CPU, GPU, what next, …

Page 19: Paraiso project - cps-jp.org

To make things worse...

• We write more than one program.

• We make trial & error in search for better programs

• Suddenly the architecture changes and we are forced to use SSE, CUDA, AVX ...

• We need to insert communications correctly

To make a single point of change, we have to grep all over the codes

Page 20: Paraiso project - cps-jp.org

Modern Parallel Programming is like this

The amount of programs we write in our life is the product of the factors mentioned

Specify each of the sufficient knowledge modules, and programs like above are automatically generated

I want it like this

Page 21: Paraiso project - cps-jp.org

What a code generator aims for

• Generally you write Nf×Nmath×Neq×Nint×Nhw… lines of code

• You find a bug / improvement and want Neq = Neq + 1; then you need to re-write Nf×Nmath×1×Nint×Nhw… lines

• With code generator you only have to write

Nf + Nmath + Neq + Nint + Nhw… lines

• You want Neq = Neq + 1; then just add 1 line

• You can concentrate on physics

Page 22: Paraiso project - cps-jp.org

Can’t you do that in existing language

•possibly

• we can define vectors and tensors as classes

• we can overload operators

• we have templates

• we have accumulated sophisticated techniques such as expression templates

Page 23: Paraiso project - cps-jp.org

as a result ...

Page 24: Paraiso project - cps-jp.org

I want a language that is modular

(easy to reuse components) transplantable

fast beautiful

• hopeless or is it...?

if you limit the problem domain

Page 25: Paraiso project - cps-jp.org

Paraiso Input: Discretized Algorithms for solving Partial

Differential Equations, in mathematical notations

Output: Implementations on Distributed, Manycore Machines.

PARallel Automated Integration Scheme Organizer

The overall design of Chapter 2.

Page 26: Paraiso project - cps-jp.org

8 building blocks of PDE solvers

Imm load constant value Load (graph starts here) read from named array Store (graph ends here) write to named array Reduce array to scalar value Broadcast scalar to array Shift copy each cell to neighbourhood LoadIndex & LoadSize get coordinate of each cell get array size Arith various mathematical operations

Page 27: Paraiso project - cps-jp.org

Basic Equations

Discretized Form

OM Dataflow Graph

Native Code

Executable Files

a_i_j <- … …

q_i <- dt *

a_i_j * f_j

ld r2, g2[0,0,0]

ld r1, g2[0,0,1]

add r1,r2,r3

st r3,g1

*q=cudaMalloc(…);

__shared__ a,b;

a=q[idx];

b=q[idx+1];

p[idx]=a+b;

Manually

OM Builder

OM Compiler

Native Compiler

Orthotope Machine Code

result

Discrete PDE Language

Page 28: Paraiso project - cps-jp.org

サーベイしたところ出てきた 先行研究

Page 29: Paraiso project - cps-jp.org

かなり似ている・・・

基礎方程式

離散化形

VVM上のコード

実マシン上のコード

実マシン上の実行ファイル

さしあたり人手

自動

自動

既存コンパイラ

Page 30: Paraiso project - cps-jp.org

DEQSOLの末路・・・

Page 31: Paraiso project - cps-jp.org

Discretized PDE Language

• Describe your numerical algorithm using natural notations e.g. tensor notations, difference operators.

• Translates to Orthotope Machine

Algorithm notation

Page 32: Paraiso project - cps-jp.org

Orthotope Machine • A Real number cellular automata

• A virtual machine with multidimensional array vector register of infinite size

• arithmetic operations work in parallel on each mesh, loads from neighbour cells,

No intention of buiding a real hardware:

a thought object to construct a dataflow graph

Page 33: Paraiso project - cps-jp.org

Orthotope Machine Compiler • convert dataflow graph

to real codes

• Divide OM registers onto distributed memories

• Generate Communication codes

• The code generator has wide choice on data layout, granularity of communication, computation

Orthotope Machine Code

Native Codes

OM Compiler

Page 34: Paraiso project - cps-jp.org

Basic Equations

Discretized Form

OM Dataflow Graph

Native Code

Executable Files

a_i_j <- … …

q_i <- dt *

a_i_j * f_j

ld r2, g2[0,0,0]

ld r1, g2[0,0,1]

add r1,r2,r3

st r3,g1

*q=cudaMalloc(…);

__shared__ a,b;

a=q[idx];

b=q[idx+1];

p[idx]=a+b;

Manually

OM Builder

OM Compiler

Native Compiler

Orthotope Machine Code

result

Discrete PDE Language

Chapter 3.

Page 35: Paraiso project - cps-jp.org

More Details on Orthotope Machine

Page 36: Paraiso project - cps-jp.org

Orthotope Machine

Page 37: Paraiso project - cps-jp.org

Three Type parameters OM components have

Type Constructor for The dimension of the Orthotope Machine

Type of the Array Index

Type of the Annotation you can apply at graph node

Page 38: Paraiso project - cps-jp.org
Page 39: Paraiso project - cps-jp.org

bipartile graph consisting of value nodes and inst nodes

Page 40: Paraiso project - cps-jp.org

8 building blocks of PDE solvers

Imm load constant value Load (graph starts here) read from named array Store (graph ends here) write to named array Reduce array to scalar value Broadcast scalar to array Shift copy each cell to neighbourhood LoadIndex & LoadSize get coordinate of each cell get array size Arith various mathematical operations

Page 41: Paraiso project - cps-jp.org

Mul

Add

Shift(-1,0)

Reduce(Min)

Broadcast

1 4

7

2 5

8

3 6

9

1 4

7

2 5

8

3 6

9

4 10

16

3 9

15

5 11

17

12 30

48

9 27

45

15 33

51

3 3

3

3 3

3

3 3

3

3

NValue NInst

global value (scalar value)

local value(Array)

local value(Array)

Load(“hoge”)

Store(“hoge”)

Page 42: Paraiso project - cps-jp.org

DynValue = Array (data container) with realm and element type information

Local Realm = Array Global Realm = Scalar data

Page 43: Paraiso project - cps-jp.org

Type-level representation of Array

Page 44: Paraiso project - cps-jp.org

Value : type-level DynValue : value-level

Page 45: Paraiso project - cps-jp.org

• User interface is in Type-level

• The type-checker helps user

• and assures type-consistency for the backend

• Dataflow graph under cover is Value-level

• can handle the graph in one type.

Value TLocal Float -> Value TGlobal Int -> Value Tlocal Float

Value TLocal Float

Value TGlobal Int

DynValue -> DynValue -> DynValue

DynValue

DynValue

User space Internal Dataflow Graph

Page 46: Paraiso project - cps-jp.org

Basic Equations

Discretized Form

OM Dataflow Graph

Native Code

Executable Files

a_i_j <- … …

q_i <- dt *

a_i_j * f_j

ld r2, g2[0,0,0]

ld r1, g2[0,0,1]

add r1,r2,r3

st r3,g1

*q=cudaMalloc(…);

__shared__ a,b;

a=q[idx];

b=q[idx+1];

p[idx]=a+b;

Manually

OM Builder

OM Compiler

Native Compiler

Orthotope Machine Code

result

Discrete PDE Language

Chapter 4.

Page 47: Paraiso project - cps-jp.org

Builder Monads constructs dataflow graph

the interface has type-level info on Realm and Type but internal representations are value-level on Realm and Type

Page 48: Paraiso project - cps-jp.org

a helper function to define binary operators for Builder Monad

Page 49: Paraiso project - cps-jp.org

Builder monad being an Additive Builder monad being a Ring

Page 50: Paraiso project - cps-jp.org

• programing language Paraiso lacks frontend (lexer, parser, AST analyzer ...)

• Instead, Paraiso components are first-class objects of a widely-used language (Haskell) (the (ultimate (source (of (programmability)))))

Page 51: Paraiso project - cps-jp.org

Example : Implement Hydro Solver

Page 52: Paraiso project - cps-jp.org

An HLLC Riemann-solver in Builder Monad

It looks like usual mathematics, but every terms appear here are Builder Monads, and the expressions are as themselves code generators for Orthotope Machine.

Page 53: Paraiso project - cps-jp.org

sample hydrodynamics solver in Paraiso

Page 54: Paraiso project - cps-jp.org

Basic Equations

Discretized Form

OM Dataflow Graph

Native Code

Executable Files

a_i_j <- … …

q_i <- dt *

a_i_j * f_j

ld r2, g2[0,0,0]

ld r1, g2[0,0,1]

add r1,r2,r3

st r3,g1

*q=cudaMalloc(…);

__shared__ a,b;

a=q[idx];

b=q[idx+1];

p[idx]=a+b;

Manually

OM Builder

OM Compiler

Native Compiler

Orthotope Machine Code

result

Discrete PDE Language

Chapter 5.

Page 55: Paraiso project - cps-jp.org

old code generator...

• not so productive

OM Dataflow Graph

Native Code

directly

runBinder graph0 n0 binder = unlines $ header ++ [bindStr] ++ footer where (header,footer) = case context state of CtxGlobal -> ([],[]) CtxLocal loopIndex -> ([loop (symbol Cpp loopIndex) ++ " {"], ["}"]) loop i = "for (int " ++ i ++ " = 0 ; " ++ i ++ " < " ++ symbol Cpp sizeName ++ "() ; " ++ "++" ++ i ++ ")"

Page 56: Paraiso project - cps-jp.org

code generator ver. 2 OM Dataflow

Graph

Native Code

OMTrans Optimization :: OM -> OM

Analysis = add annotation

Plan = decisions made upon

• how much memory to allocate

• which part of calculation to take place in same subroutine

Claris

• a C++ -like syntax tree with CUDA extension.

Plan

PlanTrans

Claris

ClarisTrans

Annotated and Optimized OM

Optimization/Analysis

Page 57: Paraiso project - cps-jp.org

Annotation • Allocation, Ballon, Boundary, Comment,

Dependency

-- | a type that represents valid region of computation. newtype Valid g = Valid [Interval (NearBoundary g)] -- | the displacement around either side of the boundary. data NearBoundary a = NegaInfinity | LowerBoundary a | UpperBoundary a | PosiInfinity deriving (Eq, Ord, Show, Typeable) data Allocation = Existing -- ^ This entity is already allocated as a static variable. | Manifest -- ^ Allocate additional memory for this entity. | Delayed -- ^ Do not allocate, re-compute it whenever if needed. deriving (Eq, Show, Typeable)

Page 58: Paraiso project - cps-jp.org

Analysis

• decide allocation

• boundary analysis

• dependency analysis and write grouping

optimize level = case level of O0 -> gmap identity . writeGrouping . gmap boundaryAnalysis . gmap decideAllocation _ -> optimize O0

Page 59: Paraiso project - cps-jp.org

Allocation

• some of the dataflow graph nodes are marked ‘Manifest.’

• Manifest nodes are stored in memory.

• Delayed nodes are re-computed as needed.

data Allocation = Existing -- ^ This entity is already allocated as a static variable. | Manifest -- ^ Allocate additional memory for this entity. | Delayed -- ^ Do not allocate, re-compute it whenever if needed. deriving (Eq, Show, Typeable)

Page 60: Paraiso project - cps-jp.org

• Less computation

for(;;){

f[i] = calc_f(a[i], a[i+1]);

}

for (;;){

b[i] += f[i] – f[i-1];

}

• Less storage consumption & access for(;;){

f0 = calc_f(a[i-1], a[i]);

f1 = calc_f(a[i], a[i+1]);

b[i] += f1 – f0;

}

a

f

b

a

f

b

Page 61: Paraiso project - cps-jp.org

Valid Boundary Analysis

• Shift operations create undefined regions in value.

• Boundary analysis trace this to find out valid regions for every node in the graph.

• How many additional mesh we need to obtain valid answers for desired region.

• To generate boundary-region communications.

Add

Shift(-1,0)

1 4

7

2 5

8

3 6

9

nan nan

nan

2 5

8

3 6

9

nan nan

nan

3 9

15

5 11

17

NValue NInst

Page 62: Paraiso project - cps-jp.org

write grouping Kernel

• a user-defined API of the generated class

Subkernel

• a set of calculation executed in a loop

• = Fortran subroutine

• = CUDA __global__ kernel

void Life::proceed () { Life_sub_2(static_2_cell, manifest_1_67); Life_sub_3(static_1_generation, manifest_1_67, manifest_1_69, manifest_1_74); (static_0_population) = (manifest_1_69); (static_1_generation) = (manifest_1_74); (static_2_cell) = (manifest_1_67); }

Page 63: Paraiso project - cps-jp.org

a Kernel

Page 64: Paraiso project - cps-jp.org

write grouping = a Kernel -> subkernels

• all node written by one subkernel must have the same valid region

• nodes written by one subkernel must not depend on each other

• greedy

Page 65: Paraiso project - cps-jp.org

a Kernel Existing nodes

Page 66: Paraiso project - cps-jp.org

a Kernel Existing nodes

write group 0

(・A・) dependency

(・A・) not dependent but different boundary

Page 67: Paraiso project - cps-jp.org

a Kernel Existing nodes

subkernel 0

Page 68: Paraiso project - cps-jp.org

a Kernel Existing nodes

subkernel 0

write group 1

Page 69: Paraiso project - cps-jp.org

a Kernel Existing nodes

subkernel 0

subkernel 1

Page 70: Paraiso project - cps-jp.org

a Kernel Existing nodes

subkernel 0

subkernel 1

write group 2

Page 71: Paraiso project - cps-jp.org

a Kernel Existing nodes

subkernel 0

subkernel 1

subkernel 2

Page 72: Paraiso project - cps-jp.org

Diffusion Equation

• an example of code manipulation

Page 73: Paraiso project - cps-jp.org
Page 74: Paraiso project - cps-jp.org

Diffusion Equation Example Annotation size of .cpp file number of

subkernels memory consumption

no annotation 1019 lines 4 5 x N

Manifest 301 lines 6 9 x N

Page 75: Paraiso project - cps-jp.org

homework 「内職」

all Paraiso-generated code become OpenMP Compatible

in one line!

Page 76: Paraiso project - cps-jp.org

x8 faster! cpu consumption

Page 77: Paraiso project - cps-jp.org