112
Programming with CellSs BSC

Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Embed Size (px)

Citation preview

Page 1: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

BSC

Page 2: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

Motivation

* Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade, Intel White Paper

Page 3: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

Outline

•CellSs

• StarSs Programming Model

• CellSs syntax

• CellSs compiler

• CellSs runtime

• Installing CellSs

• Programming examples

• Compiling and running a CellSs application

• Performance analysis using Paraver

•SMPSs

•Conclusions

Page 4: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

Cell/B.E. Architecture

Users' point of view

So, what is the Cell/B.E.?Architecture point of view

SPEPPE SPE SPE SPE SPE SPE SPE SPE

Separate address spacesTiny local memoryBandwidth

Thin processorSMT

Hard to optimize

Programmers' point of view

Page 5: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

STARSs programming model

Basic idea

...for (i=0; i<N; i++){ T1 (data1, data2); T2 (data4, data5); T3 (data2, data5, data6); T4 (data7, data8); T5 (data6, data8, data9);}...

Sequential Application

T10 T20

T30

T40

T50

T11 T21

T31

T41

T51

T12

Resource 1

Resource 2

Resource 3

Resource N

.

.

.

Task graph creation

based on data

precedence

Task selection +

parameters direction

(input, output, inout)

Scheduling,

data transfer,

task execution

Synchronization,

results transfer

Parallel Resources(multicore,SMP, cluster, grid)

Page 6: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

StarSs programming model

•GRIDSs, COMPSs

• Tailored for Grids or clusters

• Data dependency analysis based on files

• C/C++, Java

•SMPSs

• Tailored for SMPs or homogeneous multicores

• C or Fortran

•CellSs

• Tailored for Cell/B.E. processor

• C or Fortran

Page 7: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Syntax example - matrix multiply

int main (int argc, char **argv) {int i, j, k;…

initialize(A, B, C);

for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]);

}

static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) {int i, j, k;

for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j];}

B

BNB

NB

B

B

Page 8: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Syntax example - matrix multiply

int main (int argc, char **argv) {int i, j, k;…

initialize(A, B, C);

for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]);

}

#pragma css task input(A, B) inout(C)static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) {int i, j, k;

for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j];}

B

BNB

NB

B

B

Page 9: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Syntax

• pragmas' syntax:

#pragma css task [input (<input parameters>)] \

[output (<output parameters>)] \

[inout (<input/output parameters>)] \

[highpriority]

void task(<parameters>) { ...

#pragma css wait on(<data address>)

#pragma css barrier

#pragma css start

#pragma css finish

Page 10: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Syntax

• Examples: task selection

#pragma css task input(A, B) inout(C)void block_addmultiply( float C[N][N], float A[N][N], float B[N][N] ) { ...

#pragma css task input(A[BS][BS], B[BS][BS]) inout(C[BS][BS])void block_addmultiply( float *C, float *A, float *B ) { ..

#pragma css task input(A[BS][BS], B[BS][BS], BS) inout(C[BS][BS])void block_addmultiply( float *C, float *A, float *B, int BS ) { ...

• Examples: waiting for data

#pragma css task input (ref_block, to_comp) output (mse) void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *mse) { ... ...

are_blocks_equal (X[ii][jj],Y[ii][jj], &sq_error);#pragma css wait on (sq_error)

if (sq_error >0.0000001)

Page 11: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Syntax

• Examples: synchronization

for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]); #pragma css barrier

• Examples: priorization

#pragma css task input(lefthalo[32], tophalo[32], righthalo[32], \ bottomhalo[32]) inout(A[32][32]) highpriority void jacobi (float *lefthalo, float *tophalo, float *righthalo, float

*bottomhalo, float *A) { ... }

Page 12: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Syntax

• Examples: CellSs program boundary

#pragma css start for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]); #pragma css finish

Page 13: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Syntax in Fortran

subroutine example() ... interface !$CSS TASK subroutine block_add_multiply(C, A, B, BS) imtlicit none integer, intent (in) :: BS real, intent (in) :: A(BS,BS), B(BS,BS) real, intent (inout) :: C(BS,BS) end subroutine end interface ... !$CSS START ... call block_add_multiply(C, A, B, BLOCK_SIZE) ... !$CSS FINISH...end subroutine!$CSS TASKsubroutine block_add_multiply(C, A, B, BS)...end subroutine

Page 14: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs compiler: Compiler phase

Code translation

(mcc)

cellss-spu-cc_app.c

pack

app.tasks (tasks list)

app.c

cellss-spu-cc_app.o

app.o

CELSS-CC

cellss-ppu-cc_app.c

SPE Compiler PPE Compiler

cellss-spu-cc_app.o

Page 15: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs compiler: Compiler phase

•Files

• app.c: User code, with CellSs annotations

• cellss-spu-cc_app.c: specific code generated for the spu (tasks code)

• cellss-ppu-cc_app.c: specific code generated for the ppu (main program)

• app.tasks: list of annotated tasks

•Compilation steps

• mcc: source to source compiler, based on the Mercurium compiler (BSC).

• SPE compiler: Generic SPE compiler (IBM SDK)

• PPE compiler: Generic PPE compiler (IBM SDK)

• pack: Specific CellSs module that combines objects (BSC)

Page 16: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs compiler: Linker phase

app.c

unpackapp-adapters.c

exec

libCellSS.so

glue code generator

app.capp.o

app.tasks

exec-adapters.c

app-adapters.cccellss-spu-cc_app.o

exec-registration.c

exec-adapters.o

exec-registration.o

CELLSS-CC

app-adapters.capp-adapters.cccellss-ppu-cc_app.o

PPE Linker

exec-spu

SPE Compiler

PPE Compiler

SPE Embedder

SPE Linker

libCellSS-spu.a

exec-spu.o

app.tasksapp.tasks

Page 17: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs compiler: Linker phase

•Files

• exec-adapters.c: code generated for each of the annotated tasks to uniformly

call them (“stubs”).

• exec-registration.c: code generated to register the annotated tasks

• Linker steps

• unpack: unpacks objects

• glue code generator: from all the *.tasks files of an application generates a

single “adapters” file and a single “registration” file per executable

• SPE, PPE compilers and linkers and SPE embedder (IBM SDK)

Page 18: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime

PPE

User main program

CellSs PPU lib

SPE0

DMA inTask executionDMA outSynchronization

CellSs SPU lib

Original task code

Helper threadMain thread

Memory

Userdata

Task control buffer

Synchronization

Tasks

Finalization signal

Stage in/out data

Work assignment

Data dependence Data renaming

Scheduling

SPE1

SPE2

Renaming table

...

Page 19: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime - argument renaming

•False dependences (WaW and WaR) are removed with dynamic

renaming of argumentsfor (i=0; i<N; i++) { T1 (…,…, block1); T2 (block1, …, block2[i]); T3 (block2[i],…,…);}

Block1 is output from task T1

Block1 is input to task T2

block1block1

T1_1

T2_1

T3_1

T1_2

T2_2

T3_2

T1_N

T2_N

T3_N

…block1

WaR

WaW

WaR

WaW

WaR

WaW

Page 20: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime - argument renaming

•False dependences (WaW and WaR) are removed with dynamic

renaming of argumentsfor (i=0; i<N; i++) { T1 (…,…, block1); T2 (block1, …, block2[i]); T3 (block2[i],…,…);}

Block1 is output from task T1

Block1 is input to task T2

block1_Nblock1_2

T1_1

T2_1

T3_1

T1_2

T2_2

T3_2

T1_N

T2_N

T3_N

…block1_1

WaR

WaW

WaR

WaW

WaR

WaW

Page 21: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime - Dependence detection

@L

Type, size,…

*obj

*producer

*prev

Object

instance

Object

instance

Object

instance

Type, size,…

*obj

*producer

*prev

# users

# users

# users

Renaming table

Last renaming

Type, size,…

*obj

*producer

*prev

Task dependence graph

Page 22: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime – scheduling

•Scheduling strategy

• Critical path

• Locality

... ...

Bundle of dependent tasks: data locality in SPE Bundle of independent tasks:

Mixed bundle

Page 23: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime – scheduling

Ri+1

Ri

u v

ReadyLocs(u) = {@A, @B}

ReadyLocs(v) = {@C, @D}

LocHints(SPEj) = {@X, @Y, @B, @Z}

LocHints (SPEj+1

)={@U, @V, @W}

Ri+1

Ri

v

u

•Scheduling for locality

•Ready lists (Ri). Higher subindex indicates higher priority according

to memory locality

•Scheduling selects tasks from high priority ready lists (higher “i”)

Page 24: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime – scheduling

“Co-parent” edges

•Co-parent edges are added between tasks that share a direct

descendent

•Odep(u), number of outstanding dependences of task u outside the

current bundle

Page 25: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime – scheduling

“Co-parent” edges

•Co-parent edges are added between tasks that share a direct

descendent

•Maximum of two co-parent edges (due to implementation costs)

Page 26: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime – scheduling

•Scheduling algorithm

• Ri : ready lists

• Btemp : candidates for being integrated in a bundle

• B bundle to be scheduled

while not ScheduleStop { t = head (R

M ) | M = max{i|0 < i < N} and R

i not empty

add_to_head (t, Btemp); while DepthSearch { u = head (Btemp); if Odep(u)==0 { add_to_tail (u, B); if ((b = CoParent (u)) !=0) add_to_head (b, Btemp); else if ((s = successor (u))!= 0 ) add_to_tail (s, Btemp); } } }

Page 27: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime – scheduling

1

5

6

8

12

15

16

17

2

7

9

3 4

13

10

14

11

• Imagine

• R1 = {1} and

• R0 = {2, 3, 4, 7, 9, 10, 11}

•External loop

• t = 1;

• Btemp = {1}

• internal loop: iteration 1

• u = 1; Btemp = { };

• B = {u}

• b = 2; Btemp = {2};

Page 28: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime – scheduling

1

5

6

8

12

15

16

17

2

7

9

3 4

13

10

14

11

• internal loop: iteration 2

• u = 2; Btemp = { };

• B = {1,2}

• b = 3; Btemp = {3};

• internal loop: iteration 3

• u = 3; Btemp = { };

• B = {1,2,3}

• b = 4; Btemp = {4};

• internal loop: iteration 4

• u = 4; Btemp = { };

• B = {1,2,3,4}

• s = 5; Btemp = {5};

Page 29: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime – scheduling

1

5

6

8

12

15

16

17

2

7

9

3 4

13

10

14

11

• internal loop: iteration 5

• u = 5; Btemp = { };

• B = {1,2,3,4,5}

• s = 6; Btemp = {6};

• internal loop: iteration 6

• u = 6; Btemp = { };

• B = {1,2,3,4,5,6}

• b = 7; Btemp = {7};

• internal loop: iteration 7

• u = 7; Btemp = { };

• B = {1,2,3,4,5,6,7}

• s = 8; Btemp = {8};

Page 30: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime – scheduling

1

5

6

8

12

15

16

17

2

7

9

3 4

13

10

14

11

• internal loop: iteration 8

• u = 8; Btemp = { };

• B = {1,2,3,4,5,6,7,8}

• b = 9; Btemp = {9};

• internal loop: ends since

maximum size of bundle is

reached

Page 31: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime

•Paraver view of the runtime behavior

Bundle

Main thread:runs user code and adds and remove tasks to the task graph

SPEs: execute tasks' code

Helper thread:schedules tasks and synchronize with SPEs

Page 32: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime – specific SPE library features

•Data dependence analysis, data renaming, task scheduling

performed in the CellSs PPE runtime library

•CellSs SPE runtime library implements specific features, to assist

the CellSs PPE runtime library, but independently

• Early callback

• Minimal stage-out

• Software cache in the SPE Local Store

• Double buffering

Page 33: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime – specific SPE library features

•Early call-back

• Innitially, communication of completion of tasks is

done in a per bundle basis

• There are cases where this limits the application

• Task A in the example

• An early callback after the limiting task, enables

the scheduling of new bundles

• Condition: the task has more than one outgoing

dependency

Page 34: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime – specific SPE library features

•Minimal stage-out

• For each task in a bundle its outpus will be written

back to main memory

• If inside the bundle, a task rewrites the same

output, there is no need for writing back to main

memory

• The case in the figure can not happen!

• Thanks to renaming

• Example: matmul

• C[i][j] += A[i][k]*B[k][j]

X

Y

X

Zwrites A'

writes A

reads A

X

Y

X

Zwrites A

writes A

reads A

X

Y

X

Z

writes A

writes A

reads A

Page 35: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime – specific SPE library features

...

#pragma css task input(A, B) inout(C)block_addmultiply( C[i][j], A[i][k], B[k][j])

C[i][j]

A[i][k] B[k][j]

• For each operation, two blocks of data are get from PPE memory to SPE local storage

• Clusters of dependent tasks are scheduled to the

same PPE

The inout block is kept in the local storage and only

put in PPE memory once (reuse)

Page 36: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime – specific SPE library features

•Software cache in the SPE Local Store

• Maintained by the SPE runtime

• LRU replacement strategy

• PPE scheduling is not aware of this behavior

Page 37: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime - specific SPE library features

•Double buffering

• CellSs overlaps DMA transfers with computations

DMA programming: reading task control buffer

Waiting for DMA transfer

DMA programming: reading data

Task execution overlapped with data transfers

DMA programming: writing data

Task 1 in bundle Task 2 in bundle Task N in bundle

Synchronization with helper thread

...

Page 38: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime - specific SPE library features

•Double buffering: paraver view

SPE reads data

SPE executes task

Page 39: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime - specific SPE library features

•Double buffering: paraver view

DMA programming DMA programming

SPE waits for DMA in

Page 40: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime - specific SPE library features

•Double buffering: paraver view

DMA out programmingDMA in programming

SPE waits for DMA in

Page 41: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Runtime - specific SPE library features

•Double buffering: paraver view

DMA out programmingSPE waits for DMA out (all)

Page 42: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Installing CellSs

•Dowload code:

• www.bsc.es/cellsuperscalar -> download

• gunzip, tar

• Installing instructions in the CellSs manual

• www.bsc.es/cellsuperscalar -> documents

• Run configure script with installation directory as prefix

./configure - -prefix=/opt/CellSS

• Other options can be specified

./configure - - help

make

make install

Page 43: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

•Cholesky factorization

•Common matrix operation used to solve normal equations in linear

least squares problems.

•Calculates a triangular matrix (L) from a symetric and positive definite

matrix A.

Cholesky(A) = L

L · Lt = A

•Different possible implementations, depending on how the matrix is

traversed (by rows, by columns, left-looking, right-looking)

• It can be decomposed in block operations

Page 44: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

• In each iteration red and blue blocks are updated

• SPOTF: Compute the Cholesky factorization of the diagonal block .

• STRSM: Compute the column panel

• SSYRK: Update the rest of the matrix

Page 45: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

main (){...

for (i = 0; i < DIM; i++) { for (j= 0; j< i-1; j++){ for (k = 0; k < j-1; k++) { sgemm_tile( A[i][k], A[j][k], A[i][j] ); } strsm_tile( A[j][j], A[i][j] ); } for (j = 0; j < i-1; j++) { ssyrk_tile( A[i][j], A[i][i] ); } spotrf_tile( A[i][i] ); }... for (int i = 0; i < DIM; i++) { for (int j = 0; j < DIM; j++) {#pragma css wait on (A[i][j]) print_block(A[i][j]); } }... }

#pragma css task input(A[64][64], B[64][64]) inout(C[64][64])void sgemm_tile(float *A, float *B, float *C)

#pragma css task input (T[64][64]) inout(B[64][64])void strsm_tile(float *T, float *B)

#pragma css task input(A[64][64]) inout(C[64][64])void ssyrk_tile(float *A, float *C)

#pragma css task inout(A[64][64])void spotrf_tile(float *A)

DIM

DIM64

64

Cholesky factorization

Page 46: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

•Sparse LU

• More generic factorization than Cholesky

• Deals with non symetric matrixes

• Calculates one lower triangular matrix (L) and one upper triangular(U) matrix

which product fits with a permutation of rows of the original

Perm(A)=L*U

• Difficult to program for Cell, since some operations are for columns (not

blocks)

• The example shown here is a simplified version (without pivoting) based on

an initial sparse matrix

Page 47: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

int main(int argc, char **argv) {int ii, jj, kk;…

for (kk=0; kk<NB; kk++) {

lu0(A[kk][kk]);

for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL)

fwd(A[kk][kk], A[kk][jj]);

for (ii=kk+1; ii<NB; ii++)

if (A[ii][kk] != NULL) {

bdiv (A[kk][kk], A[ii][kk]);

for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL) {

if (A[ii][jj]==NULL)

A[ii][jj]=allocate_clean_block();

bmod(A[ii][kk], A[kk][jj], A[ii][jj]);

}

}

}

}

B

B

NB

NB

B

B

void lu0(float *diag);

void bdiv(float *diag, float *row);

void bmod(float *row, float *col, float *inner);

void fwd(float *diag, float *col);

Sparse LU

Page 48: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

Dynamic main memory allocationData dependent parallelism

int main(int argc, char **argv) {int ii, jj, kk;…

for (kk=0; kk<NB; kk++) {

lu0(A[kk][kk]);

for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL)

fwd(A[kk][kk], A[kk][jj]);

for (ii=kk+1; ii<NB; ii++)

if (A[ii][kk] != NULL) {

bdiv (A[kk][kk], A[ii][kk]);

for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL) {

if (A[ii][jj]==NULL)

A[ii][jj]=allocate_clean_block();

bmod(A[ii][kk], A[kk][jj], A[ii][jj]);

}

}

}

}

CellSs: Programming examples

#pragma css task inout(diag[B][B]) highpriority

void lu0(float *diag);

#pragma css task input(diag[B][B]) inout(row[B][B])

void bdiv(float *diag, float *row);

#pragma css task input(row[B][B],col[B][B]) inout(inner[B][B])

void bmod(float *row, float *col, float *inner);

#pragma css task input(diag[B][B]) inout(col[B][B])void fwd(float *diag, float *col);

Page 49: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

•SPU memory funtionality: tailored CellSs API to deal with memory

issues in the SPU

•Dynamic memory allocation

• Local Storage (LS) space in each SPU is limited, so CellSs tries to control as

much of it as possible

#include <css_malloc.h>

void *css_malloc (unsigned int size);

void css_free (void *chunk);

Page 50: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

•Example: Dynamic memory allocation#pragma css task input(bs, log2_N, is_forward, twiddle) inout(data, sync)static void FFT1D_1 (int bs, int log2_N, float twiddle[CUBE_SIZE*2], int is_forward, float data[bs][2*CUBE_SIZE], int sync[1]){ FFT1D_core ( bs, data, log2_N, twiddle, is_forward);}

static void FFT1D_core (int bs, float data[bs][2*CUBE_SIZE], int log2_N, float twiddle[CUBE_SIZE*2], int is_forward)

{ int i; int n_floats_elems = (1 << log2_N)*2; float *work_re = css_malloc(sizeof(float)*n_floats_elems); float *work_im = css_malloc(sizeof(float)*n_floats_elems); for(i=0; i<bs; i++) spe_FFT_1D_core (log2_N, &data[i][0], twiddle, is_forward, work_re, work_im); css_free((void *)work_re); css_free((void *)work_im);}

Page 51: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

•DMA accesses

• CellSs handles all data transfers from main memory to SPU Local Store

• Some applications may need to do explicit data transfer from main memory

• For transfers of 1, 2, 4, 8 bytes or multiples of 16 bytes up to 16 KB

#include <css_dma.h>

void css_get_a (void *ls, uint32_t ea, unsigned int dma_size, tagid_t tag);

void css_put_a (void *ls, uint32_t ea, unsigned int dma_size, tagid_t tag);

• ls: pointer to a 16-byte aligned allocated buffer in LS

• ea: pointer to main memory

• dma_size: size of the block

• tag: identifies of the DMA transfer

Page 52: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

•DMA accesses

• Tag obtention: returns a valid tag for a DMA transfer

tagid_t css_tag (void);

• Synchronization

void css_sync (tagid_t tag);

• For DMA transfers not meeting the previous requirements

void css_get (void *ls, unsigned int address, unsigned int dma_size, tagid_t tag);

void css_put (void *ls, unsigned int address, unsigned int dma_size, tagid_t tag);

• Example:

float *blocks = (float *)css_malloc(N*sizeof(Complex));

tag = css_tag ();

css_get_a (blocks, addr, (unsigned int)(N*sizeof(Complex)), tag);

css_sync(tag);

Page 53: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

•Strided Memory access

• Interface to scatter/gather data from 1D, 2D and 3D arrays

#include <css_stride.h>

dmal_h_t *css_gather_1d (void *ls, unsigned int start, int chunk, int stride, size_t size, size_t e_size, dmal_h_t *c_list);

dmal_h_t *css_scatter_1d (void *ls, unsigned int start, int chunk, int stride, size_t size, size_t e_size, dmal_h_t *c_list);

• ls: pointer to a 16-byte aligned allocated buffer in LS

• c_list: enables to use the same pattern to access memory, reuses DMA lists

• size: number of objects to be copied

• e_size: size of one element

start

chunk stride

Page 54: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

•Strided Memory access

#include <css_stride.h>

dmal_h_t *css_gather_2d (void *ls, unsigned int start, int local_x, int local_y, int global_x, size_t e_size, dmal_h_t *c_list);

dmal_h_t *css_scatter_2d (void *ls, unsigned int start, int local_x, int local_y, int global_x, size_t e_size, dmal_h_t *c_list);

local_x

local_y

global_x

start

Page 55: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

•Strided Memory access

#include <css_stride.h>

dmal_h_t *css_gather_3d (void *ls, unsigned int start, int local_x, int local_y, int local_z, int global_x, int global_z, size_t e_size, dmal_h_t *c_list);

dmal_h_t *css_scather_3d (void *ls, unsigned int start, int local_x, int local_y, int local_z, int global_x, int global_z, size_t e_size, dmal_h_t *c_list);

•Example:

#pragma css task input ( A_p) output (A[16*16])

void example (float *A, unsigned int A_p)

{

dmal_h_t *entry = css_gather_1d (A, A_p, 4, 16, 64, sizeof(float), NULL);

css_sync(entry->tag);

}

Page 56: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

void sequential_cholesky(void){ int STEP; int bm;

for (STEP = 0; STEP <= STEPS-1; STEP++) { // Update and factorize the current diagonal block my_cholesky_ssyrk (STEP, NB, N, A); my_cholesky_spotf2(STEP, NB, N, A);

if (STEP < STEPS-1) { // Compute the current block column for (bm = 0; bm < STEPS-STEP-1; bm++) { my_cholesky_sgemm(STEP, bm, NB, N, A); my_cholesky_strsm(STEP, bm, NB, N, A); } } }}}

void my_cholesky_ssyrk(int STEP, int nb, int N, float *A){ for (int i = 0; i < STEP; i++) // rank update for A[d][d] {

ssyrk(A[STEP*B][i*B],A[STEP*B][STEP*B]); }}

A

Original matrix A stored in consecutive positions in memory by rows

Another Cholesky

N = NB x B

NB x B

B

Page 57: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

void sequential_cholesky(void){ int STEP; int bm;

for (STEP = 0; STEP <= STEPS-1; STEP++) { // Update and factorize the current diagonal block my_cholesky_ssyrk (STEP, NB, N, A); my_cholesky_spotf2(STEP, NB, N, A);

if (STEP < STEPS-1) { // Compute the current block column for (bm = 0; bm < STEPS-STEP-1; bm++) { my_cholesky_sgemm(STEP, bm, NB, N, A); my_cholesky_strsm(STEP, bm, NB, N, A); } } }}}

void my_cholesky_ssyrk(int STEP, int nb, int N, float *A){ for (int i = 0; i < STEP; i++) // rank update for A[d][d] {

check_data_av(A, ShA, STEP, i , N, nb, B); check_data_av(A, ShA, STEP, STEP, N, nb, B); ssyrk (ShA[STEP*nb+i], ShA[STEP*nb+STEP]);

}}

NB

NBB

B

ShA

A

STEP

i

Page 58: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

void check_data_av(float* M, float** shadow, int i, int j, int N, int nb, int B){ int pp; if (shadow[i*B+j]==NULL) { shadow[i*B+j] = (float* )malloc(nb*nb*sizeof(float)); pp = (int)&M[i*N*B+j*B]; copy_to_shadow_block (&M[i*N*B+j*B], pp, B, N, shadow[i*nb+j]); }} void copy_back_to_matrix(float* M, float** shadow, int N, int nb, int B)

{ int i, j, pp; for (i = 0; i < nb; i++) { for (j = 0; j < nb; j++) { if (shadow[i*nb+j]!=NULL) { pp = (int)&M[i*N*B+j*B]; copy_from_shadow_block (&M[i*N*B+j*B],pp, nb, N, shadow[i*nb+j]); } } }}

#pragma css task input (address[1], main_address, b, n) output (WA[64][64]) void copy_to_shadow_block (float *address, int main_address, int b, int n, float *WA)

#pragma css task input (WA[64][64], main_address, b, n) inout (address[1]) void copy_from_shadow_block (float * address, int main_address, int b, int n, float *WA)

#pragma css task inout(A[64][64]) highpriority void spotrf_tile(float *A)#pragma css task input (A[64][64]) inout(B[64][64]) void ssyrk_tile(float *A, float *B)#pragma css task input(A[64][64], B[64][64]) inout(C[64][64])void sgemm_tile(float *A, float *B, float *C)#pragma css task input (T[64][64]) inout(B[64][64])void strsm_tile(float *T, float *B)

Could be

managed

as a c

ache !!!

Page 59: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

#pragma css task input (address[1], main_address, b, n) output (WA[64][64]) void copy_to_shadow_block (float *address, int main_address, int b, int n, float *WA){// address is a trick to ensure dependencies // address points to the first element of the block as representantion// of the whole block

dmal_h_t *entry;

entry = css_gather_2d (WA, main_address, b, b, n, sizeof(float), NULL); ls_sync(entry->tag);}

#pragma css task input (WA[64][64], main_address, nb, n) inout (address[1])void copy_from_shadow_block (float * address, int main_address, int b, int n, float *WA) { dmal_h_t *entry;

address[0]=WA[0];// as address is inout, when the task finishes it copies back its local value// to the original position in main memory, so we need to assign the correct// value to that local variable.

entry = css_scather_2d (WA, main_address, b, b, n, sizeof(float), NULL); ls_sync(entry->tag);}

Page 60: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

int main(int argc, char* argv[]){... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A);}

#pragma css task input(Src) out(Dst) void copy_block (float Src[BS][BS], float Dst[BS][BS]);

void copy_mat (float *Src,float *Dst){ ... for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++) ... copy_block(Src[ii][jj],block); ...}

#pragma gss task input(A) out(L,U)void split_block (float A[BS][BS], float L[BS][BS], float U[BS][BS]);

void split_mat (float *LU[NB][NB],float *L[NB][NB],float *U[NB][NB]){... for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++){ ... split_block (LU[ii][ii],L[ii][ii],U[ii][ii]); ... }}

Checking LU

Page 61: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

int main(int argc, char* argv[]){... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A);}

Checking LU

void clean_mat (p_block_t Src[NB][NB]){ int ii, jj;

for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++) if (Src[ii][jj] != NULL) { free (Src[ii][jj]); Src[ii][jj]=NULL; }}

#pragma css task output(Dst)void clean_block (float Dst[BS][BS] );

void clean_mat (p_block_t Src[NB][NB]){ int ii, jj;

for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++) if (Src[ii][jj] != NULL) { clean_block(Src[ii][jj]); }}

Page 62: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

int main(int argc, char* argv[]){... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A);}

Checking LU

void sparse_matmult (float *A[NB][NB], float *B[NB][NB], float *C[NB][NB]){ int ii, jj, kk;

for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++) for (kk=0; kk<NB; kk++) if ((A[ii][kk]!= NULL) && (B[kk][jj] !=NULL )) { if (C[ii][jj] == NULL)

C[ii][jj] = allocate_clean_block(); block_matmul (A[ii][kk], B[kk][jj], C[ii][jj]); }}

#pragma css task input(a,b) inout(c)void block_matmul(float a[BS][BS], float b[BS][BS], float c[BS][BS]){ int i, j, k;

for (i=0; i<BS; i++) for (j=0; j<BS; j++) for (k=0; k<BS; k++) c[i][j] += a[i][k]*b[k][j];}

Page 63: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

int main(int argc, char* argv[]){... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A);}

Checking LU#pragma css task input (ref_block, to_comp) output (mse)void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *ms e);void compare_mat (p_block_t X[NB][NB], p_block_t Y[NB][NB], struct timeval *stop){ ... Zero_block = allocate_clean_block(); for (ii = 0; ii < NB; ii++) for (jj = 0; jj < NB; jj++) { if (X[ii][jj] == NULL) if (Y[ii][jj] == NULL) sq_error[ii][jj] = 0.0f; else are_blocks_equal(Zero_block, Y[ii [jj],&sq_error[ii][jj]); else are_blocks_equal(X[ii][jj], Y[ii][jj],&sq_error[ii][jj]); }#pragma css finish for (ii = 0; ii < NB; ii++) for (jj = 0; jj < NB; jj++) if (sq_error[ii][jj] >0.0000001L) { printf ("block [%d, %d]: detected mse = % 20lf\n", ii,

jj,sq_error[ii][jj]); some_difference =TRUE; } if ( some_difference == FALSE) printf ("matrices are identical\n");}

Page 64: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

int main(int argc, char* argv[]){... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A);}

Checking LU #pragma css task input (ref_block, to_comp) output (mse)void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *ms e);void compare_mat (p_block_t X[NB][NB], p_block_t Y[NB][NB], struct timeval *stop){ ... Zero_block = allocate_clean_block(); for (ii = 0; ii < NB; ii++) for (jj = 0; jj < NB; jj++) { if (X[ii][jj] == NULL) if (Y[ii][jj] == NULL) sq_error[ii][jj] = 0.0f; else are_blocks_equal(Zero_block, Y[ii [jj],&sq_error[ii][jj]); else are_blocks_equal(X[ii][jj], Y[ii][jj],&sq_error[ii][jj]); } for (ii = 0; ii < NB; ii++) for (jj = 0; jj < NB; jj++)#pragma css wait on (&sq_error[ii][jj]) if (sq_error[ii][jj] >0.0000001L) { printf ("block [%d, %d]: detected mse = % 20lf\n", ii,

jj,sq_error[ii][jj]); some_difference =TRUE; } if ( some_difference == FALSE) printf ("matrices are identical\n");}

Page 65: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

copy_mat (A, origA);

LU (A);

split_mat (A, L, U);

clean_mat(A);

sparse_matmult (L, U, A); compare_mat (origA, A);

Without CellSs With CellSs(for NB=4 matrix)

Behavior Checking LU

Page 66: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

Behavior Checking LU

0: are_blocks_equal1: bdiv_adapte2: block_mpy_add3: bmod4: clean_block5: copy_block6: fwd7: lu08: split_block

Page 67: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

•Molecular dynamics: Argon simulation

• Simulates the mobility of Argon atoms in gas state, in a

constant volume at T=300K

• All elestrostatic forces observed for each of the atoms due to

the others are considered (Fi)

• The second Newton law is then applied to each atom

Fi=m*a

i

• The initial velocities are random but reasonable for argon

atoms at 300K

• To maintain a constant temperature in all the process the

Berendsen algorithm is applied

Page 68: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

program argon...!$CSS START do step=1,niter do ii=1, N, BSIZE do jj=1, N, BSIZE call velocity(BSIZE, ii, jj, x(ii), y(ii),

z(ii), x(jj), y(jj), z(jj), vx(ii),vy(ii), vz(ii))

enddo enddo do jj=1, N, BSIZE call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj)) enddo!$CSS BARRIER

tins=0.e0 do i=1,N tins=mkg*v(i)**2/3.e0/kb+tins enddo tins=tins/N lam1=sqrt(t/tins) do ii=1, N, BSIZE call update_position(BSIZE, lam1, vx(ii), vy(ii),

vz(ii), x(ii), y(ii), z(ii)) enddo enddo!$CSS FINISHend

program argon... interface !$CSS TASK subroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj,

vx, vy, vz) implicit none integer, intent(in) :: BSIZE, ii, jj real, intent(in), dimension(BSIZE) :: xi, yi, zi, xj, yj, zj real, intent(inout), dimension(BSIZE) :: vx, vy, vz end subroutine !$CSS TASK subroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z) implicit none integer, intent(in) :: BSIZE real, intent(in) :: lam1 real, intent(inout), dimension(BSIZE) :: vx, vy, vz real, intent(inout), dimension(BSIZE) :: x, y, z end subroutine !$CSS TASK subroutine v_mod(BSIZE, v, vx, vy, vz) implicit none integer, intent(in) :: BSIZE real, intent(out) :: v(BSIZE) real, intent(in), dimension(BSIZE) :: vx, vy, vz end subroutine end interface

Molecular dynamics: Argon simulation

Page 69: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

program argon...!$CSS START do step=1,niter do ii=1, N, BSIZE do jj=1, N, BSIZE call velocity(BSIZE, ii, jj, x(ii), y(ii),

z(ii), x(jj), y(jj), z(jj), vx(ii),vy(ii), vz(ii))

enddo enddo do jj=1, N, BSIZE call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj)) enddo!$CSS BARRIER

tins=0.e0 do i=1,N tins=mkg*v(i)**2/3.e0/kb+tins enddo tins=tins/N lam1=sqrt(t/tins) do ii=1, N, BSIZE call update_position(BSIZE, lam1, vx(ii), vy(ii),

vz(ii), x(ii), y(ii), z(ii)) enddo enddo!$CSS FINISHend

!$CSS TASKsubroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj, vx, vy, vz)! subroutine code end subroutine!$CSS TASKsubroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z)! subroutine code end subroutine!$CSS TASKsubroutine v_mod(BSIZE, v, vx, vy, vz)! subroutine code end subroutine

Molecular dynamics: Argon simulation

Page 70: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

•Vector reduction

...Array A

BS

...

...

NB

Page 71: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

Vector Reductionint main(int argc, char* argv[]){ LEVELS = log2 ((double)NB/BS);#pragma css start for (level = 0 ;level < LEVELS; level++){ range = exp2 ((double)level); for(i=0;i<NB;i+=2*BS*range) block_reduce(&A[i],&A[i+BS*range]); } block_reduce2(&A[0], &reduction);#pragma css finish}

#pragma css task input(B[64*64]) inout(A[64*64])void block_reduce(float *A, float *B){int i; for (i=0; i<BS; i++) A[i] += B[i];}

#pragma css task input(A) output(x)void block_reduce2(float *A, float *x){int i; *x = 0.0; for (i=0; i<BS; i++) *x += A[i];}

Page 72: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

•Vector reduction

...Array A

BS

...

NB

neutral element

- Less concurrency for one vector- Fine when considering several

Page 73: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Programming examples

Vector Reduction

int main(int argc, char* argv[]){ LEVELS = log2 ((double)NB/BS);#pragma css start for (i=0; i<NB; i+= BS) block_reduce(&RB[0], &A[i]);

block_reduce2(&RB[0], &reduction);#pragma css finish}

#pragma css task input(B[64*64]) inout(A[64*64])void block_reduce(float *A, float *B){int i; for (i=0; i<BS; i++) A[i] += B[i];}

#pragma css task input(A) output(x)void block_reduce2(float *A, float *x){int i; *x = 0.0; for (i=0; i<BS; i++) *x += A[i];}

Page 74: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Compiling and running a CellSs application

• Usage: cellss-cc <options and sources>

• cellss-cc -help : lists usage

• Options:

• Regular compilation flags: -O<opt. level>, -g, -o <filename>, -D<macro>...

• Specific compilation flags:

• -t: tracing enabled. Generates Paraver tracefiles

• -WPPUp,<options>: passes comma separated list of flags to the PPU preprocessor

• -WPPUc,<options>: passes comma separated list of flags to the PPU compiler

• -WPPUl,<options>: passes comma separated list of flags to the PPU linker

• -WPPUf,<options>: passes comma separated list of flags to the PPU Fortran compiler

• WSPUp,<options> Passes the comma separated list of options tothe SPU preprocessor.

• -WSPUc,<options> Passes the comma separated list of options to the SPU compiler.

• -WSPUf,<options> Passes the comma separated list of options to the SPU Fortran compiler.

Page 75: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Compiling and running a CellSs application

•Examples

> cellss-cc -O3 *.c -o my_binary

> cellss-cc -O3 matmul.f90 -o matmul

> cellss-cc -O2 -WSPUc,-funroll-loops,-ftree-vectorize -WSPUc,-ftree-

vectorizer-verbose=3 matmul.c -o matmul

> cellss-cc -O3 -k test.c -o test

> cellss-cc -O5 -o argon2 argon2_css.f90 -t

Page 76: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Compiling and running a CellSs application

•Multiple source files

> cellss-cc -O3 -c code1.c

> cellss-cc -O3 -c code2.c

> cellss-cc -O3 -c code3.f90

> cellss-cc -O3 code1.o code2.o code3.o -o my_binary

•Use in a Makefile

CC = cellss-cc

LD = cellss-cc

CFLAGS = -O2 -g

SOURCES = code1.c code2.c code3.c

BINARY = my_binary

$(BINARY): $(SOURCES)

Page 77: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Compiling and running a CellSs application

• Running

• Setting the LD_LIBRARY_PATH (not always needed):

export LD_LIBRARY_PATH=$(HOME_CELLSS)/lib:$LD_LIBRARY_PATH

• Setting the number of SPUS (default 8, valid from 1 to 16 in a blade, from 1 to

6 in a PS3)

export CSS_NUM_SPUS=6

• Normal execution from command line:

./my_binary arg1 arg2 ... argn

Page 78: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Compiling and running a CellSs application

•Generating a tracefile

• Compile with -t flag

> cellss-cc my_app.c -t -O3 -o my_binay_instr

• Run normally

> ./my_binary_instr arg1 arg2 ...

• Tracefile is automatically generated. Default name gss-trace-xxx.ext

gss-trace-0001.prv

gss-trace-0001.row

gss-trace-0001.pcf

• All three files used by Paraver performance analyser and visualizer

• Changing the tracefile name:

> export CSS_TRACE_FILENAME=tracefilename

Will generate tracefiles: tracefilename-0001.prv, ...

Page 79: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Compiling and running a CellSs application

•CellSs configuration file

• Optional, default settings applied if not provided

• Plain text file

scheduler.min_tasks = 32

scheduler.initial_tasks = 128

scheduler.max_strand_size = 8

task_graph.task_count_high_mark = 2000

task_graph.task_count_low_mark = 1500

renaming.memory_high_mark = 134217728

renaming.memory_low_mark = 104857600

Page 80: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Compiling and running a CellSs application

•CellSs configuration file

• scheduler.initial_tasks (128): defines the number of ready for execution tasks that are generated at the beginning of the execution of an application before starting their scheduling and execution in the SPEs

• scheduler.min_tasks (16): defines minimum number of ready tasks needed to call the scheduler

• scheduler.max_strand_size (8): defines the maximum number of tasks that are simultaneously scheduled to an SPE

• task graph.task_count_high_mark (1000): defines the maximum number of non-executed tasks that the graph will hold

• task graph.task_count_low_mark (900): whevever the task graph reaches the number of tasks defined in the previous variable, the task graph generation is suspended until the number of non-executed tasks goes below this amount

Page 81: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Compiling and running a CellSs application

•CellSs configuration file

• renaming.memory_high_mark (∞): defines the maximum amount of memory used for renaming in bytes.

• renaming.memory_low_mark (1): whenever the renaming memory usage

reaches the size specified in the previous variable, the task graph generation

is suspended until the renaming memory usage goes below the number of

bytes specified in this variable.

> export CSS_CONFIG_FILE=file.cfg

Page 82: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Performance Analysis with Paraver

•Paraver

• Flexible performance visualization and analysis tool that can be used to analyze:

• MPI, OpenMP, MPI+OpenMP

• Java

• Hardware counters profile

• Operating system activity

• ... and many other things you may think of

• Generally it uses external trace file generators. Example for MPI:

> mpitrace mpirun -n 10 my_mpi-binary

• For CellSs, the libraries have been instrumented.

• When installing the distribution, two libraries are generated: normal and instrumented

• Flag -t links with instrumented version

• Available for free from the BSC website: www.bsc.es/paraver

Page 83: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Performance Analysis with Paraver

•Running paraver

paraver tracefile-0001.prv

Page 84: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Performance Analysis with Paraver

•Configuration files

Configuration file Feature shown

2dh inbw.cfg

2dh inbytes.cfg

2dh outbw.cfg

2dh outbytes.cfg

3dh duration phase.cfg

3dh duration tasks.cfg

DMA bw.cfg

DMA bytes.cfg

execution phases.cfg

Histogram of the bandwidth achieved by individual DMA IN transfers. Histogram of bytes read by the stage in DMA transfers.Histogram of the bandwidth achieved by individual DMA OUT transfersHistogram of bytes writen by the stage out DMA transfers.Histogram of duration for each of the runtime phases.Histogram of duration of SPU tasks.DMA (in + out) bandwidth per SPU.Bytes being DMAed (in + out) by each SPU.Profile of percentage of time spent by each thread at each of the major phases

Page 85: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Performance Analysis with Paraver

•Configuration files

Configuration file Feature shown

flushing.cfg

general.cfg Mix of timelines.

stage in out phase.cfg

task.cfg

task distance histogram.cfg .

task number.cfg

Task profile.cfg

task repetitions.cfg

Total DMA bw.cfg

Intervals (dark blue) where each SPU is flushing its local trace buffer to main memory.

Identification of DMA in (grey) and out phases (green).Outlined function being executed by each SPU.Histogram of task distance between dependent tasks Number of task being executed by each SPUTime (microseconds) each SPU spent executing the different tasksShows which SPU executed each task and the number of times that the task was executed.Total DMA (in+out) bandwidth to Memory.

Page 86: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Performance Analysis with Paraver

Clustering Group of 8 tasks (23 us)Block size: 64x64 floatsDMA in/out

Data re-use

Main thread

Helper thread

Page 87: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Performance Analysis with Paraver

Another Cholesky

Page 88: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Performance evolution

Performance: matrix multiply

• Versions with different task implementation

• Task duration:

• from 2000 µsecs (simple C scalar code)

• to 22 µsecs (highly hand-vectorized/optimized code)

0 1 2 3 4 5 6 7 8 90

1

2

3

4

5

6

7

8

9

Matmul scalability

2023

281,79

117,96

58,91

28,46

22,43

# SPUs

Sp

ee

d-u

p

July 2007

0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

8

9

Matmul scalability

2023

281,79

117,96

58,91

28,46

22,43

#SPUs

Sp

ee

d-u

p

November 2007

0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

8

9

Scalability analysis of matrix multiply

2022,77 usecs

281,32 usecs

117,47 usecs

58,46 usecs

27,87 usecs

21,86 usecs

#SPUs

Sp

ee

d u

p

April 2007

0 1 2 3 4 5 6 7 8 9

0

20

40

60

80

100

120

140

160

Matmul performance

March 2007

July 2007

Nov 2007

#SPUs

GF

lop

s

Page 89: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Performance evolution

Performance: Cholesky factorization

0 500 1000 1500 2000 2500 3000 3500 4000 4500

0

10

20

30

40

50

60

70

80

90

Cholesky performance

Matrix size

GF

lop

s

April 2007

0 1024 2048 3072 4096

0

20

40

60

80

100

120

140

Cholesky Performance

Matrix size

GF

lop

s

July 2007November 2007

0 1024 2048 3072 4096

0

20

40

60

80

100

120

140

Cholesky Performance

Matrix size

GF

lop

s

Page 90: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Performance evolution

Task dependence graph for

a 320 x 320 floats matrix

(blocks of 64 x 64)

Page 91: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Performance evolution

SXU

LS

DMA

On-chip coherent bus

SL1

...

PPE Memory controller

SXU

LS

DMA

Page 92: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Performance evolution

• Increase of locality for Matmul

Page 93: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Performance evolution

• Increase of locality for Cholesky

Page 94: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Performance evolution

• Increase of locality for Sparse LU

Page 95: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Performance evolution

• Increase of locality in the software cache

Page 96: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: Performance evolution

• Increase of locality in the software cache

Page 97: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs: issues and ongoing efforts

• CellSs programming model

• Memory association

• Array regions

• Subobject accesses

• Blocks larger than Local Store.

• Access to global memory by tasks?

• Inline directives

• CellSs runtime system

• Further optimization of overheads (insert task and remove task),

• scheduling algorithms: overhead, locality

• overlays

• Short circuiting (SPE SPE transfers)

• SMP superscalar (SMPSs)

Page 98: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

Outline

•CellSs

• StarSs Programming Model

• CellSs syntax

• CellSs compiler

• CellSs runtime

• Installing CellSs

• Programming examples

• Compiling and running a CellSs application

• Performance analysis using Paraver

•SMPSs

•Conclusions

Page 99: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

1 2 3 4 5 6 7 80

5

10

15

20

25

30

LU performance

Machine peak

1024

2048

4096

Number of threads

GFl

ops

SMPSs

• “Same” source code

• Higher flexibility (block size, ...

• Same compiler

• Different back-end

• Execution environment

• Specific implementation

• Distributed scheduling

• No need for data copy

2 way POWER 5

0 5 10 15 20 25 30 35

0

5

10

15

20

25

Cholesky scalability

3072

6144

7680

#Threads

Sp

ee

d-u

p

SGI Altix

Page 100: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

SMPSs: Programming example (version array regions)

•Merge-sort

• Splits in 4 subarrays each time

• Sorts de arrays later on, calling a recursive sort to avoid sorting big arrays

• Using array regions

#pragma css task input(V[N]{i..j}) output (M[N][N]{i}{0..N-1})

Page 101: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

SMPSs: Programming example (version array regions)

#pragma css task input(low[N]{i1..j1}, low[N]{i2..j2},i1, j1, i2, j2) output (dest[N]{i1..j2})void seqmerge (ELM *low, long i1, long j1, long i2, long j2, ELM *dest);#pragma css task inout (low[N]{i..j}) input (i,j)void seqquick (ELM *low, long i, long j);

void sort (ELM *low, long i, long j){... if (size < QUICKSIZE) { seqquick (low, i, j); }else{ quarter = size / 4; i1= i; j1 = i+quarter-1; i2 = i+quarter; j2 = i+2*quarter-1; i3 = i+2*quarter; j3 = i+3*quarter-1; i4 = i+3*quarter; j4 = j;

sort(low, i1, j1); sort(low, i2, j2); sort(low, i3, j3); sort(low, i4, j4);

merge(low, i1, j1, i2, j2, tmp); merge(low, i3, j3, i4, j4, tmp); merge(tmp, i1, j2, i3, j4, low);}

Page 102: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

SMPSs: Programming example (version array regions)

void merge (ELM *low, long i1, long j1, long i2, long j2, ELM *dest){ ... if (size < MERGESIZE) { seqmerge(low1, i1, j1, i2, j2, dest ); return; } size /= 2; ... split(low, i1, j1, i2, j2, &split1, &split2);

merge(low, i1, split1-1, i2, split2-1, dest); merge(low, split, j1, split2, j2, dest );

}

main (){#pragma css startsort(&array, 0, size-1);#pragma css barrier}

Page 103: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

SMPSs: Programming example

•Queens

• Find a solution to the problem of locating N queens on an N N board, with

any of them killing each other

Page 104: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

SMPSs : Programming example

#pragma css task input (j, i,n) inout (a[n]) highpriorityvoid add_queen_task(char *a, int j, int i, int n);

#pragma css task input (results) inout (acc) highpriorityvoid acumulate(int results, int *acc);

#pragma css task input (n, j, a[n]) output (results)void nqueens_ser_task(int n, int j, char *a, int *results);

void nqueens(int n, int j, char *a, char *b, int depth) { for (i = 0; i < n; i++) { a[j] = i; if (ok(j + 1, a)) { add_queen_task(b, j, i, n); if (depth < task_depth) { nqueens(n, j + 1, a, b, depth + 1); } else { nqueens_ser_task(n, j + 1, b, &results); acumulate(results, &total_res); } } }}

Page 105: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

SMPss: Compiler phase

Code translation

(mcc)

smpss-cc_app.c pack

C compiler(gcc, icc, ...)

app.tasks (tasks list)

app.c

smpss-cc_app.o

app.o

SMPSS-CC

Page 106: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

smpss-cc-app.c

SMPss: Linker phase

app.c

unpack

smpss-cc-app.c

app-adapters.c

execlibSMPSS.so

Linker

glue code generator

app.capp.o

app.tasks

exec-adapters.c

app-adapters.ccsmpss-cc_app.o

C compiler(gcc, icc,...)

exec-registration.c

exec-adapters.oexec-registration.o

SMPSS-CC

Page 107: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

SMPss: runtime

CPU0

User mainprogram

SMPSs runtime library

Main thread

Memory

Data dependenceData renaming

Renaming table

...

SchedulingTask execution

GlobalReady task queues

High pri

Low pri

Thread 0Ready task queue

Original task code

SMPSs runtime library

SchedulingTask execution

Original task code

Worker thread 1

Thread 1Ready task queue

CPU1

Work stealing

SMPSs ru

Original task code

Worker thread 2

CPU2

Thread 2Ready task queue

Work stealing

Page 108: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

SMPss: results

Multi sortN queens

•Benchmarks used for OpenMP 3.0

development

• Similar performance in some ranges

• Overlap potential in SMPSs

• Programmability issues

• Reductions, memory allocations, synchronization representatives, nesting ,…

Page 109: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

SMPss: results

Page 110: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

Outline

•CellSs

• StarSs Programming Model

• CellSs syntax

• CellSs compiler

• CellSs runtime

• Installing CellSs

• Programming examples

• Compiling and running a CellSs application

• Performance analysis using Paraver

•SMPSs

•Conclusions

Page 111: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

Conclusions

•The road for new chips with multi and many cores is open

•New programming models that can deal with the complexity of the

hardware are now more needed than ever

•StarSs

• Simple

• Portable

• Enough performance

• Ported to different architectures: CellSs, SMPSs

Page 112: Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs

CellSs and SMPSs websites

•CellSs

• www.bsc.es/cellsuperscalar

•SMPSs

• www.bsc.es/smpsuperscalar

•Both available for download (open source, GPL and LGPL)