Processor Architectures and Program Mapping TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman Exploiting DLP SIMD architectures

Processor Architectures and Program Mapping

TU/e 5kk10Henk Corporaal

Jef van Meerbergen

Bart Mesman

Exploiting DLPSIMD architectures

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

2

flexibility

efficiency

DSP

Programmable CPU

Programmable DSP

Application specific instruction set

processor (ASIP)

Applicationspecific processor


3

SIMD PerformanceSIMD Performance

106

105

104

103

102

101

100

2 1 0.5 0.25 0.13 0.07

Computational efficiency [MOPS/W]

Feature size [um]

Application specific cores

Programmable processors[Roza]

SIMD


4

What are we talking about?

ILP = Instruction Level Parallelism =

ability to perform multiple operations (or instructions),from a single instruction stream,

in parallel

VLIW = Very Long Instruction Word architecture

operation 1 operation 2 operation 3 operation 4

Instruction format:

operation 5


5

SIMD: Topics Overview• Enhance performance: architecture methods

• Data Level Parallelism

– Application area– Subword parallelism

• Locally connected SIMDs

– Xetal

• Fully connected SIMDs

– Imagine

• Communication in SIMD processors– RCSIMD

– DCSIMD


6

Enhance performance: 3 architecture methods

• (Super)-pipelining

• Powerful instructions– MD-technique

• multiple data operands per operation

– MO-technique• multiple operations per instruction

• Multiple instruction issue


7

Characteristics of Media Applications• Poorly matched to conventional architectures

– Caches

– Instruction-Level Parallelism

– Few arithmetic units

• Well-matched to modern VLSI technology

– Lots (100’s - 1000’s) of ALUs fit on a single chip

Communication bandwidth is the scarce resource


8

Architecture methods

Powerful Instructions (1)

MD-technique• Multiple data operands per operation• SIMD: Single Instruction Multiple Data

Vector instruction:

for (i=0, i++, i<64) c[i] = a[i] + 5*b[i];

c = a + 5*b

Assembly:

set vl,64ldv v1,0(r2)mulvi v2,v1,5ldv v1,0(r1)addv v3,v1,v2stv v3,0(r3)


9



SIMD computing• Exploit data locality of

e.g. image processing applications

• Effect on code size?• Effect on power

consumption?

SIMD Execution Method

tim

e

Instruction 1

Instruction 2

Instruction 3

Instruction n

node1 node2 node-K


10



• Sub-word parallelism– SIMD on restricted scale:

– Used for Multi-media instructions

– Motivation: use a powerful 64-bit alu as 4 x 16-bit alus

• Examples– MMX, SUN-VIS, HP MAX-2, AMD-

K7/Athlon 3Dnow, Trimedia II

– Example: i=1..4|ai-bi| * * * *


11

LC-SIMD

LC-SIMD (Locally connected; e.g. Xetal, Imap) long communication delays: shift operations

PE1 PE2 PE319PE0

Instructions Bus

Memory

One wide port


12

FC-SIMDFC-SIMD (Fully Connected; Imagine)

expensive communication network

PE1 PE2 PE319PE0

Instructions Bus

Fully Connected Communication Network


13

LC: Xetal Objectives

High-degree of system integration

CMOS imaging + DSP

low cost camera systems

Low power consumption

mobile & remote sensing

Flexibility

programmable DSP and control functions

04/18/23 1

Xetal Architecture


15

Global ControllerGlobal Controller

tuned for Xetal Archit.

functions

loop/iteration control

system synchronization

exposure-time control

white balancing . . .

04/18/23 1

Xetal Architecture


17

Parallel Processing (SIMD) Parallel Processing (SIMD)

2 columns /processor

neighbour communication

low-speed clock (16 MHz)

clock gating

shared address decoding

minimal memory read access

LOW-POWER


18

Parallel Processing (Contd.) Parallel Processing (Contd.)


19

Xetal Specs & PerformanceXetal Specs & Performance


20

Simulation Results(1-input)Simulation Results(1-input)


21

Simulation Results(1-output)Simulation Results(1-output)


22

Simulation Results (2)Simulation Results (2)


23

Imagine

• Combining DLP (SIMD) and ILP (VLIW)– toplevel SIMD– per PE: VLIW


24

• Stereo Depth Extraction

• Polygon Rendering

• MPEG Encoding/Decoding

Encoded 2D Data 2D Video Stream

Encode/Decode

Imagine: Representative Applications

Render

101100010110001001


25

Stream Processing

• Little data reuse (pixels never revisited)• Highly data parallel (output pixels not dependent on other output pixels)• Compute intensive (60 arithmetic ops per memory reference)

SAD

Kernel StreamInput Data

Output Data

Image 1 convolve convolve

Image 0 convolve convolve

Depth Map


26

Stream Architecture Provides Data Bandwidth Hierarchy

2GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

File

ALU Cluster

ALU Cluster

ALU Cluster

544GB/s

ALU Cluster

ALU Cluster

ALU Cluster

ALU Cluster

ALU Cluster

SIMD/VLIW Control

Peak BW:


27

Application Data: Bandwidth Usage

2GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAMS

trea

m

Reg

iste

r F

ile

ALU Cluster

ALU Cluster

ALU Cluster

544GB/s

Memory BW Global RF BW Local RF BW

Depth Extractor 0.80 GB/s 18.45 GB/s 210.85 GB/s

MPEG Encoder 0.47 GB/s 2.46 GB/s 121.05 GB/s

Polygon Rendering 0.78 GB/s 4.06 GB/s 102.46 GB/s

QR Decomposition 0.46 GB/s 3.67 GB/s 234.57 GB/s


28

Stream Register File: Details

SRF:

Single-ported128KB SRAM

(1024 x 32W)

Stream

bu

ffers

32W/cycle

ArbiterTo/From:

Arithmetic Clusters,I/O,

Interprocessorcommunication,

and Main Memory


29

CU

Inte

rclu

ster

N

etw

ork+

From SRF

To SRF

+ * /

Cross Point

Local Register File

Arithmetic Cluster: Details

• Units support floating-point / 32-bit / dual 16-bit / quad 8-bit instructions

– 4-cycle pipelined FMUL,FADD,FSUB,FTOI,ITOF,FFRAC

– 17-cycle FDIV (pipelined for 1 FDIV every 7 cycles)

+ *


30

The Imagine Stream Processor

Stream Register File:32kW SRAM

NetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

Net

wor

k

AL

U C

lust

er 0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory System

Mic

roco

ntr

oll

er:

2K V

LIW

In

strs


31

Imagine Floorplan• 22 million

transistors

• 500 MHz

• TI GS30KA:– 0.15 m Ldrawn

– 0.m Leff

– CMOS process

Micro-Controller

ALU Cluster 0

ALU Cluster 1

ALU Cluster 2

ALU Cluster 3

ALU Cluster 4

ALU Cluster 5

ALU Cluster 6

ALU Cluster 7

Stre

am

bu

ffers

SR

F

Stre

am

bu

ffers

Me

mo

ry S

ys

tem

SRFControl

StreamController

NetworkInterface

7.8

mm

7.6mm


32

Imagine Programming Environment

StreamC

C++ compiler

KernelC

kernelscheduler

microcode

Ho

st

Imag

ine

streamscheduler

Run-time

Compile-time

StereoDepthExtraction(…) {

// Load Input Images

...

// Run Kernels

convolve7x7 (RawImage,ConvImage);

convolve3x3 (ConvImage,Conv2Image);

...

// Store Output

}

Convolve7x7(…){

...

while(!In.empty()) {

...

p0 = k0 * in10;

p12 = k21 * in32;

p34 = k43 * in54;

p56 = k65 * in76;

sum = (p0 + p12)

+ (p34 + p56);

...

}

}


33

RC-SIMD

• Imagine support full interconnect between PEs

• Do we need this expensive interconnect?

• Alternative: RC-SIMD


34

Basic template of communication architecture

S0

PE1

S1

PE2

S2

PE3PE0

1

1

11

11

0 0

00

0

0

Instructions Bus


35

Example

• 4-tap filter

LD0

* C0

LD+1

* C1

LD+2

* C2

LD+3

* C3

+

ST

PE 0 PE 1 PE 2 PE 3

0 LD 0 LD 0 LD 0 LD 0

1 * C0 * C0 * C0 * C0

2 LD +1 LD +1 LD +1 LD +1

3 * C1 * C1 * C1 * C1

4 LD +2 LD +2 LD +2 LD +2

5 * C2 * C2 * C2 * C2

6 LD +3 LD +3 LD +3 LD +3

7 * C3 * C3 * C3 * C3

8 sum sum sum sum

9 ST ST ST ST


36

Example

PE 0 PE 1 PE 2 PE 3

0 LD 0 LD 0 LD 0 LD 0

1 * C0 * C0 * C0 * C0

2 LD +1 LD +1 LD +1 LD +1

3 * C1 * C1 * C1 * C1

4 LD +2 LD +2 LD +2 LD +2

5 * C2 * C2 * C2 * C2

6 LD +3 LD +3 LD +3 LD +3

7 * C3 * C3 * C3 * C3

8 sum sum sum sum

9 ST ST ST ST

Resource sharing conflict

How to solve????

Pipeline (shift 1 cycle)

PE1

S1

PE2 PE3PE0

1

0

S21

0

S01

0


37

RC-SIMD: Basic architecture

cycle PE 0 PE 1 PE 2 PE 3

0 LD 0 - - -

1 * C0 LD 0 - -

2 LD +1 * C0 LDP0 -

3 * C1 LD +1 * C0 LD 0

4 LD +2 * C1 LD +1 * C0

5 * C2 LD +2 * C1 LD +1

6 LD +3 * C2 LD +2 * C1

7 * C3 LD +3 * C2 LD +2

8 sum * C3 LD +3 * C2

9 ST sum * C3 LD +3

10 - ST sum * C3

11 - - ST sum

12 - - - ST

• Schedule with delay-line

PE1

S1

PE2 PE3PE0

1

0

delay delay delay

S21

0

S01

0


38

Conflict model

Ld +2 S0 S1

S1 S2

1-1

00

0 0

0

0

• Schedule PE0 (using FACTS)

Node: resource usage

Sequence edge: timing dependency

Fact tools

Move problem From hardware to software

PE1

S1

PE2 PE3PE0

1

0

delay delay delay

S21

0

S01

0


39

Basic architecture

• Valid schedule

PE1

S1

PE2 PE3PE0

1

0

delay delay delay

S21

0

S01

0

cycle PE 0 PE 1 PE 2 PE 3

0 LD +3 - - -

1 LD +1 LD +3 - -

2 LD 0 LD +1 LD +3 -

3 * C0 LD 0 LD +1 LD +3

4 LD +2 * C0 LD 0 LD +1

5 * C1 LD +2 * C0 LD 0

6 * C2 * C1 LD +2 * C0

7 * C3 * C2 * C1 LD +2

8 sum * C3 * C2 * C1

9 ST sum * C3 * C2

10 - ST sum * C3

11 - - ST sum

12 - - - ST


40

Drawback

Ins 1

Ins 2 Ins 1

Ins 2 Ins 1

Ins 2 Ins 1

Ins 2

Ins 3

Ins 4 Ins 3

Ins 4 Ins 3

Ins 4 Ins 3

Ins 4

• 319 cycle between PE0 & PE319

• Size of conflict model (compile time)

PE 0 PE 1 PE 2 PE 3 PE 319


41

Update Architecture

Ins 1

Ins 2 Ins 1

Ins 2 Ins 1

Ins 2 Ins 1

Ins 2

Ins 3

Ins 4 Ins 3

Ins 4 Ins 3

Ins 4 Ins 3

Ins 4

Ins 1

Ins 2 Ins 1

Ins 2 Ins 1

Ins 2 Ins 1

Ins 2

Ins 3

Ins 4 Ins 3

Ins 4 Ins 3

Ins 4 Ins 3

Ins 4

Cycle 1

Cycle 2

Cycle 3

Cycle 4

Cycle 5

Cycle 6

Cycle 7

PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 PE 5 PE 6


42

Updated RC-SIMD Architecture

PE1

S1

PE2 PE3PE0

1

0

delay delay

S2

1

0

S0

1

0

0

1

0

1

0

1

PE4

S3

1

0

0

1

delay delay


43

Results of mapping several kernels

Kernel Num operation Num cycle in LC-SIMD

Num cycle in FC-SIMD

Num cycle in RC-SIMD

Comm. Overhead in LC-SIMD

Cycle Improvement (compare to LC-

SIMD)

4-tap filter 8 10 8 8 2 20%

Image sub-sampling

21 26 21 21 5 19%

Convolution 7x7

98 126 98 98 28 22%

Haar filter 162 216 162 162 54 25%

FFT 26 - 26 28 - -


44

Imap


45

Imap


46

Difficult SIMD Applications

• Algorithms need Dynamic communication:– lens distortion– bucket processing– Mirroring,…


47

DC-SIMD Architecture

PE_1

Bus_0

Bus_1

Bus_2

PE_2 PE_3

R3

R2

R1

PE_4 PE_5 PE_6

R6

R5

R4

PE_7

R7

PE_6 PE_3

PE_4 PE_2V dst-add data src-add

Message format


48


PE_1

Bus_0

Bus_1

Bus_2

PE_2 PE_3

R3

R2

R1

PE_4 PE_5 PE_6

R6

R5

R4

PE_7

R7

Larger distance: PE_7 PE_1


49


PE_1

Bus_0

Bus_1

Bus_2

PE_2 PE_3

R3

R2

R1

PE_4 PE_5 PE_6

R6

R5

R4

PE_7

R7

PE_7 PE_5

PE_6 PE_2

Priority


50

DC-SIMD: arbitration

PE

V des-add data src-add

xorRead data

write: give priority to further PES

Read:

PEn PEn+1 PEn+2

V des-add data src-add

Next reg.

Select (ab)

ab

v 00

n+2 01

n+1 10

n 11a=v’.2’

b=a’.v’+a.1’

n+2 : 2.v

n+1 : (2+v).1

n : (1+2+v).0

Buffer instruction:

PEid


51

Measurements: Area overhead

0

5

10

15

20

25

30

1 2 4 8 16 32 64 128 256 512

Number of PEs

Are

a o

ve

rhe

ad

(%)


52

Measurements: Performance

0102030405060708090

100

1 2 3 4 5 6 7 8 9Number of Buses

perfo

rman

ce im

prov

. %


53

Measurements: required instruction buffer size

9600

9800

10000

10200

10400

10600

10800

11000

1

22 43 64 85

106

127

148

169

190

211

232

253

274

295

316

337

358

379

400

Buffer Size

Cyc

le


54

In PS3: CELL architecture


55

CELL Highlights• Observed clock speed

– – > 4 GHz

• Peak performance (single precision)– – > 256 GFlops

• Peak performance (double precision)– – >26 GFlops

• Area 221 mm2• Technology 90nm SOI• Total # of transistors 234M


56

Conclusions• SIMD nicely matches

– Image applications: data-level parallelism– VLSI efficiency: copy-paste of simple elements

• So – Very efficient architecture for image processing– Low power! Also by trading off clock vs peroformance

• But – Programmer is burdened with vector thinking– Vectorizing compilers are not good at recognizing opportunities for vector

executions– Need for a “control” processor for control code and if-then-else

• Communication is a problem:– Dimensioned for peak BW requirements -> RCSIMD– Unable to perform indirect PE addressing-> DCSIMD

Documents

Processor Architectures and Program Mapping TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman Exploiting DLP SIMD architectures