David Stephenson Timothy Kong Dee Lee Fred Chow Guang R. Gao

•Juergen Ributzka•David Stephenson•Timothy Kong•Dee Lee•Fred Chow•Guang R. Gao

Overview

• SiCortex Multiprocessor

• Software Pipelining Framework

• Results

• Future Work

3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 2

SiCortex Multiprocessor

• RISC

• 6 cores (MIPS 5KF)

• 500 – 700 MHz

• In-Order Execution

• Limited-Dual Issue

• Low Power (12 – 16 Watt)


Software Pipelining Framework


DDG

MII

MS

MVE

RA

CG

Front-End

Middle-End

Back-End

FORTRAN C/C++ Java

Loop NestOptimizer

GlobalOptimizer

CodeGenerator

Data Dependence Graph (DDG)

• Cyclic Data Dependence Graph

• No register anti- and output-dependencies

• Only single BBs


DDG

MII

MS

MVE

RA

CG

S1

S2

S1

S2

Minimum Initiation Interval

• Limited by:

– Resource Requirements

– Recurrence Requirements


DDG

MII

MS

MVE

RA

CG

Modulo Scheduling (MS)

• Huff Modulo Scheduling

– Lifetime sensitive

– Uses backtracking

• Hyper Node Reduction Scheduling

– Lifetime sensitive

– No backtracking


DDG

MII

MS

MVE

RA

CG

Modulo Variable Expansion (MVE)

• Separate TN set for every iteration

• # of unrollings depends on longest lifetime


DDG

MII

MS

MVE

RA

CG

TN10 op1 TNXX…TN20 op2 TN10 TNYY



Iteration

C

y

c

l

e

Register Allocation (RA)

• Only kernels are fully register allocated

• No regions

– SWP kernels are not a black box for GRA and LRA anymore

• Currently no register spilling support


DDG

MII

MS

MVE

RA

CG

Code Generation (CG)

• Prologues and epilogues are partially register allocated

• Need several epilogues due to different register sets


DDG

MII

MS

MVE

RA

CG

P

P0

K0

K1 E0

E1

E2

E

Benchmark Results


0.9

0.95

1

1.05

1.1

1.15

1.2

BT CG EP FT IS LU MG SP UA

spe

ed

up

NAS Parallel Benchmark

ABI n32

ABI 64

Benchmark Results


0.95

0.96

0.97

0.98

0.99

1

1.01

1.02

1.03

1.04

1.05

41

0.b

wav

es

43

3.m

ilc

43

4.z

eusm

p

43

5.g

rom

acs

43

6.c

actu

sAD

M

43

7.le

slie

3d

44

4.n

amd

44

7.d

ealII

45

0.s

op

lex

45

3.p

ovr

ay

45

4.c

alcu

lix

45

9.G

emsF

DTD

46

5.t

on

to

47

0.lb

m

48

1.w

rf

48

2.s

ph

inx3

40

0.p

erlb

ench

40

1.b

zip

2

40

3.g

cc

42

9.m

cf

44

5.g

ob

mk

45

6.h

mm

er

45

8.s

jen

g

46

2.li

bq

uan

tum

46

4.h

26

ref

47

1.o

mn

etp

p

47

3.a

star

spe

ed

up

SPEC2006

ABI n32

ABI 64

Conclusion and Future Work

• Spilling support needed to enable more loops for SWP

• Screen out loops with small trip count during runtime to reduce SWP overhead

• Hit-under-miss support to overcome current hardware limitations


Questions ?


Documents

David Stephenson Timothy Kong Dee Lee Fred Chow Guang R. Gao