04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 2
What are we talking about?
ILP = Instruction Level Parallelism =
ability to perform multiple operations (or instructions),from a single instruction stream,
in parallel
VLIW = Very Long Instruction Word architecture
Instruction format example of 5 issue VLIW:
operation 1 operation 2 operation 3 operation 4 operation 5
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 3
Single Issue RISC vs VLIW
Compiler
op op opnop op opop op nopop nop opop op op
instrinstrinstrinstrinstr
opopopopopopopopopopopop
execute1 instr/cycle
instrinstrinstrinstrinstrinstrinstrinstrinstrinstrinstrinstr
RISC CPU
3-issue VLIW
execute1 instr/cycle3 ops/cycle
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 4
Topics Overview• How to speed up your processor?
– What options do you have?
• Operation/Instruction Level Parallelism– Limits on ILP
• VLIW– Examples– Clustering
• Code generation (2nd slide-set)• Hands-on
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 5
Speed-up
Pipelined Execution of InstructionsIF: Instruction Fetch
DC: Instruction Decode
RF: Register Fetch
EX: Execute instruction
WB: Write Result Register
IF DC RF EX WBIF DC RF EX WB
IF DC RF EX WBIF DC RF EX WB
INS
TR
UC
TIO
N
CYCLE
1 2 43 5 6 7 8
12
3
4
Purpose of pipelining:• Reduce #gate_levels in critical path
• Reduce CPI close to one (instead of a large number for the multicycle machine)• More efficient Hardware
Problems• Hazards: pipeline stalls
• Structural hazards: add more hardware• Control hazards, branch penalties: use branch prediction• Data hazards: by passing required
Simple 5-stage pipeline
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 6
Speed-up
Pipelined Execution of Instructions
Superpipelining:
• Split one or more of the critical pipeline stages
• Superpipelining degree S:
*Op I_set
S(architecture) = f(Op) * lt (Op)
where: f(op) is frequency of operation op lt(op) is latency of operation op
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 7
Speed-up
Powerful Instructions (1)
MD-technique• Multiple data operands per operation• SIMD: Single Instruction Multiple Data
Vector instruction:
for (i=0, i++, i<64) c[i] = a[i] + 5*b[i];
or
c = a + 5*b
Assembly:
set vl,64ldv v1,0(r2)mulvi v2,v1,5ldv v1,0(r1)addv v3,v1,v2stv v3,0(r3)
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 8
Speed-up
Powerful Instructions (1)
SIMD computing• Nodes used for independent
operations
• Mesh or hypercube connectivity
• Exploit data locality of e.g. image processing applications
• Dense encoding (few instruction bits needed)
SIMD Execution Method
tim
e
Instruction 1
Instruction 2
Instruction 3
Instruction n
node1 node2 node-K
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 9
Speed-up
Powerful Instructions (1)
• Sub-word parallelism– SIMD on restricted scale:
– Used for Multi-media instructions
• Examples– MMX, SSX, SUN-VIS, HP MAX-2,
AMD 3Dnow, Trimedia II
– Example: i=1..4|ai-bi|* * * *
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 10
Speed-up
Powerful Instructions (2)
MO-technique: multiple operations per instruction
Two options:
• CISC (Complex Instruction Set Computer)
• VLIW (Very Long Instruction Word)
sub r8, r5, 3 and r1, r5, 12 mul r6, r5, r2 ld r3, 0(r5)
FU 1 FU 2 FU 3 FU 4field
instruction bnez r5, 13
FU 5
VLIW instruction example
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 11
Execunit 1
Execunit 2
Execunit 3
Shared, Multi-ported Register file
Issue slot 1
Execunit 4
Execunit 5
Execunit 6
Execunit 7
Execunit 8
Execunit 9
Issue slot 2 Issue slot 3
VLIW architecture: central Register File
Q: How many ports does the registerfile need for n-issue?
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 12
Philips oldie: TriMedia TM32A processor
D-cache I-Cache
IFM
UL
1IF
MU
L1
IFM
UL
2IF
MU
L2
(FL
OA
T)
(FL
OA
T)
(FL
OA
T)
(FL
OA
T)
DS
PM
UL
1D
SP
MU
L1 D
SP
MU
L2
DS
PM
UL
2
FT
OU
GH
1F
TO
UG
H1
SH
IFT
ER
1S
HIF
TE
R1
AL
U1
AL
U1
FC
OM
P2
FC
OM
P2
DS
PA
LU
2D
SP
AL
U2
AL
U2
AL
U2
AL
U4
AL
U4
AL
U0
AL
U0
AL
U3
AL
U3
FA
LU
0F
AL
U0
FA
LU
3F
AL
U3
DS
PA
LU
0D
SP
AL
U0
SH
IFT
ER
0S
HIF
TE
R0
TA
G
TA
G
TAG
TAG
SEQUENCER / DECODE
I/OINTERFACE
0.18 micronarea : 16.9mm2
200 MHz (typ)1.4 W
7 mW/MHz
(MIPS processor:0.9 mW/MHz)
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 13
Speedup: Powerful Instructions (2)
VLIW Characteristics
• Only RISC like operation support Short cycle times
• Flexible: Can implement any FU mixture• Extensible• Tight inter FU connectivity required• Large instructions (up to 1024 bits)• Not binary compatible !!!• But good compilers exist
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 14
Speed-up
Multiple instruction issue (per cycle)
Who guarantees semantic correctness?– which can instructions be executed in parallel?
• User: he specifies multiple instruction streams
– Multi-processor: MIMD (Multiple Instruction Multiple Data)
• HW: Run-time detection of ready instructions
– Superscalar
• Compiler: Compile into dataflow representation
– Dataflow processors
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 15
Multiple instruction issue
Three Approaches
a := b + 15;
c := 3.14 * d;
e := c / f;
Translation to DDG (Data Dependence Graph)
ld
+
st
&b
15
&a
ld *
/ st
ld
st
&f 3.14
&e
&d
&c
Example code
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 16
Generated Code
Instr. Sequential Code
I1 ld r1,M(&b)I2 addi r1,r1,15I3 st r1,M(&a)I4 ld r1,M(&d)I5 muli r1,r1,3.14 I6 st r1,M(&c)I7 ld r2,M(&f)I8 div r1,r1,r2I9 st r1,M(&e)
3 approaches:• An MIMD may execute two streams: (1) I1-I3 (2) I4-I9
– No dependencies between streams; in practice communication and synchronization required between streams
• A superscalar issues multiple instructions from sequential stream– Obey dependencies (True and name dependencies)
– Reverse engineering of DDG needed at run-time
• Dataflow code is direct representation of DDG
Dataflow Code I1 ld(M(&b) -> I2I2 addi 15 -> I3I3 st M(&a)I4 ld M(&d) -> I5I5 muli 3.14 -> I6, I8I6 st M(&c)I7 ld M(&f) -> I8I8 div -> I9I9 st M(&e)
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 17
Multiple Instruction Issue: Data flow processor
Token Matching
TokenStore
InstructionGenerate
InstructionStore
FU-1 FU-2 FU-K
Reservation Stations
Re
sult
To
ken
s
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 18
Instruction Pipeline Overview
IF DC RF EX WB
IF DC/RF EX WB
CISC
RISC
IF1 DC1 RF1 EX1 ROBISSUE WB1
IF2 DC2 RF2 EX2 ROBISSUE WB2
IF3 DC3 RF3 EX3 ROBISSUE WB3
IFk DCk RFk EXk ROBISSUE WBk
Superscalar
IF1 IF2 IFs DC RF--- EX1 EX2 --- EX5 WBSuperpipelined
IF DC
RF1 EX1 WB1
RF2 EX2 WB2
RFk EXk WBk
VLIW
RF1 EX1 WB1
RF2 EX2 WB2
RFk EXk WBkD
AT
AF
LOW
(no pipelining)
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 19
Four dimensional representation of the architecture design space <I, O, D, S>
Instructions/cycle ‘I’
Superpipelining Degree ‘S’
Operations/instruction ‘O’
Data/operation ‘D’
Superscalar MIMD Dataflow
Superpipelined
RISC
VLIW
10 100
1010
0.1
Vector
10
SIMD100
CISC
Note: MIMD should better be a separate, 5th dimension !
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 20
Architecture design space
Architecture K I O D S MparCISC 1 0.2 1.2 1.1 1 0.26RISC 1 1 1 1 1.2 1.2VLIW 10 1 10 1 1.2 12Superscalar 3 3 1 1 1.2 3.6Superpipelined 1 1 1 1 3 3Vector 7 0.1 1 64 5 32SIMD 1024 1 1 1024 1.2 1229MIMD 32 32 1 1 1.2 38Dataflow 10 10 1 1 1.2 12
Typical values of K (# of functional units or processor nodes), and
<I, O, D, S> for different architectures
Mpar = I*O*D*S
Op I_setS(architecture) = f(Op) * lt (Op)
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 21
Overview
• Enhance performance: architecture methods
• Instruction Level Parallelism (ILP)– limits on ILP
• VLIW– Examples
• Clustering
• Code generation
• Hands-on
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 22
General organization of an ILP architecture
Inst
ruct
ion
mem
ory
Inst
ruct
ion
fetc
h un
it
Inst
ruct
ion
deco
de u
nit
FU-1
FU-2
FU-3
FU-4
FU-5
Reg
iste
r fi
le
Dat
a m
emor
y
CPU
Byp
assi
ng n
etw
ork
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 23
Motivation for ILP• Increasing VLSI densities; decreasing feature size
• Increasing performance requirements
• New application areas, like– multi-media (image, audio, video, 3-D, holographic)– intelligent search and filtering engines– neural, fuzzy, genetic computing
• More functionality
• Use of existing Code (Compatibility)
• Low Power: P = fCVdd2
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 24
Low power through parallelism
• Sequential Processor– Switching capacitance C
– Frequency f
– Voltage V
– P = fCV2
• Parallel Processor (two times the number of units)– Switching capacitance 2C
– Frequency f/2
– Voltage V’ < V
– P = f/2 2C V’2 = fCV’2
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 25
Measuring and exploiting available ILP
• How much ILP is there in applications?• How to measure parallelism within applications?
– Using existing compiler
– Using trace analysis• Track all the real data dependencies (RaWs) of instructions from issue
window
– register dependence
– memory dependence
• Check for correct branch prediction
– if prediction correct continue
– if wrong, flush schedule and start in next cycle
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 26
Trace analysis
Program
For i := 0..2
A[i] := i;
S := X+3;
Compiled code
set r1,0
set r2,3
set r3,&A
Loop: st r1,0(r3)
add r1,r1,1
add r3,r3,4
brne r1,r2,Loop
add r1,r5,3
Trace
set r1,0
set r2,3
set r3,&A
st r1,0(r3)
add r1,r1,1
add r3,r3,4
brne r1,r2,Loop
st r1,0(r3)
add r1,r1,1
add r3,r3,4
brne r1,r2,Loop
st r1,0(r3)
add r1,r1,1
add r3,r3,4
brne r1,r2,Loop
add r1,r5,3How parallel can you execute this code?
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 27
Trace analysis
Parallel Trace
set r1,0 set r2,3 set r3,&A
st r1,0(r3) add r1,r1,1 add r3,r3,4
st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop
st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop
brne r1,r2,Loop
add r1,r5,3
Max ILP = Speedup = Lserial / Lparallel = 16 / 6 = 2.7
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 28
Ideal ProcessorAssumptions for ideal/perfect processor:
1. Register renaming – infinite number of virtual registers => all register WAW & WAR hazards avoided2. Branch and Jump prediction – Perfect => all program instructions available for execution3. Memory-address alias analysis – addresses are known. A store can be moved before a load provided addresses not equal
Also: – unlimited number of instructions issued/cycle (unlimited resources), and– unlimited instruction window– perfect caches– 1 cycle latency for all instructions (FP *,/)
Programs were compiled using MIPS compiler with maximum optimization level
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 29
Upper Limit to ILP: Ideal Processor
Programs
Inst
ruct
ion
Iss
ues
per
cycl
e
0
20
40
60
80
100
120
140
160
gcc espresso li fpppp doducd tomcatv
54.862.6
17.9
75.2
118.7
150.1
Integer: 18 - 60 FP: 75 - 150
IPC
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 30
35
41
16
6158
60
9
1210
48
15
67 6
46
13
45
6 6 7
45
14
45
2 2 2
29
4
19
46
0
10
20
30
40
50
60
gcc espresso li fpppp doducd tomcatv
Program
Inst
ruct
ion iss
ues
per
cyc
le
Perfect Selective predictor Standard 2-bit Static None
Window Size and Branch Impact• Change from infinite window to examine 2000
and issue at most 64 instructions per cycle FP: 15 - 45
Integer: 6 – 12
IPC
Perfect Tournament BHT(512) Profile No prediction
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 31
11
15
12
29
54
10
15
12
49
16
10
1312
35
15
44
910
11
20
11
28
5 56 5 5
74 4
54
5 5
59
45
0
10
20
30
40
50
60
70
gcc espresso li fpppp doducd tomcatv
Program
Inst
ruct
ion iss
ues
per
cyc
le
Infinite 256 128 64 32 None
Limiting nr. of Renaming Registers• Changes: 2000 instr. window, 64 instr. issue, 8K 2-level
predictor (slightly better than tournament predictor)
Integer: 5 - 15 FP: 11 - 45
IP
C
Infinite 256 128 64 32
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 32
Program
Instr
ucti
on
issu
es p
er
cy
cle
0
5
10
15
20
25
30
35
40
45
50
gcc espresso li fpppp doducd tomcatv
10
15
12
49
16
45
7 79
49
16
45 4 4
6 53
53 3 4 4
45
Perfect Global/stack Perfect Inspection None
Memory Address Alias Impact• Changes: 2000 instr. window, 64 instr. issue, 8K
2-level predictor, 256 renaming registers
FP: 4 - 45(Fortran,no heap)
Integer: 4 - 9
IPC
Perfect Global/stack perfect Inspection None
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 33
Program
Instr
ucti
on
issu
es p
er
cy
cle
0
10
20
30
40
50
60
gcc expresso li fpppp doducd tomcatv
10
15
12
52
17
56
10
15
12
47
16
10
1311
35
15
34
910 11
22
12
8 8 9
14
9
14
6 6 68
79
4 4 4 5 46
3 2 3 3 3 3
45
22
Infinite 256 128 64 32 16 8 4
Reducing Window Size• Assumptions: Perfect disambiguation, 1K Selective predictor, 16
entry return stack, 64 renaming registers, issue as many as window
Integer: 6 - 12
FP: 8 - 45
IPC
Infinite 256 128 64 32 16 8 4
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 34
How to Exceed ILP Limits of This Study?
• WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory
• Unnecessary dependences – compiler did not unroll loops so iteration variable
dependence
• Overcoming the data flow limit: value prediction, predicting values and speculating on prediction– Address value prediction and speculation predicts
addresses and speculates by reordering loads and stores. Could provide better aliasing analysis
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 35
Conclusions
• Amount of parallelism is limited– higher in Multi-Media and Signal Processing appl.– higher in kernels
• Trace analysis detects all types of parallelism– task, data and operation types
• Detected parallelism depends on– quality of compiler– hardware– source-code transformations
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 36
Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW
– Examples• C6
• TM
• IA-64: Itanium, ....
• TTA
• Clustering• Code generation• Hands-on
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 37
VLIW: general concept
A VLIW architecture with 7 FUs
Int Register File
Instruction Memory
Int FU
Data Memory
Int FU Int FU LD/ST LD/ST FP FU
Floating PointRegister File
FP FU
Instruction register
Functionunits
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 38
VLIW characteristics• Multiple operations per instruction• One instruction per cycle issued (at most)• Compiler is in control• Only RISC like operation support
– Short cycle times– Easier to compile for
• Flexible: Can implement any FU mixture• Extensible / Scalable
However: • tight inter FU connectivity required• not binary compatible !!
– (new long instruction format)• low code density
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 40
VLIW example: TMS320C62
TMS320C62 VelociTI Processor
• 8 operations (of 32-bit) per instruction (256 bit)• Two clusters
– 8 Fus: 4 Fus / cluster : (2 Multipliers, 6 ALUs)– 2 x 16 registers– One bus available to write in register file of other cluster
• Flexible addressing modes (like circular addressing)
• Flexible instruction packing
• All instruction conditional
• Originally: 5 ns, 200 MHz, 0.25 um, 5-layer CMOS
• 128 KB on-chip RAM
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 41
Execunit
Execunit
Execunit
Execunit
Execunit
Register file (128 regs, 32 bit, 15 ports)
Instruction register (5 issue slots)
Data cache
(16 kB)
PCInstruction
cache (32kB)
5 constant5 ALU2 memory2 shift2 DSP-ALU2 DSP-mul3 branch2 FP ALU2 Int/FP ALU1 FP compare1 FP div/sqrt
VLIW example: Philips TriMedia TM1000
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 42
Intel EPIC Architecture IA-64Explicit Parallel Instruction Computer (EPIC)• IA-64 architecture -> Itanium, first realization 2001
Register model:• 128 64-bit int x bits, stack, rotating• 128 82-bit floating point, rotating• 64 1-bit boolean• 8 64-bit branch target address• system control registers
See http://en.wikipedia.org/wiki/Itanium
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 43
EPIC Architecture: IA-64
• Instructions grouped in 128-bit bundles– 3 * 41-bit instruction– 5 template bits, indicate type and stop location
• Each 41-bit instruction – starts with 4-bit opcode, and – ends with 6-bit guard (boolean) register-id
• Supports speculative loads
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 46
EPIC Architecture: IA-64
• EPIC allows for more binary compatibility then a plain VLIW:– Function unit assignment performed at run-time– Lock when FU results not available
• See other website (course 5MD00) for more info on IA-64:– www.ics.ele.tue.nl/~heco/courses/ACA– (look at related material)
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 47
What did we talk about?
ILP = Instruction Level Parallelism =
ability to perform multiple operations (or instructions),from a single instruction stream,
in parallel
VLIW = Very Long Instruction Word architecture
operation 1 operation 2 operation 3 operation 4
Example Instruction format (5-issue):
operation 5
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 48
VLIW evaluationIn
stru
ctio
n m
emor
y
Inst
ruct
ion
fetc
h un
it
Inst
ruct
ion
deco
de u
nit
FU-1
FU-2
FU-3
FU-4
FU-5
Reg
iste
r fi
le
Dat
a m
emor
y
CPU
Byp
assi
ng n
etw
ork
Control problem O(N2) O(N)-O(N2) With N function units
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 49
VLIW evaluation
Strong points of VLIW:– Scalable (add more FUs)– Flexible (an FU can be almost anything; e.g. multimedia support)
Weak points:• With N FUs:
– Bypassing complexity: O(N2)– Register file complexity: O(N)– Register file size: O(N2)
• Register file design restricts FU flexibility
Solution: .................................................. ?
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 50
Solution
TTA: Transport Triggered ArchitectureTTA: Transport Triggered Architecture
>
st
*
+ -
>
st
*
+ -
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 51
Transport Triggered Architecture
General organization of a TTAIn
stru
ctio
n m
emor
y
Inst
ruct
ion
fetc
h un
it
Inst
ruct
ion
deco
de u
nit
FU-1
FU-2
FU-3
FU-4
FU-5
Reg
iste
r fi
le
Dat
a m
emor
y
CPU
Byp
assi
ng n
etw
ork
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 52
TTA structure; datapath details
Socket
integer RF
floatRF
booleanRF
instruct.unit
immediateunit
load/store unit
integer ALU
float ALU
integer ALU
load/store unit
Data Memory
Instruction Memory
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 53
TTA hardware characteristics
• Modular: building blocks easy to reuse• Very flexible and scalable
– easy inclusion of Special Function Units (SFUs)
• Very low complexity– > 50% reduction on # register ports– reduced bypass complexity (no associative matching)– up to 80 % reduction in bypass connectivity– trivial decoding– reduced register pressure– easy register file partitioning (a single port is enough!)
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 54
TTA software characteristics
• More difficult to schedule !• But: extra scheduling optimizations
add r3, r1, r2
r1 add.o1; r2 add.o2;
add.r r3
That does not look like an
improvement !?!
+o1 o2
r
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 55
Program TTAs
How to do data operations ?1. Transport of operands to FU
• Operand move (s)• Trigger move
2. Transport of results from FU• Result move (s)
How to do Control flow ?1. Jumps: #jump-address pc
2. Branch: #displacement pcd
3. Call: pc r; #call-address pcd
Example Add r3,r1,r2 becomesr1 Oint // operand move to integer unitr2 Tadd // trigger move to integer unit…………. // addition operation in progressRint r3 // result move from integer unit
Trigger Operand
Internal stage
Result
FU Pipeline
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 56
Scheduling example
add r1,r1,r2
sub r4,r1,95
VLIW
r1 -> add.o1, r2 -> add.o2
add.r -> sub.o1, 95 -> sub.o2
sub.r -> r4
TTA
integer RF
immediateunit
integer ALU
integer ALU
load/store unit
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 57
TTA Instruction format
General MOVE field:g : guard specifieri : immediate specifiersrc : sourcedst : destination
g i src dst
How to use immediates?
Small, 6 bits
Long, 32 bits
g 1 imm dst
g 0 Ir-1 dst imm
move 1
General MOVE instructions: multiple fields
move 2 move 3 move 4
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 58
Programming TTAs
How to do conditional executionEach move is guarded
Exampler1 cmp.o1 // operand move to compare unitr2 cmp.o2 // trigger move to compare unitcmp.r g // put result in boolean register gg:r3 r4 // guarded move takes place when r1=r2
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 59
Register file port pressure for TTAs
12
34
5
12
34
51.00
1.50
2.00
2.50
3.00
3.50
ILP
de
gre
e
Read portsWrite ports
Read and write ports required
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 60
Summary of TTA Advantages
• Better usage of transport capacity– Instead of 3 transports per dyadic operation, about 2 are
needed– # register ports reduced with at least 50%– Inter FU connectivity reduces with 50-70%
• No full connectivity required
• Both the transport capacity and # register ports become independent design parameters; this removes one of the major bottlenecks of VLIWs
• Flexible: Fus can incorporate arbitrary functionality• Scalable: #FUS, #reg.files, etc. can be changed• FU splitting results into extra exploitable concurrency• TTAs are easy to design and can have short cycle times
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 61
TTA automatic DSE
Architectureparameters
OptimizerOptimizer
Parametric compilerParametric compiler Hardware generatorHardware generator
feedbackfeedback
Userintercation
Parallel object code chip
Pareto curve(solution space)
cost
exec
. tim
e
x
x
x
x
xx
x
xx
x
x
x
x
x
x
xx x
x
x
Move framework
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 62
Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples
– C6
– TM
– TTA
• Clustering and Reconfigurable components• Code generation• Hands-on
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 63
Clustered VLIW• Clustering = Splitting up the VLIW data path
- same can be done for the instruction path –
FU FU FU
loop buffer
register file
FU FU FU
loop buffer
register file
FU FU FU
loop buffer
register file
Level 1 Instruction Cache
Level 1 Data Cache
Level 2 (shared) C
ache
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 64
Clustered VLIW
Why clustering?• Timing: faster clock• Lower Cost
– silicon area
– T2M (Time-to-Market)
• Lower Energy
What’s the disadvantage?
Want to know more: see PhD thesis Andrei Terechko
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 65
CLB
CLB
CLB
CLB
SwitchMatrix
ProgrammableInterconnect I/O Blocks (IOBs)
ConfigurableLogic Blocks (CLBs)
D Q
SlewRate
Control
PassivePull-Up,
Pull-Down
Delay
Vcc
OutputBuffer
InputBuffer
Q D
Pad
D QSD
RDEC
S/RControl
D QSD
RDEC
S/RControl
1
1
F'
G'
H'
DIN
F'
G'
H'
DIN
F'
G'
H'
H'
HFunc.Gen.
GFunc.Gen.
FFunc.Gen.
G4G3G2G1
F4F3F2F1
C4C1 C2 C3
K
Y
X
H1 DIN S/R EC
Fine-Grained reconfigurable: Xilinx XC4000 FPGA
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 66
Recent Coarse Grain Reconfigurable Architectures
• SmartCell 2009– read http://www.hindawi.com/journals/es/2009/518659.html
• Montium (reconfigurable VLIW)• RAPID• NIOS II• RAW• PicoChip• PACT XPP64• ADRES (IMEC)
• many more ….
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 67
Xilinx Zynq with 2 ARM processors
ADRES• Combines VLIW
and reconfig. Array
• PEs have localregistersTop-row PEs share registers
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 68
69
PACT XPP: Architecture
• XPP (Extreme Processing Platform)
– A hierarchical structure consisting of PAEs• PAEs
Course grain PEs Adaptive Clustered in PACs PA = PAC + CM A hierarchical
configuration tree Memory elements
(aside PAs) I/O elements (on
each side of the chip)
PA PA
PA PA
RAW with Mesh network
ComputePipeline
Registered at input longest wire = length of tile
8 32-bit channels
04/19/23 Embedded Computer Architecture H. Corporaal and B. Mesman 71
Granularity Makes Differences
Fine-Grained Architecture
Coarse-Grained Architecture
Clock Speed Low High
Configuration Time
Long Short
# of Blocks Large Small
Flexibility High Low
Power High Low
Area Large Small