Upload
rodger-cross
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Distributed L0 Buffer Architecture and Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Exploration for Low Energy Embedded
SystemsSystems
Murali Jayapala
Francisco Barat
Pieter Op de Beeck
Tom Vander Aa
Geert Deconinck
ESAT/ACCA, K.U.Leuven, Belgium
Francky Catthoor
Henk Corporaal
IMEC, Leuven,
Belgium
ESAT/ACCA
2
OverviewOverview
• Context: Introduction to the problem
• Motivation for L0 Buffer organization and status
• Distributed L0 Buffer organization
• Instruction Memory Exploration Software and Compiler Transformation
• Conclusions
ESAT/ACCA
3
ContextContext
Low Power Embedded Systems Battery operated (low energy)
10-50 MOPS/mW
Small Low cost Flexible Multimedia Applications
Video, audio, wireless High performance
10-100 GOPS real-time constraints
Low Energy Embedded systems
ESAT/ACCA
4
ContextContext
Embedded processors• Power Breakdown
43 % of power in on-chip Memory StrongARM SA110: A 160MHz 32b 0.5W
CMOS ARM processor
40 % of power in internal memory C6x, Texas Instruments Inc.
25-30% of power in Instruction Memory
To address the data memory issues:• Data Transfer and Storage Methodology (DTSE)
F.Catthoor et. al.
Embedded systems:Programmable
Processor Based
ESAT/ACCA
5
Related WorkRelated Work
Significant Power consumption in Instruction Memory Hierarchy
Core
Main Memory(off-chip)
L1 cache(on-chip)
Compression (code size reduction)
- L. Benini et.al., “Selective Instruction Compression for Memory Energy Reduction...”, ISLPED 1999
- P. Centoducatte et.al, “Compressed Code Execution on DSP Architectures” ISSS 1999
- T. Ishihara et.al., “A Power Reduction Technique with Object Code Merging for Application Specific Embedded Processors”, DATE 2000.
Software Transformations
- N. D. Zervas et.al.,”A Code Transformation-Based Methodology for Improving I-Cache Performance of DSP Applications”, ICECS 2001
- S. Parameswaran et.al., “I-CoPES: Fast Instruction Code Placement for Embedded Sytems to Improve Performance and Energy Efficiency”, ICCAD 2001
ESAT/ACCA
6
OverviewOverview
• Context: Introduction to the problem
• Motivation for L0 Buffer organization and status
• Distributed L0 Buffer organization
• Instruction Memory Exploration Software and Compiler Transformation
• Conculsions
ESAT/ACCA
7
Application Domain: Multimedia Application Domain: Multimedia Characteristics (1)Characteristics (1)
Instruction Count Static Instruction Count Dynamic
High locality
Instruction count
ICstatic < 1% ICdynamic
IC dynamic IC static
0%
100%
2%
0%
ESAT/ACCA
8
Application Domain: Multimedia Application Domain: Multimedia Characteristics (2)Characteristics (2)
Normalized static instruction count
Nor
mal
ized
dyn
amic
inst
ruct
ion
coun
t
Within a program, few basic blocks or instructions
take up most of the execution time (ICdynamic)
ESAT/ACCA
9
Motivation for additional Motivation for additional small memorysmall memory
Application Domain:high locality in few basic blocks
Small memory, in addition to the conventional L1 cache should be used to reduce energy without compromising performance
Size ( basic blockshigh locality) is still large
if L1 cache (on-chip) is made small
performance degrades
• capacity (compulsory) misses
system power increases
• off-chip memory / bus activity increasesCore
Main Memory(off-chip)
L1 cache(on-chip)
ESAT/ACCA
10
Related Work (Microarchitecture):Related Work (Microarchitecture):Cache DesignCache Design
N. Jouppi et.al, “Improving direct-mapped cache performance by addition of a small fully-associative cache and prefetch buffers”, ISCA 1990
• Aim: to reduce miss penalty cycles
• miss caching, victim caching, stream buffers
Core
Main Memory(off-chip)
L1 cache(on-chip)
cache
ESAT/ACCA
11
J. D. Bunda et.al, “Instruction-Processing Optimization Techniques for VLSI Microprocessors”, Phd thesis 1993
• Aim: to reduce instruction cache energy
• L0 buffer: cache block buffer (1 cache block + 1 tag)
• Limitations: block trashing
Related Work (Microarchitecture):Related Work (Microarchitecture):Cache DesignCache Design
Core
Main Memory(off-chip)
L1 cache(on-chip)
L0 Buffer
J. Kin et.al, “Filtering memory references to increase energy efficiency”, IEEE Trans on Computer, 2000
• Aim: to reduce instruction cache energy
• L0 buffer: filter cache
– Small regular cache (< 1KB)
– L0 access (hit) latency: 1 cycle
– L1 access (hit) latency: 2 cycles
• Limitations:
– Energy reduced at the expense of performance
– 256Byte, 58% power reduction with 21% performance degradation
ESAT/ACCA
12
R.S. Bajwa et.al, “Instruction Buffering to Reduce Power in Processors for Signal Processing”, IEEE Trans VLSI Systems, vol 5, no 4, 1997
L. H. Lee et.al, (M-CORE), “Instruction Fetch Energy Reduction Using Loop Caches for Applications with Small and Tight Loops”, ISLPED 1999
Core
Main Memory(off-chip)
L1 cache(on-chip)
L0 Buffer
LC
- L0 Buffer: Buffer (< 1KB) + Local Controller (LC); [no tags]
- L0 / L1 access latency: 1 cycle
- Used only for specific program segments (innermost loops)- Software control:
Special instruction (lbon, sbb) to map program segments to L0 buffer
Datapath
L1
L0
Datapath
L1
L0
Datapath
L1
L0
Normal Operation
Filling L0 Buffer Operation
Initiation Execution
Termination
Related Work (Architecture):Related Work (Architecture):Software controlled L0 buffersSoftware controlled L0 buffers
ESAT/ACCA
13
• Assumed Architecture MIPS 4000 ISA Single Issue Processor L1 Cache
16KB Direct Mapped
Loop Buffer (2KB) Depth = 128 instructions Width = 16 Bytes
• Tools Simplescalar 2.0 Wattch Power estimator
• Loops with less than 128 instructions were hand-mapped onto the loop buffer
0
10
20
30
40
50
60
70
80
90
100
cav_det
c jpeg
djpeg
epic
g721gsm
mpeg2d
pegwit
unepic
Normalized Energy Consumption
Related Work (Architecture):Related Work (Architecture):Software controlled L0 buffersSoftware controlled L0 buffers
ESAT/ACCA
14
Related Work (Architecture):Related Work (Architecture):Software controlled L0 buffersSoftware controlled L0 buffers
• Advantages 50% (avg) energy reduction, with no performance degradation Software control: enables to map only a selected program segments
• Limitations Supports only innermost loops (regular basic blocks)
Other basic blocks frequently executed are still fetched from L1 cache
No support for control constructs within loops
F. Vahid et.al [2001-2002]: Hardware support for conditional constructs within loops Identifying the loop address bounds (preloading the program segment/loop) Sub-routines conditional constructs 1 level nested loop
ESAT/ACCA
15
Related Work (Architecture):Related Work (Architecture):Compiler controlled L0 buffersCompiler controlled L0 buffers
N. Bellas et.al, “Architectural and Compiler Support for Energy Reduction in Memory Hierarchy of High Performance Microprocessors”, ISLPED 1998
• Aim: Reduce instruction cache energy by letting the compiler to assume the role of allocating basic blocks to L0 buffer.
• L0 Buffer: Regular cache (< 1KB; 128 instr)
• Technique:
– profile– function inlining
– identify basic blocks
– code layout
Core
Main Memory(off-chip)
L1 cache(on-chip)
L0 Buffer
code layout
basic blocks allocated to
L0 buffer
L0 Buffer address space
Advantages
- Automated: a ‘tool’ can do this job- Use of basic block as atomic unit of allocation- 60% (avg) energy reduction in i-mem hierarchy [SPEC95]
Limitations
- Tag overhead
ESAT/ACCA
16
Loop Buffers: Commercial ProcessorsLoop Buffers: Commercial Processors
• RISC DSP Processors SH-DSP
Decoded instruction buffers Supports regular loops (no conditional constructs/nested
loops)
• VLIW Processors StarCore SC140
Supports regular and nested loopsConditional constructs through predication
STMicroelectronics, ST120Supports nested loops and loops with conditional constructs
ESAT/ACCA
17
OverviewOverview
• Context: Introduction to the problem
• Motivation for L0 Buffer organization and status
• Distributed L0 Buffer organization
• Instruction Memory Exploration Software and Compiler Transformation
• Conclusions
ESAT/ACCA
18
ShortcomingsShortcomings
• So far...
Hardware, software, compiler optimizations to increase accesses/activity at L0 Buffers
Core
Main Memory(off-chip)
L1 cache(on-chip)
L0 Buffer
Incr
ease
d A
cces
ses
(act
ivity
)
• Bottleneck to solve
– L0 Buffer organization
– Interconnect: from L0 Buffer to Datapath
– Efficient buffer controller
• Organization Scalable with increase in #FUs
L0 Buffer
FU FU FU FU
Centralized Organization
LC
ESAT/ACCA
19
Current Organizations for L0 BuffersCurrent Organizations for L0 Buffers
Uncompressed L0 Buffer
• Buffer: Width issue width (# FUS)
• Interconnect: Long
• LC: Simple Addressing (counter based)
Ref: Bajwa et.al., L.H. Lee et.al., F. Vahid et.al.
L0 Buffer
FU FU FU FU
L0 Buffer
FU FU FU FU
Decompressor/Dispatch
Compressed L0 Buffer
• Buffer: – High storage density (no NOPs)
– Width issue width (# FUS) – Overhead in decompressing
• Interconnect : Still centralized, long lines
• LC: Simple Addressing (counter based)
Ref: TI (execute packet fetch mechanism)
ESAT/ACCA
20
Current Organizations for L0 Buffers….Current Organizations for L0 Buffers….
Sub-banked/Partitioned L0 Buffer with Compression
• Buffer: Smaller memories, overhead in re-organizer
• Interconnect: Still centralized
• LC: Complex addressing (needs expensive tags)
Ref: T. Conte et.al [TINKER]
• No correlation between partitioning and FUs
Bank 1
FU FU FU FU
Re-organizer
Bank 2 Bank 3 Bank 4
LC
par 1
FU FU FU FU
par 2 par 3 par 4
LC
Partitioned L0 Buffer
• Buffer: Smaller memories
• Interconnect: Still long
• LC:
– Simple addressing (counter based)
– Need to access all the banks simultaneously, even if some of the FUs are not active
Ref: Sub-banking
ESAT/ACCA
21
SolutionSolutionDistributed Instruction Buffer OrganizationDistributed Instruction Buffer Organization
A balance of energy consumption betweenBuffers, Interconnect and Local Controllers
is needed
Buffers
FU FU FU FU
Distributor/Dispatch
Buffers BuffersATC
FU
ATC ATC
Instruction Cluster
IROC
Buffer Control
• Stores instructions in each partition
• Fetches instructions during loop execution
• Regulates the accesses to each partition
Buffers
• Sub-banked/Partitioned in correlation with FU activation
Interconnect
• Localized (limited connectivity b/w FUs and Buffers)
ATC: Address Translation and Control
IROC: Instruction Registers Operation and Control
ESAT/ACCA
22
Distributed L0 Buffer OperationDistributed L0 Buffer Operation
• Similar to conventional L0 buffer operation• Initiation
Special instruction LBON <offset>
• Filling Pre-fetching instructions from <start> to <end>
• Termination When the program flow jumps to an address out of <start> to <end> range
Datapath
L1
Distributed L0
Datapath
L1
Distributed L0
Datapath
L1
Distributed L0
Normal Operation
Filling L0 Buffer Operation
Initiation Execution
Termination
ESAT/ACCA
23
The Buffer Operation:The Buffer Operation:An IllustrationAn Illustration
OP11
for (..)
{ …
if (..) {.….}
else {.….} …}
OP21 OP31 NOP
NOP OP22 OP32 BNZ ‘x’
OP12 NOP NOP BR ‘y’
OP13 NOP OP33 NOP
OP14 OP23 NOP BNZ ‘s’
S:
X:
Y:
LBON <offset>
if block
else block
ESAT/ACCA
24
The Buffer Operation:The Buffer Operation:An IllustrationAn Illustration
OP11
for (..)
{ …
if (..) {.….}
else {.….} …}
OP21 OP31 NOP
NOP OP22 OP32 BNZ ‘x’
OP12 NOP NOP BR ‘y’
OP13 NOP OP33 NOP
OP14 OP23 NOP BNZ ‘s’
S:
X:
Y:
LBON <offset>
if block
else block
IROCSTART_ADDR
END_ADDR
IR_USE
NEW_PC
PC
FU1
OP11OP12OP13OP14
01-0112131
FU2
OP21OP22OP23
0111-0-021
FU3
OP31OP32OP33
0111-021-0
BR
BNZ ‘x’BR ‘y’
BNZ ‘s’
-00111-021
ESAT/ACCA
25
The Buffer Operation:The Buffer Operation:An IllustrationAn Illustration
OP11
for (..)
{ …
if (..) {.….}
else {.….} …}
OP21 OP31 NOP
NOP OP22 OP32 BNZ ‘x’
OP12 NOP NOP BR ‘y’
OP13 NOP OP33 NOP
OP14 OP23 NOP BNZ ‘s’
S:
X:
Y:
LBON <offset>
if block
else block
IROCSTART_ADDR
END_ADDR
IR_USE
NEW_PC
PC
FU1
OP11OP12OP13OP14
01-0112131
FU2
OP21OP22OP23
0111-0-021
FU3
OP31OP32OP33
0111-021-0
BR
BNZ ‘x’BR ‘y’
BNZ ‘s’
-00111-021
ESAT/ACCA
26
The Buffer Operation:The Buffer Operation:An IllustrationAn Illustration
OP11
for (..)
{ …
if (..) {.….}
else {.….} …}
OP21 OP31 NOP
NOP OP22 OP32 BNZ ‘x’
OP12 NOP NOP BR ‘y’
OP13 NOP OP33 NOP
OP14 OP23 NOP BNZ ‘s’
S:
X:
Y:
LBON <offset>
if block
else block
IROCSTART_ADDR
END_ADDR
IR_USE
NEW_PC
PC
FU1
OP11OP12OP13OP14
01-0112131
FU2
OP21OP22OP23
0111-0-021
FU3
OP31OP32OP33
0111-021-0
BR
BNZ ‘x’BR ‘y’
BNZ ‘s’
-00111-021
ESAT/ACCA
27
The Buffer Operation:The Buffer Operation:An IllustrationAn Illustration
OP11
for (..)
{ …
if (..) {.….}
else {.….} …}
OP21 OP31 NOP
NOP OP22 OP32 BNZ ‘x’
OP12 NOP NOP BR ‘y’
OP13 NOP OP33 NOP
OP14 OP23 NOP BNZ ‘s’
S:
X:
Y:
LBON <offset>
if block
else block
IROCSTART_ADDR
END_ADDR
IR_USE
NEW_PC
PC
FU1
OP11OP12OP13OP14
01-0112131
FU2
OP21OP22OP23
0111-0-021
FU3
OP31OP32OP33
0111-021-0
BR
BNZ ‘x’BR ‘y’
BNZ ‘s’
-00111-021
ESAT/ACCA
28
The Buffer Operation:The Buffer Operation:An IllustrationAn Illustration
OP11
for (..)
{ …
if (..) {.….}
else {.….} …}
OP21 OP31 NOP
NOP OP22 OP32 BNZ ‘x’
OP12 NOP NOP BR ‘y’
OP13 NOP OP33 NOP
OP14 OP23 NOP BNZ ‘s’
S:
X:
Y:
LBON <offset>
if block
else block
IROCSTART_ADDR
END_ADDR
IR_USE
NEW_PC
PC
FU1
OP11OP12OP13OP14
01-0112131
FU2
OP21OP22OP23
0111-0-021
FU3
OP31OP32OP33
0111-021-0
BR
BNZ ‘x’BR ‘y’
BNZ ‘s’
-00111-021
ESAT/ACCA
29
The Buffer Operation:The Buffer Operation:An IllustrationAn Illustration
OP11
for (..)
{ …
if (..) {.….}
else {.….} …}
OP21 OP31 NOP
NOP OP22 OP32 BNZ ‘x’
OP12 NOP NOP BR ‘y’
OP13 NOP OP33 NOP
OP14 OP23 NOP BNZ ‘s’
S:
X:
Y:
LBON <offset>
if block
else block
IROCSTART_ADDR
END_ADDR
IR_USE
NEW_PC
PC
FU1
OP11OP12OP13OP14
01-0112131
FU2
OP21OP22OP23
0111-0-021
FU3
OP31OP32OP33
0111-021-0
BR
BNZ ‘x’BR ‘y’
BNZ ‘s’
-00111-021
ESAT/ACCA
30
Energy Trade-OffsEnergy Trade-Offs
Energy = E buffer i + E LC i + E interconnect i
i = 1
#partitions
i = 1
#partitions
i = 1
#partitions
#partitions
Ene
rgy
(nor
mal
ized
)
1
1
E buffer i
E interconnect i
E LC i
Baseline
#FUs
ESAT/ACCA
31
Profile Based ClusteringProfile Based Clustering
Instruction Clustering
1 1 1 0 0 … 11 0 1 0 1 … 00 1 1 0 1 … 1
..
.1 1 1 0 1 … 0
Energy Models(Register File)
Dynamic Trace(during loop execution)
Static Trace(loops mapped to L0)
begin1 1 1 0 0 … 11 0 1 0 1 … 0endbegin0 1 1 0 1 … 1end
Instruction Clusters
Instruction Cluster
A group of functional units with a separate local controller and an
instruction buffer partition
Min { Energy(clust, Dynamicprofile, Staticprofile) }
clust(i,j) = 1; j
i =1
max_clusters
clust (i,j) = 1; if jth FU is assigned to cluster ‘j’
= 0; otherwise
S.T
Where,
- FU grouping
- Width and Depth of instruction buffers in each partition
ESAT/ACCA
32
ResultsResults
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 2 3 4 5 6 7 8 9 10
adpcmd
djpeg
idct
mpeg2d
Energy = E buffer i + E LC ii = 1
#partitions
i = 1
#partitions
#partitions
Ene
rgy
(nor
ma
lized
)
Assumptions
- Only the buffers and controller is modeled (no interconnect as yet)
- #FUs in datapath = 10
- Fixed Schedule ( activation trace)
- Schedule generated using
Trimaran 2.0
ESAT/ACCA
33
In Comparison With Other SchemesIn Comparison With Other Schemes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Buffers Controller/Overhead
Uncompress
ed
Compress
ed
Paritioned (s
ub-banked)
( no acc
ess re
gulation )
Clustere
d (varyi
ng width only)
Clustered (v
arying both
width and depth)
Results Shown for ADPCM
Uncompressed - CentralizedL0 buffer
Compressed - Centralized L0 Buffer
- 2 additional registers for VLDecoding
Partitioned (no control) - 2 partitions
Clustered (width only) - 3 partitions
Clustered (width and depth) - 2 partitions
ESAT/ACCA
34
Fully Distributed Instruction Memory Fully Distributed Instruction Memory HierarchyHierarchy
L0 Buffers
FU FU FU
L0 Buffers
FU FU
L0 Buffers
FU FU FU
L0 Buffers
FU FU FU FU
Main Memory(off-chip)
L1 cache(on-chip)
L1 cache(on-chip)
L0 Cluster L1 Cluster
ESAT/ACCA
35
OverviewOverview
• Context: Introduction to the problem
• Motivation for L0 Buffer organization and status
• Distributed L0 Buffer organization
• Instruction Memory Exploration Software and Compiler Transformation
• Conclusions
ESAT/ACCA
36
Exploration MethodologyExploration MethodologyWhat we haveWhat we have
Application
Software Transformations
Compiler(Scheduling)
Clustering ToolEnergyModels
InstructionClusters
Pareto Curve Generation
- For Choosing the operating point at Run-time
- Enable the designer to asses the trade-off between energy and performance
Delay
Ene
rgy
optimized for performance
- maximum cluster activity
optimized for Energy
- minimal cluster activity
ESAT/ACCA
37
Exploration MethodologyExploration MethodologyWhat we want to achieve…What we want to achieve…
Application
Software Transformations
Compiler(Scheduling & Clustering)
EnergyModels
InstructionClusters
Schedule
Pareto Curve Generation
- For Choosing the operating point at Run-time
- Enable the designer to asses the trade-off between energy and performance
Delay
Ene
rgy
optimized for performance
- maximum cluster activity
optimized for Energy
- minimal cluster activity
ESAT/ACCA
38
Compiler SchedulingCompiler Scheduling
Compiler scheduling can change the functional unit activity and hence the clustering result and hence energy and performance
OP11 OP12 - OP13 - OP14
All 3 clusters need to be activeOP11 OP12 OP13 OP14 - -
Only 2 clusters need to be active
OP11 OP12 - OP13 - OP14
OP21 - OP22 - OP23 -
2 activations of all 3 clusters OP11 OP12 - - - -
OP11 - - - - -
- - OP22 OP13 OP23 OP14
2 activations for 1st, 1 activation for 2nd and 3rd cluster
Energy reduction without performance loss
Energy reduction at the expense of performance loss
ESAT/ACCA
39
Software TransformationsSoftware Transformations
loop 1
loop 2
Loop
High level code transformations can also impact/change the clustering result and hence energy and performance
Loop Transformations
- Loop splitting
- Loop merging
- Loop peeling (for nested loops)
- Loop collapsing (nested loops)
- Code movement across loops
-....etc
Loop Splitting
ESAT/ACCA
40
OverviewOverview
• Context: Introduction to the problem
• Motivation for L0 Buffer organization and status
• Distributed L0 Buffer organization
• Instruction Memory Exploration Software and Compiler Transformation
• Conclusions
ESAT/ACCA
41
ConclusionsConclusions
• L0 Buffer Organization Multimedia applications have high locality in small program segments An additional small L0 buffer should be used Current options for L0 buffer still not efficient (energy) A distributed L0 buffer organization should be sought But, the clustering/partitioning should be application specific
• L1 Cache Organization Distributed (?)
• Instruction Memory Exploration Software transformations and compiler scheduling can change the
clusterting results An exploration methodology should be sought to analyze the trade-offs
in energy and performance (pareto curves)