Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Systems Murali Jayapala Francisco Barat Pieter Op de Beeck Tom Vander Aa Geert

Distributed L0 Buffer Architecture and Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Exploration for Low Energy Embedded

SystemsSystems

Murali Jayapala

Francisco Barat

Pieter Op de Beeck

Tom Vander Aa

Geert Deconinck

ESAT/ACCA, K.U.Leuven, Belgium

Francky Catthoor

Henk Corporaal

IMEC, Leuven,

Belgium

ESAT/ACCA

2

OverviewOverview

• Context: Introduction to the problem

• Motivation for L0 Buffer organization and status

• Distributed L0 Buffer organization

• Instruction Memory Exploration Software and Compiler Transformation

• Conclusions

ESAT/ACCA

3

ContextContext

Low Power Embedded Systems Battery operated (low energy)

10-50 MOPS/mW

Small Low cost Flexible Multimedia Applications

Video, audio, wireless High performance

10-100 GOPS real-time constraints

Low Energy Embedded systems

ESAT/ACCA

4

ContextContext

Embedded processors• Power Breakdown

43 % of power in on-chip Memory StrongARM SA110: A 160MHz 32b 0.5W

CMOS ARM processor

40 % of power in internal memory C6x, Texas Instruments Inc.

25-30% of power in Instruction Memory

To address the data memory issues:• Data Transfer and Storage Methodology (DTSE)

F.Catthoor et. al.

Embedded systems:Programmable

Processor Based

ESAT/ACCA

5

Related WorkRelated Work

Significant Power consumption in Instruction Memory Hierarchy

Core

Main Memory(off-chip)

L1 cache(on-chip)

Compression (code size reduction)

- L. Benini et.al., “Selective Instruction Compression for Memory Energy Reduction...”, ISLPED 1999

- P. Centoducatte et.al, “Compressed Code Execution on DSP Architectures” ISSS 1999

- T. Ishihara et.al., “A Power Reduction Technique with Object Code Merging for Application Specific Embedded Processors”, DATE 2000.

Software Transformations

- N. D. Zervas et.al.,”A Code Transformation-Based Methodology for Improving I-Cache Performance of DSP Applications”, ICECS 2001

- S. Parameswaran et.al., “I-CoPES: Fast Instruction Code Placement for Embedded Sytems to Improve Performance and Energy Efficiency”, ICCAD 2001

ESAT/ACCA

6

OverviewOverview





• Conculsions

ESAT/ACCA

7

Application Domain: Multimedia Application Domain: Multimedia Characteristics (1)Characteristics (1)

Instruction Count Static Instruction Count Dynamic

High locality

Instruction count

ICstatic < 1% ICdynamic

IC dynamic IC static

0%

100%

2%

0%

ESAT/ACCA

8

Application Domain: Multimedia Application Domain: Multimedia Characteristics (2)Characteristics (2)

Normalized static instruction count

Nor

mal

ized

dyn

amic

inst

ruct

ion

coun

t

Within a program, few basic blocks or instructions

take up most of the execution time (ICdynamic)

ESAT/ACCA

9

Motivation for additional Motivation for additional small memorysmall memory

Application Domain:high locality in few basic blocks

Small memory, in addition to the conventional L1 cache should be used to reduce energy without compromising performance

Size ( basic blockshigh locality) is still large

if L1 cache (on-chip) is made small

performance degrades

• capacity (compulsory) misses

system power increases

• off-chip memory / bus activity increasesCore


L1 cache(on-chip)

ESAT/ACCA

10

Related Work (Microarchitecture):Related Work (Microarchitecture):Cache DesignCache Design

N. Jouppi et.al, “Improving direct-mapped cache performance by addition of a small fully-associative cache and prefetch buffers”, ISCA 1990

• Aim: to reduce miss penalty cycles

• miss caching, victim caching, stream buffers

Core


L1 cache(on-chip)

cache

ESAT/ACCA

11

J. D. Bunda et.al, “Instruction-Processing Optimization Techniques for VLSI Microprocessors”, Phd thesis 1993

• Aim: to reduce instruction cache energy

• L0 buffer: cache block buffer (1 cache block + 1 tag)

• Limitations: block trashing

Related Work (Microarchitecture):Related Work (Microarchitecture):Cache DesignCache Design

Core


L1 cache(on-chip)

L0 Buffer

J. Kin et.al, “Filtering memory references to increase energy efficiency”, IEEE Trans on Computer, 2000

• Aim: to reduce instruction cache energy

• L0 buffer: filter cache

– Small regular cache (< 1KB)

– L0 access (hit) latency: 1 cycle

– L1 access (hit) latency: 2 cycles

• Limitations:

– Energy reduced at the expense of performance

– 256Byte, 58% power reduction with 21% performance degradation

ESAT/ACCA

12

R.S. Bajwa et.al, “Instruction Buffering to Reduce Power in Processors for Signal Processing”, IEEE Trans VLSI Systems, vol 5, no 4, 1997

L. H. Lee et.al, (M-CORE), “Instruction Fetch Energy Reduction Using Loop Caches for Applications with Small and Tight Loops”, ISLPED 1999

Core


L1 cache(on-chip)

L0 Buffer

LC

- L0 Buffer: Buffer (< 1KB) + Local Controller (LC); [no tags]

- L0 / L1 access latency: 1 cycle

- Used only for specific program segments (innermost loops)- Software control:

Special instruction (lbon, sbb) to map program segments to L0 buffer

Datapath

L1

L0

Datapath

L1

L0

Datapath

L1

L0

Normal Operation

Filling L0 Buffer Operation

Initiation Execution

Termination

Related Work (Architecture):Related Work (Architecture):Software controlled L0 buffersSoftware controlled L0 buffers

ESAT/ACCA

13

• Assumed Architecture MIPS 4000 ISA Single Issue Processor L1 Cache

16KB Direct Mapped

Loop Buffer (2KB) Depth = 128 instructions Width = 16 Bytes

• Tools Simplescalar 2.0 Wattch Power estimator

• Loops with less than 128 instructions were hand-mapped onto the loop buffer

0

10

20

30

40

50

60

70

80

90

100

cav_det

c jpeg

djpeg

epic

g721gsm

mpeg2d

pegwit

unepic

Normalized Energy Consumption


ESAT/ACCA

14


• Advantages 50% (avg) energy reduction, with no performance degradation Software control: enables to map only a selected program segments

• Limitations Supports only innermost loops (regular basic blocks)

Other basic blocks frequently executed are still fetched from L1 cache

No support for control constructs within loops

F. Vahid et.al [2001-2002]: Hardware support for conditional constructs within loops Identifying the loop address bounds (preloading the program segment/loop) Sub-routines conditional constructs 1 level nested loop

ESAT/ACCA

15

Related Work (Architecture):Related Work (Architecture):Compiler controlled L0 buffersCompiler controlled L0 buffers

N. Bellas et.al, “Architectural and Compiler Support for Energy Reduction in Memory Hierarchy of High Performance Microprocessors”, ISLPED 1998

• Aim: Reduce instruction cache energy by letting the compiler to assume the role of allocating basic blocks to L0 buffer.

• L0 Buffer: Regular cache (< 1KB; 128 instr)

• Technique:

– profile– function inlining

– identify basic blocks

– code layout

Core


L1 cache(on-chip)

L0 Buffer

code layout

basic blocks allocated to

L0 buffer

L0 Buffer address space

Advantages

- Automated: a ‘tool’ can do this job- Use of basic block as atomic unit of allocation- 60% (avg) energy reduction in i-mem hierarchy [SPEC95]

Limitations

- Tag overhead

ESAT/ACCA

16

Loop Buffers: Commercial ProcessorsLoop Buffers: Commercial Processors

• RISC DSP Processors SH-DSP

Decoded instruction buffers Supports regular loops (no conditional constructs/nested

loops)

• VLIW Processors StarCore SC140

Supports regular and nested loopsConditional constructs through predication

STMicroelectronics, ST120Supports nested loops and loops with conditional constructs

ESAT/ACCA

17

OverviewOverview





• Conclusions

ESAT/ACCA

18

ShortcomingsShortcomings

• So far...

Hardware, software, compiler optimizations to increase accesses/activity at L0 Buffers

Core


L1 cache(on-chip)

L0 Buffer

Incr

ease

d A

cces

ses

(act

ivity

)

• Bottleneck to solve

– L0 Buffer organization

– Interconnect: from L0 Buffer to Datapath

– Efficient buffer controller

• Organization Scalable with increase in #FUs

L0 Buffer

FU FU FU FU

Centralized Organization

LC

ESAT/ACCA

19

Current Organizations for L0 BuffersCurrent Organizations for L0 Buffers

Uncompressed L0 Buffer

• Buffer: Width issue width (# FUS)

• Interconnect: Long

• LC: Simple Addressing (counter based)

Ref: Bajwa et.al., L.H. Lee et.al., F. Vahid et.al.

L0 Buffer

FU FU FU FU

L0 Buffer

FU FU FU FU

Decompressor/Dispatch

Compressed L0 Buffer

• Buffer: – High storage density (no NOPs)

– Width issue width (# FUS) – Overhead in decompressing

• Interconnect : Still centralized, long lines

• LC: Simple Addressing (counter based)

Ref: TI (execute packet fetch mechanism)

ESAT/ACCA

20

Current Organizations for L0 Buffers….Current Organizations for L0 Buffers….

Sub-banked/Partitioned L0 Buffer with Compression

• Buffer: Smaller memories, overhead in re-organizer

• Interconnect: Still centralized

• LC: Complex addressing (needs expensive tags)

Ref: T. Conte et.al [TINKER]

• No correlation between partitioning and FUs

Bank 1

FU FU FU FU

Re-organizer

Bank 2 Bank 3 Bank 4

LC

par 1

FU FU FU FU

par 2 par 3 par 4

LC

Partitioned L0 Buffer

• Buffer: Smaller memories

• Interconnect: Still long

• LC:

– Simple addressing (counter based)

– Need to access all the banks simultaneously, even if some of the FUs are not active

Ref: Sub-banking

ESAT/ACCA

21

SolutionSolutionDistributed Instruction Buffer OrganizationDistributed Instruction Buffer Organization

A balance of energy consumption betweenBuffers, Interconnect and Local Controllers

is needed

Buffers

FU FU FU FU

Distributor/Dispatch

Buffers BuffersATC

FU

ATC ATC

Instruction Cluster

IROC

Buffer Control

• Stores instructions in each partition

• Fetches instructions during loop execution

• Regulates the accesses to each partition

Buffers

• Sub-banked/Partitioned in correlation with FU activation

Interconnect

• Localized (limited connectivity b/w FUs and Buffers)

ATC: Address Translation and Control

IROC: Instruction Registers Operation and Control

ESAT/ACCA

22

Distributed L0 Buffer OperationDistributed L0 Buffer Operation

• Similar to conventional L0 buffer operation• Initiation

Special instruction LBON <offset>

• Filling Pre-fetching instructions from <start> to <end>

• Termination When the program flow jumps to an address out of <start> to <end> range

Datapath

L1

Distributed L0

Datapath

L1

Distributed L0

Datapath

L1

Distributed L0

Normal Operation

Filling L0 Buffer Operation

Initiation Execution

Termination

ESAT/ACCA

23

The Buffer Operation:The Buffer Operation:An IllustrationAn Illustration

OP11

for (..)

{ …

if (..) {.….}

else {.….} …}

OP21 OP31 NOP

NOP OP22 OP32 BNZ ‘x’

OP12 NOP NOP BR ‘y’

OP13 NOP OP33 NOP

OP14 OP23 NOP BNZ ‘s’

S:

X:

Y:

LBON <offset>

if block

else block

ESAT/ACCA

24


OP11

for (..)

{ …

if (..) {.….}

else {.….} …}

OP21 OP31 NOP



OP13 NOP OP33 NOP


S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USE

NEW_PC

PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR

BNZ ‘x’BR ‘y’

BNZ ‘s’

-00111-021

ESAT/ACCA

25


OP11

for (..)

{ …

if (..) {.….}

else {.….} …}

OP21 OP31 NOP



OP13 NOP OP33 NOP


S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USE

NEW_PC

PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR


BNZ ‘s’

-00111-021

ESAT/ACCA

26


OP11

for (..)

{ …

if (..) {.….}

else {.….} …}

OP21 OP31 NOP



OP13 NOP OP33 NOP


S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USE

NEW_PC

PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR


BNZ ‘s’

-00111-021

ESAT/ACCA

27


OP11

for (..)

{ …

if (..) {.….}

else {.….} …}

OP21 OP31 NOP



OP13 NOP OP33 NOP


S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USE

NEW_PC

PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR


BNZ ‘s’

-00111-021

ESAT/ACCA

28


OP11

for (..)

{ …

if (..) {.….}

else {.….} …}

OP21 OP31 NOP



OP13 NOP OP33 NOP


S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USE

NEW_PC

PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR


BNZ ‘s’

-00111-021

ESAT/ACCA

29


OP11

for (..)

{ …

if (..) {.….}

else {.….} …}

OP21 OP31 NOP



OP13 NOP OP33 NOP


S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USE

NEW_PC

PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR


BNZ ‘s’

-00111-021

ESAT/ACCA

30

Energy Trade-OffsEnergy Trade-Offs

Energy = E buffer i + E LC i + E interconnect i

i = 1

#partitions

i = 1

#partitions

i = 1

#partitions

#partitions

Ene

rgy

(nor

mal

ized

)

1

1

E buffer i

E interconnect i

E LC i

Baseline

#FUs

ESAT/ACCA

31

Profile Based ClusteringProfile Based Clustering

Instruction Clustering

1 1 1 0 0 … 11 0 1 0 1 … 00 1 1 0 1 … 1

..

.1 1 1 0 1 … 0

Energy Models(Register File)

Dynamic Trace(during loop execution)

Static Trace(loops mapped to L0)

begin1 1 1 0 0 … 11 0 1 0 1 … 0endbegin0 1 1 0 1 … 1end

Instruction Clusters

Instruction Cluster

A group of functional units with a separate local controller and an

instruction buffer partition

Min { Energy(clust, Dynamicprofile, Staticprofile) }

clust(i,j) = 1; j

i =1

max_clusters

clust (i,j) = 1; if jth FU is assigned to cluster ‘j’

= 0; otherwise

S.T

Where,

- FU grouping

- Width and Depth of instruction buffers in each partition

ESAT/ACCA

32

ResultsResults

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 2 3 4 5 6 7 8 9 10

adpcmd

djpeg

idct

mpeg2d

Energy = E buffer i + E LC ii = 1

#partitions

i = 1

#partitions

#partitions

Ene

rgy

(nor

ma

lized

)

Assumptions

- Only the buffers and controller is modeled (no interconnect as yet)

- #FUs in datapath = 10

- Fixed Schedule ( activation trace)

- Schedule generated using

Trimaran 2.0

ESAT/ACCA

33

In Comparison With Other SchemesIn Comparison With Other Schemes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Buffers Controller/Overhead

Uncompress

ed

Compress

ed

Paritioned (s

ub-banked)

( no acc

ess re

gulation )

Clustere

d (varyi

ng width only)

Clustered (v

arying both

width and depth)

Results Shown for ADPCM

Uncompressed - CentralizedL0 buffer

Compressed - Centralized L0 Buffer

- 2 additional registers for VLDecoding

Partitioned (no control) - 2 partitions

Clustered (width only) - 3 partitions

Clustered (width and depth) - 2 partitions

ESAT/ACCA

34

Fully Distributed Instruction Memory Fully Distributed Instruction Memory HierarchyHierarchy

L0 Buffers

FU FU FU

L0 Buffers

FU FU

L0 Buffers

FU FU FU

L0 Buffers

FU FU FU FU


L1 cache(on-chip)

L1 cache(on-chip)

L0 Cluster L1 Cluster

ESAT/ACCA

35

OverviewOverview





• Conclusions

ESAT/ACCA

36

Exploration MethodologyExploration MethodologyWhat we haveWhat we have

Application


Compiler(Scheduling)

Clustering ToolEnergyModels

InstructionClusters

Pareto Curve Generation

- For Choosing the operating point at Run-time

- Enable the designer to asses the trade-off between energy and performance

Delay

Ene

rgy

optimized for performance

- maximum cluster activity

optimized for Energy

- minimal cluster activity

ESAT/ACCA

37

Exploration MethodologyExploration MethodologyWhat we want to achieve…What we want to achieve…

Application


Compiler(Scheduling & Clustering)

EnergyModels

InstructionClusters

Schedule

Pareto Curve Generation

- For Choosing the operating point at Run-time

- Enable the designer to asses the trade-off between energy and performance

Delay

Ene

rgy

optimized for performance

- maximum cluster activity

optimized for Energy

- minimal cluster activity

ESAT/ACCA

38

Compiler SchedulingCompiler Scheduling

Compiler scheduling can change the functional unit activity and hence the clustering result and hence energy and performance

OP11 OP12 - OP13 - OP14

All 3 clusters need to be activeOP11 OP12 OP13 OP14 - -

Only 2 clusters need to be active

OP11 OP12 - OP13 - OP14

OP21 - OP22 - OP23 -

2 activations of all 3 clusters OP11 OP12 - - - -

OP11 - - - - -

- - OP22 OP13 OP23 OP14

2 activations for 1st, 1 activation for 2nd and 3rd cluster

Energy reduction without performance loss

Energy reduction at the expense of performance loss

ESAT/ACCA

39

Software TransformationsSoftware Transformations

loop 1

loop 2

Loop

High level code transformations can also impact/change the clustering result and hence energy and performance

Loop Transformations

- Loop splitting

- Loop merging

- Loop peeling (for nested loops)

- Loop collapsing (nested loops)

- Code movement across loops

-....etc

Loop Splitting

ESAT/ACCA

40

OverviewOverview





• Conclusions

ESAT/ACCA

41

ConclusionsConclusions

• L0 Buffer Organization Multimedia applications have high locality in small program segments An additional small L0 buffer should be used Current options for L0 buffer still not efficient (energy) A distributed L0 buffer organization should be sought But, the clustering/partitioning should be application specific

• L1 Cache Organization Distributed (?)

• Instruction Memory Exploration Software transformations and compiler scheduling can change the

clusterting results An exploration methodology should be sought to analyze the trade-offs

in energy and performance (pareto curves)

Documents

Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Systems Murali Jayapala Francisco Barat Pieter Op de Beeck Tom Vander Aa Geert