Ph.D. Thesis presentation

Scratchpad-oriented address generation for low-power embedded

VLIW processors

Guillermo Talavera Velilla

Departament de Microelectrònica i Sistemes Electrònics

Universitat Autònoma de Barcelona

Thesis supervisor: Jordi Carrabina

Ph.D. Defense PresentationOctober 15th, 2009

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 2/50

What are we talking about?

Scratchpad-oriented address generation

for low-power embedded VLIW processors

Type of memory

Energy optimization

Accessing data

Small, portable,battery operatedand multimedia

Type of processors

Embedded Processors Memories Optimization ConclusionsAGUs Optimization ?

What should I do if I am a VLIW-processor working on the embedded

domain and I want to access data (that is located in memory) consuming little

energy?

Guillermo Talavera Velilla

Departament de Microelectrònica i Sistemes Electrònics

Universitat Autònoma de Barcelona

Thesis supervisor: Jordi Carrabina

Ph.D. Defense PresentationOctober 15th, 2009


Let’s talk about…

… embedded.

Embedded


Embedded mobile systems

Embedded


Greedy users

Users demand:• More functionalities• More speed• More battery• Cheap devices

“PC-like” functionalities

… and we give them VLIW-ASIPs

Embedded


Performance vs Energy Efficiency

Performance Energy efficiency

FlexibilityFlexible enough

Embedded


Goal of the thesis

• Main goal:– Optimization of the energy consumption of the

VLIW-ASIPs architectures focusing on address generation process.

• Side goals:– Analyze state of the art optimizations– Analyze state of the art address generator units– Test the template in different benchmarks and

applications



… processors.

Processors


Definitions

• VLIW =Very Long Instruction Word (processor)– Architecture design style that tries to maximize the

available Instruction Level Paralelism.

• ASIP =Application-Specific Instruction-Set processor – Processor were the instruction set is tailored to

benefit a specific application.

Processors


Target Architecture Style: VLIW-ASIP

Level 1Memory (on chip)

Level 2Memory (on chip)

External Memory

FU

Loop Buffer

Register File

FU FU FU FU

Loop Buffer

Register File

FU FU FU FU FU FU FU FU FU FU FU

Main Cluster Cluster

Loop Buffer Loop Buffer

Register File Register File

Cluster Cluster

Processors


Superscalar vs VLIW (remainder)

HWschedulling

SWschedulling

Processors


VLIW-ASIPs

• Ongoing work at imec:– Novel architectures oriented to low-power (x20)

HW+SW+Compiler exploration:• Data memory hierarchy• Foreground memory (registers)• Instruction/configuration memory• Data-path• Address-path

Processors


COFFEE: COmpiler Framework for Energy-aware Exploration

XMLprocessor

model

Ccode

TrimaranMDES

EnhancedTrimaranCompiler

TrimaranSimulator

Total powercalculation

Asmcode

Trace

Results

XSLTconverter

Areacalculation

Delaycalculation

Powercalculation

AnnotatedXML

processormodel

Ccode

XMLprocessor

model

Power/Energyresults

Performanceresults

Arearesults

compiler+processParametres

Processors



… memories.

Memories


Embedded multimedia domain

• 50%-70% energy consumption caused by memory accessess*

Crucial to optimize:• Memory size, type, number of ports, … • Accesses (and related address computations)

As driver example we use a real application: a MPEG4 encoder

* References of the thesis [WCNM96 and MNCM97]

Memories


Background data memory

Scracthpad (compared to cache)– average energy reduction 40% *– average area-time reduction: 46% *

* References of the thesis [BSL+02ª and AC06]

Core

Level 1

(Data/Instruction mem.)

Level 2

(Data/Instruction mem.)

Level 3

(Main memory)

Fast,small,expensive

Slow,Large,cheap

Memories



… optimization methods.

Optimization


Data Transfer and Storage Exploration (DTSE)

• Goal:– Reduce storage requirements– Optimize locality of data

• Code rewriting– Complex addressing– Control flow– Modulo and divider operations

Optimization


DTSE transformations

for (x=1; x<=N-2; ++x) { for (y=1; y<=M-2; ++y) { for (k=-1; k<=1; ++k) { A[x][y] += B[x+k][y] * C[abs(k)]; A[x][y] /= tot; } }}

for (y=0; y<=M+2; ++y) { for (x=0; x<=N+2; ++x) { if (x>=0 && x<N && y>=1 && y<=M-2) { D[x%3] = B[(y*N+x)%8704+ (y*N+x)%8704*16384+7680]; } if (x-1>=1 && x-1<=N-2 && y>=1 && y<=M-2) { for (k=-1; k<=1; ++k) { acc += D[(x-1+k)%3] *

C[abs(k)]; } } acc /= tot; }}

20

Code Before DTSE

Code After DTSE

Control flow and address calculation are the bottleneck after DTSE!!!

Optimization


DTSE: Non-linear Operator Strength Reduction

for (y=0; y<M; ++y) { for (x=0; x<N; ++x) { cse0 = x%3; cse1 = (x-1)%3; cse2 = (x-2)%3; … }}

for (y=0; y<M; ++y) { p_cse0 = 0; // x%3 p_cse1 = 2; // (x-1)%3 p_cse2 = 1; // (x-2)%3 for (x=0; x<N; ++x) { … p_cse2 = p_cse1; p_cse1 = p_cse0; p_cse0++; if (p_cse0>=3) p_cse0 = 0; }}

Before … After !!!

Optimization

Modulo operationscan not always be transformedfor complex indexes


DTSE: Arithmetic Cost Minimization

for (y=0; y<M; ++y) {

for (x=0; x<N; ++x) {

if (x>=4 && y>=6) {

ce_img1[(x-2)%3] = …

ce_img2[(x-2)%3] = …

}

if (x>=4 && y>=4) {

ce_img1[(x-1)%3] = …

ce_img1[(x-1)%3] = …

}

}

}

for (y=0; y<M; ++y) {

for (x=0; x<N; ++x) {

if (x>=4 && y>=6) {

cse0 = (x-2)%3;

ce_img1[cse0] = …

ce_img2[cse0] = …

}

if (x>=4 && y>=4) {

cse1 = (x-1)%3;

ce_img1[cse1] = …

ce_img2[cse1] = …

}

}

}

Before … After !!!

Optimization


Control flow optimization

for(i=0; i < 50; i++ ){ for(j=0; j<i; j++){ if(i+j<70) data = Aleft[i+j]; else data = Aright[i+j-70];… }

for(i=0; 50 ; i++){ if(i <= 35){ for( ; i<=35; i++){ for(j=0; j<i; j++){ data= Aleft[i+j]; } } } else{ for(j=0; j<i; j++){ if (i+j < 70){ data= Aleft[i+j]; } else{ data= Aright[i+j-70]; } } } }}

Before …

After !!!Loop nest splitting:

Optimization


Data-path architecture explorationDone at architecture level #clusters, #FU per cluster

2 clusters with 4 FU each

MPEG4 encoder application:- 90nm technology - 500MHz

Optimization



… address generation.

AGUs


How do I access data?

Core

Level 1

(Data cache)

Very often addressess are calculated in normal data-path

AGUs


Address Generation Unit (AGU)

Address Register File

Address ControlUnit

Address Data PathUnit

Addresssequence

Indexes oraddresses

range

Address equation examples:D[x%3] = B[(y*N+x)%8704+ (y*N+x)%8704*16384+7680];AE1= x%3AE2= (y*N+x)%8704+ (y*N+x)%8704*16384+7680

AGUs


AGU

• Multimedia Domain Programmable AGU

AGUs


AGU Exploration Framework*

29

PE Implementation Pattern

Constraintsmax_pe=6min_add=1max_add=6min_sub=1max_sub=6min_sft=1max_sft=6…

+ - << +,- +,-,<< * %

Arch. FileReport of

cycle, area, and energy

AddressCalculation

AGU Mapping Framework

AGU ExplorationFramework

evaluate for all architectures which satisfy constraints !

Tradeoff !* From Osaka University

AGUs


Experiments

AGUs


Results: Area

15% < Hardware overhead < 200%

Original VLIW VLIW with AGU

FU

Loop Buffer

Register File

FU FU FU FU

Loop Buffer

Register File

FU FU FU


FU

Loop Buffer

Register File

FU AGU FU

Loop Buffer

Register File

FU AGU


AGUs


Results: Speed and Energy consumption

AGUs

24%27%

16%17% 12%

13%

12%32% 35%

15%


Results (applied to the MPEG4 application)

AGUs

51%


“Stand alone” AGU

for (k …){ for (j… ){ for (i…)} … } }}

AGUs


“Stand alone” AGU (1)

Implements:i*cnst

AGUs



Implements:i+= “inc i”i-= “dec i”

AGUs



Implements:i+j i << ji*j i >> j

AGUs


“Stand alone” AGU (4-5)

Implements:(i+j)% val (i << j)%val(i*j)/val (i >> j)/val

for (i=0; i≤ 20;i++) address= i%3;

ptr= -1for (i=0; i≤ 20;i++){ ptr++; if (ptr>=3) ptr-=3; address= ptr;}

AGUs



Implements:i+j+k (i << j)+ki*j+k (i >> j)+k

AGUs


With this AGU

• Conditions:– No control flow– No dividers*– No modulo operations *

• In cavity detector application:– 2% hardware overhead– 50% energy and cycles reduction

* That can not be transformed with non-linear operator strength reduction

AGUs



… optimization (again).

Optimization


Instruction loop buffering optimization

Datapath

L1

Distributed L0

Datapath

L1

Distributed L0

Datapath

L1

Distributed L0

Normal Operation

Filling L0 Buffer Operation

Initiation Execution

Termination

Optimization


Summary of the optimizations (on the MPEG4 application)

CODE

COMPILER

HARDWARE

DATA

MEM

ORYADDRESS

GENERATION

DATA

PATH

INSTRUCTIO

N

MEM

ORY

OPTIMIZATIONS

Optimization


Results: Cycles

MORE THAN 90%!!! respecte the first straight

implementation

Optimization


Final energy distribution

MPEG4 encoder application:- 90nm technology - 500MHz

Optimization



… conclusions.

Conclusions


Thesis contributions (1)

• Address generation unit template for the embedded multimedia domain– Improvements between 12% and 35% on several

benchmarks and applications (cycles and energy)– Improvements on a real application (MPEG4) of

51% on energy consumption (respect the previous optimization step)

– Global improvements over 90% applying a complete optimization methodology

Conclusions


Thesis contributions (2)

• Quantitave comparison of different platforms commonly used in the embedded domain

• Systematic classification of address generators• Review of literature on address generation

optimization according to the classification • Introduction of AGU reconfigurable framework

results into the COFFEE framework• Application of a complete methodology to optimize

energy consumption on a real data-flow application including address generation steps.

Conclusions


Open issues:

• Support for more loops and control• Bit calculation• Merge of index expression• Extension to other benchmarks and

applications• Heterogenous distributed AGUs• Distributed loop buffers with different speeds• Complete DTSE optimization

Conclusions




End of presentation and open discussion

??



Publications

Journal papers:

• G. Talavera, M. Jayapala, J. Carrabina, and F. Catthoor, “Address generation optimization for embedded high-performance processors: A survey”, Journal of Signal Processing Systems for Signal Image and Video Technology (formerly the Journal of VLSI Signal Processing Systems for Signal Image and Video Technology), May 2008 (online) Decembre 2008 (printed version) 2008.

• G. Talavera, A. Portero, P. Raghavan, M. Jayapala, J. Carrabina, and F. Catthoor, “Power exploration and address generation optimization of multimedia applications on VLIW processors”, Planned for re-submission to the IEEE Transactions on Image Processing.

• A. Portero, G. Talavera, J. Carrabina, and F. Catthoor, “Methodology for multimedia applications in multiplatform implementation for energy-flexibility space exploration”, Planned for re-submission to the IEEE Transactions on Computers .

• A. Portero, G. Talavera, J. Carrabina, and F. Catthoor, “Data-dominant application implementation in multi-platform for energy-flexibility space exploration”, Planned for re-submission to the IEEE Transactions on Image Processing.


Conference papers

• A. Lambrecths, T. V. Aa, M. Jayapala, A. Leroy, G. Talavera, A. Shickova, F. Barat, F. Catthoor, D. Verkest, G. Deconinck, H. Corporaal, F. Robert, and J. C. Bordoll, “Design style case study for compute nodes of a heterogeneous NoC platform”, in 25th IEEE Real-Time Systems Symposium (RTSS), December 2004.

• G. Talavera, V. Nollet, J.-Y. Mignolet, D. Verkest, S. Vernalde, R. Lauwereins, and J. Carrabina, “Hardware-Software debugging techniques for reconfigurable Systems-on-Chip, International Conference on Industrial Technology, 2004. IEEE ICIT '04. vol. 3, Dec. 2004, pp. 1402- 1407 Vol. 3.

• G. Talavera, V. Nollet, J.-Y. Mignolet, D. Verkest, S. Vernalde, R. Lauwereins, and J. Carrabina, “Métodos de depuración HW-SW para sistemas on chip recongurables, in Jornadas Sobre Computación Recongurable y Aplicaciones (JCRA), Barcelona, Spain, Septembre 2004, pp. 251-258.

• A. Lambrechts, P. Raghavan, A. Leroy, G. Talavera, T. Vander Aa, M. Jayapala, F. Catthoor, D. Verkest, G. Deconinck, H. Corporaal, F. Robert, and J. Carrabina, “Power breakdown analysis for a heterogeneous NoC platform running a video application”, in IEEE International Conference on Application-Specific Systems, Architecture Processors (ASAP)), 2005. 16th , July 2005, pp. 179-184.

• A. Portero, G. Talavera, M. Monton, B. Martinez, and J. Carrabina, “NoC system for MPEG-4 SP using heterogeneous tiles” , in Design of Circuits and Integrated Systems (DCIS), San Diego, California, USA. December 2006.

• A. Portero, G. Talavera, M. Monton, B. Martinez, M. Moreno, F. Cathoor, and J. Carrabina, “Energy-aware mpeg-4 single profile in HW-SW multiplatform implementation”, in IEEE International SOC Conference, Austin, Texas, USA. Sept. 2006, pp. 13-16.

• A. Portero, G. Talavera, M. Monton, B. Martinez, F. Cathoor, and J. Carabina, “Dynamic voltage scaling for power efficient MPEG4-SP implementation”, in Proceedings of the IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP). Washington, DC, USA: IEEE Computer Society, 2006, pp. 257-260.

• A. Portero, G. Talavera, F. Catthoor, and J. Carrabina, “A study of a MPEG-4 codec in a multiprocessor platform”, in IEEE International Symposium on Industrial Electronics (ISIE), 2006, vol. 1, July 2006, pp. 661-666.


Teaching publications

• G. Talavera, J. Saiz, and J. Carrabina., “Dispositivos y plataformas para docencia de informática y electrónica”, in Jornadas Sobre Computación Recongurable y Aplicaciones (JCRA), Barcelona, Spain, Septembre 2004, pp. 711-717.

• G. Talavera, B. Lorente, M. Monton, B. Martinez, J. Oliver, C. Ferrer, L. Ribas, J. Aguilo, and E. Valderrama, “Nuevas metodologías docentes y autoaprendizaje en la enseñanza técnica universitaria”, in Congreso Internacional de Docencia Universitaria e Innovación (CIDUI), Barcelona, Spain, 2006

• B. Lorente, G. Talavera, L. Ribas, and E. Valderrama, “Implantació d'una nova metodologia docent a les pràctiques de fonaments de computadors d'enginyeria informàtica”, in Congreso Internacional de Docencia Universitaria e Innovación (CIDUI), Barcelona, Spain, 2006.

• G. Talavera, X. Fitó, B. Lorente, A. Portero, M. Montón, B. Martínez, J. Oliver, C. Ferrer, L. Ribas, J. Aguiló, and E. Valderrama, “Adaptación metodológica a las nuevas directrices del EEES en la enseñanza técnica universitaria”, in Tecnologías Aplicadas a la Enseñanza de la Electrónica (TAEE), Madrid, Spain. 2006.

• A. Portero, J. Saiz, G. Talavera, R. Aragonés, M. Rullán, J. Aguiló, and E. Valderrama, “Aplicación del plan piloto en sistemas digitales en ingenier ía informática siguiendo las directivas del EEES”, in Tecnologías Aplicadas a la Enseñanza de la Electrónica. (TAEE), Madrid, Spain. 2006.

• G. Talavera, F. X. Fitó, B. Lorente, M. Montón, B. Martínez, C. Ferrer, and E. Valderrama, “Cas pràctic d'adaptació metodològica a les directrius EEES d'una assignatura d'enginyeria informàtica”, in III Jornada de Campus d'Innovació Docent. UAB, Barcelona. Spain. 20 Setembre de 2006. .

• E. Valderrama, G. Talavera, M. Montón, B. Martínez, J. M. Fernández, and J. Muñoz, “Comparación de dos metodologías docentes utilizadas en los seminarios de fundamentos de computadores”, in XIV Jornadas de Enseñanza Universitaria de la Informática (JENUI), 2008.


Results: Energy

MORE THAN 90%!!! respecte the first straight implementation


Reconfigurable AGU template

AGUs


The loop buffer operation: An Illustration

OP11for (..){ …

if (..) {.….} else {.….} …}

OP21 OP31 NOP

NOP OP22 OP32 BNZ ‘x’

OP12 NOP NOP BR ‘y’

OP13 NOP OP33 NOP

OP14 OP23 NOP BNZ ‘s’

S:

X:

Y:

LBON <offset>

if block

else block

Optimization



OP11for (..){ …

if (..) {.….} else {.….} …}

OP21 OP31 NOP



OP13 NOP OP33 NOP


S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USE

NEW_PC

PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR

BNZ ‘x’BR ‘y’

BNZ ‘s’

-00111-021

Optimization



OP11for (..){ …

if (..) {.….} else {.….} …}

OP21 OP31 NOP



OP13 NOP OP33 NOP


S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USE

NEW_PC

PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR


BNZ ‘s’

-00111-021

Optimization



OP11for (..){ …

if (..) {.….} else {.….} …}

OP21 OP31 NOP



OP13 NOP OP33 NOP


S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USEPC

NEW_PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR


BNZ ‘s’

-00111-021

Optimization



OP11for (..){ …

if (..) {.….} else {.….} …}

OP21 OP31 NOP



OP13 NOP OP33 NOP


S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USEPC

NEW_PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR


BNZ ‘s’

-00111-021

Optimization



OP11for (..){ …

if (..) {.….} else {.….} …}

OP21 OP31 NOP



OP13 NOP OP33 NOP


S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USEPC

NEW_PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR


BNZ ‘s’

-00111-021

Optimization



OP11for (..){ …

if (..) {.….} else {.….} …}

OP21 OP31 NOP



OP13 NOP OP33 NOP


S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USEPC

NEW_PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR


BNZ ‘s’

-00111-021

Optimization