65
Scratchpad-oriented address generation for low-power embedded VLIW processors Guillermo Talavera Velilla Departament de Microelectrònica i Sistemes Electrònics Universitat Autònoma de Barcelona Thesis supervisor: Jordi Carrabina Ph.D. Defense Presentation October 15th, 2009

Ph.D. Thesis presentation

Embed Size (px)

DESCRIPTION

Ph.D. Thesis presentation

Citation preview

Page 1: Ph.D. Thesis presentation

Scratchpad-oriented address generation for low-power embedded

VLIW processors

Guillermo Talavera Velilla

Departament de Microelectrònica i Sistemes Electrònics

Universitat Autònoma de Barcelona

Thesis supervisor: Jordi Carrabina

Ph.D. Defense PresentationOctober 15th, 2009

Page 2: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 2/50

What are we talking about?

Scratchpad-oriented address generation

for low-power embedded VLIW processors

Type of memory

Energy optimization

Accessing data

Small, portable,battery operatedand multimedia

Type of processors

Embedded Processors Memories Optimization ConclusionsAGUs Optimization ?

Page 3: Ph.D. Thesis presentation

What should I do if I am a VLIW-processor working on the embedded

domain and I want to access data (that is located in memory) consuming little

energy?

Guillermo Talavera Velilla

Departament de Microelectrònica i Sistemes Electrònics

Universitat Autònoma de Barcelona

Thesis supervisor: Jordi Carrabina

Ph.D. Defense PresentationOctober 15th, 2009

Page 4: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 4/50

Let’s talk about…

… embedded.

Embedded

Page 5: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 5/50

Embedded mobile systems

Embedded

Page 6: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 6/50

Greedy users

Users demand:• More functionalities• More speed• More battery• Cheap devices

“PC-like” functionalities

… and we give them VLIW-ASIPs

Embedded

Page 7: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 7/50

Performance vs Energy Efficiency

Performance Energy efficiency

FlexibilityFlexible enough

Embedded

Page 8: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 8/50

Goal of the thesis

• Main goal:– Optimization of the energy consumption of the

VLIW-ASIPs architectures focusing on address generation process.

• Side goals:– Analyze state of the art optimizations– Analyze state of the art address generator units– Test the template in different benchmarks and

applications

Page 9: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 9/50

Let’s talk about…

… processors.

Processors

Page 10: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 10/50

Definitions

• VLIW =Very Long Instruction Word (processor)– Architecture design style that tries to maximize the

available Instruction Level Paralelism.

• ASIP =Application-Specific Instruction-Set processor – Processor were the instruction set is tailored to

benefit a specific application.

Processors

Page 11: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 11/50

Target Architecture Style: VLIW-ASIP

Level 1Memory (on chip)

Level 2Memory (on chip)

External Memory

FU

Loop Buffer

Register File

FU FU FU FU

Loop Buffer

Register File

FU FU FU FU FU FU FU FU FU FU FU

Main Cluster Cluster

Loop Buffer Loop Buffer

Register File Register File

Cluster Cluster

Processors

Page 12: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 12/50

Superscalar vs VLIW (remainder)

HWschedulling

SWschedulling

Processors

Page 13: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 13/50

VLIW-ASIPs

• Ongoing work at imec:– Novel architectures oriented to low-power (x20)

HW+SW+Compiler exploration:• Data memory hierarchy• Foreground memory (registers)• Instruction/configuration memory• Data-path• Address-path

Processors

Page 14: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 14/50

COFFEE: COmpiler Framework for Energy-aware Exploration

XMLprocessor

model

Ccode

TrimaranMDES

EnhancedTrimaranCompiler

TrimaranSimulator

Total powercalculation

Asmcode

Trace

Results

XSLTconverter

Areacalculation

Delaycalculation

Powercalculation

AnnotatedXML

processormodel

Ccode

XMLprocessor

model

Power/Energyresults

Performanceresults

Arearesults

compiler+processParametres

Processors

Page 15: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 15/50

Let’s talk about…

… memories.

Memories

Page 16: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 16/50

Embedded multimedia domain

• 50%-70% energy consumption caused by memory accessess*

Crucial to optimize:• Memory size, type, number of ports, … • Accesses (and related address computations)

As driver example we use a real application: a MPEG4 encoder

* References of the thesis [WCNM96 and MNCM97]

Memories

Page 17: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 17/50

Background data memory

Scracthpad (compared to cache)– average energy reduction 40% *– average area-time reduction: 46% *

* References of the thesis [BSL+02ª and AC06]

Core

Level 1

(Data/Instruction mem.)

Level 2

(Data/Instruction mem.)

Level 3

(Main memory)

Fast,small,expensive

Slow,Large,cheap

Memories

Page 18: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 18/50

Let’s talk about…

… optimization methods.

Optimization

Page 19: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 19/50

Data Transfer and Storage Exploration (DTSE)

• Goal:– Reduce storage requirements– Optimize locality of data

• Code rewriting– Complex addressing– Control flow– Modulo and divider operations

Optimization

Page 20: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 20/5020

DTSE transformations

for (x=1; x<=N-2; ++x) { for (y=1; y<=M-2; ++y) { for (k=-1; k<=1; ++k) { A[x][y] += B[x+k][y] * C[abs(k)]; A[x][y] /= tot; } }}

for (y=0; y<=M+2; ++y) { for (x=0; x<=N+2; ++x) { if (x>=0 && x<N && y>=1 && y<=M-2) { D[x%3] = B[(y*N+x)%8704+ (y*N+x)%8704*16384+7680]; } if (x-1>=1 && x-1<=N-2 && y>=1 && y<=M-2) { for (k=-1; k<=1; ++k) { acc += D[(x-1+k)%3] *

C[abs(k)]; } } acc /= tot; }}

20

Code Before DTSE

Code After DTSE

Control flow and address calculation are the bottleneck after DTSE!!!

Optimization

Page 21: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 21/50

DTSE: Non-linear Operator Strength Reduction

for (y=0; y<M; ++y) { for (x=0; x<N; ++x) { cse0 = x%3; cse1 = (x-1)%3; cse2 = (x-2)%3; … }}

for (y=0; y<M; ++y) { p_cse0 = 0; // x%3 p_cse1 = 2; // (x-1)%3 p_cse2 = 1; // (x-2)%3 for (x=0; x<N; ++x) { … p_cse2 = p_cse1; p_cse1 = p_cse0; p_cse0++; if (p_cse0>=3) p_cse0 = 0; }}

Before … After !!!

Optimization

Modulo operationscan not always be transformedfor complex indexes

Page 22: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 22/50

DTSE: Arithmetic Cost Minimization

for (y=0; y<M; ++y) {

for (x=0; x<N; ++x) {

if (x>=4 && y>=6) {

ce_img1[(x-2)%3] = …

ce_img2[(x-2)%3] = …

}

if (x>=4 && y>=4) {

ce_img1[(x-1)%3] = …

ce_img1[(x-1)%3] = …

}

}

}

for (y=0; y<M; ++y) {

for (x=0; x<N; ++x) {

if (x>=4 && y>=6) {

cse0 = (x-2)%3;

ce_img1[cse0] = …

ce_img2[cse0] = …

}

if (x>=4 && y>=4) {

cse1 = (x-1)%3;

ce_img1[cse1] = …

ce_img2[cse1] = …

}

}

}

Before … After !!!

Optimization

Page 23: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 23/50

Control flow optimization

for(i=0; i < 50; i++ ){ for(j=0; j<i; j++){ if(i+j<70) data = Aleft[i+j]; else data = Aright[i+j-70];… }

for(i=0; 50 ; i++){ if(i <= 35){ for( ; i<=35; i++){ for(j=0; j<i; j++){ data= Aleft[i+j]; } } } else{ for(j=0; j<i; j++){ if (i+j < 70){ data= Aleft[i+j]; } else{ data= Aright[i+j-70]; } } } }}

Before …

After !!!Loop nest splitting:

Optimization

Page 24: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 24/50

Data-path architecture explorationDone at architecture level #clusters, #FU per cluster

2 clusters with 4 FU each

MPEG4 encoder application:- 90nm technology - 500MHz

Optimization

Page 25: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 25/50

Let’s talk about…

… address generation.

AGUs

Page 26: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 26/50

How do I access data?

Core

Level 1

(Data cache)

Very often addressess are calculated in normal data-path

AGUs

Page 27: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 27/50

Address Generation Unit (AGU)

Address Register File

Address ControlUnit

Address Data PathUnit

Addresssequence

Indexes oraddresses

range

Address equation examples:D[x%3] = B[(y*N+x)%8704+ (y*N+x)%8704*16384+7680];AE1= x%3AE2= (y*N+x)%8704+ (y*N+x)%8704*16384+7680

AGUs

Page 28: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 28/50

AGU

• Multimedia Domain Programmable AGU

AGUs

Page 29: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 29/5029

AGU Exploration Framework*

29

PE Implementation Pattern

Constraintsmax_pe=6min_add=1max_add=6min_sub=1max_sub=6min_sft=1max_sft=6…

+ - << +,- +,-,<< * %

Arch. FileReport of

cycle, area, and energy

AddressCalculation

AGU Mapping Framework

AGU ExplorationFramework

evaluate for all architectures which satisfy constraints !

Tradeoff !* From Osaka University

AGUs

Page 30: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 30/50

Experiments

AGUs

Page 31: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 31/50

Results: Area

15% < Hardware overhead < 200%

Original VLIW VLIW with AGU

FU

Loop Buffer

Register File

FU FU FU FU

Loop Buffer

Register File

FU FU FU

Main Cluster Cluster

FU

Loop Buffer

Register File

FU AGU FU

Loop Buffer

Register File

FU AGU

Main Cluster Cluster

AGUs

Page 32: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 32/50

Results: Speed and Energy consumption

AGUs

24%27%

16%17% 12%

13%

12%32% 35%

15%

Page 33: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 33/50

Results (applied to the MPEG4 application)

AGUs

51%

Page 34: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 34/50

“Stand alone” AGU

for (k …){ for (j… ){ for (i…)} … } }}

AGUs

Page 35: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 35/50

“Stand alone” AGU (1)

Implements:i*cnst

AGUs

Page 36: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 36/50

“Stand alone” AGU (2)

Implements:i+= “inc i”i-= “dec i”

AGUs

Page 37: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 37/50

“Stand alone” AGU (3)

Implements:i+j i << ji*j i >> j

AGUs

Page 38: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 38/50

“Stand alone” AGU (4-5)

Implements:(i+j)% val (i << j)%val(i*j)/val (i >> j)/val

for (i=0; i≤ 20;i++) address= i%3;

ptr= -1for (i=0; i≤ 20;i++){ ptr++; if (ptr>=3) ptr-=3; address= ptr;}

AGUs

Page 39: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 39/50

“Stand alone” AGU (6)

Implements:i+j+k (i << j)+ki*j+k (i >> j)+k

AGUs

Page 40: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 40/50

With this AGU

• Conditions:– No control flow– No dividers*– No modulo operations *

• In cavity detector application:– 2% hardware overhead– 50% energy and cycles reduction

* That can not be transformed with non-linear operator strength reduction

AGUs

Page 41: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 41/50

Let’s talk about…

… optimization (again).

Optimization

Page 42: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 42/50

Instruction loop buffering optimization

Datapath

L1

Distributed L0

Datapath

L1

Distributed L0

Datapath

L1

Distributed L0

Normal Operation

Filling L0 Buffer Operation

Initiation Execution

Termination

Optimization

Page 43: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 43/50

Summary of the optimizations (on the MPEG4 application)

CODE

COMPILER

HARDWARE

DATA

MEM

ORYADDRESS

GENERATION

DATA

PATH

INSTRUCTIO

N

MEM

ORY

OPTIMIZATIONS

Optimization

Page 44: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 44/50

Results: Cycles

MORE THAN 90%!!! respecte the first straight

implementation

Optimization

Page 45: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 45/50

Final energy distribution

MPEG4 encoder application:- 90nm technology - 500MHz

Optimization

Page 46: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 46/50

Let’s talk about…

… conclusions.

Conclusions

Page 47: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 47/50

Thesis contributions (1)

• Address generation unit template for the embedded multimedia domain– Improvements between 12% and 35% on several

benchmarks and applications (cycles and energy)– Improvements on a real application (MPEG4) of

51% on energy consumption (respect the previous optimization step)

– Global improvements over 90% applying a complete optimization methodology

Conclusions

Page 48: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 48/50

Thesis contributions (2)

• Quantitave comparison of different platforms commonly used in the embedded domain

• Systematic classification of address generators• Review of literature on address generation

optimization according to the classification • Introduction of AGU reconfigurable framework

results into the COFFEE framework• Application of a complete methodology to optimize

energy consumption on a real data-flow application including address generation steps.

Conclusions

Page 49: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 49/50

Open issues:

• Support for more loops and control• Bit calculation• Merge of index expression• Extension to other benchmarks and

applications• Heterogenous distributed AGUs• Distributed loop buffers with different speeds• Complete DTSE optimization

Conclusions

Page 50: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 50/50

Page 51: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 51/50

Page 52: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 52/50

End of presentation and open discussion

??

Page 53: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 53/50

Page 54: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 54/50

Publications

Journal papers:

• G. Talavera, M. Jayapala, J. Carrabina, and F. Catthoor, “Address generation optimization for embedded high-performance processors: A survey”, Journal of Signal Processing Systems for Signal Image and Video Technology (formerly the Journal of VLSI Signal Processing Systems for Signal Image and Video Technology), May 2008 (online) Decembre 2008 (printed version) 2008.

• G. Talavera, A. Portero, P. Raghavan, M. Jayapala, J. Carrabina, and F. Catthoor, “Power exploration and address generation optimization of multimedia applications on VLIW processors”, Planned for re-submission to the IEEE Transactions on Image Processing.

• A. Portero, G. Talavera, J. Carrabina, and F. Catthoor, “Methodology for multimedia applications in multiplatform implementation for energy-flexibility space exploration”, Planned for re-submission to the IEEE Transactions on Computers .

• A. Portero, G. Talavera, J. Carrabina, and F. Catthoor, “Data-dominant application implementation in multi-platform for energy-flexibility space exploration”, Planned for re-submission to the IEEE Transactions on Image Processing.

Page 55: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 55/50

Conference papers

• A. Lambrecths, T. V. Aa, M. Jayapala, A. Leroy, G. Talavera, A. Shickova, F. Barat, F. Catthoor, D. Verkest, G. Deconinck, H. Corporaal, F. Robert, and J. C. Bordoll, “Design style case study for compute nodes of a heterogeneous NoC platform”, in 25th IEEE Real-Time Systems Symposium (RTSS), December 2004.

• G. Talavera, V. Nollet, J.-Y. Mignolet, D. Verkest, S. Vernalde, R. Lauwereins, and J. Carrabina, “Hardware-Software debugging techniques for reconfigurable Systems-on-Chip, International Conference on Industrial Technology, 2004. IEEE ICIT '04. vol. 3, Dec. 2004, pp. 1402- 1407 Vol. 3.

• G. Talavera, V. Nollet, J.-Y. Mignolet, D. Verkest, S. Vernalde, R. Lauwereins, and J. Carrabina, “Métodos de depuración HW-SW para sistemas on chip recongurables, in Jornadas Sobre Computación Recongurable y Aplicaciones (JCRA), Barcelona, Spain, Septembre 2004, pp. 251-258.

• A. Lambrechts, P. Raghavan, A. Leroy, G. Talavera, T. Vander Aa, M. Jayapala, F. Catthoor, D. Verkest, G. Deconinck, H. Corporaal, F. Robert, and J. Carrabina, “Power breakdown analysis for a heterogeneous NoC platform running a video application”, in IEEE International Conference on Application-Specific Systems, Architecture Processors (ASAP)), 2005. 16th , July 2005, pp. 179-184.

• A. Portero, G. Talavera, M. Monton, B. Martinez, and J. Carrabina, “NoC system for MPEG-4 SP using heterogeneous tiles” , in Design of Circuits and Integrated Systems (DCIS), San Diego, California, USA. December 2006.

• A. Portero, G. Talavera, M. Monton, B. Martinez, M. Moreno, F. Cathoor, and J. Carrabina, “Energy-aware mpeg-4 single profile in HW-SW multiplatform implementation”, in IEEE International SOC Conference, Austin, Texas, USA. Sept. 2006, pp. 13-16.

• A. Portero, G. Talavera, M. Monton, B. Martinez, F. Cathoor, and J. Carabina, “Dynamic voltage scaling for power efficient MPEG4-SP implementation”, in Proceedings of the IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP). Washington, DC, USA: IEEE Computer Society, 2006, pp. 257-260.

• A. Portero, G. Talavera, F. Catthoor, and J. Carrabina, “A study of a MPEG-4 codec in a multiprocessor platform”, in IEEE International Symposium on Industrial Electronics (ISIE), 2006, vol. 1, July 2006, pp. 661-666.

Page 56: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 56/50

Teaching publications

• G. Talavera, J. Saiz, and J. Carrabina., “Dispositivos y plataformas para docencia de informática y electrónica”, in Jornadas Sobre Computación Recongurable y Aplicaciones (JCRA), Barcelona, Spain, Septembre 2004, pp. 711-717.

• G. Talavera, B. Lorente, M. Monton, B. Martinez, J. Oliver, C. Ferrer, L. Ribas, J. Aguilo, and E. Valderrama, “Nuevas metodologías docentes y autoaprendizaje en la enseñanza técnica universitaria”, in Congreso Internacional de Docencia Universitaria e Innovación (CIDUI), Barcelona, Spain, 2006

• B. Lorente, G. Talavera, L. Ribas, and E. Valderrama, “Implantació d'una nova metodologia docent a les pràctiques de fonaments de computadors d'enginyeria informàtica”, in Congreso Internacional de Docencia Universitaria e Innovación (CIDUI), Barcelona, Spain, 2006.

• G. Talavera, X. Fitó, B. Lorente, A. Portero, M. Montón, B. Martínez, J. Oliver, C. Ferrer, L. Ribas, J. Aguiló, and E. Valderrama, “Adaptación metodológica a las nuevas directrices del EEES en la enseñanza técnica universitaria”, in Tecnologías Aplicadas a la Enseñanza de la Electrónica (TAEE), Madrid, Spain. 2006.

• A. Portero, J. Saiz, G. Talavera, R. Aragonés, M. Rullán, J. Aguiló, and E. Valderrama, “Aplicación del plan piloto en sistemas digitales en ingenier ía informática siguiendo las directivas del EEES”, in Tecnologías Aplicadas a la Enseñanza de la Electrónica. (TAEE), Madrid, Spain. 2006.

• G. Talavera, F. X. Fitó, B. Lorente, M. Montón, B. Martínez, C. Ferrer, and E. Valderrama, “Cas pràctic d'adaptació metodològica a les directrius EEES d'una assignatura d'enginyeria informàtica”, in III Jornada de Campus d'Innovació Docent. UAB, Barcelona. Spain. 20 Setembre de 2006. .

• E. Valderrama, G. Talavera, M. Montón, B. Martínez, J. M. Fernández, and J. Muñoz, “Comparación de dos metodologías docentes utilizadas en los seminarios de fundamentos de computadores”, in XIV Jornadas de Enseñanza Universitaria de la Informática (JENUI), 2008.

Page 57: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 57/50

Results: Energy

MORE THAN 90%!!! respecte the first straight implementation

Page 58: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 58/50

Reconfigurable AGU template

AGUs

Page 59: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 59/50

The loop buffer operation: An Illustration

OP11for (..){ …

if (..) {.….} else {.….} …}

OP21 OP31 NOP

NOP OP22 OP32 BNZ ‘x’

OP12 NOP NOP BR ‘y’

OP13 NOP OP33 NOP

OP14 OP23 NOP BNZ ‘s’

S:

X:

Y:

LBON <offset>

if block

else block

Optimization

Page 60: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 60/50

The loop buffer operation: An Illustration

OP11for (..){ …

if (..) {.….} else {.….} …}

OP21 OP31 NOP

NOP OP22 OP32 BNZ ‘x’

OP12 NOP NOP BR ‘y’

OP13 NOP OP33 NOP

OP14 OP23 NOP BNZ ‘s’

S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USE

NEW_PC

PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR

BNZ ‘x’BR ‘y’

BNZ ‘s’

-00111-021

Optimization

Page 61: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 61/50

The loop buffer operation: An Illustration

OP11for (..){ …

if (..) {.….} else {.….} …}

OP21 OP31 NOP

NOP OP22 OP32 BNZ ‘x’

OP12 NOP NOP BR ‘y’

OP13 NOP OP33 NOP

OP14 OP23 NOP BNZ ‘s’

S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USE

NEW_PC

PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR

BNZ ‘x’BR ‘y’

BNZ ‘s’

-00111-021

Optimization

Page 62: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 62/50

The loop buffer operation: An Illustration

OP11for (..){ …

if (..) {.….} else {.….} …}

OP21 OP31 NOP

NOP OP22 OP32 BNZ ‘x’

OP12 NOP NOP BR ‘y’

OP13 NOP OP33 NOP

OP14 OP23 NOP BNZ ‘s’

S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USEPC

NEW_PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR

BNZ ‘x’BR ‘y’

BNZ ‘s’

-00111-021

Optimization

Page 63: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 63/50

The loop buffer operation: An Illustration

OP11for (..){ …

if (..) {.….} else {.….} …}

OP21 OP31 NOP

NOP OP22 OP32 BNZ ‘x’

OP12 NOP NOP BR ‘y’

OP13 NOP OP33 NOP

OP14 OP23 NOP BNZ ‘s’

S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USEPC

NEW_PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR

BNZ ‘x’BR ‘y’

BNZ ‘s’

-00111-021

Optimization

Page 64: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 64/50

The loop buffer operation: An Illustration

OP11for (..){ …

if (..) {.….} else {.….} …}

OP21 OP31 NOP

NOP OP22 OP32 BNZ ‘x’

OP12 NOP NOP BR ‘y’

OP13 NOP OP33 NOP

OP14 OP23 NOP BNZ ‘s’

S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USEPC

NEW_PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR

BNZ ‘x’BR ‘y’

BNZ ‘s’

-00111-021

Optimization

Page 65: Ph.D. Thesis presentation

Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 65/50

The loop buffer operation: An Illustration

OP11for (..){ …

if (..) {.….} else {.….} …}

OP21 OP31 NOP

NOP OP22 OP32 BNZ ‘x’

OP12 NOP NOP BR ‘y’

OP13 NOP OP33 NOP

OP14 OP23 NOP BNZ ‘s’

S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USEPC

NEW_PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR

BNZ ‘x’BR ‘y’

BNZ ‘s’

-00111-021

Optimization