[IEEE TELSIKS 2011 - 2011 10th International Conference on Telecommunication in Modern Satellite, Cable and Broadcasting Services - Nis (2011.10.5-2011.10.8)] 2011 10th International

Address Generation Unit as Accelerator Block in DSP Marko Ilić1, Mile Stojčev1

Abstract – A wide variety of arithmetic intensive and scientific computing applications are characterized by a large number of data access. Such applications contain complex offset address manipulations. For most target digital signal processing (DSP) architectures, these memory-intensive applications present significant bottlenecks for system designs in term of memory bandwidth and memory access latencies, which can result in poor utilization of DSP computational logic. These time and space techniques require the design of optimized address generator units (AGUs) capable to deal with higher issue and execution rates, larger number of memory references, and demanding memory-bandwidth requirements. In this paper we described an efficient hardware AGU intended for fast generating memory addresses in 2D and 1D organized data memory embedded into a standalone accelerator processing block.

Keywords – Address generator, accelerator, ASIC design

I. INTRODUCTION

Future semiconductor chips will incorporate hundreds of processing elements running in parallel. In these solutions the access to data will become the main bottleneck that limits the available parallelism. Typical real-time embedded multimedia applications including video and audio processing are often characterized by a large number of data accesses [1]. Many of these applications contain intensive index manipulations, resulting in complex address patterns. Address calculations in such high-throughput systems involve linear and polynomial arithmetic expressions which have to be calculated during program execution under tight and strict timing constraints [2].

Manipulations with structured data-types are not efficiently supported by current RISC architectures. Therefore, compilers must generate considerable amount of code intended for fast manipulations with array data structures, and various other complex data-types such as records, etc. This code imposes considerable overhead on system performance and slow-down the program execution in a great deal. In order to cope with latency of data access, i.e. to speedup address expression evaluation, special hardware building blocks, called address generation units, are designed. The function of address generation unit (AGU) is threefold. First, during the initialization, it transforms host memory address space into accelerator memory address space. Second, provides efficient memory data access during accelerator operation. Third, performs fast data transfer between accelerator and host memory at the end of the computation.

In this paper synthesis of AGU architecture based on data-path for which a pice-wise affine address equation for address sequence generation in planar memory array organization is described.

II. ADDRESS GENERATOR AS ACCELERATOR

In computing, hardware acceleration is the use of hardware to perform some function faster than is possible in software running on the general purpose CPU. Normally, processors are sequential, and instructions are executed one by one. Various techniques are used to improve performance; hardware acceleration is one of them. The main difference between hardware and software is concurrency, allowing hardware to be much faster than software. Hardware accelerators are designed for computationally intensive software code. Depending upon granularity, hardware acceleration can vary from a small functional unit to a large functional block. The hardware that performs the acceleration, when is a separate unit from the CPU, is referred to as a hardware accelerator [5].

Our approach to use address generator units (AGUs) as accelerator block starts from the fact that most of the current computers spend more time on computing addresses and accessing data than performing operations required by the application programs. This is particularly the case for RISC architectures where the available addressing modes are very limited and address calculations are almost exclusively performed in software [4]. In other words, due to inadequate support provided by conventional architectures for the access of the types of data structures used in current applications AGUs attempt to overcome these deficiencies by the coupling data access tasks from the data computation tasks and overlapping execution of the two types of tasks.

A survey of methods and tehniques that optimize the address generation process for embedded systems is given in [3]. Several AGU architectures for data-stream based computer systems have been shortly described in [6]. Here we propose an architecture that offers an opportunity to use fine-grain streamed data access patterns and to develop a configurable FPGA based hardware in order to directly support stream-based data access modes.

III. CLASSICAL MACHINE MODEL

For AGU implementation we will use an abstract machine model to keep the proposed techniques non-target-specific while still taking advantages of the instruction level parallelism and other special features offered by modern processors. The classical abstract model, shown in Fig.1, which will be used as a starting design solution in our approach, captures the special capabilities of modern general purpose CPU cores and DSPs [7]. It is essentially a RISC like load-store architecture, but augmented with DSP features; simultaneous compute, data move, an address update operations; linear and modular addressing and inbuilt looping instructions.

1Marko Ilić, Mile Stojčev are with University of Nis, Faculty of Electronic Engineering, Aleksandra Medvedeva 14, 18000 Nis, Serbia, E-mail: [email protected], [email protected]

978-1-4577-2019-2/11/$26.00 ©2011 IEEE 563

~ TELSIKS 2011 Serbia, Nis, October 5 - 8,2011

Memory

Load/StoreUnit

Data Registers

Data-path(FunctionalUnits - FUs)

Addressregisters

Indexregisters

Addressgenerator

ProcessorSy

stem

Bus

Fig. 1. Standard modern machine architecture

The machine has an arbitrary number of registers and

functional units (FUs), with four main register types: integer and floating-point data, address and index. Each operation performed by a single FU is called atomic operation. The machine model has the ability to perform several atomic operations (called compound operations) per cycle. Compound operations are formed by: a) chaining – FU produces a result that can be used by a different FU in the same instruction; b) grouping – two or more FUs execute simultaneously, and none requires the result of any other.

The communication path between a processor and memory poses fundamental limit to performance, commonly referred to as the “von Neumann bottleneck”. This bottleneck coupled with inefficient address manipulation capabilities, forces the serialization of data access and data computation which tremendously limits system performance. Decoupled access/execute architectures represent a viable solution to the above problem. These accelerator architectures are based on the decoupled access/execute model of computation. In this model, each computational task is partitioned into a data access process, responsible for address generation and memory access, and a data computation process responsible for performing the functional operations on data.

We have implemented the decoupled access/execute architecture as depicted in Fig. 2.

AGU

AcceleratorMemory(ACCM)

FunctionalUnits(FUs)

Host

Syst

em B

us

Stand-alone accelerator processing block

HostMemory

Fig. 2. Decoupled access/execute architecture

A stand-alone accelerator processing block in our target architecture model consists of the Accelerator Memory, Functional Units, and AGU. The function and structure of AGU will be described next.

IV. AGU FUNCTION

Two big types of AGU architectures exist, the ones based on look-up tables and ones based on data-path. The main problem in AGU realization relates to efficient generation of address sequences for a given application. The generation of address sequence is done from an address equation (AE), which is a function extracted from the software description of the algorithm [3], and is defined as: AE = f(I1,.., In, r1,.., rm), where parameters are indices (Ii), i=1,…,n, of nested loop, or range addresses (rj), j=1,…,m. From regularity point of view AE can take one of the following three forms [3]: 1) affine AE- the AE is a linear expression of the indices Ii and constant Ci, and has the following form, AE = C0 + C1I1 + ….+ CnIn; 2) pice-wise affine AE- parts of AE can be written as linear expression of indices and constants; and 3) non-linear AE- there is no linear equation between the AE and the address indices.

Here we will consider synthesis of AGU architecture based on data-path for which a pice-wise affine AE for address sequence generation is used. Further, depending on the application, we will assume that the Accelerator Memory block (see Fig. 2) can be organized as: a) linear array (1D)- memory address (see Fig. 3a)) consists of two address fields, Base and Index, the AE is defined as AE1D = C0 + C1I1; b) planar array (2D)- memory address (see Fig. 3b)) is composed of three fields, Base, Index1 and Index2, the AE is defined as AE2D = C0 + C1I1 + C2I2

In both cases, the address field Base points to the starting (base) memory location of the array, while address fields Index1 and Index2 correspond to the offset of data element with respect to the base address. Linear array organization is typical for 1D FIR and IIR filtering, while planar array organization is suitable for image processing. The size of Memory block (see Fig. 2) is finite and defined during AGU’s design phase. As a consequence, the address fields Base, Index1 and Index2 are of fixed bit-size, too. During AE calculation, this fact allows us to manipulate concurrently and separately with all three address fields, and concatenating them, when the final memory access address is generated.

Base Index a)

Base Index 1 Index 2 b)

Fig. 3. Address format

AGU manipulations with address fields Base and Index1(2) are given in Fig. 4. AGU performs generation activity in three steps. During the first step, switching over indices IX1 and IX2 is performed. The switching activity can be described as follows:

564

1for ,

0for ,*

)1(2

)2(1)2(1 selIX

selIXIX (1)

The second step characterizes manipulations with indices IX1*

and IX2*. For manipulations, logical, arithmetical and shift

operations over base and index variables are performed. As a result of this step the transformed indices IX1(2)

T, and ABT are

obtained. In the last step, by concatenating address fields ABT,

IX1T and IX2

T, the memory access address is generated.

IX2

IX1*

IX2*

IX1T

IX2T

ABT Address

concatenator

Memoryaccess address

(MAA)

AB

Inde

x 2

Inde

x 1

Bas

e

Index I1'Manipulator

Index I2'Manipulator

BaseManipulator

&

S1

SelAGU

Addressregister

(ARx, XRx)

Switching Calculation Concatenator(physical grouping)

Switcher

IX1

Fig. 4. Address fields concatenation in AGU

In Table I elementary operations that AGU performs over

single index (IX1 or IX2) or base address field (AB) are given. We assume that the starting address is stored in address register ARx, x=0,…,3, and offset value N in index address register XRx.

TABLE I ELEMENTARY OPERATIONS OVER ADDRESS FIELDS

Addressing mode Operation Comment

No-update IX1(2) no change Exchanging indices XCHG Index1 ↔ Index2

Postincrement by k (IX1(2))+ k (1,2,..., 8)

Postdecrement by k (IX1(2))- k (-1,-2,...-8)

Preincrement by k +(IX1(2)) k (1,2,..., 8)

Predecrement by k -(IX1(2)) k (-1,-2,...-8)

Postinc. by offset N (IX1(2))+IRx x= 0,…,3

Postdec by offset N (IX1(2))-IRx x= 0,…,3

Indexed by offset N (IX1(2)+IRx) index addressing

Divide by 2 IX1(2)/2 right cyclic rotation

Multiply by 2 IX1(2)*2 left cyclic rotation

Short word disp. (IX1(2)+SW) SW (15, 31, 63)

Long word disp. (IX1(2)+LW) LW (16, 32, 64)

Having in mind that Index1 and Index2 fields are of fixed

size all add/sub, and inc/dec operations are performed in modulo arithmetic (i.e. circular addressing is possible ).

0

AG

U_F

SM

IN 2

MUX 2

0 13

MUX1

ALU

0 1

Ain Bin

...-7...

AG

U_D

P

AR0

AR1

AR2

AR3AB IX2 IX1

IX2

S1

MAA IX1T

0 1

selX Rx

IB IR2 IR1

AB

Con

st. l

ogic

DATA-BUS

AG

U_F

SM

ComD

Sel1

Sel2

SelARx

SelXRx

Ctrl C1

Alu Fun 1

Alu Fun 2Shift

L/R/Pass

Sel1 MUX 1

Input A

IndexRegisters

AB IX2IX1

Sel

AR

x

AR0AR1

AR2

AR3

Comp

Sel2

Sel

XR

xMUX 2

to BaseManipulator

AB IX2 IX1

to IX2*Manipulator

Status

Comp.Value

IN1

IN2

Sw

itch

erS

1

AB&IX2&IX1

IX1*

ShiftL/R/Pass

BarrelShifter

IX1T

IX2T

IX1T

ABT

AG

U_D

P

Alu Fun 1 ALU 1

ALU 2

Const.IR1*

Alu Fun 2

IR1*

IB IR2 IR1

IB IR2 IR1to BaseManipulator

MAA

SwitcherS1

XR0

XR1

XR2

XR3

to IR2*Manipulator

IR1*

MAA

Ctr

l C1

-8 +64......

Memory access address

AddressRegisters

Fig. 5. Block diagram of AGU

The structure of AGU is sketched in Fig. 5. AGU consists

of two building blocks, AGU_FSM, as control unit and AGU_DP, as a data-path. AGU_FSM is implemented as a finite state machine of Moore type. It generates control signals for correct driving data-path constituents. AGU_DP is composed of: - AR0,…,AR3 - four address registers, used for storing base memory address of 1D or 2D array element. - XR0, …,XR3 - four index registers, intended for storing index address of some element within 1D or 2D array. - S1 switching node which can pass-through or mutually exchange (cross-points) index values. - MUX1, MUX2 – used as signal selector blocks. - Comp – compares the value of the actual generated address with a predefined one. - Const. logic – combinatorial logic block used for generating constant value (ranging from -8 up to +64) during address calculation. Constant injection is realized as a bit-wise XOR operation on corresponding data value bits. In Fig. 6 a principle of constant generation in the range from 0 up to 15 is presented. For example, the constant 6 (d0=0, d1=1, d2=1, d3=0) is generated when control signals cont1=cont4=0, and cont2=cont3=1. Negative constant is obtained when the ALU1 performs NEG operation. ALU2 is used for index value updating.

DecoderLogic

d0

sel 1sel 2

sel 0

d1 d2 d3

enable

cont1cont2

cont3cont4

Fig. 6. Constant logic

We will evaluate AGU’s performance for a case when the

accelerator memory ACCM (see Fig. 2) uses 2D organization and row-major ordering for storing value of picture (image) elements (pixels). Let the capacity of ACCM be enough to

565

y

x

b)a)

c)

d)

Fig. 7. Access patterns

Notice: a) Single steps; b) Linear scan; c) Video scan; d) Zig-zag scan

store minimum three images, denoted as A, B and C, respectively. Each image characterizes the following resolution n-pixels per horizontal and m-lines per vertical direction. To each pixel a corresponding value is appended which points to its luminance and chromatic value. We will assume that the accelerator data-path (FUs in Fig. 2) performs the following data processing operation C(i,j)=A(i,j) op B(i,j), where op relates to some arithmetical or logical instruction. Access ordering to pixels of picture A, B and C depends on the application and can be performed in different ways, some typical are presented in Fig. 7. The main task now is to evaluate the speedup, Sp, of the system which corresponds to a case when the AGU is used as a building block for address generation, despite a solution when address calculation is performed by a software. In order to simplify our analysis, but without deteriorating the generality, we will assume that duration of each assembly language machine cycle is equal to the AGU cycle. The obtained results are presented in Table II. As can be seen from Table II for all access patterns we obtain that the Sp is equal to 2.5 when the access address pattern is described as a single nested loop, while Sp=2.4 for double nested outer loops when the condition branch-not-taken is fulfilled. According to the obtained results we can conclude that the implemented AGU is an efficient hardware building block because it involves acceleration in address pattern calculation for more than 240%.

566

TABLE II SOFTWARE AND AGU ACCESS PATTERN IMPLEMENTATION

Access pattern

Soft. Impl. (Mach. cyc.)

AGU (Clock cycles)

Speedup

Fig. 7a), b) 9 4 2.5 Fig. 7c) 9 4 2.5 Fig. 7d) 9(12) 4(5) 2.5(2.4)

V. CONCLUSION

In the design of Ess, memory issues play a very important role, and often impact significantly the ES’s performance, power dissipation, and overall cost of implementation. The traditional CPU-memory gap widens and often becomes the

dominant bottleneck in achieving high performance. Most current processors spend more time generating data addresses and accessing memory than performing the functional accessing data operations required by the application programs. This is mainly due to inadequate support provided by conventional architectures for the access of the data types used in current applications and their inefficient handling of the “von Neumann bottleneck”. Specialized programmable hardware address generation units are used to speed-up address expression evaluation. These units benefit from the fact that compiler code optimization techniques which map HLL data constructs to the addressing unit of the architecture have lagged far behind. However, AGUs are costly in terms of area since they include several arithmetical units, registers, and/or other combinatorial logic to provide enough programmability. In this paper, we have presented a structure of address generation hardware from the address sequence to be generated. One of the primary design goals was to devise an efficient support for decoupled access/execute architectural model with order to minimize the impact of memory accesses on processor performance. The structure is flexible, low-cost, and extensible to almost any sequence generation problem where the sequence exibit some simetry or regularity. The methodology used for address computation and generation was verified on real FPGA based computer boards [8].

ACKNOWLEDGEMENTS

This work was supported by the Serbian Ministry of Science and Technological Development, Project No. TR-32009 – “Low-Power Reconfigurable Fault-Tolerant Platforms”.

REFERENCES

[1] Ze-Nian Li, Mark S. Drew, Fundamentals of multimedia, Pearson Education, Inc., Upper Saddle River, NY; 2004

[2] Miguel A. Miranda et al., “High-level address optimization and synthesis techniques for data transfer-intensive applications”, IEEE Trans. on Very Large Scale Integration (VLSI) Systems, Vol. 6, No. 4, pp. 677-686.

[3] Guillermo Talavera, Murali Jayapala, Jordi Carrabina, Francky Catthoor, Address generation optimization for embedded high-performance processors. A survey, J. Sign. Process Systems, Vol. 53, pp. 271-284, 2008

[4] John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach, 4-th ed., Morgan Kaufmann, San Francisco, 2007

[5] Hardware Acceleration, available 04, April 2011, http://en.wikipedia.org/wiki/Hardware_acceleration

[6] Michael Herz, et al., “Memory Addressing Organization for Stream-based Reconfigurable Computing”, ICECS 2002, The 9th IEEE International Conference on Electronics, Circuits and Systems, pp. 813 – 817, Vol. 2, Dubrovnik, 2002.

[7] Dake Liu, Embedded DSP Processor Design, Morgan Kaufmann, San Francisco 2008.

[8] XILINX, Embedded Development HW/SW Kit, Spartan-3A DSP S3D1800A, MicroBlaze Processor Edition.

Documents

[IEEE TELSIKS 2011 - 2011 10th International Conference on Telecommunication in Modern Satellite, Cable and Broadcasting Services - Nis (2011.10.5-2011.10.8)] 2011 10th International