Upload
mile
View
212
Download
0
Embed Size (px)
Citation preview
Address Generation Unit as Accelerator Block in DSP Marko Ilić1, Mile Stojčev1
Abstract – A wide variety of arithmetic intensive and scientific computing applications are characterized by a large number of data access. Such applications contain complex offset address manipulations. For most target digital signal processing (DSP) architectures, these memory-intensive applications present significant bottlenecks for system designs in term of memory bandwidth and memory access latencies, which can result in poor utilization of DSP computational logic. These time and space techniques require the design of optimized address generator units (AGUs) capable to deal with higher issue and execution rates, larger number of memory references, and demanding memory-bandwidth requirements. In this paper we described an efficient hardware AGU intended for fast generating memory addresses in 2D and 1D organized data memory embedded into a standalone accelerator processing block.
Keywords – Address generator, accelerator, ASIC design
I. INTRODUCTION
Future semiconductor chips will incorporate hundreds of processing elements running in parallel. In these solutions the access to data will become the main bottleneck that limits the available parallelism. Typical real-time embedded multimedia applications including video and audio processing are often characterized by a large number of data accesses [1]. Many of these applications contain intensive index manipulations, resulting in complex address patterns. Address calculations in such high-throughput systems involve linear and polynomial arithmetic expressions which have to be calculated during program execution under tight and strict timing constraints [2].
Manipulations with structured data-types are not efficiently supported by current RISC architectures. Therefore, compilers must generate considerable amount of code intended for fast manipulations with array data structures, and various other complex data-types such as records, etc. This code imposes considerable overhead on system performance and slow-down the program execution in a great deal. In order to cope with latency of data access, i.e. to speedup address expression evaluation, special hardware building blocks, called address generation units, are designed. The function of address generation unit (AGU) is threefold. First, during the initialization, it transforms host memory address space into accelerator memory address space. Second, provides efficient memory data access during accelerator operation. Third, performs fast data transfer between accelerator and host memory at the end of the computation.
In this paper synthesis of AGU architecture based on data-path for which a pice-wise affine address equation for address sequence generation in planar memory array organization is described.
II. ADDRESS GENERATOR AS ACCELERATOR
In computing, hardware acceleration is the use of hardware to perform some function faster than is possible in software running on the general purpose CPU. Normally, processors are sequential, and instructions are executed one by one. Various techniques are used to improve performance; hardware acceleration is one of them. The main difference between hardware and software is concurrency, allowing hardware to be much faster than software. Hardware accelerators are designed for computationally intensive software code. Depending upon granularity, hardware acceleration can vary from a small functional unit to a large functional block. The hardware that performs the acceleration, when is a separate unit from the CPU, is referred to as a hardware accelerator [5].
Our approach to use address generator units (AGUs) as accelerator block starts from the fact that most of the current computers spend more time on computing addresses and accessing data than performing operations required by the application programs. This is particularly the case for RISC architectures where the available addressing modes are very limited and address calculations are almost exclusively performed in software [4]. In other words, due to inadequate support provided by conventional architectures for the access of the types of data structures used in current applications AGUs attempt to overcome these deficiencies by the coupling data access tasks from the data computation tasks and overlapping execution of the two types of tasks.
A survey of methods and tehniques that optimize the address generation process for embedded systems is given in [3]. Several AGU architectures for data-stream based computer systems have been shortly described in [6]. Here we propose an architecture that offers an opportunity to use fine-grain streamed data access patterns and to develop a configurable FPGA based hardware in order to directly support stream-based data access modes.
III. CLASSICAL MACHINE MODEL
For AGU implementation we will use an abstract machine model to keep the proposed techniques non-target-specific while still taking advantages of the instruction level parallelism and other special features offered by modern processors. The classical abstract model, shown in Fig.1, which will be used as a starting design solution in our approach, captures the special capabilities of modern general purpose CPU cores and DSPs [7]. It is essentially a RISC like load-store architecture, but augmented with DSP features; simultaneous compute, data move, an address update operations; linear and modular addressing and inbuilt looping instructions.
1Marko Ilić, Mile Stojčev are with University of Nis, Faculty of Electronic Engineering, Aleksandra Medvedeva 14, 18000 Nis, Serbia, E-mail: [email protected], [email protected]
978-1-4577-2019-2/11/$26.00 ©2011 IEEE 563
~ TELSIKS 2011 Serbia, Nis, October 5 - 8,2011
Memory
Load/StoreUnit
Data Registers
Data-path(FunctionalUnits - FUs)
Addressregisters
Indexregisters
Addressgenerator
ProcessorSy
stem
Bus
Fig. 1. Standard modern machine architecture
The machine has an arbitrary number of registers and
functional units (FUs), with four main register types: integer and floating-point data, address and index. Each operation performed by a single FU is called atomic operation. The machine model has the ability to perform several atomic operations (called compound operations) per cycle. Compound operations are formed by: a) chaining – FU produces a result that can be used by a different FU in the same instruction; b) grouping – two or more FUs execute simultaneously, and none requires the result of any other.
The communication path between a processor and memory poses fundamental limit to performance, commonly referred to as the “von Neumann bottleneck”. This bottleneck coupled with inefficient address manipulation capabilities, forces the serialization of data access and data computation which tremendously limits system performance. Decoupled access/execute architectures represent a viable solution to the above problem. These accelerator architectures are based on the decoupled access/execute model of computation. In this model, each computational task is partitioned into a data access process, responsible for address generation and memory access, and a data computation process responsible for performing the functional operations on data.
We have implemented the decoupled access/execute architecture as depicted in Fig. 2.
AGU
AcceleratorMemory(ACCM)
FunctionalUnits(FUs)
Host
Syst
em B
us
Stand-alone accelerator processing block
HostMemory
Fig. 2. Decoupled access/execute architecture
A stand-alone accelerator processing block in our target architecture model consists of the Accelerator Memory, Functional Units, and AGU. The function and structure of AGU will be described next.
IV. AGU FUNCTION
Two big types of AGU architectures exist, the ones based on look-up tables and ones based on data-path. The main problem in AGU realization relates to efficient generation of address sequences for a given application. The generation of address sequence is done from an address equation (AE), which is a function extracted from the software description of the algorithm [3], and is defined as: AE = f(I1,.., In, r1,.., rm), where parameters are indices (Ii), i=1,…,n, of nested loop, or range addresses (rj), j=1,…,m. From regularity point of view AE can take one of the following three forms [3]: 1) affine AE- the AE is a linear expression of the indices Ii and constant Ci, and has the following form, AE = C0 + C1I1 + ….+ CnIn; 2) pice-wise affine AE- parts of AE can be written as linear expression of indices and constants; and 3) non-linear AE- there is no linear equation between the AE and the address indices.
Here we will consider synthesis of AGU architecture based on data-path for which a pice-wise affine AE for address sequence generation is used. Further, depending on the application, we will assume that the Accelerator Memory block (see Fig. 2) can be organized as: a) linear array (1D)- memory address (see Fig. 3a)) consists of two address fields, Base and Index, the AE is defined as AE1D = C0 + C1I1; b) planar array (2D)- memory address (see Fig. 3b)) is composed of three fields, Base, Index1 and Index2, the AE is defined as AE2D = C0 + C1I1 + C2I2
In both cases, the address field Base points to the starting (base) memory location of the array, while address fields Index1 and Index2 correspond to the offset of data element with respect to the base address. Linear array organization is typical for 1D FIR and IIR filtering, while planar array organization is suitable for image processing. The size of Memory block (see Fig. 2) is finite and defined during AGU’s design phase. As a consequence, the address fields Base, Index1 and Index2 are of fixed bit-size, too. During AE calculation, this fact allows us to manipulate concurrently and separately with all three address fields, and concatenating them, when the final memory access address is generated.
Base Index a)
Base Index 1 Index 2 b)
Fig. 3. Address format
AGU manipulations with address fields Base and Index1(2) are given in Fig. 4. AGU performs generation activity in three steps. During the first step, switching over indices IX1 and IX2 is performed. The switching activity can be described as follows:
564
1for ,
0for ,*
)1(2
)2(1)2(1 selIX
selIXIX (1)
The second step characterizes manipulations with indices IX1*
and IX2*. For manipulations, logical, arithmetical and shift
operations over base and index variables are performed. As a result of this step the transformed indices IX1(2)
T, and ABT are
obtained. In the last step, by concatenating address fields ABT,
IX1T and IX2
T, the memory access address is generated.
IX2
IX1*
IX2*
IX1T
IX2T
ABT Address
concatenator
Memoryaccess address
(MAA)
AB
Inde
x 2
Inde
x 1
Bas
e
Index I1'Manipulator
Index I2'Manipulator
BaseManipulator
&
S1
SelAGU
Addressregister
(ARx, XRx)
Switching Calculation Concatenator(physical grouping)
Switcher
IX1
Fig. 4. Address fields concatenation in AGU
In Table I elementary operations that AGU performs over
single index (IX1 or IX2) or base address field (AB) are given. We assume that the starting address is stored in address register ARx, x=0,…,3, and offset value N in index address register XRx.
TABLE I ELEMENTARY OPERATIONS OVER ADDRESS FIELDS
Addressing mode Operation Comment
No-update IX1(2) no change Exchanging indices XCHG Index1 ↔ Index2
Postincrement by k (IX1(2))+ k (1,2,..., 8)
Postdecrement by k (IX1(2))- k (-1,-2,...-8)
Preincrement by k +(IX1(2)) k (1,2,..., 8)
Predecrement by k -(IX1(2)) k (-1,-2,...-8)
Postinc. by offset N (IX1(2))+IRx x= 0,…,3
Postdec by offset N (IX1(2))-IRx x= 0,…,3
Indexed by offset N (IX1(2)+IRx) index addressing
Divide by 2 IX1(2)/2 right cyclic rotation
Multiply by 2 IX1(2)*2 left cyclic rotation
Short word disp. (IX1(2)+SW) SW (15, 31, 63)
Long word disp. (IX1(2)+LW) LW (16, 32, 64)
Having in mind that Index1 and Index2 fields are of fixed
size all add/sub, and inc/dec operations are performed in modulo arithmetic (i.e. circular addressing is possible ).
0
AG
U_F
SM
IN 2
MUX 2
0 13
MUX1
ALU
0 1
Ain Bin
...-7...
AG
U_D
P
AR0
AR1
AR2
AR3AB IX2 IX1
IX2
S1
MAA IX1T
0 1
selX Rx
IB IR2 IR1
AB
Con
st. l
ogic
DATA-BUS
AG
U_F
SM
ComD
Sel1
Sel2
SelARx
SelXRx
Ctrl C1
Alu Fun 1
Alu Fun 2Shift
L/R/Pass
Sel1 MUX 1
Input A
IndexRegisters
AB IX2IX1
Sel
AR
x
AR0AR1
AR2
AR3
Comp
Sel2
Sel
XR
xMUX 2
to BaseManipulator
AB IX2 IX1
to IX2*Manipulator
Status
Comp.Value
IN1
IN2
Sw
itch
erS
1
AB&IX2&IX1
IX1*
ShiftL/R/Pass
BarrelShifter
IX1T
IX2T
IX1T
ABT
AG
U_D
P
Alu Fun 1 ALU 1
ALU 2
Const.IR1*
Alu Fun 2
IR1*
IB IR2 IR1
IB IR2 IR1to BaseManipulator
MAA
SwitcherS1
XR0
XR1
XR2
XR3
to IR2*Manipulator
IR1*
MAA
Ctr
l C1
-8 +64......
Memory access address
AddressRegisters
Fig. 5. Block diagram of AGU
The structure of AGU is sketched in Fig. 5. AGU consists
of two building blocks, AGU_FSM, as control unit and AGU_DP, as a data-path. AGU_FSM is implemented as a finite state machine of Moore type. It generates control signals for correct driving data-path constituents. AGU_DP is composed of: - AR0,…,AR3 - four address registers, used for storing base memory address of 1D or 2D array element. - XR0, …,XR3 - four index registers, intended for storing index address of some element within 1D or 2D array. - S1 switching node which can pass-through or mutually exchange (cross-points) index values. - MUX1, MUX2 – used as signal selector blocks. - Comp – compares the value of the actual generated address with a predefined one. - Const. logic – combinatorial logic block used for generating constant value (ranging from -8 up to +64) during address calculation. Constant injection is realized as a bit-wise XOR operation on corresponding data value bits. In Fig. 6 a principle of constant generation in the range from 0 up to 15 is presented. For example, the constant 6 (d0=0, d1=1, d2=1, d3=0) is generated when control signals cont1=cont4=0, and cont2=cont3=1. Negative constant is obtained when the ALU1 performs NEG operation. ALU2 is used for index value updating.
DecoderLogic
d0
sel 1sel 2
sel 0
d1 d2 d3
enable
cont1cont2
cont3cont4
Fig. 6. Constant logic
We will evaluate AGU’s performance for a case when the
accelerator memory ACCM (see Fig. 2) uses 2D organization and row-major ordering for storing value of picture (image) elements (pixels). Let the capacity of ACCM be enough to
565
y
x
b)a)
c)
d)
Fig. 7. Access patterns
Notice: a) Single steps; b) Linear scan; c) Video scan; d) Zig-zag scan
store minimum three images, denoted as A, B and C, respectively. Each image characterizes the following resolution n-pixels per horizontal and m-lines per vertical direction. To each pixel a corresponding value is appended which points to its luminance and chromatic value. We will assume that the accelerator data-path (FUs in Fig. 2) performs the following data processing operation C(i,j)=A(i,j) op B(i,j), where op relates to some arithmetical or logical instruction. Access ordering to pixels of picture A, B and C depends on the application and can be performed in different ways, some typical are presented in Fig. 7. The main task now is to evaluate the speedup, Sp, of the system which corresponds to a case when the AGU is used as a building block for address generation, despite a solution when address calculation is performed by a software. In order to simplify our analysis, but without deteriorating the generality, we will assume that duration of each assembly language machine cycle is equal to the AGU cycle. The obtained results are presented in Table II. As can be seen from Table II for all access patterns we obtain that the Sp is equal to 2.5 when the access address pattern is described as a single nested loop, while Sp=2.4 for double nested outer loops when the condition branch-not-taken is fulfilled. According to the obtained results we can conclude that the implemented AGU is an efficient hardware building block because it involves acceleration in address pattern calculation for more than 240%.
566
TABLE II SOFTWARE AND AGU ACCESS PATTERN IMPLEMENTATION
Access pattern
Soft. Impl. (Mach. cyc.)
AGU (Clock cycles)
Speedup
Fig. 7a), b) 9 4 2.5 Fig. 7c) 9 4 2.5 Fig. 7d) 9(12) 4(5) 2.5(2.4)
V. CONCLUSION
In the design of Ess, memory issues play a very important role, and often impact significantly the ES’s performance, power dissipation, and overall cost of implementation. The traditional CPU-memory gap widens and often becomes the
dominant bottleneck in achieving high performance. Most current processors spend more time generating data addresses and accessing memory than performing the functional accessing data operations required by the application programs. This is mainly due to inadequate support provided by conventional architectures for the access of the data types used in current applications and their inefficient handling of the “von Neumann bottleneck”. Specialized programmable hardware address generation units are used to speed-up address expression evaluation. These units benefit from the fact that compiler code optimization techniques which map HLL data constructs to the addressing unit of the architecture have lagged far behind. However, AGUs are costly in terms of area since they include several arithmetical units, registers, and/or other combinatorial logic to provide enough programmability. In this paper, we have presented a structure of address generation hardware from the address sequence to be generated. One of the primary design goals was to devise an efficient support for decoupled access/execute architectural model with order to minimize the impact of memory accesses on processor performance. The structure is flexible, low-cost, and extensible to almost any sequence generation problem where the sequence exibit some simetry or regularity. The methodology used for address computation and generation was verified on real FPGA based computer boards [8].
ACKNOWLEDGEMENTS
This work was supported by the Serbian Ministry of Science and Technological Development, Project No. TR-32009 – “Low-Power Reconfigurable Fault-Tolerant Platforms”.
REFERENCES
[1] Ze-Nian Li, Mark S. Drew, Fundamentals of multimedia, Pearson Education, Inc., Upper Saddle River, NY; 2004
[2] Miguel A. Miranda et al., “High-level address optimization and synthesis techniques for data transfer-intensive applications”, IEEE Trans. on Very Large Scale Integration (VLSI) Systems, Vol. 6, No. 4, pp. 677-686.
[3] Guillermo Talavera, Murali Jayapala, Jordi Carrabina, Francky Catthoor, Address generation optimization for embedded high-performance processors. A survey, J. Sign. Process Systems, Vol. 53, pp. 271-284, 2008
[4] John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach, 4-th ed., Morgan Kaufmann, San Francisco, 2007
[5] Hardware Acceleration, available 04, April 2011, http://en.wikipedia.org/wiki/Hardware_acceleration
[6] Michael Herz, et al., “Memory Addressing Organization for Stream-based Reconfigurable Computing”, ICECS 2002, The 9th IEEE International Conference on Electronics, Circuits and Systems, pp. 813 – 817, Vol. 2, Dubrovnik, 2002.
[7] Dake Liu, Embedded DSP Processor Design, Morgan Kaufmann, San Francisco 2008.
[8] XILINX, Embedded Development HW/SW Kit, Spartan-3A DSP S3D1800A, MicroBlaze Processor Edition.