Upload
tuan-tran
View
11
Download
0
Embed Size (px)
Citation preview
CORDIC and SVD Implementation in Digital Hardware
Przemysaw M. Szecwka, Piotr Malinowski* Faculty of Microsystem Electronics and Photonics
Wrocaw University of Technology Wrocaw, Poland
[email protected] *) now with University School of Physical Education in Wrocaw, Poland
AbstractSingular Value Decomposition is classified among
the most effective numeric methods of matrices inversion. The paper presents a study of hardware implementation of SVD and CORDIC algorithms. Various digital architectures were proposed and compared, including low-cost sequential and high-performance pipelined solutions. Fixed point and floating point arithmetic was considered. The concepts were implemented in VHDL, verified and synthesized with Xilinx tools. Selected approach was physically implemented and tested.
Index TermsCORDIC, SVD, digital, hardware, VHDL, FPGA
I. INTRODUCTION Processing of matrices, especially inversion remains a key
challenge for contemporary computing machines. Very smart algorithms were proposed many years ago, by the scientists who expected rapid development of digital hardware in the future. Many of those solutions were presumed to work on futuristic parallel devices. CORDIC and Singular Value Decomposition (SVD) are good examples here [1-3]. Eventually recent years have brought the long expected rapid development of digital hardware and growth of programmable logic devices complexity. There is growing interest in construction of dedicated digital hardware, according to more or less classic concepts [4-7].
This paper describes a study of hardware implementation of Singular Value Decomposition of matrix based on replicated CORDIC modules. The authors focus on comparison of architecture variants in the context of resource allocation, speed and accuracy. Similar works may be found in contemporary literature [8] showing growing interest in practical use of achievements of great mid XX-th century mathematicians.
II. CORDIC AND SVD OVERVIEW CORDIC algorithm (Coordinate Rotation Digital
Computer) was proposed by Volder in 1959 [2]. Initially it was used to transform polar to perpendicular coordinates and reverse. Then CORDIC was extended to provide estimation of hyperbolic and exponential function, calculation of square root and other numeric applications. Nowadays it is extensively used in digital signal and data processing like DFT [7] and SVD [5]. I.e. it is quite universal tool which may be applied in many variants and configurations. In general CORDIC consists
in iterative rotations of a vector with a predefined series of constant angles. The angles decrease in a special manner forming a series: 45, 26.7, 14, 7.1, 3.57 etc. Consecutive rotations are left or right depending on target and actual result. With growing number of rotations n the increase in accuracy is obtained. This generic schematic may be applied in various modes, depending on needs. If the target is rotation with defined angle, a series of rotations is performed. For 2-dimensional space, where the [x0, y0]T vector is to be rotated by an angle of 0z , after n iterations, the new coordinates are:
0000 sincos1 zyzx
Kx
nn
0000 sincos1 zxzy
Ky
nn
whilst the final rotation angle 0nz .
In vector mode CORDIC determines the angle between [x0, y0]T vector and X axis. After series of dummy iterative rotations the new coordinates would be
2020
1 yxK
xn
n
0ny
and
0
0arctgxy
zn . The product of algorithm in such case
however is numerical value of zn determined by cumulated sum of angles (+/- for left/right) applied for consecutive rotations.
Singular Value Decomposition of a matrix consists in finding a series of singular values l,, 21 which simplify
MIXED DESIGN MIXDES 2010, 17th International Conference "Mixed Design of Integrated Circuits and Systems", June 24-26, 2010, Wrocaw, Poland
*QTv`B;?i kyRy #v .2T`iK2Mi Q7 JB+`Q2H2+i`QMB+b *QKTmi2` a+B2M+2- h2+?MB+H lMBp2`bBiv Q7 GQ/x kjd
AdministratorHighlight
inversion of matrix. For each matrix nmM ,R there exist orthogonal matrices mmU ,R and nnV ,R , for which
nmlT ,,MVU ,21 R)diag(
where l = min(m,n), and for r = rank(A) the diagonal values fulfill conditions
021 r
021
lrr
A pseudo-inverse matrix M+ may be determined by
TUVM
where + is a pseudo-inverse of diagonal matrix, i.e. it is diagonal matrix formed by inverted (when non-zero) values of
l,, 21 . SVD is currently classified among the most
efficient numerical methods of matrices inversion. SVD may be performed by the appropriate rotation of a matrix. For a
basic 2x2 matrix
!
dcba
M the rotation angle is
adbcarctg .
This operation may be done by double use of CORDIC in two modes. First the appropriate angles are determined and then the rotations are performed. Due to the properties of CORDIC the iterations may be described by combinations of adding/subtracting and shifts of bits:
)(SHIFT1 iiiii yxx " # $
)(SHIFT1 iiiii xyy " # %
where i = +/-1 denotes left or right shift. Eventually hardware implementation of CORDIC consists of adders, subtractors and muxes.
Figure 1. CORDIC - sequential architecture
kj3
AdministratorRectangle
AdministratorRectangle
AdministratorHighlight
AdministratorTypewriter
AdministratorTypewritertable
III. CORDIC ARCHITECTURE Two variants of CORDIC architectures are presented in
Fig. 1 and 2. Both solutions are full-synchronous with single clock. In the first - sequential approach, arithmetic modules are shared by iterations. Intermediate results are fed back via the registers and the appropriate angles are delivered to arithmetic units by the muxes. Control is provided by iteration counter. Another concept is pipelined architecture presented in Fig. 2. Schematic shows a hardware providing 3 consecutive iterations. Arithmetic blocks are replicated for each iteration, thus the data flow may form a pipeline. This solution provides much faster throughput but needs more hardware resources. On the other hand the control circuitry is more simple for this solution, leading to some savings and much higher clocking speed available. The two concepts were implemented in VHDL [9], verified and synthesized with Xilinx ISE [10] tools for Virtex-5 programmable device. Arithmetic is fixed point with
8-bit numbers coded in 2complement. Synthesis results summarized in Table 1. show clearly the difference between the low-cost and high-speed approach.
TABLE I. SYNTHESIS RESULTS FOR 2 VARIANTS OF CORDIC ARCHITECTURES
Sequential Pipelined
Number of Slice Registers 56 208
Number of Slice LUTs 151 243
Clock frequency 257 MHz 428 MHz
Levels of Logic 10 2
Delay 3,891 ns 2,336 ns
Delay on Logic 1,612 ns (41,4%) 0,659 ns (28,2%)
Delay on Route 2,279 ns (58,6%) 1,677 ns (71,8%)
Figure 2. CORDIC pipelined architecture.
IV. SVD ARCHITECTURE General concept of SVD architecture based on CORDIC
modules is presented in Fig. 3. The input is a basic 2x2 matrix. The primary output are two singular values, secondary output are rotation angles. This module, either replicated or reused may be applied for construction of dedicated devices working with bigger matrices. Detailed schematic of vector rotation block is presented in Fig. 4. It is a synchronous machine based on a single CORDIC element reused for consecutive iterations. The CORDIC output is fed back to the input via the register until the final value is obtained and latched. Rotation angle is delivered by the module shown in Fig. 5. Arithmetic block is reused again for consecutive iterations, thus the output is fed back. The appropriate angles for elementary rotations are stored in a memory. Control of data flow in these two modules is provided by the Finite State Machine working together with
iteration counter. Schematic of FSM is presented in Fig. 6. The initial neutral state is wait. Activation of the strobe signal forces calculation of the angle and then the following steps of processing.
SVD 22
CORDIC
SHIFT-SUM
SHIFT-SUM
CORDIC
SHIFT-SUM
SHIFT-SUM
b
c
d
1
2
p
l
a
Figure 3. Basic SVD architecture composed of CORDIC blocks
kjN
AdministratorHighlight
AdministratorHighlight
AdministratorHighlight
The initial neutral state is wait. Activation of the strobe signal forces calculation of the angle and then the following steps of processing. After transition to each state the iteration counter is activated and counts to predefined value. When the appropriate number of iterations is reached the FSM transits to
the next state. The two final stages are used to correct the scale of output values, disturbed during iterative approximations. In general the machine circulates around all the states with a little exception for immediate start of new processing with wait state skipped, on request.
y1
nreset
clk
c
y2
x1
nreset
clk
d
CORDICnreset
clk
enable
Out 2
nreset
clk
enableshiftsum
shiftsum
iteration,FSM state
iteration, FSM state
iteration, FSM state
iteration,FSM state
Out 1
Figure 4. SVD architeccture - vector rotation block.
rotation_ angle1rotation_ angle2
rotation_ angle23rotation_ angle24
ROM 2429
z1
nreset
clk
di
angle Rnreset
clk
enableZ2
iteration , FSM state
iteration,FSM state
angle
angle L
zero
Figure 5. SVD architeccture calcualtion of rotation angle.
For this part of study two kinds of number formats and arithmetic were applied. In the first approach the floating point numbers compatible with IEEE 754 standard [11] were used. In
this format the bit vector consists of a sign bit, 8-bit, 2-complement coded exponent and 23-bit significand (non-negative). Another approach was fixed point arithmetic with
k9y
AdministratorTypewriterrotation 24 iteration
25-bit, 2-complement coded vectors. For constant angles specific format was chosen fixed point with 2 bits reserved for integral part and the rest left for fractions (the possible angle values when scaled in radians do not exceed 2). CORDIC module described in previous section was redesigned twice for these two formats
Figure 6. Finite State Machine controlling SVD
SVD architecture with 2 variants of arithmetic was implemented in VHDL and synthesized for Xilinx Virtex-5 device. Synthesis results are summarized in Table 2. If to compare allocation of resources there is no huge difference in number of registers allocated. On the other hand the floating
point variant consumes much more combinatorial logic. There is huge difference in maximum clock speed 148 MHz for fixed point version point and only 35 MHz for floating point approach. Arithmetic operations on floating point numbers require long chains of combinatorial logic which require more time to transfer signal from one register to another.
TABLE II SYNTHESIS RESULTS FOR 2 VARIANTS OF SVD ARCHITECTURE
32-bit IEEE floating point
25-bit fixed point
Clock frequency 35 MHz 148 MHz
Levels of Logic 74 35
Delay 28,602 ns 6,738 ns Number of Slice
Registers 337 (1%) 314 (1%)
Number of Slice LUTs 4648 (14%) 2609 (7%)
The two code variants were simulated in Xilinx ISE environment for several sample matrices. The results were sent to a file, converted and compared with the ones given by SVD algorithm run in Octave environment. Fig. 7 shows two plots of relative errors obtained for two architectures. It is visible that fixed point architecture delivers substantially better results.
10-23 10-13 10-3 107 1017 1027 10370,0
2,0x10-7
4,0x10-7
6,0x10-7
8,0x10-7
1,0x10-6
1,2x10-6
1,4x10-6
&1
|'&1
/&1|
Figure 7. Relative error of singular value determination for two kinds of arithmetic approach 25-bit fixed point (lower) and 32-bit floating point floating point (upper plot).
k9R
V. CONCLUSIONS A comprehensive study of digital hardware dedicated to
Singular Value Decomposition was performed. The motivation was authors interest in construction of specialized computing machines performing operations on matrices in highly parallel way. Significant effort was devoted to CORDIC algorithm which was used for SVD but may be treated as separate issue as well. The results lead to conclusion that contemporary FPGAs are very close to enable construction of machines dealing with huge computational complexity.
Presented results, limited to small matrices are a good basis for further work, but at this stage deliver quite reasonable comparative material about architecture and arithmetic variants. In this context the results obtained for fixed and floating point are very interesting. As it was expected, fixed point approach provides higher processing speed and lower logic resources allocation. Surprising result was higher precision obtained with fixed point. Shall be noted however that 25-bit vectors were selected after very careful considerations and estimations.
Further research will focus on construction of devices dealing with matrices of higher dimension, perhaps with processing decomposed to basic 2x2 elements, so the described modules may be used without any redesign. An advantage of this approach is a chance to develop a methodology of processing matrices of unlimited dimension with limited number of basic SVD/CORDIC units. That would enable optimal utilization of currently available resources with at least partial independence on input complexity.
REFERENCES [1] C. Eckart, G. Young, The approximation of one matrix by another of
lower rank, Psychometrika, vol. 1, no. 3, 1936. [2] J.E. Volder, The CORDIC Trigonometric Computing Technique, IRE
Transactions on Electronic Computers, 1959. [3] G. Golub, W. Kahan, Calculating the singular values and pseudo-
inverse of a matrix, J. SIAM Numerical Analysis, Ser. B, Vol. 2, No. 2, 1965, pp. 205-224.
[4] R.P. Brent, F.T. Luk, C.F. Van Loan, Computation of the singular value decomposition using mesh-connected processors, Journal for VLSI Computer Systems, vol. 1, no. 3, 1985, pp. 243-270
[5] J.R. Cavallaro, F.T. Luk, CORDIC Arithmetic for a SVD Processor. Journal for Parallel and Distributed Computing, vol. 5, 1988, pp. 271-290.
[6] R. Andraka, A Survey of CORDIC Algorithms for FPGA based computers, in FPGA '98: Proc. of sixth international symposium on Field programmable gate arrays ACM/SIGDA, 1998, pp. 191-200.
[7] F. Deprettere (ed.), SVD and signal processing. Algorithms, applications and architectures, Department of Electrical Engineering, Delft University of Technology, Elsevier Science Publishers B.V., Amsterdam, 1988.
[8] H. Wang, P. Leray, J. Palicot, A CORDIC-based dynamically reconfigurable FPGA architecture for signal processing algorithms, URSI 08, The XXIX General Assembly of the International Union of Radio Science, Chicago IL, 2008.
[9] VHDL, IEEE Std No. 1076, 2000. [10] Xilinx ISE Web Pack, www.xilinx.com, 2009. [11] Floating-point arithmetic, IEEE Std No. 754, 2008.
k9k
/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageDownsampleThreshold 1.00333 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.00167 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputCondition () /PDFXRegistryName (http://www.color.org) /PDFXTrapped /False
>> setdistillerparams> setpagedevice