6
CORDIC and SVD Implementation in Digital Hardware Przemysaw M. Szecówka, Piotr Malinowski * Faculty of Microsystem Electronics and Photonics Wrocaw University of Technology Wrocaw, Poland [email protected] *) now with University School of Physical Education in Wrocaw, Poland Abstract—Singular Value Decomposition is classified among the most effective numeric methods of matrices inversion. The paper presents a study of hardware implementation of SVD and CORDIC algorithms. Various digital architectures were proposed and compared, including low-cost sequential and high- performance pipelined solutions. Fixed point and floating point arithmetic was considered. The concepts were implemented in VHDL, verified and synthesized with Xilinx tools. Selected approach was physically implemented and tested. Index TermsCORDIC, SVD, digital, hardware, VHDL, FPGA I. INTRODUCTION Processing of matrices, especially inversion remains a key challenge for contemporary computing machines. Very smart algorithms were proposed many years ago, by the scientists who expected rapid development of digital hardware in the future. Many of those solutions were presumed to work on futuristic parallel devices. CORDIC and Singular Value Decomposition (SVD) are good examples here [1-3]. Eventually recent years have brought the long expected rapid development of digital hardware and growth of programmable logic devices complexity. There is growing interest in construction of dedicated digital hardware, according to more or less classic concepts [4-7]. This paper describes a study of hardware implementation of Singular Value Decomposition of matrix based on replicated CORDIC modules. The authors focus on comparison of architecture variants in the context of resource allocation, speed and accuracy. Similar works may be found in contemporary literature [8] showing growing interest in practical use of achievements of great mid XX-th century mathematicians. II. CORDIC AND SVD OVERVIEW CORDIC algorithm (Coordinate Rotation Digital Computer) was proposed by Volder in 1959 [2]. Initially it was used to transform polar to perpendicular coordinates and reverse. Then CORDIC was extended to provide estimation of hyperbolic and exponential function, calculation of square root and other numeric applications. Nowadays it is extensively used in digital signal and data processing like DFT [7] and SVD [5]. I.e. it is quite universal tool which may be applied in many variants and configurations. In general CORDIC consists in iterative rotations of a vector with a predefined series of constant angles. The angles decrease in a special manner forming a series: 45 , 26.7 , 14 , 7.1 , 3.57 etc. Consecutive rotations are left or right depending on target and actual result. With growing number of rotations n the increase in accuracy is obtained. This generic schematic may be applied in various modes, depending on needs. If the target is rotation with defined angle, a series of rotations is performed. For 2- dimensional space, where the [x 0 , y 0 ] T vector is to be rotated by an angle of 0 z , after n iterations, the new coordinates are: 0 0 0 0 sin cos 1 z y z x K x n n 0 0 0 0 sin cos 1 z x z y K y n n whilst the final rotation angle 0 n z . In vector mode CORDIC determines the angle between [x 0 , y 0 ] T vector and X axis. After series of dummy iterative rotations the new coordinates would be 2 0 2 0 1 y x K x n n 0 n y and 0 0 arctg x y z n . The product of algorithm in such case however is numerical value of z n determined by cumulated sum of angles (+/- for left/right) applied for consecutive rotations. Singular Value Decomposition of a matrix consists in finding a series of singular values l , , 2 1 which simplify MIXED DESIGN MIXDES 2010, 17 th International Conference "Mixed Design of Integrated Circuits and Systems", June 24-26, 2010, Wrocaw, Poland

CORDIC and SVD Implementation in Digital Hardware

Embed Size (px)

Citation preview

  • CORDIC and SVD Implementation in Digital Hardware

    Przemysaw M. Szecwka, Piotr Malinowski* Faculty of Microsystem Electronics and Photonics

    Wrocaw University of Technology Wrocaw, Poland

    [email protected] *) now with University School of Physical Education in Wrocaw, Poland

    AbstractSingular Value Decomposition is classified among

    the most effective numeric methods of matrices inversion. The paper presents a study of hardware implementation of SVD and CORDIC algorithms. Various digital architectures were proposed and compared, including low-cost sequential and high-performance pipelined solutions. Fixed point and floating point arithmetic was considered. The concepts were implemented in VHDL, verified and synthesized with Xilinx tools. Selected approach was physically implemented and tested.

    Index TermsCORDIC, SVD, digital, hardware, VHDL, FPGA

    I. INTRODUCTION Processing of matrices, especially inversion remains a key

    challenge for contemporary computing machines. Very smart algorithms were proposed many years ago, by the scientists who expected rapid development of digital hardware in the future. Many of those solutions were presumed to work on futuristic parallel devices. CORDIC and Singular Value Decomposition (SVD) are good examples here [1-3]. Eventually recent years have brought the long expected rapid development of digital hardware and growth of programmable logic devices complexity. There is growing interest in construction of dedicated digital hardware, according to more or less classic concepts [4-7].

    This paper describes a study of hardware implementation of Singular Value Decomposition of matrix based on replicated CORDIC modules. The authors focus on comparison of architecture variants in the context of resource allocation, speed and accuracy. Similar works may be found in contemporary literature [8] showing growing interest in practical use of achievements of great mid XX-th century mathematicians.

    II. CORDIC AND SVD OVERVIEW CORDIC algorithm (Coordinate Rotation Digital

    Computer) was proposed by Volder in 1959 [2]. Initially it was used to transform polar to perpendicular coordinates and reverse. Then CORDIC was extended to provide estimation of hyperbolic and exponential function, calculation of square root and other numeric applications. Nowadays it is extensively used in digital signal and data processing like DFT [7] and SVD [5]. I.e. it is quite universal tool which may be applied in many variants and configurations. In general CORDIC consists

    in iterative rotations of a vector with a predefined series of constant angles. The angles decrease in a special manner forming a series: 45, 26.7, 14, 7.1, 3.57 etc. Consecutive rotations are left or right depending on target and actual result. With growing number of rotations n the increase in accuracy is obtained. This generic schematic may be applied in various modes, depending on needs. If the target is rotation with defined angle, a series of rotations is performed. For 2-dimensional space, where the [x0, y0]T vector is to be rotated by an angle of 0z , after n iterations, the new coordinates are:

    0000 sincos1 zyzx

    Kx

    nn

    0000 sincos1 zxzy

    Ky

    nn

    whilst the final rotation angle 0nz .

    In vector mode CORDIC determines the angle between [x0, y0]T vector and X axis. After series of dummy iterative rotations the new coordinates would be

    2020

    1 yxK

    xn

    n

    0ny

    and

    0

    0arctgxy

    zn . The product of algorithm in such case

    however is numerical value of zn determined by cumulated sum of angles (+/- for left/right) applied for consecutive rotations.

    Singular Value Decomposition of a matrix consists in finding a series of singular values l,, 21 which simplify

    MIXED DESIGN MIXDES 2010, 17th International Conference "Mixed Design of Integrated Circuits and Systems", June 24-26, 2010, Wrocaw, Poland

    *QTv`B;?i kyRy #v .2T`iK2Mi Q7 JB+`Q2H2+i`QMB+b *QKTmi2` a+B2M+2- h2+?MB+H lMBp2`bBiv Q7 GQ/x kjd

    AdministratorHighlight

  • inversion of matrix. For each matrix nmM ,R there exist orthogonal matrices mmU ,R and nnV ,R , for which

    nmlT ,,MVU ,21 R)diag(

    where l = min(m,n), and for r = rank(A) the diagonal values fulfill conditions

    021 r

    021

    lrr

    A pseudo-inverse matrix M+ may be determined by

    TUVM

    where + is a pseudo-inverse of diagonal matrix, i.e. it is diagonal matrix formed by inverted (when non-zero) values of

    l,, 21 . SVD is currently classified among the most

    efficient numerical methods of matrices inversion. SVD may be performed by the appropriate rotation of a matrix. For a

    basic 2x2 matrix

    !

    dcba

    M the rotation angle is

    adbcarctg .

    This operation may be done by double use of CORDIC in two modes. First the appropriate angles are determined and then the rotations are performed. Due to the properties of CORDIC the iterations may be described by combinations of adding/subtracting and shifts of bits:

    )(SHIFT1 iiiii yxx " # $

    )(SHIFT1 iiiii xyy " # %

    where i = +/-1 denotes left or right shift. Eventually hardware implementation of CORDIC consists of adders, subtractors and muxes.

    Figure 1. CORDIC - sequential architecture

    kj3

    AdministratorRectangle

    AdministratorRectangle

    AdministratorHighlight

    AdministratorTypewriter

    AdministratorTypewritertable

  • III. CORDIC ARCHITECTURE Two variants of CORDIC architectures are presented in

    Fig. 1 and 2. Both solutions are full-synchronous with single clock. In the first - sequential approach, arithmetic modules are shared by iterations. Intermediate results are fed back via the registers and the appropriate angles are delivered to arithmetic units by the muxes. Control is provided by iteration counter. Another concept is pipelined architecture presented in Fig. 2. Schematic shows a hardware providing 3 consecutive iterations. Arithmetic blocks are replicated for each iteration, thus the data flow may form a pipeline. This solution provides much faster throughput but needs more hardware resources. On the other hand the control circuitry is more simple for this solution, leading to some savings and much higher clocking speed available. The two concepts were implemented in VHDL [9], verified and synthesized with Xilinx ISE [10] tools for Virtex-5 programmable device. Arithmetic is fixed point with

    8-bit numbers coded in 2complement. Synthesis results summarized in Table 1. show clearly the difference between the low-cost and high-speed approach.

    TABLE I. SYNTHESIS RESULTS FOR 2 VARIANTS OF CORDIC ARCHITECTURES

    Sequential Pipelined

    Number of Slice Registers 56 208

    Number of Slice LUTs 151 243

    Clock frequency 257 MHz 428 MHz

    Levels of Logic 10 2

    Delay 3,891 ns 2,336 ns

    Delay on Logic 1,612 ns (41,4%) 0,659 ns (28,2%)

    Delay on Route 2,279 ns (58,6%) 1,677 ns (71,8%)

    Figure 2. CORDIC pipelined architecture.

    IV. SVD ARCHITECTURE General concept of SVD architecture based on CORDIC

    modules is presented in Fig. 3. The input is a basic 2x2 matrix. The primary output are two singular values, secondary output are rotation angles. This module, either replicated or reused may be applied for construction of dedicated devices working with bigger matrices. Detailed schematic of vector rotation block is presented in Fig. 4. It is a synchronous machine based on a single CORDIC element reused for consecutive iterations. The CORDIC output is fed back to the input via the register until the final value is obtained and latched. Rotation angle is delivered by the module shown in Fig. 5. Arithmetic block is reused again for consecutive iterations, thus the output is fed back. The appropriate angles for elementary rotations are stored in a memory. Control of data flow in these two modules is provided by the Finite State Machine working together with

    iteration counter. Schematic of FSM is presented in Fig. 6. The initial neutral state is wait. Activation of the strobe signal forces calculation of the angle and then the following steps of processing.

    SVD 22

    CORDIC

    SHIFT-SUM

    SHIFT-SUM

    CORDIC

    SHIFT-SUM

    SHIFT-SUM

    b

    c

    d

    1

    2

    p

    l

    a

    Figure 3. Basic SVD architecture composed of CORDIC blocks

    kjN

    AdministratorHighlight

    AdministratorHighlight

    AdministratorHighlight

  • The initial neutral state is wait. Activation of the strobe signal forces calculation of the angle and then the following steps of processing. After transition to each state the iteration counter is activated and counts to predefined value. When the appropriate number of iterations is reached the FSM transits to

    the next state. The two final stages are used to correct the scale of output values, disturbed during iterative approximations. In general the machine circulates around all the states with a little exception for immediate start of new processing with wait state skipped, on request.

    y1

    nreset

    clk

    c

    y2

    x1

    nreset

    clk

    d

    CORDICnreset

    clk

    enable

    Out 2

    nreset

    clk

    enableshiftsum

    shiftsum

    iteration,FSM state

    iteration, FSM state

    iteration, FSM state

    iteration,FSM state

    Out 1

    Figure 4. SVD architeccture - vector rotation block.

    rotation_ angle1rotation_ angle2

    rotation_ angle23rotation_ angle24

    ROM 2429

    z1

    nreset

    clk

    di

    angle Rnreset

    clk

    enableZ2

    iteration , FSM state

    iteration,FSM state

    angle

    angle L

    zero

    Figure 5. SVD architeccture calcualtion of rotation angle.

    For this part of study two kinds of number formats and arithmetic were applied. In the first approach the floating point numbers compatible with IEEE 754 standard [11] were used. In

    this format the bit vector consists of a sign bit, 8-bit, 2-complement coded exponent and 23-bit significand (non-negative). Another approach was fixed point arithmetic with

    k9y

    AdministratorTypewriterrotation 24 iteration

  • 25-bit, 2-complement coded vectors. For constant angles specific format was chosen fixed point with 2 bits reserved for integral part and the rest left for fractions (the possible angle values when scaled in radians do not exceed 2). CORDIC module described in previous section was redesigned twice for these two formats

    Figure 6. Finite State Machine controlling SVD

    SVD architecture with 2 variants of arithmetic was implemented in VHDL and synthesized for Xilinx Virtex-5 device. Synthesis results are summarized in Table 2. If to compare allocation of resources there is no huge difference in number of registers allocated. On the other hand the floating

    point variant consumes much more combinatorial logic. There is huge difference in maximum clock speed 148 MHz for fixed point version point and only 35 MHz for floating point approach. Arithmetic operations on floating point numbers require long chains of combinatorial logic which require more time to transfer signal from one register to another.

    TABLE II SYNTHESIS RESULTS FOR 2 VARIANTS OF SVD ARCHITECTURE

    32-bit IEEE floating point

    25-bit fixed point

    Clock frequency 35 MHz 148 MHz

    Levels of Logic 74 35

    Delay 28,602 ns 6,738 ns Number of Slice

    Registers 337 (1%) 314 (1%)

    Number of Slice LUTs 4648 (14%) 2609 (7%)

    The two code variants were simulated in Xilinx ISE environment for several sample matrices. The results were sent to a file, converted and compared with the ones given by SVD algorithm run in Octave environment. Fig. 7 shows two plots of relative errors obtained for two architectures. It is visible that fixed point architecture delivers substantially better results.

    10-23 10-13 10-3 107 1017 1027 10370,0

    2,0x10-7

    4,0x10-7

    6,0x10-7

    8,0x10-7

    1,0x10-6

    1,2x10-6

    1,4x10-6

    &1

    |'&1

    /&1|

    Figure 7. Relative error of singular value determination for two kinds of arithmetic approach 25-bit fixed point (lower) and 32-bit floating point floating point (upper plot).

    k9R

  • V. CONCLUSIONS A comprehensive study of digital hardware dedicated to

    Singular Value Decomposition was performed. The motivation was authors interest in construction of specialized computing machines performing operations on matrices in highly parallel way. Significant effort was devoted to CORDIC algorithm which was used for SVD but may be treated as separate issue as well. The results lead to conclusion that contemporary FPGAs are very close to enable construction of machines dealing with huge computational complexity.

    Presented results, limited to small matrices are a good basis for further work, but at this stage deliver quite reasonable comparative material about architecture and arithmetic variants. In this context the results obtained for fixed and floating point are very interesting. As it was expected, fixed point approach provides higher processing speed and lower logic resources allocation. Surprising result was higher precision obtained with fixed point. Shall be noted however that 25-bit vectors were selected after very careful considerations and estimations.

    Further research will focus on construction of devices dealing with matrices of higher dimension, perhaps with processing decomposed to basic 2x2 elements, so the described modules may be used without any redesign. An advantage of this approach is a chance to develop a methodology of processing matrices of unlimited dimension with limited number of basic SVD/CORDIC units. That would enable optimal utilization of currently available resources with at least partial independence on input complexity.

    REFERENCES [1] C. Eckart, G. Young, The approximation of one matrix by another of

    lower rank, Psychometrika, vol. 1, no. 3, 1936. [2] J.E. Volder, The CORDIC Trigonometric Computing Technique, IRE

    Transactions on Electronic Computers, 1959. [3] G. Golub, W. Kahan, Calculating the singular values and pseudo-

    inverse of a matrix, J. SIAM Numerical Analysis, Ser. B, Vol. 2, No. 2, 1965, pp. 205-224.

    [4] R.P. Brent, F.T. Luk, C.F. Van Loan, Computation of the singular value decomposition using mesh-connected processors, Journal for VLSI Computer Systems, vol. 1, no. 3, 1985, pp. 243-270

    [5] J.R. Cavallaro, F.T. Luk, CORDIC Arithmetic for a SVD Processor. Journal for Parallel and Distributed Computing, vol. 5, 1988, pp. 271-290.

    [6] R. Andraka, A Survey of CORDIC Algorithms for FPGA based computers, in FPGA '98: Proc. of sixth international symposium on Field programmable gate arrays ACM/SIGDA, 1998, pp. 191-200.

    [7] F. Deprettere (ed.), SVD and signal processing. Algorithms, applications and architectures, Department of Electrical Engineering, Delft University of Technology, Elsevier Science Publishers B.V., Amsterdam, 1988.

    [8] H. Wang, P. Leray, J. Palicot, A CORDIC-based dynamically reconfigurable FPGA architecture for signal processing algorithms, URSI 08, The XXIX General Assembly of the International Union of Radio Science, Chicago IL, 2008.

    [9] VHDL, IEEE Std No. 1076, 2000. [10] Xilinx ISE Web Pack, www.xilinx.com, 2009. [11] Floating-point arithmetic, IEEE Std No. 754, 2008.

    k9k

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageDownsampleThreshold 1.00333 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.00167 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputCondition () /PDFXRegistryName (http://www.color.org) /PDFXTrapped /False

    >> setdistillerparams> setpagedevice