View
217
Download
0
Category
Preview:
Citation preview
ECE 734 VLSI Array Structures for Digital
Signal Processing
Project Report
Title:
A Recursive Method for the Solution of the Linear
Least Squares Formulation – algorithm and
performance using PLX instruction set
Authors:
Claus Benjaminsen
Shyam Bharat
Table of Contents
Table of Contents.......................................................................................................2
Introduction...................................................................................................................3
Algorithm.......................................................................................................................6
Theoretical Analysis.................................................................................................7
Numerical Analysis in Matlab...........................................................................13
Implementation Using the PLX instruction set.........................................16
Implementation of different dimensional multiplications using the
PLX instruction set..................................................................................................16
Problems with the PLX instruction set..........................................................19
Conclusion...................................................................................................................21
The PLX assembly code implementation....................................................22
2
References....................................................................................................................26
Introduction
The least squares method of parameter estimation seeks to minimize the squared
error between the observed data sequence and the assumed signal model, which is some
function of the parameter to be estimated. In particular, the linear least squares
formulation represents the case where the signal model is a linear function of the
parameter to be estimated. The salient feature of the least squares method is that no
probabilistic assumptions are made about the data – only a signal model is assumed. This
method is generally used in situations where a precise statistical characterization of the
data is unknown.
The linear least squares estimation (LLSE) method ultimately boils down to
solving a set of linear equations. As will be seen from the equations that follow, the
solution to the LLSE involves the computation of the inverse of a matrix. This matrix
happens to be the autocorrelation matrix of the assumed signal model matrix. Now, the
dimensions of this autocorrelation matrix are P x P, where ‘P’ is the number of samples
in the observation vector. This computation of the matrix inverse is straight-forward for
small values of ‘p’ but for large values of ‘p’, it becomes increasingly computationally
intensive. In digital signal processing (DSP) applications, the computation is generally
required to be done in real-time for a continuous succession of samples. In other words,
for DSP applications one matrix inversion is required for every new input vector. Since it
is very time consuming to calculate the inverse of a matrix, it is essential to find an
alternate way of solving the system of linear equations to yield the unknown parameters.
This problem is a fairly well-researched one and there exists an alternate method of
solving for the unknown parameters. This alternate method basically involves updating a
so-called weight vector in time, based on its value at the previous time instant. The
meaning of this statement will become clearer with the aid of equations presented below.
3
The function to be minimized with respect to the parameter at the present instant
is given by:
, 0 < λ < 1
where en(k) = d(k) – wH(n)u(k)
In matrix form, this function can be written as:
J(w(n)) = [d* - uHw(n)]HΛ[d – uHw(n)]
where Λ = diag{λn-1, λn-2, … , λ, 1}
The solution to this equation is given by:
w(n) = (u Λ uH)-1 u Λ d
or w(n) = Φ(n)-1ө(n)
where:
As can be observed from the above equations, the path to the solution involves the
computation of the inverse of a matrix, for every value of ‘n’. The method of recursive
least squares seeks to circumvent this computationally intensive step by instead
calculating w(n) from w(n-1), it’s value at the previous time instant.
4
Φ(n) and ө(n) can be rewritten in recursive form as follows:
Φ(n) = λΦ(n-1) + u(n)uH(n)
ө(n) = λө(n-1) + u(n)d*(n)
The matrix inversion lemma shown below is applied to Φ(n)
Therefore,
At this point, the following 2 new terms are defined:
P(n) = Φ-1(n) , where P(n) has dimensions M x M
, where k(n) is an M x 1 gain vector
It can be seen that: k(n) = P(n)u(n)
If we proceed with the math, we ultimately arrive at the following recursion formula for
w(n):
w(n) = w(n-1) + k(n)[d*(n) – uH(n)w(n-1)]
The last term (in square brackets) is denoted as α*(n), where α(n) = d(n) – wH(n-1)u(n) is
called the ‘innovation’ or ‘a priori estimation error’. This differs from e(n) which is the ‘a
posteriori estimation error’. To update w(n-1) to w(n), we need k(n) and α(n)
5
Algorithm
Based on the above formulation, shown below is an algorithm for solving the least
squares equation recursively:
Initialization: P(0) = δ-1I where δ is small and positive
w(0) = 0
For n = 1,2,3,….
x(n) = λ-1P(n-1)u(n)
k(n) = [1 + uH(n)x(n)]-1x(n)
α(n) = d(n) – wH(n-1)u(n)
w(n) = w(n-1) + k(n)α*(n)
P(n) = λ-1P(n-1) – k(n)xH(n)
6
Theoretical Analysis
The first view into the algorithm is looking into the sizes of the different quantities. When
doing that one finds that choosing the size of the input vector u and the target values d,
determines the sizes of all the other quantities in the algorithm. Therefore if u is (K x 1),
and d is (N x 1) the quantities in the algorithm have the following dimensions:
Quantity u d P x k w
Dimension (K x 1) (N x 1) (K x K) (K x 1) (K x 1) (K x N) (N x 1)
Considering the size of the project we will make the following simplification from the
general formulation of the algorithm. First we will assume only real quantities, which
imply that both the input values u and the output values d are real. Second we will limit d
to be a scalar, that is N = 1 and the dimensions of the quantities will then be
Quantity u d P X k W
Dimension (K x 1) (1 x 1) (K x K) (K x 1) (K x 1) (K x 1) (1 x 1)
Now the computations required in the algorithm in terms of vector and matrix algebra are
can be investigated.
The first line requires P multiplied by a constant (-1) (K2 multiplications), and a matrix
vector product (K2 multiplications and K(K-1) additions). This can be done faster if the
matrix vector product is carried out first and then the multiplication by a constant, but
since the same constant matrix product is required to calculate P(n), this should be done
first and stored, so it can be used in the calculation of P(n).
The second line requires a vector inner product (K multiplications and K-1 additions),
then a scalar addition by 1 (1 addition), a division (taking the inverse) of a scalar (1
division) and finally a constant multiplied by a vector (K multiplications).
7
In the third line there is a vector inner product (K multiplications) and a scalar subtraction
(1 subtraction).
The fourth line has a vector multiplied by a scalar (K multiplications) and a vector
addition (K additions).
Finally the fifth line has the same product of -1P(n-1) as the first line, so this doesn’t
need to be calculated again. Then it has a vector outer product (K2 multiplications) and
finally a matrix addition (K2 additions).
This gives a total number of
3K2 + 3K multiplications
K(K-1) + K-1 + 1 + K + K2 = 2K2 + K additions
1 division
1 subtraction per iteration.
If we don’t consider the division, a general purpose microprocessor with a separate
multiplier will need 5K2 + 4K + 1 operations to execute one iteration of the algorithm, not
counting the number of operations needed to move, load and store data. With the PLX
instruction set it is possible to carry out 2 multiplications and up to 4 additions in parallel,
which has the potential of reducing the number of operations per iteration to 2K2 + 1.75K
+ 1.
The algorithm can be drawn as a data flow graph (DFG), see Figure 1.
8
X
X +(.)T 1/x Xu(n)uT(n)
x(n)
+1
(.)T X
X- 1
k(n)
+
X
Z-1
P(n)P(n-1)
X +
+Z-1
w(n-1) w(n)(.)T
(n)
+
-
+d(n)
-
+
+
+
+
xT(n)
wT(n-1)
A
B
C D E
F G H I J
K L M
N O
Figure 1 The DFG of the RLS algorithm
The DFG has 6 loops.
1. A – B – A t1: tmsm+tmma
2. A – C – D – E – B – A t2: tmsm+tmvm+tvt+tvom+tmma
3. A – C – J – E – B – A t3: tmsm+tmvm+tvsm+tvom+tmma
4. A – C – G – H – I – J – E – B – A t4: tmsm+tmvm+tvim+tssa+tssd+tvsm+tvom+tmma
5. K – L – M – O – N – K t5: tvim+tssa+tvsm+tvva+tvt
6. O – O t6: tvva
Where the time required for each operation is defined as:
tmsm – Matrix – Matrix multiplication
tmma – Matrix addition
tmvm – Matrix – Vector multiplication
tvt – Vector transpose
tvom – Vector outer product
9
tvsm – Vector – scalar multiplication
tvim – Vector inner product
tssa – Scalar addition
tssd – Scalar division
tvv – Vector addition
All loops contain exactly one delay element and therefore the longest loop iteration time
will be the iteration bound. First it is noted that tvt = 0, as there in general is no need to do
a vector transpose in hardware, where column vectors and row vectors are not
distinguished only the computations they are involved in. Secondly a lot of the nodes are
the same in different loops and when comparing them it is found that only loop 4 and 5
can have the longest computation time. By subtracting the same computations in each of
the two loops one finds that if
tmsm+tmvm+tssd+tvom+tmma > tvva, which is a reasonable assumption.
Therefore loop 4 determines the iteration bound, which is
T = tmsm+tmvm+tvim+tssa+tssd+tvsm+tvom+tmma
The critical path is contained in loop 4 and is equal to the iteration bound T, therefore no
speed improvement can be achieved by retiming. Also since the iteration bound can be
directly achieved by using enough hardware with the implementation in the DFG above,
there is nothing to gain from unfolding either. Since our implementation focuses on an
implementation using the PLX instruction set, it is clear that to get anywhere close to the
iteration bound several PLX processors need to be interconnected, so the different loop
can be executed in parallel. To look into that a little further we will investigate the inter-
and intra-iteration dependence structure of the algorithm. This is done by drawing a
dependence graph (DG) of the variables in the algorithm as in Figure 2.
10
x(n-1)
(n-1)
w(n-1)P(n-1)
k(n-1)
x(n)
w(n)
(n)
P(n)
k(n)
Figure 2 The dependence graph of the RLS algorithm
From the figure it is seen that there is a lot of inter-iteration dependence. That is, not a lot
of calculations in a new iteration can be done, until all the calculations in the old iteration
are completed. If the algorithm is calculated in place, that is new values of the different
quantities are stored over old values, then we see that when P(n-1) is calculated only the
quantity x(n) can be calculated in the next iteration, before the calculation of w(n-1) is
completed. That k(n) can’t be calculated also comes from the fact that there is an output
dependence on k(n), because the calculation of w(n-1) depends on k(n-1), this needs to be
completed before the new value of k(n) can be stored on top of the old value. When w(n-
1) is calculated only (n) in the next iteration can be calculated before also P(n-1) is
calculated. If the algorithm is not calculated in place, that is old values are not
overwritten by new values, but instead stored into arrays, the output dependence on k is
removed. Therefore it is seen from the graph above that the left branch (x, k and P), can
be executed independently from the right branch. The right branch has an input
dependence on k and therefore depends on the left branch being calculated, before the
right branch can be completed. It is therefore a possibility to make an implementation
11
with two PLX processors, where only the value of k is communicated between the two
processors. Connecting this with the DFG, one processor will compute A, B, C, E, G, H, I
and J, and the other processor will compute K, L, M, O, ignoring the transpose
operations. This configuration will therefore reach the iteration bound, and no further
speed-up can be achieved by using extra PLX processors.
Looking into the intra-iteration dependence it is seen that k has an input dependence on x,
P depends on k and x, and w depends on k and . This gives the following possibilities of
execution order of each iteration.
1. x (n) k(n) P(n) (n) w(n)
2. x (n) k(n) (n) P(n) w(n)
3. x (n) k(n) (n) w(n) P(n)
4. x (n) (n) k(n) P(n) w(n)
5. x (n) (n) k(n) w(n) P(n)
6. (n) x(n) k(n) P(n) w(n)
7. (n) x(n) k(n) w(n) P(n)
There are even more possibilities if you mix the execution of iterations a possibility is
8. (n-1) w(n-1) P(n-1) x(n) k(n)
This just requires that x(1) and k(1) are calculated as initial values before the execution of
the loop is started.
This shows that there are several possible ways to order the calculations of the different
variables in the implementation, and hence these options can be explored to find the
ordering, which gives the must efficient implementation for instance by minimizing the
number of move operations.
12
Numerical Analysis in Matlab
We used the concept of ‘system identification’ to determine whether the results of the
RLS algorithm are accurate. Essentially, this boils down to the choice of the input signal
(u) and the observation vector (d). Initially, 4 weights are defined. The observation vector
is chosen to be the inner product of the 4 weights with the input signal (and random noise
is added to this product). With this a priori information, the RLS algorithm is coded in an
m-file and executed. Based on this simple experiment, it is seen from the figure below
that the weights tend to converge to their a priori defined values after just a few tens of
iterations. Of course, the convergence curve depends on the initial values ‘delta’ and
‘lambda’ (both of which are positive and less than 1). In general, higher values of
‘lambda’ tend to eliminate most of the ‘hunting’ of the weight estimates about their actual
values, while lower values of ‘lambda’ lead to a significant amount of fluctuation of the
weights about their true value. ‘delta’ affects the convergence pattern in a different
manner. Higher values of ‘delta’ lead to slower convergence, i.e. the algorithm takes a
larger number of iterations to attain the true value. Lower values of ‘delta’, on the other
hand, lead to faster convergence. Thus, higher values of ‘lambda’ and lower values of
‘delta’ are more suited for the RLS algorithm. The aim of such an analysis was to verify
the workability of the RLS algorithm using known inputs and outputs, so that the outputs
can be compared with those values. This was essential in order to implement the
algorithm using the PLX instruction set.
13
0 50 100 150 200 250-3
-2
-1
0
1
2
3
464 bit floating point; w1 = 3.2 w2 = 1.1 w3 = -2.1 w4 = 0.7
w1
w2
w3
w4
Figure 3 The convergence pattern of the weights used in the RLS algorithm
The graph in figure 3 shows the convergence of the weights, when the RLS algorithm is
executed in Matlab and as default double precision floating point values (64 bits) are used
in the calculations. PLX doesn’t support floating point, and to make use of the sub-word
parallelism inherent in the PLX architecture, we need to reduce the number of bit, so that
more than one number can be stored in each 64 bit PLX register. To get the full benefit of
the parallelism 4 numbers should be stored in each register, allowing each number to be
represented with 16 bits. In order to test the RLS algorithm with this reduced precision,
we use the fixed point package in Matlab to simulate the execution of the algorithm using
16 bit fixed point values. The option we now have to explore is how many bits should we
use for the integer part and how many for the fractional. Since the values in the algorithm
can be both positive and negative, we need to represent the values in two’s complement,
and hence one bit is used as the sign bit. This means that we can choose from 0 to 15
fractional bits, and we use the simulations to find the best choice.
14
In figure 4 the convergence patterns of the weights for 6 and 8 fractional bits are shown.
The patterns for 10 and 11 fractional bits are shown in figure 5.
0 50 100 150 200 250-10
-5
0
5
10
15
Number of Iterations
Val
ue o
f the
wei
ght
Fractional bits: 6; w1 = 3.2 w2 = 1.1 w3 = -2.1 w4 = 0.7
w1
w2
w3
w4
0 50 100 150 200 250
-4
-2
0
2
4
6
8
Number of Iterations
Val
ue o
f the
wei
ght
Fractional bits: 8; w1 = 3.2 w2 = 1.1 w3 = -2.1 w4 = 0.7
w1
w2
w3
w4
Figure 4 The convergence patterns for the weights using 16 bit fixed point values. In the graph on the left 6 fractional bits are used, and on the right 8 fractional bits are used.
0 50 100 150 200 250-4
-3
-2
-1
0
1
2
3
4
Number of Iterations
Val
ue o
f the
wei
ght
Fractional bits: 10; w1 = 3.2 w2 = 1.1 w3 = -2.1 w4 = 0.7
w1
w2
w3
w4
0 50 100 150 200 250
-20
-15
-10
-5
0
5
10
15
20
Number of Iterations
Val
ue o
f the
wei
ght
Fractional bits: 11; w1 = 3.2 w2 = 1.1 w3 = -2.1 w4 = 0.7
w1
w2
w3
w4
Figure 5 The convergence patterns for the weights using 16 bit fixed point values. In the graph on the left 10 fractional bits are used, and on the right 11 fractional bits are used.
From the figures it is clearly seen that as the number of fractional bits is increased, the
convergence becomes much more stable and the number of big fluctuations is decreased.
There is a limit though, because as the number of fractional bits is increased, the number
of integer bits is decreased, and hence at some point there will be too few integer bits to
represent the values accurately. When this happens the algorithm becomes unstable and
the weights will not converge as shown in figure 5 on the right. The simulations therefore
show that 10 fractional bits give the best performance with respect to convergence and
stability.
15
Implementation Using the PLX instruction setThe above Matlab simulations showed that 16 bits are enough to represent the values in
the RLS algorithm, while still maintaining a good performance. The number of fractional
bits should be 10, which actually doesn’t affect the implementation very much. It only
determines the format the input values, the constants and the output values are stored in.
The PLX assembly code for the RLS algorithm can be found at the end of this report.
Implementation of different dimensional multiplications using the PLX instruction set
The RLS algorithm involves the computation of multiplications and additions and even a
scalar division. Of these multiplications, there are some vector-scalar products, some
vector–vector products and some matrix-vector products. The results of such
multiplications could be vectors, scalars or matrices, depending on the combination and
orientation used. The PLX instruction set consists of thirty two 64-bit registers. Each of
these registers can thus store four sub-words each of 16-bit word length. Hence, for ease
of implementation, we have defined our vectors to be sized 4 x 1 and matrices to be sized
4 x 4. Also, each element of a vector, matrix or a scalar is defined to be 16 bits in size. As
a result, we define a convention wherein a vector is stored entirely in 1 register, with each
16-bit sub-word of the register carrying an element of the vector. Storing a matrix is also
similar – the first row of the matrix is stored like a vector (in 1 register), while the other 3
rows are stored in 3 other registers. A scalar is a little different – since only 16-bits are
required, scalars are stored in the least significant 16-bit sub-word of the appropriate
register. They can then be replicated in the other sub-words, depending on the
requirement of the computation. Described below are examples of certain common
multiplication operations, all of which appear in the RLS algorithm that is implemented
using PLX instructions.
16
Matrix-Vector product: Initially, each row of the matrix P (P is a 4x4 matrix) is stored
in a register. Therefore, each element has 16 bits allocated for it’s storage. Each row of P
is then multiplied with the column vector ‘u’, using the instructions ‘pmul.odd’ and
‘pmul.even’. This is done because, in the present form of sub-word parallelism available
with the ‘pmul.2’ instruction, only odd or evenly indexed sub-words can be multiplied
simultaneously and their result is stored as a 32-bit number. Subsequently though, we
limit the multiplication result to the lower 16-bits in order to maintain a standard limit on
the allocated space to each variable. Thus, wrap around arithmetic is used here. After the
requisite multiplications, the ‘mix.2.r’ instruction is used 4 times to concatenate the
appropriate results together. After these instructions are executed, we have 4 registers –
the content of each register represents four 16-bit words, each of which is the product of
an element of P and an element of ‘u’. Now, the content of each register needs to be
added up. This is not possible if the 16-bit data are in the same register. Hence, they are
moved to corresponding positions in different registers using the ‘check.4’ and
‘excheck.4’ instructions. Once this is accomplished, the final vector resulting from the
matrix-vector product is obtained using 3 successive ‘padd.2’ operations.
Vector-vector inner product: An inner product of a vector and a vector is essentially the
transpose (or Hermitian, which is the complex conjugate transpose) of one vector
multiplied by the other, results in a scalar. The instructions ‘pmul.odd’ and ‘pmul.even’
are used to generate 4 multiplication results. These 2 ‘result’ registers are then added
together using the ‘padd.2’ instruction. Essentially, using the same argument of retaining
the lower 16 bits of the multiplication result, all that remains now is to add the least and
second-most significant words of the result of the ‘padd.2’ instruction. This is done by
first using the ‘excheck.4’ instruction and then the ‘padd.2’ instruction again. The final
result is stored in the least significant word of the register.
Vector-vector outer product: A vector-vector outer product results in a matrix. For
example, a Kx1 vector multiplied by a 1xK vector results in a KxK matrix. Here, we have
to multiply a column vector with the Hermitian of another column vector. Since all our
data are real, the Hermitian merely translates into the transpose operation. The matrix
17
resulting from the outer product has each element as the product of an element of one
vector multiplied by an element of the other. For example, the first element of the first
column vector is multiplied with each element of the transposed vector to give the first
row of the result matrix. Similarly, the 2nd element of the first column vector is multiplied
with each element of the transposed vector to give the 2nd row of the result matrix. This
process continues till all rows of the result matrix are formed. In terms of PLX
implementation, the way this is done is by initially replicating the first element of the first
column vector into all 64 bits of a register using the ‘permset.2’ instruction. Then the
requisite multiplications are performed using the ‘pmul.odd’ and ‘pmul.even’
instructions. Now, the ‘mix.2.r’ instruction is used to arrange the elements of the first row
of the matrix as 16-bit sub-words of the register in which it is stored. This process is then
repeated for each of the other elements of the first column vector, to generate the
remaining rows of the result matrix.
Vector-scalar multiplication: This results in another vector. This operation is a subset
of the vector-vector outer product displayed above. The scalar is loaded into all 64 bits of
a register using the ‘permset.2’ instruction. The ‘pmul.odd’ and ‘pmul.even’ instructions
are then employed, followed by the ‘mix.2.r’ instruction, to obtain a register whose 64
bits are filled with four 16-bit values, each of which represent an element of the result
vector.
Scalar division: Division is not supported in the PLX instruction set and hence we have
to develop a division algorithm. This is done using multiple shift and subtract operations
performed in a loop. The denominator is first shifted 15 places to the left, where after the
loop is started. The denominator, if bigger than zero, is compared with the nominator and
if the nominator is biggest the two are subtracted. The result is stored as the new
nominator, 1 is shifted into the result register from the left and the denominator is shifted
1 place to the right. Then a new iteration is started with again comparing the denominator
with the nominator. If the denominator is biggest a 0 is shifted into the results register
from the right, the denominator is shifted one place to the right and a new iteration is
18
started. This is continued until the denominator has been shifted 31 times, after which the
division is done and the result obtained.
Problems with the PLX instruction setDuring our work on implementing our algorithm using the PLX instruction set, we came
across a problem with the compare instruction cmp. It turns out that the PLX simulator
has an error and hence in some cases, when the cmp instruction is used, it gives wrong
results. An example of this is shown in the figure 6.
Figure 6 A screen dump showing the execution of the PLX simulator. As highlighted the cmp.ge
instruction yields first a wrong result and then an instruction later it suddenly gives the right result.
The result of the first compare instruction is wrong, but when the exact same instruction
is executed an instruction later, it suddenly gives the right result. We reported the
problem and were sent another simulator, and we thought the problem was fixed, because
it didn’t give the same errors as the first simulator.
19
Figure 7 The second simulator has also problems with the compare instruction. It doesn’t give the
same errors as the first simulator, but another error as shown by the small oval in the lower left side.
Unfortunately it has another error also with the compare instruction. An example is
shown in figure 7, where the cmp.gt should turn out true, but returns false.
We have therefore not been able to test our implementation of our algorithm, and can not
verify if it works correctly or not.
20
Conclusion
The RLS algorithm is widely used in applications like adaptive beamforming, tracking
and other filtering applications. This project has investigated the algorithm with the goal
of implementing it using the PLX instruction set to take advantage of the sub-word
parallelism in the PLX architecture. A successful implementation in PLX will make it
possible to use the RLS algorithm in practical applications at a very low cost, because it
only needs a PLX processor and not development of dedicated hardware, which can be
very costly to manufacture.
The main performance criteria in connection with the implementation have been studied,
which include speed, stability and convergence. From the analysis performed it was
concluded that 16 bits are enough to store the variables in the algorithm and with 10 of
these as fractional bits, the algorithm is stable and good convergence is achieved.
The functionality of the implementation in PLX has not been verified, because errors
were found in PLX simulator and hence the results of compare instructions are unreliable.
Concluding on the speed of the implementation anyway, a count of instructions in the
assembly file shows that there are 78 instructions included in the main loop. This doesn’t
include the division subroutine, which takes a varying number of instructions depending
on the operands. As it is currently implemented using shift and subtract, it takes a really
long time to execute, and it would therefore be very beneficial to add a separate division
module to the PLX processor. In that case the 78 instructions are be a good estimate of
how fast this algorithm can be run on a PLX processor. With a processor speed of 100
MHz, this gives an iteration frequency of 1.28 kHz, which in many cases will be fast
enough for real-time applications.
21
The PLX assembly code implementation
// all Values are 16 bits 8 integer and 8 fractional
// constants#define lambda1 0x0100 //1.0 represents lambda inverse#define delta1 0xFA00 //250.0 represents delta inverse
#define u R1#define P1 R2#define P2 R3#define P3 R4#define P4 R5#define x R6#define k R7#define d R8#define alpha R9#define w R10
#define lambda R11#define count R27#define num R28#define denom R29#define divres R30
main proc// Initializationloadi.z.0 w,0x0000
loadi.z.3 P1,delta1loadi.z.2 P2,delta1loadi.z.1 P3,delta1loadi.z.0 P4,delta1
loadi.z.3 R12,lambda1permset.2 lambda,R12,3333
// Load data u and dLoop:
loadi.z.0 u,0x0010loadi.z.1 u,0x0020loadi.z.2 u,0x0030loadi.z.3 u,0x0040
loadi.z.0 d,0x0001
// Calculate lambda*Ppmul.even R12,lambda,P1pmul.odd R13,lambda,P1mix.2.r P1,R13,12
22
pmul.even R12,lambda,P2pmul.odd R13,lambda,P2mix.2.r P2,R13,12
pmul.even R12,lambda,P3pmul.odd R13,lambda,P3mix.2.r P3,R13,12
pmul.even R12,lambda,P4pmul.odd R13,lambda,P4mix.2.r P4,R13,12
// Calculate x = lambda*P*upmul.even R12,P1,upmul.odd R13,P1,u // Row#1 of p times u
pmul.even R14,P2,upmul.odd R15,P2,u // Row#2 of p times u
pmul.even R16,P3,upmul.odd R17,P3,u // Row#3 of p times u
pmul.even R18,P4,upmul.odd R19,P4,u // Row#4 of p times u
mix.2.r R12,R12,R14mix.2.r R13,R13,R15
mix.2.r R14,R16,R18mix.2.r R15,R17,R19
check.4 R16,R12,R14check.4 R17,R13,R15excheck.4 R18,R14,R12excheck.4 R19,R15,R13
// R16,R17,R18,R19 contain the sub-words of x, which are to be added
padd.2 R16,R16,R17padd.2 R17,R18,R19padd.2 x,R16,R17 // x = lambda*p*u
// Calculate the term to divided by i.e. (1 + u.x)
pmul.even R12,x,upmul.odd R13,x,upadd.2 R14,R12,R13excheck.4 R15,R14,R0paddincr.2 denom,R14,R15
// The rightmost 16 bits of 'denom' contain the scalar, // while the first 48 bits from the left are 'don't care'
call Division
// Calculate k = (divres) times (x)
23
permset.2 R12,divres,3333pmul.odd R13,R12,xpmul.even R14,R12,xmix.2.r k,R13,R14
// Calculate Hermitian of 'w' times 'u', which is a scalar (used in the // calculation of alpha)pmul.even R12,w,upmul.odd R13,w,upadd.2 R14,R12,R13excheck.4 R15,R14,R0padd.2 R15,R14,R15
// The rightmost 16 bits of 'R15' contain // the required scalar, while the first 48 bits from the left
are 'don't care'
// Calculate alphapsub.2 alpha,d,R15
// assuming that the scalar 'd' is stored // in the rightmost 16 bits of the register 'd' (R8)
// Calculate k times alphapermset.2 R12,alpha,0000
// here, we are replicating the 16-bit alpha into all 64 bits of R12, //assuming alpha to be in the rightmost 16 bits: if it is //in the leftmost 16 bits, 0000 should be
replaced with 3333pmul.odd R13,R12,kpmul.even R14,R12,kmix.2.r R15,R13,R14// 'k times alpha' is stored in R15
// Calculate w(n) = w(n-1) + 'k times alpha'padd.2 w,w,R15
// the 2nd argument w is w(n-1), which is added to R15,// and the result is stored in the first argument w, which is now w(n)
// Calculate k times 'Hermitian of x' .... 'Hermitian of x' is simply the // transpose of x, since x is
real-valuedpermset.2 R12,k,3333pmul.odd R13,R12,xpmul.even R14,R12,xmix.2.r R15,R13,R14
// R15 is the first row of 'k times Hermitian of x'
permset.2 R12,k,2222pmul.odd R13,R12,xpmul.even R14,R12,xmix.2.r R16,R13,R14
// R16 is the second row of 'k times Hermitian of x'
permset.2 R12,k,1111
24
pmul.odd R13,R12,xpmul.even R14,R12,xmix.2.r R17,R13,R14
// R17 is the third row of 'k times Hermitian of x'
permset.2 R12,k,0000pmul.odd R13,R12,xpmul.even R14,R12,xmix.2.r R18,R13,R14
// R18 is the fourth row of 'k times Hermitian of x'
// Calculating P and storing it in P1,P2,P3 and P4 (rows 1 to 4, respectively, of P)psub.2 P1,P1,R15psub.2 P2,P2,R16psub.2 P3,P3,R17psub.2 P4,P4,R18
jmp Loop
stop: trap 0FFFFh
Divisionloadi.z.0 num,0x0001 // numeratorloadi.z.0 count,0x001F // load counterloadi.z.0 R26,0x0001
slli denom,denom,15
compare:cmp.ge num,denom,P1,P2
P1 cmp.gt denom,R0,P1,P2P1 jmp sub
slli divres,divres,1shift: psub count,count,R26
cmp.eq count,R0,P3,P4P3 jmp stop
srli denom,denom,1jmp compare
sub: pshiftadd.1.l divres,divres,R26psub.2.s num,num,denomjmp shift
stop: ret R31
25
References
Simon Haykin; Adaptive Filter Theory, Prentice Hall 2002
Kalavai J. Raghunath, Keshab K. Parhi; Fixed and Floating Point Error Analysis of QRD-RLS and STAR-RLS Adaptive Filters, Acoustics, Speech and Signal Processing 1994, IC ASSP-94, Volume III, 19-22 April 1994, pages 111/81 – 111/84, vol. 3.
Minglu Jin; Partial updating RLS algorithm, Signal Processing, 2004. Proceedings. ICSP '04. 2004 7th International Conference on Volume 1, 31 Aug.-4 Sept. 2004 Page(s):392 - 395 vol.1
Liu, K.J.R.; An-Yeu Wu; Algorithms and architectures for split recursive least squares, VLSI Signal Processing, VII, 1994., [Workshop on] 26-28 Oct. 1994 Page(s):460 - 469
26
Recommended