Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
Appendix A: Linear Algebra Basics
A.1 Matrices and Vectors
A matrix A [1, 24, 25, 27], here indicated in bold capital letter, consists of a set of
ordered elements arranged in rows and columns. A matrix with N rows and
M columns is indicated with the following notations:
A ¼ AN�M ¼a11 a12 � � � a1Ma21 a11 � � � a2M⋮ ⋮ ⋱ ⋮aN1 aN1 � � � aNM
2664
3775 ðA:1Þ
or
A ¼ aij� �
i ¼ 1, 2, :::,N; j ¼ 1, 2, :::,M, ðA:2Þ
where i and j are, respectively, row and column indices. The elements aij may be real
or complex variables. An N rows and M columns ðN � MÞ real matrix can be
indicated as A ∈ ℝN�M while for the complex case as A ∈ ℂN�M. When property
holds both in the real and complex case, the matrix can be indicated asA ∈ ðℝ,ℂ ÞN�M
or as A ðN � MÞ or as simply AN�M.
A.2 Notation, Preliminary Definitions, and Properties
A.2.1 Transpose and Hermitian Matrix
Given a matrix A ∈ ℝN�M the transpose matrix, indicated as AT ∈ ℝM�N, is
obtained by interchanging the rows and columns of A, for which
A. Uncini, Fundamentals of Adaptive Signal Processing, Signals and Communication
Technology, DOI 10.1007/978-3-319-02807-1,
© Springer International Publishing Switzerland 2015
579
AT ¼a11 a21 � � � aN1a12 a11 � � � aN2⋮ ⋮ ⋱ ⋮a1M a2M � � � aNM
2664
3775 ðA:3Þ
or
AT ¼ aji� �
i ¼ 1, 2, :::,N; j ¼ 1, 2, :::,M: ðA:4Þ
It is therefore ðATÞT ¼ A.
In the case of complex matrix A ∈ ℂ N�M we define Hermitian matrix the matrix
transpose and complex conjugate
AH ¼ a∗ji
h ii ¼ 1, 2, :::,N; j ¼ 1, 2, :::,M: ðA:5Þ
If the matrix is indicated as AðN � MÞ, the symbol ðHÞ can be used to indicate boththe transpose of the real case and the Hermitian of the complex case.
A.2.2 Row and Column Vectors of a Matrix
Given a matrix A ∈ ðℝ,ℂ ÞN�M, its ith row vector is indicated as
ai: ∈ ℝ;ℂð ÞM�1 ¼ ai1 ai2 � � � aiM½ �H ðA:6Þ
while the jth column vector as
a:j ∈ ℝ;ℂð ÞM�1 ¼ a1j a2j � � � aNj½ �H ðA:7Þ
A matrix A ∈ ðℝ,ℂ ÞN�M can be represented by its N row vectors as
A ¼
aH1:
aH2:
⋮
aHN:
266664
377775 ¼ a1: a2: � � � aN:½ �H ðA:8Þ
or by its M column vectors as
580 Appendix A: Linear Algebra Basics
A ¼ a:1 a:2 � � � a:M½ � ¼
aH:1
aH:2
⋮
aH:M
266664
377775H
ðA:9Þ
Given a matrix A ∈ ðℝ,ℂ ÞN�M you can associate a vector vecðAÞ∈ ðℝ,ℂ ÞNM�1
containing, stacked, all column vectors of A
vec Að Þ ¼ � aH:1 aH
:2 � � � aH:M �HNM�1
¼ a11, :::,aN1, a12, :::,aN2, ::::::,a1M , :::,aNM½ �HNM�1: ðA:10Þ
Remark Note that in Matlab you can extract entire columns or rows of a matrix
with the following instructions:
A(i,:), extracts the entire ith row in a row vector of dimension M;
A(:,j), extracts the entire jth column in column vector of size N;A(:), extracts the entire matrix into a column vector of dimension N � M.
A.2.3 Partitioned Matrices
Sometimes it can be useful to represent a matrix AðNþM Þ�ðPþQÞ in partitioned formof the type
A ¼ A11 A12
A21 A22
� �MþNð Þ� PþQð Þ
ðA:11Þ
in which the elements Aij are in turn matrices defined as
A11 ∈ ℝ;ℂð ÞM�P,A12 ∈
�ℝ,ℂ
�M�Q
A21 ∈ ℝ;ℂð ÞN�P,A22 ∈
�ℝ,ℂ
�N�Q
ðA:12Þ
The partitioned product follows the same rules as the product of matrices. For
example applies
A11 A12
A21 A22
� �B1
B2
� �¼ A11B1 þ A12B2
A21B1 þ A22B2
� �ðA:13Þ
Obviously, the dimensions of the partition matrices must be compatible.
Appendix A: Linear Algebra Basics 581
A.2.4 Diagonal, Symmetric, Toeplitz, and Hankel Matrices
A given matrix A ∈ ðℝ,ℂ ÞN�N is called diagonal if aji ¼ 0 for i 6¼ j. Is called
symmetric if aji ¼ aij or aji ¼ a∗ij if the complex case, whereby AT ¼ A for real
case and AH ¼ A for complex case.
A matrix A ∈ ðℝ,ℂ ÞN�N ¼ ½aij� such that ½ai,j� ¼ ½ai þ 1,j þ 1� ¼ ½ai � j� is
Toeplitz, i.e., each descending diagonal from left to right is constant.
Moreover, a matrixA ∈ ðℝ,ℂ ÞN�N ¼ ½aij� such that ½ai,j� ¼ ½ai � 1,j þ 1� ¼ ½ai þ j�is Hankel, i.e., each ascending diagonal from left to right is constant.
For example, the following AT, AH matrices:
AT ¼
ai ai�1 ai�2 ai�3 � � �aiþ1 ai ai�1 ai�2 ⋱aiþ2 aiþ1 ai ai�1 ⋱aiþ3 aiþ2 aiþ1 ai ⋱⋮ ⋱ ⋱ ⋱ ⋱
266664
377775, AH ¼
ai�3 ai�2 ai�1 ai � � �ai�2 ai�1 ai aiþ1 Nai�1 ai aiþ1 aiþ2 Nai aiþ1 aiþ2 aiþ3 N⋮ N N N ⋱
266664
377775 ðA:14Þ
are Toeplitz and Hankel matrices.
Given a vector x ¼ x 0ð Þ � � � x M � 1ð Þ� �T, a special kind of Toeplitz/Hankel
matrix, called circulant matrix obtained rotating the elements of x for each column
(or row) as
AT ¼x 0ð Þ x 3ð Þ x 2ð Þ x 1ð Þx 1ð Þ x 0ð Þ x 3ð Þ x 2ð Þx 2ð Þ x 1ð Þ x 0ð Þ x 3ð Þx 3ð Þ x 2ð Þ x 1ð Þ x 0ð Þ
2664
3775, AH ¼
x 0ð Þ x 1ð Þ x 2ð Þ x 3ð Þx 1ð Þ x 2ð Þ x 3ð Þ x 0ð Þx 2ð Þ x 3ð Þ x 0ð Þ x 1ð Þx 3ð Þ x 0ð Þ x 1ð Þ x 2ð Þ
2664
3775 ðA:15Þ
Remark The circulant matrices are important in DSP because they are diagonalized
[see (A.9)] by a discrete Fourier transform, using a simple FFT algorithm.
A.2.5 Some Basic Properties
The following fundamental properties are valid:
ABC� � �ð Þ�1 ¼ C�1B�1A�1� � �AH� ��1 ¼ �A�1
�H
Aþ Bð ÞH ¼ AH þ BH
ABð ÞH ¼ BHAH
ABC� � �ð ÞH ¼ � � �CHBHAH:
ðA:16Þ
582 Appendix A: Linear Algebra Basics
A.3 Inverse, Pseudoinverse, and Determinant of a Matrix
A.3.1 Inverse Matrix
A square matrix A ∈ ðℝ,ℂ ÞN�N is called invertible or nonsingular if there exists amatrix B ∈ ðℝ,ℂ ÞN�N such that BA ¼ I, where IN�N is the so-called identitymatrix or unit matrix defined as I ¼ diag(1,1, . . .,1). In such case the matrix B is
uniquely determined from A and is defined as the inverse of A, also indicated as
A�1 (or A�1A ¼ I).
Note that if A is nonsingular the system equation
Ax ¼ b ðA:17Þ
has a unique solution, given by x ¼ A�1b.
A.3.2 Generalized Inverse or Pseudoinverse of a Matrix
The generalized inverse or Moore–Penrose pseudoinverse of a matrix represents
a general way to the determination of the solution of a linear real or complex
system equations of the type (A.17), in the case of A ∈ ðℝ,ℂ ÞN�M, x ∈ ðℝ,ℂ ÞM�1,
b ∈ ðℝ,ℂ ÞN�1. In general terms, considering a generic matrix AN�M we can define
its pseudoinverse AM�N# a matrix such that the following four properties are true:
AA#A ¼ A
A#AA# ¼ A# ðA:18Þ
and
AA# ¼ AA#� �H
A#A ¼ A#A� �H
:ðA:19Þ
Given a linear system (A.17) for its solution we can distinguish the following three
cases:
A# ¼A�1 N ¼ M, square matrix
AH AAH� ��1
N < M, “fat” matrix
AHA� ��1
AH N > M, “tall” matrix
8><>: ðA:20Þ
where by the solution of the system (A.17) may always be expressed as
Appendix A: Linear Algebra Basics 583
x ¼ A#b: ðA:21Þ
The proof of (A.20) for the case of a square and fat matrix is immediate. The case of
tall matrix can be easily demonstrated after the introduction of SVD decomposition
presented below. Different method for calculating the pseudoinverse refers to
possible decompositions of the matrix A.
A.3.3 Determinant
Given square matrix AN�N the determinant, indicated as detðAÞ or ΔA, is a scalar
value associated with the matrix itself, which summarizes some of its fundamental
properties, calculated by the following rule.
If A ¼ a ∈ ℝ1�1, by definition the determinant is detðAÞ ¼ a. The determinant
of a square matrix A ∈ ℝN�N is defined in terms of the determinant of order N � 1
with the following recursive expression:
det Að Þ ¼XNj¼1
aij �1ð Þjþidet Aij
� �h i, ðA:22Þ
whereAij ∈ ℝðN�1Þ�ðN�1Þ is a matrix obtained by eliminating the ith row and the jthcolumn of A.
Moreover, it should be noted that the value detðAijÞ is called complementaryminor of aij, and the product ð�1Þj þ i detðAijÞ is called algebraic complement ofthe element aij.
Property Given the matrices AN�N and BN�N the following properties are valid:
det Að Þ ¼Y
iλi, λi ¼ eig
�A�
det ABð Þ ¼ det�A�det�B�
det AH� � ¼ det
�A�∗
det A�1� � ¼ 1=det
�A�
det cAð Þ ¼ cNdet�A�
det Iþ abH� � ¼ �1þ aHb
��det Iþ δAð Þ ffi 1þ det
�A�þ δTr
�A�
þ 12δ2Tr Að Þ2 � 1
2δ2Tr
�A2�
for small δ:
ðA:23Þ
AmatrixAN�Nwith det(A) 6¼ 0 is called nonsingular and is always invertible. Notethat the determinant of a diagonal or triangular matrix is the product of the values on
the diagonal.
584 Appendix A: Linear Algebra Basics
A.3.4 Matrix Inversion Lemma
Very useful in the development of adaptive algorithms, the matrix inversion lemma
(MIL) (also known as the Sherman–Morrison–Woodbury formula [1, 2]) states
that: if A–1 and C–1 exist, the following equation algebraically verifiable is true1:
Aþ BCD½ ��1 ¼ A�1 � A�1B C�1 þ DA�1B� ��1
DA�1, ðA:24Þ
where A ∈ ℂM�M, B ∈ ℂM�N, C ∈ ℂN�N, and D ∈ ℂN�M. Note that (A.24) has
numerous variants the first of which, for simplifying, is that for D ¼ BH
Aþ BCBH� ��1 ¼ A�1 � A�1B C�1 þ BHA�1B
� ��1BHA�1 ðA:25Þ
The Kailath’s variant is defined for D ¼ I, in which (A.24) takes the form
Aþ BC½ ��1 ¼ A�1 � A�1B Iþ CA�1B� ��1
CA�1 ðA:26Þ
A variant of the previous one is when the matrices B and D are vectors, or for
B ! b, ∈ ℂM�1, D ! dH ∈ ℂ1�M, and C ¼ I, for which (A.24) becomes
Aþ bdH� ��1 ¼ A�1 � A�1bdHA�1
1þ dHA�1b: ðA:27Þ
A case of particular interest in adaptive filtering is when in the above we have
d ¼ bH.
In all variants the inverse of the sum A þ BCD is a function of the inverse of the
matrix A. It should be noted, in fact, that the term that appears in the denominator of
(A.27) is a scalar value.
A.4 Inner and Outer Product of Vectors
Given two vectors x ∈ ðℝ,ℂ ÞN�1 and w ∈ ðℝ,ℂ ÞN�1 we define inner product(or scalar product or sometime dot product) indicated as hx,wi ∈ ðℝ,ℂ Þ; the
product is defined as
1 The algebraic verification can be done developing the following expression:
Aþ BCD½ � A�1 � A�1B C�1 þ DA�1B� ��1DA�1
� �¼ Iþ BCDA�1 � B C�1 þ DA�1B
� ��1DA�1 � BCDA�1B
�C�1 þ DA�1B
��1DA�1
¼ :::
¼ I:
Appendix A: Linear Algebra Basics 585
x;wh i ¼ xHw ¼XNi¼1
x∗i wi: ðA:28Þ
The outer product between two vectors x ∈ ðℝ,ℂ ÞM�1 andw ∈ ðℝ,ℂ ÞN�1, denoted
as ix, wh ∈ ðℝ,ℂ ÞM�N, is a matrix defined by the product
ix,w ¼h xwH ¼x1w
∗1 � � � x1w
∗N
⋮ ⋱ ⋮xMw
∗1 � � � xMw
∗N
24
35M�N
: ðA:29Þ
Given two matrices AN�M and BP�M, represented by the respective column vectors
A ¼ a:1 a:2 � � � a:M½ �1 Nð Þ�M
B ¼ b:1 b:2 � � � b:M½ �1 Pð Þ�MðA:30Þ
with a:j ¼ a1j a2j � � � aNj½ �T and b:j ¼ b1j b2j � � � bPj½ �T , we define the
matrix outer product as
ABH ∈ ℝ;ℂð ÞN�P ¼XMi¼1
ai:bHi: ðA:31Þ
Note that the above expression indicates the sum of the outer product of the
column vectors of the respective matrices.
A.4.1 Geometric Interpretation
The inner product of a vector for itself xHx is often referred to as
xk k22 ≜ x; xh i ¼ xHx ðA:32Þ
that, as better specified below, corresponds to the square of its length in a Euclidean
space. Moreover, in Euclidean geometry, the inner product of vectors expressed in
an orthonormal basis is related to their length and angle.
Let xk k≜ffiffiffiffiffiffiffiffiffiffixk k22
qthe length of x, if w is another vector, such that θ is the angle
between x and w we have
xHw ¼ xk k � wk k cos θ: ðA:33Þ
586 Appendix A: Linear Algebra Basics
A.5 Linearly Independent Vectors
Given a set of vectors in ðℝ,ℂ ÞP, faig,ai ∈ ðℝ,ℂ ÞP, 8 i, i ¼ 1, . . ., N
and a set
of scalars c1, c2,. . ., cN, we define the vector b ∈ ðℝ,ℂ ÞP as a linear combination of
the vectors faig as
b ¼XNi¼1
ciai: ðA:34Þ
The vectors faig are defined as linearly independent if, and only if, (A.34) is zero
only in the case that all scalars ci are zero.Equivalently, the vectors are called linearly dependent if, given a set of scalars
c1, c2,. . ., cN, not all zero,
XNi¼1
ciai ¼ 0: ðA:35Þ
Note that the columns of the matrix A are linearly independent if, and only if, the
matrix ðAHAÞ is nonsingular or, as explained in the next section, is a full rankmatrix. Similarly, the rows of the matrix A are linearly independent if, and only if,
ðAAHÞ is nonsingular.
A.6 Rank and Subspaces Associated with a Matrix
Given AN�M, the rank of the matrix A, indicated as r ¼ rankðAÞ, is defined as the
scalar indicating the maximum number of its linearly independent columns. Note
that rankðAÞ ¼ rankðAHÞ; it follows that a matrix is called reduced rank matrixwhen rankðAÞ < minðN,MÞ and is full rank matrix when rankðAÞ ¼ minðN,MÞ.It is also
rank Að Þ ¼ rank AH� � ¼ rank AHA
� � ¼ rank AAH� �
: ðA:36Þ
A.6.1 Range or Column Space of a Matrix
We define column space of a matrix AN�M (also called range or image), indicatedas RðAÞ o ImðAÞ, the subspace obtained from the set of all possible linear
combinations of its linearly independent column vectors. So, called
A ¼ a1 � � � aM½ � the columns partition of the matrix,RðAÞ represents the linearspan2 (also called the linear hull) of the column vectors set in a vector space
2 The term span ðv1,v2,. . .,vnÞ is the set of all vectors, or the space, that can be represented as the
linear combination of v1,v2,. . ., vn.
Appendix A: Linear Algebra Basics 587
R Að Þ ≜ span�a1 a2 � � � aM �
¼ y∈ℝN∴ y ¼ Ax, for some x∈ℝN
:ðA:37Þ
Moreover, callingA ¼ b1 � � � bN½ � the rowmatrix partition, the dual definition is
R AH� �
≜ span�b1 b2 � � � bN �
¼ x∈ℝN∴x ¼ Ay, for some y∈ℝM
:ðA:38Þ
It appears, for the previous definition, that the rank of A is equal to the size of its
column space
rank Að Þ ¼ dim R Að Þ� �: ðA:39Þ
A.6.2 Kernel or Nullspace of a Matrix
The kernel or nullspace of matrix AN�M, indicated as N Að Þ or KerðAÞ, is the set ofall vector x for which Ax ¼ 0. More formally
N Að Þ ≜ x∈ ℝ;ℂð ÞM∴ Ax ¼ 0n o
: ðA:40Þ
Similarly, the dual definition of left nullspace is
N AH� �
≜ y∈ ℝ;ℂð ÞN∴AHy ¼ 0n o
: ðA:41Þ
The size of the kernel is called nullity of the matrix
null Að Þ ¼ dim N Að Þ� �: ðA:42Þ
In fact, the expression Ax ¼ 0 is equivalent to a homogeneous linear equations
system and is equivalent to the span of the solutions of that system. Whereby calling
A ¼ a1 � � � aN½ �H the rows partition ofA, the product Ax ¼ 0 can be expressed as
Ax ¼ 0 ,
aH1 x
aH2 x
⋮
aHN x
266664
377775 ¼ 0 ðA:43Þ
It follows that x∈N Að Þ if, and only if, x is orthogonal to the space described bythe row vectors of A, or
x⊥span a1 a2 � � � aN½ �
588 Appendix A: Linear Algebra Basics
Namely, a vector x is located in the nullspace of A iff it is perpendicular to every
vector in the space of row A. In other words, the column space of the matrix A is
orthogonal to its nullspace R Að Þ⊥N Að Þ.
A.6.3 Rank–Nullity Theorem
For any matrix AN�M,
dim N Að Þ� �þ dim R Að Þ� � ¼ null Að Þ þ rank Að Þ ¼ M: ðA:44Þ
The above equation is known as rank–nullity theorem.
A.6.4 The Four Fundamental Matrix Subsapces
When the matrix AN�M is full rank, i.e., r ¼ rankðAÞ ¼ minðN,MÞ, the matrix
always admits a left-inverse B or an right-inverse C or, in the case of N ¼ M,
admits the inverse A–1.
As a corollary, it is appropriate to recall the fundamental concepts related to the
subspaces definable for a matrix AN�M
1. Column space of A: indicted as RðAÞ, is defined by the A columns span.
2. Nullspace of A: indicted as N Að Þ, contains all vectors x such that Ax ¼ 0.
3. Row space of A: equivalent to the column space of AH, indicated as RðAHÞ, isdefined by the span of the rows of A.
4. Left nullspace of A: equivalent to the nullspace of AH, indicated as N AH� �
,
contains all vectors x such that AHx ¼ 0.
Indicating with R⊥ðAÞ and N ⊥Að Þ the orthogonal complements, respectively,
of RðAÞ and N Að Þ, the following relations are valid (Fig. A.1):
R Að Þ ¼ N ⊥�AH�
N Að Þ ¼ R⊥�AH� ðA:45Þ
and the dual
R⊥ Að Þ ¼ N�AH�
N ⊥Að Þ ¼ R
�AH�:
ðA:46Þ
Appendix A: Linear Algebra Basics 589
A.6.5 Projection Matrix
A square matrix P ∈ ðℝ,ℂ ÞN�N is defined projection operator iff P2 ¼ P, i.e., is
idempotent. If P is symmetric, then the projection is orthogonal. Furthermore, if P is
a projection matrix, it is also (I–P).
Examples of orthogonal projection matrices are matrices associated with the
pseudoinverse A# in the over- and under-determined cases.
In the case of overdetermined case N > M and A# ¼ ðAHAÞ�1
AH, we have that
P ¼ A AHA� ��1
AH, projection operator ðA:47ÞP⊥ ¼ I� A AHA
� ��1AH orthogonal complement projection oper: ðA:48Þ
such that P þ P⊥ ¼ I, i.e., P projects a vector on the subspace Ψ ¼ RðAÞ, whileP⊥ on its orthogonal complement Ψ⊥ ¼ R⊥ðAÞ or P ¼ N AH
� �. Indeed, calling
x ∈ ðℝ,ℂ ÞM�1 and y ∈ ðℝ,ℂ ÞN�1, such that Ax ¼ y, we have that Py ¼ u and
P⊥y ¼ v such that u ∈ RðAÞ and v∈N AH� �
(see Fig. A.2).
In the underdetermined, case where N < M and A# ¼ AHðAAHÞ�1, we have
P ¼ AH AAH� ��1
A ðA:49ÞP⊥ ¼ I� AH AAH
� ��1A: ðA:50Þ
A.7 Orthogonality and Unitary Matrices
In DSP, the conditions of orthogonality, orthonormality, and bi-orthogonality,
represent a tool of primary importance. Here are some basic definitions.
( )H rAR
( )H N r−AN
( ) rAR
0
( , )M ( , )N
H
A
A
T
A
A
( ) M r−AN
0TA A
Î
Î
Î
Î
Fig. A.1 The four
subspaces associated with
the matrix A ∈ ðℝ,ℂÞN�M.
These subspaces determine
an orthogonal
decomposition of the space,
into the column space
RðAÞ, and the left nullspaceN AH� �
. Similarly an
orthogonal decomposition
of (ℝ,ℂ)N into the row space
R(AH) and the nullspace
N Að Þ
590 Appendix A: Linear Algebra Basics
A.7.1 Orthogonality and Unitary Matrices
Two vectors x and wx, w ∈ ðℝ,ℂ ÞN are orthogonal if their inner product is zero
hx,wi ¼ 0. This situation is sometimes referred to as x ⊥ w.
A set of vectors fqig,�qi ∈ ðℝ,ℂ ÞN, 8 i, i ¼ 1, . . ., N
�is called orthogonal if
qHi qj ¼ 0 for i 6¼ j ðA:51Þ
A set of vectors fqig is called orthonormal if
qHi qj ¼ δij ¼ δ i� j½ �, ðA:52Þ
where δij is the Kronecker symbol defined as δij ¼ 1 for i ¼ j; δij ¼ 0 for i 6¼ j.A matrix QN�N is orthonormal if its columns are an orthonormal set of vectors.
Formally
QHQ ¼ QQH ¼ I: ðA:53Þ
Note that in the case of orthonormality isQ�1 ¼ QH. Moreover, a matrix for which
QHQ ¼ QQH is defined as normal matrix.An important property of orthonormality is that it has no effect on inner product,
which is
Qx;Qyh i ¼ Qxð ÞHQy ¼ xHQHQy ¼ x; yh i: ðA:54Þ
Furthermore, if multiplied to a vector does not change its length
Qxk k22 ¼ Qxð ÞHQx ¼ xHQHQx ¼ xk k22: ðA:55Þ
=u Py
⊥=v P y
y
(A) = N (AH)⊥Σ = R
( )Ψ = AR1
1
( )
( )
H H
H H
−
−⊥
=
= −
P A A A A
P I A A A A
Fig. A.2 Representation of
the orthogonal projection
operator
Appendix A: Linear Algebra Basics 591
A.7.2 Bi-Orthogonality and Bi-Orthogonal Bases
Given two matrix Q and P, not necessarily square, these are called bi-orthogonal if
QHP ¼ PHQ ¼ I: ðA:56Þ
Moreover, note that in the case of bi-orthonormality QH ¼ P�1 and PH ¼ Q�1.
The pair of vectors fqi,pjg represents a bi-orthogonal basis if, and only if, both ofthe following prepositions are valid:
1. For each i, j ∈ Z
qipj
D E¼ δ i� j½ � ðA:57Þ
2. There areA,B, ~A, ~B∈ℝþ such that 8 x ∈ E; the following inequalities are valid:
A xk k2 �X
kqk; xh i�� ��2 � B xk k2 ðA:58Þ
~A xk k2 �X
kpk; xh i�� ��2 � ~B xk k2: ðA:59Þ
The pair of vectors that satisfy (1.) and inequality (2.) are called Riesz bases [2]. For
which the following expansion formulas apply:
x ¼X
kqk; xh ipk ¼
Xkpk; xh iqk: ðA:60Þ
Comparing the previous inequalities with (A.52), we observe that the term
bi-orthogonal is used as the non-orthogonal basis fqig and is associated with a
dual basis fpjg that satisfies the condition (A.57). If fpig was the orthogonal
expansion (A.60) would be the usual orthogonal expansion.
A.7.3 Paraunitary Matrix
A matrix Q ∈ ðℝ,ℂ ÞN�M is called paraunitary matrix if
Q ¼ QH ðA:61Þ
In the case of the square matrix then
QHQ ¼ cI ðA:62Þ
592 Appendix A: Linear Algebra Basics
A.8 Eigenvalues and Eigenvectors
The eigenvalues of a square matrix AN�N are the solutions of the characteristic
polynomial pðλÞ, of order N, defined as
p λð Þ ≜ det A� λIð Þ ¼ 0 ðA:63Þ
for which the eigenvalues fλ1,λ2, . . .,λNg of the matrixA, denoted as λðAÞ or eigðAÞ,are the roots of the characteristic polynomial pðλÞ.
For each eigenvalue λ is associated with an eigenvector q defined by the equation
A� λIð Þq ¼ 0 or Aq ¼ λq: ðA:64Þ
Consider a simple example of a real matrix A2�2 defined as
A ¼ 2 1
1 2
� �: ðA:65Þ
For (A.63) the characteristic polynomial is
det A� λIð Þ ¼ det2� λ 1
1 2� λ
� �¼ λ2 � 4λþ 3 ¼ 0 ðA:66Þ
with two distinct and real roots: λ1 ¼ 1 and λ2 ¼ 3, for which λiðAÞ ¼ (1,3).
The eigenvector related to λ1 ¼ 1 is
2 1
1 2
� �q1q2
� �¼ q1
q2
� �) q1 ¼ 1
�1
� �ðA:67Þ
while the eigenvector related to λ2 ¼ 3 is
2 1
1 2
� �q1q2
� �¼ 3
q1q2
� �) q2 ¼ 1
1
� �: ðA:68Þ
The eigenvectors of a matrix AN�N are sometimes referred to as eigenvectðAÞ.
A.8.1 Trace of Matrix
The trace of matrix AN�N is defined as the sum of its elements in the main diagonal
and, equivalently, and is equal to the sum of its (complex) eigenvalues
Appendix A: Linear Algebra Basics 593
tr A½ � ¼XNi¼1
aii ¼XNi¼1
λi: ðA:69Þ
Moreover we have that
tr Aþ B½ � ¼ tr A½ � þ tr B½ �tr A½ � ¼ tr AH
� �tr cA½ � ¼ c � tr A½ �tr ABC½ � ¼ tr BCA½ � ¼ tr CAB½ �aHa ¼ tr aHa½ �:
ðA:70Þ
Matrices have the Frobenius inner product, which is analogous to the vector inner
product. It is defined as the sum of the products of the corresponding components of
two matrices A and B having the same size:
A;Bh i ¼Xi
Xj
aijbij ¼ tr AHB� � ¼ tr ABH
� �:
A.9 Matrix Diagonalization
A matrix AN�N is called diagonalizable matrix if there is an invertible matrix Q
such that there exists a decomposition
A ¼ QΛQ�1 ðA:71Þ
or, equivalently,
Λ ¼ Q�1AQ: ðA:72Þ
This is possible if, and only if, the matrix A has N linearly independent eigenvectors
and the matrix Q, partitioned as column vectors Q ¼ q1 q2 � � � qN½ �, is built
with independent eigenvectors of A. In this case, Λ is a diagonal matrix built with
the eigenvalues of A, i.e., Λ ¼ diagðλ1,λ2, . . .,λNÞ.
A.9.1 Diagonalization of a Normal Matrix
The matrix AN�N is said normal matrix if AHA ¼ AAH. A matrix A is normal iff it
can be factorized as
A ¼ QΛQH ðA:73Þ
where QHQ ¼ QQH ¼ I, Q ¼ q1 q2 � � � qN½ �, Λ ¼ diagðλ1,λ2, . . .,λNÞ, and
Λ ¼ QHAQ.
594 Appendix A: Linear Algebra Basics
The set of all eigenvectors of A is defined as the spectrum of the matrix. Theradius of the spectrum or spectral radius is defined as the eigenvalue of maximum
modulus
ρ Að Þ ¼ maxi
eig Að Þj j� �: ðA:74Þ
Property If the matrix AN�N is nonsingular, then all the eigenvalues are nonzero
and the eigenvalues of the inverse matrix A�1 are the reciprocal of eigðAÞ.Property If the matrix AN�N is symmetric and semi-definite positive then all
eigenvalues are real and positive. So we have that
1. The eigenvalues λi of A are real and nonnegative:
qHi Aqi ¼ λiq
Ti qi ) λi ¼ qH
i Aqi
qHi qi
Rayleighquotientð Þ ðA:75Þ
2. The eigenvectors of A are orthogonal for distinct λi
qHi qj ¼ 0, for i 6¼ j ðA:76Þ
3. The matrix A can be diagonalized as
A ¼ QΛQH ðA:77Þ
where Q ¼ q1 q2 � � � qN½ �, Λ ¼ diag(λ1,λ2, . . .,λN), and Q is a unitary matrix
or QTQ ¼ I
4. An alternative representation for A is then
A ¼XNi¼1
λiqiqHi ¼
XNi¼1
λiPi ðA:78Þ
where the term Pi ¼ qiqiH is defined as spectral projection.
A.10 Norms of Vectors and Matrices
A.10.1 Norm of Vectors
Given a vector x ∈ ðℝ,ℂ ÞN, its norm refers to its length relative to a vector space. In
the case of a space of order p, called Lp space, the norm is indicated as xk kLp or kxkpand is defined as
Appendix A: Linear Algebra Basics 595
xk kp ≜XNi¼1
xij jp" #1=p
, for p 1: ðA:79Þ
L0 norm The expression (A.79) is valid even when 0 < p < 1; however, the result
is not exactly a norm. For p ¼ 0, (A.79) becomes
xk k0 ≜ limp!0
xk kp ¼XNi¼1
��xi��0: ðA:80Þ
Note that (A.80) is equal to the number of nonzero entries of the vector x.
L1 norm
xk k1 ≜XNi¼1
xij j, L1 norm: ðA:81Þ
The previous expression represents the sum of modules of the elements of the
vector x.
Linf norm For p ! 1 the (A.79) becomes
xk k1 ≜ maxi¼1,N xij j ðA:82Þ
called uniform norm or norm of the maximum and denoted as Linf.
Euclidean or L2 norm The Euclidean norm is defined for p ¼ 2 and expresses the
standard length of the vector.
xk k2 ≜ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXNi¼1
xj j2i ¼vuut ffiffiffiffiffiffiffiffi
xHxp
, Euclidean or L2 norm ðA:83Þ
xk k22 ≜ xHx, quadratic Euclidean norm ðA:84Þxk k2G ≜ xHGx
�� ��, quadratic weighted Euclidean norm, ðA:85Þ
where G is a diagonal weighing matrix.
Frobenius norm Similar to the L1 norm, it is defined as
xk kF ≜
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXNi¼1
xij j2vuut , Frobenius norm ðA:86Þ
Property For each norm we have the following property:
1. kxk 0, the equality holds only for x ¼ 0
596 Appendix A: Linear Algebra Basics
2. αxk k ¼ α xk k, 8α3. kx þ yk � kxk þ kyk triangle inequality.
The distance between two vectors x and y is defined as
x� yk kp ≜XNi¼1
xi � yij jp" #1=p
, for p > 0: ðA:87Þ
It is called distance or similarity measure in the Minkowsky metric [1].
A.10.2 Norm of Matrices
With regard to the norm of a matrix, similar to the vectors norms, these may be
defined in the following mode. Given an AN�N matrix
L1 norm
Ak k1 ≜ maxj
XNi¼1
aij�� ��, L1norm ðA:88Þ
represents the column of A with largest sum of absolute values
Euclidean or L2 norm The Euclidean norm is defined for the space p ¼ 2 and
expresses the standard length of the vector
Ak k2 ≜ffiffiffiffiffiffiffiffiffiλmax
p) max
λieig AHA� �
o maxλi
eig AAH� � ðA:89Þ
Linf norm
Ak k1 ≜ maxi
XNj¼1
aij�� ��, Linf norm ðA:90Þ
that represents the row with greater sum of the absolute values.
Frobenius norm
Ak kF ≜
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXNi¼1
XMj¼1
aij�� ��2
vuut , Frobenius norm ðA:91Þ
A.11 Singular Value Decomposition Theorem
Given a matrix X ∈ ðℝ,ℂ ÞN�M with K ¼ minðN,MÞ, of rank r � K, there are twoorthonormal matrices U ∈ ðℝ,ℂ ÞN�N and V ∈ ðℝ,ℂ ÞM�M containing for columns,
respectively, the eigenvectors of XXH and the eigenvectors of XHX, namely,
Appendix A: Linear Algebra Basics 597
UN�N ¼ eigenvect XXH� � ¼ u0 u1 � � � uN�1½ � ðA:92Þ
VM�M ¼ eigenvect XHX� � ¼ v0 v1 � � � vM�1½ � ðA:93Þ
such that the following equality is valid:
UHXV ¼ Σ, ðA:94Þ
equivalently,
X ¼ UΣVH ðA:95Þ
or
XH ¼ VΣUH: ðA:96Þ
The expressions (A.94)–(A.96) represent the SVD decomposition of the matrix A,
shown graphically in Fig. A.3
The matrix Σ ∈ ℝN�M is characterized by the following structure:
K ¼ min M;Nð Þ Σ ¼ ΣK 0
0 0
� �K ¼ N ¼ M Σ ¼ ΣK
, ðA:97Þ
where ΣK ∈ ℝK�K is a diagonal matrix containing the positive square root of the
eigenvalues of the matrix XHX ðor XXHÞ defined as singular values.3 In formal
terms
ΣK ¼ diag σ0;σ1; :::;σK�1ð Þ ≜ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffidiag eig XHX
� � �r
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffidiag eig XXH
� � �r, ðA:98Þ
where
σ0 σ1 ::: σK�1 > 0 and σK ¼ � � � ¼ σN�1 ¼ 0: ðA:99Þ
Note that the singular values σi of X are in descending order. Moreover, the column
vectors ui and vi are defined, respectively, as left singular vectors and right singularvectors ofX. Since U and V are orthogonal, it is easy to see that the matrix X can be
written as a product
3 Remember that the nonzero eigenvalues of the matrices XHX and XX
H are identical.
598 Appendix A: Linear Algebra Basics
X ¼ UΣVH ¼XK�1
i¼0
σiuivHi : ðA:100Þ
A.11.1 Subspaces of Matrix X and SVD
The SVD reveals important property of the matrix X. In fact, for r < K we have
r ¼ rankðXÞ, for which the first r columns of U form an orthonormal basis of the
column space RðXÞ, while the first r columns of V form an orthonormal basis for
the nullspace (or kernel) N XH� �
of X, i.e.,
r ¼ rank Xð ÞR Xð Þ ¼ span
�u0,u1, :::, ur�1
�N XH� � ¼ span
�vr, vrþ1, :::, vN�1
�:
ðA:101Þ
In the case that r < K, also, for (A.99) is
σ0 σ1 ::: σr�1 > 0 and σr ¼ ::: ¼ σN�1 ¼ 0: ðA:102Þ
It follows that (A.97), for the cases over/under-determined, becomes
Σ ¼ Σr 0
0 0
� �,
where
HU
N
X
M M
M
Unitary matrix
N
V =Σ 0
00
r
M
Null matrix
HU
N
X
M M
MV
Data matrix
=Σ 0
00
r
N
M
a
b
Unitary matrix
Unitary matrix Unitary matrix
Data matrix Diagonal matrix
´ ´
´ ´
Fig. A.3 Schematic of the SVD decomposition in the cases (a) overdetermined (matrix X is tall);(b) underdetermined (matrix X is fat)
Appendix A: Linear Algebra Basics 599
Σr ¼ diag σ0; σ1; :::; σr�1ð Þ: ðA:103Þ
Moreover, from the previous development applies the expansion
X ¼ U1 U2½ � Σr 0
0 0
� �VH
1
VH2
� �¼ U1ΣrV
H1 ¼
Xr�1
i¼0
σiuivHi , ðA:104Þ
where V1, V2, U1, and U2 are orthonormal matrices defined as
V ¼ V1 V2½ � with V1 ∈ℂM�r and V2 ∈ℂM�M�r ðA:105ÞU ¼ U1 U2½ � with U1 ∈ℂN�r and U2 ∈ℂN�N�r ðA:106Þ
for which, for (A.101), we have that VH1 V2 ¼ 0 and UH
1 U2 ¼ 0. The representation
(A.104) is sometimes called thin SVD of X.
Note also that the Euclidean norm of X is equal to
Xk k2 ¼ σ0 ðA:107Þ
while its Frobenius norm is equal to
Xk kF ≜
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXN�1
i¼0
XM�1
j¼0
xij�� ��2
vuut ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiσ20 þ σ21 þ � � � þ σ2r
q: ðA:108Þ
Remark A special important case of SVD decomposition occurs when the matrix
X is symmetric and nonnegative. In this case it is
Σ ¼ diag λ0; λ1; :::; λr�1ð Þ, ðA:109Þ
where λ0 λ1 . . . λr � 1 0 are the real eigenvalues of X corresponding to
the eigenvectors vi.
A.11.2 Pseudoinverse Matrix and SVD
The Moore–Penrose pseudoinverse of the overdetermined case is defined as
X# ¼ ðXHXÞ�1XH, while for the underdetermined case is X# ¼ XHðXXHÞ�1. It
should be noted that in expression (A.95), X# always results in the following forms:
600 Appendix A: Linear Algebra Basics
X# ¼ XHX� ��1
XH ¼ VΣ�1K 0
0 0
" #UH N > M
X# ¼ XH XXH� ��1 ¼ V
Σ�1K 0
0 0
" #UH N < M,
ðA:110Þ
where for K ¼ minðN,MÞ, Σ�1K ¼ diagðσ�1
0 ,σ�11 , . . . , σ�1
K�1Þ, and for r � K,
X# ¼ V1Σ�1r UH
1 : ðA:111Þ
For both over and under-determined, by means of (A.95), the partitions (A.105) and
(A.106) are demonstrable.
Remark Remember that the right singular vectors v0, v1,. . .,vM–1, of the data matrix
X, are equal to the eigenvectors of the oversized matrix XHX, while the left singular
vectors u0, u1,. . .,uN–1 are equal to the eigenvectors of the undersized matrixXXH. It
is, also, true that r ¼ rankðXÞ, i.e., the number of positive singular values is
equivalent to the rank of the data matrix X. Therefore, the SVD decomposition
provides a practical tool for determining the rank of a matrix and its pseudoinverse.
Corollary For the calculation of the pseudoinverse it is also possible to use other
types of decomposition such as that shown below.
Given a matrix X ∈ ðℝ,ℂ ÞN�M with rankðXÞ ¼ r < minðN,MÞ, there are two
matricesCM�r andDr�N such thatX ¼ CD. Using these matrices it is easy to verify
that
X# ¼ DH DDH� ��1
CHC� ��1
CH: ðA:112Þ
A.12 Condition Number of a Matrix
In numerical analysis the condition number, indicated as χð�Þ, associated with a
problem is the degree of numerical tractability of the problem himself. A matrix
A is called ill-conditioned if χðAÞ takes large values. In this case, some methods of
matrix inversion can present a high numerical nature error.
Given a matrix A ∈ ðℝ,ℂ ÞN�M, the condition number is defined as
χ Að Þ ≜ jjAjjpjjA#jjp 1 � χ Að Þ � 1, ðA:113Þ
where p ¼ 1, 2, . . ., 1, || · ||pmay be the Frobenius norm andA# the pseudoinverse
of A. The number χðAÞ depends on the type of chosen norm. In particular, in the
case of L2 norm it is possible to prove that
Appendix A: Linear Algebra Basics 601
χ Að Þ ¼ jjAjj2jjA#jj2 ¼ σmax
σmin
, ðA:114Þ
where σmax ¼ σ1 and σminð¼σM o σNÞ are, respectively, the maximum and mini-
mum singular values of A. In the case of a square matrix
χ Að Þ ¼ λmax
λmin
, ðA:115Þ
where λmax and λmin are the maximum and minimum eigenvalues of A.
A.13 Kroneker Product
The Kronecker product between two matrices A ∈ ðℝ,ℂ ÞP�Q and
B ∈ ðℝ,ℂ ÞN�M, usually indicated as A � B, is defined as
A� B ¼a11B � � � a1QB⋮ ⋱ ⋮
aP1B � � � aPQB
24
35∈ ℝ;ℂð ÞPN�QM: ðA:116Þ
The Kronecker product can be convenient to represent linear systems equations and
some linear transformations.
Given a matrix A ∈ ðℝ,ℂ ÞN�M, you can associate with it a vector,
vecðAÞ ∈ ðℝ,ℂ ÞNM�1, containing all its column vectors [see (A.10)].
For example, given the matricesAN�M and XM�P, it is possible to represent their
product as
AX ¼ B, ðA:117Þ
where BN�P; using the definition (A.10) and the Kronecker product, we have that
I� Að Þvec Xð Þ ¼ vec Bð Þ ðA:118Þ
that represents a system of linear equations of NP equations and MP unknowns.
Similarly, given the matrices, AN�M, XM�P, and BP�Q it is possible to represent
their product
AXB ¼ C ðA:119Þ
in a equivalent manner as a QN linear system equation in MP unknowns or as
BT � A� �
vec Xð Þ ¼ vec Cð Þ: ðA:120Þ
602 Appendix A: Linear Algebra Basics
Appendix B: Elements of Nonlinear
Programming
B.1 Unconstrained Optimization
The term nonlinear programming (NLP) indicates the process of solving linear or
nonlinear systems of equations, rather than a closed mathematical–algebraic
approach with a methodology that minimizes or maximizes some cost function
associated with the problem.
This Appendix briefly introduces the basic concepts of NLP. In particular, it
presents some fundamental concepts of the unconstrained and the constrained
optimization methods [3–15].
B.1.1 Numerical Methods for Unconstrained Optimization
The problem of unconstrained optimization can be formulated as follows: find avector w ∈ Ω ℝM4 that minimizes (maximizes) a scalar function JðwÞ. Formally
w∗ ¼ minw∈Ω
J wð Þ: ðB:1Þ
The real function JðwÞ, J : ℝM ! ℝ, is called cost function (CF), or loss function orobjective function or energy function, w is an M-dimensional vector of variables
that could have any values, positive or negative, and Ω is the variables or search
space. Minimizing a function is equivalent to maximizing the negative of the
function itself. Therefore, without loss of generalities, minimizing or maximizing
a function are equivalent problems.
A point w∗ is a global minimum for function JðwÞ if
4 For uniformity of writing, we denote by Ω the search space, which in the absence of constraints
coincides with the whole space or Ω ℝM. As we will see later, in the presence of constraints,
there is a reduced search space, i.e., Ω � ℝM.
A. Uncini, Fundamentals of Adaptive Signal Processing, Signals and Communication
Technology, DOI 10.1007/978-3-319-02807-1,
© Springer International Publishing Switzerland 2015
603
J w∗ð Þ � J wð Þ, 8w∈ℝM ðB:2Þ
and w∗ is a strict local minimizer if (B.2) holds for a ε-radius ball centered in w∗
indicated as Bðw∗,εÞ.
B.1.2 Existence and Characterization of the Minimum
The admissible solutions of a problem can be characterized in terms of some
sufficient and necessary conditions
First-order necessary condition (FONC) (for minimization ormaximization) is that
∇J wð Þ ¼ 0, ðB:3Þ
where the operator∇JðwÞ ∈ ℝM is a vector indicating the gradient of function JðwÞdefined as
∇J wð Þ ≜ ∂J wð Þ∂w
¼∂J wð Þ∂w1
∂J wð Þ∂w2
� � � ∂J wð Þ∂wM
" #T: ðB:4Þ
Second-order necessary condition (SONC) is that the Hessian matrix
∇2JðwÞ ∈ ℝM�M, defined as
∇2J wð Þ ≜ ∂∂w
∂J wð Þ∂w
24
35T
¼ ∂∂w
∇J½ �T
¼
∂∂w1
∂J wð Þ∂w
24
35T
∂∂w2
∂J wð Þ∂w
24
35T
⋮
∂J∂wM
∂J wð Þ∂w
24
35T
2666666666666664
3777777777777775¼
∂2J wð Þ∂w2
1
∂2J wð Þ
∂w1∂w2
� � � ∂2J wð Þ
∂w1∂wM
∂2J wð Þ
∂w2∂w1
∂2J wð Þ∂w2
2
� � � ∂2J wð Þ
∂w2∂wM
⋮ ⋮ ⋱ ⋮∂2
J wð Þ∂wM∂w1
∂2J wð Þ
∂wM∂w2
� � � ∂2J wð Þ
∂w2M
2666666666664
3777777777775,
ðB:5Þ
is positive semi-definite (PSD) or
wT �∇2J wð Þ � w 0, for all w: ðB:6Þ
604 Appendix B: Elements of Nonlinear Programming
Second-order sufficient condition (SONC) is that: given FONC satisfied, the Hes-
sian matrix ∇2JðwÞ is definite positive that is wT � ∇2JðwÞ � w > 0 for all w.
A necessary and sufficient condition for which w∗ is a strict local minimizer of
JðwÞ can be formalized by the following theorem:
Theorem The point w∗ is a strict local minimizer of JðwÞ iff:Given nonsingular ∇2JðwÞ evaluated at the point w∗, then Jðw∗Þ < JðwÞ 8
ε > 0, 8 w such that 0 < kw � w∗k < ε, if∇JðwÞ ¼ 0 and∇2JðwÞ is symmetric
and positive defined.
B.2 Algorithms for Unconstrained Optimization
In the field of unconstrained optimization, it is known that some general principles
can be used to study most of the algorithms. This section describes some of these
fundamental principles.
B.2.1 Basic Principles
Our problem is to determine (or better estimate) the vector w∗, called optimalsolution, which minimizes the CF JðwÞ. If the CF is smooth and its gradient is
available, the optimal solution can be computed (estimated) by an iterative proce-
dure that minimizes the CF, i.e., starting from some initial condition (IC) w–1, a
suitable solution is available only after a certain number of adaptation steps:
w�1 ! w0 ! w1 . . . wk . . . ! w∗. The recursive estimator has a form of the type
wkþ1 ¼ wk þ μkdk ðB:7Þ
or as
wk ¼ wk�1 þ μkdk, ðB:8Þ
where k is the adaptation index. The vector dk represents the adaptation direction
and the parameter μk is the step size also called adaptation rate, step length, learningrate, etc., that can be obtained by means of a one-dimensional search.
An important aspect of recursive procedure (B.7) concerns the algorithm order.
In the first-order algorithms, the adaptation is carried out using only knowledge
about the CF gradient, evaluated with respect to the free parameters w. In the
second-order algorithms, to reduce the number of iterations needed for conver-
gence, information about the JðwÞ curvature, i.e., the CF Hessian, is also used.
Appendix B: Elements of Nonlinear Programming 605
Figure B.1 shows a qualitative evolution of the recursive optimization algorithm.
B.2.2 First- and Second-Order Algorithms
Let JðwÞ be the CF to be minimized, if the CF gradient is available at learning step k,indicated as ∇JðwkÞ, it is possible to define a family of iterative methods for the
optimum solution computation. These methods are referred to as search methods or
searching the performance surface, and the best-known algorithm of this class is the
steepest descent algorithm (SDA) (Cauchy 1847). Note that, given the popularity of
the SDA, this class of search methods is often identified with the name SDA
algorithms.
Considering the general adaptation formula (B.7), indicating for simplicity the
gradient as gk ¼ ∇JðwkÞ, the direction vector dk is defined as follows:
dk ¼ �gk, SDA algorithms: ðB:9Þ
The SDA are first-order algorithms because adaptation is determined by knowledge
of the gradient, i.e., only the first derivative of the CF. Starting from a given IC w–1,
they proceed by updating the solution (B.7) along the opposite direction to the CF
gradient with a step length μ.The learning algorithms performances can be improved by using second-order
derivative. In the case that the Hessian matrix is known, the method, called exact
Newton, has a form of the type
dk ¼ � ∇2J wkð Þ� ��1gk, exact Newton: ðB:10Þ
In the case the Hessian is unknown the method, called quasi-Newton (Broyden
1965; [3] and [4] for other details), has a form of the type
*2w
1w∗
2w
1w
1−w
∗w
(w)J
a b
1w
1c.i. : −w
2w
1: maximum descent directiond
Direction step-size: d1
3d
(w)J2μd2d
adaptationw2 = w1 + d1
´
Fig. B.1 Qualitative evolution of the trajectory of the weights wk, during the optimization process
towards the optimal solution w∗, for a generic two-dimensional objective function (a) qualitative
trend of steepest descent along the negative gradient of the surface JðwÞ; (b) particularly
concerning the direction and the step size
606 Appendix B: Elements of Nonlinear Programming
dk ¼ �Hkgk, quasi-Newton, ðB:11Þ
where the matrix Hk is an approximation of the inverse of the Hessian matrix
Hk ∇2J wkð Þ� ��1: ðB:12Þ
The matrix Hk is a weighing matrix that can be estimated in various ways.
As Fig. B.2 shows, the product μkHk can be interpreted as an optimum choice in
direction and step-size length, calculated so as to follow the surface-gradient
descent in very few steps. As the lower limit, as in the exact Newton’s method,
only with one step.
B.2.3 Line Search and Wolfe Condition
The step size μ of the unconstrained minimization procedure can be chosen a priori(according to certain rules) and kept fixed during the entire process or may be
variable, and mentioned as μk. In this case, the step size can be optimized according
to some criterion, e.g., the line search method defined as
μk ¼ minμmin<μ<μmax
J wk þ μdkð Þ: ðB:13Þ
With this technique, the parameter μk is (locally) increased, using a certain step,
until the CF continues to decrease. The length of the learning rate is variable and
usually with smaller size approaching to the optimal solution.
w0w
1
Performance Surface J(w)
0.5 1 1.5 2 2.5 3 3.5
-3
-2
-1
0
1
2
3
mkdk
mk Hkdk
Fig. B.2 In the second-
order algorithms, the matrix
Hk determines a
transformation in terms of
rotation and gain, of the
vector dk in the direction
of the minimum of the
surface JðwÞ
Appendix B: Elements of Nonlinear Programming 607
A typical qualitative evolution of line search during descent along the gradient of
the CF is shown in Fig. B.3.
As illustrated in Fig. B.3, in certain situations, the number of iterations to reach
the optimal point can be drastically reduced, however, with a considerable increase
in computational cost due to the calculation of the expression (B.13).
For noisy or rippled CF the expression (B.13) can be computed with some
difficulties. So algorithms for determination of optimal step size should be used
with some cautions.
The Wolfe conditions are a set of inequalities for performing inexact line search,
especially in second-order methods, in order to determine an optimal step size.
Then inexact line searches provide an efficient way of computing an acceptable step
size μ that reduces the objective function “sufficiently,” rather than minimizing the
objective function over μ ∈ ℝþ exactly. A line search algorithm can use Wolfe
conditions as a requirement for any guessed μ, before finding a new search direction
dk. A step length μk is said to satisfy the Wolfe conditions if the following two
inequalities hold:
J wk þ μkdkð Þ � J�wk
� � σ1μkdTk gk
dTk ∇J wk þ μkdkð Þ σ2d
Tk gk,
ðB:14Þ
where 0 < σ1 < σ2 < 1. The first inequality ensures that the CF Jk is
reduced sufficiently. The second, called curvature condition, ensures that the slope
has been reduced sufficiently. It is easy to show that if dk is a descent direction,
if Jk is continuously differentiable and if Jk is bounded below along the ray
fwk þ μdk | μ > 0g then there always exist step size satisfying (B.14). Algorithms
that are guaranteed to find, in a finite number of iterations, a point satisfying the
Wolfe conditions have been developed by several authors (see [4] for details).
min max1 [ , ]
min ( , )Jμ μ μ
μ μ∴ w
1c.i. : −w
1w
2w 2 2μ d
3 3μ d
1: maximum descent directiond
1 1Direction ´ step-size: μ d
2: maximum descent directiond
Î
Fig. B.3 Qualitative
evolution of the descent
along the negative gradient
of the CF method with line
search. The μ parameter is
increased until the CF
continues to decrease
608 Appendix B: Elements of Nonlinear Programming
If we modify the curvature condition
dTk ∇J wk þ μkdkð Þ�� �� � σ2d
Tk gk
�� �� ðB:15Þ
known as strong Wolfe condition, this can result in a value for the step size that is
close to a minimizer of Jðwk þ μkdkÞ.
B.2.3.1 Line Search Condition for Quadratic Form
Let A ¼ ℝM�M be a symmetric and positive definite matrix, for a quadratic CF
defined as
J wð Þ ¼ c� wTbþ 1
2wTAw ðB:16Þ
the optimal step size is
μ ¼ dTk�1dk�1
dTk�1Adk�1
: ðB:17Þ
Proof The line search is a procedure to find a best step size along steepest directionwhich minimizes the derivative ∂
∂μ J wð Þf g ! 0. Using the chain rule, we can write
∂J wkð Þ∂μ
¼ J wkð Þ∂wk
� �T ∂wk
∂μ¼ ∇J wkð Þ� �T
dk�1:
Intuitively, from the current point reached by the line search procedure, the next
direction is orthogonal to the previous direction that is dk ⊥ dk�1 (see Fig. B.3).
For the determination of the optimal step size μ, we see that ∇JðwkÞ ¼ �dk.
It follows
dTk dk�1 ¼ 0
∇J wkð Þ½ �Tdk�1 ¼ 0:ðB:18Þ
For a CF of the type (B.16), at the kth iteration, the negative gradient
(search direction) is ∇JðwkÞ ¼ b � Awk. Let weight’s correction equal to
wk ¼ wk�1 þ μdk�1, the expression (B.18) can be written as
b� Awk½ �Tdk�1 ¼ 0
b� A wk�1 þ μdk�1ð Þ½ �Tdk�1 ¼ 0
by the latter, with the position dk�1 ¼ �b þ Awk�1,
Appendix B: Elements of Nonlinear Programming 609
dk�1 � μAdk�1
�h iTdk�1 ¼ 0
dTk�1dk�1 � μdT
k�1Adk�1 ¼ 0
Finally solving for μ we have
μ ¼ dTk�1dk�1
dTk�1Adk�1
:
Q.E.D.
Example Consider a quadratic CF (B.16) with A¼ 1 0:80:8 1
� �, b¼ 0:1 �0:2½ �T
and c ¼ 0.1, the plot of the performance surface is reported in Fig. B.4.
Problem Find the optimal solution, using a Matlab procedure, with tolerance
Tol ¼ 1e–6 starting with IC w�1 ¼ 0 3½ �T .In Fig. B.5 the weights trajectories, plotted over the isolevel performance
surface, are reported for the standard SDA and SDA plus the Wolfe condition.
Computed optimum solution
w[0] ¼ 0.72222w[1] ¼ -0.77778
SDA Computed optimum solution with μ ¼ 0.1
n. Iter ¼ 1233w[0] ¼ 0.72222w[1] ¼ -0.77778
-4-2
02
4
-4
-2
2
4-10
0
10
20
30
40
Weight w [0]
Performance surface
Weight w [1]
CF
J(w
)
[0]optw[1]optw
Fig. B.4 Trend of the cost
function considered in the
example
610 Appendix B: Elements of Nonlinear Programming
SDA2_Wolfe optimal solution mu computed with Eq. (B.17)
n. Iter ¼ 30w[0] ¼ 0.72222w[1] ¼ -0.77778
Matlab Functions
% ------------------------------------ -------------------------------------% Standard Steepest Descent Algorithm %% Copyright 2013 - A. Uncini % DIET Dpt - University of Rome 'La Sapienza' - Italy% $Revision: 1.0$ $Date: 2013/03/09$ %------------------------------ --------------------------------------------function [w, k] = SDA(w, g, R, c, mu, tol, MaxIter) % Steepest descent -----------------------------------------------------
for k = 1 : MaxItergradJ = grad_CF(w, g, R); % Gradient computationw = w - mu*gradJ; % up-date solutionif ( norm(gradJ) < tol ), break, end % end criteria
endend
% ------------------------------- --------------------------- ---------------% Standard Steepest Descent Algorithm and Wolf condition % for quadratic CF% J(w) = c - w'b + (1/2)w'Aw; %% Copyright 2013 - A. Uncini % DIET Dpt - University of Rome 'La Sapienza' - Italy% $Revision: 1.0$ $Date: 2013/03/09$ %-------------------------------- --------------------------- ---------------function [w, k] = SDA2(w, g, R, c, mu, tol, MaxIter)
for k=1:MaxIter gradJ = (R*w-g); % Gradient computation or grad_CF(w, g, R);mu = gradJ'*gradJ/(gradJ'*R*gradJ); % Opt. step-size eqn.w = w - mu*gradJ; % up-date solution if ( norm(gradJ) < tol ), break, end % end criteria
endend
% ------------------------------------------------------ -------------------% Standard quadratic cost function and gradient computation %% Copyright 2013 - A. Uncini % DIET Dpt - University of Rome 'La Sapienza' - Italy % $Revision: 1.0$ $Date: 2013/03/09$ %--------------------------------------------------------------------------function [Jw] = CF(w, c, b, A)
Jw = c - 2*w'*b + w'*A*w;end%--------------------------------------------------- -----------------------function [gradJ] = grad_CF(w, b, A)
gradJ = (A*w-b); end
(B.17)
Appendix B: Elements of Nonlinear Programming 611
B.2.4 The Standard Newton’s Method
The Newton methods are based on the exact minimum computation of a quadratic
local approximation of the CF. In other words, rather than directly determine the
approximate minimum of the true CF, the minimum of a locally quadratic approx-
imation of the CF is exactly computed.
The method can be formalized by considering the truncated second-order Taylor
series expansion of the CF JðwÞ around a point wk defined as
J wð Þ ffi J wkð Þ þ w� wk½ �T∇J wkð Þ þ 1
2w� wk½ �T∇2J wkð Þ w� wk½ �: ðB:19Þ
The minimum of the (B.19) is determined by imposing ∇JðwkÞ ! 0, so, the wkþ1
point (that minimizes the CF) necessarily satisfies the relationship5
∇J wkð Þ þ 1
2∇2J wkð Þ wkþ1 � wk½ � ¼ 0: ðB:20Þ
If the inverse of the Hessian matrix exists, the previous expression can be written in
the following form of finite difference equation (FDE):
wkþ1 ¼ wk � μk � ∇2J wkð Þ� ��1∇J wkð Þ for ∇2J wkð Þ 6¼ 0, ðB:21Þ
where μk > 0 is a suitable constant. The expression (B.21) represents the standard
form of discrete Newton’s method.
SDA Weights trajectory on Performance surface
w0[n]
w1[n
]
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3SDA-W Weights trajectory on Performance surface
w0[n]
w1[n
]
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
Fig. B.5 Trajectories of the weights on the isolevel CF curves for steepest descent algorithm
(SDA) and Wolfe SDA
5 In the optimum point wkþ1, by definition, is Jðwkþ1Þ ffi JðwkÞ. It follows that the (B.19) can be
written as 0 ¼ wkþ1 � wk½ �T∇J wkð Þ þ 12wkþ1 � wk½ �T∇2J wkð Þ wkþ1 � wk½ �. So, simplifying the
term ½wkþ1 � wk�T gives (B.20).
612 Appendix B: Elements of Nonlinear Programming
Remark The CF approximation with a quadratic form is significant because JðwÞ isusually an energy function. As explained by the Lyapunov method [5], you can
think of that function as the energy associated with a continuous-time dynamical
system described by a system of differential equations of the form
dw
dt¼ �μ0 ∇2J wkð Þ� ��1∇J wkð Þ ðB:22Þ
such that for μ0 > 0, (B.21) corresponds to its numeric approximation. In this case,
the convergence properties of Newton’s method can be studied in the context of a
quadratic programming problem of the type
w∗ ¼ argminw∈Ω
J wð Þ ðB:23Þ
when the CF has a quadratic form of type (B.16). Note that for A positive definite
function JðwÞ is strictly convex and admits an absolute minimum w∗ that satisfies
Aw∗ ¼ b ) w∗ ¼ A�1b: ðB:24Þ
Also, observe that the gradient and the Hessian of the expression (B.16) are
calculated explicitly as ∇JðwÞ ¼ Aw � b and ∇2JðwÞ ¼ A, and replacing these
values in the form (B.21) for μk ¼ 1, the recurrence becomes
wkþ1 ¼ wk � A�1 Awk � bð Þ ¼ A�1b: ðB:25Þ
The above expression indicates that the Newton method converges theoretically at
the minimum point, in only one iteration. In practice, however, the gradient
calculation and the Hessian inverse pose many difficulties. In fact, the Hessian
matrix is usually ill-conditioned and its inversion represents an ill-posed problem.Furthermore, the IC w�1 can be quite far from the minimum point and the Hessian
at that point, it may not be positive definite, leading the algorithm to diverge. In
practice, a way to overcome these drawbacks is to slow the adaptation speed by
including a step-size parameter μk on the recurrence. It follows that in causal form,(B.25) can be written as
wk ¼ wk�1 � μkA�1 Awk�1 � bð Þ: ðB:26Þ
As mentioned above, in the simplest form of Newton method, the weighting of the
equations (B.10) is made with the inverse Hessian matrix, or by its estimation. We
then have
Hk ¼ ∇2Jk�1
� ��1, Exact Newton algorithms ðB:27Þ
Appendix B: Elements of Nonlinear Programming 613
Hk ∇2Jk�1
� ��1, Quasi-Newton algorithms ðB:28Þ
thereby forcing both direction and the step size to the minimum of the gradient
function. Parameter learning can be constant ðμk < 1Þ or also estimated with a
suitable optimization procedure.
B.2.5 The Levenberg–Marquardt’s Variant
A simple method to overcome the problem of ill-conditioning of the Hessian
matrix, called Levenberg–Marquardt variant [6, 7], consists in the definition of
an adaptation rule of the type
wk ¼ wk�1 ¼ μk δIþ∇2Jk�1
� ��1gk ðB:29Þ
where the constant δ > 0 must be chosen considering two contradictory require-
ments: small to increase the convergence speed and sufficiently large as to make the
Hessian matrix always positive definite.
Levenberg–Marquardt method is an approximation of the Newton algorithm.
It has, also, quadratic convergence characteristics. Furthermore, convergence is
guaranteed even when the estimate of initial conditions is far from minimum point.
Note that the sum of the term δI, in addition to ensure the positivity of the
Hessian matrix, is strictly related to the Tikhonov regularization theory. In the
presence of noisy CF, the term δI can be viewed as a Tikhonov regularizing term
which determines the optimal solution of a smooth version of CF [8].
B.2.6 Quasi-Newton Methods or Variable Metric Methods
In many optimization problems, the Hessian matrix is not explicitly available. In
the quasi-Newton, also known as variable metric methods, the inverse Hessian
matrix is determined iteratively and in an approximate way. The Hessian is updated
by analyzing successive gradient vectors. For example, in the so-called sequentialquasi-Newton methods, the estimate of the inverse Hessian matrix is evaluated by
considering two successive values of the CF gradient.
Consider the second-order CF approximation and let Δw ¼ ½w � wk�,gk ¼ ∇JðwkÞ, and Bk an approximation of the Hessian matrix Bk ∇2JðwkÞ;from Eq. (B.19) we can write
J wþ Δwð Þ J wkð Þ þ ΔwTgk þ 1
2ΔwTBkΔw: ðB:30Þ
614 Appendix B: Elements of Nonlinear Programming
The gradient of this approximation ðwith respect to ΔwÞ can be written as
∇J wk þ Δwkð Þ gk þ BkΔwk ðB:31Þ
called secant equation. The Hessian approximation can be chosen in order to
exactly satisfy Eq. (B.31); so, Δwk ! dk and setting this gradient to zero provides
the Quasi-Newton adaptations
Δwk ! dk ðB:32Þ
In particular, in the method of Broyden–Fletcher–Goldfarb–Shanno (BFGS)
[3, 9–11], the adaptation takes the form
dk ¼ �B�1k gk
wkþ1 ¼ wk þ μkdk
Bkþ1 ¼ Bk � BksksTk B
Tk
sTk Bkskþ uku
Tk
uTk sk
sk ¼ wkþ1 � wk
uk ¼ gkþ1 � gk,
ðB:33Þ
where the step size μk satisfies the above Wolfe conditions (B.14). It has been found
that for the optimal performance a very loose line search with suggested values of
the parameters in (B.14), equal to σ1 ¼ 10�4 and σ2 ¼ 0.9, is sufficient.
A method that can be considered as a serious contender of the BFGS [4] is the
so-called symmetric rank-one (SR1) method where the update is given by
Bkþ1 ¼ Bk þ dk � Bkskð Þ dk � Bkskð ÞTsTk dk � Bkskð Þ : ðB:34Þ
It was first discovered by Davidon (1959), in his seminal paper on quasi-Newton
methods, and rediscovered by several authors. The SR1 method can be derived by
posing the following simple problem. Given a symmetric matrix Bk and the vectors
sk and dk, find a new symmetric matrix Bkþ1 such that ðBkþ1–BkÞ has rank one, andsuch that
Bksk ¼ dk: ðB:35Þ
Note that, to prevent the method from failing, one can simply set Bkþ1 ¼ Bk when
the denominator in (B.34) is close to zero, though this could slow down the
convergence speed.
Appendix B: Elements of Nonlinear Programming 615
Remark In order to avoid the computation of inverse matrix Bk, denoting Hk as an
approximation of the inverse Hessian matrix�Hk ½∇2JðwkÞ��1
�, and approxi-
mating ðdk ΔwkÞ the recursion (B.33) can be rewritten as
wkþ1 ¼ wk þ μkdkdk ’ wkþ1 � wk ¼ �Hkgkuk ¼ gkþ1 � gk
Hkþ1 ¼ I� dkuTk
dTk uk
24
35Hk I� ukd
Tk
dTk uk
24
35þ dkd
Tk
dTk uk
,
ðB:36Þ
where usually, the step size μk is optimized by a one-dimensional line search
procedure (B.13) that takes the form
μk ∴ minμ∈ℝþ
J wk � μHk∇Jk½ �: ðB:37Þ
The procedure is initialized with arbitrary IC w�1 and with the matrix H�1 ¼ I.
Alternatively, in the last of (B.36) Hk can be calculated with the Barnes–Rosen
formula (see for [3] details)
Hkþ1 ¼ Hk þ dk �Hkukð Þ dk �Hkukð ÞTdk �Hkukð ÞTuk
: ðB:38Þ
The variable metric method is computationally more efficient than that of Newton.
In particular, good line search implementations of BFGS method are given in the
IMSL and NAG scientific software library. The BFGS method is fast and robust and
is currently being used to solve a myriad of optimization problems [4].
B.2.7 Conjugate Gradient Method
Introduced by Hestenes–Stiefel [12] the conjugate gradient algorithm (CGA)
marks the beginning of the field of large-scale nonlinear optimization. The CGA,
while representing a simple change compared to SDA and the quasi-Newton
method, has the advantage of a significant increase in the convergence speed and
requires storage of only a few vectors.
Although there are many recent developments of limited memory and discrete
Newton, CGA is still the one of the best choice for solving very large problems with
relatively inexpensive objective functions. CGA, in fact, has remained one of the
most useful techniques for solving problems large enough to make matrix storage
impractical.
616 Appendix B: Elements of Nonlinear Programming
B.2.7.1 Conjugate Direction
Two vectors ðd1,d2Þ ∈ ℝM�1 are defined orthogonal if d1Td2 ¼ 0 or hd1,d2i ¼ 0.
Given a symmetric and positive defined matrix A ∈ ℝM�M the vectors are defined
as A-orthogonal or A-conjugate, indicated as hd1,d2ijA ¼ 0, if dT1Ad2 ¼ 0. Result
in terms of scalar product is hAd1, d2i ¼ hATd1, d2i ¼ hd1, ATd2i ¼hd1, Ad2i ¼ 0.
Preposition The conjugation implies the linear independence and for A ∈ ℝM�M
symmetric and positive definite, the set of A-conjugate vectors, hdk � 1,dkijA ¼ 0,
for k ¼ 0,. . .,M � 1, indicated as ½dk�k¼0M�1, are linearly independent (Fig. B.6).
B.2.7.2 Conjugate Direction Optimization Algorithm
Given the standard optimization problem (B.1) with the hypothesis that the CF is a
quadratic form of the type (B.16), the following theorem holds.
Theorem Given a set of nonzero A-conjugate directions, ½dk�k¼0M�1 for each IC
w�1 ∈ ℝM�1, the sequence wk ∈ ℝM�1 generated as
wkþ1 ¼ wk þ μkdk for k ¼ 0, 1, ::: ðB:39Þ
with μk determined as line search criterion (B.17), converges in M steps to the
unique optimum solution w∗.
Proof The Proof is performed in two steps (1) computation of the step size μk;(2) Proof of the subspace optimality Theorem.
1. Computation of the step size μkConsider the standard quadratic CF minimization problem for which
∇J wð Þ ! 0 ) Aw ¼ b ðB:40Þ
with optimal solution w∗ ¼ A
�1b. A set of nonzero A-conjugate directions
½dk�k¼0M�1 forming a base over ℝM such that the solution can be expressed as
w∗ ¼XM�1
k¼0
μkdk: ðB:41Þ
2d
1d
1d2d
2Ad1 2, 0=d Ad2 2 0T =d d
1d
2d
1Ad1 2, 0=Ad d
Fig. B.6 Example of orthogonal and A-conjugate directions
Appendix B: Elements of Nonlinear Programming 617
For the previous expression, the system (B.40) for w ¼ w∗ can be written as
b ¼ AXMk¼1
μkdk ¼XMk¼1
μkAdk ðB:42Þ
Moreover, multiplying left side for dTi both members of the precedent expres-
sion, and being by definition hdTi A, dji ¼ 0 for i 6¼ j, we can write
dTi b ¼ μkd
Ti Adk ðB:43Þ
which allows the calculation of the coefficients of the base (B.41) μk as
μk ¼dTk b
dTk Adk
: ðB:44Þ
For the definition of the CGA method, we consider a recursive solution for CF
minimization, in which in the ðk–1Þth iteration we consider negative gradient
around wk, called in this context, residue. Indicating the negative direction of the
gradient as gk�1 ¼ �∇Jðwk�1Þ, we have
gk�1 ¼ b� Awk�1 ðB:45Þ
The expression (B.44) can be rewritten as
μk ¼dTk gk�1 þ Awk�1ð Þ
dTk Adk
: ðB:46Þ
From definition of A-conjugate directions dTkAwk�1 ¼ 0 we have
μk ¼dTk gk�1
dTk Adk
: ðB:47Þ
Remark Expression (B.47) represents an alternative formulation for the optimal
step-size computation (B.17).
2. Subspace optimality Theorem
Given a quadratic CF J wð Þ ¼ 12wTAw � wTb, and a set of nonzero A-conjugate
directions, ½dk�k ¼ 0M�1, for any IC w�1 ∈ ℝM�1 the sequence wk ∈ ℝM�1 generated as
wkþ1 ¼ wk þ μkdk, for k 0 ðB:48Þ
with
618 Appendix B: Elements of Nonlinear Programming
μk ¼dTk gk�1
dTk Adk
ðB:49Þ
reaches its minimum wkþ1 ! w∗ value in the set w�1 þ span d0 � � � dkf g� �.
Equivalently, considering the general solution w, we have that ∇J wð Þ� �Tdk ¼ 0.
Then there is, necessarily, a parameter βi ∈ ℝ such that
w ¼ w�1 þ β0d0 þ � � � þ βkdk ðB:50Þ
Then
0 ¼ ∇J wð Þ� �Tdi
¼ �A w�1 þ β0d0 þ � � � þ βkdk�1½ � þ b�Tdi¼ Aw�1 þ b½ �T þ β0d
T0 Adi þ � � � þ βkd
Tk Adi
¼ ∇J w�1ð Þ� �Tdi þ βid
Ti Adi,
ðB:51Þ
whereby we can calculate the parameter βi as
βi ¼� ∇J wð Þ� �T
dk
dTk Adk
¼ gTkþ1Adk
dTk Adk
ðB:52Þ
Q.E.D.
SDA-W Weights trajectory on Performance surface
w0[n]
w1[n
]
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
CGASDA
Fig. B.7 Trajectories of the
weights on the isolevel CF
curves for steepest descent
algorithm (SDA) and the
standard Hestenes–Stiefel
conjugate gradient
algorithm
Appendix B: Elements of Nonlinear Programming 619
B.2.7.3 The Standard Hestenes–Stiefel Conjugate Gradient Algorithm
From the earlier discussion, the basic algorithm of the conjugate directions can be
defined with an iterative procedure which allows the recursive calculation of the
parameters μk and βk.We can define the standard CGA [13] as (Fig. B.7)
d�1 ¼ g�1 ¼ b� Aw�1 w�1 arbitraryð Þ IC ðB:53Þ
do {
μk ¼��gk��2dTk Adk
, computation of step size ðB:54Þ
wkþ1 ¼ wk þ μkdk, new solution or adaptation ðB:55Þgkþ1 ¼ gk � μkAdk, gradient direction update
βk ¼��gkþ1
��2��gk��2 , computation of 00beta00 parameter ðB:56Þ
dkþ1 ¼ gkþ1 þ βkdk, search direction ðB:57Þ
} while�kgkk > ε
�end criterion : output for kgkk < ε.
%----------------------------------------------------------------- ----% The type 1 Hestenes - Stiefel Conjugate Gradient Algorithm % for CF: J(w) = c - w'b + (1/2)w'Aw; %% Copyright 2013 - A. Uncini % DIET Dpt - University of Rome 'La Sapienza' - Italy % $Revision: 1.0$ $Date: 2013/03/09$ %---------------------------------------------------------------------function [w, k] = CGA1(w, b, A, c, mu, tol, MaxIter)
d = b - A*w;g = d; g1 = g'*g;for k=1:MaxIter
Ad = A*d;mu = g1/(d'*Ad); % Optimal step-size w = w + mu*d; % up-date solution g = g - mu*Ad; % up-date gradient or residual g2 = g'*g;be = g2/g1; % ‘beta’ parameter d = g + be*d;; % up-date direction g1 = g2;if ( g2 <= tol ), break, end % end criteria
endend% Hestenes - Stiefel Conjugate Gradient Algorithm type 1 ------------
(B.54)(B.55)
(B.55)
(B.56)(B.57)
620 Appendix B: Elements of Nonlinear Programming
Remark In place of the formulas (B.54) and (B.56) one may use
μk ¼dTk gk
dTk Adk
ðB:58Þ
βk ¼ � gTkþ1Adk
dTk Adk
: ðB:59Þ
These formulas, although more complicated than (B.54) and (B.56), have μ and βparameters more easily changed during the iterations.
Moreover note that the direction of estimated gradients (or residual) gk is
mutually orthogonal hgkþ1,gki ¼ 0, while the direction of vectors dk is mutually
A-conjugate hdkþ1, Adki ¼ 0.
%---------------------------------------------------------------------% The type 2 Hestenes - Stiefel Conjugate Gradient Algorithm % J(w) = c - w'b + (1/2)w'Aw; %% Copyright 2013 - A. Uncini % DIET Dpt - University of Rome 'La Sapienza' - Italy % $Revision: 1.0$ $Date: 2013/03/09$ %---------------------------------------------------------------------function [w,,k] = CGA2(w, b, A, c, mu, tol, MaxIter)
d = b - A*w;g = d; for k = 1 : MaxIter
Ad = A*d;dAd = d'*Ad;mu = (d'*g)/dAd; % Optimal step-size w = w + mu*d; % up-date solution g = g - mu*Ad; % up-date direction be = -(g'*Ad)/dAd; % ‘beta’ param d = g + be*d;if ( norm(g) <= tol ), break, end % end criteria
endend% Hestenes - Stiefel Conjugate Gradient Algorithm type 2 ------------
(B.58)
(B.59)
B.2.7.4 Gradient Algorithm for Generic CF
The method of conjugate gradient can be generalized to find a minimum of a generic
CF. In this case the search method is sometimes called nonlinear CGA [14]. In this
case the gradient cannot explicitly be computed but only estimated in various ways.
In particular the residual cannot be directly found but, let ∇JðwkÞ an estimation of
the CF’s gradient at the kth iteration, we set residual as gk ¼ �∇JðwkÞ.The line search procedure cannot be computed as in the Hestenes–Stiefel CGA
previously described and could be substituted by minimizing the expression
Appendix B: Elements of Nonlinear Programming 621
∇J wk þ μkdkð Þ� �Tdk: ðB:60Þ
Moreover, the estimated Hessian of CF ∇2JðwkÞ plays the role of matrix A.
A simple modified CGA method form is defined by the following recurrence.
Starting from IC w�1 and βXY0 ¼ 0
w�1 w�1 arbitraryð Þ IC ðB:61Þd�1 ¼ g�1 ¼ �∇J w�1ð Þ IC ðB:62Þ
do f
determine μk, Wolfe conditions
wkþ1 ¼ wk þ μkdk, Adaptation ðB:63Þgk ¼ �∇J wkð Þ, gradient estimation
compute βk ¼ βXYk , }beta}parameter ðB:64Þ
dkþ1 ¼ gkþ1 þ βkdk, Compute the search direction ðB:65Þif gT
kþ1gk�� ��>0:2 gkþ1
�� ��2thendkþ1¼gkþ1, Restartcondition ðB:66Þg while gkk k > ε
� �end criterion. Exit when kgkk < ε
The parameter βXYk , which plays a central role for nonlinear CGA, can be
determined through various philosophies of calculation. Below are shown the
most commonmethods for the calculation of the beta parameter (see for details [15])
βHSk ¼ dT
k gkþ1
dTk wkþ1 � wkð Þ , Hestenes � Stiefel HSð Þ ðB:67Þ
β PRk ¼ dT
kþ1gkþ1
gTk gk
, Polak � RibiOre � Polyak PRPð Þ ðB:68Þ
βHSk ¼ dT
k gkþ1
dTk wkþ1 � wkð Þ , Liu� Storey LSð Þ ðB:69Þ
β FRk ¼ gT
kþ1gkþ1
gTk gk
, Fletcher � Reevs FRð Þ ðB:70Þ
βCDk ¼ � gT
kþ1gkþ1
gTk dk
, Conjugate Descent� Fletcher CDð Þ ðB:71Þ
βDYk ¼ gT
kþ1gkþ1
dTk wkþ1 � wkð Þ , Dai� Yuan DYð Þ: ðB:72Þ
Note that, in the specialized literature, there are many other variants (see, for
example [4]). For strictly quadratic CF this method reduces to the linear search
622 Appendix B: Elements of Nonlinear Programming
provided μk is the exact minimizer [3]. Other choices of the parameter βXYk in (B.65)
also possess this property and give rise to distinct algorithms for nonlinear problems.
In the CGA, the increase in convergence speed is obtained from information on
the search direction that depends on the previous iteration dk�1, moreover for a
quadratic CF, it is conjugated to the gradient direction. Theoretically, the algorithm,
for w ∈ ℝM, converges in M or less iterations.
To avoid numerical inaccuracy in the direction search calculation or for the
non-quadratic CF nature, the method requires a periodic reinitialization. Indeed,
over certain conditions, (B.67)–(B.72) may assume negative value. So a more
appropriate choice is
βk ¼ max βXYk ; 0
: ðB:73Þ
Thus if a negative value of βPRk occurs, this strategy will restart the iteration along
the correct steepest descent direction.
The CGA can be considered as an intermediate approach between the SDA and
the quasi-Newton method. Unlike other algorithms, the CGA main advantage is
derived from the fact of not needing to explicitly estimate the Hessian matrix which
is, in practice, replaced by the βk parameter.
B.3 Constrained Optimization Problem
The problem of constrained optimization can be formulated as: find a vector
w ∈ Ω � ℝM that minimizes (maximizes) a scalar function
minw∈Ω
J wð Þ ðB:74Þ
subject to (s.t.) the constraints
gi wð Þ 0, for i ¼ 1, 2:::,M: ðB:75Þ
Methods for solving constrained optimization problems are often characterized by
two conflicting needs:
• Finding admissible solutions,
• Finding the algorithm to minimize the objective function.
In general, there are two basic approaches:
• Transform the problems into simpler constrained problems,
• Transform the problems into a sequence (in the limit a single) of unconstrained
problems.
Appendix B: Elements of Nonlinear Programming 623
B.3.1 Single Equality Constraint: Existenceand Characterization of the Minimum
As in unconstrained optimization problems (see Sect. B.1.2), to have admissible
solution some sufficient and sufficient and necessary conditions must be satisfied.
For example, in the case of single equality constraint the problem can be
formulated as
minw∈Ω
J wð Þ s:t: h wð Þ ¼ b: ðB:76Þ
First-order necessary condition (FONC) for minimum (or maximum) is that the
functions JðwÞ and hðwÞ have continuous first-order partial derivative and that thereexists some free parameter scalar λ such that
∇J wð Þ þ λ∇h wð Þ ¼ 0 ðB:77Þ
or, as illustrated in Fig. B.8, the two surface must be tangent. Note that hðwÞ ¼ bor �hðwÞ ¼ �b are the same and that there is non-restriction on λ.
B.3.2 Constrained Optimization: Methods of LagrangeMultipliers
The method of Lagrange multipliers (MLM) is the fundamental tool for analyzing
and solving nonlinear constrained optimization problems. Lagrange multipliers
can be used to find the extreme of a multivariate function JðwÞ subject to the
constraint function hðwÞ ¼ b, where J and h are functions with continuous
first partial derivatives on the open set, containing the curve hðwÞ � b ¼ 0, and
∇hðwÞ 6¼ 0 at any point on the curve.
1,optw 1w
2w
( )h b=w
( )J w
2,optw
Fig. B.8 In the optimal
point curve JðwÞ and hðwÞare necessarily tangent
624 Appendix B: Elements of Nonlinear Programming
B.3.2.1 Optimization with Single Constraint
In the case of a single equality constrained optimization problem (B.76), we define
the Lagrangian or Lagrange function as
L w; λð Þ ¼ J wð Þ þ λ h wð Þ � b� � ðB:78Þ
such that, in the case that the existence condition is verified, the solution can be
found solving the following unconstrained optimization problem associated with
(B.76):
minw∈Ω
L w; λð Þ ðB:79Þ
minλ∈L
L w; λð Þ ðB:80Þ
That is, ∇Lðw, λÞ ¼ 0, or
∇wL w; λð Þ ¼ ∇J wð Þ þ λ∇h wð Þ ¼ 0 ðB:81Þ∇λL w; λð Þ ¼ h wð Þ � b ¼ 0: ðB:82Þ
If (B.81) and (B.82) hold then ðw, λÞ is a stationary point for the Lagrange function.In other words, the Lagrange multiplier method represents a necessary condition
for the existence of optimal solution in such constrained optimization problems.
Fig. B.9 shows an example of a constrained optimization problem for M ¼ 2.
B.3.2.2 Optimization Problem with Multiple Inequality Constraints:
Kuhn–Tucker Conditions
The generalization for multiple constraints can be formulated as
2,unoptw
1,unoptw 1w
2w
1,coptw
2,coptw
( )J w
( )f b=w
Fig. B.9 Example of a
constrained optimization
problem for M ¼ 2. The
constrained optimum value
is the closest point to the
unconstrained optimum,
belonging to the constraint
curve fðwÞ ¼ b
Appendix B: Elements of Nonlinear Programming 625
minw∈Ω
J wð Þ s:t: gi wð Þ 0 i ¼ 1, 2, :::,K ðB:83Þ
and the Lagrangian is defined as
L w; λð Þ ¼ J wð Þ þXKi¼1
λigi wð Þ: ðB:84Þ
In this case if a solution w∗ exists then the following FONC, called Kuhn–Tuckerconditions (KT), holds:
∇J w�ð Þ þXKi¼1
λ∗i ∇gi w∗ð Þ ¼ 0
gi w∗ð Þ 0
λ∗i � 0, for i ¼ 1, 2, :::,K
λ∗i gi w∗ð Þ ¼ 0:
ðB:85Þ
A feasible point w∗ for the minimization problem (B.83) is regular point if the setof vectors ∇giðw∗Þ is linearly independent over a set of indices corresponding to
the equality constraints at optimal point w∗, formally
∇gi w∗ð Þ i∈ I0, for I0 ≜ i∈ 1;K½ � ∴ gi w
∗ð Þ ¼ 0 ðB:86Þ
In eqns. (B.85) we have assumed that the first derivatives ∇JðwÞ and ∇gðwÞexist and that w∗ is a regular point or that the constraints satisfy the regularityconditions. Moreover, a point w ∈ Ω � ℝM is called feasible point, and the opti-
mization problem is called consistent, if the set of feasible points is nonempty. A
feasible point w∗ is a local minimizer if fðw∗Þ is a minimum on the set of feasible
points.
A point ðw∗, λ∗Þ at which KT conditions hold is called a saddle point for theLagrangian function if JðxÞ is convex and all giðwÞ are concave. At the saddle pointthe Lagrangian satisfies the inequalities
L w∗; λð Þ � L w∗; λ∗ð Þ � L w; λ∗ð Þ: ðB:87Þ
So, for the Lagrange function a minimum exists with respect to x and a maximum
with respect to λ.Note also that the last of (B.85) conditions, that is, λ∗i gi x
∗ð Þ ¼ 0 i ¼ 1, 2, :::,K, iscalled complementary slackness condition.
626 Appendix B: Elements of Nonlinear Programming
Example Consider the problem
minw∈Ω
w21 þ w2
2
� �s:t: �1
4w1 � 2ð Þ2 � �w2 � 2
�2 þ 1 0 : ðB:88Þ
The KT is defined as
2w1
2w2
� �� λ �
122� w1ð Þ
2 2� w2ð Þ
" #¼ 0
�14w1 � 2ð Þ2 � �w2 � 2
�2 þ 1 0
λ 0
λ�1� 1
4w1 � 2ð Þ2 � w2 � 2ð Þ2� ¼ 0:
ðB:89Þ
Geometrically illustrated in Fig. B.10.
Calculation of the solution with the KT
2w1
2w2
� �� λ �
122� w1ð Þ
2 2� w2ð Þ
" #¼ 0 ðB:90Þ
for which
w1 ¼ 2λ
4þ λ; w2 ¼ 2λ
1þ λ: ðB:91Þ
For λ ¼ 0, one has w1 ¼ 0 and w2 ¼ 0, which, however, is not a feasible solution
as the constraint conditions (B.88) are not met. It follows that λ must necessarily be
positive. Substituting the values (B.91) in the constraint
1w
2w
2 21 2( ) ( )J w w+w
2 211 24( ) ( 2) ( 2) 1g w w− − − − +w
*1w
*2w
( )J w
( )g w
-Ñ
Ñ
Fig. B.10 In the optimum point, the surface of the CF JðwÞ is tangent to the curve of the constraintgðwÞ
Appendix B: Elements of Nonlinear Programming 627
�1
4
2λ
4þ λ� 2
� �2
� 2λ
1þ λ� 2
� �2
þ 1 0 ðB:92Þ
and solving for the equality, to the value λ > 0, we obtain λ ¼ 1.8, for which the
optimum point is equal to w�1 ¼ 0.61, w�
2 ¼ 1.28.
B.3.2.3 Optimization Problem with Mixed Constraints:
Karush–Kuhn–Tucker Conditions
The KT conditions are generalized by the more general Karush–Kuhn–Tucker
(KKT) conditions, which take into account equality and inequality constraints of
the most general form hiðxÞ ¼ 0, giðxÞ 0, and fiðxÞ � 0. The KKT conditions are
necessary for a solution in nonlinear programming to be optimal, provided some
regularity conditions are satisfied.
In the presence of equality and inequality constraints the nonlinear optimization
problem can be written as
min J wð Þ s:t:
XKl
i¼1
κili wð Þ � bi
XKg
i¼1
σigi wð Þ bi
XKe
i¼1
υihi wð Þ ¼ bi,
8>>>>>>>>><>>>>>>>>>:
ðB:93Þ
where JðwÞ, liðwÞ, giðwÞ, and hiðwÞ, for all i, have continuous first-order partial
derivative on some subset Ω � ℝM. Let
λ∈ℝK ¼ κ1 � � � κKlσ1 � � � σKg υ1 � � � υKe
� �T ðB:94Þ
with K ¼ Kl þ Kg þ Ke, the vector containing all the Lagrange multipliers, and
f wð Þ ¼ l wð Þ g wð Þ h wð Þ� �T ðB:95Þ
a vector of functions containing all the inequalities and equalities constraints, for
the problem (B.93) the Lagrangian assumes the forms
628 Appendix B: Elements of Nonlinear Programming
L w;λð Þ¼ J�w�þXK
i¼1
λi f i wð Þ�bi� �
¼ J wð Þþ κ σ υ½ �T l wð Þ g wð Þ h wð Þ� �¼ J wð Þþλf
�w�,
ðB:96Þ
where vectors κ, σ, and υ are called dual variables. Further, suppose that w∗ is a
regular point for the problem. Ifw∗ is a local minimum that satisfies some regularity
conditions, then there exist constants vector λ∗ such that (KKT conditions)
∇J w�ð Þ þXKi¼1
λ�i∇f i w�ð Þ ¼ 0 ðB:97Þ
and
κ∗i 0 i ¼ 1, 2, :::,Kl
σ∗i � 0 i ¼ 1, 2, :::,Kg
υ∗i ¼ �1 i ¼ 1, 2, :::,Ke arbitray signð Þλ∗i ¼ 0 i∈ I0 , for I0 ≜ i∈ 1,Kl þ Kg
� �∴ f i w
�ð Þ ¼ 0n o
λ∗i�f w�ð Þ � bi � ¼ 0 i ¼ 1, 2, :::,Kl þ Kg,
ðB:98Þ
where I0 means the set of indices i from i ∈ ½1, Kl þ Kg� for which the inequalities
are satisfied at w∗ as strict inequalities.
In the case that the functions JðwÞ and fiðwÞ are convex, then λ�i > 0, and
concave, then λ�i < 0, for i ¼ 1, 2, . . ., K, then the point ðw∗,λ∗Þ is a saddlepoint of the Lagrangian function (B.96), and w∗ is a global minimizer of the
problem (B.93).
Observe that in the case only a equality constraint is present,
hiðwÞ ¼ bi, i ¼ 1, 2, . . . , K the above condition simplifies as
∇J w∗ð Þ þXKi¼1
υ∗i ∇hi w∗ð Þ ¼ 0 ðB:99Þ
and the conditions (B.98) are vacuous.
Remarks The KKT conditions provide that the intersection of the set of feasible
directions with the set of descent directions coincides with the intersection of the set
of feasible directions for linearized constraints with the set of descent directions.
To ensure that the necessary KKT conditions allow to identify local minimum
point, assumption of regularity of constraints must be satisfied. In general, it may
require the regularity of all admissible solutions, but, in practice, it is sufficient that
the regularity conditions are satisfied only for such point.
Appendix B: Elements of Nonlinear Programming 629
In some cases, the necessary conditions are also sufficient for optimality. This is
the case when the objective function J and the inequality constraints li, gi arecontinuously differentiable convex functions and the equality constraints hj areaffine functions. Moreover, the broader class of functions in which KKT conditions
guarantees global optimality are the so-called invex functions.The invex functions, which represent a generalization of convex functions, are
defined as differentiable vector functions rðwÞ, for which there exists a vector
valued function qðw, uÞ, such that
r wð Þ � r�u� q
�w, u
� �∇r�u� 8w,u : ðB:100Þ
In other words, a function rðwÞ is an invex function iff each stationary point (a pointof a function where the derivative is zero) is a global minimum point.
So, if equality constraints are affine functions and inequality constraints and the
objective function are continuously differentiable invex functions, then KKT con-
ditions are sufficient for global optimality.
B.3.3 Dual Problem Formulation
Consider the previously treated optimization problem (Sect. B.3.2.2), with multiple
inequality constraints (B.83) and Lagrangian (B.84), with a convex objective
function JðwÞ and concave constraint functions giðwÞ, here called primalinequality-constrained problem.
For this problem, at the saddle point the Lagrangian satisfies the inequalities
(B.87), that is, Lðw∗, λÞ � Lðw∗, λ∗Þ � Lðw, λ∗Þ, and the followings properties
hold:
∇wL w∗; λ∗ð Þ ¼ 0
∇λL w∗; λ∗ð Þ ¼ 0
∇w L w; λ∗ð Þ 0
∇λL w∗; λð Þ � 0:
ðB:101Þ
Note that, since the Lagrangian exhibits a minimum with respect to w and a
maximum with respect to λ, we can reformulate the primal inequality-constrained
problem (B.83, B.84) as the min–max problem of finding a vector w∗ which solves
minw∈Ω
maxλi�0
L w; λð Þ ¼ minw∈Ω
maxλi�0
J wð Þ þXKi¼1
λigi wð Þ( )
: ðB:102Þ
The above expression allows us to transform the primal min–max problem (B.102)
in an equivalent dual max–min problem defined as
630 Appendix B: Elements of Nonlinear Programming
maxw∈Ω
L w; λð Þ s:t: ∇J wð Þ þXKi¼1
λi∇gi wð Þ ¼ 0,�λi � 0
�: ðB:103Þ
Assuming that there is a unique minimum ðw∗, λ∗Þ to the problem:
minw∈Ω L w; λð Þ, then for each fixed vector λ � 0 we can define a Lagrange
function in terms of the alone Lagrange multipliers λ as
L λð Þ ≜ minw∈Ω
L w; λð Þ: ðB:104Þ
The optimization problem can be now defined, in a more simple and elegant dual
form, as
maxL λð Þ s:t: λi � 0, i ¼ 1, 2, :::,K, ðB:105Þ
where the Lagrange multipliers λ are called dual variables and LðλÞ is called dual
objective function. So, letg wð Þ ¼ g1 wð Þ � � � gK wð Þ� �Tthe vector containing the
constraint functions we obtain a simple relation
∇λL λð Þ ¼ g�w λð Þ: ðB:106Þ
The dual form may or may not be simpler than the original (primal) optimization.
For some particular case, when the problem presents some special structure, dual
problem can be easier to solve. For example, the dual problem can show some
advantage for separable and partial separable problems.
Appendix B: Elements of Nonlinear Programming 631
Appendix C: Elements of Random Variables,
Stochastic Processes, and Estimation Theory
C.1 Random Variables
A random variable (RV) (or stochastic variable) is a variable that can assume
different values depending on some random phenomenon [16–23].
Definition of RV (Papoulis [16]) An RV is a number xðζÞ ∈ ðℝ,ℂ Þ assigned toevery ζ ∈ S outcome of an experiment. This number can be the gain in a game ofchance, the voltage of a random source,. . ., or any numerical quantity that is ofinterest in the performance of the experiment.
An RV is indicated as xðζÞ, yðζÞ, zðζÞ, . . . or x1ðζÞ, x2ðζÞ, . . .; and can be definedwith discrete or continuous values. For example, we consider a poll of students at a
certain University. The set of all students is denoted by S ¼ ðζ1,ζ2, . . .,ζNÞ while, asshown in Fig. C.1, the discrete RVs x1ðζÞ and x2ðζÞ represent, respectively, the age(in years) and the number of passed exams, while the continuous RVs x4ðζÞ andx5ðζÞ represent, respectively, the height and the weight of students.
In other words, the RV xðζÞ ∈ ℝ represents a function with domain (or range) S,defined as abstract probability space (or universal set of experimental results), ofpossible infinite dimension, (e.g., the 52-cards deck, the six faces of a die, the value
of a voltage generator, the temperature of an oven, etc.), which assign for each
ζk ∈ S a number, i.e., x : S ! ℝ.More formally, the result of the experiment ζk is defined as a stochastic event or
occurrence ζk ∈ F � S, where the subset F called events is a σ-field, which
represents a subset collection of S, with closure property.6
Remark The value related to a specific event or occurrence of an RV is denoted as
xðζkÞ ¼ x (e.g., if the kth student is 22 years old, x1ðζkÞ ¼ 22). Instead, the
6A σ-field or σ-algebra or Borel field is a collection of sets, where a given measure is defined. This
concept is important in probability theory, where it is interpreted as a collection of events to which
can be attributed probabilities.
A. Uncini, Fundamentals of Adaptive Signal Processing, Signals and Communication
Technology, DOI 10.1007/978-3-319-02807-1,
© Springer International Publishing Switzerland 2015
633
notation xðζÞ ¼ x is interpreted as an event defined by all occurrences of ζ such
that xðζÞ ¼ x. For example, x2ðζÞ ¼ 15 denotes all the students who have passed
15 exams. Moreover, in the case of continuous RVs, the notation xðζÞ � x or
a � xðζÞ � b is interpreted as an interval. For example, 1.72 � x4ðζÞ � 1.82
denotes all the students with a height between 1.75 and 1.85 [m]. Indeed, for
continuous RVs, a fixed value is a non-sense and should always be considered a
range of values (e.g., x4ðζÞ ¼ 1.8221312567125367 is, obviously, a non-sense).
In the study of RVs an important question concerns to the probability7 related toan event ζk ∈ S, which can be defined by a nonnegative quantity denoted as pðζkÞ,k ¼ 1,2,. . .. However, it should be noted that the abstract probability space may be
not a metric space. So, rather than referring to the elements ζk ∈ S, we consider theRVs xðζÞ ∈ ℝ associated with the events that, by definition, are defined on a metric
space. For example, what is the probability that x1ðζÞ � 24 or that x2ðζÞ ¼ 20?
Or that x4ðζÞ � 1.85 or 71.3 � x5ðζÞ � 90.2? For this reason, the predictability
of the events xðζkÞ ¼ x; or considering the continuous case, that xðζÞ � x, ora � xðζÞ � b,. . ., is manipulated through a probability function pð�Þ, characterizedby the following axiomatic properties:
p�x ζð Þ ¼ þ1� ¼ 0
p�x ζð Þ ¼ �1� ¼ 0
ðC:1Þ
From the above definitions the random phenomena can be characterized by
(1) the definition of an abstract probability space described by the triple ðS, F, pÞand (2) the axiomatic definition of probability of an RV.
1ζ
3ζ
2ζ
kζ
Nζ
Set of students S
Student's age
2 3( )x ζ
2( )Nx ζ
2 1( )x ζ
2 2( )x ζ
2( )kx ζ
1 3( )x ζ
1( )kx ζ
1( )Nx ζ
1 2( )x ζ
1 1( )x ζ
Number of exams Eye Color
3 3( )x ζ
3 2( )x ζ
3( )Nx ζ3 1( )x ζ
3( )kx ζ
Height
4 3( )x ζ
4 2( )x ζ
4( )Nx ζ
4 1( )x ζ
4( )kx ζ
5 2( )x ζ
5 3( )x ζ
5( )Nx ζ
5 1( )x ζ
5( )kx ζ
Weight
1( )x ζ 2( )x ζ 3( )x ζ 4 ( )x ζ 5( )x ζ
Sζ
Fig. C.1 Example of RVs defined over a set of students for a scholastic poll
7 From the Latin probare, test, try, and ilis, be able to.
634 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
Remark In the context of RVs, care must be taken in the notation used. Sometime
RVs are indicated as XðζÞ or as xðζÞ (as in [16]). In these notes we prefer using the
italic font xðζÞ for RV, bold font xðζÞ for RV vectors, and the form xðt, ζÞ or xðt, ζÞ�x½n, ζ� or x½n, ζ� in DT
�for stochastic processes. Moreover, a complex RV
zðζÞ ∈ ℂ is defined as the sum: zðζÞ ¼ xðζÞ þ j � yðζÞ, where xðζÞ, yðζÞ ∈ ℝ.
C.1.1 Distributions and Probability Density Function
The elements of an event xðζÞ � x change depending on the number x; it follows
that the probability of this event, indicated as p�xðζÞ � x
�, is a function of x itself.
Let xðζÞ be an RV, we define the probability density function (pdf), denoted as
fx(x), that is a nonnegative integrable function, such that
p a � x ζð Þ � b� � ¼ ð b
a
f x xð Þdx, probability density function: ðC:2Þ
Therefore, from the basic axioms (C.1) it is possible to demonstrate that the
probability of sure event can be written asðþ1
�1f x xð Þdx ¼ 1: ðC:3Þ
Moreover, the event xðζÞ � x is characterized by the cumulative density function(cdf) defined as
Fx xð Þ ¼ p x ζð Þ � xð Þ, for �1 < x < 1 ðC:4Þ
or, from (C.2)
Fx xð Þ ¼ð x�1
f x υð Þdυ, cumulative density function: ðC:5Þ
In fact, we have that fxðxÞ ¼ d FxðxÞ=d x, and the value of cdf represents a measureof probability pðxðζÞ � xÞ.
For the cdf the following properties apply:
0 � Fx xð Þ � 1; Fx �1ð Þ ¼ 0; Fx þ1ð Þ ¼ 1
Fx x1ð Þ � Fx x2ð Þ if x1 < x2:
It follows that the cdf is a nondecreasing monotone function.
Note that fxðxÞ is not a probability measure. To obtain the probability of the event
x < xðζÞ � x þ Δx, we must multiply the pdf for the interval Δx. That is,
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 635
f x xð ÞΔx ΔFx xð Þ ≜ Fx xþ Δxð Þ � Fx xð Þ ¼ p x < x ζð Þ � xþ Δx� �
: ðC:6Þ
Some example of continuous, discrete, and mixed pdf and cdf are reported in
Fig. C.2.
C.1.2 Statistical Averages
The pdf completely characterizes an RV. However, in many situations it is conve-
nient or necessary to represent more concisely the RV through a few specific
parameters that describe its average behavior. These numbers, defined as statisticalaverages or moments, are determined by the mathematical expectation. Note that
even if for the determination of statistical averages, formally, the pdf knowledge is
necessary, somehow, those averages can be estimated without explicit knowledge
of the pdf.
C.1.2.1 Expectation Operator
The mathematical expectation, usually indicated as ExðζÞ, is a number defined
by the following integral:
E x ζð Þ ¼ð1�1
x f x xð Þdx, ðC:7Þ
( )xF x ( )xF x
( )xf x ( )xf x ( )xf x
x x x
x x x
( )xF x
Fig. C.2 Example of trends of the cumulative distribution functions (top figure) and of the
probability density function (lower) figure for discrete RV (left), continuous RV (middle), andmixed discrete–continuous RV (right)
636 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
where the function Ef�g indicates the expected value or the average value or mean
value. The expected value is also indicated as μ ¼ ExðζÞ.
C.1.2.2 Moments and Central Moments
Considering a function of RV denoted as g�xðζÞ�, the expected value becomes
E g x ζð Þ� �n o¼ð1�1
g xð Þf x xð Þdx ðC:8Þ
in the case that g½xðζÞ� ¼ xmðζÞ (elevation to the mth power) the previous expres-
sion is defined as moment of order m
E xm ζð Þ ¼ð1�1
xmf x xð Þdx: ðC:9Þ
The calculation of the moment is of particular significance when from the RV is
removed its expected value μ, i.e., considering the RV�xðζÞ � μ
�. In this case the
statistical function, called central moment, is defined as
E x ζð Þ � μ� �mn o
¼ð1�1
x� μð Þmf x xð Þdx: ðC:10Þ
C.1.3 Statistical Quantities Associated with Moments of Orderm
The moments computed with the previous expressions are of particular significance
for certain orders. For example, the first-order moment m ¼ 1 is just the expectedvalue μ defined by (C.7). Generalizing, moments and central moments of any order
can be written as
r mð Þx ¼ E xm ζð Þ c mð Þx ¼ E x ζð Þ � μ
� �m
n o:
ðC:11Þ
In particular, note that cð0Þx ¼ 1 and c
ð1Þx ¼ 0; moreover, it is obvious that for zero-
mean processes the central moment is identical to the moment.
C.1.3.1 Variance and Standard Deviation
We define the variance, indicated as σ2x , as the value of the second-order central
moment
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 637
σ2x ¼ c 2ð Þx ¼ E x ζð Þ � μ½ �2
n o¼ð1�1
x� μð Þ2f x xð Þdx, ðC:12Þ
where the positive constant σx ¼ffiffiffiffiffiσ2x
pis defined as standard deviation of x.
Figure C.3 shows the pdf of two overlapped Gaussian (or normal) processes with
representations of the expected value and standard deviation. (The expression of the
normal distribution pdf is given in Sect. C.1.5.2.)
C.1.3.2 The Third- and Fourth-Order Moments: Skewness and
Kurtosis
The skewness is defined as the statistic quantity associated with the third-order
central moment, defined by the following relation:
k 3ð Þx ≜ E
x ζð Þ � μ
σx
� �3( )¼ 1
σ3xc 3ð Þx : ðC:13Þ
The skewness, as illustrated in Fig. C.4a for kð3Þx > 0 and k
ð3Þx < 0, represents the
degree of asymmetry of a generic pdf. In fact, in the case where the pdf is
symmetric the skewness size is zero.
The kurtosis is a statistical quantity related to the fourth-order moment defined as
k 4ð Þx ≜ E
x ζð Þ � μ
σx
� �4( )� 3 ¼ 1
σ4xc 4ð Þx � 3: ðC:14Þ
Note that the term �3, as we shall see later, provides a zero kurtosis in the case of
Gaussian distribution processes. As illustrated in Fig. C.4b, for kð4Þx > 0, there is a
“narrow” distribution trend that is called super-Gaussian. If kð4Þx < 0, the trend of
the pdf is more “broad” and is called sub-Gaussian.
1( )f xx
2( )f xx
2σ x
1σ x
1μx 2
μx x
Fig. C.3 Typical trends of Gaussian or normal pdf with the indication of the expected value andstandard deviation
638 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
C.1.3.3 Chebyshev’s Inequality
Given an RV xðζÞ with the mean value μ and standard deviation σx, for any real
number k > 0, the following inequality is true:
p x ζð Þ � μj j kσx � 1
k2k > 0: ðC:15Þ
An RV deviates k times from its average value with probability less than or equal to
1/k2. The Chebyshev’s inequality (C.15) is a useful result for a generic distribution
fx(x) regardless of its form.
C.1.3.4 Characteristic Function and Cumulants
Consider the sign reversal Laplace (or Fourier) transform of the pdf fxðxÞ that, in thecontext of statistics, is called characteristic function, defined as
Φx sð Þ ¼ð1�1
f x xð Þesxdx, ðC:16Þ
where s is the complex Laplace variable.8
Equation (C.16) can be interpreted as the moment-generating function. In fact,
the development in Taylor series of (C.16) for s ¼ 0 yields
1( )f xx 2
( )f xx
x
2( )f xx
x
PositiveNegative 1( )f xx
Zero: Gaussian or normal distributiona b
Negative
Positive
KurtosisSkewness
Fig. C.4 Typical trends of distribution with positive and negative (a) skewness; (b) kurtosis
8 The complex Laplace variable can be written s ¼ α þ jξ. Note that the complex part jξ shouldnot be interpreted as a frequency.
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 639
Φx sð Þ ≜ E esx ζð Þn o
¼ E 1þ sx ζð Þ þ sx ζð Þ� �22!
þ � � � þ sx ζð Þ� �mm!
þ � � �0@
1A
¼ 1þ sμþ s2
2!r 2ð Þx þ � � � þ sm
m!r mð Þx þ � � �
ðC:17Þ
which is defined in terms of are all the moments of the RV xðζÞ. In addition, we cannote that considering the inverse Laplace transform of (C.17) yields
r mð Þx ¼ dmΦx sð Þ
dsm
����s¼0
, for m ¼ 1, 2, :::: ðC:18Þ
The cumulants are statistical descriptors, similar to the moments, which allow
having “more information” in the case of high-order statistics.
The cumulant-generating function is defined as the logarithm of the moment-
generating function
Ψx sð Þ ≜ lnΦx sð Þ: ðC:19Þ
Hence, we define the m-order cumulant as the expression
κ mð Þx ≜
dmΨx sð Þdsm
����s¼0
, for m ¼ 1, 2, ::: ðC:20Þ
from the above definition we can see that for a zero-mean RV, the first five
cumulants are
κ 1ð Þx ¼ r 1ð Þ
x ¼ μ ¼ 0
κ 2ð Þx ¼ r 2ð Þ
x ¼ σ2xκ 3ð Þx ¼ c 3ð Þ
x
κ 4ð Þx ¼ c 4ð Þ
x � 3σ4xκ 5ð Þx ¼ c 5ð Þ
x � 10c 3ð Þx σ2x :
ðC:21Þ
Note that the first two are identical to central moments.
C.1.4 Dependent RVs: The Joint and Conditional ProbabilityDistribution
If there is some dependence between two (or more) RVs, you need to study how the
probability of one affects the other and vice versa.
For example, considering the experiment described in Fig. C.1 where the RVs x4and x5, representing, respectively, the height and weight of students, are statisticallydependent, as well as the age x1 and the number of exams x2. In probabilistic terms,
640 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
this means that tall students are probably heavier, or considering the random vari-
ables x1 and x2, that younger students are likely to have sustained less exams.
In terms of pdf, given two RVs xðζÞ and yðζÞ, we define the joint pdf,denoted as fxyðx,yÞ, the pdf of the event obtained by the intersection between the
sets p�a � xðζÞ � b
�and p
�c � yðζÞ � d
�, i.e., the distribution probability of
occurrence of the two events. Therefore, extending the definition (C.2), the jointpdf, denoted as fxyðx,yÞ, can be defined by the following integral:
p a � x ζð Þ � b, c � y ζð Þ � d� � ¼ ð d
c
ð ba
f xy x; yð Þdxdy, joint pdf ðC:22Þ
namely, the probability that xðζÞ and yðζÞ assume value inside the interval ½a, b� and½c, d�, respectively. Let us define, also, fxjyðx��yÞ the conditional pdf of xðζÞ givenyðζÞ, such that it is possible to evaluate the probability of the events p
�a � xðζÞ
� b, yðζÞ ¼ c�as
p a � x ζð Þ � b, y ζð Þ ¼ c� � ¼ ð b
a
f xjy x j yð Þdx, conditionalpdf ðC:23Þ
i.e., the probability that xðζÞ assumes value inside the interval ½a, b� given that
yðζÞ ¼ c.Let fyðyÞ be the pdf of yðζÞ, called in the context marginal pdf, from the previous
expressions the joint pdf, in the case that the xðζÞ is conditioned by yðζÞ, can be
written as fxyðx,yÞ ¼ fxjyðx��yÞfyðyÞ. This expression indicates how the probability of
event xðζÞ is conditioned by the probability of yðζÞ. Moreover, let fxðxÞ be
the marginal pdf of xðζÞ, for simple symmetry it follows that the joint pdf is also
fxyðx,yÞ ¼ fyjxðy��xÞfxðxÞ; so, now we can relate the joint and conditional pdfs by a
Bayes’ rule, which states that
f xy x; yð Þ f xjy x j yð Þf y yð Þ ¼ f yjx y j xð Þf x xð Þ, Bayes rule ðC:24Þ
Moreover, we have ðx
ðy
f xy x; yð Þdydx ¼ 1: ðC:25Þ
Definition Two (or more) RVs are independent iff
f xjy x j yð Þ ¼ f x xð Þ and f yjx y j xð Þ ¼ f y yð Þ ðC:26Þ
or, considering (C.24), iff
f xy x; yð Þ ¼ f x xð Þf y yð Þ: ðC:27Þ
Property If two RVs are independent they are necessarily uncorrelated.
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 641
The covariance and the correlation of joint RV are respectively defined as
c 2ð Þxy ¼ E x ζð Þy ζð Þ � E y ζð Þ � E x ζð Þ ðC:28Þ
r 2ð Þxy ¼ c 2ð Þ
xy = σxσy� �
: ðC:29Þ
Two RVs xðζÞyðζÞ are uncorrelated, iff their cross-correlation (covariance) is zero.
Consequently, if (C.27) holds, then ExðζÞyðζÞ ¼ E
yðζÞ � ExðζÞ, and for
(C.28) their cross-correlation is zero.
Finally note that, if two RV are uncorrelated, they are not necessarily
independent.
C.1.5 Typical RV Distributions
C.1.5.1 Uniform Distribution
The uniform distribution is appropriate for the description of an RV with equi-
probable events in the interval ½a, b�. The pdf of the uniform distribution is defined as.
f x xð Þ ¼1
b� aa � x � b
0 elsewhere
8<: ðC:30Þ
The corresponding cdf is
Fx xð Þ ¼ð x�1
f x vð Þdv ¼0 x < a
x� a
b� aa � x � b
1 x > b
8>><>>: ðC:31Þ
Its characteristic function is
Φx sð Þ ¼ esb � esa
s b� að Þ : ðC:32Þ
Finally, the mean value and the variance are
μ ¼ aþ b
2and σ2x ¼
b� að Þ212
: ðC:33Þ
C.1.5.2 Normal Distribution
The normal distribution, also calledGaussian distribution, is one of the most useful
and appropriate description of many statistical phenomena.
642 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
The normal distribution pdf, already illustrated in Fig. C.3, with mean value μand standard deviation σx, is
f x xð Þ ¼ 1ffiffiffiffiffiffiffiffiffiffi2πσ2x
p e� 1
2σ2xx�μð Þ2 ðC:34Þ
with a CF
Φx sð Þ ¼ eμs�12σ2x s
2
: ðC:35Þ
From previous equations an RV with normal pdf, often referred to as Nðμ,σ2xÞ, isdefined by its mean value μ and its variance σ2x . Note also that the moments of
higher order can be determined in terms of only the first two moments. In fact, we
have (Fig. C.5)
c mð Þx ¼ E x ζð Þ � μ
�� ��mn o¼ 1 � 3 � 5 � � � � m� 1ð Þσm
x for m even
0 for m odd:
�ðC:36Þ
In particular, the fourth-order moments are cð4Þx ¼ 3σ4x and for the Gaussian distri-
bution the kurtosis is zero.
Remark From (C.36) we observe that an RV with Gaussian distribution is fullycharacterized only by the mean value and variance and that the moments of higher
order do not contain any useful information.
C.1.6 The Central Limit Theorem
An important theorem is the statistical central limit theorem whose statement says
that the sum of N independent RVs with the same distribution, i.e., iid with finite
variance, tends to the normal distribution as N ! 1.
( )f xx
x5− 50
Uniform
Gaussian
super-Gaussian
sub-Gaussian
Fig. C.5 Qualitative
behavior of some typical
distributions
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 643
A generalization of the theorem, due to Gnedenko and Kolmogorov, valid for a
wider class of distributions states that the sum of RVs with low power-tail distri-
bution that decreases as 1/jxjα þ 1 with α � 2 tends to the Levy alpha-stable
distribution as N ! 1.
C.1.7 Random Variables Vectors
A random vector or RV vector is defined as an RV collection of the type
x ζð Þ ¼ x0 ζð Þ x1 ζð Þ � � �� �T:
By a generalization of the definition (C.7), the expectation of random vector is also
a vector that, omitting the writing of event ðζÞ, is defined as
μ ¼ E xf g ¼ E x0f g E x1f g � � �� �T ¼ μ0 μ1 � � �½ �T : ðC:37Þ
C.1.8 Covariance and Correlation Matrix
In the case of random vector, the second-order statistic is a matrix. Therefore, the
covariance matrix is defined as
Cx ¼ E x� μð Þ x� μð ÞTn o
, Covariance matrix: ðC:38Þ
For example, given a two-dimensional random vectorx ¼ x0 x1½ �T the covarianceis defined as
Cx ¼ Ex0 � μ0x1 � μ1
� �x0 � μ0ð Þ x1 � μ1ð Þ� �� �
¼E x0 � μx0�� ��2 E x0 � μx0
� �x1 � μx1� �n o
E x1 � μx1� �
x0 � μx0� �n o
E x1 � μx1�� ��2
24
35 ðC:39Þ
so, the autocovariance matrix is symmetric
Cx ¼ CTx , ðC:40Þ
where the superscript “T” indicates the matrix transposition. Moreover, the auto-correlation matrix is defined as
Rx ¼ E xxT
, Autocorrelation matrix: ðC:41Þ
For the two-dimensional RV previously defined it is then
644 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
Rx ¼ E x0j j2 E x0x1f gE x1x0f g E x1j j2� �
ðC:42Þ
and
Rx ¼ RTx : ðC:43Þ
Property The autocorrelation matrix of an RV vector x is always defined nonneg-
ative, i.e., for each vector w ¼ w0 w1 � � � wM�1½ �T the quadratic form wTRxw
is positive semi-definite or nonnegative
wTRxw 0: ðC:44Þ
Proof Consider the inner product between x and w
α ¼ wTx ¼ xTw ¼XM�1
k¼0
wkxk: ðC:45Þ
The RV mean squared value of α is defined as
E α2 ¼ E wTxxTw
¼ wTE xxT
w ¼ wTRxw: ðC:46Þ
since, by definition, α2 0, it is wTRxw 0.
Q.E.D.
C.1.8.1 Eigenvalues and Eigenvectors of the Autocorrelation Matrix
From geometry (see Sect. A.8), the eigenvalues can be computed by solving the
characteristic polynomial pðλÞ, defined as pðλÞ ≜ detðR � λIÞ ¼ 0.
A real or complex autocorrelation matrix R ∈ ℝM�M is symmetric and positive
semi-definite. We know that for this type of matrix the following properties listed
below are valid.
1. The eigenvalues λi of R are real and nonnegative. In fact, for (A.61) we have that
Rq ¼ λq, and by left multiplying for qTi , we get
qTi Rqi ¼ λiq
Ti qi ) λi ¼ qT
i Rqi
qTi qi
0, Rayleigh quotient: ðC:47Þ
2. The eigenvectors qi i ¼ 0, 1,. . .,M � 1, ofR are orthogonal for distinct values of
λi
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 645
qTi qj ¼ 0, for i 6¼ j: ðC:48Þ
3. The matrix R can always be diagonalized as
R ¼ QΛQT , ðC:49Þ
where Q ¼ q0 q1 � � � qM�1½ �,
Λ ¼ diag λ0; λ1; :::; λM�1ð Þ ðC:50Þ
and Q is a unitary matrix, i.e., QTQ ¼ I.
4. An alternative representation for R is
R ¼XM�1
i¼0
λiqiqTi ¼
XM�1
i¼0
λiPi, ðC:51Þ
where the term Pi ¼ qiqTi is defined spectral projection.
5. The trace of the matrix R is
tr R½ � ¼XM�1
i¼0
λi ) 1
M
XM�1
i¼0
λi ¼ rxx 0½ � ¼ σ2x : ðC:52Þ
C.2 Stochastic Processes
Generalizing the concept of RV, a stochastic process (SP) is a rule to assign each
result ζ to a function xðt, ζÞ. Hence, SP is a family of two-dimensional functions, of
the variables t and ζ, where the domain is defined over the set of all the experi-
mental results ζ ∈ S, while the time variable t represents the set of real numbers
t ∈ ℝ. If ℝ represents the real axis of time, then xðt, ζÞ is a continuous-timestochastic process. In the case that ℝ represents a set of integers, then we have a
discrete-time stochastic process, and the time index is denoted by n ∈ Z.In general terms, a discrete-time SP is a time-series x½n, ζ�, consisting of all
possible sequences of the process. Each individual sequence, corresponding to a
specific result ζ ¼ ζk, indicated as x½n, ζk�, represents an RV sequence (indexed by n)that is called realization or sample sequence of the process.
Since the SP is a two-variable function, then there are four possible
interpretations
i) x½n, ζ� is an SP ) n variable, ζ variable;
ii) x½n, ζk� is an RV sequence ) n variable, ζ fixed;
iii) x½nk, ζ� is an RV ) n fixed, ζ variable;
iv) x½nk, ζk� is a number ) n fixed, ζ fixed.
646 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
For clarity of presentation, as usual in many scientific contexts (signal
processing, neural networks, etc.), writing ζ parameter is omitted and, later in the
text, the SP x½n,ζ� is indicated only with x½n� or x½n� (sometimes bold is omitted) and
the sample process sequence x½n,ζk� is often simply referred to as xk½n�.Definition We define discrete-time stochastic process (DT-SP), denoted as
x½n� ∈ ℝN, an RV vector, defined as
x n½ � ¼ x1 n½ �, x2 n½ �, :::, xN n½ �� �, ðC:53Þ
where the integer n ∈ Z represents the time index. Note, as illustrated in Fig. C.6,
that in (C.53) each realization xk½n� represents an RV sequence of the same process.
C.2.1 Statistical Averages of an SP
The determination of the statistical averages of SPs can be performed exactly as for
the RVs. In fact, note that for a given fixed temporal index, see property iii), the
process consists in a simple RV so that it is possible to evaluate all the statistical
functions proceeding as in Sect. C.1.2. Similarly, setting the parameter ζ and
considering two different temporal indexes n1 and n2 we are in the presence
of joint RVs so that it is possible to characterize the process by the joint cdf
Fx½x1, x2; n1, n2�. However, in general an SP contains an infinite number of
Realizations
1
2
[ , ]
[ , ]
[ , ]
[ , ]
k
N
n
n
n
n
ζ
ζ
ζ
ζ
x
x
x
x
DiscreteTimeRandom ProcessDT-SP
n
n
n
n
[ ]1 2[ , ] [ , ] [ , ] [ , ]N Nn n n nζ ζ ζ ζx x x xRV [ , ]kn ζx
[ , ] Sequencekn ζx
kn
Fig. C.6 Representation of the stochastic process x½n,ζ�. As usual in context of DSP, the process
sample is simply indicated as x½n�
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 647
such RVs; hence, to completely describe, in a statistical sense, an SP, the knowl-
edge of the k-order joint cdf is sufficient. It is defined as
Fx x1, :::, xk; n1, :::, nk½ � ¼ p x n1½ � � x1, :::, x nk½ � � xk� �
: ðC:54Þ
On the other hand, an SP can be characterized by the joint pdf defined as
f x x1, :::, xk; n1, :::, nk½ � ≜ ∂2kFx x1, :::, xk; n1, :::, nk½ �∂x1,∂x2, :::,∂xk
: ðC:55Þ
From now on we write the SP simply as x½n� (not in bold).
C.2.1.1 First-Order Moment: Expectation
We define the expected value of an SP x½n� with pdf f�x½n��, the value of its first-
order moment at a given time index n. According with Eq. (C.7), the expected valueis defined as
μn ¼ E x n½ � : ðC:56Þ
Referring to Fig. C.6, and considering the notation x½n,ζ�, the expectation operator
Ef�g represents the ensemble average of the RV μnk ¼ E x nk; ζ½ � .
Equation (C.56) can be also interpreted in terms of relative frequency by the
following expression:
μnk ¼ limN!1
1
N
XNj¼1
xj nk½ �" #
: ðC:57Þ
In other words (see Fig. C.6), the expectation represents the mean value of the set of
RV x½nk� at a fixed time instant.
If the process is not stationary, i.e., its statistics changes in time, its mean value is
variable during time. So, in general, we have
μn 6¼ μm, for n 6¼ m: ðC:58Þ
C.2.1.2 Second-Order Moment: Autocorrelation and Autocovariance
We define autocorrelation, or second-order moment, the sequence
r n;m½ � ¼ E x n½ �x m½ � : ðC:59Þ
In terms of relative frequency Eq. (C.59) can be written as
648 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
r n;m½ � ¼ limN!1
1
N
XNk¼1
xk n½ �xk m½ �" #
: ðC:60Þ
The autocorrelation is a measure that indicates the association degree or depen-
dency between the process at time n and at time m.Moreover, we have that
r n; n½ � ¼ E x2 n½ � , average power of the sequence:
We define autocovariance, or second-order central moment, the sequence
c n;m½ � ¼ E x n½ � � μn� �
x m½ � � μm� �n o
¼ r n;m½ � � μnμm: ðC:61Þ
C.2.1.3 Variance and Standard Deviation
Similarly for the definition in Sect. C.1.3.1, the variance of an SP is a value related
to the central second-order moment defined as
σ2xn ¼ E x n½ � � μn� �2n o
¼ E x2 n½ � � μ2n: ðC:62Þ
The quantity σxn is defined as standard deviation, which represents a measure of
the observation dispersion x½n� around its mean value μn.
Remark For zero-mean processes, the central moment coincides with moment. It
follows then σ2xn ¼ r½n,n� ¼ Ex2½n�; in other words, the variance coincides with
the signal power.
C.2.1.4 Cross-correlation and Cross-covariance
The statistical relationships between two jointly distributed SP x½n� and y½n� (i.e.,defined over the same space results S) can be described by their joint second-order
moments (the cross-correlation and cross-covariance) defined, respectively, as
rxy n;m½ � ¼ E x n½ �y m½ � ðC:63Þcxy n;m½ � ¼ E
�y n½ � � μyn
��x m½ � � μym
�n o¼ rxy n;m½ � � μxnμym: ðC:64Þ
Moreover, the normalized cross-correlation is defined as
rxy n;m½ � ¼ cxy n;m½ �σxnσxm
: ðC:65Þ
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 649
C.2.2 High-Order Moments
In linear systems the high-order moments are rarely used with respect to the first-
and second-order ones. The interest in higher order moments, in fact, is increasing
in nonlinear systems.
C.2.2.1 Moments of Order m
Generalizing the foregoing for first- and second-order statistics, moments and
central moments of any order can be written as
r mð Þ n1; :::; nm½ � ¼ E x n1½ � � x n2½ �� � �x nm½ � c mð Þ n1; :::; nm½ � ¼ E x n1½ � � μn1
� �x n2½ � � μn2� �� � � x nm½ � � μnm
� � :
For a particular index n the previous expressions are simplified as
r mð Þx ¼ E x n½ �� �mn oc mð Þx ¼ E x n½ � � μx
� �mn o:
Note, also, that cð0Þx ¼ 1 and c
ð1Þx ¼ 0. It is obvious that, for zero-mean processes,
the central moment is identical to the moment.
C.2.2.2 Moments of Third Order
The third-order moments are defined as
r 3ð Þ k;m; n½ � ¼ E x k½ � � x m½ � � x n½ � c 3ð Þ k;m; n½ � ¼ E x k½ � � μkð Þ x m½ � � μmð Þ x n½ � � μnð Þ
:
C.2.3 Property of Stochastic Processes
C.2.3.1 Independent SP
An SP is called independent iff
f x x1, :::, xk; n1, :::, nk½ � ¼ f 1 x1; n1½ � � f 2 x2; n2½ � � ::: � f k xk; nk½ � ðC:66Þ
8 k, ni i ¼ 1, . . ., k; or else, x½n� is an SP formed with independent RV x1½n�,x2½n�,. . ..
650 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
For two, or more, independent sequences x½n� and y½n� we also have that
E x n½ � � y n½ � ¼ E x n½ � � E y n½ � : ðC:67Þ
C.2.3.2 Independent Identically Distributed SP
If all the SP sequences are independent and with equal pdf, i.e.,
f1½x1; n1� ¼ . . . ¼ fk½xk; nk�, then the SP is defined as iid.
C.2.3.3 Uncorrelated SP
An SP is called uncorrelated if
c n;m½ � ¼ E x n½ � � μn� �
x m½ � � μm� �n o
¼ σ2xnδ n� m½ �: ðC:68Þ
Two processes x½n� and y½n� are uncorrelated if
cxy n;m½ � ¼ E x n½ � � μxn� �
y m½ � � μxm� �n o
¼ 0 ðC:69Þ
and if
rxy n;m½ � ¼ μxnμxm: ðC:70Þ
Remark If the SP x½n� and y½n� are independent they are, also, necessarily
uncorrelated while the contrary is not always true, i.e., the assumption of indepen-
dency is stronger than the uncorrelation.
C.2.3.4 Orthogonal SP
Two processes x½n� and y½n� are defined as orthogonal iff
rxy n;m½ � ¼ 0: ðC:71Þ
C.2.4 Stationary Stochastic Processes
An SP is defined stationary or time invariant if the statistic of x½n� is identical to thetranslated process x½n � k� statistics. Very often in real situations we consider the
processes as stationary. This is due to the simplifications of the correlation func-
tions associated with them.
In particular, a sequence is called strict sense stationary (SSS) or stationary oforder N if we have
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 651
f x x1, :::, xN; n1, :::, nN½ � ¼ f x x1, :::, xN; n1 � k, :::, nN � k½ � 8k ðC:72Þ
An SP is wide sense stationary (WSS) if its first-order statistics do not change
over time
E x n½ � ¼ E x nþ k½ � ¼ μ 8n, k: ðC:73Þ
As a corollary, consider also the following definitions. An SP is defined wide senseperiodic (WSP) if
E x n½ � ¼ E x nþ N½ � ¼ μ: 8n ðC:74Þ
An SP is wise sense cyclostationary (WSC) if the following relations are true:
E x n½ � ¼ E x nþ;N½ � r m; n½ � ¼ r mþ N, nþ N½ � 8m, n: ðC:75Þ
Let us define k ¼ n � m as correlation lag or correlation delay, the correlation is
usually written as
r k½ � ¼ E x n½ �x n� k½ � ¼ E x nþ k½ �x n½ � : ðC:76Þ
The latter is often referred to as autocorrelation function (acf).
Similarly, considering two joint WSS processes, the autocovariance (C.61) is
defined as
c k½ � ¼ E x nþ k½ � � μ� �
x n½ � � μ� �n o
¼ r k½ � � μ2: ðC:77Þ
Property The acf of WSS processes has the following properties:
1. The autocorrelation sequence r½k� is symmetric with respect to delay
r �k½ � ¼ r k½ � ðC:78Þ
2. The correlation sequence is defined nonnegative. So, for anyM> 0 and w ∈ ℝM
we have that
XMk¼1
XMm¼1
w k½ �r k � m½ �w m½ � 0 ðC:79Þ
Such property represents a necessary and sufficient condition so that r½k� isan acf.
3. The zero time delay term is that of maximum amplitude
E x2 n½ � ¼ r 0½ � r k½ �j j 8n, k: ðC:80Þ
652 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
Given two joint WSS processes x½n� and y½n�, the cross-correlation function (ccf) is
defined as
rxy k½ � ¼ E x n½ �y n� k½ � ¼ E x nþ k½ �y n½ � : ðC:81Þ
Finally, the cross-covariance sequence is defined as
cxy k½ � ¼ E x nþ k½ � � μx� �
y n½ � � μy� �n o
¼ rxy k½ � � μxμy: ðC:82Þ
C.2.5 Ergodic Processes
An SP is called ergodic if the ensemble averages coincide with the time averages.
The consequence of this definition is that an ergodic process must, necessarily, also
be strict sense stationary.
C.2.5.1 Statistics Averages of Ergodic Processes
For the determination of the statistics of an ergodic processes it is necessary to
define the time-average mathematical operation. For a discrete-time random signal
x½n� the mathematical operator of time average, indicated as hx½n�i, is defined as
x n½ �h i ¼ limN!1
1
N
XN�1
n¼0
x n½ �
x nþ k½ �x n½ �h i ¼ limN!1
1
N
XN�1
n¼0
x nþ k½ �x n½ �:ðC:83Þ
It is possible to define all the statistical quantities and functions by replacing the
ensemble-average operator Eð�Þ with the time-average operator h � i also indicated
as E �f g. In other words, if x½n� is an ergodic process, we have that
μ ¼ x n½ �� � ¼ E x n½ � : ðC:84Þ
If x½n� is an ergodic process for the correlation we have
x nþ k½ �x n½ �� � ¼ E x nþ k½ �x n½ � : ðC:85Þ
If a process is ergodic then it is WSS, i.e., only stationary processes can be ergodic.
On the contrary, a WSS process cannot be ergodic.
Considering the sequence x½n�, we have that
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 653
x n½ �� �, MeanValue ðC:86Þ
x2 n½ �� �, MeanSquareValue ðC:87Þ�
x n½ � � μð Þ2i, Variance ðC:88Þx nþ k½ �x n½ �� �
, Autocorrelation ðC:89Þx nþ k½ � � μ
��x n½ � � μ
� �� �, Autocovariance ðC:90Þ
x nþ k½ �y n½ �� �, Cross-correlation ðC:91Þ
x nþ k½ � � μx��y n½ � � μy
� �� �, Cross-covariance ðC:92Þ
For deterministic power signals, it is important to mention the similarities
among the correlation sequences, calculated by the temporal average (C.89), and
determined by the definition (C.76). Although this is a formal similarity due to the
fact that random sequences are power signals, the time averages are (for the closure
property) RVs, and the corresponding quantities for deterministic power signals are
numbers or deterministic sequences.
Two individually ergodic SPs x½n� and y½n� have the property of joint ergodicityif the cross-correlation is identical to Eq. (C.91), i.e.,
E x nþ k½ �y n½ � ¼ x nþ k½ �y n½ �� �: ðC:93Þ
Remark The ergodic processes are very important in applications as very often
only one realization of the process is available: in many practical situations,
however, the processes are stationary ergodic. Therefore, the assumption of ergo-
dicity allows the estimation of statistical functions starting from the time averages
available only for the single realization of the process. Moreover, in the case of
ergodic sequences of finite duration, the expression (C.83) is calculated as
r k½ � ¼1
N
XN�1�k
n¼0
x nþ k½ �x n½ � k 0
r �k½ � k < 0
8><>:
:
ðC:94Þ
C.2.6 Correlation Matrix of Random Sequences
A stochastic process can be represented as an RV vector and, as defined in
Sect. C.1.7, its second-order statistics are defined by the mean values vectors and
by the correlation matrix. Considering a random vector xn from the SP x½n� asfollows:
xn ≜ x n½ � x n� 1½ � � � � x n�M þ 1½ �� �T ðC:95Þ
for the definition (C.37), its mean value is defined as
654 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
μxn ¼ μxn μxn�1� � � μxn�Mþ1
� �T ðC:96Þ
and for (C.41) and (C.63), the autocorrelation matrix is defined as
Rxn ¼ E xnxTn
� � ¼ rx n; n½ � � � � rx n, n�M þ 1½ �⋮ ⋱ ⋮
rx n�M þ 1, n½ � � � � rx n�M þ 1, n�M þ 1½ �
24
35: ðC:97Þ
since rx½n � i, n � j� ¼ rx½n � j, n � i� for 0 � (i,j) � M � 1, Rxn is symmetric
(or Hermitian for complex processes).
In the case of stationary process the acf is independent from index n and, by
defining the correlation lag as k ¼ j � i, we obtain
rx n� i, n� j� � ¼ rx j� i½ � ¼ rx k½ �: ðC:98Þ
Then the autocorrelation matrix is a symmetric Toeplitz matrix of the form
Rx ¼ E xxT ¼
r 0½ � r 1½ � � � � r M � 1½ �r 1½ � r 0½ � � � � r M � 2½ �⋮ ⋮ ⋱ ⋮
r M � 1½ � r M � 2½ � � � � r 0½ �
2664
3775: ðC:99Þ
The autocorrelation matrix of stationary process is always Toeplitz (see Sect. A.2.4)
and, for (C.44), nonnegative.
C.2.7 Stationary Random Sequences and TD LTI Systems
For random sequences processed by TD LTI systems, it is necessary to study the
relationship between the input and output pdfs. For simplicity, consider a stable
circuit TD LTI characterized by the impulse response h½n�, where the input x½n� is arandom, real or complex, stationary sequence WSS. The output y½n� is computed by
the DT convolution defined as
y n½ � ¼X1l¼�1
h l½ �x n� l½ �: ðC:100Þ
C.2.7.1 Input–Output Cross-correlation Sequence
Consider the expression (C.100), and pre-multiplying both sides by x½n þ k�, andperforming the expectation we get
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 655
E x nþ k½ �y n½ � ¼X1l¼�1
h l½ �E x nþ k½ �x n� l½ � , ðC:101Þ
i.e.,
rxy k½ � ¼X1l¼�1
h l½ �rxx k þ l½ � ¼X1
m¼�1h �m½ �rxx k � m½ �: ðC:102Þ
In other words, the following relations are valid:
rxy k½ � ¼ h �k½ �∗rxx k½ � ðC:103Þ
and similarly
ryx k½ � ¼ h k½ �∗rxx k½ �: ðC:104Þ
From the previous we also have that
rxy k½ � ¼ ryx �k½ �: ðC:105Þ
C.2.7.2 Output Autocorrelation Sequence
Multiplying both sides of (C.100) for y½n � k� and computing the expectation we
get
E y n½ �y n� k½ � ¼X1l¼�1
h l½ �E x n� l½ �y n� k½ � ðC:106Þ
or
ryy k½ � ¼X1l¼�1
h l½ �rxy k � l½ � ¼ h k½ �∗rxy k½ �: ðC:107Þ
In other words, we can write
ryy k½ � ¼ h k½ �∗h �k½ �∗rxx k½ �: ðC:108Þ
By defining the term rhh½k� as
rhh k½ � ≜ h k½ �∗h �k½ � ¼X1l¼�1
h l½ �h l� k½ �: ðC:109Þ
(C.108) can be written as
656 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
ryy k½ � ¼ rhh k½ �∗rxx k½ �: ðC:110Þ
Therefore, in the case of a stationary signal, x½n� is filtered with a circuit of the
impulse response h½n�, and the output autocorrelation is equivalent to the input
autocorrelation filtered with an impulse response equal to rhh½n� ¼ h½k� ∗ h½�k�.
C.2.7.3 Output Pdf
The output pdf determination of a DT-LTI system is usually a difficult task.
However, for Gaussian input process, also the output is always a Gaussian process
with a correlation (C.110). In the case of multiple iid inputs, the output is deter-
mined by the weighted sum of the independent input SPs. Therefore, the output pdf
is equal to the convolution of the pdf of each SP.
C.2.7.4 Stationary Random Sequences Spectral Representation
Given a stationary zero-mean discrete-time signal x½n� for �1 < n < 1,
this has not, in general, finite energy for which the DTFT, and more generally the
z-transform, does not converge. The autocorrelation sequence rxx½n�, computed by
(C.76) or in terms of relative frequency, however, is “almost always” with finite
energy, and when this is true, its envelope decays (goes to zero) when the delay
increases. In these cases the sequence of autocorrelation always results absolutely
summable and its z-transform, defined as
Rxx zð Þ ¼X1k¼�1
rxx k½ �z�k,
admits some convergence region on the z-plane. Note, also, that for the symmetry
properties of (C.78), we have that Rxxðz�1Þ ¼ RxxðzÞ.
C.2.7.5 Power Spectral Density
We define the power spectral density (PSD) as the DTFT of the autocorrelation
Rxx e jω� � ¼ X1
k¼�1rxx k½ �e�jω k: ðC:111Þ
The PSD is a nonnegative real function that does not preserve the phase informa-
tion. The Rxxðe jωÞ provides a distribution measure of the average power of a
random process, in function of the frequency.
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 657
We define cross-spectrum or cross-PSD (CPSD) the DTFT of the sequence of
cross-correlation
Rxy e jω� � ¼ X1
k¼�1rxy k½ �e�jω k: ðC:112Þ
The CPSD is a complex function. Its amplitude describes the frequencies of the
SP x½n� associated, with a large or small amplitude, with those of the SP y½n�. Thephase ∡ Rxyðe jωÞ indicates the phase delay of y½n� with respect to x½n� for eachfrequency.
From equation (C.105), the following property holds:
Rxy e jω� � ¼ R∗
yx e jω� � ðC:113Þ
so, Rxyðe jωÞ and R∗yxðe jωÞ have the same module but opposite phase.
C.2.7.6 Spectral Representation of Stationary SP and TD LTI systems
For an impulse response h½n�, with z-transform HðzÞ ¼ Zh½n�, we have the
following property:
Z h n½ � ¼ H zð Þ , Z h∗ �n½ � ¼ H∗ 1=z∗ð Þ: ðC:114Þ
From the above and for (C.103)–(C.106), then
Rxy zð Þ ¼ H∗ 1=z∗ð ÞRxx zð Þ ðC:115ÞRyx zð Þ ¼ H zð ÞRxx zð Þ ðC:116ÞRyy zð Þ ¼ H zð ÞH∗ 1=z∗ð ÞRxx zð Þ: ðC:117Þ
For z ¼ e jω, we can write
Rxy e jω� � ¼ H∗ e jω
� �Rxx e jω� � ðC:118Þ
Ryx e jω� � ¼ H ejω
� �Rxx e jω� � ðC:119Þ
Ryy e jω� � ¼ H ejω
� �H∗ e jω� �
Rxx e jω� � ¼ H e jω
� ��� ��2Rxx e jω� �
: ðC:120Þ
Moreover, for (C.118) and (C.119) we have that
Ryx e jω� � ¼ R∗
xy e jω� �
: ðC:121Þ
Example Consider the sum of two SPs w½n� ¼ x½n� þ y½n�, and evaluate the rww½k�.By applying the definition (C.76) we have that
658 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
rww k½ � ¼ E w n½ �w n� k½ � ¼ E x n½ � þ y n½ �� � � x n� k½ � þ y n� k½ �� �n o¼ E
�x n½ �x n� k½ �
n oþ E x n½ �y n� k½ � þ E y n½ �x n� k½ � þ E y n½ �y n� k½ �
¼ rxx k½ � þ rxy k½ � þ ryx k½ � þ ryy k½ �:
For uncorrelated sequences the cross contributions are zero [see (C.67)]. Hence, we
obtain that rww½k� ¼ rxx½k� þ ryy½k�; therefore, for the PSD we have
Rww e jω� � ¼ Rxx e jω
� �þ Ryy e jω� �
:
Example Evaluate the output PSD Ryyðe jωÞ, for the TD LTI system illustrated in
Fig. C.7a), with random uncorrelated input sequences x1½n� and x2½n�.The inputs x1½n� and x2½n� are mutually uncorrelated and, since the system is
linear, can be considered separately with the superposition principle. The output
PSD is calculated as the sum of the single contributions when the other is null. So
we have
Ryy e jω� � ¼ Rx1
yy e jω� �þ Rx2
yy e jω� �
:
For the (C.120), we get
Rx1yy e jω� �
≜ Ryy e jω� ���
x2 n½ �¼0¼ H ejω
� ��� ��2Rx1x1 e jω� �
Rx2yy e jω� �
≜ Ryy e jω� ���
x1 n½ �¼0¼ G ejω
� ��� ��2Rx2x2 e jω� �
:
Finally, we have that
Ryy e jω� � ¼ H ejω
� ��� ��2Rx1x1 e jω� �þ G ejω
� ��� ��2Rx2x2 e jω� �
:
Example Evaluate the PSDsRy1y2 e jωð Þ,Ry2y1 e jωð Þ,Ry1y1 e jωð Þ, andRy2y2 e jωð Þ, for theTD LTI system illustrated in Fig. C.7b), with random uncorrelated input sequences
x1½n� and x2½n�.For (C.118)–(C.120), the output PSD we obtain is
[ ]y n( )H z
( )G z
1[ ]x n
2[ ]x n
1[ ]y n
( )H z( )G z
1[ ]x n
2[ ]x n2[ ]y n
ba
Fig. C.7 Block diagrams of TD LTI systems illustrated in the examples
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 659
Ry1y1 e jω� � ¼ H ejω
� ��� ��2Rx2x2 e jω� �þ Rx1x1 e jω
� �Ry2y2 e jω
� � ¼ G ejω� ��� ��2Rx1x1 e jω
� �þ Rx2x2 e jω� �
:
For the CPSD Ry1y2 e jωð Þ, we observe that the sequences y1½n� and y2½n� are in
relation with the input sequences, through the TF Hðe jωÞ and Gðe jωÞ. Moreover,
since x1½n� and x2½n� are uncorrelated, for the superposition principle, we can write
Ry1y2 e jω� � ¼ Rx1
y1y2e jω� �þ Rx2
y1y2e jω� �
:
Note that for x2½n� ¼ 0 is y1½n� x1½n� for which, for (C.119), we obtain
Rx1y1y2
e jω� � ¼ Ry1y2 e jω
� ���x2 n½ �¼0
¼ H ejω� �
Rx1x1 e jω� �
:
Similarly, for the other input when x1½n� ¼ 0, for (C.118), we obtain
Rx2y1y2
e jω� � ¼ Ry1y2 e jω
� ���x1 n½ �¼0
¼ G∗ e jω� �
Rx2x2 e jω� �
:
The CPSD Ry1y2 e jωð Þ is then
Ry1y2 e jω� � ¼ H ejω
� �Rx1x1 e jω
� �þ G∗ e jω� �
Rx1x1 e jω� �
:
Similarly, for the CPSD Ry2y1 e jωð Þ, we get
Ry2y1 e jω� � ¼ H∗ e jω
� �Rx1x1 e jω
� �þ G ejω� �
Rx1x1 e jω� �
:
C.3 Basic Concepts of Estimation Theory
In many real applications the distribution functions are not a priori known and
should be determined by appropriate experiments carried out using a finite set of
measured data. The estimation of such statistics can be performed by the use of
methodologies defined in the context of the Estimation Theory9 (ET) [16–22].
9 The Estimation Theory is a very ancient discipline and famous scientists as Lagrange, Gauss,
Legendre, etc., have used it in the past, and in the last century, attention to it has considerably
increased. In fact, many were scientists who have worked in this field (Wold, Fisher, Kolmogorov,
Wiener, Kalman, etc.). Among these N. Wiener, between 1930 and 1940, was among those who
most emphasized the importance that not only the noise but also signals should be considered as
stochastic processes.
660 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
C.3.1 Preliminary Definitions and Notation
Let Θ be defined as the parameters space, the general problem of parametersestimation is the determination of a parameter θ ∈ Θ or, more generally, of a vector
of unknown parameters θ ∈ Θ ≜�θ½n��L�1
0 , starting from a series of observations
or measurements x ≜�x½n��N�1
0 , by means of estimation function hð�Þ, called
estimator, i.e., such that the estimate is θ ¼ h xð Þ.Before proceeding to further developments, let us introduce some preliminary
formal definitions.
θ ∈ Θ In general, θ indicates the parameters vector to be estimated. Depending
on the estimation paradigm adopted, as better illustrated in the following,
θ can be considered as n RV, characterized by a certain a priori knownsupposed (or hypothesized) distribution, or simply considered as a
deterministic unknown.
h(x) This function, that is itself an RV, indicates the estimator, namely, the law
which would determine the value of the parameters to be estimated
starting from the observations x.
θThis symbol indicates the result, i.e., θ ¼ h xð Þ. Note that the estimatedvalue is always an RV characterized by a certain pdf and/or values of its
moments.
C.3.1.1 Sampling Distribution
The above definitions show that the estimator relative to the ζkth event, denoted by
h�x½n,ζk�
N�10
�, is defined in an N dimensional space, whose distribution can be
obtained from the joint distribution of the RVsx½n,ζ�N�1
0 and θ. This distribution,in the case of a single deterministic parameter estimation, is shown as fx;θðx;θÞ andis defined as sampling distribution.
Note that sampling distribution represents one of the fundamental concepts in
the estimation theory because it contains all the information needed to define the
estimator quality characteristics. In fact, it is intuitive to think that the sampling
distribution of a “good” estimator may be the most concentrate as possible. Thus it
has a small variance around the true value of the parameter to be estimated.
C.3.1.2 Estimation Theory: Classical and Bayesian Approaches
In classical estimation theory θ represents an unknown deterministic vector of
parameters. Therefore, the formalism fx;θðx;θÞ indicates a parametric dependencyof the pdf related to the measures x, from the parameters θ. For example, consider
the simple case where N ¼ 1 where the parameter θ represents a certain (mean)
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 661
value and the pdf fx;θ�x½0�; θ
�is normally distributed around this value
x½0� � Nðθ,σ2x½0�Þ so that
f x 0½ �;θ x 0½ �; θ� � ¼ 1ffiffiffiffiffi2π
pσx 0½ �
e� 1
2σ2x 0½ �
x 0½ ��θð Þ2ðC:122Þ
illustrated, by way of example, in Fig. C.8, for some value of the parameter θ. Inother words, the parameter θ is not an RV and fx;θ
�x½0�; θ� indicates a parametric
pdf that depends on a deterministic value θ .
On the contrary, in Bayesian estimation theory θ is an RV characterized by its
pdf fθðθÞ, a priori pdf, which contains all the a priori known information
(or believed). The quantity to be estimated is then interpreted as a realization of
the RV θ. Subsequently, the estimation process is described by the joint pdf throughthe Bayes rule, as [see Sect. C.1.4, Eq. (C.24)]
f x,θ x; θð Þ ¼ f x,θ x j θð Þf θ θð Þ ¼ f x,θ θ j xð Þf x xð Þ, ðC:123Þ
where fxjθðx��θÞ is the conditional pdf that represents the knowledge carried from the
data x conditioned by knowledge of distribution fθðθÞ.10From the definition of the estimator quality, it is not always possible to know
the sampling distribution fx;θðx;θÞ. In practice, however, it is possible to use the
low-order moments as the expectation E θ� �
, the variance, denoted as var θ� �
or σ2θ,
and the mean squares error (MSE) denoted as mse θ� �
.
C.3.1.3 Estimator, Expectation, and Bias
An estimator is called unbiased, if the expectation of the estimated value tends to
the true value of the parameter to be estimated. In other words,
; ( [0]; )xf xθ θ
1θ 2θ 3θ [0]x
Fig. C.8 Dependency of pdf fx;θ�x½0�; θ� form the unknown parameter θ
10 The notation fx;θðx;θÞ indicates a parametric pdf family where θ is the free parameter. Moreover,
remember that the notation fx,θðx,θÞ indicates the joint pdf, while fxjθðxjθÞ indicates the conditionalpdf.
662 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
E θ� � ¼ θ: ðC:124Þ
If E θ� � 6¼ θ, it is possible to define a quantity called deviation or bias as
b θ� �
≜ E θ� �� θ: ðC:125Þ
Remark The presence of a bias term, probably, indicates the presence of a system-atic error, i.e., due to the measure process (or due to estimation algorithm). Note
that an unbiased estimator not necessarily is a “good” estimator. In fact, the only
guarantee is that, in average, it tends to the true value.
C.3.1.4 Estimator Variance
For better characterizing the estimation quality we define the estimator variance as
var θ� � ¼ σ2
θ≜ E
��θ � E θ� ���2n o
ðC:126Þ
that represents a dispersion measure of the pdf of θ around its expected value
(Fig. C.9).
C.3.1.5 Estimator’s Mean Square Error and Bias-Vs.-Variance Trade-off
Given the true value θ and its estimated value θ , the MSE of the related estimator
θ ¼ h xð Þ can be defined as
mse θ� � ¼ E
��θ � θ��2n o
: ðC:127Þ
So the mseð�Þ is a measure of the average quadratic deviation of the estimated value
with respect to the true value. Note that, considering the definitions (C.125) and
(C.126), the mse θ� �
can be written as
θ
1θ
2θ
θ
; ( ; )f θ θx x ; ( ; )f θ θx x
1θ
2θ
1( )E θ 2( )E θ1 2 0ˆ ˆ( ) ( )E Eθ θ θ= =0θ
0θ
a b
Fig. C.9 Estimator bias and variance (a) biased estimator; (b) unbiased estimator
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 663
mse θ� � ¼ σ2
θþ ��b θ
� ���2: ðC:128Þ
In fact, by summing and subtracting the term E θ� �
, it is possible to write
E��θ � θ
��2n o¼ E jθ � θ þ E
�θ� � E
���2n o
¼ E j θ � E θ� �� �þ E θ
� �� θ� �j2� �
¼ E��θ � E θ
� ���2n o|fflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflffl}
σ2θ
þ ��E θ� �� θ
��2|fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl}��b θ� ���2
: ðC:129Þ
The expression (C.128) shows that the MSE is formed by the sum of two contrib-
utes: one due to the estimation variance, while the other due to its bias.
C.3.1.6 Example: Estimate the Current Gain of a White Gaussian
Sequence
As example we consider the estimation of discrete sequence x½n� consisting of
N independent samples defined as
x n½ � ¼ θ þ w n½ �, ðC:130Þ
where θ represents the constant component (by analogy with the constant electrical
direct current (DC)) and w½n� is additive white Gaussian noise (AWGN) with zero
mean and indicated as w½n� � Nð0,σ2wÞ.Intuitively reasoning, we can define different algorithms for the estimation of θ.
For example, two very commonly used estimators are defined as
θ 1 ¼ h1 xð Þ ≜ x 0½ � ðC:131Þ
θ 2 ¼ h2 xð Þ ≜ 1
N
XN�1
n¼0
x n½ �: ðC:132Þ
To assess the quality of the estimators h1ðxÞ and h2ðxÞ, we calculate the respectiveexpected values and variances. For the expected values we have
E θ 1
� � ¼ E x 0½ �� � ¼ θ ðC:133Þ
E θ 2
� � ¼ E1
N
XN�1
n¼0
x n½ � !
¼ 1
N
XN�1
n¼0
E�x n½ �� ¼ 1
NNθ½ � ¼ θ: ðC:134Þ
Therefore, both estimators converge to the same expected value that coincides with
the true value of θ parameter to estimate. By reasoning in a similar way, for the
variance we have
664 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
var θ 1
� � ¼ var x 0½ �� � ¼ σ2w ðC:135Þ
and
var θ 2
� � ¼ var1
N
XN�1
n¼0
x n½ � !
: ðC:136Þ
The latter, for the hypothesis of independency, can be rewritten as
var θ 2
� � ¼ 1
N2
XN�1
n¼0
var x n½ �� � ¼ 1
N2Nσ2w� � ¼ σ2w
N: ðC:137Þ
Then, it follows that the variance of the estimator var�h2ðxÞ
�< var
�h1ðxÞ
�and for
N ! 1, var θ 2
� �! 0. For this reason, the estimator h2ðxÞ turns out to be better
than h1ðxÞ. In fact, according to certain paradigms, as we shall see later, h2ðxÞ is thebest possible estimator.
C.3.1.7 Minimum Variance Unbiased (MVU) Estimator
Ideally a good estimator should have the MSE which tends to zero. Unfortunately,
the adoption of this criterion produces, in most cases, a not “feasible” estimator. In
fact, the expression of the MSE (C.128) is formed by the contribution of the
variance added to that of bias. For better understanding consider the example of
the average value estimator (C.132), redefined using the following expression:
θ ¼ h xð Þ ≜ a1
N
XN�1
n¼0
x n½ �, ðC:138Þ
where a is a suitable constant. The problem, now, consists in determining the value
of the constant a such that the MSE of the estimator is minimal.
Since by definition,E θ� � ¼ aθ, var θ
� � ¼ a2σθ 2=N and for Eq. (C.128) we have
mse θ� � ¼ a2σ2
θ
Nþ a� 1ð Þ2θ2: ðC:139Þ
Hence, differentiating the MSE with respect to a, we obtain
d mse θ� � �
da¼ 2aσ2
θ
Nþ 2 a� 1ð Þθ2: ðC:140Þ
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 665
The optimum value aopt is obtained by setting these equations to zero and solving
with respect to A. It follows:
aopt ¼ θ2
θ2 þ σ2θ
�N: ðC:141Þ
The previous expression shows that the value aopt depends on θ, i.e., the estimator
goodness depends on the parameter θ, which should be determined by the estimator
itself. Such paradox indicates the non-computability of aopt parameter, i.e., the
non-feasibility of the estimator. Generally, with certain exceptions, any criteria
that depends on the bias determines a not feasible estimator.
On the other hand, the optimal estimator is not the one with minimum MSE but
is what constrains the bias to zero and minimizes the estimated variance. For such
reason, this estimator is called minimum variance unbiased (MVU) estimator. For
one MVU estimator, from definition (C.128),
mse θ� � ¼ σ2
θ, MVUestimator: ðC:142Þ
C.3.1.8 Bias Vs. Variance Trade-off
From what was said, a “good” estimator should be unbiased and with minimum
variance. Often in practical situations, the two features are mutually contradictory,
i.e., when reducing the variance the bias increases. This situation reflects a kind of
indeterminacy between bias and variance often referred to as bias–variance trade-off.The MVU estimator does not always exist and this is generally true when the
variance of the estimator depends on the value of the parameter to be estimated.
Note also that the existence of the MVU estimator does not imply its determination.
In other words, although theoretically it exists, it is not guaranteed that we can
determine it.
C.3.1.9 Consistent Estimator
An estimator is said to be weakly consistent, if it converges in probability to the trueparameter value, for a sample length N which tends to infinity
limN!1
p��h xð Þ � θ
�� < εn o
8ε > 0: ðC:143Þ
An estimator is called strong sense consistent, if it converges with probability one
to parameter value, for a sample length N which tends to infinity
limN!1
p h xð Þ ¼ θ ¼ 1: ðC:144Þ
666 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
Sufficient conditions for a weak sense consistent estimator are that the variance andthe bias tend to zero, for sample length N tending to infinity, i.e.,
limN!1
E h xð Þ ¼ θ
limN!1
var h xð Þ ¼ 0:ðC:145Þ
In this case, the sampling distribution tends to become an impulse centered on the
value to be estimated.
C.3.1.10 Confidence Interval
Increasing the sample length N, under sufficiently general conditions, the estimate
tends to the true value θ!N!1 θ
�. Moreover, for the central limit theorem, if
N increases, the pdf of θ is well approximated by the normal distribution.
Knowing the sampling distribution of an estimator, it is possible to calculate a
certain interval ð�Δ, ΔÞ, which defines a specified probability. Such interval,
called confidence interval, indicates that the event θ is in the range ð�Δ, ΔÞ,around θ, with probability ð1 � βÞ or confidence ð1 � βÞ � 100% (see Fig. C.10).
C.3.2 Classical and Bayesian Estimation
In the classical ET, as previously indicated, the problem is addressed considering
the parameter to be estimated as deterministic, while in Bayesian ET, the estimate
parameter is considered stochastic. If the parameter is an RV, it is characterized by
a certain pdf that reflects a priori knowledge on the parameter itself.
Both theories have found several applications in signal processing and, in
particular, the three main estimation paradigms used are the following:
i) the maximum a posteriori estimation (MAP);
ii) the maximum likelihood estimation (ML);
iii) the minimum mean squares error estimation (MMSE).
θ
; ( ; )f θ θx x
θ
θ−Δ Δ
(1 )Area β= −
Fig. C.10 Confidence interval around the true θ value
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 667
C.3.2.1 Maximum a Posteriori Estimation
In the MAP estimator, the parameter θ is characterized by an a priori pdf fθðθÞ thatis determined from known knowledge before the measure of x data in the absence of
other information. Therefore, the new knowledge obtained from the measure
determines a change in the θ pdf which is conditioned by the measure itself. So
the new pdf, indicated as fxjθðθ��xÞ, is defined as a posteriori pdf of θ conditioned bymeasures x. Note that fxjθðθ��xÞ is a one-dimensional function of scalar parameter θ,but it is also subject to conditioning due to the measures.
Therefore, the MAP estimate consists in determining the maximum a posterioripdf. Indeed, this can be obtained by differentiating fx��θðθ��xÞ with respect to the
parameter θ, and equating the result to be
θMAP ≜ θ∴∂f xjθ θ jxð Þ
∂θ¼ 0
� �: ðC:146Þ
Sometimes, instead of the maximum of fx��θðθ��xÞ, we consider its natural logarithm.
So θMAP can be found from the maximum of the function ln fx��θðθ��xÞ, for which
θMAP ≜ θ∴∂lnf xjθ θ jxð Þ
∂θ¼ 0
� �: ðC:147Þ
Since the logarithm is a monotonically increasing function, the value found is the
same as that in (C.146). However, the determination of the fx��θðθ��xÞ or ln fx��θðθ��xÞ isoften problematic, and using the rule derived from the Bayes theorem, for (C.123) it
is possible to write the conditioned pdf as
f xjθ θ jxð Þ ¼ f xjθ xjθð Þf θ θð Þf x xð Þ : ðC:148Þ
Considering the logarithm of both sides of the previous, we can write
lnf xjθ θ jxð Þ ¼ lnf xjθ x jθð Þ þ lnf θ θð Þ � lnf x xð Þ:
Thus, the procedure for the MAP estimate is
∂∂θ
lnf xjθ xjθð Þ þ lnf θ θð Þ � lnf x xð Þ �
¼ 0
and, since ln fxðxÞ does not depend on θ , we can write
668 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
θMAP ≜ θ ∴∂∂θ
lnf xjθ xjθð Þ þ lnf θ θð Þ �
¼ 0
� �: ðC:149Þ
Finally, note that it is possible to determine the MAP solution equivalently through
(C.146), (C.147) or, (C.149).
C.3.2.2 Maximum-Likelihood Estimation
In the maximum-likelihood (ML) estimation, the parameter θ to be estimated is
considered as a simple deterministic unknown. Therefore, in the ML estimation the
determination of θML is carried out through the maximization of the function
fx;θðx;θÞ defined as a parametric pdf family, where θ is the deterministic parameter.
In this respect, the function fx;θðx;θÞ is sometimes referred to as the likelihoodfunction Lθ . Note that if fx;θðx;θ1Þ > fx;θðx;θ2Þ, then the value of θ1 is “more
plausible” of the value θ2, so that the ML paradigm indicates that the estimated
value θML is the most likely according to the observations x. As for the MAP
method, also for ML estimator it is often considered the natural logarithm function
ln fx;θðx;θÞ. Note that, although θ is a deterministic parameter, the likelihood
function Lθ ðor ln LθÞ has stochastic nature and is considered as an RV. In this
case, if the estimates solution exists, it can be found as the only solution of the
equation that maximize the likelihood equation defined as
θML ≜ θ ∴∂lnf x;θ x; θð Þ
∂θ¼ 0
� �: ðC:150Þ
Such solution is defined as maximum-likelihood estimate (MLE).
In other words, the ML methods search for the most likely value of θ, namely,
research within the space Θ of all possible θ values, the value of the parameter that
maximizes the probability that θML is the most plausible sample. From a mathe-
matical point of view, calling Lθ ¼ fx;θðx;θÞ the likelihood function, we have
θML ¼ maxθ∈Θ
Lθf g: ðC:151Þ
The MLE also has the following properties:
• Sufficient—if there is a sufficient statistic11 for θ then the MLE is also a sufficient
statistic;
• Efficient—an estimator is called efficient if there is a lower limit of the variance
obtained from an unbiased estimator. An estimator that reaches this limit is
11 A sufficient statistic is a statistic such that “no other statistic which can be calculated from the
same sample provides any additional information as to the value of the parameter” [18]. In other
words, a statistic is sufficient for a pdf family if the sample from which it is calculated gives no
additional information than does the statistic.
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 669
called fully efficient estimator. Although, for a finite set of observations N, thefully efficient estimator does not exist, in many practical cases, the ML estimator
turns out to be asymptotically fully efficient.
• Gaussianity—the MLE turns out to be asymptotically Gaussian.
In case the efficient estimator does not exist, then the lower limit of the MLE
cannot be achieved and, in general, it is difficult to measure the distance from this
limit.
Remark By comparing the ML and MAP estimators it should be noted that in the
latter the estimate is derived using a combination of a priori and a posteriori knowninformation on θ, where such knowledge is formulated in terms of the pdf fθðθÞ.However, the ML estimation results potentially more feasible in practical problems
because it does not require any a priori knowledge. Both procedures require
knowledge of the joint a posteriori pdf of the observations.Note also that the ML estimator can be derived starting from the MAP and
considering the parameter θ as an RV with uniformly distributed pdf between
½�1, þ1�.
C.3.2.3 Example: Noisy Measure of a Parameter with a Single Observation
As a simple example to illustrate the methodology MAP and ML, consider a single
measure x consisting of the sum of a parameter θ and a normal distributed zero-
mean RV w (AWGN) w � Nð0,σ2wÞ. Then, the process is defined as
x ¼ θ þ w: ðC:152Þ
It appears that (1) in ML estimating the parameter θ is a deterministic unknown
constant, while, (2) in the MAP estimate θ is an RV with an a priori pdf of the
normal type Nðθ,σ2θÞ.
ML Estimation
In ML method, the likelihood function Lθ ¼ fx,θðx;θÞ appears to be a scalar functionof a single variable. From equation (C.152) x is, by definition, a Gaussian signal
with mean value θ and variance equal to σ2w. It follows that the likelihood function
Lθ reflects this dependence and appears to be defined as
Lθ ¼ f x x; θð Þ ¼ 1ffiffiffiffiffiffiffiffiffiffiffi2πσ2w
p e� 1
2σ2wx�θð Þ2
: ðC:153Þ
Its logarithm is
670 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
lnLθ ¼ lnf x x; θð Þ ¼ � 1
2ln 2πσ2w� �� 1
2σ2wx� θð Þ2: ðC:154Þ
To determine the maximum, we differentiate with respect to θ, and we equate to
zero
θML ≜ θ ∴1
σ2wx� θð Þ ¼ 0
� �,
that is,
θML ¼ x: ðC:155Þ
It follows, then, that the best estimate in the ML sense is just the x value of the
measure. This is an intuitive result since, in the absence of other information, it is
not in any way possible to refine the estimate of the parameter θ.The variance associated with the estimated value appears to be
var θMLð Þ ¼ E θ2ML
� �� E2 θMLð Þ ¼ E x2� �� E2 xð Þ
that, for x ¼ θ þ w, is
var θMLð Þ ¼ θ2 þ σ2w � θ2 ¼ σ2w
which obviously coincides with the variance of the superimposed noise w.
MAP Estimation
In MAPmethod we have x ¼ θ þ wwith w � Nð0,σ2wÞ and we suppose the a prioriknown pdf fðθÞ that is normal distributed: Nðθ0,σ2θÞ. The MAP estimation is that
obtained from Eq. (C.149) as
θMAP ≜ θ ∴∂∂θ
lnf x��θ x
��θ� �þ lnf θ θð Þ �
¼ 0
� �: ðC:156Þ
Given the θ value, the pdf of x is Gaussian with mean value θ and variance σ2w. Itfollows that the logarithm of the density is
lnf x,θ x��θ� � ¼ � 1
2ln 2πσ2w� �� 1
2σ2wx� θð Þ2: ðC:157Þ
while the a priori known density fðθÞ is equal to
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 671
f θ θð Þ ¼ 1ffiffiffiffiffiffiffiffiffiffi2πσ2θ
p e� 1
2σ2θ
θ�θ0ð Þ2 ðC:158Þ
with logarithm
lnf θ θð Þ ¼ � 1
2ln 2πσ2θ� �� 1
2σ2θθ � θ0ð Þ2: ðC:159Þ
By substituting (C.157) and (C.159) in (C.156) we obtain
θMAP ≜ θ ∴∂∂θ
�1
2ln 2πσ2w� �� 1
2σ2wx�θð Þ2�1
2ln 2πσ2θ� �� 1
2σ2θθ�θ0ð Þ2
� �¼0
( ):
Differentiating we obtain
x� θMAPð Þσ2w
� θMAP � θ0ð Þσ2θ
¼ 0, ðC:160Þ
that is,
θMAP ¼ xσ2θ � θ0σ2wσ2w þ σ2θ
¼ xþ θ0 σ2w=σ2θ
� �1þ σ2w=σ
2θ
� � : ðC:161Þ
Comparing the latter with the ML estimate (C.155), we observe that the MAP
estimate can be viewed as a weighted sum of the ML estimate x and of the a priori
mean value θ0. In (C.161), the ratio of the variances ðσ2w/σ2θÞ can be seen as a
measure of confidence of the value θ0. The lower the value of σ2θ, the greater the
ratio ðσ2w/σ2θÞ, and the greater the confidence in θ0, less is the weight of the
observation x.
In the limit case where ðσ2w/σ2θÞ ! 1, the MAP estimate is simply given by the
value of the a priori mean θ0. At the opposite extreme, if σ2θ increases, then the
MAP estimate coincides with the ML estimate θMAP ! x.
C.3.2.4 Example: Noisy Measure of a Parameter by N Observations
Let’s consider, now, the previous example where N measurements are available
x n½ � ¼ θ þ w n½ �, n ¼ 0, 1, :::,N � 1, ðC:162Þ
where samples w½n� are iid, zero-mean Gaussian distributed Nð0,σ2wÞ.
672 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
ML Estimation
In the MLE, the likelihood function Lθ ¼ fx;θðx;θÞ is an N-dimensional multivariate
Gaussian defined as
Lθ ¼ f x;θ x; θð Þ ¼ 1
2πσ2w� �N=2 e�
12σ2w
PN�1
n¼0
x n½ � � θ �2
: ðC:163Þ
Its logarithm is
lnLθ ¼ lnf x;θ x; θð Þ ¼ �N
2ln 2πσ2w� �� 1
2σ2w
XN�1
n¼0
x n½ � � θ �2
:
Differentiating with respect to θ, and setting to zero
∂lnLθ∂θ
¼XN�1
n¼0
x n½ � � θML
�¼ 0
we obtain
θML ¼ 1
N
XN�1
n¼0
x n½ �: ðC:164Þ
It follows, then, that the best estimate in the ML sense coincides with the average
value of the observed data. This represents an intuitive result, already previously
reached, since, in the absence of other information, it is not possible to do better.
MAP Estimation
In MAP estimation we have x½n� ¼ θ þ w½n� where w � Nð0,σ2wÞ, and we suppose
that the a priori pdf is normally distributed fðθÞ, N ; θ; σ2θ� �
. The MAP estimation,
proceeding as in the latter case, is obtained as
XN�1
n¼0
ðx n½ � � θMAPÞσ2w
� θMAP � θ� �
σ2θ¼ 0, ðC:165Þ
that is,
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 673
θMAP ¼1N
XN�1
n¼0
x n½ � þ θ � σ2w=Nσ2θ
� �1þ σ2w=Nσ
2θ
� � : ðC:166Þ
Again, comparing the latter with the ML estimation, we observe that the MAP
estimate can be viewed as a weighted sum of the MLE and the a priori mean value.
Comparing with the case of single observation [Eq. (C.161)], one can observe that
the increase in the number of observations N is a reduced dependence of the a prioridensity by a factor N. This result is reasonable and intuitive: each new observation
reduces the variance of the observations and reduces the dependence of the model
a priori.
C.3.2.5 Example: Noisy Measure of L Parameters with N Observations
We consider now the general case where we have N measurements x ≜�x½n��0N � 1,
and we estimate a number of L parameters θ ≜�θ½n��0L � 1, where samples of w½n� are
zero-mean Gaussian Nð0,σ2wÞ, iid.
MAP Estimation
Proceed in this case prior to the MAP estimate. We seek to maximize the posterior
density fx��θðθ��xÞ or, equivalently, the ln fx,θðθ��xÞ, with respect to θ. This is achievedby differentiating with respect to each component of θ and equating to zero. It is
then
∂lnf x,θ θjxð Þ∂θ n½ � ¼ 0, n ¼ 0, 1, :::, L� 1: ðC:167Þ
By separating the derivatives we obtain L equations in L unknowns in the param-
eters θ½0�, θ½1�, . . ., θ½L – 1� that, changing the type of notation, can be expressed as
∇θ f x,θ θjxð Þ ¼ 0, ðC:168Þ
where the symbol ∇θ indicates the differential operator, called “gradient,” definedas
∇θ ≜∂
∂θ 0½ � ,∂
∂θ 1½ � , � � �, ∂∂θ L� 1½ �
" #T:
As in the case of a single parameter, the Bayes rule holds, so we have
674 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
f xjθ θjxð Þ ¼ f xjθ xjθð Þf θ θð Þf x xð Þ ,
which, considering the logarithm, can be written as
lnf xjθ θjxð Þ ¼ lnf xjθ xjθð Þ þ lnf θ θð Þ � f x xð Þ,
where fxðxÞ do not depends from θ, so we can write
θMAP ≜ θ ∴∂ lnf x
��θ x��θ� �þ lnf θ θð Þ
�∂θ n½ � ¼ 0, for n ¼ 0, 1, :::, L� 1
8<:
9=;:
ðC:169Þ
Finally, the solution of the above simultaneous equations consists in the MAP
estimation.
ML Estimation
In ML estimation, the likelihood function is Lθ ¼ fx;θðx;θÞ or, equivalently, its
logarithm is ln Lθ ¼ ln fx,θðx;θÞ. Its maximum is defined as
θML ≜ θ ∴∂lnf x,θ x; θð Þ
∂θ n½ � ¼ 0, for n ¼ 0, 1, :::, L� 1
� �: ðC:170Þ
C.3.2.6 Variance Lower Bound: Cramer–Rao Lower Bound
A very important issue of estimation theory concerns the existence of the lowerlimit of variance of the MVU estimator. This limit, known in the literature as the
Cramer–Rao lower bound (CRLB) (also known as the Cramer–Rao inequality or
information inequality), in honor of the mathematicians: Harald Cramer and
Calyampudi Radhakrishna Rao, who first derived this limit [23], expresses the
minimum value of variance that can be achieved in the estimation of a vector of
deterministic parameters θ.For the determination of the limit we consider a classical estimator and a vector
of RVs x ζð Þ ¼ x0 ζð Þ x1 ζð Þ � � � xN�1 ζð Þ� �T, and a unbiased estimator θ ¼ h xð Þ,
such that, by definition E θ� θ �
¼ 0, also characterized by the covariance matrix
Cθ ðL � LÞ defined as [see (C.38)]
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 675
Cθ ¼ covðθÞ ¼ E ðθ� θÞðθ� θÞTh i
: ðC:171Þ
Moreover, define the Fisher information matrix J ∈ ℝðL�L Þ, whose elements are12
J i; jð Þ ¼ �E∂2
lnf x,θ x; θð Þ∂θ i½ �∂θ j½ �
" #, for i, j ¼ 0, 1, :::,L� 1: ðC:172Þ
The CRLB is defined by the inequality
Cθ J�1: ðC:173Þ
The above indicates that the variance of the estimator cannot exceed the inverse ofthe amount of information contained in the random vector x. In other words,
inequality (C.173) expresses the lower limit of variance obtained from an unbiased
estimator for a vector of parameters x.
As defined in Sect. C.3.1.6, an estimator with this property, in the sense of
equality (C.173), is a Minimum Variance Unbiased (MVU) estimator. Note that
(C.173) can be interpreted as ½Cθ � J�1� 0 (positive semi-definite). An estimator
which has the property (C.173), in the sense of equality, is fully efficient.
Equation (C.173) expresses a general condition for the limit of the covariance
matrix of the parameters. Sometimes, it is useful to limit the individual parameters
variances of the estimate: this corresponds to the diagonal elements of the matrix
½Cθ � J�1�. It follows that the diagonal elements of the matrix are nonnegative, i.e.,
var θ i½ �� � 1
J i; ið Þ , for i ¼ 0, 1, :::, L� 1 ðC:174Þ
from which we have that
var θ� � 1
�E∂2
lnf x,θ x;θð Þ½ �∂θ2
� � ðC:175Þ
or
12 The Fisher information is defined as variance of the derivative associated with the likelihood
function logarithmic. The Fisher information can be interpreted as the amount of information
carried by an observable RV x, related to a nonobservable parameter θ, upon which the likelihoodfunction of θ, Lθ ¼ fx;θðx;θÞ, depends.
676 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
var θ� � 1
E∂lnf x,θ x;θð Þ
∂θ
�2� � ðC:176Þ
which represents an equivalent form of the CRLB.
Proof We have that
∂2lnf x,θ x; θð Þ∂θ2
¼∂2
∂θ2lnf x,θ x; θð Þf x,θ x; θð Þ �
∂∂θlnf x,θ x; θð Þf x,θ x; θð Þ
� �2
¼∂2
∂θ2lnf x,θ x; θð Þf x,θ x; θð Þ � ∂lnf x,θ x; θð Þ
∂θ
� �2
since
ð∂lnf x; θð Þ
∂θf x; θð Þdx ¼
ð∂f x; θð Þ
∂θdx ¼ ∂
∂θ
ðf x; θð Þdx ¼ ∂
∂θ1 ¼ 0, we get
E∂2
∂θ2lnf x,θ x; θð Þf x,θ x; θð Þ
� �¼ ::: ¼ ∂2
∂θ2�ðf x; θð Þdx ¼ ∂2
∂θ2� 1 ¼ 0:
Therefore
E∂lnf x,θ x; θð Þ
∂θ
� �2" #
¼ �E∂2
lnf x,θ x; θð Þ∂θ2
" #
Q.E.D.
Remark The CRLB expresses the minimum error variance of the estimator hðxÞ ofθ in terms of the pdf fx;θðx;θÞ of the observations x. So any unbiased estimator has
an error variance greater than the CRLB.
Example As an example, consider the ML estimator for a single observation
already studied in Sect. C.3.2.3, where we have [see (C.154)]
ln Lθ ¼ ln f x;θ x; θð Þ ¼ � 1
2ln 2πσ2w� �� 1
2σ2wx� θð Þ2:
From (C.176), the CRLB is
var θ� � 1
�E∂2
lnf x,θ x;θð Þ½ �∂θ2
� � ¼ 1
E ∂2
∂θ21
2σ2wx� θð Þ2
�� � : ðC:177Þ
Simplifying it is noted that the CRLB is given by the simple relationship
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 677
var θ� � σ2w: ðC:178Þ
The lower limit coincides with the ML estimator variance and, in this case, one can
conclude that the ML estimator reaches the CRLB on a finite set of N observations.
C.3.2.7 Minimum Mean Squares Error Estimator
Suppose we want to estimate the parameter θ using a single measure x, such that the
mean squares error defined in (C.127) is minimized. Let θ ¼ h xð Þ, it appears thatmse θ
� � ¼ E��θ � θ
��2n o; so, we have
mse θ� � ¼ E
��h xð Þ � θ��2n o
: ðC:179Þ
The expected value of the latter can be rewritten as
mse θ� � ¼ ð1
�1
ð1�1
��h xð Þ � θ��2f x,θ x; θð Þdθdx: ðC:180Þ
Remember that the joint pdf fx,θðx,θÞ can be expanded as
f x,θ x; θð Þ ¼ f x,θ θ��x� �
f x xð Þ: ðC:181Þ
Then, we obtain
mse θ� � ¼ ð1
�1f x xð Þ
ð1�1
��h xð Þ � θ��2f x,θ θ
��x� �dθ
� �dx: ðC:182Þ
In the previous expression, both integrals are positive everywhere (by pdf defini-
tion). Moreover, the external integral is fully independent from the function hðxÞ. Itfollows that the minimization of the (C.182) is equivalent to the minimization of the
internal integral ð1�1
��h xð Þ � θ��2f x,θ θ
��x� �dθ: ðC:183Þ
Differentiating with respect to hðxÞ and setting to zero
2h0xð Þð1�1
h xð Þ � θj j f x,θ θ��x� �
dθ ¼ 0
or
678 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
h xð Þð1
�1f x,θ θ jxð Þdθ ¼
ð1�1
θf x,θ θ jxð Þdθ:
by definition we have thatR�11 fx,θðθjxÞdθ ¼ 1 which is
θMMSE ¼ h xð Þ ≜ð1�1
θf x,θ θ jxð Þdθ ¼ E θ jxð Þ: ðC:184Þ
The MMSE estimator is obtained when the function hðxÞ is equal to the expectationof θ conditioned to the data x. Moreover, note that differently from MAP and ML,
the MMSE estimator requires knowledge of the conditioned expected value of the aposteriori pdf but does not require its explicit knowledge. The θMMSE ¼ EðθjxÞ is,in general, a nonlinear function of the data. An important exception is when the aposteriori pdf is Gaussian. In this case, in fact, θMMSE became a linear function of x.
It is interesting to compare the MAP estimator described above and the MMSE.
The two estimators consider the parameter to estimate θ an RV for which both can
be considered Bayesian. Both also produce estimates based on a posteriori pdf of θand the distinction between the two is the optimization criteria. The MAP takes
the maximum (peak) of the function while on the MMSE criterion considers the
expected value. Moreover, note that for symmetrical density, the peak and the
expected value (and thus the MAP and MMSE) coincide, and note also that this
class includes the most common class of Gaussian a posteriori density.Comparing classical and Bayesian estimators we observe that in the former case,
quality is defined in terms of bias, consistency, and efficiency, etc. In Bayesian
estimation of the θ RV implies the non-appropriateness of these indicators: the
performance is evaluated in terms of cost function such as in (C.182). Note that the
MMSE cost function is not the only possible choice. In terms of principle, you
can choose other features such as, for example, the minimum absolute value or
Minimum Absolute Error (MAE)
mae θ� � ¼ Eð h xð Þ � θ
�� ���: ðC:185Þ
Indeed, the MAP estimator can be derived from different forms of cost function.
The optimal estimator in the sense MAE coincides with the median of the aposteriori density. For symmetric density, the MAE coincides with the MMSE
and the MAP. In the case of unimodal symmetric density, optimal solution can be
obtained with a wide class of cost functions that, moreover, coincides with the
solution θMMSE.
Finally, note that in the case of multivariate density, expression (C.184) can be
generalized as
θMMSE ¼ E θjxð Þ: ðC:186Þ
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 679
C.3.2.8 Linear MMSE Estimator
The expression of the MMSE estimator (C.184) or (C.186), as noted in the previous
paragraph, is generally nonlinear. Suppose, now, to impose the form of the MMSE
estimator, the constraint of linearity with respect to the observed data x. With this
constraint, the estimator consists of a simple linear combination of measures. It,
therefore, assumes the form
θ∗MMSE ¼ h xð Þ ≜XN�1
i¼0
hi � x i½ � ¼ hTx, ðC:187Þ
where the coefficients h are the weights that can be determined by the minimization
of the mean squares error, defined as
hopt ≜ h ∴∂∂h
E θ � hTx�� ��2n o
¼ 0
� �: ðC:188Þ
For the derivative computation it is convenient to define the quantity “error” as
e ¼ θ � θ∗MMSE ¼ θ � hTx ðC:189Þ
and, using previous definition, it is possible express the mean squares error as a
function of the estimator parameters h as
J hð Þ ≜ E e2 ¼ E θ � hTx
�� ��2n o: ðC:190Þ
With previous positions, the derivative of (C.188) is
∂J hð Þ∂h
¼∂E ej j2n o∂h
¼ 2e∂E θ � hTx ∂h
¼ �2ex: ðC:191Þ
The optimal solution can be computed for�∂JðhÞ=∂h� ¼ 0, which is
E e � xf g ¼ 0: ðC:192Þ
The above expression indicates that at best solution point, there is the orthogonality
between the error e and the vector of data x (measures). In other words, (C.192)
expresses the principle of orthogonality that represents a fundamental property of
the linear MMSE estimation approach.
680 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
C.3.2.9 Example: Signal Estimation
We extend, now, the concepts presented in the preceding paragraphs to the estima-
tion of signals defined as time sequences.
With this assumption the vector of measured data is represented by the sequence
x ¼ x n½ � x n� 1½ � � � � x n� N þ 1½ �� �T, while the vector of parameters to be
estimated is another sequence, in this context called desired signal, indicated as
d ¼ d n½ � d n� 1½ � � � � d n� Lþ 1½ �� �T. In this situation, the estimator is
defined by the operator
d ¼ T xf g: ðC:193Þ
In other words, Tf�g maps the sequence x to another sequence d .
For such problem the estimators MAP, ML, MMSE, and linear MMSE are
defined as follows:
1. MAP
arg max f x��d d n½ ���x n½ �� �
, ðC:194Þ
2. ML
arg max f x;d d n½ �; x n½ �� �, ðC:195Þ
3. MMSE
d n½ � ¼ E d n½ ���x n½ � , ðC:196Þ
4. Linear MMSE
d n½ � ¼ hTx: ðC:197Þ
Comparing the four procedures we can say that the linear MMSE estimator, while it
is the less general, has the simplest implementative form. In fact, the methods 1.–3.
require the explicit knowledge of the density of signals (and parameters to estimate)
or, at least, conditional expectations. The linear MMSE, however, can be obtained
only by knowledge of the second-order moments (acf, ccf) of the data and param-
eters and, even if they are not known, these could easily be estimated directly from
data. As another strong point of the linear MMSE method, note that the structure of
the operator Tf�g has the form of a convolution (inner or dot product) and it takes
the form of an FIR filter; so we have
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 681
d n½ � ¼XM�1
k¼0
w k½ �x n� k½ � ¼ wTx ðC:198Þ
for which the parameters h in (C.197) are replaced with the coefficients of the linear
FIR filter w. This solution, which happens to be one of the best and most widely
used in adaptive signal processing, is, also, extended to many artificial neuralnetworks architectures.
C.3.3 Stochastic Models
An extremely powerful paradigm, useful for statistic characterization of many types
of time series, is to consider a stochastic sequence as the output of a linear time-
invariant filter whose input is white noise sequence. This type of random sequence
is defined as linear stochastic process. For stationary sequences this model is
general and the following theorem holds.
C.3.3.1 Wold Theorem
A stationary random sequence x½n� that can be represented as an output of a causal,
stable, time-invariant filter, characterized by the impulse response h½n�, for whitenoise input η½n�,
x n½ � ¼X1k¼0
h k½ �η n� k½ �, ðC:199Þ
is defined as linear stochastic process.Moreover, let Hðe jωÞ be the frequency response of the h½n� [see (C.120)], the
Power Spectral Density (PSD) of x½n� is defined as
Rxx e jω� � ¼ H ejω
� ��� ��2σ2η , ðC:200Þ
where σ2η represents the variance (the power) of the white noise η½n�.
C.3.3.2 Autoregressive Model
The autoregressive (AR) time-series model is characterized by the following
difference equation:
682 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
x n½ � ¼ �Xpk¼1
a k½ �x n� k½ � þ η n½ �, ðC:201Þ
which defines the pth-order autoregressive model that is indicated as ARðpÞ. Thefilter coefficients a ¼ a1 a2 � � � ap½ �T are called autoregressive parameters.
The frequency response of the AR filter is
H ejω� � ¼ 1
1þXpk¼1
a k½ �e�jωk
ðC:202Þ
so it is an all-pole filter. Therefore, the PSD of the process is (Fig. C.11)
Rxx e jω� � ¼ σ2η
1þXpk¼1
a k½ �e�jωk
����������2: ðC:203Þ
Moreover, it is easy to show that the acf of an ARðpÞ model satisfies the following
difference equation:
r k½ � ¼�Xpl¼1
a l½ �r k � l½ � k l
�Xpl¼1
a l½ �r l½ � þ σ2η k ¼ 0:
8>>><>>>: ðC:204Þ
Note that the latter can be written in matrix form as
r 0½ � r 1½ � � � � r p� 1½ �r 1½ � r 0½ � � � � r p� 2½ �⋮ ⋮ ⋱ ⋮
r p� 1½ � r p� 2½ � � � � r 0½ �
2664
3775
a 1½ �a 2½ �⋮a p½ �
2664
3775 ¼ �
r 1½ �r 2½ �⋮r p½ �
2664
3775: ðC:205Þ
[ ]x n+ + +
[1]a−[2]a−[ ]a p−
2[ ] (0, )n N ηη σ~
1z− 1z− 1z−
[ 1]x n −[ 2]x n −[ ]x n p−
Fig. C.11 Discrete-time circuit for the generation of a linear autoregressive random sequence
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 683
Moreover, for the (C.204) we have that
σ2η ¼ r 0½ � þXpk¼1
a k½ �r k½ �: ðC:206Þ
From the foregoing, suppose the parameters of the acf are known, r½k� for
k ¼ 1, 2,. . ., p, the AR parameters can be determined by solving the system of
p linear equations (C.205). These equations are known as the Yule–Walkerequations.
Example: First-Order AR process: Markov Process Consider a first-order AR
process in which, for simplicity of exposition, it is assumed a ¼ �a½1�, we have
that
x n½ � ¼ ax n� 1½ � þ η n½ � n 0, x �1½ � ¼ 0: ðC:207Þ
The TF has a single pole HðzÞ ¼ 1/ð1 � az�1Þ. For the (C.204)
r k½ � ¼ ar k � 1½ � k 1
ar 1½ � þ σ2η k ¼ 0
�, ðC:208Þ
which can be solved as
r k½ � ¼ r 0½ �ak k > 0: ðC:209Þ
Hence from (C.206) we have that
σ2η ¼ r 0½ � � ar 1½ �: ðC:210Þ
It is possible to derive the acf in function of the parameter a as
r k½ � ¼ σ2η1� a2
ak: ðC:211Þ
The process generated with the (C.207) is typically defined as first-order Markovstochastic process (Markov-I model). In this case, the AR filter has an impulse
response that decreases geometrically with a rate a determined by the position of
the pole on the z-plane.
Narrowband First-Order Markov Process with Unitary Variance
Usually, the measurement of the performance of adaptive algorithms is made with
narrowband unit variance SP. Very often, these SPs are generated with Eq. (C.207)
for values of a very close to 1, i.e., 0 � a < 1.
684 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
In addition, from (C.211), to have a x½n� process with unit variance, it is
sufficient that the input GWN has a variance equal to 1 � a2. In other words for
η½n� ¼ Nð0,1Þ, it is sufficient to have a TF H zð Þ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffi1� a2
p= 1� az�1ð Þ which
corresponds to a difference equation
x n½ � ¼ ax n� 1½ � þffiffiffiffiffiffiffiffiffiffiffiffiffi1� a2
pη n½ � n 0, x �1½ � ¼ 0: ðC:212Þ
In this case the acf is r½k� ¼ σ2ηak for k ¼ 0, 1, . . ., M, so the autocorrelation
matrix is
Rxx ¼ σ2η
1 a a2 � � � aM�1
a 1 a � � � aM�2
a2 a 1 � � � ⋮⋮ ⋮ ⋮ ⋱ a
aM�1 aM�2 � � � a 1
266664
377775: ðC:213Þ
For example, in case M ¼ 2, the condition number of the Rxx, given by the ratio
between maximum and minimum eigenvalue, is equal to13
χ Rxxð Þ ¼ 1þ a
1� aðC:214Þ
for which, in order to test the algorithms under extreme conditions, it is possible to
generate a process with predetermined value of the condition number. In fact
solving the latter for a, we get
a ¼ χ Rxxð Þ � 1
χ Rxxð Þ þ 1: ðC:215Þ
C.3.3.3 Moving Average Model
The moving average (MA) time-series model is characterized by the following
difference equation:
x n½ � ¼Xqk¼0
b k½ �η n� k½ � ðC:216Þ
which defines the order q moving average model, indicated as MAðqÞ. The coef-
ficients of the filterb ¼ b0 b1 � � � bq½ �T are called moving average parameters.
The scheme of the moving average circuit model is illustrated in Fig. C.12.
13 p λð Þ ¼ det1� λ aa 1� λ
� �¼ λ2 � 2λþ 1� a2ð Þ. For which λ1,2 ¼ 1 � a.
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 685
The frequency response of the filter is
H ejω� � ¼Xq
k¼0
b k½ �e�jωk: ðC:217Þ
The filter has a multiple pole in the origin and is characterized only by zeros. The
PSD of the process is
Rxx e jω� � ¼ σ2η
Xqk¼0
b k½ �e�jωk
����������2
: ðC:218Þ
The acf of the MA(q) model is
r k½ � ¼ σ2ηXq� kj j
l¼0
b l½ �b lþ kj j½ � kj j � q
0 k > q:
8><>: ðC:219Þ
C.3.3.4 Spectral Estimation with Autoregressive Moving Average Model
If the generation filter has poles and zeros the model is an autoregressive moving
average (ARMA). Denoted by q and p, respectively, the degree of the polynomial at
numerator and at the denominator of the transfer function HðzÞ, the model is
indicated as ARMAðp, qÞ. The model is then characterized by the following
difference equation:
x n½ � ¼ �Xpk¼1
a k½ �x n� k½ � þXqk¼0
b k½ �η n� k½ �: ðC:220Þ
For the PSD we have then
Rxx e jωð Þ ¼ σ2η H ejωð Þ�� ��2¼ σ2η
b0 þ b1e�jω þ b2e
�j2ω þ � � � þ bMe�jqω
�� ��21þ a1e�jω þ a2e�j2ω þ � � � þ aNe�jpω�� ��2 :
ðC:221Þ
+
1z−
[ ]x n+
1z−
+
1z−
[0]b [1]b [ ]b q
2[ ] (0, )n N ηη σ~ [ 1]nη − [ 2]nη − [ ]n qη −
Fig. C.12 Discrete-time circuit for the generation of a linear moving average random sequence
686 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory
Remark The models AR, MA, or ARMA are widely used in digital signal
processing applications such as in many contexts: the analysis and synthesis of
signals, signals compression, signals classification, quality enhancement, etc.
The expression (C.221) defines a power spectral density, which represents an
estimate of the spectrum of the signal x½n�. In other words, (C.221) allows the
estimation of the PSD through the estimation of the parameters a and b of the model
generation stochastic ARMA signal. In techniques of signal analysis such methods
are referred to as parametric methods of spectral estimation [17].
Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 687
References
1. Golub GH, Van Loan CF (1989) Matrix computation. John Hopkins University press, Balti-
more, MD. ISBN 0-80183772-3
2. Sherman J, Morrison WJ (1950) Adjustment of an inverse matrix corresponding to a change in
one element of a given matrix. Ann Math Stat 21(1):124–127
3. Fletcher R (1986) Practical methods of optimization. Wiley, New York. ISBN 0471278289
4. Nocedal J (1992) Theory of algorithms for unconstrained optimization. Acta Numerica
(199):242
5. Lyapunov AM (1966) Stability of motion. Academic, New York
6. Levenberg K (1944) A method for the solution of certain problems in least squares. Quart Appl
Math 2:164–168
7. Marquardt D (1963) An algorithm for least squares estimation on nonlinear parameters.
SIAM J Appl Math 11:431–441
8. Tychonoff AN, Arsenin VY (1977) Solution of Ill-posed problems. Winston & Sons,
Washington, DC. ISBN 0-470-99124-0
9. Broyden CG (1970) The convergence of a class of double-rank minimization algorithms. J Inst
Math Appl 6:76–90
10. Goldfarb D (1970) A family of variable metric updates derived by variational means.
Math Comput 24:23–26
11. Shanno DF (1970) Conditioning of quasi-Newton methods for function minimization.
Mathe Comput 24:647–656
12. Magnus MR, Stiefel E (1952) Methods of conjugate gradients for solving linear systems. J Res
Natl Bur Stand 49:409–436
13. Hestenes MR, Stiefel E (1952) Methods of conjugate gradients for solving linear systems.
J Res Natl Bur Stand 49(6):409–436, available on-line http://nvlpubs.nist.gov/nistpubs/jres/
049/6/V49.N06.A08.pdf
14. Shewchuk JR (1994) An introduction to the conjugate gradient method without the agonizing
pain. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
15. Andrei N (2008) Conjugate gradient methods for large-scale unconstrained optimization
scaled conjugate gradient algorithms for unconstrained optimization. Ovidius University,
Constantza, on-line available on http://www.ici.ro/camo/neculai/cg.ppt
16. Papoulis A (1991) Probability, random variables, and stochastic processes, 3rd edn.
McGraw-Hill, New York
17. Kay SM (1998) Fundamentals of statistical signal processing detection theory. Prentice Hall,
Upper Saddle River, NJ
18. Fisher RA (1922) On the mathematical foundations of theoretical statistics. Philos Trans R Soc
A 222:309–368
A. Uncini, Fundamentals of Adaptive Signal Processing, Signals and Communication
Technology, DOI 10.1007/978-3-319-02807-1,
© Springer International Publishing Switzerland 2015
689
19. Manolakis DG, Ingle VK, Kogon SM (2005) Statistical and adaptive signal processing.
Artech House, Norwood, MA
20. Widrow B, Stearns SD (1985) Adaptive signal processing. Prentice Hall ed, Englewood Cliffs,
NJ
21. Sayed AH (2003) Fundamentals of adaptive filtering. IEEE Wiley Interscience, Hoboken, NJ
22. Wiener N (1949) Extrapolation, interpolation and smoothing of stationary time series, with
engineering applications. Wiley, New York
23. Rao C (1994) Selected papers of C.R. Rao. In: Das Gupta S (ed), Wiley. ISBN:978-
0470220917
24. Strang G (1988) Linear algebra and its applications, 3rd edn. Thomas Learning, Lakewood,
CO. ISBN 0-15-551005-3
25. Petersen KB, Pedersen MS (2012) The matrix cookbook, Ver. November 15
26. Daubechies I (1988) Orthonormal bases of compactly supported wavelets. Commun Pure Appl
Math 41:909–996
27. Wikipedia: http://en.wikipedia.org/wiki/Matrix_theory
690 References
Index
AActive noise control (ANC)
confined spaces, 76
in duct, 75, 76
free space, 76
one-dimensional tubes, 75
operation principle of, 75, 76
personal protection, 76
Adaptation algorithm
first-order SDA and SGA algorithms,
208–209
general properties
energy conservation, 223–225
minimal perturbation properties,
221–223
nonlinearity error adaptation, 220
principle of energy conservation,
224–225
SGA analysis, 221
performance
convergence speed and learning curve,
218–219
nonlinear dynamic system, 215–216
stability analysis, 216
steady-state performance, 217–218
tracking properties, 219–220
weights error vector and root mean
square deviation, 216–217
priori and posteriori errors, 209–210
recursive formulation, 207–208
second-order SDA and SGA algorithms
conjugate gradient algorithms (CGA)
algorithms, 212–213
discrete Newton’s method, 210
formulation, 211
Levenberg–Marquardt variant, 211–212
on-line learning algorithms, 214
optimal filtering, 213
quasi-Newton/variable metric methods,
212
weighing matrix, 211
steepest-descent algorithms, 206
stochastic-gradient algorithms, 206
transversal adaptive filter, 206–207
Adaptive acoustic echo canceller scheme, 74
Adaptive beamforming, sidelobe canceller
composite-notation GSC, 554–556
frequency domain GSC, 556–558
generalized sidelobe canceller, 547, 548
with block algorithms, 551–553
block matrix determination, 549, 550
geometric interpretation of, 553–554
interference canceller, 549
with on-line algorithms, 551
high reverberant environment, 559–561
multiple sidelobe canceller, 547
robust GSC beamforming, 558–559
Adaptive channel equalization, 69–71
Adaptive filter (AF)
active noise control
confined spaces, 76
in duct, 75, 76
free space, 76
one-dimensional tubes, 75
operation principle of, 75, 76
personal protection, 76
adaptive inverse modeling estimation
adaptive channel equalization,
69–71
control and predistortion, 71
downstream/upstream estimation
schemes, 68, 69
adaptive noise/interference cancellation, 72
array processing
A. Uncini, Fundamentals of Adaptive Signal Processing, Signals and Communication
Technology, DOI 10.1007/978-3-319-02807-1,
© Springer International Publishing Switzerland 2015
691
adaptive interference/noise cancellation
microphone array, 78, 79
beamforming, 78–81
detection of arrivals, sensors for, 78, 79
room acoustics active control, 81–82
very large array radio telescope, 77
biological inspired intelligent circuits
artificial neural networks, 82–85
biological brain characteristics, 83
blind signal processing, 86
blind signal separation, 86–89
formal neurons, 84
multilayer perceptron network, 84–85
reinforcement learning, 85–86
supervised learning algorithm, 85, 86
classification based on
cost function characteristics, 62–66
input-output characteristics, 60, 61
learning algorithm, 61–62
definition of, 55, 58
discrete-time, 58, 59
dynamic physical system identification
process
model selection, 66–67
pseudo random binary sequences, 67
schematic representation, 66, 67
set of measures, 67–68
structural identification procedure, 67
echo cancellation
adaptive echo cancellers, 74
hybrid circuit, 73
multichannel case, 75
teleconference scenario, 73
two-wire telephone communication, 73
linear adaptive filter
filter input-output relation, 92
real and complex domain vector
notation, 92–94
MIMO filter (see Multiple-input
multiple-output (MIMO) filter)
multichannel filter with blind learning
scheme, 86, 87
optimization criterion and cost functions,
99–100
prediction system, 68
schematic representation of, 57, 58
stochastic optimization
adaptive filter performance
measurement, 110–113
coherence function, 108–110
correlation matrix estimation, 105–108
frequency domain interpretation,
108–110
geometrical interpretation and
orthogonality principle, 113–114
multichannel Wiener’s normal
equations, 119–121
principal component analysis of optimal
filter, 114–118
Wiener filter, complex domain
extension, 118–119
Wiener–Hopf notation (see Wiener–
Hopf notation)
stochastic processes, 91
usability of, 55
Adaptive interference/noise cancellation
(AIC), 72
acoustic underwater exploration, 138–139
adaptive noise cancellation principle
scheme, 133
error minimization, 133
impulse response, 132
microphone array, 78, 79
performances analysis, 137–138
primary reference, 131
reverberant noisy environment, 134, 135
scalar version, 133
secondary reference, 131, 135
signal error, 131–132
without secondary reference signal
adaptive line enhancement, 140–141
broadband signal and narrowband noise,
139–140
Adaptive inverse modeling estimation, AF
adaptive channel equalization, 69–71
control and predistortion, 71
downstream/upstream estimation schemes,
68, 69
Adaptive line enhancement (ALE), 140–141
Affine projection algorithms (APA)
computational complexity of, 298
delay input vector, 299
description of, 295
minimal perturbation property, 296–298
variants of, 299
All-pole inverse lattice filter, 464–465
Approximate stochastic optimization (ASO),
144–145
adaptive filtering formulation
minimum error energy, 151
notations and definitions, 148–149
Yule–Walker normal equations,
149–151
adaptive filter performance measurement
error surface, canonical form of,
112–113
692 Index
excess-mean-square error, 113
minimum error energy, 112
performance surface, 110–112
coherence function, 108–110
correlation matrix estimation
sequences estimation, 106–107
vectors estimation, 107–108
data matrix X
autocorrelation method, 155
covariance method, 155
post-windowing method, 153–154
projection operator and column space,
158–159
sensors arrays, 155–156
frequency domain interpretation
coherence function, 109
magnitude square coherence, 109
optimal filter, frequency response
of, 109
power spectral density, 108–110
Wiener filter interpretation, 108
geometrical interpretation, 113–114,
156–157
linearly constrained LS, 164–166
LS solution property, 159
multichannel Wiener’s normal equations
cross-correlation matrix, 120
error vector, 119
multichannel correlation matrix, 120
nonlinear LS
exponential decay, 167
rational function model, 167
separable least squares, 168
transformation, 168
optimal filter
condition number of correlation
matrix, 117
correlation matrix, 114
decoupled cross-correlation, 115
excess-mean-square error (EMSE), 116
modal matrix, 115
optimum filter output, 116–117
principal component analysis, 117
principal coordinates, 116
orthogonality principle, 113–114, 157
regularization and ill-conditioning,
163–164
regularization term, 161–163
stochastic generation model, 145–146
weighed and regularized LS, 164
weighted least squares, 160–161
Wiener filter, complex domain extension,
118–119
Wiener–Hopf notation (see Wiener–Hopf
notation)
Array gain, BF
diffuse noise field, 509
geometric gain, 510
homogeneous noise field, 509
supergain ratio, 510
symmetrical cylindrical isotropic noise,
508–509
symmetrical spherical isotropic noise, 508
white noise gain, 510
Array processing (AP), 478
adaptive filter
adaptive interference/noise cancellation
microphone array, 78, 79
beamforming, 78–81
detection of arrivals, sensors for, 78, 79
room acoustics active control, 81–82
very large array radio telescope, 77
algorithms, 480–481
circuit model
array space-time aperture, 495–497
filter steering vector, 497–498
MIMO notation, 493–495
propagation model, 481–484
sensor radiation diagram, 485–486
steering vector, 484–485
signal model
anechoic signal propagation model,
486–488
echoic signal propagation model,
488–489
numerical model, 486
steering vector
harmonic linear array, 492–493
uniform circular array, 491–492
uniform linear array, 490–491
Artificial neural networks (ANNs), 82–85
Augmented Yule–Walker normal equations,
437–439
Autoregressive moving average (ARMA)
model, 439–440
BBackward linear prediction (BLP), 431–433
Backward prediction, 424
Backward prediction RLS filter, 469–470
Basis matrix, 6, 7
Batch joint process estimation, ROF
adaptive ladder filter parameters
determination, 458–459
Burg estimation formula, 459
Index 693
Batch joint process estimation (cont.)lattice-ladder filter structure for, 456, 457
stage-by-stage orthogonalization, 457–458
Beamforming (BF), 78–81
Beampattern, 507
Biological inspired intelligent circuits, AF
artificial neural networks, 82–85
biological brain characteristics, 83
blind signal processing, 86
blind signal separation, 86–89
formal neurons, 84
multilayer perceptron network, 84–85
reinforcement learning, 85–86
supervised learning algorithm, 85, 86
Blind signal processing (BSP), 86
Blind signal separation (BSS), 86
deconvolution of sources, 88–89
independent sources separation, 87–88
Block adaptive filter
BLMS algorithm
characterization of, 357
convergence properties of, 358
definition of, 357
block matrix, 355
block update parameter, 355
error vector, 356
schematic representation of, 355
Block algorithms
definition of, 351
indicative framework for, 352
L-length signal block, 353
and online algorithms, 354
Block iterative algebraic reconstruction
technique (BI-ART), 173
Bruun’s algorithm, 397–399
Burg estimation formula, 459
CCircular convolution FDAF (CC-FDAF)
algorithm, 373–375
Combined one-step forward-backward linear
prediction (CFBLP), 434, 435
discrete-time two-port network structure,
455
and lattice adaptive filters, 453–456
Confined propagation model, 488–489
Continuous time signal-integral transformation
(CTFT), 15
Continuous-time signal-series expansion
(CTFS), 15
Conventional beamforming
broadband beamformer, 522–523
differential sensors array
DMA array gain for spherical isotropic
noise field, 519–521
DMA radiation diagram, 517–519
DMA with adaptive calibration filter,
521–522
DSBF-ULA
DSBF gains, 512–515
radiation pattern, 512
steering delay, 515–516
spatial response direct synthesis
alternation theorem, 524
frequency-angular sampling, 525–527
windowing method, 524–525
Cramer-Rao bound (CRB), 561–562
DData-dependent beamforming
minimum variance broadband beamformer,
537–538
constrained power minimization, 539
geometric interpretation, 542–544
lagrange multipliers solution, 541
LCMV constraints, 544–546
matrix constrain determination, 540
recursive procedure, 541–542
post-filtering beamformer
definition, 534–535
separate post-filter adaptation, 537
signal model, 535
superdirective beamformer
Cox’s regularized solutions, 529–531
line-array superdirective beamformer,
531–534
standard capon beamforming, 528
Data-dependent transformation matrix, 12–14
Data-dependent unitary transformation, 12–14
Data windowing constraints, 360
Delayed learning LMS algorithms
adjoint LMS (AD-LMS) algorithm,
277–278
definition, 273–274
delayed LMS (DLMS) algorithm, 275
discrete-time domain filtering operator,
274–275
filtered-X LMS Algorithm, 276–277
multichannel AD-LMS, 284
multichannel FX-LMS algorithm, 278–284
adaptation rule, 284
composite notation 1, 281–283
composite notation 2, 278–281
data matrix definition, 279
694 Index
Kronecker convolution, 280
vectors and matrices size, 283
Differential microphones array (DMA)
with adaptive calibration filter, 521–522
array gain for spherical isotropic noise field,
519–521
frequency response, 518–519
polar diagram, 518
radiation diagram, 517–519
Direction of arrival (DOA), 478
broadband, 568–569
narrowband
with Capon’s beamformer, 563
with parametric methods, 566–568
signal model, 562
steered response power method,
562–563
with subspace analysis, 563–565
Discrete cosine transform (DCT), 10–11
Discrete Fourier transform (DFT)
definition, 8, 9
matrix, 8
periodic sequence, 8
properties of, 9
with unitary transformations, 8–9
Discrete Hartley transform (DHT), 9–10
Discrete sine transform (DST), 11
Discrete space-time filtering
array processing, 478
algorithms, 480–481
circuit model, 493–498
propagation model, 481–486
signal model, 486–489
conventional beamforming
broadband beamformer, 522–523
differential sensors array, 516–522
DSBF-ULA, 511–516
spatial response direct synthesis,
523–527
data-dependent beamforming
minimum variance broadband
beamformer, 537–546
post-filtering beamformer, 534–537
superdirective beamformer, 528–534
direction of arrival
broadband, 568–569
narrowband, 561–568
electromagnetic fields, 479
isotropic sensors, 478–479
noise field
array quality, 504–511
characteristics, 501–504
spatial covariance matrix, 498–501
sidelobe canceller
composite-notation GSC, 554–556
frequency domain GSC, 556–558
generalized sidelobe canceller, 547–549
GSC adaptation, 551–554
high reverberant environment, 559–561
multiple sidelobe canceller, 547
robust GSC beamforming, 558–559
spatial aliasing, 477
spatial frequency, 478
spatial sensors distribution, 479–480
time delay estimation
cross-correlation method, 569–570
Knapp–Carter’s generalized cross-
correlation method, 570–574
steered response power PHAT method,
574–576
Discrete-time adaptive filter, 58, 59
Discrete-time (DT) circuits
analog signal processing
advantages of, 20
current use, 21
bounded-input-bounded-output stability,
22–23
causality, 22
digital signal processing
current applications of, 21
disadvantages, 20
elements definition, 25–27
FDE (see Finite difference equation, DTcircuits)
frequency response
computation, 28–29
Fourier series, 29
graphic form, 28
periodic function, 29
impulse response, 23
linearity, 22
linear time invariant
convolution sum, 24
finite duration sequences, 25
single-input-single-output (SISO), 21
time invariance, 22
transformed domains
discrete-time fourier transform, 31–35
FFT Algorithm, 37
transfer function (TF), 36
z-transform, 30–31
Discrete-time (DT) signals
definition, 2
deterministic sequences, 3, 4
real and complex exponential sequence,
5, 6
Index 695
unitary impulse, 3–4
unit step, 4–5
graphical representation, 2, 3
random sequences, 3, 4
with unitary transformations
basis matrix, 6
data-dependent transformation matrix,
12–14
DCT, 10–11
DFT, 8–9
DHT, 9–10
DST, 11
Haar transform, 11–12
Hermitian matrix, 7
nonstationary signals, 7
orthonormal expansion (seeOrthonormal expansion, DT signals)
unitary transform, 6, 7
DT delta function. see Unitary impulse
EEcho cancellation, AF
adaptive echo cancellers, 74
hybrid circuit, 73
multichannel case, 75
teleconference scenario, 73
two-wire telephone communication, 73
Energy conservation theorem, 225
Error sequential regression (ESR) algorithms
average convergence study, 292
definitions and notation, 290–291
derivation of, 291–292
Estimation of signal parameters via rotational
invariance technique (ESPRIT)
algorithm, 566–568
Exponentiated gradient algorithms (EGA)
exponentiated RLS algorithm, 347–348
positive and negative weights, 346–347
positive weights, 344–346
FFast a posteriori error sequential technique
(FAEST) algorithm, 472–474
Fast block LMS (FBLMS). See Overlap-saveFDAF (OS-FDAF) algorithm
Fast Fourier transform (FFT) algorithm, 37
Fast Kalman algorithm, 470–472
Fast LMS (FLMS). See Overlap-save FDAF(OS-FDAF) algorithm
Filter tracking capability, 314
Finite difference equation, DT circuits
BIBO stability criterion, 39–41
circuit representation, 38
impulse response
convolution-operator-matrix input
sequence, 42–43
data-matrix impulse-response vector,
41–42
FIR filter, 41
inner product vectors, 43–44
pole-zero plot, 38–39
Finite Impulse Response (FIR) filters, 494, 517
FOCal Underdetermined System Solver
(FOCUSS) algorithm, 198–199
diversity measure, 199–200
Lagrange multipliers method, 200–202
multichannel extension, 202–203
sparse solution determination, 197
weighted minimum norm solution, 197
Formal neurons, 84
Forward linear prediction (FLP), 431
estimation error, 428–429
filter structure, 429
forward prediction error filter, 430
Forward prediction, 424
Forward prediction RLS filter, 467–468
Free-field propagation model, 486–488
Frequency domain adaptive filter (FDAF)
algorithms, 353, 363
and BLMS algorithm, 358–359
classification of, 364
computational cost analysis, 376
linear convolution
data windowing constraints, 360
DFT and IDFT, in vector notation,
360–361
in frequency domain with overlap-save
method, 361–363
normalized correlation matrix, 378
overlap-add algorithm, 370–371
overlap-save algorithm
with frequency domain error, 371–372
implementative scheme of, 368–369
linear correlation coefficients, 365
structure of, 367
weight update and gradient’s constraint,
365–368
partitioned block algorithms (seePartitioned block FDAF algorithms)
performance analysis of, 376–378
schematic representation of, 359
step-size normalization procedure, 364–365
UFDAF algorithm (See UnconstrainedFDAF (UFDAF) algorithm)
696 Index
Frost algorithm, 537–538
constrained power minimization, 539
geometric interpretation, 542–544
lagrange multipliers solution, 541
LCMV constraints, 544–546
matrix constrain determination, 540
recursive procedure, 541–542
GGeneralized sidelobe canceller (GSC),
547, 548
with block algorithms, 551–553
block matrix determination, 549, 550
composite-notation, 554–556
frequency domain, 556–558
geometric interpretation of, 553–554
interference canceller, 549
with on-line algorithms, 551
robustness, 558–559
Gilloire-Vetterli’s tridiagonal SAF structure,
413–415
Gradient adaptive lattice (GAL) algorithm,
ROF, 459
adaptive filtering, 460–462
finite difference equations, 460
HHaar unitary transform, 11–12
Hermitian matrix, 7
High-tech sectors, 20
IInput signal buffer composition mechanism,
352
Inverse discrete Fourier transform (IDFT), 8, 9
KKalman filter algorithms
applications, 315
cyclic representation of, 321
discrete-time formulation, 316–319
observation mode, knowledge of, 320
in presence of external signal, 323–324
process model, knowledge of, 320
recursive nature of, 321
robustness, 323
significance of, 322
state space representation, of linear system,
315, 316
Kalman gain vector, 302
Karhunen–Loeve transform (KLT), 390
Kullback–Leibler divergence (KLD), 344
LLagrange function, 165
Lattice filters, properties of
optimal nesting, 455
orthogonality, of backward/forward
prediction errors, 456
stability, 455
Least mean squares (LMS) algorithm
characterization and convergence
error at optimal solution, 248
mean square convergence, 250–252
noisy gradient model, 252–253
weak convergence, 249–250
weights error vector, 248
complex domain signals
computational cost, 239
filter output, 237–238
stochastic gradient, 238–239
convergence speed
eigenvalues disparity, 258–260
nonuniform convergence, 258–260
excess of MSE (EMSE)
learning curve, 257–258
steady-state error, 254–256
formulation
adaptation, 233
computational cost, 236
DT circuit representation, 233
gradient vector, 233
instantaneous SDA approximation,
234–235
priori error, 233
recursive form, 235
vs. SDA comparison, 236
sum of squared error (SSE), 234
gradient estimation filter, 271–272
leaky LMS, 267–269
adaptation law, 267
cost function, 267
minimum and maximum correlation
matrix, 267
nonzero steady-state coefficient
bias, 268
transient performance, 267
least mean fourth (LMF) algorithm,
270–271
least mean mixed norm algorithm, 271
linear constraints
Index 697
linearly constrained LMS (LCLMS)
algorithm, 239–240
local Lagrangian, 239
recursive gradient projection LCLMS,
240–241
minimum perturbation property, 236–237
momentum LMS algorithm, 272–273
multichannel LMS algorithms
filter-by-filter adaptation, 244–245
filters banks adaptation, 244
global adaptation, 243–244
impulse response, 242
input and output signal, 242–243
as MIMO-SDA approximation, 245
priori error vector, 243
normalized LMS algorithm
computational cost, 263
minimal perturbation properties,
264–265
variable learning rate, 262–263
proportionate NLMS (PNLMS) algorithm,
265–267
adaptation rule, 265
Gn matrix choice, 265
improved PNLMS, 265
impulse response w sparseness, 266
regularization parameter, 266
sparse impulse response, 266, 267
signed-error LMS, 269
signed-regressor LMS, 270
sign-sign LMS, 270
statistical analysis, adaptive algorithms
performance
adaptive algorithms performance,
246–247
convergence, 248–253
dynamic system model, 246
minimum energy error, 247–248
transient and steady-state filter
performance, 247
steady-state analysis, deterministic input,
260–262
Least squares (LS) method
approximate stochastic optimization (ASO)
methods, 144–145
adaptive filtering formulation, 146–151
stochastic generation model, 145–146
linear equations system
continuous nonlinear time-invariant
dynamic system, 171
iterative LS, 172–174
iterative weighed LS, 174
Levenberg-Marquardt variant, 171
Lyapunov theorem, 172
overdetermined systems, 169
underdetermined systems, 170
matrix factorization
algebraic nature, 174–175
amplitude domain formulation, 175
Cholesky decomposition, 175–177
orthogonal transformation, 177–180
power-domain formulation, 175
singular value decomposition (SVD),
180–184
principle of, 143–144
with sparse solution
matching pursuit algorithms, 191–192
minimum amplitude solution, 190
minimum fuel solution, 191
minimum Lp-norm (or sparse) solution,
193
minimum quadratic norm sparse
solution, 193–195
numerosity, 191
uniqueness, 195
total least squares (TLS) method
constrained optimization problem, 185
generalized TLS, 188–190
TLS solution, 186–188
zero-mean Gaussian stochastic
processes, 185
Levenberg–Marquardt variant, 171
Levinson–Durbin algorithm
LPC, of speech signals, 442
k and β parameters, initialization of,
450–451
prediction error filter structure, 452–453
pseudo-code of, 451
reflection coefficients determination,
448–450
reverse, 452
in scalar form, 448
in vector form, 447–448
Linear adaptive filter
filter input–output relation, 92
real and complex domain vector notation
coefficients’ variation, 93
filter coefficients, 93
input vector regression, 93
weight vector, 92
Linear estimation, 424
Linearly constrained adaptive beamforming,
559–561
Linearly constrained minimum variance
(LCMV)
eigenvector constraint, 546
698 Index
minimum variance distortionless response,
544–545
multiple amplitude-frequency derivative
constraints, 545–546
Linear prediction
augmented Yule–Walker normal equations,
437–439
coding of speech signals, 440–442
schematic illustration of, 425
using LS approach, 435–437
Wiener’s optimum approach
augmented normal equations, 427–428
BLP, 431–433
CFBLP, 434, 435
estimation error, 424
FLP, 428–430
forward and backward prediction, 424
linear estimation, 424
minimum energy error, 427
predictor vector, 425
SFBLP, 434, 435
square error, 426
stationary process, prediction
coefficients for, 433–434
Linear prediction coding (LPC), of speech
signals
with all-pole inverse filter, 441
general synthesis-by-analysis scheme,
440, 441
Levinson-Durbin algorithm, 442
k and β parameters, initialization of,
450–451
prediction error filter structure, 452–453
pseudo-code of, 451
reflection coefficients determination,
448–450
reverse, 452
in scalar form, 448
in vector form, 447–448
low-rate voice transmission, 441
speech synthesizer, 441, 442
Linear random sequence, spectral estimation
of, 439–440
LMS algorithm, tracking performance of
mean square convergence of, 331–332
nonstationary RLS performance, 332–334
stochastic differential equation, 330
weak convergence analysis, 330–331
LMS Newton algorithm, 174, 293
Low-diversity inputs MIMO adaptive filtering,
335–336
channels dependent LMS algorithm,
337–338
multi-channels factorized RLS algorithm,
336–337
LOw-Resolution Electromagnetic
Tomography Algorithm
(LORETA), 196
Lyapunov attractor
continuous nonlinear time-invariant
dynamic system, 171
finite-difference equations (FDE), 173
generalized energy function, 172
iterative update expression, 174
learning rates, 173
LMS Newton algorithm, 174
online adaptation algorithms, 173
order recursive technique, 173
row-action-projection method, 173
MMIMO error sequential regression algorithms
low-diversity inputs MIMO adaptive
filtering, 335–338
MIMO RLS, 334–335
multi-channel APA algorithm, 338–339
Moore–Penrose pseudoinverse matrix, 151
Multi-channel APA algorithm, 338–339
Multilayer perceptron (MLP) network, 84–85
Multiple error filtered-x (MEFEX), 82
Multiple-input multiple-output (MIMO) filter
composite notation 1, 96
composite notation 2, 97
impulse responses, 96
output snap-shot, 95
parallel of Q filters banks, 97–98
P inputs and Q outputs, 94
snap-shot notation, 98–99
NNarrowband direction of arrival
with Capon’s beamformer, 563
with parametric methods, 566–568
signal model, 562
steered response power method, 562–563
with subspace analysis, 563–565
Newton’s algorithm
convergence study, 289–290
formulation of, 288
Noise field
array quality
array gain, 507–510
array sensitivity, 510–511
radiation functions, 506–507
Index 699
signal-to-noise ratio, 505–506
characteristics
coherent field, 502
combined noise field, 504
diffuse field, 503–504
incoherent field, 502
spatial covariance matrix
definition, 498
isotropic noise, 500–501
projection operators, 500
spatial white noise, 499
spectral factorization, 500
Nonstationary AF performance analysis
delay noise, 330
estimation noise, 330
excess error, 327–328
misalignment and non-stationarity degree,
328–329
optimal solution a posteriori error, 327
optimal solution a priori error, 327
weight error lag, 329, 330
weight error noise, 329, 330
weights error vector correlation
matrix, 329
weights error vector mean square
deviation, 329
Normalized correlation matrix, 378
Normalized least mean squares (NLMS), 173
Numerical filter
definition of, 55
linear vs. nonlinear, 56–57
OOnline adaptation algorithms, 173
Optimal linear filter theory
adaptive filter basic and notations
(see Adaptive filter (AF))adaptive interference/noise cancellation
(AIC)
acoustic underwater exploration,
138–139
adaptive noise cancellation principle
scheme, 133
error minimization, 133
impulse response, 132
performances Analysis, 137–138
primary reference, 131
reverberant noisy environment,
134, 135
scalar version, 133
secondary reference, 131, 135
signal error, 131–132
without secondary reference signal,
139–141
communication channel equalization
channel model, 130
channel TF G(z), 127equalizer input, 128
impulse response g[n] and input
s[n], 129optimum filter, 129
partial fractions, 130–131
receiver’s input, 130
dynamical system modeling 1
cross-correlation vector, 122
H(z) system output, 122
linear dynamic system model, 122
optimum model parameter
computation, 121
performance surface and minimum
energy error, 123
dynamical system modeling 2
linear dynamic system model, 124
optimum Wiener filter, 124–125
time delay estimation
matrix determination R, 126
performance surface, 127
stochastic moving average
(MA) process, 126
vector computation g, 126
Wiener solution, 127
Orthogonality principle, 157
Orthonormal expansion, DT signals
CTFT and CTFS, 15
discrete-time signal, 15–16
Euclidean space, 14
inner product, 14–15
kernel function
energy conservation principle, 17
expansion, 16–17
Haar expansion, 18
quadratically summable sequences, 14
Output projection matrix, 362
Overlap-add FDAF (OA-FDAF) algorithm,
370–371
Overlap-save FDAF (OS-FDAF) algorithm
with frequency domain error, 371–372
implementative scheme of, 368–369
linear correlation coefficients, 365
structure of, 367
weight update and gradient’s constraint,
365–368
Overlap-save sectioning method, 361–363
700 Index
PPartial rank algorithm (PRA), 299–300
Partitioned block FDAF (PBFDAF)
algorithms, 379
computational cost of, 385–386
development, 382–384
FFT calculation, 382
filter weights, augmented form of, 380
performance analysis of, 386–388
structure of, 384, 385
time-domain partitioned convolution
schematization, 380, 381
Partitioned frequency domain adaptive
beamformer (PFDABF), 556–558
Partitioned frequency domain adaptive filters
(PFDAF), 354
Partitioned matrix inversion lemma, 443–445
Phase transform method (PHAT), 573–574
Positive weights EGA, 344–346
Pradhan-Reddy’s polyphase SAF architecture,
416–418
A priori error fast transversal filter, 474–475
Propagation model, AP, 481–484
anechoic signal, 486–488
echoic signal, 488–489
sensor radiation diagram, 485–486
steering vector, 484–485
Pseudo random binary sequences (PRBS), 67
RRandom walk model, 325, 326
Real and complex exponential sequence, 5, 6
Recursive-in-model-order adaptive filter
algorithms. See Recursive orderfilter (ROF)
Recursive least squares (RLS)
computational complexity of, 307–308
conventional algorithm, 305–307
convergence of, 309–312
correlation matrix, with forgetting
factor/Kalman gain, 301–302
derivation of, 300–301
eigenvalues spread, 310
nonstationary, 314–315
performance analysis, 308, 309
a posteriori error, 303–305
a priori error, 303
regularization parameter, 310
robustness, 313–314
steady-state and convergence performance
of, 313
steady-state error of, 312–313
transient performance of, 313
Recursive order filter (ROF), 445–447
all-pole inverse lattice filter, 464–465
batch joint process estimation
adaptive ladder filter parameters
determination, 458–459
Burg estimation formula, 459
lattice-ladder filter structure for,
456, 457
stage-by-stage orthogonalization,
457–458
GAL algorithm, 459
adaptive filtering, 460–462
finite difference equations, 460
importance of, 443
partitioned matrix inversion lemma,
443–445
RLS algorithm
backward prediction RLS filter,
469–470
FAEST, 472–474
fast Kalman algorithm, 470–472
fast RLS algorithm, 465, 466
forward prediction RLS filter, 467–468
a priori error fast transversal filter,
474–475
transversal RLS filter, 466–467
Schur algorithm, 463
Riccati equation, 302
Riemann metric tensor, 343
Robust GSC beamforming, 558–559
Room acoustics active control, 81–82
Room transfer functions (RTF), 81, 82
Row-action-projection method, 173
SSAF. See Subband adaptive filter (SAF)
Schur algorithm, 463
Second-order adaptive algorithms, 287,
324–325
affine projection algorithms
computational complexity of, 298
delay input vector, 299
description of, 295
minimal perturbation property,
296–298
variants of, 299
error sequential regression algorithms
average convergence study, 292
definitions and notation, 290–291
derivation of, 291–292
general adaptation law
Index 701
adaptive regularized form, with sparsity
constraints, 340–344
exponentiated gradient algorithms,
344–348
types, 339–340
Kalman filter
applications, 315
cyclic representation of, 321
discrete-time formulation, 316–319
observation mode, knowledge of, 320
in presence of external signal, 323–324
process model, knowledge of, 320
recursive nature of, 321
robustness, 323
significance of, 322
state space representation, of linear
system, 315, 316
LMS algorithm, tracking performance of
mean square convergence of, 331–332
nonstationary RLS performance,
332–334
stochastic differential equation, 330
weak convergence analysis, 330–331
LMS-Newton algorithm, 293
MIMO error sequential regression
algorithms
low-diversity inputs MIMO adaptive
filtering, 335–338
MIMO RLS, 334–335
multi-channel APA algorithm, 338–339
Newton’s algorithm
convergence study, 289–290
formulation of, 288
performance analysis indices
delay noise, 330
estimation noise, 330
excess error, 327–328
misalignment and non-stationarity
degree, 328–329
optimal solution a posteriori error, 327
optimal solution a priori error, 327
weight error lag, 329, 330
weight error noise, 329, 330
weights error vector correlation
matrix, 329
weights error vector mean square
deviation, 329
recursive least squares
computational complexity of, 307–308
conventional, 305–307
convergence of, 309–312
correlation matrix, with forgetting
factor/Kalman gain, 301–302
derivation of, 300–301
eigenvalues spread, 310
nonstationary, 314–315
performance analysis, 308, 309
a posteriori error, 303–305
a priori error, 303
regularization parameter, 310
robustness, 313–314
steady-state and convergence
performance of, 313
steady-state error of, 312–313
transient performance of, 313
time-average autocorrelation matrix,
recursive estimation of, 293
initialization, 295
with matrix inversion lemma, 294
sequential regression algorithm,
294–295
tracking analysis model
assumptions of, 327
first-order Markov process, 326
minimum error energy, 327
nonstationary stochastic process,
325, 326
Signals
analog/continuous-time signals, 1–2
array processing
anechoic signal propagation model,
486–488
echoic signal propagation model,
488–489
numerical model, 486
complex domain, 1, 2
definition, 1
DT signals (see Discrete-time (DT) signals)
Signal-to-noise ratio (SNR), 478
Singular value decomposition (SVD) method
computational cost, 182
singular values, 181
SVD-LS Algorithm, 182
Tikhonov regularization theory, 183–184
Sliding window, 354
Smoothed coherence transform method
(SCOT), 572–573
Speech signals, LPC of. See Linear predictioncoding (LPC), of speech signals
Steepest-Descent algorithm (SDA)
convergence and stability
learning curve and weights trajectories,
228
natural modes, 227
similarity unitary transformation, 227
stability condition, 228–229
weights error vector, 227
convergence speed
convergence time constant and learning
curve, 231–232
702 Index
eigenvalues disparities, 229
performance surface trends, 230
rotated expected error, 229
signal spectrum and eigenvalues spread,
230–231
error expectation, 225–226
multichannel extension, 226
recursive solution, 225
Steered response power PHAT (SRP-PHAT),
574–576
Steering vector, AP, 489–493
harmonic linear array, 492–493
uniform circular array, 491–492
uniform linear array, 490–491
Stochastic-gradient algorithms (SGA), 206
Subband adaptive filter (SAF), 354
analysis-synthesis filter banks, 418–419
circuit architectures for
Gilloire-Vetterli’s tridiagonal structure,
413–415
LMS adaptation algorithm, 415–416
Pradhan-Reddy’s polyphase
architecture, 416–418
optimal solution, conditions for, 410–412
schematic representation, 401, 402
subband-coding, 401, 402
subband decomposition, 401
two-channel subband-coding
closed-loop error computation,
409, 410
conjugate quadrature filters, 408–409
with critical sample rate, 402
in modulation domain z-transform
representation, 402–405
open-loop error computation, 409, 410
perfect reconstruction conditions,
405–407
quadrature mirror filters, 407–408
Superdirective beamformer
Cox’s regularized solutions, 529–531
line-array superdirective beamformer,
531–534
standard capon beamforming, 528
Superposition principle, 22
Symmetric forward-backward linear prediction
(SFBLP), 434, 435, 437
TTeleconference scenario, echo cancellation
in, 73
Temporal array aperture, 495–496
Tikhonov regularization parameter, 310
Tikhonov’s regularization theory, 163,
183–184
Time-average autocorrelation matrix, recursive
estimation of, 293
initialization, 295
with matrix inversion lemma, 294
sequential regression algorithm, 294–295
Time band-width product (TBWP), 497
Time delay estimation (TDE)
cross-correlation method, 569–570
Knapp–Carter’s generalized cross-
correlation method, 570–574
steered response power PHAT method,
574–576
Total least squares (TLS) method
constrained optimization problem, 185
generalized TLS, 188–190
TLS solution, 186–188
zero-mean Gaussian stochastic
processes, 185
Tracking analysis model
assumptions of, 327
first-order Markov process, 326
minimum error energy, 327
nonstationary stochastic process, 325, 326
Transform domain adaptive filter (TDAF)
algorithms
data-dependent optimal transformation,
390
definition of, 351
FDAF (see Frequency domain adaptive
filter (FDAF) algorithms)
performance analysis, 399–400
a priori fixed sub-optimal transformations,
390
schematic illustration of, 388, 389
sliding transformation LMS, bandpass
filters
DFT bank representation, 394
frequency responses of DFT/DCT, 395
non-recursive DFT filter bank, 397–399
recursive DCT filter bank, 395–397
short-time Fourier transform, 392
signal process in two-dimensional
domain, 393
transform domain LMS algorithm, 391–392
unitary similarity transformation, 390
Transversal RLS filter, 466–467
Two-channel subband-coding
closed-loop error computation, 409, 410
conjugate quadrature filters, 408–409
with critical sample rate, 402
in modulation domain z-transform
representation, 402–405
open-loop error computation, 409, 410
perfect reconstruction conditions, 405–407
quadrature mirror filters, 407–408
Index 703
Two-wire telephone communication, 73
Type II discrete cosine transform (DCT-II ),
391
Type II discrete sine transform (DST-II), 391
UUnconstrained FDAF (UFDAF) algorithm
circulant Toeplitz matrix, 373
circular convolution FDAF scheme,
373–375
configuration of, 368, 369
convergence analysis, 376–378
convergence speed of, 369
for N ¼ M, 372–375
Unitary impulse, 3–4
Unit step sequence, 4–5
WWeighed projection operator (WPO), 161
Weighted least squares (WLS), 160–161
Weighting matrix, 362
Wiener–Hopf notation
adaptive filter (AF), 103
autocorrelation matrix, 102
normal equations, 103
scalar notation
autocorrelation, 104
correlation functions, 104
error derivative, 104
filter output, 103
square error, 102
Wiener’s optimal filtering theory, 103
Wiener’s optimum approach, linear prediction
augmented normal equations, 427–428
BLP, 431–433
CFBLP, 434, 435
estimation error, 424
FLP, 428–430
forward and backward prediction, 424
linear estimation, 424
minimum energy error, 427
predictor vector, 425
SFBLP, 434, 435
square error, 426
stationary process, prediction coefficients
for, 433–434
YYule–Walker normal equations, 150
704 Index