Appendix A: Linear Algebra Basics - Springer978-3-319-02807-1/1.pdf · Appendix A: Linear Algebra Basics A.1 Matrices and Vectors A matrix A [1, 24, 25, 27], here indicated in bold

Appendix A: Linear Algebra Basics

A.1 Matrices and Vectors

A matrix A [1, 24, 25, 27], here indicated in bold capital letter, consists of a set of

ordered elements arranged in rows and columns. A matrix with N rows and

M columns is indicated with the following notations:

A ¼ AN�M ¼a11 a12 � � � a1Ma21 a11 � � � a2M⋮ ⋮ ⋱ ⋮aN1 aN1 � � � aNM

2664

3775 ðA:1Þ

or

A ¼ aij� �

i ¼ 1, 2, :::,N; j ¼ 1, 2, :::,M, ðA:2Þ

where i and j are, respectively, row and column indices. The elements aij may be real

or complex variables. An N rows and M columns ðN � MÞ real matrix can be

indicated as A ∈ ℝN�M while for the complex case as A ∈ ℂN�M. When property

holds both in the real and complex case, the matrix can be indicated asA ∈ ðℝ,ℂ ÞN�M

or as A ðN � MÞ or as simply AN�M.

A.2 Notation, Preliminary Definitions, and Properties

A.2.1 Transpose and Hermitian Matrix

Given a matrix A ∈ ℝN�M the transpose matrix, indicated as AT ∈ ℝM�N, is

obtained by interchanging the rows and columns of A, for which

A. Uncini, Fundamentals of Adaptive Signal Processing, Signals and Communication

Technology, DOI 10.1007/978-3-319-02807-1,

© Springer International Publishing Switzerland 2015

579

AT ¼a11 a21 � � � aN1a12 a11 � � � aN2⋮ ⋮ ⋱ ⋮a1M a2M � � � aNM

2664

3775 ðA:3Þ

or

AT ¼ aji� �

i ¼ 1, 2, :::,N; j ¼ 1, 2, :::,M: ðA:4Þ

It is therefore ðATÞT ¼ A.

In the case of complex matrix A ∈ ℂ N�M we define Hermitian matrix the matrix

transpose and complex conjugate

AH ¼ a∗ji

h ii ¼ 1, 2, :::,N; j ¼ 1, 2, :::,M: ðA:5Þ

If the matrix is indicated as AðN � MÞ, the symbol ðHÞ can be used to indicate boththe transpose of the real case and the Hermitian of the complex case.

A.2.2 Row and Column Vectors of a Matrix

Given a matrix A ∈ ðℝ,ℂ ÞN�M, its ith row vector is indicated as

ai: ∈ ℝ;ℂð ÞM�1 ¼ ai1 ai2 � � � aiM½ �H ðA:6Þ

while the jth column vector as

a:j ∈ ℝ;ℂð ÞM�1 ¼ a1j a2j � � � aNj½ �H ðA:7Þ

A matrix A ∈ ðℝ,ℂ ÞN�M can be represented by its N row vectors as

A ¼

aH1:

aH2:

⋮

aHN:

266664

377775 ¼ a1: a2: � � � aN:½ �H ðA:8Þ

or by its M column vectors as

580 Appendix A: Linear Algebra Basics

A ¼ a:1 a:2 � � � a:M½ � ¼

aH:1

aH:2

⋮

aH:M

266664

377775H

ðA:9Þ

Given a matrix A ∈ ðℝ,ℂ ÞN�M you can associate a vector vecðAÞ∈ ðℝ,ℂ ÞNM�1

containing, stacked, all column vectors of A

vec Að Þ ¼ � aH:1 aH

:2 � � � aH:M �HNM�1

¼ a11, :::,aN1, a12, :::,aN2, ::::::,a1M , :::,aNM½ �HNM�1: ðA:10Þ

Remark Note that in Matlab you can extract entire columns or rows of a matrix

with the following instructions:

A(i,:), extracts the entire ith row in a row vector of dimension M;

A(:,j), extracts the entire jth column in column vector of size N;A(:), extracts the entire matrix into a column vector of dimension N � M.

A.2.3 Partitioned Matrices

Sometimes it can be useful to represent a matrix AðNþM Þ�ðPþQÞ in partitioned formof the type

A ¼ A11 A12

A21 A22

� �MþNð Þ� PþQð Þ

ðA:11Þ

in which the elements Aij are in turn matrices defined as

A11 ∈ ℝ;ℂð ÞM�P,A12 ∈

�ℝ,ℂ

�M�Q

A21 ∈ ℝ;ℂð ÞN�P,A22 ∈

�ℝ,ℂ

�N�Q

ðA:12Þ

The partitioned product follows the same rules as the product of matrices. For

example applies

A11 A12

A21 A22

� �B1

B2

� �¼ A11B1 þ A12B2

A21B1 þ A22B2

� �ðA:13Þ

Obviously, the dimensions of the partition matrices must be compatible.

Appendix A: Linear Algebra Basics 581

A.2.4 Diagonal, Symmetric, Toeplitz, and Hankel Matrices

A given matrix A ∈ ðℝ,ℂ ÞN�N is called diagonal if aji ¼ 0 for i 6¼ j. Is called

symmetric if aji ¼ aij or aji ¼ a∗ij if the complex case, whereby AT ¼ A for real

case and AH ¼ A for complex case.

A matrix A ∈ ðℝ,ℂ ÞN�N ¼ ½aij� such that ½ai,j� ¼ ½ai þ 1,j þ 1� ¼ ½ai � j� is

Toeplitz, i.e., each descending diagonal from left to right is constant.

Moreover, a matrixA ∈ ðℝ,ℂ ÞN�N ¼ ½aij� such that ½ai,j� ¼ ½ai � 1,j þ 1� ¼ ½ai þ j�is Hankel, i.e., each ascending diagonal from left to right is constant.

For example, the following AT, AH matrices:

AT ¼

ai ai�1 ai�2 ai�3 � � �aiþ1 ai ai�1 ai�2 ⋱aiþ2 aiþ1 ai ai�1 ⋱aiþ3 aiþ2 aiþ1 ai ⋱⋮ ⋱ ⋱ ⋱ ⋱

266664

377775, AH ¼

ai�3 ai�2 ai�1 ai � � �ai�2 ai�1 ai aiþ1 Nai�1 ai aiþ1 aiþ2 Nai aiþ1 aiþ2 aiþ3 N⋮ N N N ⋱

266664

377775 ðA:14Þ

are Toeplitz and Hankel matrices.

Given a vector x ¼ x 0ð Þ � � � x M � 1ð Þ� �T, a special kind of Toeplitz/Hankel

matrix, called circulant matrix obtained rotating the elements of x for each column

(or row) as

AT ¼x 0ð Þ x 3ð Þ x 2ð Þ x 1ð Þx 1ð Þ x 0ð Þ x 3ð Þ x 2ð Þx 2ð Þ x 1ð Þ x 0ð Þ x 3ð Þx 3ð Þ x 2ð Þ x 1ð Þ x 0ð Þ

2664

3775, AH ¼

x 0ð Þ x 1ð Þ x 2ð Þ x 3ð Þx 1ð Þ x 2ð Þ x 3ð Þ x 0ð Þx 2ð Þ x 3ð Þ x 0ð Þ x 1ð Þx 3ð Þ x 0ð Þ x 1ð Þ x 2ð Þ

2664

3775 ðA:15Þ

Remark The circulant matrices are important in DSP because they are diagonalized

[see (A.9)] by a discrete Fourier transform, using a simple FFT algorithm.

A.2.5 Some Basic Properties

The following fundamental properties are valid:

ABC� � �ð Þ�1 ¼ C�1B�1A�1� � �AH� ��1 ¼ �A�1

�H

Aþ Bð ÞH ¼ AH þ BH

ABð ÞH ¼ BHAH

ABC� � �ð ÞH ¼ � � �CHBHAH:

ðA:16Þ


A.3 Inverse, Pseudoinverse, and Determinant of a Matrix

A.3.1 Inverse Matrix

A square matrix A ∈ ðℝ,ℂ ÞN�N is called invertible or nonsingular if there exists amatrix B ∈ ðℝ,ℂ ÞN�N such that BA ¼ I, where IN�N is the so-called identitymatrix or unit matrix defined as I ¼ diag(1,1, . . .,1). In such case the matrix B is

uniquely determined from A and is defined as the inverse of A, also indicated as

A�1 (or A�1A ¼ I).

Note that if A is nonsingular the system equation

Ax ¼ b ðA:17Þ

has a unique solution, given by x ¼ A�1b.

A.3.2 Generalized Inverse or Pseudoinverse of a Matrix

The generalized inverse or Moore–Penrose pseudoinverse of a matrix represents

a general way to the determination of the solution of a linear real or complex

system equations of the type (A.17), in the case of A ∈ ðℝ,ℂ ÞN�M, x ∈ ðℝ,ℂ ÞM�1,

b ∈ ðℝ,ℂ ÞN�1. In general terms, considering a generic matrix AN�M we can define

its pseudoinverse AM�N# a matrix such that the following four properties are true:

AA#A ¼ A

A#AA# ¼ A# ðA:18Þ

and

AA# ¼ AA#� �H

A#A ¼ A#A� �H

:ðA:19Þ

Given a linear system (A.17) for its solution we can distinguish the following three

cases:

A# ¼A�1 N ¼ M, square matrix

AH AAH� ��1

N < M, “fat” matrix

AHA� ��1

AH N > M, “tall” matrix

8><>: ðA:20Þ

where by the solution of the system (A.17) may always be expressed as


x ¼ A#b: ðA:21Þ

The proof of (A.20) for the case of a square and fat matrix is immediate. The case of

tall matrix can be easily demonstrated after the introduction of SVD decomposition

presented below. Different method for calculating the pseudoinverse refers to

possible decompositions of the matrix A.

A.3.3 Determinant

Given square matrix AN�N the determinant, indicated as detðAÞ or ΔA, is a scalar

value associated with the matrix itself, which summarizes some of its fundamental

properties, calculated by the following rule.

If A ¼ a ∈ ℝ1�1, by definition the determinant is detðAÞ ¼ a. The determinant

of a square matrix A ∈ ℝN�N is defined in terms of the determinant of order N � 1

with the following recursive expression:

det Að Þ ¼XNj¼1

aij �1ð Þjþidet Aij

� �h i, ðA:22Þ

whereAij ∈ ℝðN�1Þ�ðN�1Þ is a matrix obtained by eliminating the ith row and the jthcolumn of A.

Moreover, it should be noted that the value detðAijÞ is called complementaryminor of aij, and the product ð�1Þj þ i detðAijÞ is called algebraic complement ofthe element aij.

Property Given the matrices AN�N and BN�N the following properties are valid:

det Að Þ ¼Y

iλi, λi ¼ eig

�A�

det ABð Þ ¼ det�A�det�B�

det AH� � ¼ det

�A�∗

det A�1� � ¼ 1=det

�A�

det cAð Þ ¼ cNdet�A�

det Iþ abH� � ¼ �1þ aHb

��det Iþ δAð Þ ffi 1þ det

�A�þ δTr

�A�

þ 12δ2Tr Að Þ2 � 1

2δ2Tr

�A2�

for small δ:

ðA:23Þ

AmatrixAN�Nwith det(A) 6¼ 0 is called nonsingular and is always invertible. Notethat the determinant of a diagonal or triangular matrix is the product of the values on

the diagonal.


A.3.4 Matrix Inversion Lemma

Very useful in the development of adaptive algorithms, the matrix inversion lemma

(MIL) (also known as the Sherman–Morrison–Woodbury formula [1, 2]) states

that: if A–1 and C–1 exist, the following equation algebraically verifiable is true1:

Aþ BCD½ ��1 ¼ A�1 � A�1B C�1 þ DA�1B� ��1

DA�1, ðA:24Þ

where A ∈ ℂM�M, B ∈ ℂM�N, C ∈ ℂN�N, and D ∈ ℂN�M. Note that (A.24) has

numerous variants the first of which, for simplifying, is that for D ¼ BH

Aþ BCBH� ��1 ¼ A�1 � A�1B C�1 þ BHA�1B

� ��1BHA�1 ðA:25Þ

The Kailath’s variant is defined for D ¼ I, in which (A.24) takes the form

Aþ BC½ ��1 ¼ A�1 � A�1B Iþ CA�1B� ��1

CA�1 ðA:26Þ

A variant of the previous one is when the matrices B and D are vectors, or for

B ! b, ∈ ℂM�1, D ! dH ∈ ℂ1�M, and C ¼ I, for which (A.24) becomes

Aþ bdH� ��1 ¼ A�1 � A�1bdHA�1

1þ dHA�1b: ðA:27Þ

A case of particular interest in adaptive filtering is when in the above we have

d ¼ bH.

In all variants the inverse of the sum A þ BCD is a function of the inverse of the

matrix A. It should be noted, in fact, that the term that appears in the denominator of

(A.27) is a scalar value.

A.4 Inner and Outer Product of Vectors

Given two vectors x ∈ ðℝ,ℂ ÞN�1 and w ∈ ðℝ,ℂ ÞN�1 we define inner product(or scalar product or sometime dot product) indicated as hx,wi ∈ ðℝ,ℂ Þ; the

product is defined as

1 The algebraic verification can be done developing the following expression:

Aþ BCD½ � A�1 � A�1B C�1 þ DA�1B� ��1DA�1

� �¼ Iþ BCDA�1 � B C�1 þ DA�1B

� ��1DA�1 � BCDA�1B

�C�1 þ DA�1B

��1DA�1

¼ :::

¼ I:


x;wh i ¼ xHw ¼XNi¼1

x∗i wi: ðA:28Þ

The outer product between two vectors x ∈ ðℝ,ℂ ÞM�1 andw ∈ ðℝ,ℂ ÞN�1, denoted

as ix, wh ∈ ðℝ,ℂ ÞM�N, is a matrix defined by the product

ix,w ¼h xwH ¼x1w

∗1 � � � x1w

∗N

⋮ ⋱ ⋮xMw

∗1 � � � xMw

∗N

24

35M�N

: ðA:29Þ

Given two matrices AN�M and BP�M, represented by the respective column vectors

A ¼ a:1 a:2 � � � a:M½ �1 Nð Þ�M

B ¼ b:1 b:2 � � � b:M½ �1 Pð Þ�MðA:30Þ

with a:j ¼ a1j a2j � � � aNj½ �T and b:j ¼ b1j b2j � � � bPj½ �T , we define the

matrix outer product as

ABH ∈ ℝ;ℂð ÞN�P ¼XMi¼1

ai:bHi: ðA:31Þ

Note that the above expression indicates the sum of the outer product of the

column vectors of the respective matrices.

A.4.1 Geometric Interpretation

The inner product of a vector for itself xHx is often referred to as

xk k22 ≜ x; xh i ¼ xHx ðA:32Þ

that, as better specified below, corresponds to the square of its length in a Euclidean

space. Moreover, in Euclidean geometry, the inner product of vectors expressed in

an orthonormal basis is related to their length and angle.

Let xk k≜ffiffiffiffiffiffiffiffiffiffixk k22

qthe length of x, if w is another vector, such that θ is the angle

between x and w we have

xHw ¼ xk k � wk k cos θ: ðA:33Þ


A.5 Linearly Independent Vectors

Given a set of vectors in ðℝ,ℂ ÞP, faig,ai ∈ ðℝ,ℂ ÞP, 8 i, i ¼ 1, . . ., N

and a set

of scalars c1, c2,. . ., cN, we define the vector b ∈ ðℝ,ℂ ÞP as a linear combination of

the vectors faig as

b ¼XNi¼1

ciai: ðA:34Þ

The vectors faig are defined as linearly independent if, and only if, (A.34) is zero

only in the case that all scalars ci are zero.Equivalently, the vectors are called linearly dependent if, given a set of scalars

c1, c2,. . ., cN, not all zero,

XNi¼1

ciai ¼ 0: ðA:35Þ

Note that the columns of the matrix A are linearly independent if, and only if, the

matrix ðAHAÞ is nonsingular or, as explained in the next section, is a full rankmatrix. Similarly, the rows of the matrix A are linearly independent if, and only if,

ðAAHÞ is nonsingular.

A.6 Rank and Subspaces Associated with a Matrix

Given AN�M, the rank of the matrix A, indicated as r ¼ rankðAÞ, is defined as the

scalar indicating the maximum number of its linearly independent columns. Note

that rankðAÞ ¼ rankðAHÞ; it follows that a matrix is called reduced rank matrixwhen rankðAÞ < minðN,MÞ and is full rank matrix when rankðAÞ ¼ minðN,MÞ.It is also

rank Að Þ ¼ rank AH� � ¼ rank AHA

� � ¼ rank AAH� �

: ðA:36Þ

A.6.1 Range or Column Space of a Matrix

We define column space of a matrix AN�M (also called range or image), indicatedas RðAÞ o ImðAÞ, the subspace obtained from the set of all possible linear

combinations of its linearly independent column vectors. So, called

A ¼ a1 � � � aM½ � the columns partition of the matrix,RðAÞ represents the linearspan2 (also called the linear hull) of the column vectors set in a vector space

2 The term span ðv1,v2,. . .,vnÞ is the set of all vectors, or the space, that can be represented as the

linear combination of v1,v2,. . ., vn.


R Að Þ ≜ span�a1 a2 � � � aM �

¼ y∈ℝN∴ y ¼ Ax, for some x∈ℝN

:ðA:37Þ

Moreover, callingA ¼ b1 � � � bN½ � the rowmatrix partition, the dual definition is

R AH� �

≜ span�b1 b2 � � � bN �

¼ x∈ℝN∴x ¼ Ay, for some y∈ℝM

:ðA:38Þ

It appears, for the previous definition, that the rank of A is equal to the size of its

column space

rank Að Þ ¼ dim R Að Þ� �: ðA:39Þ

A.6.2 Kernel or Nullspace of a Matrix

The kernel or nullspace of matrix AN�M, indicated as N Að Þ or KerðAÞ, is the set ofall vector x for which Ax ¼ 0. More formally

N Að Þ ≜ x∈ ℝ;ℂð ÞM∴ Ax ¼ 0n o

: ðA:40Þ

Similarly, the dual definition of left nullspace is

N AH� �

≜ y∈ ℝ;ℂð ÞN∴AHy ¼ 0n o

: ðA:41Þ

The size of the kernel is called nullity of the matrix

null Að Þ ¼ dim N Að Þ� �: ðA:42Þ

In fact, the expression Ax ¼ 0 is equivalent to a homogeneous linear equations

system and is equivalent to the span of the solutions of that system. Whereby calling

A ¼ a1 � � � aN½ �H the rows partition ofA, the product Ax ¼ 0 can be expressed as

Ax ¼ 0 ,

aH1 x

aH2 x

⋮

aHN x

266664

377775 ¼ 0 ðA:43Þ

It follows that x∈N Að Þ if, and only if, x is orthogonal to the space described bythe row vectors of A, or

x⊥span a1 a2 � � � aN½ �


Namely, a vector x is located in the nullspace of A iff it is perpendicular to every

vector in the space of row A. In other words, the column space of the matrix A is

orthogonal to its nullspace R Að Þ⊥N Að Þ.

A.6.3 Rank–Nullity Theorem

For any matrix AN�M,

dim N Að Þ� �þ dim R Að Þ� � ¼ null Að Þ þ rank Að Þ ¼ M: ðA:44Þ

The above equation is known as rank–nullity theorem.

A.6.4 The Four Fundamental Matrix Subsapces

When the matrix AN�M is full rank, i.e., r ¼ rankðAÞ ¼ minðN,MÞ, the matrix

always admits a left-inverse B or an right-inverse C or, in the case of N ¼ M,

admits the inverse A–1.

As a corollary, it is appropriate to recall the fundamental concepts related to the

subspaces definable for a matrix AN�M

1. Column space of A: indicted as RðAÞ, is defined by the A columns span.

2. Nullspace of A: indicted as N Að Þ, contains all vectors x such that Ax ¼ 0.

3. Row space of A: equivalent to the column space of AH, indicated as RðAHÞ, isdefined by the span of the rows of A.

4. Left nullspace of A: equivalent to the nullspace of AH, indicated as N AH� �

,

contains all vectors x such that AHx ¼ 0.

Indicating with R⊥ðAÞ and N ⊥Að Þ the orthogonal complements, respectively,

of RðAÞ and N Að Þ, the following relations are valid (Fig. A.1):

R Að Þ ¼ N ⊥�AH�

N Að Þ ¼ R⊥�AH� ðA:45Þ

and the dual

R⊥ Að Þ ¼ N�AH�

N ⊥Að Þ ¼ R

�AH�:

ðA:46Þ


A.6.5 Projection Matrix

A square matrix P ∈ ðℝ,ℂ ÞN�N is defined projection operator iff P2 ¼ P, i.e., is

idempotent. If P is symmetric, then the projection is orthogonal. Furthermore, if P is

a projection matrix, it is also (I–P).

Examples of orthogonal projection matrices are matrices associated with the

pseudoinverse A# in the over- and under-determined cases.

In the case of overdetermined case N > M and A# ¼ ðAHAÞ�1

AH, we have that

P ¼ A AHA� ��1

AH, projection operator ðA:47ÞP⊥ ¼ I� A AHA

� ��1AH orthogonal complement projection oper: ðA:48Þ

such that P þ P⊥ ¼ I, i.e., P projects a vector on the subspace Ψ ¼ RðAÞ, whileP⊥ on its orthogonal complement Ψ⊥ ¼ R⊥ðAÞ or P ¼ N AH

� �. Indeed, calling

x ∈ ðℝ,ℂ ÞM�1 and y ∈ ðℝ,ℂ ÞN�1, such that Ax ¼ y, we have that Py ¼ u and

P⊥y ¼ v such that u ∈ RðAÞ and v∈N AH� �

(see Fig. A.2).

In the underdetermined, case where N < M and A# ¼ AHðAAHÞ�1, we have

P ¼ AH AAH� ��1

A ðA:49ÞP⊥ ¼ I� AH AAH

� ��1A: ðA:50Þ

A.7 Orthogonality and Unitary Matrices

In DSP, the conditions of orthogonality, orthonormality, and bi-orthogonality,

represent a tool of primary importance. Here are some basic definitions.

( )H rAR

( )H N r−AN

( ) rAR

0

( , )M ( , )N

H

A

A

T

A

A

( ) M r−AN

0TA A

Î

Î

Î

Î

Fig. A.1 The four

subspaces associated with

the matrix A ∈ ðℝ,ℂÞN�M.

These subspaces determine

an orthogonal

decomposition of the space,

into the column space

RðAÞ, and the left nullspaceN AH� �

. Similarly an

orthogonal decomposition

of (ℝ,ℂ)N into the row space

R(AH) and the nullspace

N Að Þ


A.7.1 Orthogonality and Unitary Matrices

Two vectors x and wx, w ∈ ðℝ,ℂ ÞN are orthogonal if their inner product is zero

hx,wi ¼ 0. This situation is sometimes referred to as x ⊥ w.

A set of vectors fqig,�qi ∈ ðℝ,ℂ ÞN, 8 i, i ¼ 1, . . ., N

�is called orthogonal if

qHi qj ¼ 0 for i 6¼ j ðA:51Þ

A set of vectors fqig is called orthonormal if

qHi qj ¼ δij ¼ δ i� j½ �, ðA:52Þ

where δij is the Kronecker symbol defined as δij ¼ 1 for i ¼ j; δij ¼ 0 for i 6¼ j.A matrix QN�N is orthonormal if its columns are an orthonormal set of vectors.

Formally

QHQ ¼ QQH ¼ I: ðA:53Þ

Note that in the case of orthonormality isQ�1 ¼ QH. Moreover, a matrix for which

QHQ ¼ QQH is defined as normal matrix.An important property of orthonormality is that it has no effect on inner product,

which is

Qx;Qyh i ¼ Qxð ÞHQy ¼ xHQHQy ¼ x; yh i: ðA:54Þ

Furthermore, if multiplied to a vector does not change its length

Qxk k22 ¼ Qxð ÞHQx ¼ xHQHQx ¼ xk k22: ðA:55Þ

=u Py

⊥=v P y

y

(A) = N (AH)⊥Σ = R

( )Ψ = AR1

1

( )

( )

H H

H H

−

−⊥

=

= −

P A A A A

P I A A A A

Fig. A.2 Representation of

the orthogonal projection

operator


A.7.2 Bi-Orthogonality and Bi-Orthogonal Bases

Given two matrix Q and P, not necessarily square, these are called bi-orthogonal if

QHP ¼ PHQ ¼ I: ðA:56Þ

Moreover, note that in the case of bi-orthonormality QH ¼ P�1 and PH ¼ Q�1.

The pair of vectors fqi,pjg represents a bi-orthogonal basis if, and only if, both ofthe following prepositions are valid:

1. For each i, j ∈ Z

qipj

D E¼ δ i� j½ � ðA:57Þ

2. There areA,B, ~A, ~B∈ℝþ such that 8 x ∈ E; the following inequalities are valid:

A xk k2 �X

kqk; xh i�� 2 � B xk k2 ðA:58Þ

~A xk k2 �X

kpk; xh i�� 2 � ~B xk k2: ðA:59Þ

The pair of vectors that satisfy (1.) and inequality (2.) are called Riesz bases [2]. For

which the following expansion formulas apply:

x ¼X

kqk; xh ipk ¼

Xkpk; xh iqk: ðA:60Þ

Comparing the previous inequalities with (A.52), we observe that the term

bi-orthogonal is used as the non-orthogonal basis fqig and is associated with a

dual basis fpjg that satisfies the condition (A.57). If fpig was the orthogonal

expansion (A.60) would be the usual orthogonal expansion.

A.7.3 Paraunitary Matrix

A matrix Q ∈ ðℝ,ℂ ÞN�M is called paraunitary matrix if

Q ¼ QH ðA:61Þ

In the case of the square matrix then

QHQ ¼ cI ðA:62Þ


A.8 Eigenvalues and Eigenvectors

The eigenvalues of a square matrix AN�N are the solutions of the characteristic

polynomial pðλÞ, of order N, defined as

p λð Þ ≜ det A� λIð Þ ¼ 0 ðA:63Þ

for which the eigenvalues fλ1,λ2, . . .,λNg of the matrixA, denoted as λðAÞ or eigðAÞ,are the roots of the characteristic polynomial pðλÞ.

For each eigenvalue λ is associated with an eigenvector q defined by the equation

A� λIð Þq ¼ 0 or Aq ¼ λq: ðA:64Þ

Consider a simple example of a real matrix A2�2 defined as

A ¼ 2 1

1 2

� �: ðA:65Þ

For (A.63) the characteristic polynomial is

det A� λIð Þ ¼ det2� λ 1

1 2� λ

� �¼ λ2 � 4λþ 3 ¼ 0 ðA:66Þ

with two distinct and real roots: λ1 ¼ 1 and λ2 ¼ 3, for which λiðAÞ ¼ (1,3).

The eigenvector related to λ1 ¼ 1 is

2 1

1 2

� �q1q2

� �¼ q1

q2

� �) q1 ¼ 1

�1

� �ðA:67Þ

while the eigenvector related to λ2 ¼ 3 is

2 1

1 2

� �q1q2

� �¼ 3

q1q2

� �) q2 ¼ 1

1

� �: ðA:68Þ

The eigenvectors of a matrix AN�N are sometimes referred to as eigenvectðAÞ.

A.8.1 Trace of Matrix

The trace of matrix AN�N is defined as the sum of its elements in the main diagonal

and, equivalently, and is equal to the sum of its (complex) eigenvalues


tr A½ � ¼XNi¼1

aii ¼XNi¼1

λi: ðA:69Þ

Moreover we have that

tr Aþ B½ � ¼ tr A½ � þ tr B½ �tr A½ � ¼ tr AH

� �tr cA½ � ¼ c � tr A½ �tr ABC½ � ¼ tr BCA½ � ¼ tr CAB½ �aHa ¼ tr aHa½ �:

ðA:70Þ

Matrices have the Frobenius inner product, which is analogous to the vector inner

product. It is defined as the sum of the products of the corresponding components of

two matrices A and B having the same size:

A;Bh i ¼Xi

Xj

aijbij ¼ tr AHB� � ¼ tr ABH

� �:

A.9 Matrix Diagonalization

A matrix AN�N is called diagonalizable matrix if there is an invertible matrix Q

such that there exists a decomposition

A ¼ QΛQ�1 ðA:71Þ

or, equivalently,

Λ ¼ Q�1AQ: ðA:72Þ

This is possible if, and only if, the matrix A has N linearly independent eigenvectors

and the matrix Q, partitioned as column vectors Q ¼ q1 q2 � � � qN½ �, is built

with independent eigenvectors of A. In this case, Λ is a diagonal matrix built with

the eigenvalues of A, i.e., Λ ¼ diagðλ1,λ2, . . .,λNÞ.

A.9.1 Diagonalization of a Normal Matrix

The matrix AN�N is said normal matrix if AHA ¼ AAH. A matrix A is normal iff it

can be factorized as

A ¼ QΛQH ðA:73Þ

where QHQ ¼ QQH ¼ I, Q ¼ q1 q2 � � � qN½ �, Λ ¼ diagðλ1,λ2, . . .,λNÞ, and

Λ ¼ QHAQ.


The set of all eigenvectors of A is defined as the spectrum of the matrix. Theradius of the spectrum or spectral radius is defined as the eigenvalue of maximum

modulus

ρ Að Þ ¼ maxi

eig Að Þj j� �: ðA:74Þ

Property If the matrix AN�N is nonsingular, then all the eigenvalues are nonzero

and the eigenvalues of the inverse matrix A�1 are the reciprocal of eigðAÞ.Property If the matrix AN�N is symmetric and semi-definite positive then all

eigenvalues are real and positive. So we have that

1. The eigenvalues λi of A are real and nonnegative:

qHi Aqi ¼ λiq

Ti qi ) λi ¼ qH

i Aqi

qHi qi

Rayleighquotientð Þ ðA:75Þ

2. The eigenvectors of A are orthogonal for distinct λi

qHi qj ¼ 0, for i 6¼ j ðA:76Þ

3. The matrix A can be diagonalized as

A ¼ QΛQH ðA:77Þ

where Q ¼ q1 q2 � � � qN½ �, Λ ¼ diag(λ1,λ2, . . .,λN), and Q is a unitary matrix

or QTQ ¼ I

4. An alternative representation for A is then

A ¼XNi¼1

λiqiqHi ¼

XNi¼1

λiPi ðA:78Þ

where the term Pi ¼ qiqiH is defined as spectral projection.

A.10 Norms of Vectors and Matrices

A.10.1 Norm of Vectors

Given a vector x ∈ ðℝ,ℂ ÞN, its norm refers to its length relative to a vector space. In

the case of a space of order p, called Lp space, the norm is indicated as xk kLp or kxkpand is defined as


xk kp ≜XNi¼1

xij jp" #1=p

, for p 1: ðA:79Þ

L0 norm The expression (A.79) is valid even when 0 < p < 1; however, the result

is not exactly a norm. For p ¼ 0, (A.79) becomes

xk k0 ≜ limp!0

xk kp ¼XNi¼1

��xi��0: ðA:80Þ

Note that (A.80) is equal to the number of nonzero entries of the vector x.

L1 norm

xk k1 ≜XNi¼1

xij j, L1 norm: ðA:81Þ

The previous expression represents the sum of modules of the elements of the

vector x.

Linf norm For p ! 1 the (A.79) becomes

xk k1 ≜ maxi¼1,N xij j ðA:82Þ

called uniform norm or norm of the maximum and denoted as Linf.

Euclidean or L2 norm The Euclidean norm is defined for p ¼ 2 and expresses the

standard length of the vector.

xk k2 ≜ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXNi¼1

xj j2i ¼vuut ffiffiffiffiffiffiffiffi

xHxp

, Euclidean or L2 norm ðA:83Þ

xk k22 ≜ xHx, quadratic Euclidean norm ðA:84Þxk k2G ≜ xHGx

�� , quadratic weighted Euclidean norm, ðA:85Þ

where G is a diagonal weighing matrix.

Frobenius norm Similar to the L1 norm, it is defined as

xk kF ≜

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXNi¼1

xij j2vuut , Frobenius norm ðA:86Þ

Property For each norm we have the following property:

1. kxk 0, the equality holds only for x ¼ 0


2. αxk k ¼ α xk k, 8α3. kx þ yk � kxk þ kyk triangle inequality.

The distance between two vectors x and y is defined as

x� yk kp ≜XNi¼1

xi � yij jp" #1=p

, for p > 0: ðA:87Þ

It is called distance or similarity measure in the Minkowsky metric [1].

A.10.2 Norm of Matrices

With regard to the norm of a matrix, similar to the vectors norms, these may be

defined in the following mode. Given an AN�N matrix

L1 norm

Ak k1 ≜ maxj

XNi¼1

aij�� , L1norm ðA:88Þ

represents the column of A with largest sum of absolute values

Euclidean or L2 norm The Euclidean norm is defined for the space p ¼ 2 and

expresses the standard length of the vector

Ak k2 ≜ffiffiffiffiffiffiffiffiffiλmax

p) max

λieig AHA� �

o maxλi

eig AAH� � ðA:89Þ

Linf norm

Ak k1 ≜ maxi

XNj¼1

aij�� , Linf norm ðA:90Þ

that represents the row with greater sum of the absolute values.

Frobenius norm

Ak kF ≜

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXNi¼1

XMj¼1

aij�� 2

vuut , Frobenius norm ðA:91Þ

A.11 Singular Value Decomposition Theorem

Given a matrix X ∈ ðℝ,ℂ ÞN�M with K ¼ minðN,MÞ, of rank r � K, there are twoorthonormal matrices U ∈ ðℝ,ℂ ÞN�N and V ∈ ðℝ,ℂ ÞM�M containing for columns,

respectively, the eigenvectors of XXH and the eigenvectors of XHX, namely,


UN�N ¼ eigenvect XXH� � ¼ u0 u1 � � � uN�1½ � ðA:92Þ

VM�M ¼ eigenvect XHX� � ¼ v0 v1 � � � vM�1½ � ðA:93Þ

such that the following equality is valid:

UHXV ¼ Σ, ðA:94Þ

equivalently,

X ¼ UΣVH ðA:95Þ

or

XH ¼ VΣUH: ðA:96Þ

The expressions (A.94)–(A.96) represent the SVD decomposition of the matrix A,

shown graphically in Fig. A.3

The matrix Σ ∈ ℝN�M is characterized by the following structure:

K ¼ min M;Nð Þ Σ ¼ ΣK 0

0 0

� �K ¼ N ¼ M Σ ¼ ΣK

, ðA:97Þ

where ΣK ∈ ℝK�K is a diagonal matrix containing the positive square root of the

eigenvalues of the matrix XHX ðor XXHÞ defined as singular values.3 In formal

terms

ΣK ¼ diag σ0;σ1; :::;σK�1ð Þ ≜ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffidiag eig XHX

� � �r

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffidiag eig XXH

� � �r, ðA:98Þ

where

σ0 σ1 ::: σK�1 > 0 and σK ¼ � � � ¼ σN�1 ¼ 0: ðA:99Þ

Note that the singular values σi of X are in descending order. Moreover, the column

vectors ui and vi are defined, respectively, as left singular vectors and right singularvectors ofX. Since U and V are orthogonal, it is easy to see that the matrix X can be

written as a product

3 Remember that the nonzero eigenvalues of the matrices XHX and XX

H are identical.


X ¼ UΣVH ¼XK�1

i¼0

σiuivHi : ðA:100Þ

A.11.1 Subspaces of Matrix X and SVD

The SVD reveals important property of the matrix X. In fact, for r < K we have

r ¼ rankðXÞ, for which the first r columns of U form an orthonormal basis of the

column space RðXÞ, while the first r columns of V form an orthonormal basis for

the nullspace (or kernel) N XH� �

of X, i.e.,

r ¼ rank Xð ÞR Xð Þ ¼ span

�u0,u1, :::, ur�1

�N XH� � ¼ span

�vr, vrþ1, :::, vN�1

�:

ðA:101Þ

In the case that r < K, also, for (A.99) is

σ0 σ1 ::: σr�1 > 0 and σr ¼ ::: ¼ σN�1 ¼ 0: ðA:102Þ

It follows that (A.97), for the cases over/under-determined, becomes

Σ ¼ Σr 0

0 0

� �,

where

HU

N

X

M M

M

Unitary matrix

N

V =Σ 0

00

r

M

Null matrix

HU

N

X

M M

MV

Data matrix

=Σ 0

00

r

N

M

a

b

Unitary matrix

Unitary matrix Unitary matrix

Data matrix Diagonal matrix

´ ´

´ ´

Fig. A.3 Schematic of the SVD decomposition in the cases (a) overdetermined (matrix X is tall);(b) underdetermined (matrix X is fat)


Σr ¼ diag σ0; σ1; :::; σr�1ð Þ: ðA:103Þ

Moreover, from the previous development applies the expansion

X ¼ U1 U2½ � Σr 0

0 0

� �VH

1

VH2

� �¼ U1ΣrV

H1 ¼

Xr�1

i¼0

σiuivHi , ðA:104Þ

where V1, V2, U1, and U2 are orthonormal matrices defined as

V ¼ V1 V2½ � with V1 ∈ℂM�r and V2 ∈ℂM�M�r ðA:105ÞU ¼ U1 U2½ � with U1 ∈ℂN�r and U2 ∈ℂN�N�r ðA:106Þ

for which, for (A.101), we have that VH1 V2 ¼ 0 and UH

1 U2 ¼ 0. The representation

(A.104) is sometimes called thin SVD of X.

Note also that the Euclidean norm of X is equal to

Xk k2 ¼ σ0 ðA:107Þ

while its Frobenius norm is equal to

Xk kF ≜

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXN�1

i¼0

XM�1

j¼0

xij�� 2

vuut ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiσ20 þ σ21 þ � � � þ σ2r

q: ðA:108Þ

Remark A special important case of SVD decomposition occurs when the matrix

X is symmetric and nonnegative. In this case it is

Σ ¼ diag λ0; λ1; :::; λr�1ð Þ, ðA:109Þ

where λ0 λ1 . . . λr � 1 0 are the real eigenvalues of X corresponding to

the eigenvectors vi.

A.11.2 Pseudoinverse Matrix and SVD

The Moore–Penrose pseudoinverse of the overdetermined case is defined as

X# ¼ ðXHXÞ�1XH, while for the underdetermined case is X# ¼ XHðXXHÞ�1. It

should be noted that in expression (A.95), X# always results in the following forms:


X# ¼ XHX� ��1

XH ¼ VΣ�1K 0

0 0

" #UH N > M

X# ¼ XH XXH� ��1 ¼ V

Σ�1K 0

0 0

" #UH N < M,

ðA:110Þ

where for K ¼ minðN,MÞ, Σ�1K ¼ diagðσ�1

0 ,σ�11 , . . . , σ�1

K�1Þ, and for r � K,

X# ¼ V1Σ�1r UH

1 : ðA:111Þ

For both over and under-determined, by means of (A.95), the partitions (A.105) and

(A.106) are demonstrable.

Remark Remember that the right singular vectors v0, v1,. . .,vM–1, of the data matrix

X, are equal to the eigenvectors of the oversized matrix XHX, while the left singular

vectors u0, u1,. . .,uN–1 are equal to the eigenvectors of the undersized matrixXXH. It

is, also, true that r ¼ rankðXÞ, i.e., the number of positive singular values is

equivalent to the rank of the data matrix X. Therefore, the SVD decomposition

provides a practical tool for determining the rank of a matrix and its pseudoinverse.

Corollary For the calculation of the pseudoinverse it is also possible to use other

types of decomposition such as that shown below.

Given a matrix X ∈ ðℝ,ℂ ÞN�M with rankðXÞ ¼ r < minðN,MÞ, there are two

matricesCM�r andDr�N such thatX ¼ CD. Using these matrices it is easy to verify

that

X# ¼ DH DDH� ��1

CHC� ��1

CH: ðA:112Þ

A.12 Condition Number of a Matrix

In numerical analysis the condition number, indicated as χð�Þ, associated with a

problem is the degree of numerical tractability of the problem himself. A matrix

A is called ill-conditioned if χðAÞ takes large values. In this case, some methods of

matrix inversion can present a high numerical nature error.

Given a matrix A ∈ ðℝ,ℂ ÞN�M, the condition number is defined as

χ Að Þ ≜ jjAjjpjjA#jjp 1 � χ Að Þ � 1, ðA:113Þ

where p ¼ 1, 2, . . ., 1, || · ||pmay be the Frobenius norm andA# the pseudoinverse

of A. The number χðAÞ depends on the type of chosen norm. In particular, in the

case of L2 norm it is possible to prove that


χ Að Þ ¼ jjAjj2jjA#jj2 ¼ σmax

σmin

, ðA:114Þ

where σmax ¼ σ1 and σminð¼σM o σNÞ are, respectively, the maximum and mini-

mum singular values of A. In the case of a square matrix

χ Að Þ ¼ λmax

λmin

, ðA:115Þ

where λmax and λmin are the maximum and minimum eigenvalues of A.

A.13 Kroneker Product

The Kronecker product between two matrices A ∈ ðℝ,ℂ ÞP�Q and

B ∈ ðℝ,ℂ ÞN�M, usually indicated as A � B, is defined as

A� B ¼a11B � � � a1QB⋮ ⋱ ⋮

aP1B � � � aPQB

24

35∈ ℝ;ℂð ÞPN�QM: ðA:116Þ

The Kronecker product can be convenient to represent linear systems equations and

some linear transformations.

Given a matrix A ∈ ðℝ,ℂ ÞN�M, you can associate with it a vector,

vecðAÞ ∈ ðℝ,ℂ ÞNM�1, containing all its column vectors [see (A.10)].

For example, given the matricesAN�M and XM�P, it is possible to represent their

product as

AX ¼ B, ðA:117Þ

where BN�P; using the definition (A.10) and the Kronecker product, we have that

I� Að Þvec Xð Þ ¼ vec Bð Þ ðA:118Þ

that represents a system of linear equations of NP equations and MP unknowns.

Similarly, given the matrices, AN�M, XM�P, and BP�Q it is possible to represent

their product

AXB ¼ C ðA:119Þ

in a equivalent manner as a QN linear system equation in MP unknowns or as

BT � A� �

vec Xð Þ ¼ vec Cð Þ: ðA:120Þ


Appendix B: Elements of Nonlinear

Programming

B.1 Unconstrained Optimization

The term nonlinear programming (NLP) indicates the process of solving linear or

nonlinear systems of equations, rather than a closed mathematical–algebraic

approach with a methodology that minimizes or maximizes some cost function

associated with the problem.

This Appendix briefly introduces the basic concepts of NLP. In particular, it

presents some fundamental concepts of the unconstrained and the constrained

optimization methods [3–15].

B.1.1 Numerical Methods for Unconstrained Optimization

The problem of unconstrained optimization can be formulated as follows: find avector w ∈ Ω ℝM4 that minimizes (maximizes) a scalar function JðwÞ. Formally

w∗ ¼ minw∈Ω

J wð Þ: ðB:1Þ

The real function JðwÞ, J : ℝM ! ℝ, is called cost function (CF), or loss function orobjective function or energy function, w is an M-dimensional vector of variables

that could have any values, positive or negative, and Ω is the variables or search

space. Minimizing a function is equivalent to maximizing the negative of the

function itself. Therefore, without loss of generalities, minimizing or maximizing

a function are equivalent problems.

A point w∗ is a global minimum for function JðwÞ if

4 For uniformity of writing, we denote by Ω the search space, which in the absence of constraints

coincides with the whole space or Ω ℝM. As we will see later, in the presence of constraints,

there is a reduced search space, i.e., Ω � ℝM.


Technology, DOI 10.1007/978-3-319-02807-1,


603

J w∗ð Þ � J wð Þ, 8w∈ℝM ðB:2Þ

and w∗ is a strict local minimizer if (B.2) holds for a ε-radius ball centered in w∗

indicated as Bðw∗,εÞ.

B.1.2 Existence and Characterization of the Minimum

The admissible solutions of a problem can be characterized in terms of some

sufficient and necessary conditions

First-order necessary condition (FONC) (for minimization ormaximization) is that

∇J wð Þ ¼ 0, ðB:3Þ

where the operator∇JðwÞ ∈ ℝM is a vector indicating the gradient of function JðwÞdefined as

∇J wð Þ ≜ ∂J wð Þ∂w

¼∂J wð Þ∂w1

∂J wð Þ∂w2

� � � ∂J wð Þ∂wM

" #T: ðB:4Þ

Second-order necessary condition (SONC) is that the Hessian matrix

∇2JðwÞ ∈ ℝM�M, defined as

∇2J wð Þ ≜ ∂∂w

∂J wð Þ∂w

24

35T

¼ ∂∂w

∇J½ �T

¼

∂∂w1

∂J wð Þ∂w

24

35T

∂∂w2

∂J wð Þ∂w

24

35T

⋮

∂J∂wM

∂J wð Þ∂w

24

35T

2666666666666664

3777777777777775¼

∂2J wð Þ∂w2

1

∂2J wð Þ

∂w1∂w2

� � � ∂2J wð Þ

∂w1∂wM

∂2J wð Þ

∂w2∂w1

∂2J wð Þ∂w2

2

� � � ∂2J wð Þ

∂w2∂wM

⋮ ⋮ ⋱ ⋮∂2

J wð Þ∂wM∂w1

∂2J wð Þ

∂wM∂w2

� � � ∂2J wð Þ

∂w2M

2666666666664

3777777777775,

ðB:5Þ

is positive semi-definite (PSD) or

wT �∇2J wð Þ � w 0, for all w: ðB:6Þ

604 Appendix B: Elements of Nonlinear Programming

Second-order sufficient condition (SONC) is that: given FONC satisfied, the Hes-

sian matrix ∇2JðwÞ is definite positive that is wT � ∇2JðwÞ � w > 0 for all w.

A necessary and sufficient condition for which w∗ is a strict local minimizer of

JðwÞ can be formalized by the following theorem:

Theorem The point w∗ is a strict local minimizer of JðwÞ iff:Given nonsingular ∇2JðwÞ evaluated at the point w∗, then Jðw∗Þ < JðwÞ 8

ε > 0, 8 w such that 0 < kw � w∗k < ε, if∇JðwÞ ¼ 0 and∇2JðwÞ is symmetric

and positive defined.

B.2 Algorithms for Unconstrained Optimization

In the field of unconstrained optimization, it is known that some general principles

can be used to study most of the algorithms. This section describes some of these

fundamental principles.

B.2.1 Basic Principles

Our problem is to determine (or better estimate) the vector w∗, called optimalsolution, which minimizes the CF JðwÞ. If the CF is smooth and its gradient is

available, the optimal solution can be computed (estimated) by an iterative proce-

dure that minimizes the CF, i.e., starting from some initial condition (IC) w–1, a

suitable solution is available only after a certain number of adaptation steps:

w�1 ! w0 ! w1 . . . wk . . . ! w∗. The recursive estimator has a form of the type

wkþ1 ¼ wk þ μkdk ðB:7Þ

or as

wk ¼ wk�1 þ μkdk, ðB:8Þ

where k is the adaptation index. The vector dk represents the adaptation direction

and the parameter μk is the step size also called adaptation rate, step length, learningrate, etc., that can be obtained by means of a one-dimensional search.

An important aspect of recursive procedure (B.7) concerns the algorithm order.

In the first-order algorithms, the adaptation is carried out using only knowledge

about the CF gradient, evaluated with respect to the free parameters w. In the

second-order algorithms, to reduce the number of iterations needed for conver-

gence, information about the JðwÞ curvature, i.e., the CF Hessian, is also used.

Appendix B: Elements of Nonlinear Programming 605

Figure B.1 shows a qualitative evolution of the recursive optimization algorithm.

B.2.2 First- and Second-Order Algorithms

Let JðwÞ be the CF to be minimized, if the CF gradient is available at learning step k,indicated as ∇JðwkÞ, it is possible to define a family of iterative methods for the

optimum solution computation. These methods are referred to as search methods or

searching the performance surface, and the best-known algorithm of this class is the

steepest descent algorithm (SDA) (Cauchy 1847). Note that, given the popularity of

the SDA, this class of search methods is often identified with the name SDA

algorithms.

Considering the general adaptation formula (B.7), indicating for simplicity the

gradient as gk ¼ ∇JðwkÞ, the direction vector dk is defined as follows:

dk ¼ �gk, SDA algorithms: ðB:9Þ

The SDA are first-order algorithms because adaptation is determined by knowledge

of the gradient, i.e., only the first derivative of the CF. Starting from a given IC w–1,

they proceed by updating the solution (B.7) along the opposite direction to the CF

gradient with a step length μ.The learning algorithms performances can be improved by using second-order

derivative. In the case that the Hessian matrix is known, the method, called exact

Newton, has a form of the type

dk ¼ � ∇2J wkð Þ� ��1gk, exact Newton: ðB:10Þ

In the case the Hessian is unknown the method, called quasi-Newton (Broyden

1965; [3] and [4] for other details), has a form of the type

*2w

1w∗

2w

1w

1−w

∗w

(w)J

a b

1w

1c.i. : −w

2w

1: maximum descent directiond

Direction step-size: d1

3d

(w)J2μd2d

adaptationw2 = w1 + d1

´

Fig. B.1 Qualitative evolution of the trajectory of the weights wk, during the optimization process

towards the optimal solution w∗, for a generic two-dimensional objective function (a) qualitative

trend of steepest descent along the negative gradient of the surface JðwÞ; (b) particularly

concerning the direction and the step size


dk ¼ �Hkgk, quasi-Newton, ðB:11Þ

where the matrix Hk is an approximation of the inverse of the Hessian matrix

Hk ∇2J wkð Þ� ��1: ðB:12Þ

The matrix Hk is a weighing matrix that can be estimated in various ways.

As Fig. B.2 shows, the product μkHk can be interpreted as an optimum choice in

direction and step-size length, calculated so as to follow the surface-gradient

descent in very few steps. As the lower limit, as in the exact Newton’s method,

only with one step.

B.2.3 Line Search and Wolfe Condition

The step size μ of the unconstrained minimization procedure can be chosen a priori(according to certain rules) and kept fixed during the entire process or may be

variable, and mentioned as μk. In this case, the step size can be optimized according

to some criterion, e.g., the line search method defined as

μk ¼ minμmin<μ<μmax

J wk þ μdkð Þ: ðB:13Þ

With this technique, the parameter μk is (locally) increased, using a certain step,

until the CF continues to decrease. The length of the learning rate is variable and

usually with smaller size approaching to the optimal solution.

w0w

1

Performance Surface J(w)

0.5 1 1.5 2 2.5 3 3.5

-3

-2

-1

0

1

2

3

mkdk

mk Hkdk

Fig. B.2 In the second-

order algorithms, the matrix

Hk determines a

transformation in terms of

rotation and gain, of the

vector dk in the direction

of the minimum of the

surface JðwÞ


A typical qualitative evolution of line search during descent along the gradient of

the CF is shown in Fig. B.3.

As illustrated in Fig. B.3, in certain situations, the number of iterations to reach

the optimal point can be drastically reduced, however, with a considerable increase

in computational cost due to the calculation of the expression (B.13).

For noisy or rippled CF the expression (B.13) can be computed with some

difficulties. So algorithms for determination of optimal step size should be used

with some cautions.

The Wolfe conditions are a set of inequalities for performing inexact line search,

especially in second-order methods, in order to determine an optimal step size.

Then inexact line searches provide an efficient way of computing an acceptable step

size μ that reduces the objective function “sufficiently,” rather than minimizing the

objective function over μ ∈ ℝþ exactly. A line search algorithm can use Wolfe

conditions as a requirement for any guessed μ, before finding a new search direction

dk. A step length μk is said to satisfy the Wolfe conditions if the following two

inequalities hold:

J wk þ μkdkð Þ � J�wk

� � σ1μkdTk gk

dTk ∇J wk þ μkdkð Þ σ2d

Tk gk,

ðB:14Þ

where 0 < σ1 < σ2 < 1. The first inequality ensures that the CF Jk is

reduced sufficiently. The second, called curvature condition, ensures that the slope

has been reduced sufficiently. It is easy to show that if dk is a descent direction,

if Jk is continuously differentiable and if Jk is bounded below along the ray

fwk þ μdk | μ > 0g then there always exist step size satisfying (B.14). Algorithms

that are guaranteed to find, in a finite number of iterations, a point satisfying the

Wolfe conditions have been developed by several authors (see [4] for details).

min max1 [ , ]

min ( , )Jμ μ μ

μ μ∴ w

1c.i. : −w

1w

2w 2 2μ d

3 3μ d


1 1Direction ´ step-size: μ d


Î

Fig. B.3 Qualitative

evolution of the descent

along the negative gradient

of the CF method with line

search. The μ parameter is

increased until the CF

continues to decrease


If we modify the curvature condition

dTk ∇J wk þ μkdkð Þ�� σ2d

Tk gk

�� ðB:15Þ

known as strong Wolfe condition, this can result in a value for the step size that is

close to a minimizer of Jðwk þ μkdkÞ.

B.2.3.1 Line Search Condition for Quadratic Form

Let A ¼ ℝM�M be a symmetric and positive definite matrix, for a quadratic CF

defined as

J wð Þ ¼ c� wTbþ 1

2wTAw ðB:16Þ

the optimal step size is

μ ¼ dTk�1dk�1

dTk�1Adk�1

: ðB:17Þ

Proof The line search is a procedure to find a best step size along steepest directionwhich minimizes the derivative ∂

∂μ J wð Þf g ! 0. Using the chain rule, we can write

∂J wkð Þ∂μ

¼ J wkð Þ∂wk

� �T ∂wk

∂μ¼ ∇J wkð Þ� �T

dk�1:

Intuitively, from the current point reached by the line search procedure, the next

direction is orthogonal to the previous direction that is dk ⊥ dk�1 (see Fig. B.3).

For the determination of the optimal step size μ, we see that ∇JðwkÞ ¼ �dk.

It follows

dTk dk�1 ¼ 0

∇J wkð Þ½ �Tdk�1 ¼ 0:ðB:18Þ

For a CF of the type (B.16), at the kth iteration, the negative gradient

(search direction) is ∇JðwkÞ ¼ b � Awk. Let weight’s correction equal to

wk ¼ wk�1 þ μdk�1, the expression (B.18) can be written as

b� Awk½ �Tdk�1 ¼ 0

b� A wk�1 þ μdk�1ð Þ½ �Tdk�1 ¼ 0

by the latter, with the position dk�1 ¼ �b þ Awk�1,


dk�1 � μAdk�1

�h iTdk�1 ¼ 0

dTk�1dk�1 � μdT

k�1Adk�1 ¼ 0

Finally solving for μ we have

μ ¼ dTk�1dk�1

dTk�1Adk�1

:

Q.E.D.

Example Consider a quadratic CF (B.16) with A¼ 1 0:80:8 1

� �, b¼ 0:1 �0:2½ �T

and c ¼ 0.1, the plot of the performance surface is reported in Fig. B.4.

Problem Find the optimal solution, using a Matlab procedure, with tolerance

Tol ¼ 1e–6 starting with IC w�1 ¼ 0 3½ �T .In Fig. B.5 the weights trajectories, plotted over the isolevel performance

surface, are reported for the standard SDA and SDA plus the Wolfe condition.

Computed optimum solution

w[0] ¼ 0.72222w[1] ¼ -0.77778

SDA Computed optimum solution with μ ¼ 0.1

n. Iter ¼ 1233w[0] ¼ 0.72222w[1] ¼ -0.77778

-4-2

02

4

-4

-2

2

4-10

0

10

20

30

40

Weight w [0]

Performance surface

Weight w [1]

CF

J(w

)

[0]optw[1]optw

Fig. B.4 Trend of the cost

function considered in the

example


SDA2_Wolfe optimal solution mu computed with Eq. (B.17)

n. Iter ¼ 30w[0] ¼ 0.72222w[1] ¼ -0.77778

Matlab Functions

% ------------------------------------ -------------------------------------% Standard Steepest Descent Algorithm %% Copyright 2013 - A. Uncini % DIET Dpt - University of Rome 'La Sapienza' - Italy% $Revision: 1.0$ $Date: 2013/03/09$ %------------------------------ --------------------------------------------function [w, k] = SDA(w, g, R, c, mu, tol, MaxIter) % Steepest descent -----------------------------------------------------

for k = 1 : MaxItergradJ = grad_CF(w, g, R); % Gradient computationw = w - mu*gradJ; % up-date solutionif ( norm(gradJ) < tol ), break, end % end criteria

endend

% ------------------------------- --------------------------- ---------------% Standard Steepest Descent Algorithm and Wolf condition % for quadratic CF% J(w) = c - w'b + (1/2)w'Aw; %% Copyright 2013 - A. Uncini % DIET Dpt - University of Rome 'La Sapienza' - Italy% $Revision: 1.0$ $Date: 2013/03/09$ %-------------------------------- --------------------------- ---------------function [w, k] = SDA2(w, g, R, c, mu, tol, MaxIter)

for k=1:MaxIter gradJ = (R*w-g); % Gradient computation or grad_CF(w, g, R);mu = gradJ'*gradJ/(gradJ'*R*gradJ); % Opt. step-size eqn.w = w - mu*gradJ; % up-date solution if ( norm(gradJ) < tol ), break, end % end criteria

endend

% ------------------------------------------------------ -------------------% Standard quadratic cost function and gradient computation %% Copyright 2013 - A. Uncini % DIET Dpt - University of Rome 'La Sapienza' - Italy % $Revision: 1.0$ $Date: 2013/03/09$ %--------------------------------------------------------------------------function [Jw] = CF(w, c, b, A)

Jw = c - 2*w'*b + w'*A*w;end%--------------------------------------------------- -----------------------function [gradJ] = grad_CF(w, b, A)

gradJ = (A*w-b); end

(B.17)


B.2.4 The Standard Newton’s Method

The Newton methods are based on the exact minimum computation of a quadratic

local approximation of the CF. In other words, rather than directly determine the

approximate minimum of the true CF, the minimum of a locally quadratic approx-

imation of the CF is exactly computed.

The method can be formalized by considering the truncated second-order Taylor

series expansion of the CF JðwÞ around a point wk defined as

J wð Þ ffi J wkð Þ þ w� wk½ �T∇J wkð Þ þ 1

2w� wk½ �T∇2J wkð Þ w� wk½ �: ðB:19Þ

The minimum of the (B.19) is determined by imposing ∇JðwkÞ ! 0, so, the wkþ1

point (that minimizes the CF) necessarily satisfies the relationship5

∇J wkð Þ þ 1

2∇2J wkð Þ wkþ1 � wk½ � ¼ 0: ðB:20Þ

If the inverse of the Hessian matrix exists, the previous expression can be written in

the following form of finite difference equation (FDE):

wkþ1 ¼ wk � μk � ∇2J wkð Þ� ��1∇J wkð Þ for ∇2J wkð Þ 6¼ 0, ðB:21Þ

where μk > 0 is a suitable constant. The expression (B.21) represents the standard

form of discrete Newton’s method.

SDA Weights trajectory on Performance surface

w0[n]

w1[n

]

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3SDA-W Weights trajectory on Performance surface

w0[n]

w1[n

]

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

Fig. B.5 Trajectories of the weights on the isolevel CF curves for steepest descent algorithm

(SDA) and Wolfe SDA

5 In the optimum point wkþ1, by definition, is Jðwkþ1Þ ffi JðwkÞ. It follows that the (B.19) can be

written as 0 ¼ wkþ1 � wk½ �T∇J wkð Þ þ 12wkþ1 � wk½ �T∇2J wkð Þ wkþ1 � wk½ �. So, simplifying the

term ½wkþ1 � wk�T gives (B.20).


Remark The CF approximation with a quadratic form is significant because JðwÞ isusually an energy function. As explained by the Lyapunov method [5], you can

think of that function as the energy associated with a continuous-time dynamical

system described by a system of differential equations of the form

dw

dt¼ �μ0 ∇2J wkð Þ� ��1∇J wkð Þ ðB:22Þ

such that for μ0 > 0, (B.21) corresponds to its numeric approximation. In this case,

the convergence properties of Newton’s method can be studied in the context of a

quadratic programming problem of the type

w∗ ¼ argminw∈Ω

J wð Þ ðB:23Þ

when the CF has a quadratic form of type (B.16). Note that for A positive definite

function JðwÞ is strictly convex and admits an absolute minimum w∗ that satisfies

Aw∗ ¼ b ) w∗ ¼ A�1b: ðB:24Þ

Also, observe that the gradient and the Hessian of the expression (B.16) are

calculated explicitly as ∇JðwÞ ¼ Aw � b and ∇2JðwÞ ¼ A, and replacing these

values in the form (B.21) for μk ¼ 1, the recurrence becomes

wkþ1 ¼ wk � A�1 Awk � bð Þ ¼ A�1b: ðB:25Þ

The above expression indicates that the Newton method converges theoretically at

the minimum point, in only one iteration. In practice, however, the gradient

calculation and the Hessian inverse pose many difficulties. In fact, the Hessian

matrix is usually ill-conditioned and its inversion represents an ill-posed problem.Furthermore, the IC w�1 can be quite far from the minimum point and the Hessian

at that point, it may not be positive definite, leading the algorithm to diverge. In

practice, a way to overcome these drawbacks is to slow the adaptation speed by

including a step-size parameter μk on the recurrence. It follows that in causal form,(B.25) can be written as

wk ¼ wk�1 � μkA�1 Awk�1 � bð Þ: ðB:26Þ

As mentioned above, in the simplest form of Newton method, the weighting of the

equations (B.10) is made with the inverse Hessian matrix, or by its estimation. We

then have

Hk ¼ ∇2Jk�1

� ��1, Exact Newton algorithms ðB:27Þ


Hk ∇2Jk�1

� ��1, Quasi-Newton algorithms ðB:28Þ

thereby forcing both direction and the step size to the minimum of the gradient

function. Parameter learning can be constant ðμk < 1Þ or also estimated with a

suitable optimization procedure.

B.2.5 The Levenberg–Marquardt’s Variant

A simple method to overcome the problem of ill-conditioning of the Hessian

matrix, called Levenberg–Marquardt variant [6, 7], consists in the definition of

an adaptation rule of the type

wk ¼ wk�1 ¼ μk δIþ∇2Jk�1

� ��1gk ðB:29Þ

where the constant δ > 0 must be chosen considering two contradictory require-

ments: small to increase the convergence speed and sufficiently large as to make the

Hessian matrix always positive definite.

Levenberg–Marquardt method is an approximation of the Newton algorithm.

It has, also, quadratic convergence characteristics. Furthermore, convergence is

guaranteed even when the estimate of initial conditions is far from minimum point.

Note that the sum of the term δI, in addition to ensure the positivity of the

Hessian matrix, is strictly related to the Tikhonov regularization theory. In the

presence of noisy CF, the term δI can be viewed as a Tikhonov regularizing term

which determines the optimal solution of a smooth version of CF [8].

B.2.6 Quasi-Newton Methods or Variable Metric Methods

In many optimization problems, the Hessian matrix is not explicitly available. In

the quasi-Newton, also known as variable metric methods, the inverse Hessian

matrix is determined iteratively and in an approximate way. The Hessian is updated

by analyzing successive gradient vectors. For example, in the so-called sequentialquasi-Newton methods, the estimate of the inverse Hessian matrix is evaluated by

considering two successive values of the CF gradient.

Consider the second-order CF approximation and let Δw ¼ ½w � wk�,gk ¼ ∇JðwkÞ, and Bk an approximation of the Hessian matrix Bk ∇2JðwkÞ;from Eq. (B.19) we can write

J wþ Δwð Þ J wkð Þ þ ΔwTgk þ 1

2ΔwTBkΔw: ðB:30Þ


The gradient of this approximation ðwith respect to ΔwÞ can be written as

∇J wk þ Δwkð Þ gk þ BkΔwk ðB:31Þ

called secant equation. The Hessian approximation can be chosen in order to

exactly satisfy Eq. (B.31); so, Δwk ! dk and setting this gradient to zero provides

the Quasi-Newton adaptations

Δwk ! dk ðB:32Þ

In particular, in the method of Broyden–Fletcher–Goldfarb–Shanno (BFGS)

[3, 9–11], the adaptation takes the form

dk ¼ �B�1k gk

wkþ1 ¼ wk þ μkdk

Bkþ1 ¼ Bk � BksksTk B

Tk

sTk Bkskþ uku

Tk

uTk sk

sk ¼ wkþ1 � wk

uk ¼ gkþ1 � gk,

ðB:33Þ

where the step size μk satisfies the above Wolfe conditions (B.14). It has been found

that for the optimal performance a very loose line search with suggested values of

the parameters in (B.14), equal to σ1 ¼ 10�4 and σ2 ¼ 0.9, is sufficient.

A method that can be considered as a serious contender of the BFGS [4] is the

so-called symmetric rank-one (SR1) method where the update is given by

Bkþ1 ¼ Bk þ dk � Bkskð Þ dk � Bkskð ÞTsTk dk � Bkskð Þ : ðB:34Þ

It was first discovered by Davidon (1959), in his seminal paper on quasi-Newton

methods, and rediscovered by several authors. The SR1 method can be derived by

posing the following simple problem. Given a symmetric matrix Bk and the vectors

sk and dk, find a new symmetric matrix Bkþ1 such that ðBkþ1–BkÞ has rank one, andsuch that

Bksk ¼ dk: ðB:35Þ

Note that, to prevent the method from failing, one can simply set Bkþ1 ¼ Bk when

the denominator in (B.34) is close to zero, though this could slow down the

convergence speed.


Remark In order to avoid the computation of inverse matrix Bk, denoting Hk as an

approximation of the inverse Hessian matrix�Hk ½∇2JðwkÞ��1

�, and approxi-

mating ðdk ΔwkÞ the recursion (B.33) can be rewritten as

wkþ1 ¼ wk þ μkdkdk ’ wkþ1 � wk ¼ �Hkgkuk ¼ gkþ1 � gk

Hkþ1 ¼ I� dkuTk

dTk uk

24

35Hk I� ukd

Tk

dTk uk

24

35þ dkd

Tk

dTk uk

,

ðB:36Þ

where usually, the step size μk is optimized by a one-dimensional line search

procedure (B.13) that takes the form

μk ∴ minμ∈ℝþ

J wk � μHk∇Jk½ �: ðB:37Þ

The procedure is initialized with arbitrary IC w�1 and with the matrix H�1 ¼ I.

Alternatively, in the last of (B.36) Hk can be calculated with the Barnes–Rosen

formula (see for [3] details)

Hkþ1 ¼ Hk þ dk �Hkukð Þ dk �Hkukð ÞTdk �Hkukð ÞTuk

: ðB:38Þ

The variable metric method is computationally more efficient than that of Newton.

In particular, good line search implementations of BFGS method are given in the

IMSL and NAG scientific software library. The BFGS method is fast and robust and

is currently being used to solve a myriad of optimization problems [4].

B.2.7 Conjugate Gradient Method

Introduced by Hestenes–Stiefel [12] the conjugate gradient algorithm (CGA)

marks the beginning of the field of large-scale nonlinear optimization. The CGA,

while representing a simple change compared to SDA and the quasi-Newton

method, has the advantage of a significant increase in the convergence speed and

requires storage of only a few vectors.

Although there are many recent developments of limited memory and discrete

Newton, CGA is still the one of the best choice for solving very large problems with

relatively inexpensive objective functions. CGA, in fact, has remained one of the

most useful techniques for solving problems large enough to make matrix storage

impractical.


B.2.7.1 Conjugate Direction

Two vectors ðd1,d2Þ ∈ ℝM�1 are defined orthogonal if d1Td2 ¼ 0 or hd1,d2i ¼ 0.

Given a symmetric and positive defined matrix A ∈ ℝM�M the vectors are defined

as A-orthogonal or A-conjugate, indicated as hd1,d2ijA ¼ 0, if dT1Ad2 ¼ 0. Result

in terms of scalar product is hAd1, d2i ¼ hATd1, d2i ¼ hd1, ATd2i ¼hd1, Ad2i ¼ 0.

Preposition The conjugation implies the linear independence and for A ∈ ℝM�M

symmetric and positive definite, the set of A-conjugate vectors, hdk � 1,dkijA ¼ 0,

for k ¼ 0,. . .,M � 1, indicated as ½dk�k¼0M�1, are linearly independent (Fig. B.6).

B.2.7.2 Conjugate Direction Optimization Algorithm

Given the standard optimization problem (B.1) with the hypothesis that the CF is a

quadratic form of the type (B.16), the following theorem holds.

Theorem Given a set of nonzero A-conjugate directions, ½dk�k¼0M�1 for each IC

w�1 ∈ ℝM�1, the sequence wk ∈ ℝM�1 generated as

wkþ1 ¼ wk þ μkdk for k ¼ 0, 1, ::: ðB:39Þ

with μk determined as line search criterion (B.17), converges in M steps to the

unique optimum solution w∗.

Proof The Proof is performed in two steps (1) computation of the step size μk;(2) Proof of the subspace optimality Theorem.

1. Computation of the step size μkConsider the standard quadratic CF minimization problem for which

∇J wð Þ ! 0 ) Aw ¼ b ðB:40Þ

with optimal solution w∗ ¼ A

�1b. A set of nonzero A-conjugate directions

½dk�k¼0M�1 forming a base over ℝM such that the solution can be expressed as

w∗ ¼XM�1

k¼0

μkdk: ðB:41Þ

2d

1d

1d2d

2Ad1 2, 0=d Ad2 2 0T =d d

1d

2d

1Ad1 2, 0=Ad d

Fig. B.6 Example of orthogonal and A-conjugate directions


For the previous expression, the system (B.40) for w ¼ w∗ can be written as

b ¼ AXMk¼1

μkdk ¼XMk¼1

μkAdk ðB:42Þ

Moreover, multiplying left side for dTi both members of the precedent expres-

sion, and being by definition hdTi A, dji ¼ 0 for i 6¼ j, we can write

dTi b ¼ μkd

Ti Adk ðB:43Þ

which allows the calculation of the coefficients of the base (B.41) μk as

μk ¼dTk b

dTk Adk

: ðB:44Þ

For the definition of the CGA method, we consider a recursive solution for CF

minimization, in which in the ðk–1Þth iteration we consider negative gradient

around wk, called in this context, residue. Indicating the negative direction of the

gradient as gk�1 ¼ �∇Jðwk�1Þ, we have

gk�1 ¼ b� Awk�1 ðB:45Þ

The expression (B.44) can be rewritten as

μk ¼dTk gk�1 þ Awk�1ð Þ

dTk Adk

: ðB:46Þ

From definition of A-conjugate directions dTkAwk�1 ¼ 0 we have

μk ¼dTk gk�1

dTk Adk

: ðB:47Þ

Remark Expression (B.47) represents an alternative formulation for the optimal

step-size computation (B.17).

2. Subspace optimality Theorem

Given a quadratic CF J wð Þ ¼ 12wTAw � wTb, and a set of nonzero A-conjugate

directions, ½dk�k ¼ 0M�1, for any IC w�1 ∈ ℝM�1 the sequence wk ∈ ℝM�1 generated as

wkþ1 ¼ wk þ μkdk, for k 0 ðB:48Þ

with


μk ¼dTk gk�1

dTk Adk

ðB:49Þ

reaches its minimum wkþ1 ! w∗ value in the set w�1 þ span d0 � � � dkf g� �.

Equivalently, considering the general solution w, we have that ∇J wð Þ� �Tdk ¼ 0.

Then there is, necessarily, a parameter βi ∈ ℝ such that

w ¼ w�1 þ β0d0 þ � � � þ βkdk ðB:50Þ

Then

0 ¼ ∇J wð Þ� �Tdi

¼ �A w�1 þ β0d0 þ � � � þ βkdk�1½ � þ b�Tdi¼ Aw�1 þ b½ �T þ β0d

T0 Adi þ � � � þ βkd

Tk Adi

¼ ∇J w�1ð Þ� �Tdi þ βid

Ti Adi,

ðB:51Þ

whereby we can calculate the parameter βi as

βi ¼� ∇J wð Þ� �T

dk

dTk Adk

¼ gTkþ1Adk

dTk Adk

ðB:52Þ

Q.E.D.

SDA-W Weights trajectory on Performance surface

w0[n]

w1[n

]

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

CGASDA

Fig. B.7 Trajectories of the

weights on the isolevel CF

curves for steepest descent

algorithm (SDA) and the

standard Hestenes–Stiefel

conjugate gradient

algorithm


B.2.7.3 The Standard Hestenes–Stiefel Conjugate Gradient Algorithm

From the earlier discussion, the basic algorithm of the conjugate directions can be

defined with an iterative procedure which allows the recursive calculation of the

parameters μk and βk.We can define the standard CGA [13] as (Fig. B.7)

d�1 ¼ g�1 ¼ b� Aw�1 w�1 arbitraryð Þ IC ðB:53Þ

do {

μk ¼��gk��2dTk Adk

, computation of step size ðB:54Þ

wkþ1 ¼ wk þ μkdk, new solution or adaptation ðB:55Þgkþ1 ¼ gk � μkAdk, gradient direction update

βk ¼��gkþ1

��2��gk��2 , computation of 00beta00 parameter ðB:56Þ

dkþ1 ¼ gkþ1 þ βkdk, search direction ðB:57Þ

} while�kgkk > ε

�end criterion : output for kgkk < ε.

%----------------------------------------------------------------- ----% The type 1 Hestenes - Stiefel Conjugate Gradient Algorithm % for CF: J(w) = c - w'b + (1/2)w'Aw; %% Copyright 2013 - A. Uncini % DIET Dpt - University of Rome 'La Sapienza' - Italy % $Revision: 1.0$ $Date: 2013/03/09$ %---------------------------------------------------------------------function [w, k] = CGA1(w, b, A, c, mu, tol, MaxIter)

d = b - A*w;g = d; g1 = g'*g;for k=1:MaxIter

Ad = A*d;mu = g1/(d'*Ad); % Optimal step-size w = w + mu*d; % up-date solution g = g - mu*Ad; % up-date gradient or residual g2 = g'*g;be = g2/g1; % ‘beta’ parameter d = g + be*d;; % up-date direction g1 = g2;if ( g2 <= tol ), break, end % end criteria

endend% Hestenes - Stiefel Conjugate Gradient Algorithm type 1 ------------

(B.54)(B.55)

(B.55)

(B.56)(B.57)


Remark In place of the formulas (B.54) and (B.56) one may use

μk ¼dTk gk

dTk Adk

ðB:58Þ

βk ¼ � gTkþ1Adk

dTk Adk

: ðB:59Þ

These formulas, although more complicated than (B.54) and (B.56), have μ and βparameters more easily changed during the iterations.

Moreover note that the direction of estimated gradients (or residual) gk is

mutually orthogonal hgkþ1,gki ¼ 0, while the direction of vectors dk is mutually

A-conjugate hdkþ1, Adki ¼ 0.

%---------------------------------------------------------------------% The type 2 Hestenes - Stiefel Conjugate Gradient Algorithm % J(w) = c - w'b + (1/2)w'Aw; %% Copyright 2013 - A. Uncini % DIET Dpt - University of Rome 'La Sapienza' - Italy % $Revision: 1.0$ $Date: 2013/03/09$ %---------------------------------------------------------------------function [w,,k] = CGA2(w, b, A, c, mu, tol, MaxIter)

d = b - A*w;g = d; for k = 1 : MaxIter

Ad = A*d;dAd = d'*Ad;mu = (d'*g)/dAd; % Optimal step-size w = w + mu*d; % up-date solution g = g - mu*Ad; % up-date direction be = -(g'*Ad)/dAd; % ‘beta’ param d = g + be*d;if ( norm(g) <= tol ), break, end % end criteria

endend% Hestenes - Stiefel Conjugate Gradient Algorithm type 2 ------------

(B.58)

(B.59)

B.2.7.4 Gradient Algorithm for Generic CF

The method of conjugate gradient can be generalized to find a minimum of a generic

CF. In this case the search method is sometimes called nonlinear CGA [14]. In this

case the gradient cannot explicitly be computed but only estimated in various ways.

In particular the residual cannot be directly found but, let ∇JðwkÞ an estimation of

the CF’s gradient at the kth iteration, we set residual as gk ¼ �∇JðwkÞ.The line search procedure cannot be computed as in the Hestenes–Stiefel CGA

previously described and could be substituted by minimizing the expression


∇J wk þ μkdkð Þ� �Tdk: ðB:60Þ

Moreover, the estimated Hessian of CF ∇2JðwkÞ plays the role of matrix A.

A simple modified CGA method form is defined by the following recurrence.

Starting from IC w�1 and βXY0 ¼ 0

w�1 w�1 arbitraryð Þ IC ðB:61Þd�1 ¼ g�1 ¼ �∇J w�1ð Þ IC ðB:62Þ

do f

determine μk, Wolfe conditions

wkþ1 ¼ wk þ μkdk, Adaptation ðB:63Þgk ¼ �∇J wkð Þ, gradient estimation

compute βk ¼ βXYk , }beta}parameter ðB:64Þ

dkþ1 ¼ gkþ1 þ βkdk, Compute the search direction ðB:65Þif gT

kþ1gk�� >0:2 gkþ1

�� 2thendkþ1¼gkþ1, Restartcondition ðB:66Þg while gkk k > ε

� �end criterion. Exit when kgkk < ε

The parameter βXYk , which plays a central role for nonlinear CGA, can be

determined through various philosophies of calculation. Below are shown the

most commonmethods for the calculation of the beta parameter (see for details [15])

βHSk ¼ dT

k gkþ1

dTk wkþ1 � wkð Þ , Hestenes � Stiefel HSð Þ ðB:67Þ

β PRk ¼ dT

kþ1gkþ1

gTk gk

, Polak � RibiOre � Polyak PRPð Þ ðB:68Þ

βHSk ¼ dT

k gkþ1

dTk wkþ1 � wkð Þ , Liu� Storey LSð Þ ðB:69Þ

β FRk ¼ gT

kþ1gkþ1

gTk gk

, Fletcher � Reevs FRð Þ ðB:70Þ

βCDk ¼ � gT

kþ1gkþ1

gTk dk

, Conjugate Descent� Fletcher CDð Þ ðB:71Þ

βDYk ¼ gT

kþ1gkþ1

dTk wkþ1 � wkð Þ , Dai� Yuan DYð Þ: ðB:72Þ

Note that, in the specialized literature, there are many other variants (see, for

example [4]). For strictly quadratic CF this method reduces to the linear search


provided μk is the exact minimizer [3]. Other choices of the parameter βXYk in (B.65)

also possess this property and give rise to distinct algorithms for nonlinear problems.

In the CGA, the increase in convergence speed is obtained from information on

the search direction that depends on the previous iteration dk�1, moreover for a

quadratic CF, it is conjugated to the gradient direction. Theoretically, the algorithm,

for w ∈ ℝM, converges in M or less iterations.

To avoid numerical inaccuracy in the direction search calculation or for the

non-quadratic CF nature, the method requires a periodic reinitialization. Indeed,

over certain conditions, (B.67)–(B.72) may assume negative value. So a more

appropriate choice is

βk ¼ max βXYk ; 0

: ðB:73Þ

Thus if a negative value of βPRk occurs, this strategy will restart the iteration along

the correct steepest descent direction.

The CGA can be considered as an intermediate approach between the SDA and

the quasi-Newton method. Unlike other algorithms, the CGA main advantage is

derived from the fact of not needing to explicitly estimate the Hessian matrix which

is, in practice, replaced by the βk parameter.

B.3 Constrained Optimization Problem

The problem of constrained optimization can be formulated as: find a vector

w ∈ Ω � ℝM that minimizes (maximizes) a scalar function

minw∈Ω

J wð Þ ðB:74Þ

subject to (s.t.) the constraints

gi wð Þ 0, for i ¼ 1, 2:::,M: ðB:75Þ

Methods for solving constrained optimization problems are often characterized by

two conflicting needs:

• Finding admissible solutions,

• Finding the algorithm to minimize the objective function.

In general, there are two basic approaches:

• Transform the problems into simpler constrained problems,

• Transform the problems into a sequence (in the limit a single) of unconstrained

problems.


B.3.1 Single Equality Constraint: Existenceand Characterization of the Minimum

As in unconstrained optimization problems (see Sect. B.1.2), to have admissible

solution some sufficient and sufficient and necessary conditions must be satisfied.

For example, in the case of single equality constraint the problem can be

formulated as

minw∈Ω

J wð Þ s:t: h wð Þ ¼ b: ðB:76Þ

First-order necessary condition (FONC) for minimum (or maximum) is that the

functions JðwÞ and hðwÞ have continuous first-order partial derivative and that thereexists some free parameter scalar λ such that

∇J wð Þ þ λ∇h wð Þ ¼ 0 ðB:77Þ

or, as illustrated in Fig. B.8, the two surface must be tangent. Note that hðwÞ ¼ bor �hðwÞ ¼ �b are the same and that there is non-restriction on λ.

B.3.2 Constrained Optimization: Methods of LagrangeMultipliers

The method of Lagrange multipliers (MLM) is the fundamental tool for analyzing

and solving nonlinear constrained optimization problems. Lagrange multipliers

can be used to find the extreme of a multivariate function JðwÞ subject to the

constraint function hðwÞ ¼ b, where J and h are functions with continuous

first partial derivatives on the open set, containing the curve hðwÞ � b ¼ 0, and

∇hðwÞ 6¼ 0 at any point on the curve.

1,optw 1w

2w

( )h b=w

( )J w

2,optw

Fig. B.8 In the optimal

point curve JðwÞ and hðwÞare necessarily tangent


B.3.2.1 Optimization with Single Constraint

In the case of a single equality constrained optimization problem (B.76), we define

the Lagrangian or Lagrange function as

L w; λð Þ ¼ J wð Þ þ λ h wð Þ � b� � ðB:78Þ

such that, in the case that the existence condition is verified, the solution can be

found solving the following unconstrained optimization problem associated with

(B.76):

minw∈Ω

L w; λð Þ ðB:79Þ

minλ∈L

L w; λð Þ ðB:80Þ

That is, ∇Lðw, λÞ ¼ 0, or

∇wL w; λð Þ ¼ ∇J wð Þ þ λ∇h wð Þ ¼ 0 ðB:81Þ∇λL w; λð Þ ¼ h wð Þ � b ¼ 0: ðB:82Þ

If (B.81) and (B.82) hold then ðw, λÞ is a stationary point for the Lagrange function.In other words, the Lagrange multiplier method represents a necessary condition

for the existence of optimal solution in such constrained optimization problems.

Fig. B.9 shows an example of a constrained optimization problem for M ¼ 2.

B.3.2.2 Optimization Problem with Multiple Inequality Constraints:

Kuhn–Tucker Conditions

The generalization for multiple constraints can be formulated as

2,unoptw

1,unoptw 1w

2w

1,coptw

2,coptw

( )J w

( )f b=w

Fig. B.9 Example of a

constrained optimization

problem for M ¼ 2. The

constrained optimum value

is the closest point to the

unconstrained optimum,

belonging to the constraint

curve fðwÞ ¼ b


minw∈Ω

J wð Þ s:t: gi wð Þ 0 i ¼ 1, 2, :::,K ðB:83Þ

and the Lagrangian is defined as

L w; λð Þ ¼ J wð Þ þXKi¼1

λigi wð Þ: ðB:84Þ

In this case if a solution w∗ exists then the following FONC, called Kuhn–Tuckerconditions (KT), holds:

∇J w�ð Þ þXKi¼1

λ∗i ∇gi w∗ð Þ ¼ 0

gi w∗ð Þ 0

λ∗i � 0, for i ¼ 1, 2, :::,K

λ∗i gi w∗ð Þ ¼ 0:

ðB:85Þ

A feasible point w∗ for the minimization problem (B.83) is regular point if the setof vectors ∇giðw∗Þ is linearly independent over a set of indices corresponding to

the equality constraints at optimal point w∗, formally

∇gi w∗ð Þ i∈ I0, for I0 ≜ i∈ 1;K½ � ∴ gi w

∗ð Þ ¼ 0 ðB:86Þ

In eqns. (B.85) we have assumed that the first derivatives ∇JðwÞ and ∇gðwÞexist and that w∗ is a regular point or that the constraints satisfy the regularityconditions. Moreover, a point w ∈ Ω � ℝM is called feasible point, and the opti-

mization problem is called consistent, if the set of feasible points is nonempty. A

feasible point w∗ is a local minimizer if fðw∗Þ is a minimum on the set of feasible

points.

A point ðw∗, λ∗Þ at which KT conditions hold is called a saddle point for theLagrangian function if JðxÞ is convex and all giðwÞ are concave. At the saddle pointthe Lagrangian satisfies the inequalities

L w∗; λð Þ � L w∗; λ∗ð Þ � L w; λ∗ð Þ: ðB:87Þ

So, for the Lagrange function a minimum exists with respect to x and a maximum

with respect to λ.Note also that the last of (B.85) conditions, that is, λ∗i gi x

∗ð Þ ¼ 0 i ¼ 1, 2, :::,K, iscalled complementary slackness condition.


Example Consider the problem

minw∈Ω

w21 þ w2

2

� �s:t: �1

4w1 � 2ð Þ2 � �w2 � 2

�2 þ 1 0 : ðB:88Þ

The KT is defined as

2w1

2w2

� �� λ �

122� w1ð Þ

2 2� w2ð Þ

" #¼ 0

�14w1 � 2ð Þ2 � �w2 � 2

�2 þ 1 0

λ 0

λ�1� 1

4w1 � 2ð Þ2 � w2 � 2ð Þ2� ¼ 0:

ðB:89Þ

Geometrically illustrated in Fig. B.10.

Calculation of the solution with the KT

2w1

2w2

� �� λ �

122� w1ð Þ

2 2� w2ð Þ

" #¼ 0 ðB:90Þ

for which

w1 ¼ 2λ

4þ λ; w2 ¼ 2λ

1þ λ: ðB:91Þ

For λ ¼ 0, one has w1 ¼ 0 and w2 ¼ 0, which, however, is not a feasible solution

as the constraint conditions (B.88) are not met. It follows that λ must necessarily be

positive. Substituting the values (B.91) in the constraint

1w

2w

2 21 2( ) ( )J w w+w

2 211 24( ) ( 2) ( 2) 1g w w− − − − +w

*1w

*2w

( )J w

( )g w

-Ñ

Ñ

Fig. B.10 In the optimum point, the surface of the CF JðwÞ is tangent to the curve of the constraintgðwÞ


�1

4

2λ

4þ λ� 2

� �2

� 2λ

1þ λ� 2

� �2

þ 1 0 ðB:92Þ

and solving for the equality, to the value λ > 0, we obtain λ ¼ 1.8, for which the

optimum point is equal to w�1 ¼ 0.61, w�

2 ¼ 1.28.

B.3.2.3 Optimization Problem with Mixed Constraints:

Karush–Kuhn–Tucker Conditions

The KT conditions are generalized by the more general Karush–Kuhn–Tucker

(KKT) conditions, which take into account equality and inequality constraints of

the most general form hiðxÞ ¼ 0, giðxÞ 0, and fiðxÞ � 0. The KKT conditions are

necessary for a solution in nonlinear programming to be optimal, provided some

regularity conditions are satisfied.

In the presence of equality and inequality constraints the nonlinear optimization

problem can be written as

min J wð Þ s:t:

XKl

i¼1

κili wð Þ � bi

XKg

i¼1

σigi wð Þ bi

XKe

i¼1

υihi wð Þ ¼ bi,

8>>>>>>>>><>>>>>>>>>:

ðB:93Þ

where JðwÞ, liðwÞ, giðwÞ, and hiðwÞ, for all i, have continuous first-order partial

derivative on some subset Ω � ℝM. Let

λ∈ℝK ¼ κ1 � � � κKlσ1 � � � σKg υ1 � � � υKe

� �T ðB:94Þ

with K ¼ Kl þ Kg þ Ke, the vector containing all the Lagrange multipliers, and

f wð Þ ¼ l wð Þ g wð Þ h wð Þ� �T ðB:95Þ

a vector of functions containing all the inequalities and equalities constraints, for

the problem (B.93) the Lagrangian assumes the forms


L w;λð Þ¼ J�w�þXK

i¼1

λi f i wð Þ�bi� �

¼ J wð Þþ κ σ υ½ �T l wð Þ g wð Þ h wð Þ� �¼ J wð Þþλf

�w�,

ðB:96Þ

where vectors κ, σ, and υ are called dual variables. Further, suppose that w∗ is a

regular point for the problem. Ifw∗ is a local minimum that satisfies some regularity

conditions, then there exist constants vector λ∗ such that (KKT conditions)

∇J w�ð Þ þXKi¼1

λ�i∇f i w�ð Þ ¼ 0 ðB:97Þ

and

κ∗i 0 i ¼ 1, 2, :::,Kl

σ∗i � 0 i ¼ 1, 2, :::,Kg

υ∗i ¼ �1 i ¼ 1, 2, :::,Ke arbitray signð Þλ∗i ¼ 0 i∈ I0 , for I0 ≜ i∈ 1,Kl þ Kg

� �∴ f i w

�ð Þ ¼ 0n o

λ∗i�f w�ð Þ � bi � ¼ 0 i ¼ 1, 2, :::,Kl þ Kg,

ðB:98Þ

where I0 means the set of indices i from i ∈ ½1, Kl þ Kg� for which the inequalities

are satisfied at w∗ as strict inequalities.

In the case that the functions JðwÞ and fiðwÞ are convex, then λ�i > 0, and

concave, then λ�i < 0, for i ¼ 1, 2, . . ., K, then the point ðw∗,λ∗Þ is a saddlepoint of the Lagrangian function (B.96), and w∗ is a global minimizer of the

problem (B.93).

Observe that in the case only a equality constraint is present,

hiðwÞ ¼ bi, i ¼ 1, 2, . . . , K the above condition simplifies as

∇J w∗ð Þ þXKi¼1

υ∗i ∇hi w∗ð Þ ¼ 0 ðB:99Þ

and the conditions (B.98) are vacuous.

Remarks The KKT conditions provide that the intersection of the set of feasible

directions with the set of descent directions coincides with the intersection of the set

of feasible directions for linearized constraints with the set of descent directions.

To ensure that the necessary KKT conditions allow to identify local minimum

point, assumption of regularity of constraints must be satisfied. In general, it may

require the regularity of all admissible solutions, but, in practice, it is sufficient that

the regularity conditions are satisfied only for such point.


In some cases, the necessary conditions are also sufficient for optimality. This is

the case when the objective function J and the inequality constraints li, gi arecontinuously differentiable convex functions and the equality constraints hj areaffine functions. Moreover, the broader class of functions in which KKT conditions

guarantees global optimality are the so-called invex functions.The invex functions, which represent a generalization of convex functions, are

defined as differentiable vector functions rðwÞ, for which there exists a vector

valued function qðw, uÞ, such that

r wð Þ � r�u� q

�w, u

� �∇r�u� 8w,u : ðB:100Þ

In other words, a function rðwÞ is an invex function iff each stationary point (a pointof a function where the derivative is zero) is a global minimum point.

So, if equality constraints are affine functions and inequality constraints and the

objective function are continuously differentiable invex functions, then KKT con-

ditions are sufficient for global optimality.

B.3.3 Dual Problem Formulation

Consider the previously treated optimization problem (Sect. B.3.2.2), with multiple

inequality constraints (B.83) and Lagrangian (B.84), with a convex objective

function JðwÞ and concave constraint functions giðwÞ, here called primalinequality-constrained problem.

For this problem, at the saddle point the Lagrangian satisfies the inequalities

(B.87), that is, Lðw∗, λÞ � Lðw∗, λ∗Þ � Lðw, λ∗Þ, and the followings properties

hold:

∇wL w∗; λ∗ð Þ ¼ 0

∇λL w∗; λ∗ð Þ ¼ 0

∇w L w; λ∗ð Þ 0

∇λL w∗; λð Þ � 0:

ðB:101Þ

Note that, since the Lagrangian exhibits a minimum with respect to w and a

maximum with respect to λ, we can reformulate the primal inequality-constrained

problem (B.83, B.84) as the min–max problem of finding a vector w∗ which solves

minw∈Ω

maxλi�0

L w; λð Þ ¼ minw∈Ω

maxλi�0

J wð Þ þXKi¼1

λigi wð Þ( )

: ðB:102Þ

The above expression allows us to transform the primal min–max problem (B.102)

in an equivalent dual max–min problem defined as


maxw∈Ω

L w; λð Þ s:t: ∇J wð Þ þXKi¼1

λi∇gi wð Þ ¼ 0,�λi � 0

�: ðB:103Þ

Assuming that there is a unique minimum ðw∗, λ∗Þ to the problem:

minw∈Ω L w; λð Þ, then for each fixed vector λ � 0 we can define a Lagrange

function in terms of the alone Lagrange multipliers λ as

L λð Þ ≜ minw∈Ω

L w; λð Þ: ðB:104Þ

The optimization problem can be now defined, in a more simple and elegant dual

form, as

maxL λð Þ s:t: λi � 0, i ¼ 1, 2, :::,K, ðB:105Þ

where the Lagrange multipliers λ are called dual variables and LðλÞ is called dual

objective function. So, letg wð Þ ¼ g1 wð Þ � � � gK wð Þ� �Tthe vector containing the

constraint functions we obtain a simple relation

∇λL λð Þ ¼ g�w λð Þ: ðB:106Þ

The dual form may or may not be simpler than the original (primal) optimization.

For some particular case, when the problem presents some special structure, dual

problem can be easier to solve. For example, the dual problem can show some

advantage for separable and partial separable problems.


Appendix C: Elements of Random Variables,

Stochastic Processes, and Estimation Theory

C.1 Random Variables

A random variable (RV) (or stochastic variable) is a variable that can assume

different values depending on some random phenomenon [16–23].

Definition of RV (Papoulis [16]) An RV is a number xðζÞ ∈ ðℝ,ℂ Þ assigned toevery ζ ∈ S outcome of an experiment. This number can be the gain in a game ofchance, the voltage of a random source,. . ., or any numerical quantity that is ofinterest in the performance of the experiment.

An RV is indicated as xðζÞ, yðζÞ, zðζÞ, . . . or x1ðζÞ, x2ðζÞ, . . .; and can be definedwith discrete or continuous values. For example, we consider a poll of students at a

certain University. The set of all students is denoted by S ¼ ðζ1,ζ2, . . .,ζNÞ while, asshown in Fig. C.1, the discrete RVs x1ðζÞ and x2ðζÞ represent, respectively, the age(in years) and the number of passed exams, while the continuous RVs x4ðζÞ andx5ðζÞ represent, respectively, the height and the weight of students.

In other words, the RV xðζÞ ∈ ℝ represents a function with domain (or range) S,defined as abstract probability space (or universal set of experimental results), ofpossible infinite dimension, (e.g., the 52-cards deck, the six faces of a die, the value

of a voltage generator, the temperature of an oven, etc.), which assign for each

ζk ∈ S a number, i.e., x : S ! ℝ.More formally, the result of the experiment ζk is defined as a stochastic event or

occurrence ζk ∈ F � S, where the subset F called events is a σ-field, which

represents a subset collection of S, with closure property.6

Remark The value related to a specific event or occurrence of an RV is denoted as

xðζkÞ ¼ x (e.g., if the kth student is 22 years old, x1ðζkÞ ¼ 22). Instead, the

6A σ-field or σ-algebra or Borel field is a collection of sets, where a given measure is defined. This

concept is important in probability theory, where it is interpreted as a collection of events to which

can be attributed probabilities.


Technology, DOI 10.1007/978-3-319-02807-1,


633

notation xðζÞ ¼ x is interpreted as an event defined by all occurrences of ζ such

that xðζÞ ¼ x. For example, x2ðζÞ ¼ 15 denotes all the students who have passed

15 exams. Moreover, in the case of continuous RVs, the notation xðζÞ � x or

a � xðζÞ � b is interpreted as an interval. For example, 1.72 � x4ðζÞ � 1.82

denotes all the students with a height between 1.75 and 1.85 [m]. Indeed, for

continuous RVs, a fixed value is a non-sense and should always be considered a

range of values (e.g., x4ðζÞ ¼ 1.8221312567125367 is, obviously, a non-sense).

In the study of RVs an important question concerns to the probability7 related toan event ζk ∈ S, which can be defined by a nonnegative quantity denoted as pðζkÞ,k ¼ 1,2,. . .. However, it should be noted that the abstract probability space may be

not a metric space. So, rather than referring to the elements ζk ∈ S, we consider theRVs xðζÞ ∈ ℝ associated with the events that, by definition, are defined on a metric

space. For example, what is the probability that x1ðζÞ � 24 or that x2ðζÞ ¼ 20?

Or that x4ðζÞ � 1.85 or 71.3 � x5ðζÞ � 90.2? For this reason, the predictability

of the events xðζkÞ ¼ x; or considering the continuous case, that xðζÞ � x, ora � xðζÞ � b,. . ., is manipulated through a probability function pð�Þ, characterizedby the following axiomatic properties:

p�x ζð Þ ¼ þ1� ¼ 0

p�x ζð Þ ¼ �1� ¼ 0

ðC:1Þ

From the above definitions the random phenomena can be characterized by

(1) the definition of an abstract probability space described by the triple ðS, F, pÞand (2) the axiomatic definition of probability of an RV.

1ζ

3ζ

2ζ

kζ

Nζ

Set of students S

Student's age

2 3( )x ζ

2( )Nx ζ

2 1( )x ζ

2 2( )x ζ

2( )kx ζ

1 3( )x ζ

1( )kx ζ

1( )Nx ζ

1 2( )x ζ

1 1( )x ζ

Number of exams Eye Color

3 3( )x ζ

3 2( )x ζ

3( )Nx ζ3 1( )x ζ

3( )kx ζ

Height

4 3( )x ζ

4 2( )x ζ

4( )Nx ζ

4 1( )x ζ

4( )kx ζ

5 2( )x ζ

5 3( )x ζ

5( )Nx ζ

5 1( )x ζ

5( )kx ζ

Weight

1( )x ζ 2( )x ζ 3( )x ζ 4 ( )x ζ 5( )x ζ

Sζ

Fig. C.1 Example of RVs defined over a set of students for a scholastic poll

7 From the Latin probare, test, try, and ilis, be able to.

634 Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory

Remark In the context of RVs, care must be taken in the notation used. Sometime

RVs are indicated as XðζÞ or as xðζÞ (as in [16]). In these notes we prefer using the

italic font xðζÞ for RV, bold font xðζÞ for RV vectors, and the form xðt, ζÞ or xðt, ζÞ�x½n, ζ� or x½n, ζ� in DT

�for stochastic processes. Moreover, a complex RV

zðζÞ ∈ ℂ is defined as the sum: zðζÞ ¼ xðζÞ þ j � yðζÞ, where xðζÞ, yðζÞ ∈ ℝ.

C.1.1 Distributions and Probability Density Function

The elements of an event xðζÞ � x change depending on the number x; it follows

that the probability of this event, indicated as p�xðζÞ � x

�, is a function of x itself.

Let xðζÞ be an RV, we define the probability density function (pdf), denoted as

fx(x), that is a nonnegative integrable function, such that

p a � x ζð Þ � b� � ¼ ð b

a

f x xð Þdx, probability density function: ðC:2Þ

Therefore, from the basic axioms (C.1) it is possible to demonstrate that the

probability of sure event can be written asðþ1

�1f x xð Þdx ¼ 1: ðC:3Þ

Moreover, the event xðζÞ � x is characterized by the cumulative density function(cdf) defined as

Fx xð Þ ¼ p x ζð Þ � xð Þ, for �1 < x < 1 ðC:4Þ

or, from (C.2)

Fx xð Þ ¼ð x�1

f x υð Þdυ, cumulative density function: ðC:5Þ

In fact, we have that fxðxÞ ¼ d FxðxÞ=d x, and the value of cdf represents a measureof probability pðxðζÞ � xÞ.

For the cdf the following properties apply:

0 � Fx xð Þ � 1; Fx �1ð Þ ¼ 0; Fx þ1ð Þ ¼ 1

Fx x1ð Þ � Fx x2ð Þ if x1 < x2:

It follows that the cdf is a nondecreasing monotone function.

Note that fxðxÞ is not a probability measure. To obtain the probability of the event

x < xðζÞ � x þ Δx, we must multiply the pdf for the interval Δx. That is,

Appendix C: Elements of Random Variables, Stochastic Processes, and Estimation Theory 635

f x xð ÞΔx ΔFx xð Þ ≜ Fx xþ Δxð Þ � Fx xð Þ ¼ p x < x ζð Þ � xþ Δx� �

: ðC:6Þ

Some example of continuous, discrete, and mixed pdf and cdf are reported in

Fig. C.2.

C.1.2 Statistical Averages

The pdf completely characterizes an RV. However, in many situations it is conve-

nient or necessary to represent more concisely the RV through a few specific

parameters that describe its average behavior. These numbers, defined as statisticalaverages or moments, are determined by the mathematical expectation. Note that

even if for the determination of statistical averages, formally, the pdf knowledge is

necessary, somehow, those averages can be estimated without explicit knowledge

of the pdf.

C.1.2.1 Expectation Operator

The mathematical expectation, usually indicated as ExðζÞ, is a number defined

by the following integral:

E x ζð Þ ¼ð1�1

x f x xð Þdx, ðC:7Þ

( )xF x ( )xF x

( )xf x ( )xf x ( )xf x

x x x

x x x

( )xF x

Fig. C.2 Example of trends of the cumulative distribution functions (top figure) and of the

probability density function (lower) figure for discrete RV (left), continuous RV (middle), andmixed discrete–continuous RV (right)


where the function Ef�g indicates the expected value or the average value or mean

value. The expected value is also indicated as μ ¼ ExðζÞ.

C.1.2.2 Moments and Central Moments

Considering a function of RV denoted as g�xðζÞ�, the expected value becomes

E g x ζð Þ� �n o¼ð1�1

g xð Þf x xð Þdx ðC:8Þ

in the case that g½xðζÞ� ¼ xmðζÞ (elevation to the mth power) the previous expres-

sion is defined as moment of order m

E xm ζð Þ ¼ð1�1

xmf x xð Þdx: ðC:9Þ

The calculation of the moment is of particular significance when from the RV is

removed its expected value μ, i.e., considering the RV�xðζÞ � μ

�. In this case the

statistical function, called central moment, is defined as

E x ζð Þ � μ� �mn o

¼ð1�1

x� μð Þmf x xð Þdx: ðC:10Þ

C.1.3 Statistical Quantities Associated with Moments of Orderm

The moments computed with the previous expressions are of particular significance

for certain orders. For example, the first-order moment m ¼ 1 is just the expectedvalue μ defined by (C.7). Generalizing, moments and central moments of any order

can be written as

r mð Þx ¼ E xm ζð Þ c mð Þx ¼ E x ζð Þ � μ

� �m

n o:

ðC:11Þ

In particular, note that cð0Þx ¼ 1 and c

ð1Þx ¼ 0; moreover, it is obvious that for zero-

mean processes the central moment is identical to the moment.

C.1.3.1 Variance and Standard Deviation

We define the variance, indicated as σ2x , as the value of the second-order central

moment


σ2x ¼ c 2ð Þx ¼ E x ζð Þ � μ½ �2

n o¼ð1�1

x� μð Þ2f x xð Þdx, ðC:12Þ

where the positive constant σx ¼ffiffiffiffiffiσ2x

pis defined as standard deviation of x.

Figure C.3 shows the pdf of two overlapped Gaussian (or normal) processes with

representations of the expected value and standard deviation. (The expression of the

normal distribution pdf is given in Sect. C.1.5.2.)

C.1.3.2 The Third- and Fourth-Order Moments: Skewness and

Kurtosis

The skewness is defined as the statistic quantity associated with the third-order

central moment, defined by the following relation:

k 3ð Þx ≜ E

x ζð Þ � μ

σx

� �3( )¼ 1

σ3xc 3ð Þx : ðC:13Þ

The skewness, as illustrated in Fig. C.4a for kð3Þx > 0 and k

ð3Þx < 0, represents the

degree of asymmetry of a generic pdf. In fact, in the case where the pdf is

symmetric the skewness size is zero.

The kurtosis is a statistical quantity related to the fourth-order moment defined as

k 4ð Þx ≜ E

x ζð Þ � μ

σx

� �4( )� 3 ¼ 1

σ4xc 4ð Þx � 3: ðC:14Þ

Note that the term �3, as we shall see later, provides a zero kurtosis in the case of

Gaussian distribution processes. As illustrated in Fig. C.4b, for kð4Þx > 0, there is a

“narrow” distribution trend that is called super-Gaussian. If kð4Þx < 0, the trend of

the pdf is more “broad” and is called sub-Gaussian.

1( )f xx

2( )f xx

2σ x

1σ x

1μx 2

μx x

Fig. C.3 Typical trends of Gaussian or normal pdf with the indication of the expected value andstandard deviation


C.1.3.3 Chebyshev’s Inequality

Given an RV xðζÞ with the mean value μ and standard deviation σx, for any real

number k > 0, the following inequality is true:

p x ζð Þ � μj j kσx � 1

k2k > 0: ðC:15Þ

An RV deviates k times from its average value with probability less than or equal to

1/k2. The Chebyshev’s inequality (C.15) is a useful result for a generic distribution

fx(x) regardless of its form.

C.1.3.4 Characteristic Function and Cumulants

Consider the sign reversal Laplace (or Fourier) transform of the pdf fxðxÞ that, in thecontext of statistics, is called characteristic function, defined as

Φx sð Þ ¼ð1�1

f x xð Þesxdx, ðC:16Þ

where s is the complex Laplace variable.8

Equation (C.16) can be interpreted as the moment-generating function. In fact,

the development in Taylor series of (C.16) for s ¼ 0 yields

1( )f xx 2

( )f xx

x

2( )f xx

x

PositiveNegative 1( )f xx

Zero: Gaussian or normal distributiona b

Negative

Positive

KurtosisSkewness

Fig. C.4 Typical trends of distribution with positive and negative (a) skewness; (b) kurtosis

8 The complex Laplace variable can be written s ¼ α þ jξ. Note that the complex part jξ shouldnot be interpreted as a frequency.


Φx sð Þ ≜ E esx ζð Þn o

¼ E 1þ sx ζð Þ þ sx ζð Þ� �22!

þ � � � þ sx ζð Þ� �mm!

þ � � �0@

1A

¼ 1þ sμþ s2

2!r 2ð Þx þ � � � þ sm

m!r mð Þx þ � � �

ðC:17Þ

which is defined in terms of are all the moments of the RV xðζÞ. In addition, we cannote that considering the inverse Laplace transform of (C.17) yields

r mð Þx ¼ dmΦx sð Þ

dsm

��s¼0

, for m ¼ 1, 2, :::: ðC:18Þ

The cumulants are statistical descriptors, similar to the moments, which allow

having “more information” in the case of high-order statistics.

The cumulant-generating function is defined as the logarithm of the moment-

generating function

Ψx sð Þ ≜ lnΦx sð Þ: ðC:19Þ

Hence, we define the m-order cumulant as the expression

κ mð Þx ≜

dmΨx sð Þdsm

��s¼0

, for m ¼ 1, 2, ::: ðC:20Þ

from the above definition we can see that for a zero-mean RV, the first five

cumulants are

κ 1ð Þx ¼ r 1ð Þ

x ¼ μ ¼ 0

κ 2ð Þx ¼ r 2ð Þ

x ¼ σ2xκ 3ð Þx ¼ c 3ð Þ

x

κ 4ð Þx ¼ c 4ð Þ

x � 3σ4xκ 5ð Þx ¼ c 5ð Þ

x � 10c 3ð Þx σ2x :

ðC:21Þ

Note that the first two are identical to central moments.

C.1.4 Dependent RVs: The Joint and Conditional ProbabilityDistribution

If there is some dependence between two (or more) RVs, you need to study how the

probability of one affects the other and vice versa.

For example, considering the experiment described in Fig. C.1 where the RVs x4and x5, representing, respectively, the height and weight of students, are statisticallydependent, as well as the age x1 and the number of exams x2. In probabilistic terms,


this means that tall students are probably heavier, or considering the random vari-

ables x1 and x2, that younger students are likely to have sustained less exams.

In terms of pdf, given two RVs xðζÞ and yðζÞ, we define the joint pdf,denoted as fxyðx,yÞ, the pdf of the event obtained by the intersection between the

sets p�a � xðζÞ � b

�and p

�c � yðζÞ � d

�, i.e., the distribution probability of

occurrence of the two events. Therefore, extending the definition (C.2), the jointpdf, denoted as fxyðx,yÞ, can be defined by the following integral:

p a � x ζð Þ � b, c � y ζð Þ � d� � ¼ ð d

c

ð ba

f xy x; yð Þdxdy, joint pdf ðC:22Þ

namely, the probability that xðζÞ and yðζÞ assume value inside the interval ½a, b� and½c, d�, respectively. Let us define, also, fxjyðx��yÞ the conditional pdf of xðζÞ givenyðζÞ, such that it is possible to evaluate the probability of the events p

�a � xðζÞ

� b, yðζÞ ¼ c�as

p a � x ζð Þ � b, y ζð Þ ¼ c� � ¼ ð b

a

f xjy x j yð Þdx, conditionalpdf ðC:23Þ

i.e., the probability that xðζÞ assumes value inside the interval ½a, b� given that

yðζÞ ¼ c.Let fyðyÞ be the pdf of yðζÞ, called in the context marginal pdf, from the previous

expressions the joint pdf, in the case that the xðζÞ is conditioned by yðζÞ, can be

written as fxyðx,yÞ ¼ fxjyðx��yÞfyðyÞ. This expression indicates how the probability of

event xðζÞ is conditioned by the probability of yðζÞ. Moreover, let fxðxÞ be

the marginal pdf of xðζÞ, for simple symmetry it follows that the joint pdf is also

fxyðx,yÞ ¼ fyjxðy��xÞfxðxÞ; so, now we can relate the joint and conditional pdfs by a

Bayes’ rule, which states that

f xy x; yð Þ f xjy x j yð Þf y yð Þ ¼ f yjx y j xð Þf x xð Þ, Bayes rule ðC:24Þ

Moreover, we have ðx

ðy

f xy x; yð Þdydx ¼ 1: ðC:25Þ

Definition Two (or more) RVs are independent iff

f xjy x j yð Þ ¼ f x xð Þ and f yjx y j xð Þ ¼ f y yð Þ ðC:26Þ

or, considering (C.24), iff

f xy x; yð Þ ¼ f x xð Þf y yð Þ: ðC:27Þ

Property If two RVs are independent they are necessarily uncorrelated.


The covariance and the correlation of joint RV are respectively defined as

c 2ð Þxy ¼ E x ζð Þy ζð Þ � E y ζð Þ � E x ζð Þ ðC:28Þ

r 2ð Þxy ¼ c 2ð Þ

xy = σxσy� �

: ðC:29Þ

Two RVs xðζÞyðζÞ are uncorrelated, iff their cross-correlation (covariance) is zero.

Consequently, if (C.27) holds, then ExðζÞyðζÞ ¼ E

yðζÞ � ExðζÞ, and for

(C.28) their cross-correlation is zero.

Finally note that, if two RV are uncorrelated, they are not necessarily

independent.

C.1.5 Typical RV Distributions

C.1.5.1 Uniform Distribution

The uniform distribution is appropriate for the description of an RV with equi-

probable events in the interval ½a, b�. The pdf of the uniform distribution is defined as.

f x xð Þ ¼1

b� aa � x � b

0 elsewhere

8<: ðC:30Þ

The corresponding cdf is

Fx xð Þ ¼ð x�1

f x vð Þdv ¼0 x < a

x� a

b� aa � x � b

1 x > b

8>><>>: ðC:31Þ

Its characteristic function is

Φx sð Þ ¼ esb � esa

s b� að Þ : ðC:32Þ

Finally, the mean value and the variance are

μ ¼ aþ b

2and σ2x ¼

b� að Þ212

: ðC:33Þ

C.1.5.2 Normal Distribution

The normal distribution, also calledGaussian distribution, is one of the most useful

and appropriate description of many statistical phenomena.


The normal distribution pdf, already illustrated in Fig. C.3, with mean value μand standard deviation σx, is

f x xð Þ ¼ 1ffiffiffiffiffiffiffiffiffiffi2πσ2x

p e� 1

2σ2xx�μð Þ2 ðC:34Þ

with a CF

Φx sð Þ ¼ eμs�12σ2x s

2

: ðC:35Þ

From previous equations an RV with normal pdf, often referred to as Nðμ,σ2xÞ, isdefined by its mean value μ and its variance σ2x . Note also that the moments of

higher order can be determined in terms of only the first two moments. In fact, we

have (Fig. C.5)

c mð Þx ¼ E x ζð Þ � μ

�� mn o¼ 1 � 3 � 5 � � � � m� 1ð Þσm

x for m even

0 for m odd:

�ðC:36Þ

In particular, the fourth-order moments are cð4Þx ¼ 3σ4x and for the Gaussian distri-

bution the kurtosis is zero.

Remark From (C.36) we observe that an RV with Gaussian distribution is fullycharacterized only by the mean value and variance and that the moments of higher

order do not contain any useful information.

C.1.6 The Central Limit Theorem

An important theorem is the statistical central limit theorem whose statement says

that the sum of N independent RVs with the same distribution, i.e., iid with finite

variance, tends to the normal distribution as N ! 1.

( )f xx

x5− 50

Uniform

Gaussian

super-Gaussian

sub-Gaussian

Fig. C.5 Qualitative

behavior of some typical

distributions


A generalization of the theorem, due to Gnedenko and Kolmogorov, valid for a

wider class of distributions states that the sum of RVs with low power-tail distri-

bution that decreases as 1/jxjα þ 1 with α � 2 tends to the Levy alpha-stable

distribution as N ! 1.

C.1.7 Random Variables Vectors

A random vector or RV vector is defined as an RV collection of the type

x ζð Þ ¼ x0 ζð Þ x1 ζð Þ � � �� T:

By a generalization of the definition (C.7), the expectation of random vector is also

a vector that, omitting the writing of event ðζÞ, is defined as

μ ¼ E xf g ¼ E x0f g E x1f g � � �� T ¼ μ0 μ1 � � �½ �T : ðC:37Þ

C.1.8 Covariance and Correlation Matrix

In the case of random vector, the second-order statistic is a matrix. Therefore, the

covariance matrix is defined as

Cx ¼ E x� μð Þ x� μð ÞTn o

, Covariance matrix: ðC:38Þ

For example, given a two-dimensional random vectorx ¼ x0 x1½ �T the covarianceis defined as

Cx ¼ Ex0 � μ0x1 � μ1

� �x0 � μ0ð Þ x1 � μ1ð Þ� ��

¼E x0 � μx0�� 2 E x0 � μx0

� �x1 � μx1� �n o

E x1 � μx1� �

x0 � μx0� �n o

E x1 � μx1�� 2

24

35 ðC:39Þ

so, the autocovariance matrix is symmetric

Cx ¼ CTx , ðC:40Þ

where the superscript “T” indicates the matrix transposition. Moreover, the auto-correlation matrix is defined as

Rx ¼ E xxT

, Autocorrelation matrix: ðC:41Þ

For the two-dimensional RV previously defined it is then


Rx ¼ E x0j j2 E x0x1f gE x1x0f g E x1j j2� �

ðC:42Þ

and

Rx ¼ RTx : ðC:43Þ

Property The autocorrelation matrix of an RV vector x is always defined nonneg-

ative, i.e., for each vector w ¼ w0 w1 � � � wM�1½ �T the quadratic form wTRxw

is positive semi-definite or nonnegative

wTRxw 0: ðC:44Þ

Proof Consider the inner product between x and w

α ¼ wTx ¼ xTw ¼XM�1

k¼0

wkxk: ðC:45Þ

The RV mean squared value of α is defined as

E α2 ¼ E wTxxTw

¼ wTE xxT

w ¼ wTRxw: ðC:46Þ

since, by definition, α2 0, it is wTRxw 0.

Q.E.D.

C.1.8.1 Eigenvalues and Eigenvectors of the Autocorrelation Matrix

From geometry (see Sect. A.8), the eigenvalues can be computed by solving the

characteristic polynomial pðλÞ, defined as pðλÞ ≜ detðR � λIÞ ¼ 0.

A real or complex autocorrelation matrix R ∈ ℝM�M is symmetric and positive

semi-definite. We know that for this type of matrix the following properties listed

below are valid.

1. The eigenvalues λi of R are real and nonnegative. In fact, for (A.61) we have that

Rq ¼ λq, and by left multiplying for qTi , we get

qTi Rqi ¼ λiq

Ti qi ) λi ¼ qT

i Rqi

qTi qi

0, Rayleigh quotient: ðC:47Þ

2. The eigenvectors qi i ¼ 0, 1,. . .,M � 1, ofR are orthogonal for distinct values of

λi


qTi qj ¼ 0, for i 6¼ j: ðC:48Þ

3. The matrix R can always be diagonalized as

R ¼ QΛQT , ðC:49Þ

where Q ¼ q0 q1 � � � qM�1½ �,

Λ ¼ diag λ0; λ1; :::; λM�1ð Þ ðC:50Þ

and Q is a unitary matrix, i.e., QTQ ¼ I.

4. An alternative representation for R is

R ¼XM�1

i¼0

λiqiqTi ¼

XM�1

i¼0

λiPi, ðC:51Þ

where the term Pi ¼ qiqTi is defined spectral projection.

5. The trace of the matrix R is

tr R½ � ¼XM�1

i¼0

λi ) 1

M

XM�1

i¼0

λi ¼ rxx 0½ � ¼ σ2x : ðC:52Þ

C.2 Stochastic Processes

Generalizing the concept of RV, a stochastic process (SP) is a rule to assign each

result ζ to a function xðt, ζÞ. Hence, SP is a family of two-dimensional functions, of

the variables t and ζ, where the domain is defined over the set of all the experi-

mental results ζ ∈ S, while the time variable t represents the set of real numbers

t ∈ ℝ. If ℝ represents the real axis of time, then xðt, ζÞ is a continuous-timestochastic process. In the case that ℝ represents a set of integers, then we have a

discrete-time stochastic process, and the time index is denoted by n ∈ Z.In general terms, a discrete-time SP is a time-series x½n, ζ�, consisting of all

possible sequences of the process. Each individual sequence, corresponding to a

specific result ζ ¼ ζk, indicated as x½n, ζk�, represents an RV sequence (indexed by n)that is called realization or sample sequence of the process.

Since the SP is a two-variable function, then there are four possible

interpretations

i) x½n, ζ� is an SP ) n variable, ζ variable;

ii) x½n, ζk� is an RV sequence ) n variable, ζ fixed;

iii) x½nk, ζ� is an RV ) n fixed, ζ variable;

iv) x½nk, ζk� is a number ) n fixed, ζ fixed.


For clarity of presentation, as usual in many scientific contexts (signal

processing, neural networks, etc.), writing ζ parameter is omitted and, later in the

text, the SP x½n,ζ� is indicated only with x½n� or x½n� (sometimes bold is omitted) and

the sample process sequence x½n,ζk� is often simply referred to as xk½n�.Definition We define discrete-time stochastic process (DT-SP), denoted as

x½n� ∈ ℝN, an RV vector, defined as

x n½ � ¼ x1 n½ �, x2 n½ �, :::, xN n½ �� , ðC:53Þ

where the integer n ∈ Z represents the time index. Note, as illustrated in Fig. C.6,

that in (C.53) each realization xk½n� represents an RV sequence of the same process.

C.2.1 Statistical Averages of an SP

The determination of the statistical averages of SPs can be performed exactly as for

the RVs. In fact, note that for a given fixed temporal index, see property iii), the

process consists in a simple RV so that it is possible to evaluate all the statistical

functions proceeding as in Sect. C.1.2. Similarly, setting the parameter ζ and

considering two different temporal indexes n1 and n2 we are in the presence

of joint RVs so that it is possible to characterize the process by the joint cdf

Fx½x1, x2; n1, n2�. However, in general an SP contains an infinite number of

Realizations

1

2

[ , ]

[ , ]

[ , ]

[ , ]

k

N

n

n

n

n

ζ

ζ

ζ

ζ

x

x

x

x

DiscreteTimeRandom ProcessDT-SP

n

n

n

n

[ ]1 2[ , ] [ , ] [ , ] [ , ]N Nn n n nζ ζ ζ ζx x x xRV [ , ]kn ζx

[ , ] Sequencekn ζx

kn

Fig. C.6 Representation of the stochastic process x½n,ζ�. As usual in context of DSP, the process

sample is simply indicated as x½n�


such RVs; hence, to completely describe, in a statistical sense, an SP, the knowl-

edge of the k-order joint cdf is sufficient. It is defined as

Fx x1, :::, xk; n1, :::, nk½ � ¼ p x n1½ � � x1, :::, x nk½ � � xk� �

: ðC:54Þ

On the other hand, an SP can be characterized by the joint pdf defined as

f x x1, :::, xk; n1, :::, nk½ � ≜ ∂2kFx x1, :::, xk; n1, :::, nk½ �∂x1,∂x2, :::,∂xk

: ðC:55Þ

From now on we write the SP simply as x½n� (not in bold).

C.2.1.1 First-Order Moment: Expectation

We define the expected value of an SP x½n� with pdf f�x½n��, the value of its first-

order moment at a given time index n. According with Eq. (C.7), the expected valueis defined as

μn ¼ E x n½ � : ðC:56Þ

Referring to Fig. C.6, and considering the notation x½n,ζ�, the expectation operator

Ef�g represents the ensemble average of the RV μnk ¼ E x nk; ζ½ � .

Equation (C.56) can be also interpreted in terms of relative frequency by the

following expression:

μnk ¼ limN!1

1

N

XNj¼1

xj nk½ �" #

: ðC:57Þ

In other words (see Fig. C.6), the expectation represents the mean value of the set of

RV x½nk� at a fixed time instant.

If the process is not stationary, i.e., its statistics changes in time, its mean value is

variable during time. So, in general, we have

μn 6¼ μm, for n 6¼ m: ðC:58Þ

C.2.1.2 Second-Order Moment: Autocorrelation and Autocovariance

We define autocorrelation, or second-order moment, the sequence

r n;m½ � ¼ E x n½ �x m½ � : ðC:59Þ

In terms of relative frequency Eq. (C.59) can be written as


r n;m½ � ¼ limN!1

1

N

XNk¼1

xk n½ �xk m½ �" #

: ðC:60Þ

The autocorrelation is a measure that indicates the association degree or depen-

dency between the process at time n and at time m.Moreover, we have that

r n; n½ � ¼ E x2 n½ � , average power of the sequence:

We define autocovariance, or second-order central moment, the sequence

c n;m½ � ¼ E x n½ � � μn� �

x m½ � � μm� �n o

¼ r n;m½ � � μnμm: ðC:61Þ

C.2.1.3 Variance and Standard Deviation

Similarly for the definition in Sect. C.1.3.1, the variance of an SP is a value related

to the central second-order moment defined as

σ2xn ¼ E x n½ � � μn� �2n o

¼ E x2 n½ � � μ2n: ðC:62Þ

The quantity σxn is defined as standard deviation, which represents a measure of

the observation dispersion x½n� around its mean value μn.

Remark For zero-mean processes, the central moment coincides with moment. It

follows then σ2xn ¼ r½n,n� ¼ Ex2½n�; in other words, the variance coincides with

the signal power.

C.2.1.4 Cross-correlation and Cross-covariance

The statistical relationships between two jointly distributed SP x½n� and y½n� (i.e.,defined over the same space results S) can be described by their joint second-order

moments (the cross-correlation and cross-covariance) defined, respectively, as

rxy n;m½ � ¼ E x n½ �y m½ � ðC:63Þcxy n;m½ � ¼ E

�y n½ � � μyn

��x m½ � � μym

�n o¼ rxy n;m½ � � μxnμym: ðC:64Þ

Moreover, the normalized cross-correlation is defined as

rxy n;m½ � ¼ cxy n;m½ �σxnσxm

: ðC:65Þ


C.2.2 High-Order Moments

In linear systems the high-order moments are rarely used with respect to the first-

and second-order ones. The interest in higher order moments, in fact, is increasing

in nonlinear systems.

C.2.2.1 Moments of Order m

Generalizing the foregoing for first- and second-order statistics, moments and

central moments of any order can be written as

r mð Þ n1; :::; nm½ � ¼ E x n1½ � � x n2½ �� x nm½ � c mð Þ n1; :::; nm½ � ¼ E x n1½ � � μn1

� �x n2½ � � μn2� �� x nm½ � � μnm

� � :

For a particular index n the previous expressions are simplified as

r mð Þx ¼ E x n½ �� mn oc mð Þx ¼ E x n½ � � μx

� �mn o:

Note, also, that cð0Þx ¼ 1 and c

ð1Þx ¼ 0. It is obvious that, for zero-mean processes,

the central moment is identical to the moment.

C.2.2.2 Moments of Third Order

The third-order moments are defined as

r 3ð Þ k;m; n½ � ¼ E x k½ � � x m½ � � x n½ � c 3ð Þ k;m; n½ � ¼ E x k½ � � μkð Þ x m½ � � μmð Þ x n½ � � μnð Þ

:

C.2.3 Property of Stochastic Processes

C.2.3.1 Independent SP

An SP is called independent iff

f x x1, :::, xk; n1, :::, nk½ � ¼ f 1 x1; n1½ � � f 2 x2; n2½ � � ::: � f k xk; nk½ � ðC:66Þ

8 k, ni i ¼ 1, . . ., k; or else, x½n� is an SP formed with independent RV x1½n�,x2½n�,. . ..


For two, or more, independent sequences x½n� and y½n� we also have that

E x n½ � � y n½ � ¼ E x n½ � � E y n½ � : ðC:67Þ

C.2.3.2 Independent Identically Distributed SP

If all the SP sequences are independent and with equal pdf, i.e.,

f1½x1; n1� ¼ . . . ¼ fk½xk; nk�, then the SP is defined as iid.

C.2.3.3 Uncorrelated SP

An SP is called uncorrelated if

c n;m½ � ¼ E x n½ � � μn� �

x m½ � � μm� �n o

¼ σ2xnδ n� m½ �: ðC:68Þ

Two processes x½n� and y½n� are uncorrelated if

cxy n;m½ � ¼ E x n½ � � μxn� �

y m½ � � μxm� �n o

¼ 0 ðC:69Þ

and if

rxy n;m½ � ¼ μxnμxm: ðC:70Þ

Remark If the SP x½n� and y½n� are independent they are, also, necessarily

uncorrelated while the contrary is not always true, i.e., the assumption of indepen-

dency is stronger than the uncorrelation.

C.2.3.4 Orthogonal SP

Two processes x½n� and y½n� are defined as orthogonal iff

rxy n;m½ � ¼ 0: ðC:71Þ

C.2.4 Stationary Stochastic Processes

An SP is defined stationary or time invariant if the statistic of x½n� is identical to thetranslated process x½n � k� statistics. Very often in real situations we consider the

processes as stationary. This is due to the simplifications of the correlation func-

tions associated with them.

In particular, a sequence is called strict sense stationary (SSS) or stationary oforder N if we have


f x x1, :::, xN; n1, :::, nN½ � ¼ f x x1, :::, xN; n1 � k, :::, nN � k½ � 8k ðC:72Þ

An SP is wide sense stationary (WSS) if its first-order statistics do not change

over time

E x n½ � ¼ E x nþ k½ � ¼ μ 8n, k: ðC:73Þ

As a corollary, consider also the following definitions. An SP is defined wide senseperiodic (WSP) if

E x n½ � ¼ E x nþ N½ � ¼ μ: 8n ðC:74Þ

An SP is wise sense cyclostationary (WSC) if the following relations are true:

E x n½ � ¼ E x nþ;N½ � r m; n½ � ¼ r mþ N, nþ N½ � 8m, n: ðC:75Þ

Let us define k ¼ n � m as correlation lag or correlation delay, the correlation is

usually written as

r k½ � ¼ E x n½ �x n� k½ � ¼ E x nþ k½ �x n½ � : ðC:76Þ

The latter is often referred to as autocorrelation function (acf).

Similarly, considering two joint WSS processes, the autocovariance (C.61) is

defined as

c k½ � ¼ E x nþ k½ � � μ� �

x n½ � � μ� �n o

¼ r k½ � � μ2: ðC:77Þ

Property The acf of WSS processes has the following properties:

1. The autocorrelation sequence r½k� is symmetric with respect to delay

r �k½ � ¼ r k½ � ðC:78Þ

2. The correlation sequence is defined nonnegative. So, for anyM> 0 and w ∈ ℝM

we have that

XMk¼1

XMm¼1

w k½ �r k � m½ �w m½ � 0 ðC:79Þ

Such property represents a necessary and sufficient condition so that r½k� isan acf.

3. The zero time delay term is that of maximum amplitude

E x2 n½ � ¼ r 0½ � r k½ �j j 8n, k: ðC:80Þ


Given two joint WSS processes x½n� and y½n�, the cross-correlation function (ccf) is

defined as

rxy k½ � ¼ E x n½ �y n� k½ � ¼ E x nþ k½ �y n½ � : ðC:81Þ

Finally, the cross-covariance sequence is defined as

cxy k½ � ¼ E x nþ k½ � � μx� �

y n½ � � μy� �n o

¼ rxy k½ � � μxμy: ðC:82Þ

C.2.5 Ergodic Processes

An SP is called ergodic if the ensemble averages coincide with the time averages.

The consequence of this definition is that an ergodic process must, necessarily, also

be strict sense stationary.

C.2.5.1 Statistics Averages of Ergodic Processes

For the determination of the statistics of an ergodic processes it is necessary to

define the time-average mathematical operation. For a discrete-time random signal

x½n� the mathematical operator of time average, indicated as hx½n�i, is defined as

x n½ �h i ¼ limN!1

1

N

XN�1

n¼0

x n½ �

x nþ k½ �x n½ �h i ¼ limN!1

1

N

XN�1

n¼0

x nþ k½ �x n½ �:ðC:83Þ

It is possible to define all the statistical quantities and functions by replacing the

ensemble-average operator Eð�Þ with the time-average operator h � i also indicated

as E �f g. In other words, if x½n� is an ergodic process, we have that

μ ¼ x n½ �� ¼ E x n½ � : ðC:84Þ

If x½n� is an ergodic process for the correlation we have

x nþ k½ �x n½ �� ¼ E x nþ k½ �x n½ � : ðC:85Þ

If a process is ergodic then it is WSS, i.e., only stationary processes can be ergodic.

On the contrary, a WSS process cannot be ergodic.

Considering the sequence x½n�, we have that


x n½ �� , MeanValue ðC:86Þ

x2 n½ �� , MeanSquareValue ðC:87Þ�

x n½ � � μð Þ2i, Variance ðC:88Þx nþ k½ �x n½ ��

, Autocorrelation ðC:89Þx nþ k½ � � μ

��x n½ � � μ

� �� , Autocovariance ðC:90Þ

x nþ k½ �y n½ �� , Cross-correlation ðC:91Þ

x nþ k½ � � μx��y n½ � � μy

� �� , Cross-covariance ðC:92Þ

For deterministic power signals, it is important to mention the similarities

among the correlation sequences, calculated by the temporal average (C.89), and

determined by the definition (C.76). Although this is a formal similarity due to the

fact that random sequences are power signals, the time averages are (for the closure

property) RVs, and the corresponding quantities for deterministic power signals are

numbers or deterministic sequences.

Two individually ergodic SPs x½n� and y½n� have the property of joint ergodicityif the cross-correlation is identical to Eq. (C.91), i.e.,

E x nþ k½ �y n½ � ¼ x nþ k½ �y n½ �� : ðC:93Þ

Remark The ergodic processes are very important in applications as very often

only one realization of the process is available: in many practical situations,

however, the processes are stationary ergodic. Therefore, the assumption of ergo-

dicity allows the estimation of statistical functions starting from the time averages

available only for the single realization of the process. Moreover, in the case of

ergodic sequences of finite duration, the expression (C.83) is calculated as

r k½ � ¼1

N

XN�1�k

n¼0

x nþ k½ �x n½ � k 0

r �k½ � k < 0

8><>:

:

ðC:94Þ

C.2.6 Correlation Matrix of Random Sequences

A stochastic process can be represented as an RV vector and, as defined in

Sect. C.1.7, its second-order statistics are defined by the mean values vectors and

by the correlation matrix. Considering a random vector xn from the SP x½n� asfollows:

xn ≜ x n½ � x n� 1½ � � � � x n�M þ 1½ �� T ðC:95Þ

for the definition (C.37), its mean value is defined as


μxn ¼ μxn μxn�1� � � μxn�Mþ1

� �T ðC:96Þ

and for (C.41) and (C.63), the autocorrelation matrix is defined as

Rxn ¼ E xnxTn

� � ¼ rx n; n½ � � � � rx n, n�M þ 1½ �⋮ ⋱ ⋮

rx n�M þ 1, n½ � � � � rx n�M þ 1, n�M þ 1½ �

24

35: ðC:97Þ

since rx½n � i, n � j� ¼ rx½n � j, n � i� for 0 � (i,j) � M � 1, Rxn is symmetric

(or Hermitian for complex processes).

In the case of stationary process the acf is independent from index n and, by

defining the correlation lag as k ¼ j � i, we obtain

rx n� i, n� j� � ¼ rx j� i½ � ¼ rx k½ �: ðC:98Þ

Then the autocorrelation matrix is a symmetric Toeplitz matrix of the form

Rx ¼ E xxT ¼

r 0½ � r 1½ � � � � r M � 1½ �r 1½ � r 0½ � � � � r M � 2½ �⋮ ⋮ ⋱ ⋮

r M � 1½ � r M � 2½ � � � � r 0½ �

2664

3775: ðC:99Þ

The autocorrelation matrix of stationary process is always Toeplitz (see Sect. A.2.4)

and, for (C.44), nonnegative.

C.2.7 Stationary Random Sequences and TD LTI Systems

For random sequences processed by TD LTI systems, it is necessary to study the

relationship between the input and output pdfs. For simplicity, consider a stable

circuit TD LTI characterized by the impulse response h½n�, where the input x½n� is arandom, real or complex, stationary sequence WSS. The output y½n� is computed by

the DT convolution defined as

y n½ � ¼X1l¼�1

h l½ �x n� l½ �: ðC:100Þ

C.2.7.1 Input–Output Cross-correlation Sequence

Consider the expression (C.100), and pre-multiplying both sides by x½n þ k�, andperforming the expectation we get


E x nþ k½ �y n½ � ¼X1l¼�1

h l½ �E x nþ k½ �x n� l½ � , ðC:101Þ

i.e.,

rxy k½ � ¼X1l¼�1

h l½ �rxx k þ l½ � ¼X1

m¼�1h �m½ �rxx k � m½ �: ðC:102Þ

In other words, the following relations are valid:

rxy k½ � ¼ h �k½ �∗rxx k½ � ðC:103Þ

and similarly

ryx k½ � ¼ h k½ �∗rxx k½ �: ðC:104Þ

From the previous we also have that

rxy k½ � ¼ ryx �k½ �: ðC:105Þ

C.2.7.2 Output Autocorrelation Sequence

Multiplying both sides of (C.100) for y½n � k� and computing the expectation we

get

E y n½ �y n� k½ � ¼X1l¼�1

h l½ �E x n� l½ �y n� k½ � ðC:106Þ

or

ryy k½ � ¼X1l¼�1

h l½ �rxy k � l½ � ¼ h k½ �∗rxy k½ �: ðC:107Þ

In other words, we can write

ryy k½ � ¼ h k½ �∗h �k½ �∗rxx k½ �: ðC:108Þ

By defining the term rhh½k� as

rhh k½ � ≜ h k½ �∗h �k½ � ¼X1l¼�1

h l½ �h l� k½ �: ðC:109Þ

(C.108) can be written as


ryy k½ � ¼ rhh k½ �∗rxx k½ �: ðC:110Þ

Therefore, in the case of a stationary signal, x½n� is filtered with a circuit of the

impulse response h½n�, and the output autocorrelation is equivalent to the input

autocorrelation filtered with an impulse response equal to rhh½n� ¼ h½k� ∗ h½�k�.

C.2.7.3 Output Pdf

The output pdf determination of a DT-LTI system is usually a difficult task.

However, for Gaussian input process, also the output is always a Gaussian process

with a correlation (C.110). In the case of multiple iid inputs, the output is deter-

mined by the weighted sum of the independent input SPs. Therefore, the output pdf

is equal to the convolution of the pdf of each SP.

C.2.7.4 Stationary Random Sequences Spectral Representation

Given a stationary zero-mean discrete-time signal x½n� for �1 < n < 1,

this has not, in general, finite energy for which the DTFT, and more generally the

z-transform, does not converge. The autocorrelation sequence rxx½n�, computed by

(C.76) or in terms of relative frequency, however, is “almost always” with finite

energy, and when this is true, its envelope decays (goes to zero) when the delay

increases. In these cases the sequence of autocorrelation always results absolutely

summable and its z-transform, defined as

Rxx zð Þ ¼X1k¼�1

rxx k½ �z�k,

admits some convergence region on the z-plane. Note, also, that for the symmetry

properties of (C.78), we have that Rxxðz�1Þ ¼ RxxðzÞ.

C.2.7.5 Power Spectral Density

We define the power spectral density (PSD) as the DTFT of the autocorrelation

Rxx e jω� � ¼ X1

k¼�1rxx k½ �e�jω k: ðC:111Þ

The PSD is a nonnegative real function that does not preserve the phase informa-

tion. The Rxxðe jωÞ provides a distribution measure of the average power of a

random process, in function of the frequency.


We define cross-spectrum or cross-PSD (CPSD) the DTFT of the sequence of

cross-correlation

Rxy e jω� � ¼ X1

k¼�1rxy k½ �e�jω k: ðC:112Þ

The CPSD is a complex function. Its amplitude describes the frequencies of the

SP x½n� associated, with a large or small amplitude, with those of the SP y½n�. Thephase ∡ Rxyðe jωÞ indicates the phase delay of y½n� with respect to x½n� for eachfrequency.

From equation (C.105), the following property holds:

Rxy e jω� � ¼ R∗

yx e jω� � ðC:113Þ

so, Rxyðe jωÞ and R∗yxðe jωÞ have the same module but opposite phase.

C.2.7.6 Spectral Representation of Stationary SP and TD LTI systems

For an impulse response h½n�, with z-transform HðzÞ ¼ Zh½n�, we have the

following property:

Z h n½ � ¼ H zð Þ , Z h∗ �n½ � ¼ H∗ 1=z∗ð Þ: ðC:114Þ

From the above and for (C.103)–(C.106), then

Rxy zð Þ ¼ H∗ 1=z∗ð ÞRxx zð Þ ðC:115ÞRyx zð Þ ¼ H zð ÞRxx zð Þ ðC:116ÞRyy zð Þ ¼ H zð ÞH∗ 1=z∗ð ÞRxx zð Þ: ðC:117Þ

For z ¼ e jω, we can write

Rxy e jω� � ¼ H∗ e jω

� �Rxx e jω� � ðC:118Þ

Ryx e jω� � ¼ H ejω

� �Rxx e jω� � ðC:119Þ

Ryy e jω� � ¼ H ejω

� �H∗ e jω� �

Rxx e jω� � ¼ H e jω

� �� 2Rxx e jω� �

: ðC:120Þ

Moreover, for (C.118) and (C.119) we have that

Ryx e jω� � ¼ R∗

xy e jω� �

: ðC:121Þ

Example Consider the sum of two SPs w½n� ¼ x½n� þ y½n�, and evaluate the rww½k�.By applying the definition (C.76) we have that


rww k½ � ¼ E w n½ �w n� k½ � ¼ E x n½ � þ y n½ �� x n� k½ � þ y n� k½ �� n o¼ E

�x n½ �x n� k½ �

n oþ E x n½ �y n� k½ � þ E y n½ �x n� k½ � þ E y n½ �y n� k½ �

¼ rxx k½ � þ rxy k½ � þ ryx k½ � þ ryy k½ �:

For uncorrelated sequences the cross contributions are zero [see (C.67)]. Hence, we

obtain that rww½k� ¼ rxx½k� þ ryy½k�; therefore, for the PSD we have

Rww e jω� � ¼ Rxx e jω

� �þ Ryy e jω� �

:

Example Evaluate the output PSD Ryyðe jωÞ, for the TD LTI system illustrated in

Fig. C.7a), with random uncorrelated input sequences x1½n� and x2½n�.The inputs x1½n� and x2½n� are mutually uncorrelated and, since the system is

linear, can be considered separately with the superposition principle. The output

PSD is calculated as the sum of the single contributions when the other is null. So

we have

Ryy e jω� � ¼ Rx1

yy e jω� �þ Rx2

yy e jω� �

:

For the (C.120), we get

Rx1yy e jω� �

≜ Ryy e jω� ��

x2 n½ �¼0¼ H ejω

� �� 2Rx1x1 e jω� �

Rx2yy e jω� �

≜ Ryy e jω� ��

x1 n½ �¼0¼ G ejω

� �� 2Rx2x2 e jω� �

:

Finally, we have that

Ryy e jω� � ¼ H ejω

� �� 2Rx1x1 e jω� �þ G ejω

� �� 2Rx2x2 e jω� �

:

Example Evaluate the PSDsRy1y2 e jωð Þ,Ry2y1 e jωð Þ,Ry1y1 e jωð Þ, andRy2y2 e jωð Þ, for theTD LTI system illustrated in Fig. C.7b), with random uncorrelated input sequences

x1½n� and x2½n�.For (C.118)–(C.120), the output PSD we obtain is

[ ]y n( )H z

( )G z

1[ ]x n

2[ ]x n

1[ ]y n

( )H z( )G z

1[ ]x n

2[ ]x n2[ ]y n

ba

Fig. C.7 Block diagrams of TD LTI systems illustrated in the examples


Ry1y1 e jω� � ¼ H ejω

� �� 2Rx2x2 e jω� �þ Rx1x1 e jω

� �Ry2y2 e jω

� � ¼ G ejω� �� 2Rx1x1 e jω

� �þ Rx2x2 e jω� �

:

For the CPSD Ry1y2 e jωð Þ, we observe that the sequences y1½n� and y2½n� are in

relation with the input sequences, through the TF Hðe jωÞ and Gðe jωÞ. Moreover,

since x1½n� and x2½n� are uncorrelated, for the superposition principle, we can write

Ry1y2 e jω� � ¼ Rx1

y1y2e jω� �þ Rx2

y1y2e jω� �

:

Note that for x2½n� ¼ 0 is y1½n� x1½n� for which, for (C.119), we obtain

Rx1y1y2

e jω� � ¼ Ry1y2 e jω

� ��x2 n½ �¼0

¼ H ejω� �

Rx1x1 e jω� �

:

Similarly, for the other input when x1½n� ¼ 0, for (C.118), we obtain

Rx2y1y2

e jω� � ¼ Ry1y2 e jω

� ��x1 n½ �¼0

¼ G∗ e jω� �

Rx2x2 e jω� �

:

The CPSD Ry1y2 e jωð Þ is then

Ry1y2 e jω� � ¼ H ejω

� �Rx1x1 e jω

� �þ G∗ e jω� �

Rx1x1 e jω� �

:

Similarly, for the CPSD Ry2y1 e jωð Þ, we get

Ry2y1 e jω� � ¼ H∗ e jω

� �Rx1x1 e jω

� �þ G ejω� �

Rx1x1 e jω� �

:

C.3 Basic Concepts of Estimation Theory

In many real applications the distribution functions are not a priori known and

should be determined by appropriate experiments carried out using a finite set of

measured data. The estimation of such statistics can be performed by the use of

methodologies defined in the context of the Estimation Theory9 (ET) [16–22].

9 The Estimation Theory is a very ancient discipline and famous scientists as Lagrange, Gauss,

Legendre, etc., have used it in the past, and in the last century, attention to it has considerably

increased. In fact, many were scientists who have worked in this field (Wold, Fisher, Kolmogorov,

Wiener, Kalman, etc.). Among these N. Wiener, between 1930 and 1940, was among those who

most emphasized the importance that not only the noise but also signals should be considered as

stochastic processes.


C.3.1 Preliminary Definitions and Notation

Let Θ be defined as the parameters space, the general problem of parametersestimation is the determination of a parameter θ ∈ Θ or, more generally, of a vector

of unknown parameters θ ∈ Θ ≜�θ½n��L�1

0 , starting from a series of observations

or measurements x ≜�x½n��N�1

0 , by means of estimation function hð�Þ, called

estimator, i.e., such that the estimate is θ ¼ h xð Þ.Before proceeding to further developments, let us introduce some preliminary

formal definitions.

θ ∈ Θ In general, θ indicates the parameters vector to be estimated. Depending

on the estimation paradigm adopted, as better illustrated in the following,

θ can be considered as n RV, characterized by a certain a priori knownsupposed (or hypothesized) distribution, or simply considered as a

deterministic unknown.

h(x) This function, that is itself an RV, indicates the estimator, namely, the law

which would determine the value of the parameters to be estimated

starting from the observations x.

θThis symbol indicates the result, i.e., θ ¼ h xð Þ. Note that the estimatedvalue is always an RV characterized by a certain pdf and/or values of its

moments.

C.3.1.1 Sampling Distribution

The above definitions show that the estimator relative to the ζkth event, denoted by

h�x½n,ζk�

N�10

�, is defined in an N dimensional space, whose distribution can be

obtained from the joint distribution of the RVsx½n,ζ�N�1

0 and θ. This distribution,in the case of a single deterministic parameter estimation, is shown as fx;θðx;θÞ andis defined as sampling distribution.

Note that sampling distribution represents one of the fundamental concepts in

the estimation theory because it contains all the information needed to define the

estimator quality characteristics. In fact, it is intuitive to think that the sampling

distribution of a “good” estimator may be the most concentrate as possible. Thus it

has a small variance around the true value of the parameter to be estimated.

C.3.1.2 Estimation Theory: Classical and Bayesian Approaches

In classical estimation theory θ represents an unknown deterministic vector of

parameters. Therefore, the formalism fx;θðx;θÞ indicates a parametric dependencyof the pdf related to the measures x, from the parameters θ. For example, consider

the simple case where N ¼ 1 where the parameter θ represents a certain (mean)


value and the pdf fx;θ�x½0�; θ

�is normally distributed around this value

x½0� � Nðθ,σ2x½0�Þ so that

f x 0½ �;θ x 0½ �; θ� � ¼ 1ffiffiffiffiffi2π

pσx 0½ �

e� 1

2σ2x 0½ �

x 0½ ��θð Þ2ðC:122Þ

illustrated, by way of example, in Fig. C.8, for some value of the parameter θ. Inother words, the parameter θ is not an RV and fx;θ

�x½0�; θ� indicates a parametric

pdf that depends on a deterministic value θ .

On the contrary, in Bayesian estimation theory θ is an RV characterized by its

pdf fθðθÞ, a priori pdf, which contains all the a priori known information

(or believed). The quantity to be estimated is then interpreted as a realization of

the RV θ. Subsequently, the estimation process is described by the joint pdf throughthe Bayes rule, as [see Sect. C.1.4, Eq. (C.24)]

f x,θ x; θð Þ ¼ f x,θ x j θð Þf θ θð Þ ¼ f x,θ θ j xð Þf x xð Þ, ðC:123Þ

where fxjθðx��θÞ is the conditional pdf that represents the knowledge carried from the

data x conditioned by knowledge of distribution fθðθÞ.10From the definition of the estimator quality, it is not always possible to know

the sampling distribution fx;θðx;θÞ. In practice, however, it is possible to use the

low-order moments as the expectation E θ� �

, the variance, denoted as var θ� �

or σ2θ,

and the mean squares error (MSE) denoted as mse θ� �

.

C.3.1.3 Estimator, Expectation, and Bias

An estimator is called unbiased, if the expectation of the estimated value tends to

the true value of the parameter to be estimated. In other words,

; ( [0]; )xf xθ θ

1θ 2θ 3θ [0]x

Fig. C.8 Dependency of pdf fx;θ�x½0�; θ� form the unknown parameter θ

10 The notation fx;θðx;θÞ indicates a parametric pdf family where θ is the free parameter. Moreover,

remember that the notation fx,θðx,θÞ indicates the joint pdf, while fxjθðxjθÞ indicates the conditionalpdf.


E θ� � ¼ θ: ðC:124Þ

If E θ� � 6¼ θ, it is possible to define a quantity called deviation or bias as

b θ� �

≜ E θ� �� θ: ðC:125Þ

Remark The presence of a bias term, probably, indicates the presence of a system-atic error, i.e., due to the measure process (or due to estimation algorithm). Note

that an unbiased estimator not necessarily is a “good” estimator. In fact, the only

guarantee is that, in average, it tends to the true value.

C.3.1.4 Estimator Variance

For better characterizing the estimation quality we define the estimator variance as

var θ� � ¼ σ2

θ≜ E

��θ � E θ� ��2n o

ðC:126Þ

that represents a dispersion measure of the pdf of θ around its expected value

(Fig. C.9).

C.3.1.5 Estimator’s Mean Square Error and Bias-Vs.-Variance Trade-off

Given the true value θ and its estimated value θ , the MSE of the related estimator

θ ¼ h xð Þ can be defined as

mse θ� � ¼ E

��θ � θ��2n o

: ðC:127Þ

So the mseð�Þ is a measure of the average quadratic deviation of the estimated value

with respect to the true value. Note that, considering the definitions (C.125) and

(C.126), the mse θ� �

can be written as

θ

1θ

2θ

θ

; ( ; )f θ θx x ; ( ; )f θ θx x

1θ

2θ

1( )E θ 2( )E θ1 2 0ˆ ˆ( ) ( )E Eθ θ θ= =0θ

0θ

a b

Fig. C.9 Estimator bias and variance (a) biased estimator; (b) unbiased estimator


mse θ� � ¼ σ2

θþ ��b θ

� ��2: ðC:128Þ

In fact, by summing and subtracting the term E θ� �

, it is possible to write

E��θ � θ

��2n o¼ E jθ � θ þ E

�θ� � E

�θ��2n o

¼ E j θ � E θ� �� þ E θ

� �� θ� �j2� �

¼ E��θ � E θ

� ��2n o|fflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflffl}

σ2θ

þ ��E θ� �� θ

��2|fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl}��b θ� ��2

: ðC:129Þ

The expression (C.128) shows that the MSE is formed by the sum of two contrib-

utes: one due to the estimation variance, while the other due to its bias.

C.3.1.6 Example: Estimate the Current Gain of a White Gaussian

Sequence

As example we consider the estimation of discrete sequence x½n� consisting of

N independent samples defined as

x n½ � ¼ θ þ w n½ �, ðC:130Þ

where θ represents the constant component (by analogy with the constant electrical

direct current (DC)) and w½n� is additive white Gaussian noise (AWGN) with zero

mean and indicated as w½n� � Nð0,σ2wÞ.Intuitively reasoning, we can define different algorithms for the estimation of θ.

For example, two very commonly used estimators are defined as

θ 1 ¼ h1 xð Þ ≜ x 0½ � ðC:131Þ

θ 2 ¼ h2 xð Þ ≜ 1

N

XN�1

n¼0

x n½ �: ðC:132Þ

To assess the quality of the estimators h1ðxÞ and h2ðxÞ, we calculate the respectiveexpected values and variances. For the expected values we have

E θ 1

� � ¼ E x 0½ �� ¼ θ ðC:133Þ

E θ 2

� � ¼ E1

N

XN�1

n¼0

x n½ � !

¼ 1

N

XN�1

n¼0

E�x n½ �� ¼ 1

NNθ½ � ¼ θ: ðC:134Þ

Therefore, both estimators converge to the same expected value that coincides with

the true value of θ parameter to estimate. By reasoning in a similar way, for the

variance we have


var θ 1

� � ¼ var x 0½ �� ¼ σ2w ðC:135Þ

and

var θ 2

� � ¼ var1

N

XN�1

n¼0

x n½ � !

: ðC:136Þ

The latter, for the hypothesis of independency, can be rewritten as

var θ 2

� � ¼ 1

N2

XN�1

n¼0

var x n½ �� ¼ 1

N2Nσ2w� � ¼ σ2w

N: ðC:137Þ

Then, it follows that the variance of the estimator var�h2ðxÞ

�< var

�h1ðxÞ

�and for

N ! 1, var θ 2

� �! 0. For this reason, the estimator h2ðxÞ turns out to be better

than h1ðxÞ. In fact, according to certain paradigms, as we shall see later, h2ðxÞ is thebest possible estimator.

C.3.1.7 Minimum Variance Unbiased (MVU) Estimator

Ideally a good estimator should have the MSE which tends to zero. Unfortunately,

the adoption of this criterion produces, in most cases, a not “feasible” estimator. In

fact, the expression of the MSE (C.128) is formed by the contribution of the

variance added to that of bias. For better understanding consider the example of

the average value estimator (C.132), redefined using the following expression:

θ ¼ h xð Þ ≜ a1

N

XN�1

n¼0

x n½ �, ðC:138Þ

where a is a suitable constant. The problem, now, consists in determining the value

of the constant a such that the MSE of the estimator is minimal.

Since by definition,E θ� � ¼ aθ, var θ

� � ¼ a2σθ 2=N and for Eq. (C.128) we have

mse θ� � ¼ a2σ2

θ

Nþ a� 1ð Þ2θ2: ðC:139Þ

Hence, differentiating the MSE with respect to a, we obtain

d mse θ� � �

da¼ 2aσ2

θ

Nþ 2 a� 1ð Þθ2: ðC:140Þ


The optimum value aopt is obtained by setting these equations to zero and solving

with respect to A. It follows:

aopt ¼ θ2

θ2 þ σ2θ

�N: ðC:141Þ

The previous expression shows that the value aopt depends on θ, i.e., the estimator

goodness depends on the parameter θ, which should be determined by the estimator

itself. Such paradox indicates the non-computability of aopt parameter, i.e., the

non-feasibility of the estimator. Generally, with certain exceptions, any criteria

that depends on the bias determines a not feasible estimator.

On the other hand, the optimal estimator is not the one with minimum MSE but

is what constrains the bias to zero and minimizes the estimated variance. For such

reason, this estimator is called minimum variance unbiased (MVU) estimator. For

one MVU estimator, from definition (C.128),

mse θ� � ¼ σ2

θ, MVUestimator: ðC:142Þ

C.3.1.8 Bias Vs. Variance Trade-off

From what was said, a “good” estimator should be unbiased and with minimum

variance. Often in practical situations, the two features are mutually contradictory,

i.e., when reducing the variance the bias increases. This situation reflects a kind of

indeterminacy between bias and variance often referred to as bias–variance trade-off.The MVU estimator does not always exist and this is generally true when the

variance of the estimator depends on the value of the parameter to be estimated.

Note also that the existence of the MVU estimator does not imply its determination.

In other words, although theoretically it exists, it is not guaranteed that we can

determine it.

C.3.1.9 Consistent Estimator

An estimator is said to be weakly consistent, if it converges in probability to the trueparameter value, for a sample length N which tends to infinity

limN!1

p��h xð Þ � θ

�� < εn o

8ε > 0: ðC:143Þ

An estimator is called strong sense consistent, if it converges with probability one

to parameter value, for a sample length N which tends to infinity

limN!1

p h xð Þ ¼ θ ¼ 1: ðC:144Þ


Sufficient conditions for a weak sense consistent estimator are that the variance andthe bias tend to zero, for sample length N tending to infinity, i.e.,

limN!1

E h xð Þ ¼ θ

limN!1

var h xð Þ ¼ 0:ðC:145Þ

In this case, the sampling distribution tends to become an impulse centered on the

value to be estimated.

C.3.1.10 Confidence Interval

Increasing the sample length N, under sufficiently general conditions, the estimate

tends to the true value θ!N!1 θ

�. Moreover, for the central limit theorem, if

N increases, the pdf of θ is well approximated by the normal distribution.

Knowing the sampling distribution of an estimator, it is possible to calculate a

certain interval ð�Δ, ΔÞ, which defines a specified probability. Such interval,

called confidence interval, indicates that the event θ is in the range ð�Δ, ΔÞ,around θ, with probability ð1 � βÞ or confidence ð1 � βÞ � 100% (see Fig. C.10).

C.3.2 Classical and Bayesian Estimation

In the classical ET, as previously indicated, the problem is addressed considering

the parameter to be estimated as deterministic, while in Bayesian ET, the estimate

parameter is considered stochastic. If the parameter is an RV, it is characterized by

a certain pdf that reflects a priori knowledge on the parameter itself.

Both theories have found several applications in signal processing and, in

particular, the three main estimation paradigms used are the following:

i) the maximum a posteriori estimation (MAP);

ii) the maximum likelihood estimation (ML);

iii) the minimum mean squares error estimation (MMSE).

θ

; ( ; )f θ θx x

θ

θ−Δ Δ

(1 )Area β= −

Fig. C.10 Confidence interval around the true θ value


C.3.2.1 Maximum a Posteriori Estimation

In the MAP estimator, the parameter θ is characterized by an a priori pdf fθðθÞ thatis determined from known knowledge before the measure of x data in the absence of

other information. Therefore, the new knowledge obtained from the measure

determines a change in the θ pdf which is conditioned by the measure itself. So

the new pdf, indicated as fxjθðθ��xÞ, is defined as a posteriori pdf of θ conditioned bymeasures x. Note that fxjθðθ��xÞ is a one-dimensional function of scalar parameter θ,but it is also subject to conditioning due to the measures.

Therefore, the MAP estimate consists in determining the maximum a posterioripdf. Indeed, this can be obtained by differentiating fx��θðθ��xÞ with respect to the

parameter θ, and equating the result to be

θMAP ≜ θ∴∂f xjθ θ jxð Þ

∂θ¼ 0

� �: ðC:146Þ

Sometimes, instead of the maximum of fx��θðθ��xÞ, we consider its natural logarithm.

So θMAP can be found from the maximum of the function ln fx��θðθ��xÞ, for which

θMAP ≜ θ∴∂lnf xjθ θ jxð Þ

∂θ¼ 0

� �: ðC:147Þ

Since the logarithm is a monotonically increasing function, the value found is the

same as that in (C.146). However, the determination of the fx��θðθ��xÞ or ln fx��θðθ��xÞ isoften problematic, and using the rule derived from the Bayes theorem, for (C.123) it

is possible to write the conditioned pdf as

f xjθ θ jxð Þ ¼ f xjθ xjθð Þf θ θð Þf x xð Þ : ðC:148Þ

Considering the logarithm of both sides of the previous, we can write

lnf xjθ θ jxð Þ ¼ lnf xjθ x jθð Þ þ lnf θ θð Þ � lnf x xð Þ:

Thus, the procedure for the MAP estimate is

∂∂θ

lnf xjθ xjθð Þ þ lnf θ θð Þ � lnf x xð Þ �

¼ 0

and, since ln fxðxÞ does not depend on θ , we can write


θMAP ≜ θ ∴∂∂θ

lnf xjθ xjθð Þ þ lnf θ θð Þ �

¼ 0

� �: ðC:149Þ

Finally, note that it is possible to determine the MAP solution equivalently through

(C.146), (C.147) or, (C.149).

C.3.2.2 Maximum-Likelihood Estimation

In the maximum-likelihood (ML) estimation, the parameter θ to be estimated is

considered as a simple deterministic unknown. Therefore, in the ML estimation the

determination of θML is carried out through the maximization of the function

fx;θðx;θÞ defined as a parametric pdf family, where θ is the deterministic parameter.

In this respect, the function fx;θðx;θÞ is sometimes referred to as the likelihoodfunction Lθ . Note that if fx;θðx;θ1Þ > fx;θðx;θ2Þ, then the value of θ1 is “more

plausible” of the value θ2, so that the ML paradigm indicates that the estimated

value θML is the most likely according to the observations x. As for the MAP

method, also for ML estimator it is often considered the natural logarithm function

ln fx;θðx;θÞ. Note that, although θ is a deterministic parameter, the likelihood

function Lθ ðor ln LθÞ has stochastic nature and is considered as an RV. In this

case, if the estimates solution exists, it can be found as the only solution of the

equation that maximize the likelihood equation defined as

θML ≜ θ ∴∂lnf x;θ x; θð Þ

∂θ¼ 0

� �: ðC:150Þ

Such solution is defined as maximum-likelihood estimate (MLE).

In other words, the ML methods search for the most likely value of θ, namely,

research within the space Θ of all possible θ values, the value of the parameter that

maximizes the probability that θML is the most plausible sample. From a mathe-

matical point of view, calling Lθ ¼ fx;θðx;θÞ the likelihood function, we have

θML ¼ maxθ∈Θ

Lθf g: ðC:151Þ

The MLE also has the following properties:

• Sufficient—if there is a sufficient statistic11 for θ then the MLE is also a sufficient

statistic;

• Efficient—an estimator is called efficient if there is a lower limit of the variance

obtained from an unbiased estimator. An estimator that reaches this limit is

11 A sufficient statistic is a statistic such that “no other statistic which can be calculated from the

same sample provides any additional information as to the value of the parameter” [18]. In other

words, a statistic is sufficient for a pdf family if the sample from which it is calculated gives no

additional information than does the statistic.


called fully efficient estimator. Although, for a finite set of observations N, thefully efficient estimator does not exist, in many practical cases, the ML estimator

turns out to be asymptotically fully efficient.

• Gaussianity—the MLE turns out to be asymptotically Gaussian.

In case the efficient estimator does not exist, then the lower limit of the MLE

cannot be achieved and, in general, it is difficult to measure the distance from this

limit.

Remark By comparing the ML and MAP estimators it should be noted that in the

latter the estimate is derived using a combination of a priori and a posteriori knowninformation on θ, where such knowledge is formulated in terms of the pdf fθðθÞ.However, the ML estimation results potentially more feasible in practical problems

because it does not require any a priori knowledge. Both procedures require

knowledge of the joint a posteriori pdf of the observations.Note also that the ML estimator can be derived starting from the MAP and

considering the parameter θ as an RV with uniformly distributed pdf between

½�1, þ1�.

C.3.2.3 Example: Noisy Measure of a Parameter with a Single Observation

As a simple example to illustrate the methodology MAP and ML, consider a single

measure x consisting of the sum of a parameter θ and a normal distributed zero-

mean RV w (AWGN) w � Nð0,σ2wÞ. Then, the process is defined as

x ¼ θ þ w: ðC:152Þ

It appears that (1) in ML estimating the parameter θ is a deterministic unknown

constant, while, (2) in the MAP estimate θ is an RV with an a priori pdf of the

normal type Nðθ,σ2θÞ.

ML Estimation

In ML method, the likelihood function Lθ ¼ fx,θðx;θÞ appears to be a scalar functionof a single variable. From equation (C.152) x is, by definition, a Gaussian signal

with mean value θ and variance equal to σ2w. It follows that the likelihood function

Lθ reflects this dependence and appears to be defined as

Lθ ¼ f x x; θð Þ ¼ 1ffiffiffiffiffiffiffiffiffiffiffi2πσ2w

p e� 1

2σ2wx�θð Þ2

: ðC:153Þ

Its logarithm is


lnLθ ¼ lnf x x; θð Þ ¼ � 1

2ln 2πσ2w� �� 1

2σ2wx� θð Þ2: ðC:154Þ

To determine the maximum, we differentiate with respect to θ, and we equate to

zero

θML ≜ θ ∴1

σ2wx� θð Þ ¼ 0

� �,

that is,

θML ¼ x: ðC:155Þ

It follows, then, that the best estimate in the ML sense is just the x value of the

measure. This is an intuitive result since, in the absence of other information, it is

not in any way possible to refine the estimate of the parameter θ.The variance associated with the estimated value appears to be

var θMLð Þ ¼ E θ2ML

� �� E2 θMLð Þ ¼ E x2� �� E2 xð Þ

that, for x ¼ θ þ w, is

var θMLð Þ ¼ θ2 þ σ2w � θ2 ¼ σ2w

which obviously coincides with the variance of the superimposed noise w.

MAP Estimation

In MAPmethod we have x ¼ θ þ wwith w � Nð0,σ2wÞ and we suppose the a prioriknown pdf fðθÞ that is normal distributed: Nðθ0,σ2θÞ. The MAP estimation is that

obtained from Eq. (C.149) as


lnf x��θ x

��θ� �þ lnf θ θð Þ �

¼ 0

� �: ðC:156Þ

Given the θ value, the pdf of x is Gaussian with mean value θ and variance σ2w. Itfollows that the logarithm of the density is

lnf x,θ x��θ� � ¼ � 1

2ln 2πσ2w� �� 1

2σ2wx� θð Þ2: ðC:157Þ

while the a priori known density fðθÞ is equal to


f θ θð Þ ¼ 1ffiffiffiffiffiffiffiffiffiffi2πσ2θ

p e� 1

2σ2θ

θ�θ0ð Þ2 ðC:158Þ

with logarithm

lnf θ θð Þ ¼ � 1

2ln 2πσ2θ� �� 1

2σ2θθ � θ0ð Þ2: ðC:159Þ

By substituting (C.157) and (C.159) in (C.156) we obtain


�1

2ln 2πσ2w� �� 1

2σ2wx�θð Þ2�1

2ln 2πσ2θ� �� 1

2σ2θθ�θ0ð Þ2

� �¼0

( ):

Differentiating we obtain

x� θMAPð Þσ2w

� θMAP � θ0ð Þσ2θ

¼ 0, ðC:160Þ

that is,

θMAP ¼ xσ2θ � θ0σ2wσ2w þ σ2θ

¼ xþ θ0 σ2w=σ2θ

� �1þ σ2w=σ

2θ

� � : ðC:161Þ

Comparing the latter with the ML estimate (C.155), we observe that the MAP

estimate can be viewed as a weighted sum of the ML estimate x and of the a priori

mean value θ0. In (C.161), the ratio of the variances ðσ2w/σ2θÞ can be seen as a

measure of confidence of the value θ0. The lower the value of σ2θ, the greater the

ratio ðσ2w/σ2θÞ, and the greater the confidence in θ0, less is the weight of the

observation x.

In the limit case where ðσ2w/σ2θÞ ! 1, the MAP estimate is simply given by the

value of the a priori mean θ0. At the opposite extreme, if σ2θ increases, then the

MAP estimate coincides with the ML estimate θMAP ! x.

C.3.2.4 Example: Noisy Measure of a Parameter by N Observations

Let’s consider, now, the previous example where N measurements are available

x n½ � ¼ θ þ w n½ �, n ¼ 0, 1, :::,N � 1, ðC:162Þ

where samples w½n� are iid, zero-mean Gaussian distributed Nð0,σ2wÞ.


ML Estimation

In the MLE, the likelihood function Lθ ¼ fx;θðx;θÞ is an N-dimensional multivariate

Gaussian defined as

Lθ ¼ f x;θ x; θð Þ ¼ 1

2πσ2w� �N=2 e�

12σ2w

PN�1

n¼0

x n½ � � θ �2

: ðC:163Þ

Its logarithm is

lnLθ ¼ lnf x;θ x; θð Þ ¼ �N

2ln 2πσ2w� �� 1

2σ2w

XN�1

n¼0

x n½ � � θ �2

:

Differentiating with respect to θ, and setting to zero

∂lnLθ∂θ

¼XN�1

n¼0

x n½ � � θML

�¼ 0

we obtain

θML ¼ 1

N

XN�1

n¼0

x n½ �: ðC:164Þ

It follows, then, that the best estimate in the ML sense coincides with the average

value of the observed data. This represents an intuitive result, already previously

reached, since, in the absence of other information, it is not possible to do better.

MAP Estimation

In MAP estimation we have x½n� ¼ θ þ w½n� where w � Nð0,σ2wÞ, and we suppose

that the a priori pdf is normally distributed fðθÞ, N ; θ; σ2θ� �

. The MAP estimation,

proceeding as in the latter case, is obtained as

XN�1

n¼0

ðx n½ � � θMAPÞσ2w

� θMAP � θ� �

σ2θ¼ 0, ðC:165Þ

that is,


θMAP ¼1N

XN�1

n¼0

x n½ � þ θ � σ2w=Nσ2θ

� �1þ σ2w=Nσ

2θ

� � : ðC:166Þ

Again, comparing the latter with the ML estimation, we observe that the MAP

estimate can be viewed as a weighted sum of the MLE and the a priori mean value.

Comparing with the case of single observation [Eq. (C.161)], one can observe that

the increase in the number of observations N is a reduced dependence of the a prioridensity by a factor N. This result is reasonable and intuitive: each new observation

reduces the variance of the observations and reduces the dependence of the model

a priori.

C.3.2.5 Example: Noisy Measure of L Parameters with N Observations

We consider now the general case where we have N measurements x ≜�x½n��0N � 1,

and we estimate a number of L parameters θ ≜�θ½n��0L � 1, where samples of w½n� are

zero-mean Gaussian Nð0,σ2wÞ, iid.

MAP Estimation

Proceed in this case prior to the MAP estimate. We seek to maximize the posterior

density fx��θðθ��xÞ or, equivalently, the ln fx,θðθ��xÞ, with respect to θ. This is achievedby differentiating with respect to each component of θ and equating to zero. It is

then

∂lnf x,θ θjxð Þ∂θ n½ � ¼ 0, n ¼ 0, 1, :::, L� 1: ðC:167Þ

By separating the derivatives we obtain L equations in L unknowns in the param-

eters θ½0�, θ½1�, . . ., θ½L – 1� that, changing the type of notation, can be expressed as

∇θ f x,θ θjxð Þ ¼ 0, ðC:168Þ

where the symbol ∇θ indicates the differential operator, called “gradient,” definedas

∇θ ≜∂

∂θ 0½ � ,∂

∂θ 1½ � , � � �, ∂∂θ L� 1½ �

" #T:

As in the case of a single parameter, the Bayes rule holds, so we have


f xjθ θjxð Þ ¼ f xjθ xjθð Þf θ θð Þf x xð Þ ,

which, considering the logarithm, can be written as

lnf xjθ θjxð Þ ¼ lnf xjθ xjθð Þ þ lnf θ θð Þ � f x xð Þ,

where fxðxÞ do not depends from θ, so we can write

θMAP ≜ θ ∴∂ lnf x

��θ x��θ� �þ lnf θ θð Þ

�∂θ n½ � ¼ 0, for n ¼ 0, 1, :::, L� 1

8<:

9=;:

ðC:169Þ

Finally, the solution of the above simultaneous equations consists in the MAP

estimation.

ML Estimation

In ML estimation, the likelihood function is Lθ ¼ fx;θðx;θÞ or, equivalently, its

logarithm is ln Lθ ¼ ln fx,θðx;θÞ. Its maximum is defined as

θML ≜ θ ∴∂lnf x,θ x; θð Þ

∂θ n½ � ¼ 0, for n ¼ 0, 1, :::, L� 1

� �: ðC:170Þ

C.3.2.6 Variance Lower Bound: Cramer–Rao Lower Bound

A very important issue of estimation theory concerns the existence of the lowerlimit of variance of the MVU estimator. This limit, known in the literature as the

Cramer–Rao lower bound (CRLB) (also known as the Cramer–Rao inequality or

information inequality), in honor of the mathematicians: Harald Cramer and

Calyampudi Radhakrishna Rao, who first derived this limit [23], expresses the

minimum value of variance that can be achieved in the estimation of a vector of

deterministic parameters θ.For the determination of the limit we consider a classical estimator and a vector

of RVs x ζð Þ ¼ x0 ζð Þ x1 ζð Þ � � � xN�1 ζð Þ� �T, and a unbiased estimator θ ¼ h xð Þ,

such that, by definition E θ� θ �

¼ 0, also characterized by the covariance matrix

Cθ ðL � LÞ defined as [see (C.38)]


Cθ ¼ covðθÞ ¼ E ðθ� θÞðθ� θÞTh i

: ðC:171Þ

Moreover, define the Fisher information matrix J ∈ ℝðL�L Þ, whose elements are12

J i; jð Þ ¼ �E∂2

lnf x,θ x; θð Þ∂θ i½ �∂θ j½ �

" #, for i, j ¼ 0, 1, :::,L� 1: ðC:172Þ

The CRLB is defined by the inequality

Cθ J�1: ðC:173Þ

The above indicates that the variance of the estimator cannot exceed the inverse ofthe amount of information contained in the random vector x. In other words,

inequality (C.173) expresses the lower limit of variance obtained from an unbiased

estimator for a vector of parameters x.

As defined in Sect. C.3.1.6, an estimator with this property, in the sense of

equality (C.173), is a Minimum Variance Unbiased (MVU) estimator. Note that

(C.173) can be interpreted as ½Cθ � J�1� 0 (positive semi-definite). An estimator

which has the property (C.173), in the sense of equality, is fully efficient.

Equation (C.173) expresses a general condition for the limit of the covariance

matrix of the parameters. Sometimes, it is useful to limit the individual parameters

variances of the estimate: this corresponds to the diagonal elements of the matrix

½Cθ � J�1�. It follows that the diagonal elements of the matrix are nonnegative, i.e.,

var θ i½ �� 1

J i; ið Þ , for i ¼ 0, 1, :::, L� 1 ðC:174Þ

from which we have that

var θ� � 1

�E∂2

lnf x,θ x;θð Þ½ �∂θ2

� � ðC:175Þ

or

12 The Fisher information is defined as variance of the derivative associated with the likelihood

function logarithmic. The Fisher information can be interpreted as the amount of information

carried by an observable RV x, related to a nonobservable parameter θ, upon which the likelihoodfunction of θ, Lθ ¼ fx;θðx;θÞ, depends.


var θ� � 1

E∂lnf x,θ x;θð Þ

∂θ

�2� � ðC:176Þ

which represents an equivalent form of the CRLB.

Proof We have that

∂2lnf x,θ x; θð Þ∂θ2

¼∂2

∂θ2lnf x,θ x; θð Þf x,θ x; θð Þ �

∂∂θlnf x,θ x; θð Þf x,θ x; θð Þ

� �2

¼∂2

∂θ2lnf x,θ x; θð Þf x,θ x; θð Þ � ∂lnf x,θ x; θð Þ

∂θ

� �2

since

ð∂lnf x; θð Þ

∂θf x; θð Þdx ¼

ð∂f x; θð Þ

∂θdx ¼ ∂

∂θ

ðf x; θð Þdx ¼ ∂

∂θ1 ¼ 0, we get

E∂2

∂θ2lnf x,θ x; θð Þf x,θ x; θð Þ

� �¼ ::: ¼ ∂2

∂θ2�ðf x; θð Þdx ¼ ∂2

∂θ2� 1 ¼ 0:

Therefore

E∂lnf x,θ x; θð Þ

∂θ

� �2" #

¼ �E∂2

lnf x,θ x; θð Þ∂θ2

" #

Q.E.D.

Remark The CRLB expresses the minimum error variance of the estimator hðxÞ ofθ in terms of the pdf fx;θðx;θÞ of the observations x. So any unbiased estimator has

an error variance greater than the CRLB.

Example As an example, consider the ML estimator for a single observation

already studied in Sect. C.3.2.3, where we have [see (C.154)]

ln Lθ ¼ ln f x;θ x; θð Þ ¼ � 1

2ln 2πσ2w� �� 1

2σ2wx� θð Þ2:

From (C.176), the CRLB is

var θ� � 1

�E∂2

lnf x,θ x;θð Þ½ �∂θ2

� � ¼ 1

E ∂2

∂θ21

2σ2wx� θð Þ2

�� : ðC:177Þ

Simplifying it is noted that the CRLB is given by the simple relationship


var θ� � σ2w: ðC:178Þ

The lower limit coincides with the ML estimator variance and, in this case, one can

conclude that the ML estimator reaches the CRLB on a finite set of N observations.

C.3.2.7 Minimum Mean Squares Error Estimator

Suppose we want to estimate the parameter θ using a single measure x, such that the

mean squares error defined in (C.127) is minimized. Let θ ¼ h xð Þ, it appears thatmse θ

� � ¼ E��θ � θ

��2n o; so, we have

mse θ� � ¼ E

��h xð Þ � θ��2n o

: ðC:179Þ

The expected value of the latter can be rewritten as

mse θ� � ¼ ð1

�1

ð1�1

��h xð Þ � θ��2f x,θ x; θð Þdθdx: ðC:180Þ

Remember that the joint pdf fx,θðx,θÞ can be expanded as

f x,θ x; θð Þ ¼ f x,θ θ��x� �

f x xð Þ: ðC:181Þ

Then, we obtain

mse θ� � ¼ ð1

�1f x xð Þ

ð1�1

��h xð Þ � θ��2f x,θ θ

��x� �dθ

� �dx: ðC:182Þ

In the previous expression, both integrals are positive everywhere (by pdf defini-

tion). Moreover, the external integral is fully independent from the function hðxÞ. Itfollows that the minimization of the (C.182) is equivalent to the minimization of the

internal integral ð1�1

��h xð Þ � θ��2f x,θ θ

��x� �dθ: ðC:183Þ

Differentiating with respect to hðxÞ and setting to zero

2h0xð Þð1�1

h xð Þ � θj j f x,θ θ��x� �

dθ ¼ 0

or


h xð Þð1

�1f x,θ θ jxð Þdθ ¼

ð1�1

θf x,θ θ jxð Þdθ:

by definition we have thatR�11 fx,θðθjxÞdθ ¼ 1 which is

θMMSE ¼ h xð Þ ≜ð1�1

θf x,θ θ jxð Þdθ ¼ E θ jxð Þ: ðC:184Þ

The MMSE estimator is obtained when the function hðxÞ is equal to the expectationof θ conditioned to the data x. Moreover, note that differently from MAP and ML,

the MMSE estimator requires knowledge of the conditioned expected value of the aposteriori pdf but does not require its explicit knowledge. The θMMSE ¼ EðθjxÞ is,in general, a nonlinear function of the data. An important exception is when the aposteriori pdf is Gaussian. In this case, in fact, θMMSE became a linear function of x.

It is interesting to compare the MAP estimator described above and the MMSE.

The two estimators consider the parameter to estimate θ an RV for which both can

be considered Bayesian. Both also produce estimates based on a posteriori pdf of θand the distinction between the two is the optimization criteria. The MAP takes

the maximum (peak) of the function while on the MMSE criterion considers the

expected value. Moreover, note that for symmetrical density, the peak and the

expected value (and thus the MAP and MMSE) coincide, and note also that this

class includes the most common class of Gaussian a posteriori density.Comparing classical and Bayesian estimators we observe that in the former case,

quality is defined in terms of bias, consistency, and efficiency, etc. In Bayesian

estimation of the θ RV implies the non-appropriateness of these indicators: the

performance is evaluated in terms of cost function such as in (C.182). Note that the

MMSE cost function is not the only possible choice. In terms of principle, you

can choose other features such as, for example, the minimum absolute value or

Minimum Absolute Error (MAE)

mae θ� � ¼ Eð h xð Þ � θ

�� : ðC:185Þ

Indeed, the MAP estimator can be derived from different forms of cost function.

The optimal estimator in the sense MAE coincides with the median of the aposteriori density. For symmetric density, the MAE coincides with the MMSE

and the MAP. In the case of unimodal symmetric density, optimal solution can be

obtained with a wide class of cost functions that, moreover, coincides with the

solution θMMSE.

Finally, note that in the case of multivariate density, expression (C.184) can be

generalized as

θMMSE ¼ E θjxð Þ: ðC:186Þ


C.3.2.8 Linear MMSE Estimator

The expression of the MMSE estimator (C.184) or (C.186), as noted in the previous

paragraph, is generally nonlinear. Suppose, now, to impose the form of the MMSE

estimator, the constraint of linearity with respect to the observed data x. With this

constraint, the estimator consists of a simple linear combination of measures. It,

therefore, assumes the form

θ∗MMSE ¼ h xð Þ ≜XN�1

i¼0

hi � x i½ � ¼ hTx, ðC:187Þ

where the coefficients h are the weights that can be determined by the minimization

of the mean squares error, defined as

hopt ≜ h ∴∂∂h

E θ � hTx�� 2n o

¼ 0

� �: ðC:188Þ

For the derivative computation it is convenient to define the quantity “error” as

e ¼ θ � θ∗MMSE ¼ θ � hTx ðC:189Þ

and, using previous definition, it is possible express the mean squares error as a

function of the estimator parameters h as

J hð Þ ≜ E e2 ¼ E θ � hTx

�� 2n o: ðC:190Þ

With previous positions, the derivative of (C.188) is

∂J hð Þ∂h

¼∂E ej j2n o∂h

¼ 2e∂E θ � hTx ∂h

¼ �2ex: ðC:191Þ

The optimal solution can be computed for�∂JðhÞ=∂h� ¼ 0, which is

E e � xf g ¼ 0: ðC:192Þ

The above expression indicates that at best solution point, there is the orthogonality

between the error e and the vector of data x (measures). In other words, (C.192)

expresses the principle of orthogonality that represents a fundamental property of

the linear MMSE estimation approach.


C.3.2.9 Example: Signal Estimation

We extend, now, the concepts presented in the preceding paragraphs to the estima-

tion of signals defined as time sequences.

With this assumption the vector of measured data is represented by the sequence

x ¼ x n½ � x n� 1½ � � � � x n� N þ 1½ �� T, while the vector of parameters to be

estimated is another sequence, in this context called desired signal, indicated as

d ¼ d n½ � d n� 1½ � � � � d n� Lþ 1½ �� T. In this situation, the estimator is

defined by the operator

d ¼ T xf g: ðC:193Þ

In other words, Tf�g maps the sequence x to another sequence d .

For such problem the estimators MAP, ML, MMSE, and linear MMSE are

defined as follows:

1. MAP

arg max f x��d d n½ ��x n½ ��

, ðC:194Þ

2. ML

arg max f x;d d n½ �; x n½ �� , ðC:195Þ

3. MMSE

d n½ � ¼ E d n½ ��x n½ � , ðC:196Þ

4. Linear MMSE

d n½ � ¼ hTx: ðC:197Þ

Comparing the four procedures we can say that the linear MMSE estimator, while it

is the less general, has the simplest implementative form. In fact, the methods 1.–3.

require the explicit knowledge of the density of signals (and parameters to estimate)

or, at least, conditional expectations. The linear MMSE, however, can be obtained

only by knowledge of the second-order moments (acf, ccf) of the data and param-

eters and, even if they are not known, these could easily be estimated directly from

data. As another strong point of the linear MMSE method, note that the structure of

the operator Tf�g has the form of a convolution (inner or dot product) and it takes

the form of an FIR filter; so we have


d n½ � ¼XM�1

k¼0

w k½ �x n� k½ � ¼ wTx ðC:198Þ

for which the parameters h in (C.197) are replaced with the coefficients of the linear

FIR filter w. This solution, which happens to be one of the best and most widely

used in adaptive signal processing, is, also, extended to many artificial neuralnetworks architectures.

C.3.3 Stochastic Models

An extremely powerful paradigm, useful for statistic characterization of many types

of time series, is to consider a stochastic sequence as the output of a linear time-

invariant filter whose input is white noise sequence. This type of random sequence

is defined as linear stochastic process. For stationary sequences this model is

general and the following theorem holds.

C.3.3.1 Wold Theorem

A stationary random sequence x½n� that can be represented as an output of a causal,

stable, time-invariant filter, characterized by the impulse response h½n�, for whitenoise input η½n�,

x n½ � ¼X1k¼0

h k½ �η n� k½ �, ðC:199Þ

is defined as linear stochastic process.Moreover, let Hðe jωÞ be the frequency response of the h½n� [see (C.120)], the

Power Spectral Density (PSD) of x½n� is defined as

Rxx e jω� � ¼ H ejω

� �� 2σ2η , ðC:200Þ

where σ2η represents the variance (the power) of the white noise η½n�.

C.3.3.2 Autoregressive Model

The autoregressive (AR) time-series model is characterized by the following

difference equation:


x n½ � ¼ �Xpk¼1

a k½ �x n� k½ � þ η n½ �, ðC:201Þ

which defines the pth-order autoregressive model that is indicated as ARðpÞ. Thefilter coefficients a ¼ a1 a2 � � � ap½ �T are called autoregressive parameters.

The frequency response of the AR filter is

H ejω� � ¼ 1

1þXpk¼1

a k½ �e�jωk

ðC:202Þ

so it is an all-pole filter. Therefore, the PSD of the process is (Fig. C.11)

Rxx e jω� � ¼ σ2η

1þXpk¼1

a k½ �e�jωk

��2: ðC:203Þ

Moreover, it is easy to show that the acf of an ARðpÞ model satisfies the following


r k½ � ¼�Xpl¼1

a l½ �r k � l½ � k l

�Xpl¼1

a l½ �r l½ � þ σ2η k ¼ 0:

8>>><>>>: ðC:204Þ

Note that the latter can be written in matrix form as

r 0½ � r 1½ � � � � r p� 1½ �r 1½ � r 0½ � � � � r p� 2½ �⋮ ⋮ ⋱ ⋮

r p� 1½ � r p� 2½ � � � � r 0½ �

2664

3775

a 1½ �a 2½ �⋮a p½ �

2664

3775 ¼ �

r 1½ �r 2½ �⋮r p½ �

2664

3775: ðC:205Þ

[ ]x n+ + +

[1]a−[2]a−[ ]a p−

2[ ] (0, )n N ηη σ~

1z− 1z− 1z−

[ 1]x n −[ 2]x n −[ ]x n p−

Fig. C.11 Discrete-time circuit for the generation of a linear autoregressive random sequence


Moreover, for the (C.204) we have that

σ2η ¼ r 0½ � þXpk¼1

a k½ �r k½ �: ðC:206Þ

From the foregoing, suppose the parameters of the acf are known, r½k� for

k ¼ 1, 2,. . ., p, the AR parameters can be determined by solving the system of

p linear equations (C.205). These equations are known as the Yule–Walkerequations.

Example: First-Order AR process: Markov Process Consider a first-order AR

process in which, for simplicity of exposition, it is assumed a ¼ �a½1�, we have

that

x n½ � ¼ ax n� 1½ � þ η n½ � n 0, x �1½ � ¼ 0: ðC:207Þ

The TF has a single pole HðzÞ ¼ 1/ð1 � az�1Þ. For the (C.204)

r k½ � ¼ ar k � 1½ � k 1

ar 1½ � þ σ2η k ¼ 0

�, ðC:208Þ

which can be solved as

r k½ � ¼ r 0½ �ak k > 0: ðC:209Þ

Hence from (C.206) we have that

σ2η ¼ r 0½ � � ar 1½ �: ðC:210Þ

It is possible to derive the acf in function of the parameter a as

r k½ � ¼ σ2η1� a2

ak: ðC:211Þ

The process generated with the (C.207) is typically defined as first-order Markovstochastic process (Markov-I model). In this case, the AR filter has an impulse

response that decreases geometrically with a rate a determined by the position of

the pole on the z-plane.

Narrowband First-Order Markov Process with Unitary Variance

Usually, the measurement of the performance of adaptive algorithms is made with

narrowband unit variance SP. Very often, these SPs are generated with Eq. (C.207)

for values of a very close to 1, i.e., 0 � a < 1.


In addition, from (C.211), to have a x½n� process with unit variance, it is

sufficient that the input GWN has a variance equal to 1 � a2. In other words for

η½n� ¼ Nð0,1Þ, it is sufficient to have a TF H zð Þ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffi1� a2

p= 1� az�1ð Þ which

corresponds to a difference equation

x n½ � ¼ ax n� 1½ � þffiffiffiffiffiffiffiffiffiffiffiffiffi1� a2

pη n½ � n 0, x �1½ � ¼ 0: ðC:212Þ

In this case the acf is r½k� ¼ σ2ηak for k ¼ 0, 1, . . ., M, so the autocorrelation

matrix is

Rxx ¼ σ2η

1 a a2 � � � aM�1

a 1 a � � � aM�2

a2 a 1 � � � ⋮⋮ ⋮ ⋮ ⋱ a

aM�1 aM�2 � � � a 1

266664

377775: ðC:213Þ

For example, in case M ¼ 2, the condition number of the Rxx, given by the ratio

between maximum and minimum eigenvalue, is equal to13

χ Rxxð Þ ¼ 1þ a

1� aðC:214Þ

for which, in order to test the algorithms under extreme conditions, it is possible to

generate a process with predetermined value of the condition number. In fact

solving the latter for a, we get

a ¼ χ Rxxð Þ � 1

χ Rxxð Þ þ 1: ðC:215Þ

C.3.3.3 Moving Average Model

The moving average (MA) time-series model is characterized by the following


x n½ � ¼Xqk¼0

b k½ �η n� k½ � ðC:216Þ

which defines the order q moving average model, indicated as MAðqÞ. The coef-

ficients of the filterb ¼ b0 b1 � � � bq½ �T are called moving average parameters.

The scheme of the moving average circuit model is illustrated in Fig. C.12.

13 p λð Þ ¼ det1� λ aa 1� λ

� �¼ λ2 � 2λþ 1� a2ð Þ. For which λ1,2 ¼ 1 � a.


The frequency response of the filter is

H ejω� � ¼Xq

k¼0

b k½ �e�jωk: ðC:217Þ

The filter has a multiple pole in the origin and is characterized only by zeros. The

PSD of the process is

Rxx e jω� � ¼ σ2η

Xqk¼0

b k½ �e�jωk

��2

: ðC:218Þ

The acf of the MA(q) model is

r k½ � ¼ σ2ηXq� kj j

l¼0

b l½ �b lþ kj j½ � kj j � q

0 k > q:

8><>: ðC:219Þ

C.3.3.4 Spectral Estimation with Autoregressive Moving Average Model

If the generation filter has poles and zeros the model is an autoregressive moving

average (ARMA). Denoted by q and p, respectively, the degree of the polynomial at

numerator and at the denominator of the transfer function HðzÞ, the model is

indicated as ARMAðp, qÞ. The model is then characterized by the following


x n½ � ¼ �Xpk¼1

a k½ �x n� k½ � þXqk¼0

b k½ �η n� k½ �: ðC:220Þ

For the PSD we have then

Rxx e jωð Þ ¼ σ2η H ejωð Þ�� 2¼ σ2η

b0 þ b1e�jω þ b2e

�j2ω þ � � � þ bMe�jqω

�� 21þ a1e�jω þ a2e�j2ω þ � � � þ aNe�jpω�� 2 :

ðC:221Þ

+

1z−

[ ]x n+

1z−

+

1z−

[0]b [1]b [ ]b q

2[ ] (0, )n N ηη σ~ [ 1]nη − [ 2]nη − [ ]n qη −

Fig. C.12 Discrete-time circuit for the generation of a linear moving average random sequence


Remark The models AR, MA, or ARMA are widely used in digital signal

processing applications such as in many contexts: the analysis and synthesis of

signals, signals compression, signals classification, quality enhancement, etc.

The expression (C.221) defines a power spectral density, which represents an

estimate of the spectrum of the signal x½n�. In other words, (C.221) allows the

estimation of the PSD through the estimation of the parameters a and b of the model

generation stochastic ARMA signal. In techniques of signal analysis such methods

are referred to as parametric methods of spectral estimation [17].


References

1. Golub GH, Van Loan CF (1989) Matrix computation. John Hopkins University press, Balti-

more, MD. ISBN 0-80183772-3

2. Sherman J, Morrison WJ (1950) Adjustment of an inverse matrix corresponding to a change in

one element of a given matrix. Ann Math Stat 21(1):124–127

3. Fletcher R (1986) Practical methods of optimization. Wiley, New York. ISBN 0471278289

4. Nocedal J (1992) Theory of algorithms for unconstrained optimization. Acta Numerica

(199):242

5. Lyapunov AM (1966) Stability of motion. Academic, New York

6. Levenberg K (1944) A method for the solution of certain problems in least squares. Quart Appl

Math 2:164–168

7. Marquardt D (1963) An algorithm for least squares estimation on nonlinear parameters.

SIAM J Appl Math 11:431–441

8. Tychonoff AN, Arsenin VY (1977) Solution of Ill-posed problems. Winston & Sons,

Washington, DC. ISBN 0-470-99124-0

9. Broyden CG (1970) The convergence of a class of double-rank minimization algorithms. J Inst

Math Appl 6:76–90

10. Goldfarb D (1970) A family of variable metric updates derived by variational means.

Math Comput 24:23–26

11. Shanno DF (1970) Conditioning of quasi-Newton methods for function minimization.

Mathe Comput 24:647–656

12. Magnus MR, Stiefel E (1952) Methods of conjugate gradients for solving linear systems. J Res

Natl Bur Stand 49:409–436

13. Hestenes MR, Stiefel E (1952) Methods of conjugate gradients for solving linear systems.

J Res Natl Bur Stand 49(6):409–436, available on-line http://nvlpubs.nist.gov/nistpubs/jres/

049/6/V49.N06.A08.pdf

14. Shewchuk JR (1994) An introduction to the conjugate gradient method without the agonizing

pain. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA

15. Andrei N (2008) Conjugate gradient methods for large-scale unconstrained optimization

scaled conjugate gradient algorithms for unconstrained optimization. Ovidius University,

Constantza, on-line available on http://www.ici.ro/camo/neculai/cg.ppt

16. Papoulis A (1991) Probability, random variables, and stochastic processes, 3rd edn.

McGraw-Hill, New York

17. Kay SM (1998) Fundamentals of statistical signal processing detection theory. Prentice Hall,

Upper Saddle River, NJ

18. Fisher RA (1922) On the mathematical foundations of theoretical statistics. Philos Trans R Soc

A 222:309–368


Technology, DOI 10.1007/978-3-319-02807-1,


689

http://nvlpubs.nist.gov/nistpubs/jres/049/6/V49.N06.A08.pdf

http://nvlpubs.nist.gov/nistpubs/jres/049/6/V49.N06.A08.pdf

http://www.ici.ro/camo/neculai/cg.ppt

19. Manolakis DG, Ingle VK, Kogon SM (2005) Statistical and adaptive signal processing.

Artech House, Norwood, MA

20. Widrow B, Stearns SD (1985) Adaptive signal processing. Prentice Hall ed, Englewood Cliffs,

NJ

21. Sayed AH (2003) Fundamentals of adaptive filtering. IEEE Wiley Interscience, Hoboken, NJ

22. Wiener N (1949) Extrapolation, interpolation and smoothing of stationary time series, with

engineering applications. Wiley, New York

23. Rao C (1994) Selected papers of C.R. Rao. In: Das Gupta S (ed), Wiley. ISBN:978-

0470220917

24. Strang G (1988) Linear algebra and its applications, 3rd edn. Thomas Learning, Lakewood,

CO. ISBN 0-15-551005-3

25. Petersen KB, Pedersen MS (2012) The matrix cookbook, Ver. November 15

26. Daubechies I (1988) Orthonormal bases of compactly supported wavelets. Commun Pure Appl

Math 41:909–996

27. Wikipedia: http://en.wikipedia.org/wiki/Matrix_theory

690 References

http://en.wikipedia.org/wiki/Matrix_theory

Index

AActive noise control (ANC)

confined spaces, 76

in duct, 75, 76

free space, 76

one-dimensional tubes, 75

operation principle of, 75, 76

personal protection, 76

Adaptation algorithm

first-order SDA and SGA algorithms,

208–209

general properties

energy conservation, 223–225

minimal perturbation properties,

221–223

nonlinearity error adaptation, 220

principle of energy conservation,

224–225

SGA analysis, 221

performance

convergence speed and learning curve,

218–219

nonlinear dynamic system, 215–216

stability analysis, 216

steady-state performance, 217–218

tracking properties, 219–220

weights error vector and root mean

square deviation, 216–217

priori and posteriori errors, 209–210

recursive formulation, 207–208

second-order SDA and SGA algorithms

conjugate gradient algorithms (CGA)

algorithms, 212–213

discrete Newton’s method, 210

formulation, 211

Levenberg–Marquardt variant, 211–212

on-line learning algorithms, 214

optimal filtering, 213

quasi-Newton/variable metric methods,

212

weighing matrix, 211

steepest-descent algorithms, 206

stochastic-gradient algorithms, 206

transversal adaptive filter, 206–207

Adaptive acoustic echo canceller scheme, 74

Adaptive beamforming, sidelobe canceller

composite-notation GSC, 554–556

frequency domain GSC, 556–558

generalized sidelobe canceller, 547, 548

with block algorithms, 551–553

block matrix determination, 549, 550

geometric interpretation of, 553–554

interference canceller, 549

with on-line algorithms, 551

high reverberant environment, 559–561

multiple sidelobe canceller, 547

robust GSC beamforming, 558–559

Adaptive channel equalization, 69–71

Adaptive filter (AF)

active noise control

confined spaces, 76

in duct, 75, 76

free space, 76

one-dimensional tubes, 75

operation principle of, 75, 76

personal protection, 76

adaptive inverse modeling estimation

adaptive channel equalization,

69–71

control and predistortion, 71

downstream/upstream estimation

schemes, 68, 69

adaptive noise/interference cancellation, 72

array processing


Technology, DOI 10.1007/978-3-319-02807-1,


691

adaptive interference/noise cancellation

microphone array, 78, 79

beamforming, 78–81

detection of arrivals, sensors for, 78, 79

room acoustics active control, 81–82

very large array radio telescope, 77

biological inspired intelligent circuits

artificial neural networks, 82–85

biological brain characteristics, 83

blind signal processing, 86

blind signal separation, 86–89

formal neurons, 84

multilayer perceptron network, 84–85

reinforcement learning, 85–86

supervised learning algorithm, 85, 86

classification based on

cost function characteristics, 62–66

input-output characteristics, 60, 61

learning algorithm, 61–62

definition of, 55, 58

discrete-time, 58, 59

dynamic physical system identification

process

model selection, 66–67

pseudo random binary sequences, 67

schematic representation, 66, 67

set of measures, 67–68

structural identification procedure, 67

echo cancellation

adaptive echo cancellers, 74

hybrid circuit, 73

multichannel case, 75

teleconference scenario, 73

two-wire telephone communication, 73

linear adaptive filter

filter input-output relation, 92

real and complex domain vector

notation, 92–94

MIMO filter (see Multiple-input

multiple-output (MIMO) filter)

multichannel filter with blind learning

scheme, 86, 87

optimization criterion and cost functions,

99–100

prediction system, 68

schematic representation of, 57, 58

stochastic optimization

adaptive filter performance

measurement, 110–113

coherence function, 108–110

correlation matrix estimation, 105–108

frequency domain interpretation,

108–110

geometrical interpretation and

orthogonality principle, 113–114

multichannel Wiener’s normal

equations, 119–121

principal component analysis of optimal

filter, 114–118

Wiener filter, complex domain

extension, 118–119

Wiener–Hopf notation (see Wiener–

Hopf notation)

stochastic processes, 91

usability of, 55

Adaptive interference/noise cancellation

(AIC), 72

acoustic underwater exploration, 138–139

adaptive noise cancellation principle

scheme, 133

error minimization, 133

impulse response, 132


performances analysis, 137–138

primary reference, 131

reverberant noisy environment, 134, 135

scalar version, 133

secondary reference, 131, 135

signal error, 131–132

without secondary reference signal

adaptive line enhancement, 140–141

broadband signal and narrowband noise,

139–140

Adaptive inverse modeling estimation, AF

adaptive channel equalization, 69–71

control and predistortion, 71

downstream/upstream estimation schemes,

68, 69

Adaptive line enhancement (ALE), 140–141

Affine projection algorithms (APA)

computational complexity of, 298

delay input vector, 299

description of, 295

minimal perturbation property, 296–298

variants of, 299

All-pole inverse lattice filter, 464–465

Approximate stochastic optimization (ASO),

144–145

adaptive filtering formulation

minimum error energy, 151

notations and definitions, 148–149

Yule–Walker normal equations,

149–151

adaptive filter performance measurement

error surface, canonical form of,

112–113

692 Index

excess-mean-square error, 113


performance surface, 110–112

coherence function, 108–110

correlation matrix estimation

sequences estimation, 106–107

vectors estimation, 107–108

data matrix X

autocorrelation method, 155

covariance method, 155

post-windowing method, 153–154

projection operator and column space,

158–159

sensors arrays, 155–156

frequency domain interpretation

coherence function, 109

magnitude square coherence, 109

optimal filter, frequency response

of, 109

power spectral density, 108–110

Wiener filter interpretation, 108

geometrical interpretation, 113–114,

156–157

linearly constrained LS, 164–166

LS solution property, 159

multichannel Wiener’s normal equations

cross-correlation matrix, 120

error vector, 119

multichannel correlation matrix, 120

nonlinear LS

exponential decay, 167

rational function model, 167

separable least squares, 168

transformation, 168

optimal filter

condition number of correlation

matrix, 117

correlation matrix, 114

decoupled cross-correlation, 115

excess-mean-square error (EMSE), 116

modal matrix, 115

optimum filter output, 116–117

principal component analysis, 117

principal coordinates, 116

orthogonality principle, 113–114, 157

regularization and ill-conditioning,

163–164

regularization term, 161–163

stochastic generation model, 145–146

weighed and regularized LS, 164

weighted least squares, 160–161

Wiener filter, complex domain extension,

118–119

Wiener–Hopf notation (see Wiener–Hopf

notation)

Array gain, BF

diffuse noise field, 509

geometric gain, 510

homogeneous noise field, 509

supergain ratio, 510

symmetrical cylindrical isotropic noise,

508–509

symmetrical spherical isotropic noise, 508

white noise gain, 510

Array processing (AP), 478

adaptive filter

adaptive interference/noise cancellation


beamforming, 78–81

detection of arrivals, sensors for, 78, 79

room acoustics active control, 81–82

very large array radio telescope, 77


circuit model

array space-time aperture, 495–497

filter steering vector, 497–498

MIMO notation, 493–495

propagation model, 481–484

sensor radiation diagram, 485–486

steering vector, 484–485

signal model

anechoic signal propagation model,

486–488

echoic signal propagation model,

488–489

numerical model, 486

steering vector

harmonic linear array, 492–493

uniform circular array, 491–492

uniform linear array, 490–491

Artificial neural networks (ANNs), 82–85

Augmented Yule–Walker normal equations,

437–439

Autoregressive moving average (ARMA)

model, 439–440

BBackward linear prediction (BLP), 431–433

Backward prediction, 424

Backward prediction RLS filter, 469–470

Basis matrix, 6, 7

Batch joint process estimation, ROF

adaptive ladder filter parameters

determination, 458–459

Burg estimation formula, 459

Index 693

Batch joint process estimation (cont.)lattice-ladder filter structure for, 456, 457

stage-by-stage orthogonalization, 457–458

Beamforming (BF), 78–81

Beampattern, 507

Biological inspired intelligent circuits, AF

artificial neural networks, 82–85

biological brain characteristics, 83

blind signal processing, 86

blind signal separation, 86–89

formal neurons, 84

multilayer perceptron network, 84–85

reinforcement learning, 85–86

supervised learning algorithm, 85, 86

Blind signal processing (BSP), 86

Blind signal separation (BSS), 86

deconvolution of sources, 88–89

independent sources separation, 87–88

Block adaptive filter

BLMS algorithm

characterization of, 357

convergence properties of, 358

definition of, 357

block matrix, 355

block update parameter, 355

error vector, 356

schematic representation of, 355

Block algorithms

definition of, 351

indicative framework for, 352

L-length signal block, 353

and online algorithms, 354

Block iterative algebraic reconstruction

technique (BI-ART), 173

Bruun’s algorithm, 397–399


CCircular convolution FDAF (CC-FDAF)

algorithm, 373–375

Combined one-step forward-backward linear

prediction (CFBLP), 434, 435

discrete-time two-port network structure,

455

and lattice adaptive filters, 453–456

Confined propagation model, 488–489

Continuous time signal-integral transformation

(CTFT), 15

Continuous-time signal-series expansion

(CTFS), 15

Conventional beamforming

broadband beamformer, 522–523

differential sensors array

DMA array gain for spherical isotropic

noise field, 519–521

DMA radiation diagram, 517–519

DMA with adaptive calibration filter,

521–522

DSBF-ULA

DSBF gains, 512–515

radiation pattern, 512

steering delay, 515–516

spatial response direct synthesis

alternation theorem, 524

frequency-angular sampling, 525–527

windowing method, 524–525

Cramer-Rao bound (CRB), 561–562

DData-dependent beamforming

minimum variance broadband beamformer,

537–538

constrained power minimization, 539

geometric interpretation, 542–544

lagrange multipliers solution, 541

LCMV constraints, 544–546

matrix constrain determination, 540

recursive procedure, 541–542

post-filtering beamformer

definition, 534–535

separate post-filter adaptation, 537

signal model, 535

superdirective beamformer

Cox’s regularized solutions, 529–531

line-array superdirective beamformer,

531–534

standard capon beamforming, 528

Data-dependent transformation matrix, 12–14

Data-dependent unitary transformation, 12–14

Data windowing constraints, 360

Delayed learning LMS algorithms

adjoint LMS (AD-LMS) algorithm,

277–278

definition, 273–274

delayed LMS (DLMS) algorithm, 275

discrete-time domain filtering operator,

274–275

filtered-X LMS Algorithm, 276–277

multichannel AD-LMS, 284

multichannel FX-LMS algorithm, 278–284

adaptation rule, 284

composite notation 1, 281–283

composite notation 2, 278–281

data matrix definition, 279

694 Index

Kronecker convolution, 280

vectors and matrices size, 283

Differential microphones array (DMA)

with adaptive calibration filter, 521–522

array gain for spherical isotropic noise field,

519–521

frequency response, 518–519

polar diagram, 518

radiation diagram, 517–519

Direction of arrival (DOA), 478

broadband, 568–569

narrowband

with Capon’s beamformer, 563

with parametric methods, 566–568

signal model, 562

steered response power method,

562–563

with subspace analysis, 563–565

Discrete cosine transform (DCT), 10–11

Discrete Fourier transform (DFT)

definition, 8, 9

matrix, 8

periodic sequence, 8

properties of, 9

with unitary transformations, 8–9

Discrete Hartley transform (DHT), 9–10

Discrete sine transform (DST), 11

Discrete space-time filtering

array processing, 478


circuit model, 493–498

propagation model, 481–486

signal model, 486–489

conventional beamforming

broadband beamformer, 522–523

differential sensors array, 516–522

DSBF-ULA, 511–516

spatial response direct synthesis,

523–527

data-dependent beamforming

minimum variance broadband

beamformer, 537–546

post-filtering beamformer, 534–537

superdirective beamformer, 528–534

direction of arrival

broadband, 568–569

narrowband, 561–568

electromagnetic fields, 479

isotropic sensors, 478–479

noise field

array quality, 504–511

characteristics, 501–504

spatial covariance matrix, 498–501

sidelobe canceller

composite-notation GSC, 554–556

frequency domain GSC, 556–558

generalized sidelobe canceller, 547–549

GSC adaptation, 551–554

high reverberant environment, 559–561

multiple sidelobe canceller, 547

robust GSC beamforming, 558–559

spatial aliasing, 477

spatial frequency, 478

spatial sensors distribution, 479–480

time delay estimation

cross-correlation method, 569–570

Knapp–Carter’s generalized cross-

correlation method, 570–574

steered response power PHAT method,

574–576

Discrete-time adaptive filter, 58, 59

Discrete-time (DT) circuits

analog signal processing

advantages of, 20

current use, 21

bounded-input-bounded-output stability,

22–23

causality, 22

digital signal processing

current applications of, 21

disadvantages, 20

elements definition, 25–27

FDE (see Finite difference equation, DTcircuits)

frequency response

computation, 28–29

Fourier series, 29

graphic form, 28

periodic function, 29


linearity, 22

linear time invariant

convolution sum, 24

finite duration sequences, 25

single-input-single-output (SISO), 21

time invariance, 22

transformed domains

discrete-time fourier transform, 31–35

FFT Algorithm, 37

transfer function (TF), 36

z-transform, 30–31

Discrete-time (DT) signals

definition, 2

deterministic sequences, 3, 4

real and complex exponential sequence,

5, 6

Index 695

unitary impulse, 3–4

unit step, 4–5

graphical representation, 2, 3

random sequences, 3, 4

with unitary transformations

basis matrix, 6

data-dependent transformation matrix,

12–14

DCT, 10–11

DFT, 8–9

DHT, 9–10

DST, 11

Haar transform, 11–12

Hermitian matrix, 7

nonstationary signals, 7

orthonormal expansion (seeOrthonormal expansion, DT signals)

unitary transform, 6, 7

DT delta function. see Unitary impulse

EEcho cancellation, AF

adaptive echo cancellers, 74

hybrid circuit, 73

multichannel case, 75

teleconference scenario, 73

two-wire telephone communication, 73

Energy conservation theorem, 225

Error sequential regression (ESR) algorithms

average convergence study, 292

definitions and notation, 290–291

derivation of, 291–292

Estimation of signal parameters via rotational

invariance technique (ESPRIT)


Exponentiated gradient algorithms (EGA)

exponentiated RLS algorithm, 347–348

positive and negative weights, 346–347

positive weights, 344–346

FFast a posteriori error sequential technique

(FAEST) algorithm, 472–474

Fast block LMS (FBLMS). See Overlap-saveFDAF (OS-FDAF) algorithm

Fast Fourier transform (FFT) algorithm, 37

Fast Kalman algorithm, 470–472

Fast LMS (FLMS). See Overlap-save FDAF(OS-FDAF) algorithm

Filter tracking capability, 314

Finite difference equation, DT circuits

BIBO stability criterion, 39–41

circuit representation, 38

impulse response

convolution-operator-matrix input

sequence, 42–43

data-matrix impulse-response vector,

41–42

FIR filter, 41

inner product vectors, 43–44

pole-zero plot, 38–39

Finite Impulse Response (FIR) filters, 494, 517

FOCal Underdetermined System Solver

(FOCUSS) algorithm, 198–199

diversity measure, 199–200

Lagrange multipliers method, 200–202

multichannel extension, 202–203

sparse solution determination, 197

weighted minimum norm solution, 197

Formal neurons, 84

Forward linear prediction (FLP), 431

estimation error, 428–429

filter structure, 429

forward prediction error filter, 430

Forward prediction, 424

Forward prediction RLS filter, 467–468

Free-field propagation model, 486–488

Frequency domain adaptive filter (FDAF)

algorithms, 353, 363

and BLMS algorithm, 358–359

classification of, 364

computational cost analysis, 376

linear convolution

data windowing constraints, 360

DFT and IDFT, in vector notation,

360–361

in frequency domain with overlap-save

method, 361–363

normalized correlation matrix, 378

overlap-add algorithm, 370–371

overlap-save algorithm

with frequency domain error, 371–372

implementative scheme of, 368–369

linear correlation coefficients, 365

structure of, 367

weight update and gradient’s constraint,

365–368

partitioned block algorithms (seePartitioned block FDAF algorithms)

performance analysis of, 376–378

schematic representation of, 359

step-size normalization procedure, 364–365

UFDAF algorithm (See UnconstrainedFDAF (UFDAF) algorithm)

696 Index

Frost algorithm, 537–538

constrained power minimization, 539

geometric interpretation, 542–544

lagrange multipliers solution, 541

LCMV constraints, 544–546

matrix constrain determination, 540

recursive procedure, 541–542

GGeneralized sidelobe canceller (GSC),

547, 548

with block algorithms, 551–553

block matrix determination, 549, 550

composite-notation, 554–556

frequency domain, 556–558

geometric interpretation of, 553–554

interference canceller, 549

with on-line algorithms, 551

robustness, 558–559

Gilloire-Vetterli’s tridiagonal SAF structure,

413–415

Gradient adaptive lattice (GAL) algorithm,

ROF, 459

adaptive filtering, 460–462

finite difference equations, 460

HHaar unitary transform, 11–12

Hermitian matrix, 7

High-tech sectors, 20

IInput signal buffer composition mechanism,

352

Inverse discrete Fourier transform (IDFT), 8, 9

KKalman filter algorithms

applications, 315

cyclic representation of, 321

discrete-time formulation, 316–319

observation mode, knowledge of, 320

in presence of external signal, 323–324

process model, knowledge of, 320

recursive nature of, 321

robustness, 323

significance of, 322

state space representation, of linear system,

315, 316

Kalman gain vector, 302

Karhunen–Loeve transform (KLT), 390

Kullback–Leibler divergence (KLD), 344

LLagrange function, 165

Lattice filters, properties of

optimal nesting, 455

orthogonality, of backward/forward

prediction errors, 456

stability, 455

Least mean squares (LMS) algorithm

characterization and convergence

error at optimal solution, 248

mean square convergence, 250–252

noisy gradient model, 252–253

weak convergence, 249–250

weights error vector, 248

complex domain signals

computational cost, 239

filter output, 237–238

stochastic gradient, 238–239

convergence speed

eigenvalues disparity, 258–260

nonuniform convergence, 258–260

excess of MSE (EMSE)

learning curve, 257–258

steady-state error, 254–256

formulation

adaptation, 233


DT circuit representation, 233

gradient vector, 233

instantaneous SDA approximation,

234–235

priori error, 233

recursive form, 235

vs. SDA comparison, 236

sum of squared error (SSE), 234

gradient estimation filter, 271–272

leaky LMS, 267–269

adaptation law, 267

cost function, 267

minimum and maximum correlation

matrix, 267

nonzero steady-state coefficient

bias, 268

transient performance, 267

least mean fourth (LMF) algorithm,

270–271

least mean mixed norm algorithm, 271

linear constraints

Index 697

linearly constrained LMS (LCLMS)


local Lagrangian, 239

recursive gradient projection LCLMS,

240–241

minimum perturbation property, 236–237

momentum LMS algorithm, 272–273

multichannel LMS algorithms

filter-by-filter adaptation, 244–245

filters banks adaptation, 244

global adaptation, 243–244


input and output signal, 242–243

as MIMO-SDA approximation, 245

priori error vector, 243

normalized LMS algorithm


minimal perturbation properties,

264–265

variable learning rate, 262–263

proportionate NLMS (PNLMS) algorithm,

265–267

adaptation rule, 265

Gn matrix choice, 265

improved PNLMS, 265

impulse response w sparseness, 266

regularization parameter, 266

sparse impulse response, 266, 267

signed-error LMS, 269

signed-regressor LMS, 270

sign-sign LMS, 270

statistical analysis, adaptive algorithms

performance

adaptive algorithms performance,

246–247

convergence, 248–253

dynamic system model, 246

minimum energy error, 247–248

transient and steady-state filter

performance, 247

steady-state analysis, deterministic input,

260–262

Least squares (LS) method

approximate stochastic optimization (ASO)

methods, 144–145

adaptive filtering formulation, 146–151

stochastic generation model, 145–146

linear equations system

continuous nonlinear time-invariant

dynamic system, 171

iterative LS, 172–174

iterative weighed LS, 174

Levenberg-Marquardt variant, 171

Lyapunov theorem, 172

overdetermined systems, 169

underdetermined systems, 170

matrix factorization

algebraic nature, 174–175

amplitude domain formulation, 175

Cholesky decomposition, 175–177

orthogonal transformation, 177–180

power-domain formulation, 175

singular value decomposition (SVD),

180–184

principle of, 143–144

with sparse solution

matching pursuit algorithms, 191–192

minimum amplitude solution, 190

minimum fuel solution, 191

minimum Lp-norm (or sparse) solution,

193

minimum quadratic norm sparse

solution, 193–195

numerosity, 191

uniqueness, 195

total least squares (TLS) method

constrained optimization problem, 185

generalized TLS, 188–190

TLS solution, 186–188

zero-mean Gaussian stochastic

processes, 185

Levenberg–Marquardt variant, 171

Levinson–Durbin algorithm

LPC, of speech signals, 442

k and β parameters, initialization of,

450–451

prediction error filter structure, 452–453

pseudo-code of, 451

reflection coefficients determination,

448–450

reverse, 452

in scalar form, 448

in vector form, 447–448

Linear adaptive filter

filter input–output relation, 92

real and complex domain vector notation

coefficients’ variation, 93

filter coefficients, 93

input vector regression, 93

weight vector, 92

Linear estimation, 424

Linearly constrained adaptive beamforming,

559–561

Linearly constrained minimum variance

(LCMV)

eigenvector constraint, 546

698 Index

minimum variance distortionless response,

544–545

multiple amplitude-frequency derivative

constraints, 545–546

Linear prediction

augmented Yule–Walker normal equations,

437–439

coding of speech signals, 440–442

schematic illustration of, 425

using LS approach, 435–437

Wiener’s optimum approach

augmented normal equations, 427–428

BLP, 431–433

CFBLP, 434, 435

estimation error, 424

FLP, 428–430

forward and backward prediction, 424

linear estimation, 424

minimum energy error, 427

predictor vector, 425

SFBLP, 434, 435

square error, 426

stationary process, prediction

coefficients for, 433–434

Linear prediction coding (LPC), of speech

signals

with all-pole inverse filter, 441

general synthesis-by-analysis scheme,

440, 441

Levinson-Durbin algorithm, 442

k and β parameters, initialization of,

450–451

prediction error filter structure, 452–453

pseudo-code of, 451

reflection coefficients determination,

448–450

reverse, 452

in scalar form, 448

in vector form, 447–448

low-rate voice transmission, 441

speech synthesizer, 441, 442

Linear random sequence, spectral estimation

of, 439–440

LMS algorithm, tracking performance of

mean square convergence of, 331–332

nonstationary RLS performance, 332–334

stochastic differential equation, 330

weak convergence analysis, 330–331

LMS Newton algorithm, 174, 293

Low-diversity inputs MIMO adaptive filtering,

335–336

channels dependent LMS algorithm,

337–338

multi-channels factorized RLS algorithm,

336–337

LOw-Resolution Electromagnetic

Tomography Algorithm

(LORETA), 196

Lyapunov attractor

continuous nonlinear time-invariant

dynamic system, 171

finite-difference equations (FDE), 173

generalized energy function, 172

iterative update expression, 174

learning rates, 173

LMS Newton algorithm, 174

online adaptation algorithms, 173

order recursive technique, 173

row-action-projection method, 173

MMIMO error sequential regression algorithms

low-diversity inputs MIMO adaptive

filtering, 335–338

MIMO RLS, 334–335

multi-channel APA algorithm, 338–339

Moore–Penrose pseudoinverse matrix, 151

Multi-channel APA algorithm, 338–339

Multilayer perceptron (MLP) network, 84–85

Multiple error filtered-x (MEFEX), 82

Multiple-input multiple-output (MIMO) filter

composite notation 1, 96

composite notation 2, 97

impulse responses, 96

output snap-shot, 95

parallel of Q filters banks, 97–98

P inputs and Q outputs, 94

snap-shot notation, 98–99

NNarrowband direction of arrival

with Capon’s beamformer, 563

with parametric methods, 566–568

signal model, 562

steered response power method, 562–563

with subspace analysis, 563–565

Newton’s algorithm

convergence study, 289–290

formulation of, 288

Noise field

array quality

array gain, 507–510

array sensitivity, 510–511

radiation functions, 506–507

Index 699

signal-to-noise ratio, 505–506

characteristics

coherent field, 502

combined noise field, 504

diffuse field, 503–504

incoherent field, 502

spatial covariance matrix

definition, 498

isotropic noise, 500–501

projection operators, 500

spatial white noise, 499

spectral factorization, 500

Nonstationary AF performance analysis

delay noise, 330

estimation noise, 330

excess error, 327–328

misalignment and non-stationarity degree,

328–329

optimal solution a posteriori error, 327

optimal solution a priori error, 327

weight error lag, 329, 330

weight error noise, 329, 330

weights error vector correlation

matrix, 329

weights error vector mean square

deviation, 329

Normalized correlation matrix, 378

Normalized least mean squares (NLMS), 173

Numerical filter

definition of, 55

linear vs. nonlinear, 56–57

OOnline adaptation algorithms, 173

Optimal linear filter theory

adaptive filter basic and notations

(see Adaptive filter (AF))adaptive interference/noise cancellation

(AIC)

acoustic underwater exploration,

138–139

adaptive noise cancellation principle

scheme, 133

error minimization, 133


performances Analysis, 137–138

primary reference, 131

reverberant noisy environment,

134, 135

scalar version, 133

secondary reference, 131, 135

signal error, 131–132

without secondary reference signal,

139–141

communication channel equalization

channel model, 130

channel TF G(z), 127equalizer input, 128

impulse response g[n] and input

s[n], 129optimum filter, 129

partial fractions, 130–131

receiver’s input, 130

dynamical system modeling 1

cross-correlation vector, 122

H(z) system output, 122

linear dynamic system model, 122

optimum model parameter

computation, 121

performance surface and minimum

energy error, 123

dynamical system modeling 2

linear dynamic system model, 124

optimum Wiener filter, 124–125

time delay estimation

matrix determination R, 126

performance surface, 127

stochastic moving average

(MA) process, 126

vector computation g, 126

Wiener solution, 127

Orthogonality principle, 157

Orthonormal expansion, DT signals

CTFT and CTFS, 15

discrete-time signal, 15–16

Euclidean space, 14

inner product, 14–15

kernel function

energy conservation principle, 17

expansion, 16–17

Haar expansion, 18

quadratically summable sequences, 14

Output projection matrix, 362

Overlap-add FDAF (OA-FDAF) algorithm,

370–371

Overlap-save FDAF (OS-FDAF) algorithm

with frequency domain error, 371–372

implementative scheme of, 368–369

linear correlation coefficients, 365

structure of, 367

weight update and gradient’s constraint,

365–368

Overlap-save sectioning method, 361–363

700 Index

PPartial rank algorithm (PRA), 299–300

Partitioned block FDAF (PBFDAF)

algorithms, 379

computational cost of, 385–386

development, 382–384

FFT calculation, 382

filter weights, augmented form of, 380

performance analysis of, 386–388

structure of, 384, 385

time-domain partitioned convolution

schematization, 380, 381

Partitioned frequency domain adaptive

beamformer (PFDABF), 556–558

Partitioned frequency domain adaptive filters

(PFDAF), 354

Partitioned matrix inversion lemma, 443–445

Phase transform method (PHAT), 573–574

Positive weights EGA, 344–346

Pradhan-Reddy’s polyphase SAF architecture,

416–418

A priori error fast transversal filter, 474–475

Propagation model, AP, 481–484

anechoic signal, 486–488

echoic signal, 488–489

sensor radiation diagram, 485–486

steering vector, 484–485

Pseudo random binary sequences (PRBS), 67

RRandom walk model, 325, 326

Real and complex exponential sequence, 5, 6

Recursive-in-model-order adaptive filter

algorithms. See Recursive orderfilter (ROF)

Recursive least squares (RLS)

computational complexity of, 307–308

conventional algorithm, 305–307

convergence of, 309–312

correlation matrix, with forgetting

factor/Kalman gain, 301–302


eigenvalues spread, 310

nonstationary, 314–315

performance analysis, 308, 309

a posteriori error, 303–305

a priori error, 303



steady-state and convergence performance

of, 313

steady-state error of, 312–313

transient performance of, 313

Recursive order filter (ROF), 445–447

all-pole inverse lattice filter, 464–465

batch joint process estimation

adaptive ladder filter parameters

determination, 458–459


lattice-ladder filter structure for,

456, 457

stage-by-stage orthogonalization,

457–458

GAL algorithm, 459

adaptive filtering, 460–462

finite difference equations, 460

importance of, 443

partitioned matrix inversion lemma,

443–445

RLS algorithm

backward prediction RLS filter,

469–470

FAEST, 472–474

fast Kalman algorithm, 470–472

fast RLS algorithm, 465, 466

forward prediction RLS filter, 467–468

a priori error fast transversal filter,

474–475

transversal RLS filter, 466–467

Schur algorithm, 463

Riccati equation, 302

Riemann metric tensor, 343

Robust GSC beamforming, 558–559

Room acoustics active control, 81–82

Room transfer functions (RTF), 81, 82

Row-action-projection method, 173

SSAF. See Subband adaptive filter (SAF)

Schur algorithm, 463

Second-order adaptive algorithms, 287,

324–325

affine projection algorithms

computational complexity of, 298

delay input vector, 299

description of, 295

minimal perturbation property,

296–298

variants of, 299

error sequential regression algorithms

average convergence study, 292

definitions and notation, 290–291


general adaptation law

Index 701

adaptive regularized form, with sparsity

constraints, 340–344

exponentiated gradient algorithms,

344–348

types, 339–340

Kalman filter

applications, 315

cyclic representation of, 321

discrete-time formulation, 316–319

observation mode, knowledge of, 320

in presence of external signal, 323–324

process model, knowledge of, 320

recursive nature of, 321

robustness, 323

significance of, 322

state space representation, of linear

system, 315, 316

LMS algorithm, tracking performance of

mean square convergence of, 331–332

nonstationary RLS performance,

332–334

stochastic differential equation, 330

weak convergence analysis, 330–331

LMS-Newton algorithm, 293

MIMO error sequential regression

algorithms

low-diversity inputs MIMO adaptive

filtering, 335–338

MIMO RLS, 334–335

multi-channel APA algorithm, 338–339

Newton’s algorithm

convergence study, 289–290

formulation of, 288

performance analysis indices

delay noise, 330

estimation noise, 330

excess error, 327–328

misalignment and non-stationarity

degree, 328–329

optimal solution a posteriori error, 327

optimal solution a priori error, 327

weight error lag, 329, 330

weight error noise, 329, 330

weights error vector correlation

matrix, 329

weights error vector mean square

deviation, 329

recursive least squares

computational complexity of, 307–308

conventional, 305–307

convergence of, 309–312

correlation matrix, with forgetting

factor/Kalman gain, 301–302


eigenvalues spread, 310

nonstationary, 314–315

performance analysis, 308, 309

a posteriori error, 303–305

a priori error, 303



steady-state and convergence

performance of, 313

steady-state error of, 312–313

transient performance of, 313

time-average autocorrelation matrix,

recursive estimation of, 293

initialization, 295

with matrix inversion lemma, 294

sequential regression algorithm,

294–295

tracking analysis model

assumptions of, 327

first-order Markov process, 326


nonstationary stochastic process,

325, 326

Signals

analog/continuous-time signals, 1–2

array processing

anechoic signal propagation model,

486–488

echoic signal propagation model,

488–489

numerical model, 486

complex domain, 1, 2

definition, 1

DT signals (see Discrete-time (DT) signals)

Signal-to-noise ratio (SNR), 478

Singular value decomposition (SVD) method


singular values, 181

SVD-LS Algorithm, 182

Tikhonov regularization theory, 183–184

Sliding window, 354

Smoothed coherence transform method

(SCOT), 572–573

Speech signals, LPC of. See Linear predictioncoding (LPC), of speech signals

Steepest-Descent algorithm (SDA)

convergence and stability

learning curve and weights trajectories,

228

natural modes, 227

similarity unitary transformation, 227

stability condition, 228–229

weights error vector, 227

convergence speed

convergence time constant and learning

curve, 231–232

702 Index

eigenvalues disparities, 229

performance surface trends, 230

rotated expected error, 229

signal spectrum and eigenvalues spread,

230–231

error expectation, 225–226

multichannel extension, 226

recursive solution, 225

Steered response power PHAT (SRP-PHAT),

574–576

Steering vector, AP, 489–493

harmonic linear array, 492–493

uniform circular array, 491–492

uniform linear array, 490–491

Stochastic-gradient algorithms (SGA), 206

Subband adaptive filter (SAF), 354

analysis-synthesis filter banks, 418–419

circuit architectures for

Gilloire-Vetterli’s tridiagonal structure,

413–415

LMS adaptation algorithm, 415–416

Pradhan-Reddy’s polyphase

architecture, 416–418

optimal solution, conditions for, 410–412

schematic representation, 401, 402

subband-coding, 401, 402

subband decomposition, 401

two-channel subband-coding

closed-loop error computation,

409, 410

conjugate quadrature filters, 408–409

with critical sample rate, 402

in modulation domain z-transform

representation, 402–405

open-loop error computation, 409, 410

perfect reconstruction conditions,

405–407

quadrature mirror filters, 407–408

Superdirective beamformer

Cox’s regularized solutions, 529–531

line-array superdirective beamformer,

531–534

standard capon beamforming, 528

Superposition principle, 22

Symmetric forward-backward linear prediction

(SFBLP), 434, 435, 437

TTeleconference scenario, echo cancellation

in, 73

Temporal array aperture, 495–496

Tikhonov regularization parameter, 310

Tikhonov’s regularization theory, 163,

183–184

Time-average autocorrelation matrix, recursive

estimation of, 293

initialization, 295

with matrix inversion lemma, 294

sequential regression algorithm, 294–295

Time band-width product (TBWP), 497

Time delay estimation (TDE)

cross-correlation method, 569–570

Knapp–Carter’s generalized cross-

correlation method, 570–574

steered response power PHAT method,

574–576

Total least squares (TLS) method

constrained optimization problem, 185

generalized TLS, 188–190

TLS solution, 186–188

zero-mean Gaussian stochastic

processes, 185

Tracking analysis model

assumptions of, 327

first-order Markov process, 326


nonstationary stochastic process, 325, 326

Transform domain adaptive filter (TDAF)

algorithms

data-dependent optimal transformation,

390

definition of, 351

FDAF (see Frequency domain adaptive

filter (FDAF) algorithms)

performance analysis, 399–400

a priori fixed sub-optimal transformations,

390

schematic illustration of, 388, 389

sliding transformation LMS, bandpass

filters

DFT bank representation, 394

frequency responses of DFT/DCT, 395

non-recursive DFT filter bank, 397–399

recursive DCT filter bank, 395–397

short-time Fourier transform, 392

signal process in two-dimensional

domain, 393

transform domain LMS algorithm, 391–392

unitary similarity transformation, 390

Transversal RLS filter, 466–467

Two-channel subband-coding

closed-loop error computation, 409, 410

conjugate quadrature filters, 408–409

with critical sample rate, 402

in modulation domain z-transform

representation, 402–405

open-loop error computation, 409, 410

perfect reconstruction conditions, 405–407

quadrature mirror filters, 407–408

Index 703

Two-wire telephone communication, 73

Type II discrete cosine transform (DCT-II ),

391

Type II discrete sine transform (DST-II), 391

UUnconstrained FDAF (UFDAF) algorithm

circulant Toeplitz matrix, 373

circular convolution FDAF scheme,

373–375

configuration of, 368, 369

convergence analysis, 376–378

convergence speed of, 369

for N ¼ M, 372–375

Unitary impulse, 3–4

Unit step sequence, 4–5

WWeighed projection operator (WPO), 161

Weighted least squares (WLS), 160–161

Weighting matrix, 362

Wiener–Hopf notation

adaptive filter (AF), 103

autocorrelation matrix, 102

normal equations, 103

scalar notation

autocorrelation, 104

correlation functions, 104

error derivative, 104

filter output, 103

square error, 102

Wiener’s optimal filtering theory, 103

Wiener’s optimum approach, linear prediction

augmented normal equations, 427–428

BLP, 431–433

CFBLP, 434, 435

estimation error, 424

FLP, 428–430

forward and backward prediction, 424

linear estimation, 424

minimum energy error, 427

predictor vector, 425

SFBLP, 434, 435

square error, 426

stationary process, prediction coefficients

for, 433–434

YYule–Walker normal equations, 150

704 Index

Documents

Appendix A: Linear Algebra Basics - Springer978-3-319-02807-1/1.pdf · Appendix A: Linear Algebra Basics A.1 Matrices and Vectors A matrix A [1, 24, 25, 27], here indicated in bold