89
watsonlogo Introduction The Kronecker and Box Product Matrix Differentiation Optimization Efficient Automatic Differentiation of Matrix Functions Peder A. Olsen, Steven J. Rennie, and Vaibhava Goel IBM T.J. Watson Research Center Automatic Differentiation 2012 Fort Collins, CO July 25, 2012 Peder, Steven and Vaibhava Matrix Differentiation

Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

  • Upload
    others

  • View
    8

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

watsonlogo

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Efficient Automatic Differentiation of MatrixFunctions

Peder A. Olsen, Steven J. Rennie, and Vaibhava GoelIBM T.J. Watson Research Center

Automatic Differentiation 2012Fort Collins, CO

July 25, 2012

Peder, Steven and Vaibhava Matrix Differentiation

Page 2: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

How can we compute the derivative?

Numerical Methods that do numeric differentiation by formulaslike the finite difference

f ′(x) ≈ f (x + h)− f (x)

h

These methods are simple to program, but loose halfof all significant digits.

Symbolic Symbolic differentiation gives the symbolicrepresentation of the derivative without saying howbest to implement the derivative. In general symbolicderivatives are as accurate as they come.

Automatic Something between numeric and symbolicdifferentiation. The derivative implementation iscomputed from the code for f (x).

Peder, Steven and Vaibhava Matrix Differentiation

Page 3: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

Finite Difference:

f ′(x) ≈ f (x + h)− f (x)

h

Central Difference:

f ′(x) ≈ f (x + h)− f (x − h)

2h

For h < εx , where ε is the machine precision, these approximationsbecome zero. Thus h has to be chosen with care.

Peder, Steven and Vaibhava Matrix Differentiation

Page 4: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

The error in the finite difference is bounded by

‖f ′′‖∞h

2+

2ε‖f ‖∞h

whose minimum is2√ε‖f ‖∞‖f ′′‖∞

achieved for

h = 2

√ε‖f ‖∞‖f ′′‖∞

.

Peder, Steven and Vaibhava Matrix Differentiation

Page 5: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

Numerical Differentiation Error

The graph shows the error in approximating the derivative forf (x) = ex as a function of h.

Peder, Steven and Vaibhava Matrix Differentiation

Page 6: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

An imaginary derivative

The error stemming from the subtractive cancellation can becompletely removed by using complex numbers.Let f : R→ R then

f ′(x + ih) ≈ f (x) + ihf ′(x) +(ih)2

2!f ′′(x) +

(ih)3

3!f (3)(x)

= f (x)− h2

2f ′′(x) + hi(f ′(x)− h2

6f (3)(x))

Thus the derivative can be approximated by

f ′(x) ≈ Im

(f ′(x + ih)

h

)Peder, Steven and Vaibhava Matrix Differentiation

Page 7: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

Imaginary Differentiation Error

The graph shows the error in approximating the derivative forf (x) = ex as a function of h.

Peder, Steven and Vaibhava Matrix Differentiation

Page 8: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

By using dual numbers dual numbers a + bE , where E 2 = 0 wecan further reduce the error. For second order derivatives we canuse hyper dual numbers. ((2012) J. Fike).

Complex number implementations usually come with mostprogramming languages. For dual or hyper-dual numbers we needto download or implement code.

Peder, Steven and Vaibhava Matrix Differentiation

Page 9: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

What is Automatic Differentiation?

Algorithmic Differentiation (often referred to asAutomatic Differentiation or just AD) uses the softwarerepresentation of a function to obtain an efficient methodfor calculating its derivatives. These derivatives can be ofarbitrary order and are analytic in nature (do not haveany truncation error). B. Bell – author of CppAD

Use of dual or complex numbers is a form of automaticdifferentiation. More common though is operator overloadingimplementations as in CppAD.

Peder, Steven and Vaibhava Matrix Differentiation

Page 10: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

The Speelpenning function is:

f (y) =n∏

i=1

xi (y)

We consider xi = (y − ti ), so that f (y) =∏n

i=1(y − ti ). If wecompute the symbolic derivative we get

f ′(y) =n∑

i=1

∏j 6=i

(y − tj),

which requires n(n − 1) multiplications to compute with astraightforward implementation. Rewriting the expression as

f ′(y) =n∑

i=1

f (y)

y − ti,

reduces the computational cost, but is not numerically stable ify ≈ ti .

Peder, Steven and Vaibhava Matrix Differentiation

Page 11: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

Reverse Mode Differentiation

Define fk =∏k

i=1(y − ti ) and rk =∏n

i=k(y − ti ), then

f ′(y) =n∑

i=1

∏j 6=i

(y − tj) =n∑

i=1

fi−1ri+1,

and fi = (y − ti )fi−1, f1 = y − t1, ri = (y − ti )ri+1, rn = y − tn.Note that f (y) = fn = r1.

Cost of computing f (y) is n subtractions and n − 1 multiplies. Atthe overhead of storing f1, . . . , fn the derivative can be computedwith an additional n − 1 additions and 2n − 1 multiplications.

Peder, Steven and Vaibhava Matrix Differentiation

Page 12: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

Reverse Mode Differentiation

Under quite realistic assumptions the evaluation of agradient requires never more than five times the effort ofevaluating the underlying function by itself.

Andreas Griewank (1988)

Peder, Steven and Vaibhava Matrix Differentiation

Page 13: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

Reverse mode differentiation (1971) is essentially applying thechain rule on a function in the reverse order that the function iscomputed.

This is the same technique as applied in the backpropagationalgorithm (1969) for neural networks and the forward-backwardalgorithm for training HMMs (1970).

Reverse mode differentiation applies to very general classes offunctions.

Peder, Steven and Vaibhava Matrix Differentiation

Page 14: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

Reverse mode differentiation was independently invented by

G. M. Ostrovskii (1971)

S. Linnainmaa (1976)

John Larson and Ahmed Sameh (1978)

Bert Speelpenning (1980)

Not surprisingly the fame went to B. Speelpenning, whoadmittedly told the full story at AD 2012.

A. Sameh was on B. Speelpenning’s thesis committee.

The Speelpenning function was suggested as a motivation toSpeelpenning’s thesis by his Canadian interim advisor Prof. ArtSedgwick who passed away on July 5th.

Peder, Steven and Vaibhava Matrix Differentiation

Page 15: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

Why worry about matrix differentiation?

The differentiation of functions of a matrix argument is still notwell understood

No simple “calculus” for computing matrix derivatives.

Matrix derivatives can be complicated to compute even forquite simple expressions.

Goal: Build on known results to define a matrix-derivative calculus.

Peder, Steven and Vaibhava Matrix Differentiation

Page 16: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

Quiz

The Tikhonov regularized maximum (log-)likelihood for learningthe covariance Σ given an empirical covariance S is

f (Σ) = −log det(Σ)− trace(SΣ−1)− ‖Σ−1‖2F .

What is f ′(Σ) and f ′′(Σ)?

Statisticians can compute it. Can you?

Can a calculus be reverse engineered?

Peder, Steven and Vaibhava Matrix Differentiation

Page 17: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

Terminology

Classical calculus:

scalar-scalar functions (f : R→ R)

scalar-vector functions (f : Rd → R).

Matrix calculus:

scalar-matrix functions (f : Rm×n → R)

matrix-matrix functions (f : Rm×n → Rk×l).

Note:

1 Derivatives of scalar-matrix functions requires derivatives ofmatrix-matrix functions (why?)

2 Matrix-matrix derivatives should be computed implicitlywherever possible (why?)

Peder, Steven and Vaibhava Matrix Differentiation

Page 18: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

Optimizing Scalar-Matrix Functions

Useful quantities for optimizing a scalar-matrix function

f (X) : Rd×d → RThe gradient is a matrix–matrix function:

f ′(X) : Rd×d → Rd×d .

The Hessian is also a matrix-matrix function:

f ′′(X) : Rd×d → Rd2×d2.

Desiderata: The derivative of a matrix–matrix function should be amatrix, so that we can directly use it to compute the Hessian.

Peder, Steven and Vaibhava Matrix Differentiation

Page 19: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

Optimizing Scalar-Matrix Functions (continued)

Taking the scalar–matrix derivative of

f (G(X))

will require the information in the matrix–matrix derivative

∂G

∂X.

Desiderata: The derivative of a matrix-matrix function should bea matrix, so that a convenient chain-rule can be established.

Peder, Steven and Vaibhava Matrix Differentiation

Page 20: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

The Matrix-Matrix Derivative

We define the matrix–matrix derivative to be:

∂F

∂Xdef=

∂vec(F>)

∂vec>(X>)=

∂f11∂x11

∂f11∂x12

· · · ∂f11∂xmn

∂f12∂x11

∂f12∂x12

· · · ∂f12∂xmn

......

. . ....

∂fkl∂x11

∂fkl∂x12

· · · ∂fkl∂xmn

.

Note that the matrix–matrix derivative of a scalar–matrixfunction is not the same as the scalar–matrix derivative:

∂ mat(f )

∂X= vec>

((∂f

∂X

)>).

Peder, Steven and Vaibhava Matrix Differentiation

Page 21: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

The Matrix-Matrix Derivative

We define the matrix–matrix derivative to be:

∂F

∂Xdef=

∂vec(F>)

∂vec>(X>)=

∂f11∂x11

∂f11∂x12

· · · ∂f11∂xmn

∂f12∂x11

∂f12∂x12

· · · ∂f12∂xmn

......

. . ....

∂fkl∂x11

∂fkl∂x12

· · · ∂fkl∂xmn

.

Note that the matrix–matrix derivative of a scalar–matrixfunction is not the same as the scalar–matrix derivative:

∂ mat(f )

∂X= vec>

((∂f

∂X

)>).

Peder, Steven and Vaibhava Matrix Differentiation

Page 22: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

The Matrix-Matrix Derivative

We define the matrix–matrix derivative to be:

∂F

∂Xdef=

∂vec(F>)

∂vec>(X>)=

∂f11∂x11

∂f11∂x12

· · · ∂f11∂xmn

∂f12∂x11

∂f12∂x12

· · · ∂f12∂xmn

......

. . ....

∂fkl∂x11

∂fkl∂x12

· · · ∂fkl∂xmn

.

Note that the matrix–matrix derivative of a scalar–matrixfunction is not the same as the scalar–matrix derivative:

∂ mat(f )

∂X= vec>

((∂f

∂X

)>).

Peder, Steven and Vaibhava Matrix Differentiation

Page 23: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

The Matrix-Matrix Derivative

We define the matrix–matrix derivative to be:

∂F

∂Xdef=

∂vec(F>)

∂vec>(X>)=

∂f11∂x11

∂f11∂x12

· · · ∂f11∂xmn

∂f12∂x11

∂f12∂x12

· · · ∂f12∂xmn

......

. . ....

∂fkl∂x11

∂fkl∂x12

· · · ∂fkl∂xmn

.

Note that the matrix–matrix derivative of a scalar–matrixfunction is not the same as the scalar–matrix derivative:

∂ mat(f )

∂X= vec>

((∂f

∂X

)>).

Peder, Steven and Vaibhava Matrix Differentiation

Page 24: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

Derivatives of Linear Matrix Functions

1 A matrix–matrix derivative is a matrix outer-product:

F (X) = trace(AX)B,∂F

∂X= vec(B>)vec>(A).

2 A matrix–matrix derivative is a Kronecker product:

F (X) = AXB,∂F

∂X= A⊗ B>.

3 A matrix–matrix derivative cannot be expressed with standardoperations. We call the new operation the box product

F (X) = AX>B,∂F

∂X= A� B>.

Peder, Steven and Vaibhava Matrix Differentiation

Page 25: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

Direct Matrix Products

A Direct Matrix Product, X = A~ B, is a matrix whose elementsare

x(i1i2)(i3i4) = aiσ(1)iσ(2)biσ(3)iσ(4)

.

(i1i2) is shorthand for i1n2 + i2 − 1 where i2 ∈ {1, 2, . . . , n2}.σ is a permutation over Z4

The direct matrix products are central to matrix–matrixdifferentiation. 8 can be expressed using Kronecker products:

A⊗ B A> ⊗ B A⊗ B> A> ⊗ B>

B⊗ A B> ⊗ A B⊗ A> B> ⊗ A>

Then 8 using matrix outer products and the final 8 using the boxproduct.

Peder, Steven and Vaibhava Matrix Differentiation

Page 26: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

Definition

Let A ∈ Rm1×n1 and B ∈ Rm2×n2

Definition (Kronecker Product)

A⊗ B ∈ R(m1m2)×(n1n2) is defined by (A⊗ B)(i−1)m2+j ,(k−1)n2+l =aikbj l = (A⊗ B)(ij)(kl).

Definition (Box Product)

A� B ∈ R(m1m2)×(n1n2) is defined by (A� B)(i−1)m2+j ,(k−1)n1+l =ai lbjk = (A� B)(ij)(kl).

Peder, Steven and Vaibhava Matrix Differentiation

Page 27: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

An example Kronecker and box product

To see the structure consider the box product of two 2× 2matrices, A and B:

A⊗ B A� Ba11b11 a11b12 a12b11 a12b12

a11b21 a11b22 a12b21 a12b22

a21b11 a21b12 a22b11 a22b12

a21b21 a21b22 a22b21 a22b22

a11b11 a12b11 a11b12 a12b12

a11b21 a12b21 a11b22 a12b22

a21b11 a22b11 a21b12 a22b12

a21b21 a22b21 a21b22 a22b22

Peder, Steven and Vaibhava Matrix Differentiation

Page 28: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

An example Kronecker and box product

To see the structure consider the box product of two 2× 2matrices, A and B:

A⊗ B A� Ba11b11 a11b12 a12b11 a12b12

a11b21 a11b22 a12b21 a12b22

a21b11 a21b12 a22b11 a22b12

a21b21 a21b22 a22b21 a22b22

a11b11 a12b11 a11b12 a12b12

a11b21 a12b21 a11b22 a12b22

a21b11 a22b11 a21b12 a22b12

a21b21 a22b21 a21b22 a22b22

Peder, Steven and Vaibhava Matrix Differentiation

Page 29: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

An example Kronecker and box product

To see the structure consider the box product of two 2× 2matrices, A and B:

A⊗ B A� Ba11b11 a11b12 a12b11 a12b12

a11b21 a11b22 a12b21 a12b22

a21b11 a21b12 a22b11 a22b12

a21b21 a21b22 a22b21 a22b22

a11b11 a12b11 a11b12 a12b12

a11b21 a12b21 a11b22 a12b22

a21b11 a22b11 a21b12 a22b12

a21b21 a22b21 a21b22 a22b22

Peder, Steven and Vaibhava Matrix Differentiation

Page 30: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

An example Kronecker and box product

To see the structure consider the box product of two 2× 2matrices, A and B:

A⊗ B A� Ba11b11 a11b12 a12b11 a12b12

a11b21 a11b22 a12b21 a12b22

a21b11 a21b12 a22b11 a22b12

a21b21 a21b22 a22b21 a22b22

a11b11 a12b11 a11b12 a12b12

a11b21 a12b21 a11b22 a12b22

a21b11 a22b11 a21b12 a22b12

a21b21 a22b21 a21b22 a22b22

Peder, Steven and Vaibhava Matrix Differentiation

Page 31: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

An example Kronecker and box product

To see the structure consider the box product of two 2× 2matrices, A and B:

A⊗ B A� Ba11b11 a11b12 a12b11 a12b12

a11b21 a11b22 a12b21 a12b22

a21b11 a21b12 a22b11 a22b12

a21b21 a21b22 a22b21 a22b22

a11b11 a12b11 a11b12 a12b12

a11b21 a12b21 a11b22 a12b22

a21b11 a22b11 a21b12 a22b12

a21b21 a22b21 a21b22 a22b22

Peder, Steven and Vaibhava Matrix Differentiation

Page 32: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

An example Kronecker and box product

To see the structure consider the box product of two 2× 2matrices, A and B:

A⊗ B A� Ba11b11 a11b12 a12b11 a12b12

a11b21 a11b22 a12b21 a12b22

a21b11 a21b12 a22b11 a22b12

a21b21 a21b22 a22b21 a22b22

a11b11 a12b11 a11b12 a12b12

a11b21 a12b21 a11b22 a12b22

a21b11 a22b11 a21b12 a22b12

a21b21 a22b21 a21b22 a22b22

Peder, Steven and Vaibhava Matrix Differentiation

Page 33: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

An example Kronecker and box product

To see the structure consider the box product of two 2× 2matrices, A and B:

A⊗ B A� Ba11b11 a11b12 a12b11 a12b12

a11b21 a11b22 a12b21 a12b22

a21b11 a21b12 a22b11 a22b12

a21b21 a21b22 a22b21 a22b22

a11b11 a12b11 a11b12 a12b12

a11b21 a12b21 a11b22 a12b22

a21b11 a22b11 a21b12 a22b12

a21b21 a22b21 a21b22 a22b22

Peder, Steven and Vaibhava Matrix Differentiation

Page 34: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

An example Kronecker and box product

To see the structure consider the box product of two 2× 2matrices, A and B:

A⊗ B A� Ba11b11 a11b12 a12b11 a12b12

a11b21 a11b22 a12b21 a12b22

a21b11 a21b12 a22b11 a22b12

a21b21 a21b22 a22b21 a22b22

a11b11 a12b11 a11b12 a12b12

a11b21 a12b21 a11b22 a12b22

a21b11 a22b11 a21b12 a22b12

a21b21 a22b21 a21b22 a22b22

Peder, Steven and Vaibhava Matrix Differentiation

Page 35: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

A =

1 1 1 · · · 12 2 2 · · · 23 3 3 · · · 3...

......

...10 10 10 · · · 10

B =

1 2 3 · · · 101 2 3 · · · 101 2 3 · · · 10...

......

...1 2 3 · · · 10

Peder, Steven and Vaibhava Matrix Differentiation

Page 36: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

Peder, Steven and Vaibhava Matrix Differentiation

Page 37: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

A⊗ B

Peder, Steven and Vaibhava Matrix Differentiation

Page 38: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

A� B

Peder, Steven and Vaibhava Matrix Differentiation

Page 39: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

vec(A)vec>(B)

Peder, Steven and Vaibhava Matrix Differentiation

Page 40: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

The box product behaves similarly to the Kronecker product:1 Vector Multiplication:

(B> ⊗ A)vec(X) = vec(AXB) (B> � A)vec(X) = vec(AX>B)

2 Matrix Multiplication:

(A⊗ B)(C⊗D) = (AC)⊗ (BD) (A� B)(C�D) = (AD)⊗ (BC)

3 Inverse and Transpose:

(A⊗ B)−1 = A−1 ⊗ B−1 (A� B)−1 = B−1 � A−1

(A⊗ B)> = A> ⊗ B> (A� B)> = B> � A>

4 Mixed Products:

(A⊗ B)(C�D) = (AC)� (BD) (A� B)(C⊗D) = (AD)� (BC)

Peder, Steven and Vaibhava Matrix Differentiation

Page 41: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

The box product behaves similarly to the Kronecker product:1 Vector Multiplication:

(B> ⊗ A)vec(X) = vec(AXB) (B> � A)vec(X) = vec(AX>B)

2 Matrix Multiplication:

(A⊗ B)(C⊗D) = (AC)⊗ (BD) (A� B)(C�D) = (AD)⊗ (BC)

3 Inverse and Transpose:

(A⊗ B)−1 = A−1 ⊗ B−1 (A� B)−1 = B−1 � A−1

(A⊗ B)> = A> ⊗ B> (A� B)> = B> � A>

4 Mixed Products:

(A⊗ B)(C�D) = (AC)� (BD) (A� B)(C⊗D) = (AD)� (BC)

Peder, Steven and Vaibhava Matrix Differentiation

Page 42: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

The box product behaves similarly to the Kronecker product:1 Vector Multiplication:

(B> ⊗ A)vec(X) = vec(AXB) (B> � A)vec(X) = vec(AX>B)

2 Matrix Multiplication:

(A⊗ B)(C⊗D) = (AC)⊗ (BD) (A� B)(C�D) = (AD)⊗ (BC)

3 Inverse and Transpose:

(A⊗ B)−1 = A−1 ⊗ B−1 (A� B)−1 = B−1 � A−1

(A⊗ B)> = A> ⊗ B> (A� B)> = B> � A>

4 Mixed Products:

(A⊗ B)(C�D) = (AC)� (BD) (A� B)(C⊗D) = (AD)� (BC)

Peder, Steven and Vaibhava Matrix Differentiation

Page 43: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

The box product behaves similarly to the Kronecker product:1 Vector Multiplication:

(B> ⊗ A)vec(X) = vec(AXB) (B> � A)vec(X) = vec(AX>B)

2 Matrix Multiplication:

(A⊗ B)(C⊗D) = (AC)⊗ (BD) (A� B)(C�D) = (AD)⊗ (BC)

3 Inverse and Transpose:

(A⊗ B)−1 = A−1 ⊗ B−1 (A� B)−1 = B−1 � A−1

(A⊗ B)> = A> ⊗ B> (A� B)> = B> � A>

4 Mixed Products:

(A⊗ B)(C�D) = (AC)� (BD) (A� B)(C⊗D) = (AD)� (BC)

Peder, Steven and Vaibhava Matrix Differentiation

Page 44: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

Let A ∈ Rm1×n1 , B ∈ Rm2×n2 then

Trace:

trace(A⊗B) = trace(A) trace(B) trace(A�B) = trace(AB)

Determinant: Here m1 = n1 and m2 = n2 is required

det(A⊗ B) = (det(A))m2(det(A))m1

det(A� B) = (−1)(m12 )(m2

2 )(det(A))m2(det(B))m1

Associativity:

(A⊗B)⊗C = A⊗ (B⊗C) (A�B)�C = A� (B�C),

but not for mixed products. In general we have

(A⊗B)�C 6= A⊗ (B�C) (A�B)⊗C 6= A� (B⊗C).

Peder, Steven and Vaibhava Matrix Differentiation

Page 45: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

Identity Box Products

The identity box product Im � In is permutation matrix withinteresting properties. Let A ∈ Rm1×n1 , B ∈ Rm2×n2 .

Orthonormal: (Im � In)>(Im � In) = Imn

Transposition: (Im1 � In1)vec(A) = vec(A>)

Connector: Converting a Kronecker product to a box product:(A⊗ B)(In1 � In2) = A� B.Converting a box product to a Kronecker product:(A� B)(In2 � In1) = A⊗ B.

Peder, Steven and Vaibhava Matrix Differentiation

Page 46: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

Box products are old things in a new wrapping

Although the notation for box products is new, Im � In has longbeen known as Tm,n by physicists, or as a stride permutation Lmn

m

by others. These objects are identical, but the box-product allowsus to express more complex identities more compactly, e.g. letA ∈ Rm1×n1 , B ∈ Rm2×n2 then

(Im1 � In1m2 � In2)vec((A⊗ B)>) = vec(vec(A)vec>(B)).

The initial permutation matrix would have to be written

Im1 � In1m2 � In2 = (Im1 ⊗ Tn1m2,m1)Tm1,n1m1m2 ,

if the notation of box-products wasn’t being used.

Peder, Steven and Vaibhava Matrix Differentiation

Page 47: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

Direct matrix products are important because such matrices can bemultiplied fast with each other and with vectors.

Let us show how the FFT can be done in terms of Kronecker andbox products. Recall that the DFT matrix of order n is given by

Fn = [e−2πikl/n]0≤k,l<n = [ωkln ]0≤k,l<n and therefore

F2 =

(1 11 −1

), F4 =

1 1 1 11 −i −1 i1 −1 1 −11 i −1 −i

Peder, Steven and Vaibhava Matrix Differentiation

Page 48: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

We can factor the matrix F4 as follows:

F4 =

1 1

1 11 −1

1 −1

11

1−i

1 11 −1

1 11 −1

11

11

or more compactly

F4 = (F2 ⊗ I2)diag

(vec

(1 11 −i

))(I2 � F2)

Peder, Steven and Vaibhava Matrix Differentiation

Page 49: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

In general if we define the matrix

VN,M(α) =

1 1 1 . . . 11 α α2 . . . αM−1

1 α2 α4 . . . α2(M−1)

......

.... . .

...

1 αN−1 α2(N−1) . . . α(N−1)(M−1)

,

For N = km we have the following factorizations of the DFTmatrix:

FN = (Fk ⊗ Im)(diag(vec(Vm,k(ωN))))(Ik � Fm)

FN = (Fm � Ik)(diag(vec(Vm,k(ωN))))(Fk ⊗ Im)

Peder, Steven and Vaibhava Matrix Differentiation

Page 50: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

This allows FFTN(x) = y = FNx to be computed as

FNx = vec((Vm,k(ωN) ◦ (FmX>))F>k ).

where x = vec(X) and ◦ denotes elementwise multiplication (.∗ inMatlab). The direct computation has a cost of O(k2m2), whereasthe formula above does the job in O(km(k + m)) operations.Repeated use of the identity for N = 2n leads to the Cooley-TukeyFFT algorithm. (James Cooley was at T. J. Watson from 1962 to1991).

The fastest FFT library in the world as of 2012 (Spiral) usesknowledge of several such factorizations to automatically optimizethe FFT implementation for an arbitrary value of n to a givenplatform!

Peder, Steven and Vaibhava Matrix Differentiation

Page 51: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

Here we show the role of Kronecker and box product in thematrix–matrix differentiation framework.

Peder, Steven and Vaibhava Matrix Differentiation

Page 52: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

Basic Differentiation Identitities

Let X ∈ Rm×n.

Identity: F(X) = X, F′(X) = Imn

Transpose:F(X) = X>, F′(X) = Im � In

Chain Rule:

F(X) = G(H(X)), F′(X) = G′(H(X))H′(X)

Product Rule:

F(X) = G(X)H(X), F′(X) = (I⊗H>(X))G′(X)+(G(X)⊗I)H′(X)

Peder, Steven and Vaibhava Matrix Differentiation

Page 53: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

More Derivative Identitities

Assume X ∈ Rm×m is a square matrix.

Square: F(X) = X2, F′(X) = Im ⊗ X> + X⊗ Im.

Inverse: F(X) = X−1, F′(X) = −X−1 ⊗ X−>.

+Transpose: F(X) = X−>, F′(X) = −X−> � X−1.

Square Root: F(X) = X1/2,

F′(X) =(I⊗ (X1/2)> + X1/2 ⊗ I

)−1.

Integer Power: F(X) = Xk , F′(X) =∑k−1

i=0 Xi ⊗ (Xk−1−i )>.

Peder, Steven and Vaibhava Matrix Differentiation

Page 54: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

First some simple identities:

f (X) = trace(AX) f ′(X) = A>

f (X) = trace(AX>) f ′(X) = Af (X) = log det(X) f ′(X) = X−>

Then the chain rule for more general expressions:

vec

((∂

∂Xlog det(G(X))

)>)=

(∂G

∂X

)>vec(

(G(X))−1)

vec

((∂

∂Xtrace(G(X))

)>)=

(∂G

∂X

)>vec(I)

Peder, Steven and Vaibhava Matrix Differentiation

Page 55: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

The chain rule to get scalar-matrix derivatives is awkward to use.Instead we have some short-cuts.

∂Xtrace(G(X)H(X)) =

∂Xtrace (H(Y)G(X) + H(X)G(Y))

∣∣∣∣Y=X

,

∂Xtrace(AF−1(X)) = − ∂

∂Xtrace

(F−1(Y)AF−1(Y)F(X)

)∣∣∣∣Y=X

,

and

∂log det(F(X))

∂X=

∂Xtrace

(F−1(Y)F(X)

)∣∣∣∣Y=X

.

Peder, Steven and Vaibhava Matrix Differentiation

Page 56: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

Use the formulas on the previous page to compute

∂Xtrace((X− A)(X− B)−1(X− C))

The answer is:

(X− C)>(X− B)−> + (X− B)−>(X− A)>

+(X− B)−>(X− A)>(X− C)>(X− B)−>

Peder, Steven and Vaibhava Matrix Differentiation

Page 57: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

Use the formulas on the previous page to compute

∂Xtrace((X− A)(X− B)−1(X− C))

The answer is:

(X− C)>(X− B)−> + (X− B)−>(X− A)>

+(X− B)−>(X− A)>(X− C)>(X− B)−>

Peder, Steven and Vaibhava Matrix Differentiation

Page 58: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

Finally if

r(x) =q(x)

p(x)=

∑ni=1 aix

i∑nj=1 bix i

is a scalar-scalar function we can form a matrix-matrix function bysimply substituting a matrix X for the scalar x . Then

∂Xtrace(r(X)) =

(r ′(X)

)>and if h(x) = log(r(x)) then

∂Xlog det(r(X)) =

(h′(X)

)>.

Peder, Steven and Vaibhava Matrix Differentiation

Page 59: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)

trace(F(X))

×

× X

(·)−1 (·)>

+ X

XI

Peder, Steven and Vaibhava Matrix Differentiation

Page 60: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)

trace(F(X))

×

× X

(·)−1 (·)>

+ X

XI

T1 = I + X

Peder, Steven and Vaibhava Matrix Differentiation

Page 61: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)

trace(F(X))

×

× X

(·)−1T2 = T−11

(·)>

+ X

XI

T1 = I + X

Peder, Steven and Vaibhava Matrix Differentiation

Page 62: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)

trace(F(X))

×

× X

(·)−1T2 = T−11

(·)>T3 = X>

+ X

XI

T1 = I + X

Peder, Steven and Vaibhava Matrix Differentiation

Page 63: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)

trace(F(X))

×

×T4 = T2 ∗ T3 X

(·)−1T2 = T−11

(·)>T3 = X>

+ X

XI

T1 = I + X

Peder, Steven and Vaibhava Matrix Differentiation

Page 64: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)

trace(F(X))

×T5 = T4 ∗ X

×T4 = T2 ∗ T3 X

(·)−1T2 = T−11

(·)>T3 = X>

+ X

XI

T1 = I + X

Peder, Steven and Vaibhava Matrix Differentiation

Page 65: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)

trace(F(X))f = trace(T5)

×T5 = T4 ∗ X

×T4 = T2 ∗ T3 X

(·)−1T2 = T−11

(·)>T3 = X>

+ X

XI

T1 = I + X

Peder, Steven and Vaibhava Matrix Differentiation

Page 66: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

Forward mode computation of f ′(X):

trace(F(X))

× T5

× T4 XI⊗ I

(·)−1 T2 (·)>T3

+ XI⊗ IT1

XI⊗ II0

Peder, Steven and Vaibhava Matrix Differentiation

Page 67: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

(G + H)′ = G′ + H′

trace(F(X))

× T5

× T4 XI⊗ I

(·)−1 T2 (·)>T3

+ XI⊗ IT1T′1 = 0 + I⊗ I

XI⊗ II0

Peder, Steven and Vaibhava Matrix Differentiation

Page 68: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

Chain rule: (T−11 )′ = −(T−1

1 ⊗ T−>1 )T′1

trace(F(X))

× T5

× T4 XI⊗ I

(·)−1 T2T′2 = −(T2 ⊗ T>2 )T′1 (·)>T3

+ XI⊗ IT1T′1 = 0 + I⊗ I

XI⊗ II0

Peder, Steven and Vaibhava Matrix Differentiation

Page 69: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

(X>)′ = I� I

trace(F(X))

× T5

× T4 XI⊗ I

(·)−1 T2T′2 = −(T2 ⊗ T>2 )T′1 (·)>T3 T′3 = I� I

+ XI⊗ IT1T′1 = 0 + I⊗ I

XI⊗ II0

Peder, Steven and Vaibhava Matrix Differentiation

Page 70: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

Product rule: (GH)′ = (I⊗H>)G′ + (G⊗ I)H′

trace(F(X))

× T5

× T4T′4 = (I⊗ T>3 )T′2 + (T2 ⊗ I)T′3 XI⊗ I

(·)−1 T2T′2 = −(T2 ⊗ T>2 )T′1 (·)>T3 T′3 = I� I

+ XI⊗ IT1T′1 = 0 + I⊗ I

XI⊗ II0

Peder, Steven and Vaibhava Matrix Differentiation

Page 71: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

Product rule: (GH)′ = (I⊗H>)G′ + (G⊗ I)H′

trace(F(X))

× T5T′5 = (I⊗ X>)T′4 + (T4 ⊗ I)(I⊗ I)

× T4T′4 = (I⊗ T>3 )T′2 + (T2 ⊗ I)T′3 XI⊗ I

(·)−1 T2T′2 = −(T2 ⊗ T>2 )T′1 (·)>T3 T′3 = I� I

+ XI⊗ IT1T′1 = 0 + I⊗ I

XI⊗ II0

Peder, Steven and Vaibhava Matrix Differentiation

Page 72: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

((trace(F))′)> = vec−1(vec>(I)F′

)trace(F(X))

×

f ′ = vec−1(vec>(I)T′5

)

T5T′5 = (I⊗ X>)T′4 + (T4 ⊗ I)(I⊗ I)

× T4T′4 = (I⊗ T>3 )T′2 + (T2 ⊗ I)T′3 XI⊗ I

(·)−1 T2T′2 = −(T2 ⊗ T>2 )T′1 (·)>T3 T′3 = I� I

+ XI⊗ IT1T′1 = 0 + I⊗ I

XI⊗ II0

Peder, Steven and Vaibhava Matrix Differentiation

Page 73: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

Forward mode differentiation

Requires huge intermediate matrices.

Only the last step reduces the size.

Critical points:

F′(X) is composed of Kronecker, box (and outer) products fora large class of functions. (wow!)

These can be “unwound” by multiplication with vectorizedscalar-matrix derivatives (gasp!):

vec>(C)A⊗ B = vec>(B>CA)

Reverse mode differentiation:

Evaluate the derivative from top to bottom.

Small scalar-matrix derivatives are propagated down the tree.

Peder, Steven and Vaibhava Matrix Differentiation

Page 74: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

Reverse Mode Differentiation: (f ′)> = vec−1(vec>(I)F′

)trace(F(X))

×

f ′ = R0 = I

T5

× T4 X

(·)−1 T2 (·)>T3

+ XT1

XI

Peder, Steven and Vaibhava Matrix Differentiation

Page 75: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

vec>(R0)(I⊗ X>) = vec>(XR0I)

trace(F(X))

×

f ′ = R0 = I

T5T′5 = (I⊗ X>) T′4+ T4 ⊗ I

× T4

R1 = XR0

X

(·)−1 T2 (·)>T3

+ XT1

XI

Peder, Steven and Vaibhava Matrix Differentiation

Page 76: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

vec>(R0)(T4 ⊗ I) = vec>(IR0T4)

trace(F(X))

×

f ′ = R>2 R0 = I

T5T′5 = (I⊗ X>) T′4+ T4 ⊗ I

× T4

R1 = XR0

X R2 = R0T4

(·)−1 T2 (·)>T3

+ XT1

XI

Peder, Steven and Vaibhava Matrix Differentiation

Page 77: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

vec>(R1)(I⊗ T>3 ) = vec>(T3R1I)

trace(F(X))

×

f ′ = R>2 R0 = I

T5

× T4

R1 = XR0

T′4 = (I⊗ T>3 ) T′2+ (T2 ⊗ I) T′3 X R2 = R0T4

(·)−1 T2

R3 = T3R1

(·)>T3

+ XT1

XI

Peder, Steven and Vaibhava Matrix Differentiation

Page 78: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

vec>(R1)(T2 ⊗ I) = vec>(IR1T2)

trace(F(X))

×

f ′ = R>2 R0 = I

T5

× T4

R1 = XR0

T′4 = (I⊗ T>3 ) T′2+ (T2 ⊗ I) T′3 X R2 = R0T4

(·)−1 T2

R3 = T3R1

(·)>T3

R4 = R1T2

+ XT1

XI

Peder, Steven and Vaibhava Matrix Differentiation

Page 79: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

vec>(R3)(−T2 ⊗ T>2 ) = vec>(−T2R3T2)

trace(F(X))

×

f ′ = R>2 R0 = I

T5

× T4 X R2 = R0T4

(·)−1 T2

R3 = T3R1

T′2 = −(T2 ⊗ T>2 )T′1 (·)>T3

R4 = R1T2

+ XT1

R5 = −T2R3T2

XI

Peder, Steven and Vaibhava Matrix Differentiation

Page 80: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

vec>(R4)(I� I) = vec>(IR>4 I)

trace(F(X))

×

f ′ = R>2 +R>6 R0 = I

T5

× T4 X R2 = R0T4

(·)−1 T2 (·)>T3

R4 = R1T2

T′3 = I� I

+ X R6 = R>4T1

R5 = −T2R3T2

XI

Peder, Steven and Vaibhava Matrix Differentiation

Page 81: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

vec>(R5)0 = vec(0)

trace(F(X))

×

f ′ = R>2 +R>6 R0 = I

T5

× T4 X R2 = R0T4

(·)−1 T2 (·)>T3

+ X R6 = R>4T1

R5 = −T2R3T2

T′1 =0+I⊗ I

XI R7 = 0

Peder, Steven and Vaibhava Matrix Differentiation

Page 82: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

vec>(R5)I⊗ I = vec(R5)

trace(F(X))

×

f ′ = R>2 +R>6 +R>8 R0 = I

T5

× T4 X R2 = R0T4

(·)−1 T2 (·)>T3

+ X R6 = R>4T1

R5 = −T2R3T2

T′1 =0+I⊗ I

X R8 = R5I R7 = 0

Peder, Steven and Vaibhava Matrix Differentiation

Page 83: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

f ′(X) = R>2 + R>6 + R>8

trace(F(X))

×

f ′ = R>2 +R>6 +R>8 R0 = I

T5

× T4 X R2 = R0T4

(·)−1 T2 (·)>T3

+ X R6 = R>4T1

X R8 = R5I

Peder, Steven and Vaibhava Matrix Differentiation

Page 84: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Derivative PatternsHessian FormsNewton’s MethodFuture Work

There is growing interest in Machine Learning in scalar–matrixobjective functions.

Probabilistic Graphical Models

Covariance Selection

Optimization in Graphs and Networks.

Data-mining in social networks

Can we use this theory to help optimize such functions? We are onthe look-out for interesting problems.

Peder, Steven and Vaibhava Matrix Differentiation

Page 85: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Derivative PatternsHessian FormsNewton’s MethodFuture Work

The Anatomy of a Matrix-Matrix Function

Theorem

Let R(X) be a rational matrix–matrix function formed fromconstant matrices and K occurrences of X using arithmetic matrixoperations (+, −, ∗ and (·)−1) and transposition ((·)>). Then thederivative of the matrix–matrix function is of the form

k1∑i=1

Ai ⊗ Bi +K∑

i=k1+1

Ai � Bi .

The matrices Ai and Bi are computed as parts of R(X).

Peder, Steven and Vaibhava Matrix Differentiation

Page 86: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Derivative PatternsHessian FormsNewton’s MethodFuture Work

Hessian Forms

The derivative of a function f of the form trace(R(X)) orlog det(R(X)) is a rational function. Therefore, the Hessian is ofthe form

k1∑i=1

Ai ⊗ Bi +K∑

i=k1+1

Ai � Bi .

where K is the number of times X occurs in the expression for thederivative.If Ai , Bi are d × d matrices then multiplication by H can be donewith O(Kd3) operations.Generalizations of this result can be found in the paper.

Peder, Steven and Vaibhava Matrix Differentiation

Page 87: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Derivative PatternsHessian FormsNewton’s MethodFuture Work

Let f : Rd×d → R with f (X) = trace(R(X)) orf (X) = log det(R(X)).

Derivative: G = f ′(X) ∈ Rd×d and Hessian:H = f ′′(X) ∈ Rd2×d2

.

Newton direction vec(V) = H−1vec(G) can be computedefficiently if K � d

K Algorithm Complexity Storage

1 A⊗ B = A−1 ⊗ B−1 O(d3) O(Kd2)2 Barthels-Stewart algorithm O(d3) O(Kd2)≥ 3 Conjugate Gradient Algorithm O(Kd5) O(Kd2)≥ d General matrix inversion O(d6 + Kd4) O(d4)

The (generalized) Bartels-Stewart algorithms solves theSylvester-like equation AX + X>B = C.

Peder, Steven and Vaibhava Matrix Differentiation

Page 88: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Derivative PatternsHessian FormsNewton’s MethodFuture Work

What about optimizing functions of the form

f (X) + ‖vec(X)‖1?

Common approaches:

Strategy method Use Hessian structure?

Newton-Lasso Coordinate descent XFISTA X

Orthantwise `1 CG XL-BFGS 7

It is not obvious how to take advantage of the Hessian structurefor all these methods. For orthantwise CG for example – thesub-problem requires the Newton direction for a sub-matrix of theHessian.

Peder, Steven and Vaibhava Matrix Differentiation

Page 89: Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Derivative PatternsHessian FormsNewton’s MethodFuture Work

1 We have applied this methodology to the covariance selectionproblem – a paper is forthcoming.

2 We are digging deeper in the theory of matrix differentiationand properties of the box products.

3 We are on the look-out for more interesting matrixoptimization problems. All suggestions are appreciated!

Peder, Steven and Vaibhava Matrix Differentiation