Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation

watsonlogo

IntroductionThe Kronecker and Box Product

Matrix DifferentiationOptimization

Efficient Automatic Differentiation of MatrixFunctions

Peder A. Olsen, Steven J. Rennie, and Vaibhava GoelIBM T.J. Watson Research Center

Automatic Differentiation 2012Fort Collins, CO

July 25, 2012

Peder, Steven and Vaibhava Matrix Differentiation



Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions

How can we compute the derivative?

Numerical Methods that do numeric differentiation by formulaslike the finite difference

f ′(x) ≈ f (x + h)− f (x)

h

These methods are simple to program, but loose halfof all significant digits.

Symbolic Symbolic differentiation gives the symbolicrepresentation of the derivative without saying howbest to implement the derivative. In general symbolicderivatives are as accurate as they come.

Automatic Something between numeric and symbolicdifferentiation. The derivative implementation iscomputed from the code for f (x).





Finite Difference:

f ′(x) ≈ f (x + h)− f (x)

h

Central Difference:

f ′(x) ≈ f (x + h)− f (x − h)

2h

For h < εx , where ε is the machine precision, these approximationsbecome zero. Thus h has to be chosen with care.





The error in the finite difference is bounded by

‖f ′′‖∞h

2+

2ε‖f ‖∞h

whose minimum is2√ε‖f ‖∞‖f ′′‖∞

achieved for

h = 2

√ε‖f ‖∞‖f ′′‖∞

.





Numerical Differentiation Error

The graph shows the error in approximating the derivative forf (x) = ex as a function of h.





An imaginary derivative

The error stemming from the subtractive cancellation can becompletely removed by using complex numbers.Let f : R→ R then

f ′(x + ih) ≈ f (x) + ihf ′(x) +(ih)2

2!f ′′(x) +

(ih)3

3!f (3)(x)

= f (x)− h2

2f ′′(x) + hi(f ′(x)− h2

6f (3)(x))

Thus the derivative can be approximated by

f ′(x) ≈ Im

(f ′(x + ih)

h

)Peder, Steven and Vaibhava Matrix Differentiation




Imaginary Differentiation Error

The graph shows the error in approximating the derivative forf (x) = ex as a function of h.





By using dual numbers dual numbers a + bE , where E 2 = 0 wecan further reduce the error. For second order derivatives we canuse hyper dual numbers. ((2012) J. Fike).

Complex number implementations usually come with mostprogramming languages. For dual or hyper-dual numbers we needto download or implement code.


http://www.stanford.edu/~jfike/hyperdual.html

http://www.stanford.edu/~jfike/hyperdual.html




What is Automatic Differentiation?

Algorithmic Differentiation (often referred to asAutomatic Differentiation or just AD) uses the softwarerepresentation of a function to obtain an efficient methodfor calculating its derivatives. These derivatives can be ofarbitrary order and are analytic in nature (do not haveany truncation error). B. Bell – author of CppAD

Use of dual or complex numbers is a form of automaticdifferentiation. More common though is operator overloadingimplementations as in CppAD.





The Speelpenning function is:

f (y) =n∏

i=1

xi (y)

We consider xi = (y − ti ), so that f (y) =∏n

i=1(y − ti ). If wecompute the symbolic derivative we get

f ′(y) =n∑

i=1

∏j 6=i

(y − tj),

which requires n(n − 1) multiplications to compute with astraightforward implementation. Rewriting the expression as

f ′(y) =n∑

i=1

f (y)

y − ti,

reduces the computational cost, but is not numerically stable ify ≈ ti .





Reverse Mode Differentiation

Define fk =∏k

i=1(y − ti ) and rk =∏n

i=k(y − ti ), then

f ′(y) =n∑

i=1

∏j 6=i

(y − tj) =n∑

i=1

fi−1ri+1,

and fi = (y − ti )fi−1, f1 = y − t1, ri = (y − ti )ri+1, rn = y − tn.Note that f (y) = fn = r1.

Cost of computing f (y) is n subtractions and n − 1 multiplies. Atthe overhead of storing f1, . . . , fn the derivative can be computedwith an additional n − 1 additions and 2n − 1 multiplications.





Reverse Mode Differentiation

Under quite realistic assumptions the evaluation of agradient requires never more than five times the effort ofevaluating the underlying function by itself.

Andreas Griewank (1988)





Reverse mode differentiation (1971) is essentially applying thechain rule on a function in the reverse order that the function iscomputed.

This is the same technique as applied in the backpropagationalgorithm (1969) for neural networks and the forward-backwardalgorithm for training HMMs (1970).

Reverse mode differentiation applies to very general classes offunctions.





Reverse mode differentiation was independently invented by

G. M. Ostrovskii (1971)

S. Linnainmaa (1976)

John Larson and Ahmed Sameh (1978)

Bert Speelpenning (1980)

Not surprisingly the fame went to B. Speelpenning, whoadmittedly told the full story at AD 2012.

A. Sameh was on B. Speelpenning’s thesis committee.

The Speelpenning function was suggested as a motivation toSpeelpenning’s thesis by his Canadian interim advisor Prof. ArtSedgwick who passed away on July 5th.





Why worry about matrix differentiation?

The differentiation of functions of a matrix argument is still notwell understood

No simple “calculus” for computing matrix derivatives.

Matrix derivatives can be complicated to compute even forquite simple expressions.

Goal: Build on known results to define a matrix-derivative calculus.





Quiz

The Tikhonov regularized maximum (log-)likelihood for learningthe covariance Σ given an empirical covariance S is

f (Σ) = −log det(Σ)− trace(SΣ−1)− ‖Σ−1‖2F .

What is f ′(Σ) and f ′′(Σ)?

Statisticians can compute it. Can you?

Can a calculus be reverse engineered?





Terminology

Classical calculus:

scalar-scalar functions (f : R→ R)

scalar-vector functions (f : Rd → R).

Matrix calculus:

scalar-matrix functions (f : Rm×n → R)

matrix-matrix functions (f : Rm×n → Rk×l).

Note:

1 Derivatives of scalar-matrix functions requires derivatives ofmatrix-matrix functions (why?)

2 Matrix-matrix derivatives should be computed implicitlywherever possible (why?)





Optimizing Scalar-Matrix Functions

Useful quantities for optimizing a scalar-matrix function

f (X) : Rd×d → RThe gradient is a matrix–matrix function:

f ′(X) : Rd×d → Rd×d .

The Hessian is also a matrix-matrix function:

f ′′(X) : Rd×d → Rd2×d2.

Desiderata: The derivative of a matrix–matrix function should be amatrix, so that we can directly use it to compute the Hessian.





Optimizing Scalar-Matrix Functions (continued)

Taking the scalar–matrix derivative of

f (G(X))

will require the information in the matrix–matrix derivative

∂G

∂X.

Desiderata: The derivative of a matrix-matrix function should bea matrix, so that a convenient chain-rule can be established.





The Matrix-Matrix Derivative

We define the matrix–matrix derivative to be:

∂F

∂Xdef=

∂vec(F>)

∂vec>(X>)=

∂f11∂x11

∂f11∂x12

· · · ∂f11∂xmn

∂f12∂x11

∂f12∂x12

· · · ∂f12∂xmn

......

. . ....

∂fkl∂x11

∂fkl∂x12

· · · ∂fkl∂xmn

.

Note that the matrix–matrix derivative of a scalar–matrixfunction is not the same as the scalar–matrix derivative:

∂ mat(f )

∂X= vec>

((∂f

∂X

)>).







∂F

∂Xdef=

∂vec(F>)

∂vec>(X>)=

∂f11∂x11

∂f11∂x12

· · · ∂f11∂xmn

∂f12∂x11

∂f12∂x12

· · · ∂f12∂xmn

......

. . ....

∂fkl∂x11

∂fkl∂x12


.


∂ mat(f )

∂X= vec>

((∂f

∂X

)>).







∂F

∂Xdef=

∂vec(F>)

∂vec>(X>)=

∂f11∂x11

∂f11∂x12

· · · ∂f11∂xmn

∂f12∂x11

∂f12∂x12

· · · ∂f12∂xmn

......

. . ....

∂fkl∂x11

∂fkl∂x12


.


∂ mat(f )

∂X= vec>

((∂f

∂X

)>).







∂F

∂Xdef=

∂vec(F>)

∂vec>(X>)=

∂f11∂x11

∂f11∂x12

· · · ∂f11∂xmn

∂f12∂x11

∂f12∂x12

· · · ∂f12∂xmn

......

. . ....

∂fkl∂x11

∂fkl∂x12


.


∂ mat(f )

∂X= vec>

((∂f

∂X

)>).





Derivatives of Linear Matrix Functions

1 A matrix–matrix derivative is a matrix outer-product:

F (X) = trace(AX)B,∂F

∂X= vec(B>)vec>(A).

2 A matrix–matrix derivative is a Kronecker product:

F (X) = AXB,∂F

∂X= A⊗ B>.

3 A matrix–matrix derivative cannot be expressed with standardoperations. We call the new operation the box product

F (X) = AX>B,∂F

∂X= A� B>.




DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform

Direct Matrix Products

A Direct Matrix Product, X = A~ B, is a matrix whose elementsare

x(i1i2)(i3i4) = aiσ(1)iσ(2)biσ(3)iσ(4)

.

(i1i2) is shorthand for i1n2 + i2 − 1 where i2 ∈ {1, 2, . . . , n2}.σ is a permutation over Z4

The direct matrix products are central to matrix–matrixdifferentiation. 8 can be expressed using Kronecker products:

A⊗ B A> ⊗ B A⊗ B> A> ⊗ B>

B⊗ A B> ⊗ A B⊗ A> B> ⊗ A>

Then 8 using matrix outer products and the final 8 using the boxproduct.





Definition

Let A ∈ Rm1×n1 and B ∈ Rm2×n2

Definition (Kronecker Product)

A⊗ B ∈ R(m1m2)×(n1n2) is defined by (A⊗ B)(i−1)m2+j ,(k−1)n2+l =aikbj l = (A⊗ B)(ij)(kl).

Definition (Box Product)

A� B ∈ R(m1m2)×(n1n2) is defined by (A� B)(i−1)m2+j ,(k−1)n1+l =ai lbjk = (A� B)(ij)(kl).





An example Kronecker and box product

To see the structure consider the box product of two 2× 2matrices, A and B:

A⊗ B A� Ba11b11 a11b12 a12b11 a12b12

a11b21 a11b22 a12b21 a12b22

a21b11 a21b12 a22b11 a22b12

a21b21 a21b22 a22b21 a22b22

a11b11 a12b11 a11b12 a12b12

a11b21 a12b21 a11b22 a12b22

a21b11 a22b11 a21b12 a22b12

a21b21 a22b21 a21b22 a22b22







A⊗ B A� Ba11b11 a11b12 a12b11 a12b12

a11b21 a11b22 a12b21 a12b22

a21b11 a21b12 a22b11 a22b12

a21b21 a21b22 a22b21 a22b22

a11b11 a12b11 a11b12 a12b12

a11b21 a12b21 a11b22 a12b22

a21b11 a22b11 a21b12 a22b12

a21b21 a22b21 a21b22 a22b22







A⊗ B A� Ba11b11 a11b12 a12b11 a12b12

a11b21 a11b22 a12b21 a12b22

a21b11 a21b12 a22b11 a22b12

a21b21 a21b22 a22b21 a22b22

a11b11 a12b11 a11b12 a12b12

a11b21 a12b21 a11b22 a12b22

a21b11 a22b11 a21b12 a22b12

a21b21 a22b21 a21b22 a22b22







A⊗ B A� Ba11b11 a11b12 a12b11 a12b12

a11b21 a11b22 a12b21 a12b22

a21b11 a21b12 a22b11 a22b12

a21b21 a21b22 a22b21 a22b22

a11b11 a12b11 a11b12 a12b12

a11b21 a12b21 a11b22 a12b22

a21b11 a22b11 a21b12 a22b12

a21b21 a22b21 a21b22 a22b22







A⊗ B A� Ba11b11 a11b12 a12b11 a12b12

a11b21 a11b22 a12b21 a12b22

a21b11 a21b12 a22b11 a22b12

a21b21 a21b22 a22b21 a22b22

a11b11 a12b11 a11b12 a12b12

a11b21 a12b21 a11b22 a12b22

a21b11 a22b11 a21b12 a22b12

a21b21 a22b21 a21b22 a22b22







A⊗ B A� Ba11b11 a11b12 a12b11 a12b12

a11b21 a11b22 a12b21 a12b22

a21b11 a21b12 a22b11 a22b12

a21b21 a21b22 a22b21 a22b22

a11b11 a12b11 a11b12 a12b12

a11b21 a12b21 a11b22 a12b22

a21b11 a22b11 a21b12 a22b12

a21b21 a22b21 a21b22 a22b22







A⊗ B A� Ba11b11 a11b12 a12b11 a12b12

a11b21 a11b22 a12b21 a12b22

a21b11 a21b12 a22b11 a22b12

a21b21 a21b22 a22b21 a22b22

a11b11 a12b11 a11b12 a12b12

a11b21 a12b21 a11b22 a12b22

a21b11 a22b11 a21b12 a22b12

a21b21 a22b21 a21b22 a22b22







A⊗ B A� Ba11b11 a11b12 a12b11 a12b12

a11b21 a11b22 a12b21 a12b22

a21b11 a21b12 a22b11 a22b12

a21b21 a21b22 a22b21 a22b22

a11b11 a12b11 a11b12 a12b12

a11b21 a12b21 a11b22 a12b22

a21b11 a22b11 a21b12 a22b12

a21b21 a22b21 a21b22 a22b22





A =

1 1 1 · · · 12 2 2 · · · 23 3 3 · · · 3...

......

...10 10 10 · · · 10

B =

1 2 3 · · · 101 2 3 · · · 101 2 3 · · · 10...

......

...1 2 3 · · · 10









A⊗ B





A� B





vec(A)vec>(B)





The box product behaves similarly to the Kronecker product:1 Vector Multiplication:

(B> ⊗ A)vec(X) = vec(AXB) (B> � A)vec(X) = vec(AX>B)

2 Matrix Multiplication:

(A⊗ B)(C⊗D) = (AC)⊗ (BD) (A� B)(C�D) = (AD)⊗ (BC)

3 Inverse and Transpose:

(A⊗ B)−1 = A−1 ⊗ B−1 (A� B)−1 = B−1 � A−1

(A⊗ B)> = A> ⊗ B> (A� B)> = B> � A>

4 Mixed Products:

(A⊗ B)(C�D) = (AC)� (BD) (A� B)(C⊗D) = (AD)� (BC)










(A⊗ B)−1 = A−1 ⊗ B−1 (A� B)−1 = B−1 � A−1

(A⊗ B)> = A> ⊗ B> (A� B)> = B> � A>

4 Mixed Products:











(A⊗ B)−1 = A−1 ⊗ B−1 (A� B)−1 = B−1 � A−1

(A⊗ B)> = A> ⊗ B> (A� B)> = B> � A>

4 Mixed Products:











(A⊗ B)−1 = A−1 ⊗ B−1 (A� B)−1 = B−1 � A−1

(A⊗ B)> = A> ⊗ B> (A� B)> = B> � A>

4 Mixed Products:






Let A ∈ Rm1×n1 , B ∈ Rm2×n2 then

Trace:

trace(A⊗B) = trace(A) trace(B) trace(A�B) = trace(AB)

Determinant: Here m1 = n1 and m2 = n2 is required

det(A⊗ B) = (det(A))m2(det(A))m1

det(A� B) = (−1)(m12 )(m2

2 )(det(A))m2(det(B))m1

Associativity:

(A⊗B)⊗C = A⊗ (B⊗C) (A�B)�C = A� (B�C),

but not for mixed products. In general we have

(A⊗B)�C 6= A⊗ (B�C) (A�B)⊗C 6= A� (B⊗C).





Identity Box Products

The identity box product Im � In is permutation matrix withinteresting properties. Let A ∈ Rm1×n1 , B ∈ Rm2×n2 .

Orthonormal: (Im � In)>(Im � In) = Imn

Transposition: (Im1 � In1)vec(A) = vec(A>)

Connector: Converting a Kronecker product to a box product:(A⊗ B)(In1 � In2) = A� B.Converting a box product to a Kronecker product:(A� B)(In2 � In1) = A⊗ B.





Box products are old things in a new wrapping

Although the notation for box products is new, Im � In has longbeen known as Tm,n by physicists, or as a stride permutation Lmn

m

by others. These objects are identical, but the box-product allowsus to express more complex identities more compactly, e.g. letA ∈ Rm1×n1 , B ∈ Rm2×n2 then

(Im1 � In1m2 � In2)vec((A⊗ B)>) = vec(vec(A)vec>(B)).

The initial permutation matrix would have to be written

Im1 � In1m2 � In2 = (Im1 ⊗ Tn1m2,m1)Tm1,n1m1m2 ,

if the notation of box-products wasn’t being used.





Direct matrix products are important because such matrices can bemultiplied fast with each other and with vectors.

Let us show how the FFT can be done in terms of Kronecker andbox products. Recall that the DFT matrix of order n is given by

Fn = [e−2πikl/n]0≤k,l<n = [ωkln ]0≤k,l<n and therefore

F2 =

(1 11 −1

), F4 =

1 1 1 11 −i −1 i1 −1 1 −11 i −1 −i





We can factor the matrix F4 as follows:

F4 =

1 1

1 11 −1

1 −1

11

1−i

1 11 −1

1 11 −1

11

11

or more compactly

F4 = (F2 ⊗ I2)diag

(vec

(1 11 −i

))(I2 � F2)





In general if we define the matrix

VN,M(α) =

1 1 1 . . . 11 α α2 . . . αM−1

1 α2 α4 . . . α2(M−1)

......

.... . .

...

1 αN−1 α2(N−1) . . . α(N−1)(M−1)

,

For N = km we have the following factorizations of the DFTmatrix:

FN = (Fk ⊗ Im)(diag(vec(Vm,k(ωN))))(Ik � Fm)

FN = (Fm � Ik)(diag(vec(Vm,k(ωN))))(Fk ⊗ Im)





This allows FFTN(x) = y = FNx to be computed as

FNx = vec((Vm,k(ωN) ◦ (FmX>))F>k ).

where x = vec(X) and ◦ denotes elementwise multiplication (.∗ inMatlab). The direct computation has a cost of O(k2m2), whereasthe formula above does the job in O(km(k + m)) operations.Repeated use of the identity for N = 2n leads to the Cooley-TukeyFFT algorithm. (James Cooley was at T. J. Watson from 1962 to1991).

The fastest FFT library in the world as of 2012 (Spiral) usesknowledge of several such factorizations to automatically optimizethe FFT implementation for an arbitrary value of n to a givenplatform!




Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation

Here we show the role of Kronecker and box product in thematrix–matrix differentiation framework.





Basic Differentiation Identitities

Let X ∈ Rm×n.

Identity: F(X) = X, F′(X) = Imn

Transpose:F(X) = X>, F′(X) = Im � In

Chain Rule:

F(X) = G(H(X)), F′(X) = G′(H(X))H′(X)

Product Rule:

F(X) = G(X)H(X), F′(X) = (I⊗H>(X))G′(X)+(G(X)⊗I)H′(X)





More Derivative Identitities

Assume X ∈ Rm×m is a square matrix.

Square: F(X) = X2, F′(X) = Im ⊗ X> + X⊗ Im.

Inverse: F(X) = X−1, F′(X) = −X−1 ⊗ X−>.

+Transpose: F(X) = X−>, F′(X) = −X−> � X−1.

Square Root: F(X) = X1/2,

F′(X) =(I⊗ (X1/2)> + X1/2 ⊗ I

)−1.

Integer Power: F(X) = Xk , F′(X) =∑k−1

i=0 Xi ⊗ (Xk−1−i )>.





First some simple identities:

f (X) = trace(AX) f ′(X) = A>

f (X) = trace(AX>) f ′(X) = Af (X) = log det(X) f ′(X) = X−>

Then the chain rule for more general expressions:

vec

((∂

∂Xlog det(G(X))

)>)=

(∂G

∂X

)>vec(

(G(X))−1)

vec

((∂

∂Xtrace(G(X))

)>)=

(∂G

∂X

)>vec(I)





The chain rule to get scalar-matrix derivatives is awkward to use.Instead we have some short-cuts.

∂

∂Xtrace(G(X)H(X)) =

∂

∂Xtrace (H(Y)G(X) + H(X)G(Y))

∣∣∣∣Y=X

,

∂

∂Xtrace(AF−1(X)) = − ∂

∂Xtrace

(F−1(Y)AF−1(Y)F(X)

)∣∣∣∣Y=X

,

and

∂log det(F(X))

∂X=

∂

∂Xtrace

(F−1(Y)F(X)

)∣∣∣∣Y=X

.





Use the formulas on the previous page to compute

∂

∂Xtrace((X− A)(X− B)−1(X− C))

The answer is:

(X− C)>(X− B)−> + (X− B)−>(X− A)>

+(X− B)−>(X− A)>(X− C)>(X− B)−>





Use the formulas on the previous page to compute

∂

∂Xtrace((X− A)(X− B)−1(X− C))

The answer is:

(X− C)>(X− B)−> + (X− B)−>(X− A)>

+(X− B)−>(X− A)>(X− C)>(X− B)−>





Finally if

r(x) =q(x)

p(x)=

∑ni=1 aix

i∑nj=1 bix i

is a scalar-scalar function we can form a matrix-matrix function bysimply substituting a matrix X for the scalar x . Then

∂

∂Xtrace(r(X)) =

(r ′(X)

)>and if h(x) = log(r(x)) then

∂

∂Xlog det(r(X)) =

(h′(X)

)>.





Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)

trace(F(X))

×

× X

(·)−1 (·)>

+ X

XI






trace(F(X))

×

× X

(·)−1 (·)>

+ X

XI

T1 = I + X






trace(F(X))

×

× X

(·)−1T2 = T−11

(·)>

+ X

XI

T1 = I + X






trace(F(X))

×

× X

(·)−1T2 = T−11

(·)>T3 = X>

+ X

XI

T1 = I + X






trace(F(X))

×

×T4 = T2 ∗ T3 X

(·)−1T2 = T−11

(·)>T3 = X>

+ X

XI

T1 = I + X






trace(F(X))

×T5 = T4 ∗ X

×T4 = T2 ∗ T3 X

(·)−1T2 = T−11

(·)>T3 = X>

+ X

XI

T1 = I + X






trace(F(X))f = trace(T5)

×T5 = T4 ∗ X

×T4 = T2 ∗ T3 X

(·)−1T2 = T−11

(·)>T3 = X>

+ X

XI

T1 = I + X





Forward mode computation of f ′(X):

trace(F(X))

× T5

× T4 XI⊗ I

(·)−1 T2 (·)>T3

+ XI⊗ IT1

XI⊗ II0





(G + H)′ = G′ + H′

trace(F(X))

× T5

× T4 XI⊗ I

(·)−1 T2 (·)>T3

+ XI⊗ IT1T′1 = 0 + I⊗ I

XI⊗ II0





Chain rule: (T−11 )′ = −(T−1

1 ⊗ T−>1 )T′1

trace(F(X))

× T5

× T4 XI⊗ I

(·)−1 T2T′2 = −(T2 ⊗ T>2 )T′1 (·)>T3

+ XI⊗ IT1T′1 = 0 + I⊗ I

XI⊗ II0





(X>)′ = I� I

trace(F(X))

× T5

× T4 XI⊗ I

(·)−1 T2T′2 = −(T2 ⊗ T>2 )T′1 (·)>T3 T′3 = I� I

+ XI⊗ IT1T′1 = 0 + I⊗ I

XI⊗ II0





Product rule: (GH)′ = (I⊗H>)G′ + (G⊗ I)H′

trace(F(X))

× T5

× T4T′4 = (I⊗ T>3 )T′2 + (T2 ⊗ I)T′3 XI⊗ I

(·)−1 T2T′2 = −(T2 ⊗ T>2 )T′1 (·)>T3 T′3 = I� I

+ XI⊗ IT1T′1 = 0 + I⊗ I

XI⊗ II0





Product rule: (GH)′ = (I⊗H>)G′ + (G⊗ I)H′

trace(F(X))

× T5T′5 = (I⊗ X>)T′4 + (T4 ⊗ I)(I⊗ I)

× T4T′4 = (I⊗ T>3 )T′2 + (T2 ⊗ I)T′3 XI⊗ I

(·)−1 T2T′2 = −(T2 ⊗ T>2 )T′1 (·)>T3 T′3 = I� I

+ XI⊗ IT1T′1 = 0 + I⊗ I

XI⊗ II0





((trace(F))′)> = vec−1(vec>(I)F′

)trace(F(X))

×

f ′ = vec−1(vec>(I)T′5

)

T5T′5 = (I⊗ X>)T′4 + (T4 ⊗ I)(I⊗ I)

× T4T′4 = (I⊗ T>3 )T′2 + (T2 ⊗ I)T′3 XI⊗ I

(·)−1 T2T′2 = −(T2 ⊗ T>2 )T′1 (·)>T3 T′3 = I� I

+ XI⊗ IT1T′1 = 0 + I⊗ I

XI⊗ II0





Forward mode differentiation

Requires huge intermediate matrices.

Only the last step reduces the size.

Critical points:

F′(X) is composed of Kronecker, box (and outer) products fora large class of functions. (wow!)

These can be “unwound” by multiplication with vectorizedscalar-matrix derivatives (gasp!):

vec>(C)A⊗ B = vec>(B>CA)

Reverse mode differentiation:

Evaluate the derivative from top to bottom.

Small scalar-matrix derivatives are propagated down the tree.





Reverse Mode Differentiation: (f ′)> = vec−1(vec>(I)F′

)trace(F(X))

×

f ′ = R0 = I

T5

× T4 X

(·)−1 T2 (·)>T3

+ XT1

XI





vec>(R0)(I⊗ X>) = vec>(XR0I)

trace(F(X))

×

f ′ = R0 = I

T5T′5 = (I⊗ X>) T′4+ T4 ⊗ I

× T4

R1 = XR0

X

(·)−1 T2 (·)>T3

+ XT1

XI





vec>(R0)(T4 ⊗ I) = vec>(IR0T4)

trace(F(X))

×

f ′ = R>2 R0 = I

T5T′5 = (I⊗ X>) T′4+ T4 ⊗ I

× T4

R1 = XR0

X R2 = R0T4

(·)−1 T2 (·)>T3

+ XT1

XI





vec>(R1)(I⊗ T>3 ) = vec>(T3R1I)

trace(F(X))

×

f ′ = R>2 R0 = I

T5

× T4

R1 = XR0

T′4 = (I⊗ T>3 ) T′2+ (T2 ⊗ I) T′3 X R2 = R0T4

(·)−1 T2

R3 = T3R1

(·)>T3

+ XT1

XI





vec>(R1)(T2 ⊗ I) = vec>(IR1T2)

trace(F(X))

×

f ′ = R>2 R0 = I

T5

× T4

R1 = XR0

T′4 = (I⊗ T>3 ) T′2+ (T2 ⊗ I) T′3 X R2 = R0T4

(·)−1 T2

R3 = T3R1

(·)>T3

R4 = R1T2

+ XT1

XI





vec>(R3)(−T2 ⊗ T>2 ) = vec>(−T2R3T2)

trace(F(X))

×

f ′ = R>2 R0 = I

T5

× T4 X R2 = R0T4

(·)−1 T2

R3 = T3R1

T′2 = −(T2 ⊗ T>2 )T′1 (·)>T3

R4 = R1T2

+ XT1

R5 = −T2R3T2

XI





vec>(R4)(I� I) = vec>(IR>4 I)

trace(F(X))

×

f ′ = R>2 +R>6 R0 = I

T5

× T4 X R2 = R0T4

(·)−1 T2 (·)>T3

R4 = R1T2

T′3 = I� I

+ X R6 = R>4T1

R5 = −T2R3T2

XI





vec>(R5)0 = vec(0)

trace(F(X))

×

f ′ = R>2 +R>6 R0 = I

T5

× T4 X R2 = R0T4

(·)−1 T2 (·)>T3

+ X R6 = R>4T1

R5 = −T2R3T2

T′1 =0+I⊗ I

XI R7 = 0





vec>(R5)I⊗ I = vec(R5)

trace(F(X))

×

f ′ = R>2 +R>6 +R>8 R0 = I

T5

× T4 X R2 = R0T4

(·)−1 T2 (·)>T3

+ X R6 = R>4T1

R5 = −T2R3T2

T′1 =0+I⊗ I

X R8 = R5I R7 = 0





f ′(X) = R>2 + R>6 + R>8

trace(F(X))

×

f ′ = R>2 +R>6 +R>8 R0 = I

T5

× T4 X R2 = R0T4

(·)−1 T2 (·)>T3

+ X R6 = R>4T1

X R8 = R5I




Derivative PatternsHessian FormsNewton’s MethodFuture Work

There is growing interest in Machine Learning in scalar–matrixobjective functions.

Probabilistic Graphical Models

Covariance Selection

Optimization in Graphs and Networks.

Data-mining in social networks

Can we use this theory to help optimize such functions? We are onthe look-out for interesting problems.





The Anatomy of a Matrix-Matrix Function

Theorem

Let R(X) be a rational matrix–matrix function formed fromconstant matrices and K occurrences of X using arithmetic matrixoperations (+, −, ∗ and (·)−1) and transposition ((·)>). Then thederivative of the matrix–matrix function is of the form

k1∑i=1

Ai ⊗ Bi +K∑

i=k1+1

Ai � Bi .

The matrices Ai and Bi are computed as parts of R(X).





Hessian Forms

The derivative of a function f of the form trace(R(X)) orlog det(R(X)) is a rational function. Therefore, the Hessian is ofthe form

k1∑i=1

Ai ⊗ Bi +K∑

i=k1+1

Ai � Bi .

where K is the number of times X occurs in the expression for thederivative.If Ai , Bi are d × d matrices then multiplication by H can be donewith O(Kd3) operations.Generalizations of this result can be found in the paper.





Let f : Rd×d → R with f (X) = trace(R(X)) orf (X) = log det(R(X)).

Derivative: G = f ′(X) ∈ Rd×d and Hessian:H = f ′′(X) ∈ Rd2×d2

.

Newton direction vec(V) = H−1vec(G) can be computedefficiently if K � d

K Algorithm Complexity Storage

1 A⊗ B = A−1 ⊗ B−1 O(d3) O(Kd2)2 Barthels-Stewart algorithm O(d3) O(Kd2)≥ 3 Conjugate Gradient Algorithm O(Kd5) O(Kd2)≥ d General matrix inversion O(d6 + Kd4) O(d4)

The (generalized) Bartels-Stewart algorithms solves theSylvester-like equation AX + X>B = C.





What about optimizing functions of the form

f (X) + ‖vec(X)‖1?

Common approaches:

Strategy method Use Hessian structure?

Newton-Lasso Coordinate descent XFISTA X

Orthantwise `1 CG XL-BFGS 7

It is not obvious how to take advantage of the Hessian structurefor all these methods. For orthantwise CG for example – thesub-problem requires the Newton direction for a sub-matrix of theHessian.





1 We have applied this methodology to the covariance selectionproblem – a paper is forthcoming.

2 We are digging deeper in the theory of matrix differentiationand properties of the box products.

3 We are on the look-out for more interesting matrixoptimization problems. All suggestions are appreciated!


Documents

Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation