1
Can We Gain More from Orthogonality Regularizations in Training Deep CNNs? Nitin Bansal Xiaohan Chen Zhangyang Wang Department of Computer Science and Engineering, Texas A&M University O VERVIEW We develop novel orthogonality regularizations on training deep CNNs, by borrowing ideas and tools from sparse optimization. These plug-and-play regularizations can be conveniently incorporated into training almost any CNN without extra hassle. The proposed regularizations can consistently improve the performances of baseline deep networks on CIFAR-10/100, ImageNet and SVHN datasets, based on intensive empirical experiments, as well as accelerate/stabilize the training curves. The proposed orthogonal regularizations outperform existing competitors. P RELIMINARIES Goal We aim to regularize the (overcomplete or undercomplete) CNN weights to be “close” to orthogonal ones, for improving both training stability and final accuracy. Notation The weight in one fully-connected layer is denoted as W R m×n . For convo- lutional layer C R S ×H ×C ×M , we reshape C into W 0 R m 0 ×n 0 where m 0 = S × H × C and n 0 = M to reduce it to the form of fully-connected layer. Mutual Coherence The mutual coherence of a weight W is defined as μ W = max i6=j |hw i ,w j i| ||w i || · ||w j || , (1) where w i denotes the i-th column of W , i =1, 2, ..., n. In order for W to have orthogonal or near-orthogonal columns, μ W should be as low as possible (zero if m n). Restricted Isometry Property We rewrite the Restricted Isometry Property condition of W as: δ W = sup z R n ,z 6=0 ||Wz || 2 ||z || 2 - 1 , (2) where z is k-sparse. Note that δ W reduces to the spectral norm of W T W - I , denoted as σ (W T W - I ), if we let k = n. O RTHOGONALITY R EGULARIZATION Soft Orthogonality Regularization (SO) SO simply minimizes the distance from the Gram matrix of W to the identity matrix: (SO) λ||W T W - I || 2 F , (3) Double Soft Orthogonality Regularization (DSO) DSO tries to regularize better when W is overcomplete, by appending another term to (3). (DSO) λ(||W T W - I || 2 F + ||WW T - I || 2 F ). (4) Mutual Coherence Regularization (MC) We suppress μ W to enforce orthogonality. Assuming columns of W are normalized to unit vectors (what if not?), we propose the following MC regularization based on (1): (MC) λ||W T W - I || , (5) Spectral Restricted Isometry Property Regularization (SRIP) We suppress σ W to en- force orthogonality, and propose the following SRIP regularization based on (2): (SRIP) λ · σ (W T W - I ). (6) Power Methods for Efficient SRIP Implementation To avoid the computationally ex- pensive EVD, we approximate the computation of spectral norm using the truncated power iteration method. Starting with a randomly initialized v R n , we iteratively perform the following procedure a small number of times (2 times by default) : u (W T W - I )v,v (W T W - I )u, σ (W T W - I ) ||v || ||u|| . (7) L INKS arXiv preprint: Source Codes: E XPERIMENTAL R ESULTS We perform our experiments on several most popular state-of-the-art models: ResNet(including several different variants), Wide ResNet and ResNext. Datasets include CIFAR-10, CIFAR-100, SVHN and ImageNet. All results endorse the advantages of orthogonality regularization in improving the final accuracies: evident, stable, reproducible, and sometimes with a large margin. SRIP is the best among all, and incurs negligible extra computational load. E FFECTS ON THE T RAINING P ROCESS We carefully inspect the training curves (in term of validation accuracies w.r.t epoch numbers) of different methods on CIFAR-10 and CIFAR-100, with ResNet-110 curvesshown here. Top: CIFAR-10; Bottom: CIFAR-100.

CanWeGainMorefromOrthogonalityRegularizationsinTrainingDee ...people.tamu.edu/~chernxh/posters/nips18ortho.pdf · Double Soft Orthogonality Regularization (DSO) DSO tries to regularize

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CanWeGainMorefromOrthogonalityRegularizationsinTrainingDee ...people.tamu.edu/~chernxh/posters/nips18ortho.pdf · Double Soft Orthogonality Regularization (DSO) DSO tries to regularize

Can We Gain More from Orthogonality Regularizations in Training Deep CNNs?Nitin Bansal Xiaohan Chen Zhangyang Wang

Department of Computer Science and Engineering, Texas A&M University

OVERVIEW

• We develop novel orthogonality regularizations on training deep CNNs, by borrowingideas and tools from sparse optimization.

• These plug-and-play regularizations can be conveniently incorporated into trainingalmost any CNN without extra hassle.

• The proposed regularizations can consistently improve the performances of baselinedeep networks on CIFAR-10/100, ImageNet and SVHN datasets, based on intensiveempirical experiments, as well as accelerate/stabilize the training curves.

• The proposed orthogonal regularizations outperform existing competitors.

PRELIMINARIES

Goal We aim to regularize the (overcomplete or undercomplete) CNN weights to be“close” to orthogonal ones, for improving both training stability and final accuracy.

Notation The weight in one fully-connected layer is denoted asW ∈ Rm×n. For convo-lutional layer C ∈ RS×H×C×M , we reshape C into W ′ ∈ Rm′×n′

where m′ = S ×H × Cand n′ =M to reduce it to the form of fully-connected layer.

Mutual Coherence The mutual coherence of a weight W is defined as

µW = maxi 6=j

|〈wi, wj〉|||wi|| · ||wj ||

, (1)

where wi denotes the i-th column of W , i = 1, 2, ..., n. In order for W to have orthogonalor near-orthogonal columns, µW should be as low as possible (zero if m ≥ n).

Restricted Isometry Property We rewrite the Restricted Isometry Property conditionof W as:

δW = supz∈Rn,z 6=0

∣∣∣∣ ||Wz||2

||z||2− 1

∣∣∣∣ , (2)

where z is k-sparse. Note that δW reduces to the spectral norm of WTW − I , denoted asσ(WTW − I), if we let k = n.

ORTHOGONALITY REGULARIZATION

Soft Orthogonality Regularization (SO) SO simply minimizes the distance from theGram matrix of W to the identity matrix:

(SO) λ||WTW − I||2F , (3)

Double Soft Orthogonality Regularization (DSO) DSO tries to regularize better whenW is overcomplete, by appending another term to (3).

(DSO) λ(||WTW − I||2F + ||WWT − I||2F ). (4)

Mutual Coherence Regularization (MC) We suppress µW to enforce orthogonality.Assuming columns of W are normalized to unit vectors (what if not?), we propose thefollowing MC regularization based on (1):

(MC) λ||WTW − I||∞, (5)

Spectral Restricted Isometry Property Regularization (SRIP) We suppress σW to en-force orthogonality, and propose the following SRIP regularization based on (2):

(SRIP) λ · σ(WTW − I). (6)

Power Methods for Efficient SRIP Implementation To avoid the computationally ex-pensive EVD, we approximate the computation of spectral norm using the truncatedpower iteration method. Starting with a randomly initialized v ∈ Rn, we iterativelyperform the following procedure a small number of times (2 times by default) :

u← (WTW − I)v, v ← (WTW − I)u, σ(WTW − I)← ||v||||u||

. (7)

LINKS

arXiv preprint: Source Codes:

EXPERIMENTAL RESULTS

• We perform our experiments on several most popular state-of-the-art models: ResNet(including several different variants), WideResNet and ResNext. Datasets include CIFAR-10, CIFAR-100, SVHN and ImageNet.

• All results endorse the advantages of orthogonality regularization in improving the final accuracies: evident, stable, reproducible,and sometimes with a large margin. SRIP is the best among all, and incurs negligible extra computational load.

EFFECTS ON THE TRAINING PROCESSWe carefully inspect the training curves (in term of validation accuracies w.r.t epoch numbers) of different methods on CIFAR-10

and CIFAR-100, with ResNet-110 curves shown here. Top: CIFAR-10; Bottom: CIFAR-100.