Gated Autoencoders with Tied Input Weights · Gated Autoencoders with Tied Input Weights Furthermore, Memisevic’s work shows that commut-ing transformations, like translations or

Gated Autoencoders with Tied Input Weights

Droniou Alain [email protected] Olivier [email protected]

Institut des Systemes Intelligents et de Robotique (ISIR)Universite Pierre et Marie Curie - CNRS72224 Place Jussieu, 75005 Paris, France

Abstract

The semantic interpretation of images is oneof the core applications of deep learning. Sev-eral techniques have been recently proposedto model the relation between two images,with application to pose estimation, actionrecognition or invariant object recognition.Among these techniques, higher-order Boltz-mann machines or relational autoencodersconsider projections of the images on dif-ferent subspaces and intermediate layers actas transformation specific detectors. In thiswork, we extend the mathematical study of(Memisevic, 2012b) to show that it is possibleto use a unique projection for both images ina way that turns intermediate layers as spec-trum encoders of transformations. We showthat this results in networks that are easierto tune and have greater generalization capa-bilities.

1. Introduction

Recognizing actions from images is an important chal-lenge for developmental robotics (Fitzpatrick et al.,2003; Montesano et al., 2008). Gated neural networkscan lead to state-of-the art performances in actionrecognition (Taylor et al., 2010), and are very good atencoding relation between images (Memisevic & Hin-ton, 2007; Memisevic, 2012a).

The performances of gated neural networks come fromtheir multiplicative interactions between different lay-ers. It actually allows the network to uncouple therepresentation of a “dictionary” and a representationof the plausible transformations of its elements. In

Proceedings of the 30 th International Conference on Ma-chine Learning, Atlanta, Georgia, USA, 2013. JMLR:W&CP volume 28. Copyright 2013 by the author(s).

the particular case of the encoding of relation betweenimages, it allows the network to focus on the transfor-mation between two images without spending compu-tational resources on the representation of their con-tent: the elements of the dictionary are the imagesthemselves, which are provided as inputs. In this case,the role of the network is to discover an efficient rep-resentation of correlations between images, which isencoded in a layer commonly referred to as “mappingunits”.

Recently, (Memisevic, 2012b) proposed a mathemat-ical framework for gated autoencoders which linksenergy-based models, such as higher-order Boltzmannmachines (Memisevic & Hinton, 2010), with relationalautoencoders. This framework shows that classicalgated autoencoders learn pairs of filters which are re-lated by the transformation to detect. The projectionof both images on these pairs creates a correlation peakwhen the transformation between the two images isconsistent with the transformation encoded in a pairof filters. Moreover, for each transformation to be de-tected, two pairs of filters are learned in order to spanthe subset of images: to be able to detect a transfor-mation independently of the content of the images, thetwo pairs are in quadrature such that if the images areorthogonal to the first pair of filters, the transforma-tion will be detected by the other pair. A factor layercomputes the correlation between the projections ofimages on each pair of filters, and a mapping layerpools over the factor units corresponding to quadra-ture pairs to obtain a content independent detectionof the transformation. This implies that the number offilters and mapping units has to grow with the desiredaccuracy of transformation detection, since the num-ber of mapping units grows linearly with the step sizebetween two transformations to be distinguished. Thiscan lead to prohibitive network sizes for real world ap-plications such as action recognition, especially whenthe number of filter pairs increases, since each filteradds one weight per image pixel.


Furthermore, Memisevic’s work shows that commut-ing transformations, like translations or rotations, canshare the same set of filters. In this case, the detec-tion of a specific transformation requires the detectionof an angle between the projection of the first imageon the filters in quadrature and the projection of thesecond image on the same filters. However, this raisesthe aperture problem if we detect the angle after nor-malizing the projections, as suggested by Memisevic:this can lead to false or very inaccurate detection oftransformations when the projections of the images areclose to zero.

In this work, we show that projections normalizationis not mandatory. Thanks to this approach, map-ping units are no longer transformation specific andwe show that the mapping layer represents a discretiza-tion of the spectrum of the transformations instead of adiscretization of the transformations themselves. Thisreduces the size needed for a correct representation oftransformations, from n×n matrices to n eigenvalues.

2. Gated autoencoders

In this section, we briefly recall the mathematical anal-ysis of gated autoencoders (a.k.a. relational autoen-coders) from (Memisevic, 2012b).

Let us consider two images x and y in the form of avector (x, y ∈ Rn) related by a transformation L, suchthat

y = Lx (1)

where L is an orthogonal matrix (LL> = I, I beingthe identity matrix). In particular, orthogonal ma-trices cover all the permutations of pixels and thesetransformations can be roughly seen as “constant in-formation” transformations where all the pixels of oneimage can be fully predicted from the pixels of theother image.

Orthogonal matrices can be diagonalized by

L = UDU> (2)

where D is the diagonal matrix containing eigenvalueswhich are all complex numbers of module 1, and Ucontains the eigenvectors. When two transformationsof the same space L1 and L2 commute, they share thesame eigenvectors and their factorization differs onlyby the diagonal matrix. Remarkably from (2), since Lis a real matrix, we have U−1 = U> and (1) can berewritten

U>y = DU>x. (3)

It then appears that the eigenvalues can be retrievedby detecting the rotation angles between the projec-

tions of x and y1. Memisevic showed that the innerproduct between normalized projections directly pro-vides the cosine of the angle, but the normalization ofthese projections is unstable when they are too closeto zero, leading to false detection of transformations.

(Memisevic, 2012a;b) reformulates the problem as adetection task consisting in learning input and outputfilters U and V , such that V incorporates D and U .Then, Eq. 3 becomes

U>y = V >x (4)

and the correlation between input and output filters isa detector whose value depends on the transformationand the content of the images, through their projec-tions on U and V . To have a content independent rep-resentation, it is possible to pool over a set of detectorswhich represent the same transformation but span thespace of images (they are in “quadrature”). In thesearchitectures, a factor layer computes the correlationsbetween projections, while a mapping layer pools overthe factor layer to obtain the transformation specificdetectors.

3. Encoding relationship with tiedinput weights

The approach described in Section 2 is suboptimalsince it requires several pairs of filters to represent eachspecific transformation. The idea behind our work isto use directly Eq. (2) and make the factor layer repre-sent the diagonal matrix D, while filters represent thematrix U . This leads to a simpler network, since thereis no more distinction between input and output fil-ters and the filters are shared between all commutingtransformations.

3.1. Theoretical description

Recalling Eq. (3) and the fact that D contains complexvalues of module 1, all the information about D canbe retrieved through the cosine and sine of the anglebetween projections of x and y, that we note generi-cally ux and uy. If we see these complex values as 2Dvectors, the inner product ux.uy and the magnitude ofthe cross product ux × uy2 provide cosine and sine upto a multiplicative factor ||ux||.||uy|| which is the samefor both values:

ux.uy = ||ux||.||uy||.cos(ux, uy)

ux × uy = ||ux||.||uy||.sin(ux, uy).(5)

1A multiplication by a complex of module 1 is actuallyequivalent to a rotation in the complex plane

2The magnitude corresponds to the cross product with-out the unit vector.


...

x

x y

Fx FyF1x . . . Fnx F1y . . . Fny

Factors*Fx

Factors

Mappings

W1 W2

Fy*Fx

Filter 1

Filter 2

Filter n

UU?

U?

?

U?

y

Figure 1. Gated autoencoder with tied input weights.

These equations are sufficient to retrieve the anglemodulo 2π, since their quotient gives the tangent ofthe angle. However, taking this quotient would lead tothe same problem as normalization, with very insta-ble values when the projections (i.e. norms of ux anduy) are close to zero. It is much better to keep cosineand sine values apart, so that small values simply in-dicate that the corresponding eigenvectors are not rel-evant to describe the considered transformation. Con-sidering complex numbers, the inner product is givenby Real(uyux) and the magnitude of cross product isgiven by Imag(uyux) where · is the conjugate andReal and Imag are respectively the real and imagi-nary parts of a complex number. To sum up, all theinformation is contained in uyux

3.

Our factor layer is thus given by

f = U>y ∗ U>x (6)

where ∗ denotes the element-wise multiplication. Wecan then use the mapping layer of classical relationalautoencoders to compress the factor layer into a lowerdimensionality representation.

We train our network as classical gated autoencoders,where the goal is to minimize the error of the recon-struction of the second image from the first. To do

3From (3), we deduce uyux = d||ux||2.

that, each image is filtered by a stack of filters U ,which project the images on the factor space, whichis then projected on the mapping layer. A backwardpass through the network then produces the recon-struction of the output image (Fig. 1). We also adoptthe denoising approach, in which inputs are corruptedwith some noise before flowing through the network, toavoid overfitting (Vincent et al., 2010). In our case, weuse a zeromask noise, which sets a fraction of the pixels(typically 30%) to zero and lets the others unaltered.We denote x the corrupted version of x. This con-straint is all the more important than sparsity-basedregularizations (such as (Lee et al., 2008)) are not well-suited with a spectrum representation.

The mapping layer is given by

m = σ(W1f + biasmappings

)(7)

where f is the factor layer (Eq. 6) obtained from thecorrupted version of x and y and σ is the activationfunction of rectified linear units (a.k.a. softplus units):σ(x) = log(1 + exp(x)). Note that we use rectified lin-ear units where classical gated autoencoders use sig-moid units. In fact, since classical gated autoencodersact as “transformation specific detectors” and poolover a set of detectors for each transformation, themapping units represent the presence or the absenceof a given transformation and a “probability” unit isa good choice. In our case, we want to represent thespectrum of the transformation, such that linear unitsare better suited than sigmoid units.

The reconstruction is then

r = U(W>2 m ∗ U>x) + biasoutput. (8)

The reconstruction error used for the gradient descentis finally given by

error = ||y − r||2. (9)

Note that two weight matrices W1 and W2 are usedbetween the factor and mapping layers. This is use-ful since the reconstruction given by (8) is not strictlycorrect from our mathematical description. In partic-ular, it can be seen closer to r = ||U>x||2UDU>x (be-cause uyux = d||ux||2) whereas we expect y = UDU>x(Eq. 1 and 2). This would be the case in particular ifW1 = W2, but untying these weights allows the net-work to scale the reconstructed activation of the factorlayer differently from its bottom-up activation. It alsolets the network cancel out factor values which corre-spond to irrelevant eigenvectors for a given transfor-mation in the case where the learning set is composedof several classes of transformations.


3.2. Practical implementation

Our mathematical description involves complex num-bers, whereas neural networks generally work on realnumbers and very few studies have been carried onmore general mathematical frameworks (see for in-stance (Baldi, 2012; Baldi et al., 2012)). In particular,our study involves multiplications of complex num-bers. When working with real numbers, we have tosplit real and imaginary parts of each filter into fil-ter pairs, as mentioned in (Memisevic, 2012b). How-ever, the element-wise multiplication between projec-tions of both images, as usually performed in gatedautoencoders, makes impossible to retrieve the com-plex multiplication. In previous gated autoencoders,this was hidden by the fact that only specific detec-tors were learned, so that only the cosine of the anglewas necessary and a simple inner product was suffi-cient. In our case, we need to retrieve both the co-sine and sine of the angle, and we also need the cross-product. Otherwise, it would be impossible to differ-entiate symmetric transformations, such as rotationsof θ and −θ. One way to deal with this issue con-sists in making a matrix multiplication between bothprojections, which would create all possible two-termsproducts, but this would lead to a huge factor layer,with many weights to project it on the mapping layer.Moreover, most of these coefficients would be useless,since they involve multiplications between unrelatedfilters. This also prevents mini-batch learning opti-mizations through matricial representation of multipletraining cases: one training case cannot flow throughthe network as one column of a matrix, since the factorlayer would require a matrix for each case.

Our proposed solution is to put prior knowledge oncomplex multiplication inside the connectivity of thenetwork. To do that, we artificially duplicate the fac-tor layer: the first half involves term-to-term multipli-cations, while the second half is “crossed” so that term-to-term multiplications correspond to the multiplica-tion between real and imaginary part of each filter. Inother words, we shape the network with the sufficientstructure to learn filters corresponding to our math-ematical description. “Crossing” the second half ofthe factors simply consists in duplicating factors fromone image, while the factors from the other image aremultiplied by a block-diagonal matrix B composed of(

0 1−1 0

)blocks.

We also use a pooling matrix to tie the learning pro-cess of pairs of filters, as mentioned in (Memisevic,2012a), and retrieve directly the values of inner andcross products. This pooling matrix is also useful to

Fy

Fx

Fx F1x . . . Fnx

F1y . . . Fny

F1x . . . Fnx

FyF1y . . . Fny

F2y . . . -Fn-1y-F1y Fny

F1x . . . FnxF2x Fn-1x

F2y Fn-1y

F2x Fn-1x

} } } }

F1x F1y

+ F2x F2y

Fn-1x Fn-1y

+ Fnx Fny

F1x F2y

- F2x F1y

Fn-1x Fny

- Fnx Fn-1y

. . . . . .

Factors

Element-wise

multiplication

and pooling

Copy Copy

and cross

E1 E2

P

Figure 2. The real-valued implementation of our algorithminvolves some hard-coded operations to simulate complex-valued multiplications. These operations are computed bymultiplications with constant matrices (see Section 3.2).

restrict the final factor layer size to the one obtainedbefore our artificial duplication of factors. The wholeprocess is illustrated in Fig. 2.

Let Pl be the pooling matrix of size l × 2l

Pl =

1 1 0 0 0 0 . . .0 0 1 1 0 0 . . .

. . .

,

Il the identity matrix of size l, Bl the block-diagonalmatrix B defined above with size l, Rl a reorderingmatrix

Rl =

1 0 0 . . . 0 0 0 . . .0 0 0 . . . −1 0 0 . . .0 1 0 . . . 0 0 0 . . .0 0 0 . . . 0 −1 0 . . .

...

and E1, E2, and E3 duplicating matrices of size 2l× lwhere l is the size of the factor layer

E1 =

(IlIl

), E2 =

(IlBl

), E3 =

(RlBlRl

).

Equations (7) and (8) can finally be rewritten:

m = σ(W1P ((E1U

>x) ∗ (E2U>y)) + biasmappings

)r = UP ((E3W

>2 m) ∗ (E1U

>x)) + biasoutput(10)

Note that E3 could be incorporated in W2 during thelearning process rather than being defined a priori. Inthat case, W2 should have twice more rows than W1.

3.3. Relation with quadrature pairs

The notion of “generalized quadrature pairs” was in-troduced by (Bethge et al., 2007), to highlight the fact


that any orthogonal transformation L has a block di-agonal representation W>LW = G ∈ Rm×m such thateach block is at most two-dimensional, and each two-dimensional block belongs to SO(2,R) (special orthog-onal group). Therefore, the corresponding pairs of ba-sis functions are referred to “generalized quadraturepairs”.

Remarkably, if we note L = UDU> the eigende-composition of L (which involves complex numbers),we have G = MDM> where M is a block diago-

nal matrix with blocks√22

(1 ii 1

). Then, L =

UDU> = WGW> = UM>GMU>, which leads toW>UM>G = GW>UM>. Thus, G and W>UM>

commute, so they share the same eigenvectors. SinceD = M>GM , the eigenvectors of G are the columnsof M , and W>UM>M = MΘ where Θ is a diagonalmatrix containing the eigenvalues of W>UM> (whichare all complex numbers of module 1 by orthogonal-ity), leading to U = WMΘ.

Finally, the real and imaginary part of each complexfilter in U correspond to a quadrature pair of W , upto a rotation in the complex plane (matrix Θ).

4. Experiments

In this section, we present some experimental resultsto show the effectiveness of our approach.

4.1. Experimental setup

For all our experiments, we use 13x13 pixels imagesconsisting in random dots, whose values are randomlydrawn from a normal distribution and independentfrom one another. We then apply translations com-bining uniformly random +/-3 pixels shifts horizon-tally and vertically, or rotations between -50 and +50degrees. Pixels of the output image not covered bythe input image are then also drawn randomly accord-ing to the same law. For each experiment, we createa training set of 100 000 pairs of images, equally dis-tributed among the considered transformations. Wemeasure the error as the arithmetic mean of ||y − r||22where r is the reconstruction of y (Eq. 10).

We use the network presented in the previous sectionwith different sizes for each layer and we decrease thelearning rate during the process. We use the theano4

python library (Bergstra et al., 2010), and our codeis an extension of Memisevic’s code of gated autoen-coders, available online5. This code is also used for

4http://deeplearning.net/software/theano/5http://learning.cs.toronto.edu/ rfm/code/rae/index.html

Table 1. Common parameters for all the experiments, forboth CGA and our algorithm.

Parameter Value

Corruption type Zero maskCorruption level 0.3

Weight decay 0Minibatch size 100

Input size 13× 13 pixels

Learning rate 0.005max(1,floor(epoch∗0.1))

comparison between our approach and a classical gatedautoencoder in Sections 4.3 and 4.4. In this section,we will refer to Memisevic’s implementation of classi-cal gated encoder as “CGA”. Since our network usessoftplus units at the mapping layer, instead of sigmoidunits for CGA, we also test a modified version of CGAwith softplus units, which we call “CGA-softplus”.

We keep some parameters constant for all the experi-ments. They are given in Table 1. The initial learningrate is chosen such that no instability of the mean re-construction error is visible during learning. For otherparameters, the values are those by default in Memi-sevic’s code, except for the corruption level that wefixed at 30% instead of 50%.

Finally, we recall that the mathematical study is validonly for “constant information” transformations. Inpractice, this condition is rarely met: for a transla-tion for instance, pixels at the edges cannot be de-duced from the previous image. To be closer to thetheoretical conditions, we apply a Gaussian mask onthe images when computing the reconstruction error,such that edge pixels have a smaller weight than centerones.

4.2. Experiment 1: Factor layer as spectrumrepresentation of transformation

In this experiment, we want to test the validity of ourmathematical approach, to see if the network behaviormeets our expectations.

We use a property of orthogonal transformations, thatis LL> = I, i.e. L> = UDU>. Thus, by testing in-verse transformations, like rotations of angle +θ and−θ, we expect the corresponding factor values to beconjugate, i.e. their inner products to be the samewhile cross products to have opposite signs. There-fore, we train a network with 400 units on the factorlayer and 40 units on the mapping layer on rotationsand we compare factor values for opposite rotations.We average the result over 100 rotations of differentimages for each angle. As expected, we can see in


−40 −20 0 20 40Rotation angle (degree)

−40

−20

0

20

40

Rot

atio

nan

gle

(deg

ree)

Inner product correlations


−40

−20

0

20

40

Cross product correlations

−1.0

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

Figure 3. Mean correlations of the activation of units inthe factor layer for rotations between -50 and 50 degrees(represented by the covariance matrix). We can see that,as expected, the network learns both an inner product partand a cross product part, the former being correlated be-tween opposite rotations, while the latter is anticorrelated.

Fig. 3 that values corresponding to the inner productsare correlated, while values corresponding to the crossproducts are anti-correlated.

Our network is thus learning a representation consis-tent with our mathematical description, even if experi-mental conditions do not meet exactly our mathemat-ical assumption of orthogonal transformations, givennoise and border effects.

4.3. Experiment 2: Filters are shared betweencommuting transformations

In this section, we want to demonstrate the higher gen-erality of the learned features compared to CGA de-scribed in (Memisevic, 2012a). This relies on the factthat learned filters are shared between all commutingtransformations, whereas in the other approach, eachfilter is specific to a given transformation.

To do that, we train the same network as above ona sparse subset of rotations, of -50,-40,-30,..., 40 and50 degrees. Then, we test the network on every ro-tation of integer angles between -50 and 50 degrees.As before, we average our results on 100 rotations ofdifferent images for each angle.

Fig. 4 represents the distance between mapping layeractivation for each rotation. CGA tends to createclusters around rotations in the training set, whereasour network produces a smoother representation: itinterpolates its representations, whereas CGA actsmore like a nearest neighbor algorithm. This effectis less visible for CGA-softplus. This can be ex-plained because sigmoid units tend to force the ac-tivation to be 0 or 1 (to be “transformation specific”detectors), whereas softplus units allow smoother vari-ations. However, the discriminative power of our net-work is better than both versions of CGA: out of the


−40

−20

0

20

40

Rot

atio

nan

gle

(deg

ree)

Distances between mappingrepresentations for our network


−40

−20

0

20

40

Distance between mappingrepresentations for CGA

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0


−40

−20

0

20

40

Distance between mappingrepresentations for CGA-softplus

Figure 4. Distances between mean activation of the map-ping layer for rotations between -50 and 50 degrees with anetwork trained with only 11 different rotations (see textfor details). Distances have been normalized between 0 and1. Our network produces a smoother representation of ro-tations compared to CGA and has a higher discriminativepower: the “block” effect around rotations in the trainingset, which leads the network to assimilate all the rotationsin the block, is less visible and distances out of the maindiagonal are bigger.


0

2

4

6

8

10

12

14

16

18

Act

ivat

ion

Mapping units activationfor our network


0.0

0.2

0.4

0.6

0.8

1.0

Mapping units activationfor CGA


0

5

10

15

20

25

Mapping units activationfor CGA-softplus

Figure 5. Activation of mapping units for rotations be-tween -50 and 50 degrees with a network trained on 11different rotations. Our network learns a simpler represen-tation compared with CGA and CGA-softplus. In partic-ular, one mapping unit specializes for each rotation in thetraining set.

main diagonal, distances are bigger and the transitionis sharper.

Mapping layer activations in our network and in CGAare compared in Fig. 5. The representation learned byour network is simpler: one unit specializes for eachrotation in the dataset. Since we have only 11 differentrotations in the dataset for 40 mapping units, 29 unitsare not used and their activation keeps close to 0. Thisraises one question: how relevant for generalization isthe representation learned by our network?

First, we can assume that the learned weights for eachunit correspond to the spectrum of the rotation forwhich its response is maximum. Then, the networkapproximates intermediate rotations between two con-secutive learned rotations by a linear mixing of theirspectrum. If rotations in the training set are not toofar from each other, this is a good approximation. LetA and B be two commuting transformations and a andb their eigenvalues corresponding to eigenvector v. We



0

2

4

6

8

10

12

14

Act

ivat

ion

Mapping units activationfor our network


0.0

0.2

0.4

0.6

0.8

1.0

Mapping units activationfor CGA


0

2

4

6

8

10

12

14

16

18

Mapping units activationfor CGA-softplus

Figure 6. Activation of mapping units for rotations be-tween -50 and 50 degrees. Compared to Fig. 5, there areless units than rotations in the training set. In this case,mapping units span regularly the set of rotations.

consider T the average transformation between A andB, i.e. T = (AB)

12 where we define the matrix square

root by M12M

12 = M . In particular when M is di-

agonalizable, M = V DV −1, then M12 = V D

12V −1

and the eigenvalues of M12 are the square roots of

those of M . In our case, the eigenvalues of T are thusthe geometric mean of the eigenvalues of A and B:t =√ab. Finally, approximating the geometric mean

by the arithmetic mean is a good approximation whenboth numbers are not too far from each other (which isthe case for the eigenvalues of nearby orthogonal ma-trices, by continuity): if a = c + δ and b = c − δ, the

first order error is δ2

2c .

Fig. 6 shows what happens when there are less map-ping units than transformations in the training set.We use the same dataset as above, but with only 5mapping units. In this case, the mapping units spanregularly the set of transformations for the three algo-rithms: since the resources are very constrained, eachunit is forced to represent a different subset of trans-formations. These results illustrate the fact that ournetwork is easier to tune, since an oversized mappinglayer has a lower impact on its generalization capabil-ities than for CGA.

4.4. Experiment 3: Comparison between CGAand our tied input weights autoencoder

Finally, we compare the performance of our networkwith the one described in (Memisevic, 2012a). Wetrain the networks for 500 epochs on a set of rota-tions and translations, and measure their reconstruc-tion error on test images spanning the whole set oftransformations we used for training (10 000 pairs ofimages equally distributed). Given the proximity be-tween both algorithms, we compare their performanceswith the same parameters, especially for the learningrates, the mini-batch size and the input corruptionlevel (see Table 1).

75 80 85 90 95 100 105 110 115 120

Reconstruction error for our algorithm75

80

85

90

95

100

105

110

115

120

Rec

onst

ruct

ion

erro

rfor

CG

A

20 mapping units

60 70 80 90 100 110 120 130


70

80

90

100

110

120

130100 mapping units

50 factor units100 factor units200 factor units400 factor units700 factor units800 factor unitsCGA-softplusEqual performance line

Figure 7. Learning curves for our algorithm and CGA,both trained for 500 epochs. We represent the evolution ofthe reconstruction error between both algorithms. Tempo-ral course happens from upper right corner towards bottomleft corner. We can see that for small factor layers, CGA isbetter (the end point is below the principal axis), but ouralgorithm converges faster and becomes better for biggerfactor layers (see the text for explanation). However, wemust recall that for a given factor layer size, CGA has twicemore input-to-factor weights than our algorithm. Whenwe compare for a given number of weights, our algorithmis better (see Fig. 8).

Fig. 7 shows the learning curves of both algorithms fordifferent sets of parameters. Our algorithm convergesfaster and has a lower final error than CGA when thenumber of factor units is large enough. In contrast,CGA is better for small factor layers. This can be sur-prising given our study, which states that filters areshared between commuting transformations. However,when there are not enough factor units to represent allthe eigenvectors of a transformation, this is equivalentto assigning a null eigenvalue to the missing eigenvec-tors, while eigenvalues of orthogonal transformationsare values of module 1. Moreover, we must recall thatfor a given size of factor layer, CGA has twice moreinput-to-factor weights providing more flexibility. Thisexplains why our algorithm makes worse estimationsfor small factor layers. On the other hand, we can seethat the limit predicted by the mathematical descrip-tion, which is reached when the size of the factor layeris equal to the number of eigenvectors for every trans-formations, is confirmed by our experiments. Withtwo classes of transformations, we expect a limit for2× 2× 132 = 676 factor units6.

Fig. 8 plots the mean reconstruction error on the testset for the algorithms. We can see that CGA is sub-ject to a severe degradation of performances when thenumber of factor units increases, which is not the casefor our algorithm (the performance only slightly de-

6Each transformation is represented by a 169×169 ma-trix, and has 169 eigenvectors. In our architecture, eacheigenvector is represented by two factor units (real andimaginary parts).


65 70 75 80 85 90 95 100 105 110


70

75

80

85

90

95

100

105

110

Rec

onst

ruct

ion

erro

rfor

CG

A

Comparison with samefactor layer size

20 mapping units100 mapping unitsCGA-softplusEqual performance line

65 70 75 80 85 90 95 100 105 110


70

75

80

85

90

95

100

105

110

Comparison with same numberof input-to-factor weights

Figure 8. Mean reconstruction error on the test set for bothalgorithms with 50, 100, 200, 400, 700 and 800 factor units.Left: comparison with the same factor layer size for bothalgorithms; right: comparison with the same number ofinput-to-factor weights. For a given number of free param-eters, our algorithm outperforms CGA (right).

Figure 9. Features learned at the factor level by our algo-rithm on a dataset which contains rotations and transla-tions. The network has 700 factor units and 100 mappingunits.

creases for 100 mapping units). This observation iscoherent with the fact that our network generalizesbetter.

In particular, our network is easier to design since anoversized factor layer has a smaller impact on perfor-mance compared with CGA, and there are less de-pendences between factor and mapping layers sizes(Fig. 8).

Fig. 9 shows the filters learned by our network. Asexpected, most filters come by pairs which representthe real and the imaginary part of one complex filter.They are similar to the ones learned by a CGA (seefor instance (Memisevic, 2012a)).

5. Conclusion and future work

In this work we presented a new architecture for re-lational autoencoders, which achieves better perfor-mances with respect to previous methods, in particu-lar from the generalization point of view, while havingless parameters to learn. It relies on a complex-valuedreformulation of classical gated networks.

Our network differentiates from previous gated au-toencoders since both images are projected on thesame subspace, through the same matrix of weights U .There is thus only one weight matrix shared betweeninputs at different timesteps. This makes our networkcloser to classical recurrent networks than usual gatednetworks which consider either a sliding window of atime dependent input signal or pairs of related inputsand associate each input with a different weight ma-trix (Le et al., 2011; Memisevic, 2012a; Susskind et al.,2011; Sutskever & Hinton, 2007; Taylor et al., 2011;Baccouche et al., 2011). This opens the way to a sim-ple recurrent gated autoencoder, where the input con-sists of only one image. In particular, we believe thatthis could be useful for motion analysis with a mov-ing camera, a situation often encountered in robotics,for instance. Actually, motions of the camera induceimages transformations like translations and rotations,which can be encoded by a network similar to the onedescribed in this paper. A multiplicative interactionat the mapping layer, between visual inputs and anaction-based encoding of the camera’s motion (or ofthe gaze position, as in (Larochelle & Hinton, 2010;Denil et al., 2012)) could achieve better performanceand robustness in motion recognition.

One of the obvious extensions of our work is to handlemore realistic image sizes, by using for instance convo-lutional techniques, to use the network on real videosstreamed from cameras. Like other gated networks,the algorithmic complexity of our algorithm is actuallyquadratic in the number of pixels. This makes thesealgorithms unscalable directly to bigger images andconvolutional techniques are one way to keep compu-tational complexity under control. Furthermore, con-volutional techniques are well suited to handle someinvariances, like the position of a moving object in avideo, and provide greater factorization possibilities,for instance when several objects are moving in differ-ent ways on the same video.

Finally, unsupervised clustering algorithms, such asself-organizing maps, could be used to cluster au-tonomously the different classes of transformationsand extract a more semantic discretization of the ob-served transformations.


Acknowledgments

This work is supported by the French ANRprogram (ANR 2010 BLAN 0216 01), more athttp://macsi.isir.upmc.fr

References

Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., andBaskurt, A. Sequential Deep Learning for HumanAction Recognition. In A.A. Salah, B. L. (ed.), 2ndInternational Workshop on Human Behavior Un-derstanding (HBU), Lecture Notes in Computer Sci-ence, pp. 29–39. Springer, 2011.

Baldi, P. Autoencoders, Unsupervised Learning, andDeep Architectures. Proceedings of ICML Workshopon Unsupervised and Transfer Learning, 2012.

Baldi, P., Forouzan, S., and Lu, Z. Complex-ValuedAutoencoders. Neural Networks, 33:136–147, 2012.

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P.,Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. Theano: a CPU andGPU Math Expression Compiler. In Proceedings ofSciPy, 2010.

Bethge, M., Gerwinn, S., and Macke, J. H. Unsuper-vised learning of a steerable basis for invariant im-age representations. In Proceedings of SPIE HumanVision and Electronic Imaging, 2007.

Denil, M., Bazzani, L., Larochelle, H., and de Fre-itas, N. Learning where to Attend with Deep Archi-tectures for Image Tracking. Neural Computation,2012.

Fitzpatrick, P., Metta, G., Natale, L., Rao, S., andSandini, G. Learning about objects through action– initial steps towards artificial cognition. In IEEEInternational Conference on Robotics and Automa-tion (ICRA), pp. 3140–3145, 2003.

Larochelle, H. and Hinton, G. Learning to combinefoveal glimpses with a third-order Boltzmann ma-chine. In Lafferty, J., Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., and Culotta, A. (eds.),Advances in Neural Information Processing Systems23, pp. 1243–1251. 2010.

Le, Q. V., Zou, W. Y., Yeung, S. Y., and Ng, A. Y.Learning hierarchical invariant spatio-temporal fea-tures for action recognition with independent sub-space analysis. In CVPR 2011, pp. 3361–3368.IEEE, June 2011.

Lee, H., Ekanadham, C., and Ng, A. Sparse deep beliefnet model for visual area V2. In Platt, J. C., Koller,D., Singer, Y., and Roweis, S. (eds.), Advances inNeural Information Processing Systems 20, pp. 873–880. MIT Press, Cambridge, MA, 2008.

Memisevic, R. On multi-view feature learning. InICML, 2012a.

Memisevic, R. Learning to relate images: Mappingunits, complex cells and simultaneous eigenspaces.ArXiv e-prints, 2012b.

Memisevic, R. and Hinton, G. Unsupervised Learningof Image Transformations. In Proceedings of IEEEConference on Computer Vision and Pattern Recog-nition, 2007.

Memisevic, R. and Hinton, G. E. Learning to Repre-sent Spatial Transformations with Factored Higher-Order Boltzmann Machines. Neural Computation,22(6):1473–1492, 2010.

Montesano, L., Lopes, M., Bernardino, A., and Santos-Victor, J. Learning object affordances: Fromsensory–motor coordination to imitation. IEEETransactions on Robotics, 24(1):15–26, Feb. 2008.

Susskind, J., Memisevic, R., Hinton, G., and Pollefeys,M. Modeling the joint density of two images under aVariety of Transformations. In Proceedings of IEEEConference on Computer Vision and Pattern Recog-nition, 2011.

Sutskever, I. and Hinton, G. E. Learning MultilevelDistributed Representations for High-dimensionalSequences. Proceeding of the Eleventh InternationalConference on Artificial Intelligence and Statistics,2007.

Taylor, G. W., Fergus, R., LeCun, Y., and Bregler, C.Convolutional learning of spatio-temporal features.In ECCV’10, pp. 140–153, September 2010.

Taylor, G. W., Hinton, G. E., and Roweis, S. T.Two Distributed-State Models For Generating High-Dimensional Time Series. J. Mach. Learn. Res., 12:1025–1068, 2011. ISSN 1532-4435.

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., andManzagol, P.-A. Stacked Denoising Autoencoders:Learning Useful Representations in a Deep Networkwith a Local Denoising Criterion. J. Mach. Learn.Res., 11:3371–3408, 2010. ISSN 1532-4435.

Documents

Gated Autoencoders with Tied Input Weights · Gated Autoencoders with Tied Input Weights Furthermore, Memisevic’s work shows that commut-ing transformations, like translations or