HybridNet: Classification and Reconstruction Cooperation ...€¦ · Deep learning and Convolutional Neural Networks (ConvNets) have shown im- pressive state-of-the-art results in

HybridNet: Classification and Reconstruction

Cooperation for Semi-Supervised Learning

Thomas Robert1, Nicolas Thome2, and Matthieu Cord1

1 Sorbonne Universite, CNRS, LIP6, F-75005 Paris, Francethomas.robert,[email protected]

2 CEDRIC - Conservatoire National des Arts et Metiers, 75003 Paris, [email protected]

Abstract. In this paper, we introduce a new model for leveraging un-labeled data to improve generalization performances of image classifiers:a two-branch encoder-decoder architecture called HybridNet. The firstbranch receives supervision signal and is dedicated to the extraction ofinvariant class-related representations. The second branch is fully un-supervised and dedicated to model information discarded by the firstbranch to reconstruct input data. To further support the expected be-havior of our model, we propose an original training objective. It favorsstability in the discriminative branch and complementarity between thelearned representations in the two branches. HybridNet is able to outper-form state-of-the-art results on CIFAR-10, SVHN and STL-10 in varioussemi-supervised settings. In addition, visualizations and ablation stud-ies validate our contributions and the behavior of the model on bothCIFAR-10 and STL-10 datasets.

Keywords: Deep learning, semi-supervised learning, regularization, re-construction, invariance and stability, encoder-decoder

1 Introduction

Deep learning and Convolutional Neural Networks (ConvNets) have shown im-pressive state-of-the-art results in the last years on various visual recognitiontasks, e.g. image classification [1,2,3], object localization [4,5,6], image segmen-tation [7] and even multimodal embedding [8,9,10]. Some key elements are theuse of very deep models with a huge number of parameters and the availabilityof large-scale datasets such as ImageNet. When dealing with smaller datasets,however, the need for proper regularization methods becomes more crucial tocontrol overfitting [11,12,13,14]. An appealing direction to tackle this issue isto take advantage of the huge number of unlabeled data by developing semi-supervised learning techniques.

Many approaches attempt at designing semi-supervised techniques where theunsupervised cost produces encoders that have high data-likelihood or smallreconstruction error [15]. This strategy has been followed by historical deeplearning approaches [16], but also in some promising recent results with modern

2 Thomas Robert, Nicolas Thome and Matthieu Cord

+

Wc

Wu

x

ŷbird

HybridNet model

x

Fig. 1. Illustration of HybridNet behavior: the input image is processed by two networkpaths of weights Wc and Wu; each path produces a partial reconstruction, and both aresummed to produce the final reconstruction, while only one path is used to produce aclassification prediction. Thanks to a joint training of both tasks, the weights Wc andWu influence each other to cooperate

ConvNets [17,18]. However, the unsupervised term in reconstruction-based ap-proaches arguably conflicts with the supervised loss, which requires intra-classinvariant representations. This is the motivation for designing auto-encoders thatare able to discard information, such as the Ladder Networks [19].

Another interesting regularization criterion relies on stability. Predictionfunctions which are stable under small input variations are likely to generalizewell, especially when training with small amounts of data. Theoretical works haveshown the stability properties of some deep models, e.g. by using harmonic anal-ysis for scattering transforms [20,21] or for Convolution Kernel Machines [22]. Inaddition, recent semi-supervised models incorporate a stability-based regularizeron the prediction [23,24,25].

In this paper, we propose a new approach for regularizing ConvNets usingunlabeled data. The behavior of our model, called HybridNet, is illustrated inFig. 1. It consists in a “hybrid” auto-encoder with the feature extraction pathdecomposed into two branches.

The top branch encoder, of parameters Wc, is connected to a classificationlayer that produces class predictions while the decoder from this branch is usedto partly reconstruct the input image from the discriminative features, leading toxc. Since those features are expected to extract invariant class-specific patterns,information is lost and exact reconstruction is not possible. To complement it, asecond encoder-decoder branch of parameters Wu is added to produce a comple-mentary reconstruction xu such that the sum x = xc + xu is the final completereconstruction.

During training, the supervised classification cost impact Wc while an un-supervised reconstruction cost is applied to both Wc and Wu to properly re-construct the input image. The main assumption behind HybridNet is that thetwo-path architecture helps in making classification and reconstruction cooper-

HybridNet: Classification and Reconstruction Cooperation for SSL 3

ate. To encourage this, we use additional costs and training techniques, namelya stability regularization in the discriminative branch and a branch complemen-tarity training method.

2 Related Work

Training deep models with relatively small annotated datasets is a crucial issuenowadays. To this end, the design of proper regularization techniques plays acentral role. In this paper, we address the problem of taking advantage of un-labeled data for improving generalization performances of deep ConvNets withsemi-supervised learning [26].

One standard goal followed when training deep models with unlabeled dataconsists in designing models which fit input data well. Reconstruction error isthe standard criterion used in (possibly denoising) Auto-Encoders [15,27,28,29],while maximum likelihood is used with generative models, e.g. Restricted Boltz-mann Machines, Deep Belief Networks or Deep Generative Models [16,30,31,32].This unsupervised training framework was generally used as a pre-training beforesupervised learning with back-propagation [33], potentially with an intermediatestep [34]. The currently very popular Generative Adversarial Networks [35] alsofalls into this category. With modern ConvNets, regularization with unlabeleddata is generally formulated as a multi-task learning problem, where reconstruc-tion and classification objectives are combined during training [17,18,36]. In thesearchitectures, the encoder used for classification is regularized by a decoder ded-icated to reconstruction.

This strategy of classification and reconstruction with an Auto-Encoder ishowever questionable, since classification and reconstruction may play contra-dictory roles in terms of feature extraction. Classification arguably aims at ex-tracting invariant class-specific features, improving sample complexity of thelearned model [37], therefore inducing an information loss which prevents exactreconstruction. Ladder Networks [19] have historically been designed to overcomethe previously mentioned conflict between reconstruction and classification, bydesigning Auto-Encoders capable of discarding information. Reconstruction isproduced using higher-layer representation and a noisy version of the recon-struction target. However, it is not obvious that providing a noisy version of thetarget and training the network to remove the noise allows the encoder to losesome information since it must be able to correct low-level errors that requiredetails.

Another interesting regularization criterion relies on stability or smoothnessof the prediction function, which is at the basis of interesting unsupervised train-ing methods, e.g. Slow Feature Analysis [38]. Adding stability to the predictionfunction was studied in Adversarial Training [39] for supervised learning and fur-ther extended to semi-supervised learning in the Virtual Adversarial Trainingmethod [40]. Other recent semi-supervised models incorporate a stability-basedregularizer on the prediction. The idea was first introduced by [23] and proposesto make the prediction vector stable toward data augmentation (translation, ro-


tation, shearing, noise, etc.) and model stochasticity (dropout) for a given input.Following work [24,25] slightly improves upon it by proposing variants on theway to compute stability targets to increase their consistency and better adaptto the model’s evolution over training.

When using large modern ConvNets, the problem of designing decoders ableto invert the encoding still is an open question [41]. The usual solution is tomirror the architecture of the encoder by using transposed convolutions [42]. Thisproblem is exacerbated with irreversible pooling operations such as max-poolingthat must be reversed by an upsampling operation. In [17,18], they use unpoolingoperations to bring back spatial information from the encoder to the decoder,reusing pooling switches locations for upsampling. Another interesting optionis to explicitly create models which are reversible by design. This is the optionfollowed by recent works such as RevNet [43] and i-RevNet [44], being inspiredby second generation of bi-orthogonal multi-resolution analysis and wavelets [45]from the signal processing literature.

To sum up, using reconstruction as a regularization cost added to classifica-tion is an appealing idea but the best way to efficiently use it as a regularizeris still an open question. As we have seen, when applied to an auto-encodingarchitecture [17,18], reconstruction and classification would compete. To over-come the aforementioned issues, we propose HybridNet, a new framework forsemi-supervised learning. Presented on Fig. 2, this framework can be seen asan extension of the popular auto-encoding architecture. In HybridNet, the usualauto-encoder that does both classification and reconstruction is assisted by anadditional auto-encoder so that the first one is allowed to discard information inorder to produce intra-class invariant features while the second one retains thelost information. The combination of both branches then produces the recon-struction. This way, our architecture prevents the conflict between classificationand reconstruction and allows the two branches to cooperate and accomplishboth classification and reconstruction tasks.

Compared to Ladder Networks [19], our two-branch approach without directskip connection allows for a finer and learned information separation and is thusexpected to have a more favorable impact in terms of discriminative encoder reg-ularization. Our HybridNet model also has conceptual connections with waveletdecomposition [46]: the first branch can be seen as extracting discriminative low-pass features from input images, and the second branch acting as a high-passfilter to restore the lost information. HybridNet also differs from reversible mod-els [43,44] by the explicit decomposition between supervised and unsupervisedsignals, enabling the discriminative encoder to have fewer parameters and bettersample complexity.

In this paper, our contributions with the HybridNet framework are twofold:first, in Section 3.1, we propose an architecture designed to efficiently allow bothreconstruction and classification losses to cooperate; second, in Section 3.2, wedesign a training loss adapted to it that includes reconstruction, stability in thediscriminative branch and a branch complementarity technique. In Section 4,we perform experiments to show that HybridNet is able to outperform state-


Ec

Eu

Dchc

hu

C

x x

ℒcls

Ωstability

ℒrec

+ branch +

complementarity

ℒrec−inter

-

-

Du

ŷ

xu

xc

Fig. 2. General description of the HybridNet framework. Ec and C correspond to aclassifier, Ec and Dc form an autoencoder that we call discriminative path, and Eu

and Du form a second autoencoder called unsupervised path. The various loss functionsused to train HybridNet are also represented in yellow

of-the-art results in various semi-supervised settings on CIFAR-10, SVHN andSTL-10. We also provide ablation studies validating the favorable impact of ourcontributions. Finally, we show several visualizations on CIFAR-10 and STL-10datasets analogous to Fig. 1 to validate the behavior of both branches, witha discriminative branch that loses information that is restored by the secondbranch.

3 HybridNet: a semi-supervised learning framework

In this section, we detail the proposed HybridNet model: the chosen architectureto mix supervised and unsupervised information efficiently in Section 3.1, andthe semi-supervised training method adapted to this particular architecture inSection 3.2.

3.1 Designing the HybridNet architecture

General architecture. As we have seen, classification requires intra-class in-variant features while reconstruction needs to retain all the information. Tocircumvent this issue, HybridNet is composed of two auto-encoding paths, thediscriminative path (Ec and Dc) and the unsupervised path (Eu and Du). Bothencoders Ec and Eu take an input image x and produce representations hc andhu, while decoders Dc and Du take respectively hc and hu as input to pro-duce two partial reconstructions xc and xu. Finally, a classifier C produces aclass prediction using discriminative features only: y = C(hc). Even if the twopaths can have similar architectures, they should play different and comple-mentary roles. The discriminative path must extract discriminative features hc

that should eventually be well crafted to perform a classification task effectively,and produce a purposely partial reconstruction xc that should not be perfect


since preserving all the information is not a behavior we want to encourage.Consequently, the role of the unsupervised path is to be complementary to thediscriminative branch by retaining in hu the information lost in hc. This way, itcan produce a complementary reconstruction xu so that, when merging xu andxc, the final reconstruction x is close to x. The HybridNet architecture, visibleon Fig. 2, can be described by the following equations:

hc = Ec(x) xc = Dc(hc) y = C(hc)hu = Eu(x) xu = Du(hu) x = xc + xu

(1)

Note that the end-role of reconstruction is just to act as a regularizer for thediscriminative encoder. The main challenge and contribution of this paper is tofind a way to ensure that the two paths will in fact behave in this desired way.The two main issues that we tackle are the fact that we want the discriminativebranch to focus on discriminative features, and that we want both branches to co-operate and contribute to the reconstruction. Indeed, with such an architecture,we could end up with two paths that work independently: a classification pathy = C(Ec(x)) and a reconstruction path x = xu = Du(Eu(x)) and xc = 0. Weaddress both those issues through the design of the architecture of the encodersand decoders as well as an appropriate loss and training procedure.

Branches design. To design the HybridNet architecture, we start with a con-volutional architecture adapted to the targeted dataset, for example a state-of-the-art ResNet architecture for CIFAR-10. This architecture is split into twomodules: the discriminative encoder Ec and the classifier C. On top of thismodel, we add the discriminative decoder Dc. The location of the splitting pointin the original network is free, but C will not be directly affected by the re-construction loss. In our experiments, we choose hc (Ec’s output) to be thelast intermediate representation before the final pooling that aggregates all thespatial information, leaving in C a global average pooling followed by one ormore fully-connected layers. The decoder Dc is designed to be a “mirror” of theencoder’s architecture, as commonly done in the literature, e.g. [17,19,47].

After constructing the discriminative branch, we add an unsupervised com-plementary branch. To ensure that both branches are “balanced” and behave ina similar way, the internal architecture of Eu and Du is mostly the same as forEc and Dc. The only difference remains in the mirroring of pooling layers, thatcan be reversed either by upsampling or unpooling. An upsampling will increasethe spatial size of a feature map without any additional information while anunpooling, used in [17,18], will use spatial information (pooling switches) fromthe corresponding max-pooling layer to do the upsampling. In our architecture,we propose to use upsampling in the discriminative branch because we wantto encourage spatial invariance, and use unpooling in the unsupervised branchto compensate this information loss and favor the learning of spatial-dependentlow-level information. An example of HybridNet architecture is presented inFig. 3.


As mentioned previously, one key problem to tackle is to ensure that thismodel will behave as expected, i.e. by learning discriminative features in thediscriminative encoder and non-discriminative features in the unsupervised one.This is encouraged in different ways by the design of the architecture. First,the fact that only hc is used for classification means that Ec will be pushed bythe classification loss to produce discriminative features. Thus, the unsupervisedbranch will naturally focus on information lost by Ec. Using upsampling in Dc

and unpooling in Du also encourages the unsupervised branch to focus on low-level information. In addition to this, the design of an adapted loss and trainingprotocol is a major contribution to the efficient training of HybridNet.

3.2 Training HybridNet

The HybridNet architecture has two information paths with only one producinga class prediction and both producing partial reconstructions that should becombined. In this section, we will address the question of training this architec-ture efficiently. The complete loss is composed of various terms as illustrated onFig. 2. It comprises terms for classification with Lcls; final reconstruction withLrec; intermediate reconstructions with Lrec-interb,l (for layer l and branch b);and stability with Ωstability. It is also accompanied by a branch complementaritytraining method. Each term is weighted by a corresponding parameter λ:

L = λcLcls + λrLrec +∑

b∈c,u,l λrb,lLrec-interb,l + λsΩstability . (2)

HybridNet can be trained on a partially labeled dataset, i.e. that is composedof labeled pairs Dsup = (x(k), y(k))k=1..Ns

and unlabeled images Dunsup =x(k)k=1..Nu

. Each batch is composed of n samples, divided into ns image-labelpairs from Dsup and nu unlabeled images from Dunsup.

Classification. The classification term is a regular cross-entropy term, that isapplied only on the ns labeled samples of the batch and averaged over them:

ℓcls = ℓCE(y,y) = −∑

i

yi log yi , Lcls =1

ns

∑

k

ℓcls(y(k),y(k)) . (3)

Reconstruction losses. In HybridNet, we chose to keep discriminative andunsupervised paths separate so that they produce two complementary recon-structions (xu, xc) that we combine with an addition into x = xu + xc. Keepingthe two paths independent until the reconstruction in pixel space, as well as themerge-by-addition strategy allows us to apply different treatments to them andinfluence their behavior efficiently. The merge by addition in pixel space is alsoanalogous to wavelet decomposition where the signal is decomposed into low-and high-pass branches that are then decoded and summed in pixel space. Thereconstruction loss that we use is a simple mean-squared error between the input


and the sum of the partial reconstructions:

ℓrec = ||x− x||22 = ||xu + xc − x||22 , Lrec =1

n

∑

k

ℓrec(x(k),x(k)) . (4)

In addition to the final reconstruction loss, we also add reconstruction costsbetween intermediate representations in the encoders and the decoders which ispossible since encoders and decoders have mirrored structure. We apply thesecosts to the representations hb,l (for branch b and layer l) produced just after

pooling layers in the encoders and reconstructions hb,l produced just before thecorresponding upsampling or unpooling layers in the decoders. This is commonin the literature [17,18,19] but is particularly important in our case: in additionto guiding the model to produce the right final reconstruction, it pushes the dis-criminative branch to produce a reconstruction and avoid the undesired situationwhere only the unsupervised branch would contribute to the final reconstruction.This is applied in both branches (b ∈ c, u):

Lrec-interb,l =1

n

∑

k

||h(k)b,l − h

(k)b,l ||

22 . (5)

Branch cooperation. As described previously, we want to ensure that bothbranches contribute to the final reconstruction, otherwise this would mean thatthe reconstruction is not helping to regularize Ec, which is our end-goal. Havingboth branches produce a partial reconstruction and using intermediate recon-structions already help with this goal. In addition, to balance their training evenmore, we propose a training technique such that the reconstruction loss is onlybackpropagated to the branch that contributes less to the final reconstructionof each sample. This is done by comparing ||xc − x||22 and ||xu − x||22 and onlyapplying the final reconstruction loss to the branch with the higher error.

This can be implemented either in the gradient descent or simply by pre-venting gradient propagation in one branch or the other using features liketf.stop gradient in Tensorflow or .detach() in PyTorch:

ℓrec-balanced =

||xu + stopgrad(xc)− x||22 if ||xu − x||22 ≥ ||xc − x||22||stopgrad(xu) + xc − x||22 otherwise

. (6)

Encouraging invariance in the discriminative branch. We have seen thatan important issue that needs to be addressed when training this model is toensure that the discriminative branch will filter out information and learn in-variant features. For now, the only signal that pushes the model to do so isthe classification loss. However, in a semi-supervised context, when only a smallportion of our dataset is labeled, this signal can be fairly weak and might notbe sufficient to make the discriminative encoder focus on invariant features.

In order to further encourage this behavior, we propose to use a stabil-

ity regularizer. Such a regularizer is currently at the core of the models that


hc

hu

x

xc

xu

x

ŷ

Pooling switches

Ec D

c

Eu D

u

Fig. 3. Example of HybridNet architecture where an original classifier (ConvLarge)constitutes Ec and has been mirrored to create Dc and duplicated for Eu and Du, withthe addition of unpooling in the discriminative branch

give state-of-the-art results in semi-supervised setting on the most commondatasets [23,24,25]. The principle is to encourage the classifier’s output predic-tion y(k) for sample k to be invariant to different sources of randomness appliedon the input (translation, horizontal flip, random noise, etc.) and in the network(e.g. dropout). This is done by minimizing the squared euclidean distance be-tween the output y(k) and a “stability” target z(k). Multiple methods have beenproposed to compute such a target [23,24,25], for example by using a secondpass of the sample in the network with a different draw of random factors thatwill therefore produce a different output. We have:

Ωstability =1

n

∑

k

||y(k) − z(k)||22 . (7)

By applying this loss on y, we encourage Ec to find invariant patterns inthe data, patterns that have more chances of being discriminative and useful forclassification. Furthermore, this loss has the advantage of being applicable toboth labeled and unlabeled images.

In the experiments, we tried both Temporal Ensembling [24] and MeanTeacher [25] methods and did not see a major difference. In Temporal Ensem-bling, the target z(k) is a moving average of the y(k) over the previous pass ofx(k) in the network during training; while in Mean Teacher, z(k) is the outputof a secondary model where weights are a moving average of the weights of themodel being trained.

4 Experiments

In this section, we will study and validate the behavior of our novel framework.We first perform ablation studies to validate the architecture and loss terms ofthe model. We also propose visualizations of the behavior of the model in various


configurations, before demonstrating the capability of HybridNet to obtain state-of-the-art results.

In these experiments, we use three image datasets: SVHN [48], CIFAR-10 [49]and STL-10 [50]. Both SVHN and CIFAR-10 are 10-classes datasets of 32 ×32 pixels images. SVHN has 73,257 images for training, 26,032 for testing and531,131 extra images used only as unlabeled data. CIFAR-10 has 50,000 trainingimages and 10,000 testing images. For our semi-supervised experiments, we onlykeep N labeled training samples (with N/10 samples per class) while the restof the data is kept unlabeled, as is commonly done. STL-10 have the same10 classes as CIFAR-10 with 96 × 96 pixels images. It is designed for semi-supervised learning since it contains 10 folds of 1,000 labeled training images,100,000 unlabeled training images and 8,000 test images with labels.

4.1 HybridNet framework validation

First, we propose a thorough analysis of the behavior of our model at two dif-ferent levels: first by comparing it to baselines that we obtain when disablingparts of the architecture, and second by analyzing the contribution of the dif-ferent terms of the training loss of HybridNet both quantitatively and throughvisualizations.

This study was mainly performed using the ConvLarge architecture [19] onCIFAR-10 since it’s a very common setup used in recent semi-supervised ex-periments [23,24,25]. The design of the HybridNet version of this architecturefollows Section 3 (illustrated in Fig. 3) and uses Temporal Ensembling to pro-duce stability targets z. Additional results are provided using an adapted versionof ConvLarge for STL-10 with added blocks of convolutions and pooling.

Models are trained with Adam with a learning rate of 0.003 for 600 epochswith batches of 20 labeled images and 80 unlabeled ones. The various loss-weighting terms λ of the general loss (Eq. (2)) could have been optimized on avalidation set but for these experiments they were simply set so that the differentloss terms have values of the same order of magnitude. Thus, all λ were set toeither 0 or 1 if activated or not, except λs set to 0 or 100. All the details ofthe experiments – exact architecture, hyperparameters, optimization, etc. – areprovided in the supplementary material.

Ablation study of the architecture. We start this analysis by validatingour architecture with an ablation study on CIFAR-10 with various number oflabeled samples. By disabling parts of the model and training terms, we compareHybridNet to different baselines and validate the importance of combining bothcontributions of the paper: the architecture and the training method.

Results are presented in Table 1. The classification and auto-encoder resultsare obtained with the same code and hyperparameters by simply disabling dif-ferent losses and parts of the model: the classifier only use Ec and C; and theauto-encoder (similar to [17]) only Ec, Dc and C. For both, we can add thestability loss. The HybridNet architecture only uses the classification and recon-structions loss terms while the second result uses the full training loss.


Table 1. Ablation study performed on CIFAR-10 with ConvLarge architecture

Labeled samples

Model 1000 2000 4000

Classification 63.4 71.5 79.0Classification and stability 65.6 74.6 81.3

Auto-encoder 65.0 73.6 79.8Auto-encoder and stability 71.8 80.4 84.9

HybridNet architecture 63.2 74.0 80.3HybridNet architecture and full training loss 74.1 81.6 86.6

First, we can see that the HybridNet architecture alone already yields animprovement over the baseline and the auto-encoder, except at 1000 labels. Thiscould be explained by the fact that with very few labels, the model fails tocorrectly separate the information between the two branches because of thefaint classification signal, and the additional loss terms that control the trainingof HybridNet are even more necessary. Overall, the architecture alone does notprovide an important gain since it is not guided to efficiently take advantage ofthe two branches, indeed, we see that the addition of the complete HybridNetloss allows the model to provide much stronger results, with an improvementof 6-7 pts over the architecture alone, around 5-6 pts better than the stabilityor auto-encoding baseline, and 7-10 pts more than the supervised baseline. Themost challenging baseline is the stabilized auto-encoder that manages to takeadvantage of the stability loss but from which we still improve by 1.2-2.8 pts.

This ablation study demonstrates the capability of the HybridNet frameworkto surpass the different architectural baselines, and shows the importance of thecomplementarity between the two-branch architecture and the complete trainingloss.

Importance of the various loss terms. We now propose a more fine-grainstudy to look at the importance of each loss term of the HybridNet trainingdescribed in Section 3.2, both through classification results and visualizations.

First, in Table 2a we show classification accuracy on CIFAR-10 with 2000labels and STL-10 with 1000 labels for numerous combinations of loss terms.These results demonstrates that each loss term has it’s importance and that allof them cooperate in order to reach the final best result of the full HybridNetmodel. In particular the stability loss is an important element of the trainingbut is not sufficient as shown by lines b and f-h, while the other terms bring anequivalent gain as shown by lines c-e. Both those ∼5 pts gains can be combinedto work in concert and reach the final score line i of a ∼10 pts gain.

Second, to interpret how the branches behave we propose to visualizing thedifferent reconstructions xc, xu and x for different combinations of loss terms in


Table 2. Detailed ablation studies when activating different terms and techniques ofthe HybridNet learning. These results are obtained with ConvLarge on CIFAR-10 with2000 labeled samples and ConvLarge-like on STL-10 with 1000 labeled samples

(a) Test accuracy (%)

Lclassif

Ωstability

Lrec(hybrid)

Lrec-inter

Lrec-balanced

CIFAR-10

STL-10

a 71.5 65.6

b 74.6 69.8

c 72.4 67.8

d 74.0 –

e 75.2 –

f 77.7 71.5

g 77.4 –

h 80.8 72.2

i 81.6 74.1

(b) Visualization of partial and combined reconstructions

Lrec(hybrid)

Lrec-inter

Lrec-balanced

Ωstability

x xc xu x x xc xu x x xc xu x

c

d

e

i

c

d

e

i

Table 2b. With only the final reconstruction term (lines c), the discriminativebranch does not contribute to the reconstruction and is thus barely regularizedby the reconstruction loss, showing little gain over the classification baseline.The addition of the intermediate reconstruction terms helps the discriminativebranch to produce a weak reconstruction (lines d) and is complemented by thebranch balancing technique (lines e) to produce balanced reconstructions in bothbranch. The stability loss (lines i) adds little visual impact on xc, it has probablymore impact on the quality of the latent representation hc and seems to helpin making the discriminative features and classifier more robust with a largeimprovement of the accuracy.

Visualization of information separation on CIFAR-10 and STL-10.

Overall, we can see in Table 2b lines i that thanks to the full HybridNet train-ing loss, the information is correctly separated between xc and xu than bothcontribute somewhat equally while specializing on different type of information.For example, for the blue car, xc produces a blurry car with approximate col-ors, while xu provides both shape details and exact color information. For nicervisualizations, we also show reconstructions of the full HybridNet model trainedon STL-10 which has larger images in Fig. 4. These confirm the observationson CIFAR-10 with a very good final reconstruction composed of a rough recon-struction that lacks texture and color details from the discriminative branch,completed by low-level details of shape, texture, writings, color correction andbackground information from the unsupervised branch.


Input Discr. Unsup. Rec.

+ =→

+ =→

+ =→

+ =→

+ =→

+ =→

+ =→

+ =→

+ =→

+ =→

x x c

x u

x

Input Discr. Unsup. Rec.

x x c

x u

x

Fig. 4. Visualizations of input, partial and final reconstructions of STL-10 images usinga HybridNet model derived from a ConvLarge-like architecture

4.2 State-of-the-art comparison

After studying the behavior of this novel architecture, we propose to demon-strate its effectiveness and capability to produce state-of-the-art results for semi-supervised learning on three datasets: SVHN, CIFAR-10 and STL-10.

We use ResNet architectures to constitute the supervised encoder Ec andclassifier C; and augment them with a mirror decoder Dc and an unsupervisedsecond branch containing an encoder Eu and a decoder Du using the samearchitecture. For SVHN and CIFAR-10, we use the small ResNet from [51], whichis used in Mean Teacher [25] and currently achieves state-of-the-art results onCIFAR-10. For STL-10, we upscale the images to 224×224 px and use a regularResNet-50 pretrained on the Places dataset.

We trained HybridNet with the training method described in Section 3.2,including Mean Teacher to produce stability targets z(k). The training protocolfollow exactly the protocol of Mean Teacher [25] for CIFAR-10 and a similarone for SVHN and STL-10 for which [25] does not report results with ResNet.The hyperparameters added in HybridNet, i.e. the weights of the reconstructionterms (final and intermediate), were coarsely adjusted on a validation set (wetried values 0.25, 0.5 and 1.0 for both of them). Details are in the supplementary.

The results of these experiments are presented in Table 3. We can see the hugeperformance boost obtained by HybridNet compared to the ResNet baselines, inparticular with CIFAR-10 with 1000 labels where the error rate goes from 45.2%to 8.81%, which demonstrates the large benefit of our regularizer. HybridNet alsoimproves over the strong Mean Teacher baseline [25], with an improvement of


Table 3. Results on CIFAR-10, STL-10 and SVHN using a ResNet-based HybridNet.“Mean Teacher ResNet” is our classification & stability baseline; results marked with∗ are not reported in the original paper and were obtained ourselves

Dataset CIFAR-10 SVHN STL-10

Nb. labeled images 1000 2000 4000 500 1000 1000

SWWAE [17] 23.56 25.67Ladder Network [19] 20.40Improved GAN [53] 21.83 19.61 18.63 18.44 8.11CatGAN [52] 19.58Stability regularization [23] 11.29 6.03Temporal Ensembling [24] 12.16 5.12 4.42Mean Teacher ConvLarge [25] 21.55 15.73 12.31 4.18 3.95Mean Teacher ResNet [25] 10.10 6.28 ∗2.33 ∗2.05 ∗16.8

ResNet baseline [51] 45.2 24.3 15.45 12.27 9.56 18.0HybridNet [ours] 8.81 7.87 6.09 1.85 1.80 15.9

1.29 pt with 1000 labeled samples on CIFAR-10, and 0.9 pt on STL-10. We alsosignificantly improve over other stability-based approaches [23,24], and over theLadder Networks [19] and GAN-based techniques [52,53].

These results demonstrate the capability of HybridNet to apply to largeresidual architecture – that are very common nowadays – and to improve overbaselines that already provided very good performance.

5 Conclusion

In this paper, we described a novel semi-supervised framework called Hybrid-Net that proposes an auto-encoder-based architecture with two distinct pathsthat separate the discriminative information useful for classification from theremaining information that is only useful for reconstruction. This architectureis accompanied by a loss and training technique that allows the architecture tobehave in the desired way. In the experiments, we validate the significant perfor-mance boost brought by HybridNet in comparison with several other commonarchitectures that use reconstruction losses and stability. We also show thatHybridNet is able to produce state-of-the-art results on multiple datasets.

With two latent representations that explicitly encode classification informa-tion on one side and the remaining information on the other side, our modelmay be seen as a competitor to the fully reversible RevNets models recentlyproposed, that implicitly encode both types of information. We plan to furtherexplore the relationships between these approaches.

Acknowledgements. This work was funded by grant DeepVision (ANR-15-CE23-0029-02, STPGP-479356-15), a joint French/Canadian call by ANR &NSERC.


References

1. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deepconvolutional neural networks. In: Advances in Neural Information ProcessingSystems (NIPS). (2012)

2. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)

3. Durand, T., Mordan, T., Thome, N., Cord, M.: WILDCAT: Weakly SupervisedLearning of Deep ConvNets for Image Classification, Pointwise Localization andSegmentation. In: IEEE Conference on Computer Vision and Pattern Recognition(CVPR). (2017)

4. Dai, J., Li, Y., He, K., Sun, J.: R-fcn: Object detection via region-based fullyconvolutional networks. In: Advances in Neural Information Processing Systems(NIPS). (2016)

5. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified,real-time object detection. In: IEEE Conference on Computer Vision and PatternRecognition (CVPR). (2016)

6. Mordan, T., Thome, N., Cord, M., Henaff, G.: Deformable Part-based Fully Con-volutional Network for Object Detection. In: British Machine Vision Conference(BMVC). (2017)

7. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab:Semantic image segmentation with deep convolutional nets, atrous convolution,and fully connected crfs. IEEE Transactions on Pattern Analysis and MachineIntelligence (TPAMI) (2018)

8. Engilberge, M., Chevallier, L., Perez, P., Cord, M.: Finding beans in burgers:Deep semantic-visual embedding with localization. IEEE Conference on ComputerVision and Pattern Recognition (CVPR) (2018)

9. Carvalho, M., Cadene, R., Picard, D., Soulier, L., Thome, N., Cord, M.: Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings.Special Interest Group on Information Retrieval (SIGIR) (2018)

10. Ben-Younes, H., Cadene, R., Thome, N., Cord, M.: Mutan: Multimodal tuckerfusion for visual question answering. IEEE International Conference on ComputerVision (ICCV) (2017)

11. Krogh, A., Hertz, J.A.: A simple weight decay can improve generalization. In:Advances in Neural Information Processing Systems (NIPS). (1992)

12. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training byreducing internal covariate shift. Journal of Machine Learning Research (JMLR)(2016)

13. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:Dropout: A simple way to prevent neural networks from overfitting. Journal ofMachine Learning Research (JMLR) (2014)

14. Blot, M., Robert, T., Thome, N., Cord, M.: Shade: Information-based regulariza-tion for deep learning. In: IEEE International Conference on Image Processing(ICIP). (2018)

15. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise trainingof deep networks. In: Advances in Neural Information Processing Systems (NIPS).(2007)

16. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data withneural networks. Science (2006)


17. Zhao, J., Mathieu, M., Goroshin, R., LeCun, Y.: Stacked What-Where Auto-encoders. In: International Conference on Learning Representations Workshop(ICLR-W). (2016)

18. Zhang, Y., Lee, K., Lee, H.: Augmenting supervised neural networks with unsuper-vised objectives for large-scale image classification. In: International Conferenceon Machine Learning (ICML). (2016)

19. Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko, T.: Semi-supervisedlearning with ladder networks. In: Advances in Neural Information ProcessingSystems (NIPS). (2015)

20. Mallat, S.: Group invariant scattering. Communications on Pure and AppliedMathematics (CPAM) (2012)

21. Bruna, J., Mallat, S.: Invariant scattering convolution networks. IEEE Transac-tions on Pattern Analysis and Machine Intelligence (TPAMI) (2013)

22. Bietti, A., Mairal, J.: Group Invariance, Stability to Deformations, and Complex-ity of Deep Convolutional Representations. In: Advances in Neural InformationProcessing Systems (NIPS). (2017)

23. Sajjadi, M., Javanmardi, M., Tasdizen, T.: Regularization with stochastic trans-formations and perturbations for deep semi-supervised learning. In: Advances inNeural Information Processing Systems (NIPS). (2016)

24. Laine, S., Aila, T.: Temporal ensembling for semi-supervised learning. In: Inter-national Conference on Learning Representations (ICLR). (2017)

25. Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averagedconsistency targets improve semi-supervised deep learning results. In: Advances inNeural Information Processing Systems (NIPS). (2017)

26. Zhu, X.: Semi-supervised learning literature survey. Technical Report 1530, Com-puter Sciences, University of Wisconsin-Madison (2005)

27. Ranzato, M., Szummer, M.: Semi-supervised learning of compact document repre-sentations with deep networks. In: International Conference on Machine Learning(ICML). (2008)

28. Ranzato, M., Huang, F.J., Boureau, Y.L., LeCun, Y.: Unsupervised learning ofinvariant feature hierarchies with applications to object recognition. In: IEEEConference on Computer Vision and Pattern Recognition (CVPR). (June 2007)

29. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and compos-ing robust features with denoising autoencoders. In: International Conference onMachine Learning (ICML). (2008)

30. Ranzato, M., Poultney, C., Chopra, S., Lecun, Y.: Efficient learning of sparserepresentations with an energy-based model. In: Advances in Neural InformationProcessing Systems (NIPS). (2007)

31. Larochelle, H., Bengio, Y.: Classification using discriminative restricted boltzmannmachines. In: International Conference on Machine Learning (ICML). (2008)

32. Kingma, D.P., Mohamed, S., Rezende, D.J., Welling, M.: Semi-supervised learn-ing with deep generative models. In: Advances in Neural Information ProcessingSystems (NIPS). (2014)

33. Erhan, D., Bengio, Y., Courville, A., Manzagol, P.A., Vincent, P., Bengio, S.: Whydoes unsupervised pre-training help deep learning? Journal of Machine LearningResearch (JMLR) (2010)

34. Goh, H., Thome, N., Cord, M., Lim, J.H.: Top-down regularization of deep beliefnetworks. In: Advances in Neural Information Processing Systems (NIPS). (2013)

35. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in NeuralInformation Processing Systems (NIPS). (2014)


36. Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I.: Adversarial autoencoders. In:International Conference on Learning Representations (ICLR). (2016)

37. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning.Springer (2009)

38. Theriault, C., Thome, N., Cord, M.: Dynamic Scene Classification: Learning Mo-tion Descriptors with Slow Features Analysis. In: IEEE Conference on ComputerVision and Pattern Recognition (CVPR). (2013)

39. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial ex-amples. In: International Conference on Learning Representations (ICLR). (2015)

40. Miyato, T., Maeda, S.i., Koyama, M., Nakae, K., Ishii, S.: Distributional smooth-ing with virtual adversarial training. In: International Conference on LearningRepresentations (ICLR). (2016)

41. Wojna, Z., Ferrari, V., Guadarrama, S., Silberman, N., Chen, L.C., Fathi, A.,Uijlings, J.: The devil is in the decoder. In: British Machine Vision Conference(BMVC). (2017)

42. Dumoulin, V., Visin, F.: A guide to convolution arithmetic for deep learning.Technical Report (2016)

43. Gomez, A.N., Ren, M., Urtasun, R., Grosse, R.B.: The reversible residual network:Backpropagation without storing activations. In: Advances in Neural InformationProcessing Systems (NIPS). (2017)

44. Jacobsen, J.H., Smeulders, A., Oyallon, E.: i-RevNet: Deep Invertible Networks.In: International Conference on Learning Representations (ICLR). (2018)

45. Sweldens, W.: The lifting scheme: A new philosophy in biorthogonal wavelet con-structions. In: Wavelet Applications in Signal and Image Processing III. (1995)

46. Mallat, S.G., Peyre, G.: A wavelet tour of signal processing : the sparse way.Academic Press (2009)

47. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks.In: European Conference on Computer Vision (ECCV). (2014)

48. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digitsin natural images with unsupervised feature learning. In: NIPS workshop on deeplearning and unsupervised feature learning. (2011)

49. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images.Technical Report (2009)

50. Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervisedfeature learning. In: International Conference on Artificial Intelligence and Statis-tics (AISTATS). (2011)

51. Gastaldi, X.: Shake-shake regularization of 3-branch residual networks. In: Inter-national Conference on Learning Representations Workshop (ICLR-W). (2017)

52. Springenberg, J.T.: Unsupervised and Semi-supervised Learning with Categori-cal Generative Adversarial Networks. In: International Conference on LearningRepresentations (ICLR). (2016)

53. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.,Chen, X.: Improved techniques for training gans. In Lee, D.D., Sugiyama, M.,Luxburg, U.V., Guyon, I., Garnett, R., eds.: Advances in Neural Information Pro-cessing Systems (NIPS). Curran Associates, Inc. (2016)

Documents

HybridNet: Classification and Reconstruction Cooperation ...€¦ · Deep learning and Convolutional Neural Networks (ConvNets) have shown im- pressive state-of-the-art results in