View
1
Download
0
Category
Preview:
Citation preview
Adapting Deep Networks Across Domains, Modalities, and Tasks
Judy Hoffman!
Eric Tzeng Saurabh Gupta Kate Saenko Trevor Darrell
2010 2011 2012 2013 201470
75
80
85
90
95
100
Recent Visual Recognition Progress
ImageNet Performance
Accu
racy
2010 2011 2012 2013 201470
75
80
85
90
95
100
Recent Visual Recognition Progress
ImageNet Performance
Accu
racy
Deep models
Deep Visual Models
Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilitiesbetween the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-partsat the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, andthe number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–4096–4096–1000.
neurons in a kernel map). The second convolutional layer takes as input the (response-normalizedand pooled) output of the first convolutional layer and filters it with 256 kernels of size 5⇥ 5⇥ 48.The third, fourth, and fifth convolutional layers are connected to one another without any interveningpooling or normalization layers. The third convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourthconvolutional layer has 384 kernels of size 3 ⇥ 3 ⇥ 192 , and the fifth convolutional layer has 256kernels of size 3⇥ 3⇥ 192. The fully-connected layers have 4096 neurons each.
4 Reducing Overfitting
Our neural network architecture has 60 million parameters. Although the 1000 classes of ILSVRCmake each training example impose 10 bits of constraint on the mapping from image to label, thisturns out to be insufficient to learn so many parameters without considerable overfitting. Below, wedescribe the two primary ways in which we combat overfitting.
4.1 Data Augmentation
The easiest and most common method to reduce overfitting on image data is to artificially enlargethe dataset using label-preserving transformations (e.g., [25, 4, 5]). We employ two distinct formsof data augmentation, both of which allow transformed images to be produced from the originalimages with very little computation, so the transformed images do not need to be stored on disk.In our implementation, the transformed images are generated in Python code on the CPU while theGPU is training on the previous batch of images. So these data augmentation schemes are, in effect,computationally free.
The first form of data augmentation consists of generating image translations and horizontal reflec-tions. We do this by extracting random 224⇥ 224 patches (and their horizontal reflections) from the256⇥256 images and training our network on these extracted patches4. This increases the size of ourtraining set by a factor of 2048, though the resulting training examples are, of course, highly inter-dependent. Without this scheme, our network suffers from substantial overfitting, which would haveforced us to use much smaller networks. At test time, the network makes a prediction by extractingfive 224 ⇥ 224 patches (the four corner patches and the center patch) as well as their horizontalreflections (hence ten patches in all), and averaging the predictions made by the network’s softmaxlayer on the ten patches.
The second form of data augmentation consists of altering the intensities of the RGB channels intraining images. Specifically, we perform PCA on the set of RGB pixel values throughout theImageNet training set. To each training image, we add multiples of the found principal components,
4This is the reason why the input images in Figure 2 are 224⇥ 224⇥ 3-dimensional.
5
[Krizhevsky 2012]
[LeCuN 89, 98]
[Simonyan 2014][Szegedy 2014]
[Hubel and Wisel 59]
Deep Visual Models
Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilitiesbetween the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-partsat the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, andthe number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–4096–4096–1000.
neurons in a kernel map). The second convolutional layer takes as input the (response-normalizedand pooled) output of the first convolutional layer and filters it with 256 kernels of size 5⇥ 5⇥ 48.The third, fourth, and fifth convolutional layers are connected to one another without any interveningpooling or normalization layers. The third convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourthconvolutional layer has 384 kernels of size 3 ⇥ 3 ⇥ 192 , and the fifth convolutional layer has 256kernels of size 3⇥ 3⇥ 192. The fully-connected layers have 4096 neurons each.
4 Reducing Overfitting
Our neural network architecture has 60 million parameters. Although the 1000 classes of ILSVRCmake each training example impose 10 bits of constraint on the mapping from image to label, thisturns out to be insufficient to learn so many parameters without considerable overfitting. Below, wedescribe the two primary ways in which we combat overfitting.
4.1 Data Augmentation
The easiest and most common method to reduce overfitting on image data is to artificially enlargethe dataset using label-preserving transformations (e.g., [25, 4, 5]). We employ two distinct formsof data augmentation, both of which allow transformed images to be produced from the originalimages with very little computation, so the transformed images do not need to be stored on disk.In our implementation, the transformed images are generated in Python code on the CPU while theGPU is training on the previous batch of images. So these data augmentation schemes are, in effect,computationally free.
The first form of data augmentation consists of generating image translations and horizontal reflec-tions. We do this by extracting random 224⇥ 224 patches (and their horizontal reflections) from the256⇥256 images and training our network on these extracted patches4. This increases the size of ourtraining set by a factor of 2048, though the resulting training examples are, of course, highly inter-dependent. Without this scheme, our network suffers from substantial overfitting, which would haveforced us to use much smaller networks. At test time, the network makes a prediction by extractingfive 224 ⇥ 224 patches (the four corner patches and the center patch) as well as their horizontalreflections (hence ten patches in all), and averaging the predictions made by the network’s softmaxlayer on the ten patches.
The second form of data augmentation consists of altering the intensities of the RGB channels intraining images. Specifically, we perform PCA on the set of RGB pixel values throughout theImageNet training set. To each training image, we add multiples of the found principal components,
4This is the reason why the input images in Figure 2 are 224⇥ 224⇥ 3-dimensional.
5
[Krizhevsky 2012]
[LeCuN 89, 98]
[Simonyan 2014][Szegedy 2014]
[Hubel and Wisel 59]
Domain Adaptation: Train on source adapt to target
backpack chair bike
Source Domainlots of labeled data
⇠ PS(X,Y )
DS = {(xi, yi), 8i 2 {1, . . . , N}}
Domain Adaptation: Train on source adapt to target
backpack chair bike
Source Domainlots of labeled data
⇠ PS(X,Y )
DS = {(xi, yi), 8i 2 {1, . . . , N}}
bike??
Target Domainunlabeled or limited labels
⇠ PT (Z,H)
?DT = {(zj , ), 8j 2 {1, . . . ,M}}
Domain Adaptation: Train on source adapt to target
backpack chair bike
Adapt
Source Domainlots of labeled data
⇠ PS(X,Y )
DS = {(xi, yi), 8i 2 {1, . . . , N}}
bike??
Target Domainunlabeled or limited labels
⇠ PT (Z,H)
?DT = {(zj , ), 8j 2 {1, . . . ,M}}
Prior work: domain adaptation
• Minimizing distribution distance
• Borgwardt`06, Mansour`09, Pan`09, Fernando`13 !
• Deep model adaptation
• Chopra`13, Tzeng`14, Long`15, Ganin`15
Adapting across domains minimize discrepancy
[ICCV 2015]
min✓repr
Lconf
(x, z, ✓D
; ✓repr
) =X
xi2S
H(U(D), qsi
) +X
zj2T
H(U(D), qtj
)
Adapting across domains minimize discrepancy
[ICCV 2015]
min✓repr
Lconf
(x, z, ✓D
; ✓repr
) =X
xi2S
H(U(D), qsi
) +X
zj2T
H(U(D), qtj
)
xi
Adapting across domains minimize discrepancy
[ICCV 2015]
min✓repr
Lconf
(x, z, ✓D
; ✓repr
) =X
xi2S
H(U(D), qsi
) +X
zj2T
H(U(D), qtj
)
✓cobject
classifierxi
Adapting across domains minimize discrepancy
[ICCV 2015]
min✓repr
Lconf
(x, z, ✓D
; ✓repr
) =X
xi2S
H(U(D), qsi
) +X
zj2T
H(U(D), qtj
)
✓cobject
classifierxi
✓xi
yi✓Txi � 1zj
hj✓Tzj � 1
1
Adapting across domains minimize discrepancy
[ICCV 2015]
min✓repr
Lconf
(x, z, ✓D
; ✓repr
) =X
xi2S
H(U(D), qsi
) +X
zj2T
H(U(D), qtj
)
✓cobject
classifierxi
✓xi
yi✓Txi � 1zj
hj✓Tzj � 1
1
Discrepency
Adapting across domains minimize discrepancy
[ICCV 2015]
min✓repr
Lconf
(x, z, ✓D
; ✓repr
) =X
xi2S
H(U(D), qsi
) +X
zj2T
H(U(D), qtj
)
✓cobject
classifierxi
✓xi
yi✓Txi � 1zj
hj✓Tzj � 1
1
✓Ddomain
classifier
Adapting across domains minimize discrepancy
[ICCV 2015]
min✓repr
Lconf
(x, z, ✓D
; ✓repr
) =X
xi2S
H(U(D), qsi
) +X
zj2T
H(U(D), qtj
)
xi
✓xi
yi✓Txi � 1zj
hj✓Tzj � 1
1
qtj = [0, 1]
qsi = [1, 0]
✓Ddomain
classifier
Adapting across domains minimize discrepancy
[ICCV 2015]
min✓repr
Lconf
(x, z, ✓D
; ✓repr
) =X
xi2S
H(U(D), qsi
) +X
zj2T
H(U(D), qtj
)
xi
zj
✓Ddomain
classifier
Cross-entropy with uniform distribution
Adapting across domains minimize discrepancy
[ICCV 2015]
min✓repr
Lconf
(x, z, ✓D
; ✓repr
) =X
xi2S
H(U(D), qsi
) +X
zj2T
H(U(D), qtj
)✓
xi
yi✓Txi � 1zj
hj✓Tzj � 1
1
xi✓D
domain classifier
Adapting across domains minimize discrepancy
[ICCV 2015]
min✓repr
Lconf
(x, z, ✓D
; ✓repr
) =X
xi2S
H(U(D), qsi
) +X
zj2T
H(U(D), qtj
)✓
xi
yi✓Txi � 1zj
hj✓Tzj � 1
1
xi
✓cobject
classifier
Source Data
backpack chair bike
fc8conv1 conv5source data
fc6 fc7 classification loss
Adapting across domains minimize discrepancy
Source Data
backpack chair bike
Target Databackpack
?
fc8conv1 conv5 fc6 fc7
labeled target data
fc8conv1 conv5source data
fc6 fc7
classification losssh
ared
shar
ed
shar
ed
shar
ed
shar
ed
Adapting across domains minimize discrepancy
Source Data
backpack chair bike
Target Databackpack
?
fc8conv1 conv5 fc6 fc7 all ta
rget
dat
a
source data
labeled target data
fc8conv1 conv5source data
fcD
fc6 fc7
classification loss
domain confusion
loss
domain classifier
loss
shar
ed
shar
ed
shar
ed
shar
ed
shar
ed
Adapting across domains minimize discrepancy
Verify confusion
Domain Adaptation: Train on source adapt to target
backpack chair bike
Adapt
Source Domainlots of labeled data
⇠ PS(X,Y )
DS = {(xi, yi), 8i 2 {1, . . . , N}}
bike??
Target Domainunlabeled or limited labels
⇠ PT (Z,H)
?DT = {(zj , ), 8j 2 {1, . . . ,M}}
Domain Adaptation: Train on source adapt to target
backpack chair bike
Adapt
Source Domainlots of labeled data
⇠ PS(X,Y )
DS = {(xi, yi), 8i 2 {1, . . . , N}}
bike??
Target Domainunlabeled or limited labels
⇠ PT (Z,H)
?DT = {(zj , ), 8j 2 {1, . . . ,M}}
Standard supervised deep loss
[ICCV 2015]
H(h, p) = Eh[� log p]
[ICCV 2015]
H(h, p) = Eh[� log p]
pj
Standard supervised deep loss
BottleMugChairLaptop
Keyboard
✓xi
yi✓Txi � 1zj
hj✓Tzj � 1
1
fc8conv1 conv5 fc6 fc7
[ICCV 2015]
H(h, p) = Eh[� log p]
pj
BottleMug ChairLaptop
Keyboard
hjbottle
Standard supervised deep loss
BottleMugChairLaptop
Keyboard
✓xi
yi✓Txi � 1zj
hj✓Tzj � 1
1
fc8conv1 conv5 fc6 fc7
Source CNN
Source CNN
Source CNN
Bottle Mug ChairLaptop
Keyboard
Bottle Mug ChairLaptop
Keyboard
Bottle Mug ChairLaptop
Keyboard
Bottle Mug ChairLaptop
Keyboard
+
softmax high temp
softmax high temp
softmax high temp
Source Softlabels
source bottle examples
[ICCV 2015]
H(h, p) = Eh[� log p]
pj
bottle
Standard supervised deep loss
BottleMugChairLaptop
Keyboard
✓xi
yi✓Txi � 1zj
hj✓Tzj � 1
1
fc8conv1 conv5 fc6 fc7
BottleMug ChairLaptop
Keyboard
hj
Class correlation transfer loss
[ICCV 2015]
H(h, p) = Eh[� log p]
pj
BottleMugChairLaptop
Keyboard
✓xi
yi✓Txi � 1zj
hj✓Tzj � 1
1
fc8conv1 conv5 fc6 fc7
bottle
source softlabel
Bottle Mug ChairLaptop
Keyboard
hj
Source Data
backpack chair bike
Target Databackpack
?
fc8conv1 conv5 fc6 fc7 all ta
rget
dat
a
source data
labeled target data
fc8conv1 conv5source data
fcD
fc6 fc7
classification loss
domain confusion
loss
domain classifier
loss
shar
ed
shar
ed
shar
ed
shar
ed
shar
ed
Class correlation transfer loss
Source Data
backpack chair bike
Target Databackpack
?
fc8conv1 conv5 fc6 fc7
Source softlabels
all ta
rget
dat
a
source data
labeled target data
fc8conv1 conv5source data
softmax high temp
softlabel loss
fcD
fc6 fc7
classification loss
domain confusion
loss
domain classifier
loss
shar
ed
shar
ed
shar
ed
shar
ed
shar
ed
Class correlation transfer loss
Office dataset Experiment Adapting Visual Category Models to New Domains 9
31 categories� �� �
keyboardheadphonesfile cabinet... laptop letter tray ...
amazon dSLR webcam
...
inst
ance
1in
stan
ce 2
...
...
...
inst
ance
5
...
inst
ance
1in
stan
ce 2
...
...
...
inst
ance
5
...
� �� �3 domains
Fig. 4. New dataset for investigating domain shifts in visual category recognition tasks.Images of objects from 31 categories are downloaded from the web as well as capturedby a high definition and a low definition camera.
popular way to acquire data, as it allows for easy access to large amounts ofdata that lends itself to learning category models. These images are of productsshot at medium resolution typically taken in an environment with studio lightingconditions. We collected two datasets: amazon contains 31 categories4 with anaverage of 90 images each. The images capture the large intra-class variation ofthese categories, but typically show the objects only from a canonical viewpoint.amazonINS contains 17 object instances (e.g. can of Taster’s Choice instantco↵ee) with an average of two images each.
Images from a digital SLR camera: The second domain consists of im-ages that are captured with a digital SLR camera in realistic environments withnatural lighting conditions. The images have high resolution (4288x2848) andlow noise. We have recorded two datasets: dslr has images of the 31 object cat-
4 The 31 categories in the database are: backpack, bike, bike helmet, bookcase, bottle,calculator, desk chair, desk lamp, computer, file cabinet, headphones, keyboard, lap-top, letter tray, mobile phone, monitor, mouse, mug, notebook, pen, phone, printer,projector, puncher, ring binder, ruler, scissors, speaker, stapler, tape, and trash can.
• all classes have source labeled examples
• 15 classes have target labeled examples
• evaluate on remaining 16 classes
[saenko`10]
540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593
594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647
ICCV#937
ICCV#937
ICCV 2015 Submission #937. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
A ! W A ! D D ! A D ! W W ! A W ! D Average
DLID [7] 51.9 – – 78.2 – 89.9 –DeCAF6 S+T [9] 80.7 ± 2.3 – – 94.8 ± 1.2 – – –DaNN [13] 53.6 ± 0.2 – – 71.2 ± 0.0 – 83.5 ± 0.0 –Source CNN 56.5 ± 0.3 64.6 ± 0.4 47.6 ± 0.1 92.4 ± 0.3 42.7 ± 0.1 93.6 ± 0.2 66.22Target CNN 80.5 ± 0.5 81.8 ± 1.0 59.9 ± 0.3 80.5 ± 0.5 59.9 ± 0.3 81.8 ± 1.0 74.05Source+Target CNN 82.5 ± 0.9 85.2 ± 1.1 65.8 ± 0.5 93.9 ± 0.5 65.2 ± 0.7 96.3 ± 0.5 81.50
Ours: dom confusion only 82.8 ± 0.9 85.9 ± 1.1 66.2 ± 0.4 95.6 ± 0.4 64.9 ± 0.5 97.5 ± 0.2 82.13Ours: soft labels only 82.7 ± 0.7 84.9 ± 1.2 66.0 ± 0.5 95.9 ± 0.6 65.2 ± 0.6 98.3 ± 0.3 82.17Ours: dom confusion+soft labels 82.7 ± 0.8 86.1 ± 1.2 66.2 ± 0.3 95.7 ± 0.5 65.0 ± 0.5 97.6 ± 0.2 82.22
Table 1. Multi-class accuracy evaluation on the standard supervised adaptation setting with the Office dataset. We evaluate on all 31 categoriesusing the standard experimental protocol from [28]. Here, we compare against three state-of-the-art domain adaptation methods as well as aCNN trained using only source data, only target data, or both source and target data together.
A ! W A ! D D ! A D ! W W ! A W ! D Average
MMDT [18] – 44.6 ± 0.3 – – – 58.3 ± 0.5 –Source CNN 54.2 ± 0.6 63.2 ± 0.4 36.4 ± 0.1 89.3 ± 0.5 34.7 ± 0.1 94.5 ± 0.2 62.0
Ours: dom confusion only 55.2 ± 0.6 63.7 ± 0.9 41.2 ± 0.1 91.3 ± 0.4 41.1 ± 0.0 96.5 ± 0.1 64.8Ours: soft labels only 56.8 ± 0.4 65.2 ± 0.9 41.7 ± 0.3 89.6 ± 0.1 38.8 ± 0.4 96.5 ± 0.2 64.8Ours: dom confusion+soft labels 59.3 ±0.6 68.0±0.5 43.1± 0.2 90.0± 0.2 40.5±0.2 97.5± 0.1 66.4
Table 2. Multi-class accuracy evaluation on the standard semi-supervised adaptation setting with the Office dataset. We evaluate on 16held-out categories for which we have no access to target labeled data. We show results on these unsupervised categories for the source onlymodel, our model trained using only soft labels for the 15 auxiliary categories, and finally using domain confusion together with soft labelson the 15 auxiliary categories.
target domain. We report accuracies on the remaining un-labeled images, following the standard protocol introducedwith the dataset [28]. In addition to a variety of baselines, wereport numbers for both soft label fine-tuning alone as wellas soft labels with domain confusion in Table 1. Because theOffice dataset is imbalanced, we report multi-class accura-cies, which are obtained by computing per-class accuraciesindependently, then averaging over all 31 categories.
We see that fine-tuning with soft labels or domain con-fusion provides a consistent improvement over hard labeltraining in 5 of 6 shifts. Combining soft labels with do-main confusion produces marginally higher performance onaverage. This result follows the intuitive notion that whenenough target labeled examples are present, directly opti-mizing for the joint source and target classification objective(Source+Target CNN) is a strong baseline and so using ei-ther of our new losses adds enough regularization to improveperformance.
Next, we experiment with the semi-supervised adaptationsetting. We consider the case in which training data andlabels are available for some, but not all of the categories inthe target domain. We are interested in seeing whether wecan transfer information learned from the labeled classes tothe unlabeled classes.
To do this, we consider having 10 target labeled exam-ples per category from only 15 of the 31 total categories,
following the standard protocol introduced with the Officedataset [28]. We then evaluate our classification performanceon the remaining 16 categories for which no data was avail-able at training time.
In Table 2 we present multi-class accuracies over the 16held-out categories and compare our method to a previousdomain adaptation method [18] as well as a source-onlytrained CNN. Note that, since the performance here is com-puted over only a subset of the categories in the dataset, thenumbers in this table should not be directly compared to thesupervised setting in Table 1.
We find that all variations of our method (only soft labelloss, only domain confusion, and both together) outperformthe baselines. Contrary to the fully supervised case, here wenote that both domain confusion and soft labels contributesignificantly to the overall performance improvement of ourmethod. This stems from the fact that we are now evaluat-ing on categories which lack labeled target data, and thusthe network can not implicitly enforce domain invariancethrough the classification objective alone. Separately, thefact that we get improvement from the soft label training onrelated tasks indicates that information is being effectivelytransferred between tasks.
In Figure 5, we show examples for theAmazon!Webcam shift where our method correctlyclassifies images from held out object categories and the
6
Office dataset Experiment
Multiclass accuracy over 16 classes which lack target labels
back packbike
bike helmet
bookcasebottle
calculator
desk chair
desk lamp
desktop co
mputer
file ca
binet
headphones
keyboard
laptop computer
letter tray
mobile phonemonitor
mousemug
paper notebookpenphone
printer
projector
punchers
ring binderruler
scissorsspeaker
stapler
tape dispenser
trash can
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1Ours soft label
back packbike
bike helmet
bookcasebottle
calculator
desk chair
desk lamp
desktop co
mputer
file ca
binet
headphones
keyboard
laptop computer
letter tray
mobile phonemonitor
mousemug
paper notebookpenphone
printer
projector
punchers
ring binderruler
scissorsspeaker
stapler
tape dispenser
trash can
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1Baseline soft label
ring bindermonitor
Baseline soft activation
Our soft activation
Target test image
back pack bike bike helmet
bookcase bottle calculator
desk chair desk lamp desktop computer
file cabinet headphones keyboard
laptop computer letter tray mobile phone
Source soft labels
Cross-dataset Experiment Setup
Source: ImageNet !Target: Caltech256 !40 categories !Evaluate adaptation performance with 0,1,3,5 target labeled examples per class
[tommasi`14]
ImageNet adapted to Caltech
Number Labeled Target Examples per Category0 1 3 5
Mul
ti-cla
ss A
ccur
acy
72
73
74
75
76
77
78
Source+Target CNNOurs: softlabels onlyOurs: dom confusion+softlabels
[ICCV 2015]
400 120 200Number of labeled target examples
Mul
ticla
ss A
ccur
acy
Summary: simultaneous transfer across domains and tasks
Domain confusion aligns the distributions
Softlabels transfer class correlations
Paper presented in poster session Wednesday 12/16 4B
System uses model
Discrepancy due to modality shift
Lots of data to train models
RGB
lamp
pillow
bed
night-stand
Current output
System uses model
lamp bednight-stand
[NIPS 2014, CVPR 2015]
Label space discrepancy
lamp
pillow
bed
night-stand
Desired output
System uses model
lamp bednight-stand
[NIPS 2014, CVPR 2015]
Label space discrepancy
Adapting Deep Visual ModelsAdapting across domains
Adapting across tasks
Adapting across modalities
Error bounds on adapted deep models
lamp
pillowbed
night-stand
Generally applicable to adaptation with deep learning in AI
Thank you.
Recommended