Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Deep Transfer Mapping
1NLPR, Institute of Automation, Chinese Academy of Sciences
Aug. 8, 2018
for Unsupervised Writer Adaptation
2University of Chinese Academy of Sciences
Hong-Ming Yang1,2, Xu-Yao Zhang1,2, Fei Yin1,2, Jun Sun4, Cheng-Lin Liu1,2,3
3CAS Center for Excellence in Brain Science and Intelligence Technology
4Fujitsu Research & Development Center
Outline
Introduction
Style Transfer Mapping
Experiments and Analysis
Conclusions
Motivation and the Proposed Method
2/18
Introduction
3/18
The large variability of distributions across training and different test data
A main challenge for handwriting recognition:
β Different writing styles of different writers
β Different writing tools (e.g. different pens or electronic writing devices)
β Different writing environments (e.g. normal or emergency situations)
β β¦β¦β¦β¦
Written characters of two writers.
Introduction
β’ Domain adaption: a form of transfer learning Adapt the base classifier to each domain in the test dataset
β’ Recent methods: mainly based on deep learning
β Fine tuning with target domain data
β Learning domain invariant representations (features)
β Project source or target domain data to align the distribution
4
Training Datasets
Training Methods Base Classifier
Writer 1
Writer 2
Writer N
β
Adaptation
Style Transfer Mapping
[Xu-Yao. Zhang et al., writer adaptation with style transfer mapping. TPAMIβ13] 5/18
Main idea: project the target domain (test) data to balance the data distribution
Style transfer mapping (STM)
π π₯π β π π₯π‘
π₯ π‘ = π΄π‘π₯π‘ + ππ‘
π π₯π β π π₯ π‘
Learning classifier on π₯π and apply π₯ π‘ to the base classifier
Learning of the projection (π¨π and ππ)
ππππ΄βπ πΓπ,πβπ π ππ π΄π π + π β π‘π 22 + π½
π
π=1
π΄ β πΌ πΉ2 + πΎ π 2
2
Source points ππ: features in the target domain, i.e., π₯π‘
Target points ππ: prototype (LVQ) or mean (MQDF) for class π¦π (π¦π is the label of sample π π)
Style Transfer Mapping
[Xu-Yao. Zhang et al., writer adaptation with style transfer mapping. TPAMIβ13] 6/18
Solution: a convex quadratic programming problem, has a closed-form solution
π΄ = ππβ1, π =1
π π‘ β π΄π
π = πππ‘ππ πΞ€
π
π=1
β1
π π‘ π Ξ€ + π½πΌ π = πππ ππ π
Ξ€
π
π=1
β1
π π π Ξ€ + π½πΌ π = πππ π
π
π=1
π‘ = πππ‘π
π
π=1
π = ππ
π
π=1
+ πΎ
Extend to convolutional neural networks (CNNs)
Main idea: perform adaptation on the deep features
π π₯ : πΆππ ππππ‘π’ππ ππ₯π‘ππππ‘ππ
π₯π = π π₯π , π₯π‘ = π π₯π‘
Dealing with unsupervised adaptation
β Using the pseudo labels, predicted by the base classifier
β Iteration method: base classifier pseudo label adaptation better pseudo label adaptation β¦β¦
[Xu-Yao. Zhang et al., online and offline handwritten Chinese character recognition: a comprehensive study and new benchmark. PRβ17]
Motivations & Methods
Traditional adaptation methods with CNN
7/18
Motivations
β Consider only the fully connected layers
β Perform adaptation only on one layer
β Adaptation on both fully connected layers and convolutional layers
β Perform adaptation on multiple (or all) layers of the base CNN
Adaptation method for fully connected layers
STM based on the deep features of the layer (unsupervised adaptation)
Adaptation methods for convolutional layers
β Use a linear transformation to project the target domain data for aligning the data distributions
β Propose four variations of linear transformation, which are based on different assumptions of the space relation in the feature maps
Motivations & Methods
8/18
β Output of a convolutional layer for an input π₯π
ππ = ππππ π=1,π=1,π=1
π=πΆ,π=π»,π=π
π, π, π: index of the feature maps, rows, and columns in each feature map
Fully associate adaptation (FAA):
β Assumption: all positions of π, π, π are related to each other
β Method: expand ππ to a long vector π£π with dimension πΆπ»π, and learn a transformation π΄ β π πΆπ»πΓπΆπ»π, π β π πΆπ»π by STM
β π£πβ² = π΄π£π + π, π£π
β²π = π΄ππ π£π π + ππ
πΆπ»ππ=1 , each position π in π£π
β² are related
to all positions in π£π
Motivations & Methods
9/18
Partly associate adaptation (PAA):
β Assumption: positions within the same feature map are related to each other, but the feature maps are mutually independent
β Method: expand each feature map to a vector with dimension π»π, and learn a transformation π΄π β π π»πΓπ»π, ππ β π π»π for each feature map π separately by STM
β Transformation π΄π, ππ ensures the relation of positions within a feature map, learn π΄π, ππ separately ensures the independence between the feature maps
Separate STM for each feature map
Motivations & Methods
10/18
Weakly independent adaptation (WIA):
β Assumption: all positions π, π, π in ππ are independent to each other
β Learn a transformation π, π β π for each position π, π, π separately by STM
β ππβ²
π0,π0,π0= π ππ π0,π0,π0
+ π
β ππππ,πβπ ππ π ππ π0,π0,π0+ π β π‘π π0,π0,π0
2ππ‘π=1 + π½ π β 1 2 + πΎπ2
Motivations & Methods
11/18
Strong independent adaptation (SIA):
β Assumption: all positions are independent to each other and the positions within the same feature map share a same linear transformation
β Learn a transformation π, π β π for each feature map separately by STM
ππππ,πβπ ππ π ππ π0,π,π + π β π‘π π0,π,π2
π
π=1
π»
π=1
ππ‘
π=1
+ π½ π β 1 2 + πΎπ2
β Similar to the linear projection in the batch normalization (BN) layer
Motivations & Methods
12/18
Analysis and comparison
β Complexity & Flexibility: FAA > PAA > WIA > SIA
β Computation & Memory efficiency: SIA > WIA > PAA > FAA
Adaptation Methods
FAA PAA WIA SIA
Assumption All positions
related Inner feature map related
All positions independent
All positions independent & parameter sharing
Feature Dimension πΆπ»π π»π 1 1
Matrix Size πΆπ»π Γ πΆπ»π π»π Γ π»π 1 Γ 1 1 Γ 1
Transformation Number
1 πΆ πΆπ»π πΆ
Total Parameters πΆπ»π πΆπ»π + 1 Cπ»π π»π + 1 2πΆπ»π 2πΆ
Motivations & Methods
13/18
Algorithm
1. Select a group of layers πΏ on which to perform adaptation
β More powerful for flexibly aligning the distributions between the domains
2. From bottom to top layers in πΏ, perform adaptation on the specific layer with the proposed adaptation methods, but keep the other layers unchanged
3. After adaptation on each layer, insert an linear layer after it and set the weights and bias of the linear layer as the solved π΄ and π
Advantages of DTM
β Captures more comprehensive information and minimize the discrepancy of distributions under different abstract levels
Deep transfer mapping (DTM)
Perform adaptation on multiple layers in a deep manner
Experiments & Analysis
Datasets
14/18
Dataset Dataset Info #Sample (in 3755-
class) #Writers
(domains)
Training Set CASIA OLHWDB 1.0-1.2 2,697,673 1020
Test Set On-ICDAR2013
Competition 224,590 60
Adaptation Set Unlabeled samples from each domain (writer) in test set
Online handwritten Chinese characters. The samples of one writer stored in one single file, can be viewed as a domain.
Experiments & Analysis
15/18
Different adaptation methods for Convolutional layers
Base classifier: 11 layers 97.55% accuracy on test set
Four adaption methods on the same convolutional layer #8
Experiments & Analysis
Adaptation property of different layers in CNN
16/18
β From bottom to top layers, the adaptation performance increases
β Bottom layers extract general features, which are applicable across different domains, thus the promotion are not obvious after adaptation
β Top layers occupy abstract features, which are more domain specific, thus adaptation is helpful for such layers
Adaptation method: WIA
Experiments & Analysis
Deep transfer mapping
17/18
β DTM can further boost the performance of the base classifier
β DTM still has some limitations, the promotion is not obvious when adopt overmuch adaptations
Conclusions
β’ Unsupervised domain adaptation to alleviate the writing style variation, assuming each writer has an consistent style
β’ Four variations of adaptation methods for convolutional layers, assuming different space relations in the output of convolutional layers
β’ Deep transfer mapping (DTM) method to conduct adaptation on multiple (or all) layers of CNN, to better align the data distributions of different styles
β’ Remaining Problems
β What is the best way of adaptation for deep neural networks
β How to adapt in the case of small sample in adaption/testing (currently 3,755 samples per writer)
β Theoretical modeling of within/between-writer style variation
β Continuous adaptation of classifier
18
Thanks for your attention!