Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Boosting for Transfer Learning
Wenyuan Dai1 Qiang Yang2 Gui-Rong Xue1 Yong Yu1
1Department of Computer Science and EngineeringShanghai Jiao Tong University
2Department of Computer Science and EngineeringHong Kong University of Science and Technology
The 24th International Conference on Machine Learning
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Outline
1 Introduction
2 TrAdaBoost = Transfer AdaBoostTwo Ensemble MethodsAlgorithmTheoretical Properties
3 Experimental Results
4 Conclusion
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Transfer Learning
Effective Bayesian Transfer Learning (2005)
Transfer learning is what happens when someone finds it mucheasier to learn to play chess having already learned to playcheckers; or to recognize tables having already learned torecognize chairs; or to learn Spanish having already learnedItalian.
NIPS Inductive Transfer Workshop (2005)Transfer learning emphasizes the transfer of knowledge acrossdomains, tasks, and distributions that are similar but not thesame.
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Our Problem
The training and test data are from different data sources, e.g.new Web pages vs. old Web pages
The Web data are easily outdated.The training and test data come from different domains.
Domain Adaptation in NLP [6]
etc.The training and test data share a same or similar class-labelset.
The knowledge can be transferred between two data sets.Learning with auxiliary data [5].
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Example
Example
A1 – catA2 – tigerB1 – dogB2 – wolf
The task is to distinguish tiger and wolf with the help of the dataabout cat and dog.
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Problem Formulation
NotationsTest data S.Training data T = Ts ∪ Td .
base labeled data Tsdistributed the same as S.We assume that Ts is inadequate to train a good classifier.
auxiliary labeled data Td .might be distributed differently from S.
Our goal is to train a classifier which minimizes the predictionerror on S.
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Motivation
The motivation of our work is to show how boosting can helptransfer learning.
– ––
––
––
–
–
––
– –
–
––
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Reweighing the examples
Misclassified examples:increase the weights of the misclassified data in Ts
decrease the weights of the misclassified data in Td
–
–
–
+
–
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Two Ensemble MethodsAlgorithmTheoretical Properties
Outline
1 Introduction
2 TrAdaBoost = Transfer AdaBoostTwo Ensemble MethodsAlgorithmTheoretical Properties
3 Experimental Results
4 Conclusion
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Two Ensemble MethodsAlgorithmTheoretical Properties
Two Ensemble Methods
Hedge(β) [2]
Hedge(β) is able to find a strategy that approximates the beststrategy among all the N strategies, even though we do notknow which one is the best.
AdaBoost [2]Given a weak learner, AdaBoost is able to minimize the errorsuffered by the learner on the training data, and boost it to astrong learner.
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Two Ensemble MethodsAlgorithmTheoretical Properties
Basic Idea
AdaBoost is applied to the data in Ts
increase the weights of the misclassified data in Ts
Hedge(β) is applied to the data in Td
decrease the weights of the misclassified data in Td
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Two Ensemble MethodsAlgorithmTheoretical Properties
Outline
1 Introduction
2 TrAdaBoost = Transfer AdaBoostTwo Ensemble MethodsAlgorithmTheoretical Properties
3 Experimental Results
4 Conclusion
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Two Ensemble MethodsAlgorithmTheoretical Properties
Transfer AdaBoost
Freund & Schapire (EuroCOLT 1995)
Freund & Schapire (EuroCOLT 1995)
Given N different strategies, Hedge(β) finds a strategy that approximates the best strategy among all the N strate-gies, even though we do not know which is the best.
Given a weak learner, AdaBoost is able to minimize the error suffered by the learner on the training data, and boost it to a strong learner.
Ts (base labeled data)
Td (auxiliary labeled data)
S (unlabeled data)
Hedge(β)
AdaBoost minimize the error suffered by TrAdaBoost on Ts
choose the best examples with minimal average training error to help the learning
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Two Ensemble MethodsAlgorithmTheoretical Properties
Algorithm DescriptionInput the two labeled data sets Td and Ts , the unlabeled data set S, a base learning algorithm Learner,and the maximum number of iterations N.Initialize the initial weight vector, that w1 = (w1
1 , . . . , w1n+m).
For t = 1, . . . , N
1 Set pt = wt /(Pn+m
i=1 w ti ).
2 Call Learner, providing it the combined training set T with the distribution pt over T and theunlabeled data set S. Then, get back a hypothesis ht : X → Y (or [0, 1] by confidence).
3 Calculate the error of ht on Ts :
εt =n+mX
i=n+1
w ti · |ht (xi ) − c(xi )|Pn+m
i=n+1 w ti
.
4 Set βt = εt /(1 − εt ) and β = 1/(1 +p
2 ln n/N). Note that, εt is required to be less than 1/2.
5 Update the new weight vector:
w t+1i =
{w t
i β|ht (xi )−c(xi )|, 1 ≤ i ≤ n
w ti β
−|ht (xi )−c(xi )|t , n + 1 ≤ i ≤ n + m
Output the hypothesis
hf (x) =
8<: 1,QN
t=dN/2e β−ht (x)t ≥
QNt=dN/2e β
− 12
t0, otherwise
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Two Ensemble MethodsAlgorithmTheoretical Properties
Outline
1 Introduction
2 TrAdaBoost = Transfer AdaBoostTwo Ensemble MethodsAlgorithmTheoretical Properties
3 Experimental Results
4 Conclusion
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Two Ensemble MethodsAlgorithmTheoretical Properties
Rate of Convergence
Ld – the training loss w.r.t. Td through N iterations.L(xi) – the training loss w.r.t. xi through N iterations
Theorem 1 (Hedge(β))In TrAdaBoost, when the number of iterations is N, we have
Ld
N≤ min
1≤i≤n
L(xi)
N+
√2 ln n
N+
ln nN
. (1)
The convergence rate of TrAdaBoost over the data in Td isO(√
ln n/N).
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Two Ensemble MethodsAlgorithmTheoretical Properties
The average weighted error on Td
Let li = |ht(xi) − c(xi |) be the loss of the training instance xisuffered by the hypothesis ht .
Theorem 2
In TrAdaBoost, pti denotes the weight of the training instance
xi , which is defined as pt = wt/(∑n+m
i=1 w ti ). Then,
limN→∞
∑Nt=dN/2e
∑ni=1 pt
i lti
N − dN/2e= 0. (2)
Intuitively, Theorem 2 indicates that the average weightedtraining loss by TrAdaBoost on the data in Td from the N/2-thiteration to the Nth converges to zero.
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Two Ensemble MethodsAlgorithmTheoretical Properties
The prediction error on Ts
Theorem 3 (AdaBoost)Let I = {i : hf (xi) 6= c(xi) and n + 1 ≤ i ≤ n + m}. The predictionerror on the data in Ts suffered by the final hypothesis hf isdefined as ε = Prx∈Ts [hf (x) 6= c(x)] = |I|/m. Then,
ε ≤ 2dN/2eN∏
t=dN/2e
√εt(1 − εt). (3)
Theorem 3 shows that TrAdaBoost preserves a similarerror-convergence property as the AdaBoost algorithm.
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Two Ensemble MethodsAlgorithmTheoretical Properties
Generalization Error
Theorem 4 (AdaBoost)Let dVC be the VC-dimension of the hypothesis space, thegeneralization error on Ts, with high probability, is at most
ε + O
(√NdVC
|Ts|
)(4)
Here, N is the number of iterations and ε is the error on Ts fromhf .
TrAdaBoost preserves the generalization error bound ofAdaBoost.
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Two Ensemble MethodsAlgorithmTheoretical Properties
Summary
TrAdaBoostconverges in O(
√ln n/N).
minimizes the average weighted training error on Td .minimizes the prediction error on Ts.
TrAdaBoost preserves the properties of AdaBoost, andminimizes the training error on Td , simultaneously.
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Data Sets
Four Data Sets20 NewsgroupsSRAAReuters-21578mushroom data set from UCI Machine Learning Repository
The data are split based on sub-categories.
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Description of the Data Sets
Data Set KL-divergence Size|Td | |Ts ∪ S|
rec vs talk 1.102 3,669 3,561rec vs sci 1.021 3,961 3,965sci vs talk 0.854 3,374 3,828
auto vs aviation 1.126 8,000 8,000real vs simulated 1.048 8,000 8,000orgs vs people 0.303 1,016 1,046orgs vs places 0.329 1,079 1,080
people vs places 0.307 1,239 1,210edible vs poisonous 1.315 4,608 3,516
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Baseline Methods
Baseline Training Data Test Data Basic Learnerlabeled unlabeledSVM Ts ∅ S SVMSVMt Ts ∪ Td ∅ S SVMAUX [5] Ts ∪ Td ∅ S SVM
AUX – Improving SVM Accuracy by Training on Auxiliary DataSources (Wu & Dietterich, ICML 2004)
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Performance
Table: The error rates when supervised learning ( |Ts||Td | = 0.01)
Data Set SVM SVMt AUX TrAdaBoost(SVM)rec vs talk 0.222 0.127 0.127 0.080rec vs sci 0.240 0.164 0.153 0.097sci vs talk 0.234 0.177 0.173 0.125
auto vs aviation 0.131 0.192 0.188 0.096real vs simulated 0.140 0.219 0.210 0.119orgs vs people 0.494 0.285 0.287 0.280orgs vs places 0.423 0.440 0.433 0.315
people vs places 0.412 0.255 0.257 0.216edible vs poisonous 0.127 0.135 0.082 0.071
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Vary the size of Ts
0 0.1 0.2 0.3 0.4 0.50.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
Ratio between Same−Distribution and Diff−Distribution Training Data
Err
or R
ate
people vs places
TrAdaBoost(SVM)SVMSVMt
0 0.1 0.2 0.3 0.4 0.50.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
Ratio between Same−Distribution and Diff−Distribution Training Data
Err
or R
ate
orgs vs places
TrAdaBoost(SVM)SVMSVMt
Ratio – |Ts|/|Td |TrAdaBoost performs the best in situations the ratio is lessthan 0.1.
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Convergence
0 10 20 30 40 50 60 70 80 90 1000.08
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0.24
0.26
0.28
Number of Iterations
Err
or R
ate
people vs places
0.010.020.030.040.050.100.200.300.400.50
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
IntroductionTrAdaBoost = Transfer AdaBoost
Experimental ResultsConclusion
Conclusion
ProsAdaBoost is extended to better solve the transfer learningproblem.TrAdaBoost preserves the properties of AdaBoost, and hassome good properties on the auxiliary data.TrAdaBoost outperforms several state-of-art classifiers inthe situation of transfer learning.
ConsThe algorithm is heuristic.The rate of convergence is somewhat slow (O(
√ln n/N)).
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
Appendix Reference
Reference I
Caruana, R. (1997)Multitask Learning.Machine Learning, 28(1), 41–75.
Freund, Y., & Schapire, R. E. (1997)A Decision-theoretic Generalization of On-line Learningand an Application to Boosting.Journal of Computer and System Sciences, 55(1),119–139.
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
Appendix Reference
Reference II
Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1997)Boosting the Margin: A New Explanation for theEffectiveness of Voting Methods.Proceedings of the Fourteenth International Conference onMachine Learning.
Schapire, R. E. (1999)A Brief Introduction to Boosting.Proceedings of the Sixteenth International JointConference on Artificial Intelligence.
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning
Appendix Reference
Reference III
Wu, P., & Dietterich, T. G. (2004)Improving SVM Accuracy by Training on Auxiliary DataSources.Proceedings of the Twenty-First International Conferenceon Machine Learning.
Daumé III, H., & Marcu, D. (2006)Domain adaptation for statistical classifiers.Journal of Artificial Intelligence Research, 26, 101–126.
Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning