Boosting for Transfer Learning - Semantic Scholar · Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning. Introduction TrAdaBoost = Transfer AdaBoost Experimental Results

IntroductionTrAdaBoost = Transfer AdaBoost

Experimental ResultsConclusion

Boosting for Transfer Learning

Wenyuan Dai1 Qiang Yang2 Gui-Rong Xue1 Yong Yu1

1Department of Computer Science and EngineeringShanghai Jiao Tong University

2Department of Computer Science and EngineeringHong Kong University of Science and Technology

The 24th International Conference on Machine Learning

Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning



Outline

1 Introduction

2 TrAdaBoost = Transfer AdaBoostTwo Ensemble MethodsAlgorithmTheoretical Properties

3 Experimental Results

4 Conclusion




Transfer Learning

Effective Bayesian Transfer Learning (2005)

Transfer learning is what happens when someone finds it mucheasier to learn to play chess having already learned to playcheckers; or to recognize tables having already learned torecognize chairs; or to learn Spanish having already learnedItalian.

NIPS Inductive Transfer Workshop (2005)Transfer learning emphasizes the transfer of knowledge acrossdomains, tasks, and distributions that are similar but not thesame.




Our Problem

The training and test data are from different data sources, e.g.new Web pages vs. old Web pages

The Web data are easily outdated.The training and test data come from different domains.

Domain Adaptation in NLP [6]

etc.The training and test data share a same or similar class-labelset.

The knowledge can be transferred between two data sets.Learning with auxiliary data [5].




Example

Example

A1 – catA2 – tigerB1 – dogB2 – wolf

The task is to distinguish tiger and wolf with the help of the dataabout cat and dog.




Problem Formulation

NotationsTest data S.Training data T = Ts ∪ Td .

base labeled data Tsdistributed the same as S.We assume that Ts is inadequate to train a good classifier.

auxiliary labeled data Td .might be distributed differently from S.

Our goal is to train a classifier which minimizes the predictionerror on S.




Motivation

The motivation of our work is to show how boosting can helptransfer learning.

– ––

––

––

–

–

––

– –

–

––




Reweighing the examples

Misclassified examples:increase the weights of the misclassified data in Ts

decrease the weights of the misclassified data in Td

–

–

–

+

–




Two Ensemble MethodsAlgorithmTheoretical Properties

Outline

1 Introduction



4 Conclusion





Two Ensemble Methods

Hedge(β) [2]

Hedge(β) is able to find a strategy that approximates the beststrategy among all the N strategies, even though we do notknow which one is the best.

AdaBoost [2]Given a weak learner, AdaBoost is able to minimize the errorsuffered by the learner on the training data, and boost it to astrong learner.





Basic Idea

AdaBoost is applied to the data in Ts

increase the weights of the misclassified data in Ts

Hedge(β) is applied to the data in Td

decrease the weights of the misclassified data in Td





Outline

1 Introduction



4 Conclusion





Transfer AdaBoost

Freund & Schapire (EuroCOLT 1995)

Freund & Schapire (EuroCOLT 1995)

Given N different strategies, Hedge(β) finds a strategy that approximates the best strategy among all the N strate-gies, even though we do not know which is the best.

Given a weak learner, AdaBoost is able to minimize the error suffered by the learner on the training data, and boost it to a strong learner.

Ts (base labeled data)

Td (auxiliary labeled data)

S (unlabeled data)

Hedge(β)

AdaBoost minimize the error suffered by TrAdaBoost on Ts

choose the best examples with minimal average training error to help the learning





Algorithm DescriptionInput the two labeled data sets Td and Ts , the unlabeled data set S, a base learning algorithm Learner,and the maximum number of iterations N.Initialize the initial weight vector, that w1 = (w1

1 , . . . , w1n+m).

For t = 1, . . . , N

1 Set pt = wt /(Pn+m

i=1 w ti ).

2 Call Learner, providing it the combined training set T with the distribution pt over T and theunlabeled data set S. Then, get back a hypothesis ht : X → Y (or [0, 1] by confidence).

3 Calculate the error of ht on Ts :

εt =n+mX

i=n+1

w ti · |ht (xi ) − c(xi )|Pn+m

i=n+1 w ti

.

4 Set βt = εt /(1 − εt ) and β = 1/(1 +p

2 ln n/N). Note that, εt is required to be less than 1/2.

5 Update the new weight vector:

w t+1i =

{w t

i β|ht (xi )−c(xi )|, 1 ≤ i ≤ n

w ti β

−|ht (xi )−c(xi )|t , n + 1 ≤ i ≤ n + m

Output the hypothesis

hf (x) =

8<: 1,QN

t=dN/2e β−ht (x)t ≥

QNt=dN/2e β

− 12

t0, otherwise





Outline

1 Introduction



4 Conclusion





Rate of Convergence

Ld – the training loss w.r.t. Td through N iterations.L(xi) – the training loss w.r.t. xi through N iterations

Theorem 1 (Hedge(β))In TrAdaBoost, when the number of iterations is N, we have

Ld

N≤ min

1≤i≤n

L(xi)

N+

√2 ln n

N+

ln nN

. (1)

The convergence rate of TrAdaBoost over the data in Td isO(√

ln n/N).





The average weighted error on Td

Let li = |ht(xi) − c(xi |) be the loss of the training instance xisuffered by the hypothesis ht .

Theorem 2

In TrAdaBoost, pti denotes the weight of the training instance

xi , which is defined as pt = wt/(∑n+m

i=1 w ti ). Then,

limN→∞

∑Nt=dN/2e

∑ni=1 pt

i lti

N − dN/2e= 0. (2)

Intuitively, Theorem 2 indicates that the average weightedtraining loss by TrAdaBoost on the data in Td from the N/2-thiteration to the Nth converges to zero.





The prediction error on Ts

Theorem 3 (AdaBoost)Let I = {i : hf (xi) 6= c(xi) and n + 1 ≤ i ≤ n + m}. The predictionerror on the data in Ts suffered by the final hypothesis hf isdefined as ε = Prx∈Ts [hf (x) 6= c(x)] = |I|/m. Then,

ε ≤ 2dN/2eN∏

t=dN/2e

√εt(1 − εt). (3)

Theorem 3 shows that TrAdaBoost preserves a similarerror-convergence property as the AdaBoost algorithm.





Generalization Error

Theorem 4 (AdaBoost)Let dVC be the VC-dimension of the hypothesis space, thegeneralization error on Ts, with high probability, is at most

ε + O

(√NdVC

|Ts|

)(4)

Here, N is the number of iterations and ε is the error on Ts fromhf .

TrAdaBoost preserves the generalization error bound ofAdaBoost.





Summary

TrAdaBoostconverges in O(

√ln n/N).

minimizes the average weighted training error on Td .minimizes the prediction error on Ts.

TrAdaBoost preserves the properties of AdaBoost, andminimizes the training error on Td , simultaneously.




Data Sets

Four Data Sets20 NewsgroupsSRAAReuters-21578mushroom data set from UCI Machine Learning Repository

The data are split based on sub-categories.




Description of the Data Sets

Data Set KL-divergence Size|Td | |Ts ∪ S|

rec vs talk 1.102 3,669 3,561rec vs sci 1.021 3,961 3,965sci vs talk 0.854 3,374 3,828

auto vs aviation 1.126 8,000 8,000real vs simulated 1.048 8,000 8,000orgs vs people 0.303 1,016 1,046orgs vs places 0.329 1,079 1,080

people vs places 0.307 1,239 1,210edible vs poisonous 1.315 4,608 3,516




Baseline Methods

Baseline Training Data Test Data Basic Learnerlabeled unlabeledSVM Ts ∅ S SVMSVMt Ts ∪ Td ∅ S SVMAUX [5] Ts ∪ Td ∅ S SVM

AUX – Improving SVM Accuracy by Training on Auxiliary DataSources (Wu & Dietterich, ICML 2004)




Performance

Table: The error rates when supervised learning ( |Ts||Td | = 0.01)

Data Set SVM SVMt AUX TrAdaBoost(SVM)rec vs talk 0.222 0.127 0.127 0.080rec vs sci 0.240 0.164 0.153 0.097sci vs talk 0.234 0.177 0.173 0.125

auto vs aviation 0.131 0.192 0.188 0.096real vs simulated 0.140 0.219 0.210 0.119orgs vs people 0.494 0.285 0.287 0.280orgs vs places 0.423 0.440 0.433 0.315

people vs places 0.412 0.255 0.257 0.216edible vs poisonous 0.127 0.135 0.082 0.071




Vary the size of Ts

0 0.1 0.2 0.3 0.4 0.50.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

Ratio between Same−Distribution and Diff−Distribution Training Data

Err

or R

ate

people vs places

TrAdaBoost(SVM)SVMSVMt

0 0.1 0.2 0.3 0.4 0.50.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

Ratio between Same−Distribution and Diff−Distribution Training Data

Err

or R

ate

orgs vs places

TrAdaBoost(SVM)SVMSVMt

Ratio – |Ts|/|Td |TrAdaBoost performs the best in situations the ratio is lessthan 0.1.




Convergence

0 10 20 30 40 50 60 70 80 90 1000.08

0.10

0.12

0.14

0.16

0.18

0.20

0.22

0.24

0.26

0.28

Number of Iterations

Err

or R

ate

people vs places

0.010.020.030.040.050.100.200.300.400.50




Conclusion

ProsAdaBoost is extended to better solve the transfer learningproblem.TrAdaBoost preserves the properties of AdaBoost, and hassome good properties on the auxiliary data.TrAdaBoost outperforms several state-of-art classifiers inthe situation of transfer learning.

ConsThe algorithm is heuristic.The rate of convergence is somewhat slow (O(

√ln n/N)).


Appendix Reference

Reference I

Caruana, R. (1997)Multitask Learning.Machine Learning, 28(1), 41–75.

Freund, Y., & Schapire, R. E. (1997)A Decision-theoretic Generalization of On-line Learningand an Application to Boosting.Journal of Computer and System Sciences, 55(1),119–139.


Appendix Reference

Reference II

Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1997)Boosting the Margin: A New Explanation for theEffectiveness of Voting Methods.Proceedings of the Fourteenth International Conference onMachine Learning.

Schapire, R. E. (1999)A Brief Introduction to Boosting.Proceedings of the Sixteenth International JointConference on Artificial Intelligence.


Appendix Reference

Reference III

Wu, P., & Dietterich, T. G. (2004)Improving SVM Accuracy by Training on Auxiliary DataSources.Proceedings of the Twenty-First International Conferenceon Machine Learning.

Daumé III, H., & Marcu, D. (2006)Domain adaptation for statistical classifiers.Journal of Artificial Intelligence Research, 26, 101–126.


Documents

Boosting for Transfer Learning - Semantic Scholar · Wenyuan Dai, Qiang Yang et al. Boosting for Transfer Learning. Introduction TrAdaBoost = Transfer AdaBoost Experimental Results