32
A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie Mellon University IEEE ICDM Workshop on Mining and Management of Biological Data October 28, 2007

A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

A Comparative Study of Methods for

Transductive Transfer Learning

Andrew Arnold, Ramesh Nallapati, William W. CohenMachine Learning Department

Carnegie Mellon University

IEEE ICDMWorkshop on Mining and Management of Biological Data

October 28, 2007

Page 2: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

What we are able to do:

The neuronal cyclin-dependent kinase p35/cdk5 comprises a catalytic subunit (cdk5) and an activator subunit (p35)

Reversible histone acetylation changes the chromatin structure and can modulate gene transcription. Mammalian histone deacetylase 1 (HDAC1)

Training data: Test:

Train:Test:

• Supervised learning – Train on large, labeled data sets drawn from same

distribution as testing data– Well studied problem

Page 3: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

What we’re getting better at doing:• Semi-supervised learning

– Same as before, but now• Add large unlabelled or weakly labeled data sets from same domain

– [Zhu ’05, Grandvalet ’05]

The neuronal cyclin-dependent kinase p35/cdk5 comprises a catalytic subunit (cdk5) and an activator subunit (p35)

Reversible histone acetylation changes the chromatin structure and can modulate gene transcription. Mammalian histone deacetylase 1 (HDAC1)

Train: Auxiliary (available for training):

Train:Auxiliary:

Page 4: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

• Transductive learning – Unlabeled test data is available during training– Easier than inductive learning:

• Learning specific predictions rahter than general function• [Joachims ’99, ’03, Sindhwani ’05, Vapnik ‘98]

The neuronal cyclin-dependent kinase p35/cdk5 comprises a catalytic subunit (cdk5) and an activator subunit (p35)

Reversible histone acetylation changes the chromatin structure and can modulate gene transcription. Mammalian histone deacetylase 1 (HDAC1)

Train: Both Auxiliary & Eventual Test:

Train:Auxiliary & Test:

What we’re getting better at doing:

Page 5: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

• Transfer learning (domain adaptation):– Leverage large, previously labeled data from a related domain

• Related domain we’ll be training on (with lots of data): Source• Domain we’re interested in and will be tested on (data scarce): Target

– [Ng ’06, Daumé ’06, Jiang ’06, Blitzer ’06, Ben-David ’07, Thrun ’96]

The neuronal cyclin-dependent kinase p35/cdk5 comprises a catalytic subunit (cdk5) and an activator subunit (p35)

Neuronal cyclin-dependent kinase p35/cdk5 (Fig 1, a) comprises a catalytic subunit (cdk5, left panel) and an activator subunit (p35, fmi #4)

Train (source domain: E-mail): Test (target domain: IM):

Train (source domain: Abstract):Test (target domain: Caption):

What we’d like to be able to do:

Page 6: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

The neuronal cyclin-dependent kinase p35/cdk5 comprises a catalytic subunit (cdk5) and an activator subunit (p35)

Reversible histone acetylation changes the chromatin structure and can modulate gene transcription. Mammalian histone deacetylase 1 (HDAC1)

What we’d like to be able to do: • Transfer learning (multi-task):

• Same domain, but slightly different task• Related task we’ll be training on (with lots of data): Source• Task we’re interested in and will be tested on (data scarce): Target

– [Ando ’05, Sutton ’05]

Train (source task: Names): Test (target task: Pronouns):

Train (source task: Proteins):Test (target task: Action Verbs):

Page 7: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

7

Motivation• Why is transfer important?

– Often we violate non-transfer assumption without realizing. How much data is truly identically distributed (i.i.d.)?

• E.g. Different authors, annotators, time periods, sources

– Large amounts of labeled data/trained classifiers already exist• Why waste data & computation?• Can learning be made easier by leveraging related domains/problems?

– Life-long learning

• Why is transduction important?– Why solve a harder problem than we need to?– Unlabeled data is vast and cheap

• Are transduction and transfer so different?– Can we learn more about one by studying the other?

Page 8: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

8

Outline• Motivating Problems

– Supervised learning– Semi-supervised learning– Transductive learning– Transfer learning: domain adaptation– Transfer learning: multi-task

• Methods– Maximum entropy (MaxEnt)– Source regularized maximum entropy– Feature space expansion– Feature selection– Feature space transformation– Iterative Pseudo Labeling (IPL)– Biased thresholding– Support Vector Machines (SVMs)

• Inductive SVM• Transductive SVM

• Experiment:– Domain & Data

• Results• Conclusions & Contributions• Limitations & future work

Page 9: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

9

Maximum Entropy (MaxEnt)

• Discriminative model– Matches feature expectations of model to data

Conditional likelihood:

Regularized optimization:

Page 10: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

10

Summary of Learning Settings

Page 11: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

11

Source-regularized MaxEnt

• Instead of regularizing towards zero– Learn model Λ’s on source data– During target training

• Regularize towards source-trained Λ’s

[Chelba’04]

Page 12: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

12

Feature Space Expansion

• Add extra degrees of freedom– Allow classifier to discern general/specific features

[Daumé ’06, ’07]

Page 13: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

13

Feature selection

• Emphasize features shared by source and target data

• Minimize different features

• How to measure?– Fisher exact test:

Is P(feature | source) == P(feature | target) ?

– If so, shared feature keep– If not, different feature discard

Page 14: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

14

Feature Space Transformation

• Source and target originally independently separable• Learn transformation, G, to allow joint separation:

Page 15: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

15

Iterative Pseudo Labeling (IPL)

• Novel algorithm for MaxEnt based transfer• Adjust feature values to match feature

expectation in source and target• θ trades off certainty vs adaptativity

Page 16: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

16

IPL analysis

Given linear transform:

We can express conditional feature expectations of target data in terms of a transformation of source:

Page 17: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

17

Biased Thresholding• Different proportions of positive examples

– Learning to predict rain in in humid and arid climates

• How to maximize F1 (and not accuracy)?– Score Cut (s-cut)

• Select score threshold over ranked train scores

• Apply to test data

– Percentage Cut (p-cut)

• Estimate proportion of positive examples expected in target data

• Set threshold so as to select this amount

Page 18: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

18

Support Vector Machines (SVMs)

• Inductive (standard) SVM:– Learn separating hyperplane on labeled training data.

Then evaluate on held-out testing data.

• Transductive SVM:– Learn hyperplane in the presence of labeled training

data AND unlabeled testing data. Use distribution of testing points to assist you.

– Easier to learn particular labels than a whole function.– More expensive than inductive

Page 19: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

19

Transductive vs. Inductive SVM

[Joachims ’99, ‘03]

Page 20: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

20

Domain

Page 21: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

21

DataUT

% positive 7%

positive 14,360.00 negative 202,435.00

total 216,795.00 abstracts 747

yapex% positive 15%positive 9,058.00 negative 51,472.00

total 60,530.00 abstracts 200

• Notice difference in:– Length and density of protein names– Number of training examples: ||UT|| ~ 4*||Yapex||– % positive examples: twice as many in Yapex

<prot> p38 stress-activated protein kinase </prot> inhibitor reverses <prot> bradykinin B(1) receptor </prot>-mediated component of inflammatory hyperalgesia.

<Protname>p35</Protname>/<Protname>cdk5 </Protname> binds and phosphorylates <Protname>beta-catenin</Protname> and regulates <Protname>beta-catenin </Protname> / <Protname>presenilin-1</Protname> interaction.

Page 22: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

22

Experiment

• Examining three dimensions:– Labeled vs unlabeled vs prior auxiliary data

• eg. % target positive examples, few labeled target data

– Transduction vs induction– Transfer vs non-transfer

• Since few true positives, focused on: F1 := (2 * Precision * Recall) / (Precision + Recall)

• Source = UT, target = Yapex

• For IPL, θ = .95 (conservative)

Page 23: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

23

Results: Transfer

• Transfer is much more difficult– Accuracy is not the problem

Page 24: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

24

Results: Transduction

• Transduction helps in transfer setting– TSVM copes better than MaxEnt, ISVM

Page 25: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

25

Results: IPL

• IPL can help boost performance– Makes transfer MaxEnt competitive with TSVM

– But bounded by quality of initial pseudo-labels

Page 26: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

Results: Priors

• Priors improve unsupervised transfer– Threshold helps balance recall and precision better F1

– A little bit of knowledge can help a lot

Page 27: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

27

Results: Supervision

• Supervised transfer beats supervised non-transfer– Significant at 99% binomial CI on precision and recall

• But not by as much as might be hoped for

• Even relatively simple transfer methods can help

Page 28: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

28

Conclusions & Contributions• Introduced novel MaxEnt transfer method: IPL

– Can match transduction in unsupervised setting

– Gives probabilistic results

• Analyzed and compared various methods related to transfer learning and concluded:– Transfer is hard

• But made easier when explicitly addressed

– Transduction is a good start• TSVM excels even with scant prior knowledge

– A little prior target knowledge is even better• No need for fully labeled target data set

Page 29: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

• Threshold is important:– Currently only using at test time– Why not incorporate earlier, get better pseudo labels

• Priors seem to help a lot:– Currently only using feature means, what about

variances?

• Can structuring feature

space lead to parsimonious

transferable priors?

Limitations & Future Work

token

rightleft

token.is.capitalized token.is.numeric

Page 30: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

30

Limitations & Future Work: high-level• How to better make use of source data?

– Why doesn’t source data help so much?

• Is IPL convex?– Is this exactly what we want to optimize?– How does regularization affect convexity?

• What, exactly, is the relationship between transduction and transfer?– Can their theories be unified?

• When is it worth explicitly modeling transfer?– How different do the domains need to be?– How much source/target data do we need?– What kind of priors do we need?

Page 31: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

31

☺ Thank you! ☺

¿ Questions ?

Page 32: A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie

32

References