Relaxed Transfer of Different Classes via Spectral Partition Xiaoxiao Shi 1 Wei Fan 2 Qiang Yang 3 Jiangtao Ren 4 1 University of Illinois at Chicago 2

1

Relaxed Transfer of Different Classes

via Spectral Partition

Xiaoxiao Shi1 Wei Fan2 Qiang Yang3 Jiangtao Ren4

1 University of Illinois at Chicago2 IBM T. J. Watson Research Center

3 Hong Kong University of Science and Technology4 Sun Yat-sen University

1. Unsupervised2. Can use data with different classes to help. How

so?

22

What is Transfer Learning?

New York Times

training (labeled)

test (unlabeled)

Classifier

New York Times

85.5%

Standard Supervised Learning

33New York Times

training (labeled)

test (unlabeled)

New York Times

Labeled data are insufficient!

47.3%

How to improve the

performance?

In Reality…


44


Reuters

Source domaintraining (labeled)

Target domaintest (unlabeled)

Transfer Classifier

New York Times

82.6%

Not necessary from the same domain and do not follow the same distribution

5

Reuters

Source domaintraining (labeled)

Target domaintest (unlabeled)

Transfer Classifier

New York Times

82.6%

Since they are from different domains,they may have different class labels!

Labels:

MarketsPolitics

EntertainmentBlogs……

Labels:

WorldU. S.

Fashion StyleTravel……

How to transfer when class labels

are different?in number and meaning

Transfer across Different Class Labels

6

Two Main Categories of Transfer Learning

• Unsupervised Transfer Learning– Do not have any labeled data from the target domain.– Use source domain to help learning.– Question: is it better than clustering?

• Supervised Transfer Learning– Have limited number of labeled examples from target

domain– Is it better than not using any source data example?

7

• Two sub-problems:– (1) What and how to transfer, since we can

not explicitly use P(x|y) or P(y|x) to build the similarity among tasks (class labels ‘y’ have different meanings)?

– (2) How to avoid negative transfer since the tasks may be from very different domains?

Negative Transfer: when the tasks are too different, transfer learning may hurt learning accuracy.

Transfer across Different Class Labels

8

The proposed solution

• (1) What and How to transfer?– Transfer the eigensapce

Eigenspace: space expended by a set of eigen vectors.-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Dataset exhibits complex Dataset exhibits complex cluster shapescluster shapes

K-means performs very K-means performs very poorly in this space due poorly in this space due bias toward dense bias toward dense spherical clusters.spherical clusters.

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

-0.709 -0.7085 -0.708 -0.7075 -0.707 -0.7065 -0.706In the In the eigenspace eigenspace (space given by the (space given by the eigenvectors),eigenvectors), clusters clusters are trivial to separate.are trivial to separate.-- Spectral Clustering-- Spectral Clustering

9

10

• (2) How to avoid negative transfer?– A new clustering-based KL Divergence to reflect

distribution differences.– If distributions are too different (KL is large),

automatically decrease the effect from source domain.

The proposed solution

Traditional KL Divergence

Need to solve P(x), Q(x) for every x, which is normally difficult to obtain.

To get the Clustering-based KL divergence:(1) Perform Clustering on the combined dataset.(2) Calculate the KL divergence by some basic

statistical properties of the clusters. See Example.

11

An Example

P Q

C1

C2

Clustering S(P’, C1)

S(Q’, C1)

S(P’, C2)

S(Q’, C2)CombinedDataset

For example, S(P’, C) means “the portion of examples in P that are contained in cluster C ”.

= 0.5

the portion of examples in P that are contained in

cluster C1

the portion of examples in Q that

are contained in cluster C1

= 0.5

=5/9

=4/9

the portion of examples in P that are contained in

cluster C2the portion of

examples in Q that are contained in

cluster C2

E(P)=8/15E(Q)=7/15

P’(C1)=3/15Q’(C1)=3/15P’(C2)=5/15Q’(C2)=4/15

KL=0.0309

12

Objective Function

• Objective: Find an eigenspace that well separates the target data– Intuition: If the source data is similar to the target data,

make good use of the source eigenspace;– Otherwise, keep the original structure of the target data.

Prefer Source

Eigenspace

Prefer Original

Structure

Balanced by R(L; U)More similar of distributions, less is R(L; U), more the function will rely on source eigenspace TL

TraditionalNormalized Cut

Penalty Term

13

How to construct constraint TL and Tu?

• Principle:

– To construct TL --- it is directly derived from the “must-link” constraint (the examples with the same label should be together).

– To construct TU --- (1) Perform standard spectral clustering (e.g., Ncut) on U. (2) the examples in the same cluster should be together.

1 4

2

3 56

1, 2, 4 should be together (blue);

3, 5, 6 should be together (red)

1 4

2

3 56

1, 2, 3 should be together;

4, 5, 6 should be together

14

How to construct constraint TL and Tu?

• Construct the constraint matrix M=[m1, m2, …, mr]’

For example,

1 4

2

3 56

1, -1, 0, 0, 0, 0

1, 0, 0, -1, 0, 0

0, 0, 1, 0, -1, 0

……

T

ML =

1 and 2

1 and 4

3 and 5

1515

Experiment Data sets

16

Experiment data sets

17

Text Classification

Comp1 VS

Rec1

1: comp2 VS Rec2 2: 4 classes (Graphics, etc) 3: 3 classes (crypt, etc)

1: org2 VS People2 2: 3 classes (Places, etc) 3: 3 classes (crypt, etc)

Org1VS

People1

40%

60%

80%

100%

120%

1 2 3

Ful l Transf er No Transf er RSP

50%

60%

70%

80%

90%

1 2 3


18

Image Classification

HomerVS

Real Bear

CartmanVS

Fern

1: Superman VS Teddy 2: 3 classes (cartman, etc) 3: 4 classes (laptop, etc)

1: Superman VS Bonsai 2: 3 classes (homer, etc) 3: 4 classes (laptop, etc)

50%

60%

70%

80%

90%

1 2 3


50%

60%

70%

80%

90%

100%

1 2 3


19

Parameter Sensitivity

2020

• Problem: Transfer across tasks with different class labels

• Two sub-problems:• (1) What and How to transfer?

• Transfer the eigenspace.• (2) How to avoid negative transfer?

• Propose an effective clustering-based KL Divergence; if KL is large, or distributions are too different, decrease the effect from source domain.

Conclusions

2121

Thanks!

Datasets and codes: http://www.cs.columbia.edu/~wfan/software.htm

http://www.cs.columbia.edu/~wfan/software.htm

22

# Clusters?Condition for Lemma 1 to be valid: In each cluster, the expected values of the target and source data are about the same.

>If

Adaptively Control the #Clusters to guarantee Lemma 1 valid!

--Stop bisecting clustering when there is only target/source data in the cluster, or

where is close to 0.

23

Optimization

Let

Algorithm flow

Then,

Documents

Relaxed Transfer of Different Classes via Spectral Partition Xiaoxiao Shi 1 Wei Fan 2 Qiang Yang 3 Jiangtao Ren 4 1 University of Illinois at Chicago 2