Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/jsm06zhu.pdf · Semi-Supervised Learning an overview Xiaojin “Jerry” Zhu [email protected] Computer Science Department

Semi-Supervised Learning

an overview

Xiaojin “Jerry” Zhu

[email protected]

Computer Science Department

University of Wisconsin, Madison

Semi-supervised Learning: an Overview. Zhu – p. 1/31

Outline

background

four methods

open questions

A computer scientist’s view . . .


Why semi-supervised learning?

Because people want better performance for free.


Why semi-supervised learning?

Because people want better performance for free.

Exactly how hard can it be to obtain labeled data?

Switchboard speech transcription, 400 hours for each

hour of speech

film ⇒ f ih_n uh_gl_n m

be all ⇒ bcl b iy iy_tr ao_tr ao l_dl

Penn Chinese Treebank, 2 years for 4000 sentences

‘The National Track and Field Championship has finished.’


The landscape

supervised learning (classification, regression)

{(x1:n, y1:n)}

l

semi-supervised classification/regression

{(x1:l, y1:l), xl+1:n}

l

semi-supervised clustering {x1:n, must-, cannot-links}

l

unsupervised learning (clustering) {x1:n}

transduction (limited to x1:n) ↔ induction (unseen data)


The problem

Goal:

Using both labeled and unlabeled data to build better

learners (than using labeled data alone).

Notation:

input features x, label y

learner f : X 7→ Y

labeled data (Xl, Yl) = {(x1:l, y1:l)}

unlabeled data Xu = {xl+1:n}

usually l ≪ n

How can Xu help?


Method 1: generative models

Self-training:

1. Train f from (Xl, Yl)

2. Predict on x ∈ Xu

3. Add (x, f(x)) to labeled data

4. Repeat

Naıve? error self-enforcing?

But if you set things just right, this is in fact the EM

algorithm on mixture models . . .



Example: EM for Gaussian mixture models

θ = {p(c), µ, Σ}1:C

Start from MLE θ on (Xl, Yl), repeat:

1. E-step: compute the expected labels p(y|x, θ) for all

x ∈ Xu

assign class 1 to p(y = 1|x, θ) fraction of x

assign class 2 to p(y = 2|x, θ) fraction of x

. . .

2. M-step: update MLE θ with the original labeled and

(now labeled) unlabeled data



The MLE of θ without and with Xu is different.

labeled data only labeled and unlabeled

log p(Xl, Yl|θ)

=∑l

i=1 log p(yi|θ)p(xi|yi, θ)

log p(Xl, Yl, Xu|θ) =∑l

i=1 log p(yi|θ)p(xi|yi, θ)

+∑n

i=l+1 log(

∑c

y=1 p(y|θ)p(xi|y, θ))

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

In principle Xu is useful for other generative models too.


Method 2: multi-view

Two views of an item: image and HTML text

Feature split x = [x(1); x(2)]



Co-training:

1. Train f (1) from (X(1)l , Yl), f (2) from (X

(2)l , Yl).

2. Classify Xu with f (1) and f (2) separately.

3. Add most-confident (x, f (1)(x)) to f (2)’s labeled data.

4. Add most-confident (x, f (2)(x)) to f (1)’s labeled data.

5. Repeat.

Encourages agreement between two classifiers.



A regularized risk minimization framework to encourage

multi-learner agreement:

minf

M∑

v=1

(

l∑

i=1

c(yi, fv(xi)) + λ1‖f‖2K

)

+λ2

M∑

u,v=1

n∑

i=l+1

(fu(xi) − fv(xi))2

M learners, c loss function (e.g., hinge)


Method 3: graph-based

Example: Classify astronomy vs. travel articles.

d1 d5 d6 d7 d3 d4 d8 d9 d2

asteroid •

bright • •

comet • •

year • •

zodiac • •

...

airport •

bike • •

camp • •

yellowstone • •

zion •

Unlabeled articles are stepping stones.



The graph has nodes Xl ∪ Xu and edges:

k-nearest-neighbor unweighted graph

fully connected graph, weights decay with distance

w = exp (−‖xi − xj‖2/σ2)

other (expert knowledge)

The graph is represented by an n × n weight matrix.

d1

d2

d4

d3



An electric network view:

Edges are resistors with conductance wij

1-volt battery connects to labeled points y = 0, 1

The voltage at the nodes is the harmonic function f

+1 volt

wijR =ij

1

1

0

The harmonic function minimizes∑n

i,j=1 wij(f(xi) − f(xj))2.



Manifold regularization extends the harmonic solution to

handle

noisy labels

unseen examples

minf

l∑

i=1

c(yi, f(xi)) + λ1‖f‖2K + λ2

n∑

i,j=1

wij(f(xi) − f(xj))2


Method 4: S3VMs

Semi-supervised SVMs (S3VMs, transductive SVMs):

maximizing “unlabeled margin”

+

+

+

+

+

−

−

−

−


Method 4: S3VMs

standard SVM with hinge loss

minf

∑l

i=1(1 − yif(xi))+ + λ‖h‖2K

−2 −1.5 −1 −0.5 0 0.5 1 1.5 20

0.5

1

1.5

2

2.5

3

yif(xi)

Prefers labeled points on the ‘correct’ side.


Method 4: S3VMs

How to incorporate unlabeled points?

Assign putative labels sign(f(x)) to x ∈ Xu

sign(f(x))f(x) = |f(x)|

The hinge loss on unlabeled points

(1 − yif(xi))+ = (1 − |f(xi)|)+

Semi-supervised SVMs

minf

l∑

i=1

(1 − yif(xi))+ + λ1‖f‖2K + λ2

n∑

i=l+1

(1 − |f(xi)|)+


Method 4: S3VMs

The hat loss (1 − |f(xi)|)+ prefers f(x) ≥ 1 or f(x) ≤ −1.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 20

0.5

1

1.5

2

2.5

3

f(xi)

minf

l∑

i=1

(1 − yif(xi))+ + λ1‖f‖2K + λ2

n∑

i=l+1

(1 − |f(xi)|)+

The third term prefers unlabeled points outside the margin.


Question 1

What can we say about convergence, consistency, etc.?

n → ∞, l fixed

l → 0

l → ∞, n → ∞, l/n → 0


Question 2

l∑

i=1

log p(yi|θ)p(xi|yi, θ) +n∑

i=l+1

log

(

c∑

y=1

p(y|θ)p(xi|y, θ)

)

minf

M∑

v=1

(

l∑

i=1

c(yi, fv(xi)) + λ1‖f‖2K

)

+λ2

M∑

u,v=1

n∑

i=l+1

(fu(xi) − fv(xi))2

minf

l∑

i=1

c(yi, f(xi)) + λ1‖f‖2K + λ2

n∑

i,j=1

wij(f(xi) − f(xj))2

minf

l∑

i=1

(1 − yif(xi))+ + λ1‖f‖2K + λ2

n∑

i=l+1

(1 − |f(xi)|)+


Question 2

Why 4 methods?


Question 2

Why 4 methods?

Why not just 1 method?

What is the model that unifies semi-supervised

learning?


Question 2

Why 4 methods?

Why not just 1 method?

What is the model that unifies semi-supervised

learning?

Why not 40 methods?

What are new semi-supervised learning approaches?


Question 3

no pain, no gain


Question 3

no model assumption, no gain


Question 3


wrong model assumption, no gain


Question 3


wrong model assumption, no gain, a lot of pain


Question 3


wrong model assumption, no gain, a lot of pain

How do we know that we are making the right model

assumptions?

= Which semi-supervised learning method should I use?


Question 4

Are the methods practical?

e.g., S3VM is not convex

−2 −1.5 −1 −0.5 0 0.5 1 1.5 20

0.5

1

1.5

2

2.5

3

100

102

104

102

104

106

108

1010

world population

internet users in the US

people in full stadium

labeled data size

unla

bele

d da

ta s

ize


Question 5

Do humans do semi-supervised learning?

17-month-old infants word learning

heard word before ⇒ easier to associate the word with

a visual object


References

google semi-supervised learning survey

thank you


Documents

Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/jsm06zhu.pdf · Semi-Supervised Learning an overview Xiaojin “Jerry” Zhu [email protected] Computer Science Department