Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Semi-Supervised Learning
an overview
Xiaojin “Jerry” Zhu
Computer Science Department
University of Wisconsin, Madison
Semi-supervised Learning: an Overview. Zhu – p. 1/31
Outline
background
four methods
open questions
A computer scientist’s view . . .
Semi-supervised Learning: an Overview. Zhu – p. 2/31
Why semi-supervised learning?
Because people want better performance for free.
Semi-supervised Learning: an Overview. Zhu – p. 3/31
Why semi-supervised learning?
Because people want better performance for free.
Exactly how hard can it be to obtain labeled data?
Switchboard speech transcription, 400 hours for each
hour of speech
film ⇒ f ih_n uh_gl_n m
be all ⇒ bcl b iy iy_tr ao_tr ao l_dl
Penn Chinese Treebank, 2 years for 4000 sentences
‘The National Track and Field Championship has finished.’
Semi-supervised Learning: an Overview. Zhu – p. 4/31
The landscape
supervised learning (classification, regression)
{(x1:n, y1:n)}
l
semi-supervised classification/regression
{(x1:l, y1:l), xl+1:n}
l
semi-supervised clustering {x1:n, must-, cannot-links}
l
unsupervised learning (clustering) {x1:n}
transduction (limited to x1:n) ↔ induction (unseen data)
Semi-supervised Learning: an Overview. Zhu – p. 5/31
The problem
Goal:
Using both labeled and unlabeled data to build better
learners (than using labeled data alone).
Notation:
input features x, label y
learner f : X 7→ Y
labeled data (Xl, Yl) = {(x1:l, y1:l)}
unlabeled data Xu = {xl+1:n}
usually l ≪ n
How can Xu help?
Semi-supervised Learning: an Overview. Zhu – p. 6/31
Method 1: generative models
Self-training:
1. Train f from (Xl, Yl)
2. Predict on x ∈ Xu
3. Add (x, f(x)) to labeled data
4. Repeat
Naıve? error self-enforcing?
But if you set things just right, this is in fact the EM
algorithm on mixture models . . .
Semi-supervised Learning: an Overview. Zhu – p. 7/31
Method 1: generative models
Example: EM for Gaussian mixture models
θ = {p(c), µ, Σ}1:C
Start from MLE θ on (Xl, Yl), repeat:
1. E-step: compute the expected labels p(y|x, θ) for all
x ∈ Xu
assign class 1 to p(y = 1|x, θ) fraction of x
assign class 2 to p(y = 2|x, θ) fraction of x
. . .
2. M-step: update MLE θ with the original labeled and
(now labeled) unlabeled data
Semi-supervised Learning: an Overview. Zhu – p. 8/31
Method 1: generative models
The MLE of θ without and with Xu is different.
labeled data only labeled and unlabeled
log p(Xl, Yl|θ)
=∑l
i=1 log p(yi|θ)p(xi|yi, θ)
log p(Xl, Yl, Xu|θ) =∑l
i=1 log p(yi|θ)p(xi|yi, θ)
+∑n
i=l+1 log(
∑c
y=1 p(y|θ)p(xi|y, θ))
−5 −4 −3 −2 −1 0 1 2 3 4 5−5
−4
−3
−2
−1
0
1
2
3
4
5
−5 −4 −3 −2 −1 0 1 2 3 4 5−5
−4
−3
−2
−1
0
1
2
3
4
5
In principle Xu is useful for other generative models too.
Semi-supervised Learning: an Overview. Zhu – p. 9/31
Method 2: multi-view
Two views of an item: image and HTML text
Feature split x = [x(1); x(2)]
Semi-supervised Learning: an Overview. Zhu – p. 10/31
Method 2: multi-view
Co-training:
1. Train f (1) from (X(1)l , Yl), f (2) from (X
(2)l , Yl).
2. Classify Xu with f (1) and f (2) separately.
3. Add most-confident (x, f (1)(x)) to f (2)’s labeled data.
4. Add most-confident (x, f (2)(x)) to f (1)’s labeled data.
5. Repeat.
Encourages agreement between two classifiers.
Semi-supervised Learning: an Overview. Zhu – p. 11/31
Method 2: multi-view
A regularized risk minimization framework to encourage
multi-learner agreement:
minf
M∑
v=1
(
l∑
i=1
c(yi, fv(xi)) + λ1‖f‖2K
)
+λ2
M∑
u,v=1
n∑
i=l+1
(fu(xi) − fv(xi))2
M learners, c loss function (e.g., hinge)
Semi-supervised Learning: an Overview. Zhu – p. 12/31
Method 3: graph-based
Example: Classify astronomy vs. travel articles.
d1 d5 d6 d7 d3 d4 d8 d9 d2
asteroid •
bright • •
comet • •
year • •
zodiac • •
...
airport •
bike • •
camp • •
yellowstone • •
zion •
Unlabeled articles are stepping stones.
Semi-supervised Learning: an Overview. Zhu – p. 13/31
Method 3: graph-based
The graph has nodes Xl ∪ Xu and edges:
k-nearest-neighbor unweighted graph
fully connected graph, weights decay with distance
w = exp (−‖xi − xj‖2/σ2)
other (expert knowledge)
The graph is represented by an n × n weight matrix.
d1
d2
d4
d3
Semi-supervised Learning: an Overview. Zhu – p. 14/31
Method 3: graph-based
An electric network view:
Edges are resistors with conductance wij
1-volt battery connects to labeled points y = 0, 1
The voltage at the nodes is the harmonic function f
+1 volt
wijR =ij
1
1
0
The harmonic function minimizes∑n
i,j=1 wij(f(xi) − f(xj))2.
Semi-supervised Learning: an Overview. Zhu – p. 15/31
Method 3: graph-based
Manifold regularization extends the harmonic solution to
handle
noisy labels
unseen examples
minf
l∑
i=1
c(yi, f(xi)) + λ1‖f‖2K + λ2
n∑
i,j=1
wij(f(xi) − f(xj))2
Semi-supervised Learning: an Overview. Zhu – p. 16/31
Method 4: S3VMs
Semi-supervised SVMs (S3VMs, transductive SVMs):
maximizing “unlabeled margin”
+
+
+
+
+
−
−
−
−
Semi-supervised Learning: an Overview. Zhu – p. 17/31
Method 4: S3VMs
standard SVM with hinge loss
minf
∑l
i=1(1 − yif(xi))+ + λ‖h‖2K
−2 −1.5 −1 −0.5 0 0.5 1 1.5 20
0.5
1
1.5
2
2.5
3
yif(xi)
Prefers labeled points on the ‘correct’ side.
Semi-supervised Learning: an Overview. Zhu – p. 18/31
Method 4: S3VMs
How to incorporate unlabeled points?
Assign putative labels sign(f(x)) to x ∈ Xu
sign(f(x))f(x) = |f(x)|
The hinge loss on unlabeled points
(1 − yif(xi))+ = (1 − |f(xi)|)+
Semi-supervised SVMs
minf
l∑
i=1
(1 − yif(xi))+ + λ1‖f‖2K + λ2
n∑
i=l+1
(1 − |f(xi)|)+
Semi-supervised Learning: an Overview. Zhu – p. 19/31
Method 4: S3VMs
The hat loss (1 − |f(xi)|)+ prefers f(x) ≥ 1 or f(x) ≤ −1.
−2 −1.5 −1 −0.5 0 0.5 1 1.5 20
0.5
1
1.5
2
2.5
3
f(xi)
minf
l∑
i=1
(1 − yif(xi))+ + λ1‖f‖2K + λ2
n∑
i=l+1
(1 − |f(xi)|)+
The third term prefers unlabeled points outside the margin.
Semi-supervised Learning: an Overview. Zhu – p. 20/31
Question 1
What can we say about convergence, consistency, etc.?
n → ∞, l fixed
l → 0
l → ∞, n → ∞, l/n → 0
Semi-supervised Learning: an Overview. Zhu – p. 21/31
Question 2
l∑
i=1
log p(yi|θ)p(xi|yi, θ) +n∑
i=l+1
log
(
c∑
y=1
p(y|θ)p(xi|y, θ)
)
minf
M∑
v=1
(
l∑
i=1
c(yi, fv(xi)) + λ1‖f‖2K
)
+λ2
M∑
u,v=1
n∑
i=l+1
(fu(xi) − fv(xi))2
minf
l∑
i=1
c(yi, f(xi)) + λ1‖f‖2K + λ2
n∑
i,j=1
wij(f(xi) − f(xj))2
minf
l∑
i=1
(1 − yif(xi))+ + λ1‖f‖2K + λ2
n∑
i=l+1
(1 − |f(xi)|)+
Semi-supervised Learning: an Overview. Zhu – p. 22/31
Question 2
Why 4 methods?
Semi-supervised Learning: an Overview. Zhu – p. 23/31
Question 2
Why 4 methods?
Why not just 1 method?
What is the model that unifies semi-supervised
learning?
Semi-supervised Learning: an Overview. Zhu – p. 23/31
Question 2
Why 4 methods?
Why not just 1 method?
What is the model that unifies semi-supervised
learning?
Why not 40 methods?
What are new semi-supervised learning approaches?
Semi-supervised Learning: an Overview. Zhu – p. 23/31
Question 3
no pain, no gain
Semi-supervised Learning: an Overview. Zhu – p. 24/31
Question 3
no model assumption, no gain
Semi-supervised Learning: an Overview. Zhu – p. 25/31
Question 3
no model assumption, no gain
wrong model assumption, no gain
Semi-supervised Learning: an Overview. Zhu – p. 26/31
Question 3
no model assumption, no gain
wrong model assumption, no gain, a lot of pain
Semi-supervised Learning: an Overview. Zhu – p. 27/31
Question 3
no model assumption, no gain
wrong model assumption, no gain, a lot of pain
How do we know that we are making the right model
assumptions?
= Which semi-supervised learning method should I use?
Semi-supervised Learning: an Overview. Zhu – p. 28/31
Question 4
Are the methods practical?
e.g., S3VM is not convex
−2 −1.5 −1 −0.5 0 0.5 1 1.5 20
0.5
1
1.5
2
2.5
3
100
102
104
102
104
106
108
1010
world population
internet users in the US
people in full stadium
labeled data size
unla
bele
d da
ta s
ize
Semi-supervised Learning: an Overview. Zhu – p. 29/31
Question 5
Do humans do semi-supervised learning?
17-month-old infants word learning
heard word before ⇒ easier to associate the word with
a visual object
Semi-supervised Learning: an Overview. Zhu – p. 30/31
References
google semi-supervised learning survey
thank you
Semi-supervised Learning: an Overview. Zhu – p. 31/31