Robust Multi-Class Transductive Learning with Graphswliu/CVPR09_wliu_slides.pdf · 2-NN GraphFigure: Thicker edges ... I We try to learn W from training data without any presumed

OutlineIntroduction

Graph ConstructionGraph Learning

Robust Multi-Class Graph Transduction (RMGT)Experiments

Robust Multi-Class Transductive Learning withGraphs

Wei Liu and Shih-Fu Chang

Columbia University

June 19, 2009

Wei Liu and Shih-Fu Chang Robust Multi-Class Transductive Learning with Graphs

OutlineIntroduction




OutlineIntroduction



Introduction

Graph Construction

Graph Learning

Robust Multi-Class Graph Transduction (RMGT)

Experiments


OutlineIntroduction



What is Semi-Supervised Learning (SSL)?

F In the narrow sense, SSL refers particularly to semi-supervisedclassification using labeled data and unlabeled data, which oftenincludes transductive and inductive cases.

+-

seen data

transductive learning

inductive learning

unseen data

Figure: Narrow-sense semi-supervised learning.


OutlineIntroduction



What is Semi-Supervised Learning (SSL)?

F In the wide sense, SSL covers all learning tasks where priorknowledge about a few data is known and knowledge about theremaining data can be inferred. The knowledge may be labels,response values, vector representations, and pairwise relations.

regression clustering

Figure: Wide-sense semi-supervised learning.


OutlineIntroduction



Survey and Book

Xiaojin Zhu. Semi-Supervised Learning Literature Survey,Computer Sciences Technical Report 1530, University ofWisconsin-Madison, 2005.Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien.Semi-Supervised Learning, MIT Press, 2006.


OutlineIntroduction



Binary-Class SSL Setting

I A data set X = {x1, · · · , xl , · · · , xn} ⊂ Rd in which the first lsamples are labeled and the remaining u = n − l ones areunlabeled. Prior labels saved in y ∈ Rn such that yi ∈ {1,−1}if xi is labeled and yi = 0 if unlabeled. Use the graphLaplacian matrix L or its normalized variant L to infer theoverall labeling f ∈ Rn.

I Graph Laplacian: L = D −W where W is the weight matrixof the graph G (V ,E ,W ) built on the dataset X , andDii =

∑j Wij .

I Normalized Graph Laplacian: L = D− 12 LD− 1

2 .


OutlineIntroduction



State-of-The-Arts

F Label Propagation – the key is the Laplacian-shaped regularizer.Gaussian Fields and Harmonic Functions (GFHF), Zhu et al. 2003:

minf

fTLf

s.t. fl = yl

Local and Global Consistency (LGC), Zhou et al. 2004:

minf‖f − y‖2 + µfT Lf

Quadratic Criterion (QC), Bengio et al. 2006:

minf‖fl − yl‖2 + µfTLf + µε‖f‖2


OutlineIntroduction



F Remarks1. All these methods are akin to each other. I found that X. Zhu’smethod GFHF gives more robust performance because of the hardconstraint and no trade-off parameters.2. All these methods heavily depend on graph structures.3. All these methods naturally generalize to multi-class problems.


OutlineIntroduction



Motivation

1. ”Several graph-based methods listed here are similar to eachother. They differ in the particular choice of the loss function andthe regularizer. We believe it is more important to construct agood graph than to choose among the methods. However graphconstruction, as we will see later, is not a well studied area.”

X. Zhu, the SSL survey 2005.2. Two mostly used kinds of graphs: k-NN graph andh-neighborhood graph. Empirically, k-NN weighted graph withsmall k tends to perform better.


OutlineIntroduction



A Simple Toy Problem–Noisy Two Moons

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2Noisy two moons

unlabelednoiselabeled: +1labeled: −1

Figure: Noisy two moons given two labeled points. We only have groundtruth labels for the points on two moons, so we evaluate classificationperformance on these on-manifold points.


OutlineIntroduction



A Simple Toy Problem–Noisy Two Moons

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2(a) LGC (13.55%)

labeled to ’+1’ labeled to ’−1’ ’+1’ ’−1’−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2(b) GFHF (14.21%)

labeled to ’+1’ labeled to ’−1’ ’+1’ ’−1’−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2(c) GFHF with sGraph (0%)

labeled to ’+1’ labeled to ’−1’ ’+1’ ’−1’

Figure: Error rates over unlabeled points. (a) LGC with 13.55% errorrate using a 10-NN graph; (b) GFHF with 14.21% error rate using a10-NN graph; (c) GFHF with zero error rate using a symmetry-favored10-NN graph.


OutlineIntroduction



Illumination

I Using the traditional k-NN graph, LGC and GFHF causemany errors. But GFHF achieves perfect results when usingthe proposed symmetry-favored k-NN graph. This illustratesthat graph quality is critical to SSL, and the same SSLmethod leads to very different results using different graphconstruction schemes.


OutlineIntroduction



k-NN Graph

I Let us define an asymmetric n × n matrix:

Aij =

{exp

(−d(xi ,xj )

2

σ2

), if j ∈ Ni

0, otherwise(1)

where the set Ni saves the indexes of k nearest neighbors ofpoint xi and d(xi , xj) is some distance measure (e.g.Euclidean distance) between xi and xj .

I The parameter σ is empirically estimated byσ =

∑ni=1 d(xi , xik )/n, where xik is the k-th nearest neighbor

of xi . Such an estimation is verified simple and sufficientlyeffective.


OutlineIntroduction



k-NN sGraph

I Let us define a symmetric n × n matrix:

Wij =

Aij + Aji , if j ∈ Ni and i ∈ Nj

Aji , if j /∈ Ni and i ∈ Nj

Aij , otherwise(2)

Obviously, W = A + AT and W is symmetric with Wii = 0(to avoid self loops). This weighting scheme favors thesymmetric edges < xi , xj > such that xi is in theneighborhood of xj and xj is simultaneously in theneighborhood of xi .


OutlineIntroduction



Remark1. The weights of those symmetric edges are doubled explicitly dueto the reasonable consideration that two points connected by asymmetric edge are prone to be on the same submanifold.2. In contrast, the weighting scheme adopted by traditional k-NNgraphs treats all edges in the same manner, which defines theweighted adjacency matrix by max{A,AT}.3. We call the graph constructed through eq. (2) thesymmetry-favored k-NN graph or k-NN sGraph in abbreviation.The proposed graph is relatively robust to noise as it reinforces thesimilarities between points on manifolds.


OutlineIntroduction



Comparision

2-NN Graph 2-NN sGraph

Figure: Thicker edges represent larger edge weights.


OutlineIntroduction



Graph Laplacian

I Given the constructed graph G (V ,E ,W), the smoothsemi-norm used in most graph-based approaches is

‖f ‖2G =

1

2(f (vi )− f (vj))

2Wij = fTLf,

where we elicit the graph Laplacian matrix

L = D−W. (3)

I The degree matrix D ∈ Rn×n is a diagonal matrix such thatDii =

∑nj=1 Wij . Dii approximates the local density of

neighborhood at xi .


OutlineIntroduction



Doubly-Stochastic Matrix

I Theorem 1 (in paper) implies that the smooth normemphasizes neighborhoods of high densities (large Dii ).However, sampling is usually not uniform in practice, soover-emphasizing the neighborhoods of high densities mayocclude the information in sparse regions.

Figure: Ununiform sampling.


OutlineIntroduction



Doubly-Stochastic Matrix

I To fully exploit the power of unlabeled data, we wouldn’texpect sparse densities from all unlabeled data. Thus, wechoose to enforce the equal degree constraint Dii = 1 bysetting W1 = 1 which makes the adjacency matrix W adoubly-stochastic matrix.


OutlineIntroduction



How to learn?

I We try to learn W from training data without any presumedfunction form. We only assume that W is close to the initialW0 calculated via eq. (2).

I We can infuse semi-supervised information into W. Considera pair set

T = {(i , j)|i = j or (xi , xj) differ in labels}

and define its matrix form T. In particular, we requireWij = 0 for (i , j) ∈ T or equivalently require

∑(i ,j)∈T Wij = 0

due to Wij ≥ 0. This constraint is intuitive since it removesself loops and erroneous edges.


OutlineIntroduction



Learning W

I We formulate learning doubly-stochastic W subject todifferently labeled information T as follows

min G(W) =1

2‖W −W0‖2

F

s.t.∑

(i ,j)∈T

Wij = 0

W1 = 1, W = WT , W ≥ 0 (4)

where ‖.‖F stands for the Frobenius norm. Eq. (4) falls intoan instance of quadratic programming (QP).


OutlineIntroduction



Learning W

I For efficient computation, we divide this QP problem into twoconvex sub-problems

min G(W) =1

2‖W −W0‖2

F

s.t.∑

(i ,j)∈T

Wij = 0, W1 = 1, W = WT (5)

and

minG(W) =1

2‖W −W0‖2

F s.t.W ≥ 0 (6)


OutlineIntroduction



Learning W

I We find a simple solution to the sub-problem eq. (6):W = dW0e≥0 in which the operator dW0e≥0 zeros out allnegative entries of W0. The operator is essentially a conicsubspace projection operator.

I We solve the sub-problem eq. (5)

W = P(W0,T) = W0 −(

t0 +21TTµµµ0

|T |)

T + µµµ01T + 1µµµ0T, (7)

where P(W0,T) behaves as an affine subspace projectionoperator. t0 and µµµ0 are also computed based on W0.


OutlineIntroduction



Successive Projection

I We tackle the original QP problem eq. (4) by successiveprojection using the two subspace projection operators.

I Von-Neumanns successive projection lemma: thesuccessively alternate projection process will converge ontothe intersect of the affine and conic subspace operators. VNslemma ensures that alternately solving sub-problems eq. (5)and (6) is theoretically guaranteed to converge to the globallyoptimal solution of the target problem eq. (4).


OutlineIntroduction



Algorithm 1. Doubly-Stochastic Adjacency Matrix LearningINPUT: the initial adjacency matrix W0

the differently labeled information Tthe maximum iteration number MaxIter .

LOOP: m = 1, · · · ,MaxIterWm = P(Wm−1,T)If Wm ≥ 0 stop LOOP;else Wm = dWme≥0.

OUTPUT: W = Wm.


OutlineIntroduction



Two Rings Toy Problem

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure: Two rings toy data.


OutlineIntroduction




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9(a) k−NN Graph

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9(b) b−Matching Graph

Figure: k = 10. The b-matching graph is a regular graph where eachnode has k adjacent nodes.


OutlineIntroduction




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9(c) unit−degree Graph

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9(d) unit−degree Graph given two labeled points

Figure: These two graphs have doubly-stochastic matrices learned basedon the 10-NN sGraph. The former doesn’t use the differently labeledinformation T (good enough!), while the latter does.


OutlineIntroduction



Merits of Doubly-Stochastic Matrix

I It offers a nonparametric form for W, flexibly representingdata lying in compact clusters or intrinsic low-dimensionalsubmanifolds.

I It is highly robust to noise, e.g., when a noisy sample xj

invades the neighborhood of xi , the unit-degree constraintmakes the weight Wij absolutely small compared to theweights between xi and closer neighbors.

I It provides the “balanced” graph Laplacian with which thesmooth norm penalizes label prediction functions on eachsample (node) uniformly, resulting in uniform labelpropagation.


OutlineIntroduction



Goal

I Solve a soft label matrix F ∈ Rn×c for any multi-class SSLtask.

.1 .2 .[ , ,..., ]l

c

u

= =

F

F F F F

F

lY

unknownaccount for each class

known class assignment

Figure: Provided Yl infer Fu.


OutlineIntroduction



Multi-Class Constraints

I It suffices to suppose the class posteriors for the labeled databe p(Ck |xi ) = Yik = 1 if xi ∈ Ck and p(Ck |xi ) = Yik = 0otherwise. Importantly, if we knew class priorsωωω = [p(C1), · · · , p(Cc)]

T (ωωωT1c = 1) and regarded soft labelsFik as p(Ck |xi ), we would have the equation

1TF.k

n∼=

n∑

i=1

p(Ck |xi )

n=

n∑

i=1

p(xi )p(Ck |xi ) = p(Ck) (8)

where the marginal probability density p(xi ) ∝ Dii = 1 isassumed to be 1/n. Eq. (8) induces a hard constraint1TF = nωωωT (FT1 = nωωω).


OutlineIntroduction



Multi-Class Label Propagation

I To address multi-class problems, our motivation is to let thesoft labels Fik carry the main properties of p(Ck |xi ). Hence,we impose two hard constraints FT1 = nωωω and F1c = 1 (dueto

∑k p(Ck |xi ) = 1, 1c is a c-dimensional 1-entry vector) to

obtain a constrained multi-class label propagation:

minF tr(FTLF)

s.t. Fl = Yl , F1c = 1, FT1 = nωωω (9)


OutlineIntroduction




I Eq. (9) reduces to

min Q(Fu) = tr(FTu LuuFu) + 2tr(FT

u LulYl)

s.t. Fu1c = 1u, FTu 1u = nωωω − YT

l 1l (10)

where Luu and Lul are sub-matrices of L =

[Lll Llu

Lul Luu

], and

1l and 1u are l- and u-dimensional 1-entry vectors,respectively.


OutlineIntroduction




I Theorem 2 (in paper) shows a closed-form solution toeq. (10). The formulated multi-class label propagationsucceeds in incorporating class priors, different from allexisting label propagation methods.


OutlineIntroduction



Flowchart of RMGT

doubly-stochastic

adjacency matrix

learning

input

feature

vectors

k-NN sGraph unit-degree Graph

multi-class label

propagation

prior labels

global

classification

Figure: The RMGT algorithm.


OutlineIntroduction



Experimental Setup

Data #Features #Samples #Classes

USPS (test) 256 2007 10

FRGC (subset) 4608 3160 316

Data #Features #Samples #Classes

USPS (test) 256 2007 10

FRGC (subset) 4608 3160 316

Figure: Digit and face images.

RMGT: without graph adjacency matrix learning.RMGT(W): with graph adjacency matrix learning.


OutlineIntroduction



Performance Curves

20 30 40 50 60 70 80 90 1000.1

0.15

0.2

0.25

0.3

0.35

# Labeled Samples

Err

or R

ate

(%)

USPS

LGCSGTGFHF+CMNRMGTRMGT(W)

3 4 5 6 7 8 9 10

0.65

0.7

0.75

0.8

0.85

# Labeled Samples/100

Re

cog

nit

ion

Ra

te (

%)

FRGC

LGC

SGT

GFHF+CMN

RMGT

RMGT(W)


OutlineIntroduction



Conclusions

I All compared SSL algorithms achieve performance gains whenswitching k-NN graphs to k-NN sGraphs.

I RMGT performs better than the other methods, thusdemonstrating the success of multi-class label propagationwith class priors.

I RMGT(W) is significantly superior to the others, manifestingthat the proposed graph learning technique (doubly-stochasticadjacency matrix learning) boosts graph-based SSLperformance.


OutlineIntroduction



Thanks!For any problems, please email to [email protected].


Documents

Robust Multi-Class Transductive Learning with Graphswliu/CVPR09_wliu_slides.pdf · 2-NN GraphFigure: Thicker edges ... I We try to learn W from training data without any presumed