Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

Co-Training and Expansion: Towards Bridging Theory and

Practice

Maria-Florina Balcan, Avrim Blum, Ke Yang

Carnegie Mellon University,Computer Science Department

2

Combining Labeled and Unlabeled Data (a.k.a. Semi-supervised Learning)

• Many applications have lots of unlabeled data, but labeled data is rare or expensive:

•Web page, document classification•OCR, Image classification

• Several methods have been developed to try to use unlabeled data to improve performance, e.g.:

•Transductive SVM•Co-training•Graph-based methods

3

Co-training: method for combining labeled & unlabeled data

• Works in scenarios where examples have distinct, yet sufficient feature sets:– An example:– Belief is that the two parts of the example are

consistent, i.e. 9 c1, c2 such that• Each view is sufficient for correct classification

• Works by using unlabeled data to propagate learned information.

+++

X1X2

4

Co-Training: method for combining labeled & unlabeled data

• For example, if we want to classify web pages:

My Advisor

Prof. Avrim Blum

My Advisor

Prof. Avrim Blum

x2- Text infox1- Link infox - Link info & Text info

5

Iterative Co-Training

• Have learning algorithms A1, A2 on each of the two views.

• Use labeled data to learn two initial hypotheses h1, h2.

• Look through unlabeled data to find examples where one of hi is confident but other is not.

• Have the confident hi label it for algorithm A3-

i.

Repeat

6

Iterative Co-Training A Simple Example: Learning Intervals

c1

c2

Use labeled data to learn h11 and h2

1

Use unlabeled data to bootstrap

h11

h21

Labeled examples Unlabeled examples

h12

h21

h12

h22

7

Theoretical/Conceptual Question

• What properties do we need for co-training to work well?

• Need assumptions about:– the underlying data distribution– the learning algorithms on the two sides

8

Theoretical/Conceptual Question

• What property of the data do we need for co-training to work well?

• Previous work:1) Independence given the label2) Weak rule dependence

• Our work - much weaker assumption about how the data should behave:

• expansion property of the underlying distribution

• Though we will need stronger assumption on the learning algorithm compared to (1).

9

Co-Training, Formal Setting• Assume that examples are drawn from distribution D

over instance space .• Let c be the target function; assume that each view is

sufficient for correct classification:– c can be decomposed into c1, c2 over each view s. t. D has no

probability mass on examples x with c1(x1) c2(x2)

• Let X+ and X- denote the positive and negative regions of X.

• Let D+ and D- be the marginal distribution of D over X+ and X- respectively.

• Let – think of as

D+

D-

10

(Formalization)

• We assume that D+ is expanding.

• Expansion:

• This is a natural analog of the graph-theoretic notions of conductance and expansion.

S1S2

11

Property of the underlying distribution

• Necessary condition for co-training to work well:– If S1 and S2 (our confident sets) do not expand, then

we might never see examples for which one hypothesis could help the other.

• We show, sufficient for co-training to generalize well in a relatively small number of iterations, under some assumptions:– the data is perfectly separable– have strong learning algorithms on the two

sides

12

Expansion, Examples: Learning Intervals

c1

c2

D+

c1

c2

Zero probability mass in the regions

Non-expanding distribution Expanding distribution

c1

c2

D+

S1

S2

13

Weaker than independence given the label & than weak rule dependence.

D+S1

S2

D-

e.g, w.h.p. a random degree-3 bipartite graph is expanding, but would NOT have independence given the label, or weak rule dependence

14

Main Result• Assume D+ is -expanding.• Assume that on each of the two views we have

algorithms A1 and A2 for learning from positive data only.

• Assume that we have initial confident sets S10 and S2

0 such that

15

Main Result, Interpretation

• Assumption on A1, A2 implies the they never generalize incorrectly.

• Question is: what needs to be true for them to actually generalize to whole of D+?

+++

X1+ X2

+

16

Main Result, Proof Idea• Expansion implies that at each iteration, there is

reasonable probability mass on "new, useful" data.• Algorithms generalize to most of this new region.

• See paper for real proof.

+++

17

What if assumptions are violated?

• What if our algorithms can make incorrect generalizations and/or there is no perfect separability?

18

What if assumptions are violated?

• Expect "leakage" to negative region.

• If negative region is expanding too, then incorrect generalizations will grow at exponential rate.

• Correct generalization are growing at exponential rate too, but will slow down first.

• Expect overall accuracy to go up then down.

19

Synthetic Experiments • Create a 2n-by-2n bipartite graph;

– nodes 1 to n on each side represent positive clusters– nodes n+1 to 2n on each side represent negative clusters

• Connect each node on the left to 3 nodes on the right: – each neighbor is chosen with prob. 1- to be a random

node of the same class, and with prob. to be a random node of the opposite class

• Begin with an initial confident set and then propagate confidence through rounds of co-training:– monitor the percentage of the positive class covered, the

percent of the negative class mistakenly covered, and the overall accuracy

20

Synthetic Experiments

• solid line indicates overall accuracy• green curve is accuracy on positives• red curve is accuracy on negatives

=0.01, n=5000, d=3 =0.001, n=5000, d=3

21

Conclusions

• We propose a much weaker expansion assumption of the underlying data distribution.

• It seems to be the “right” condition on the distribution for co-training to work well.

• It directly motivates the iterative nature of many of the practical co-training based algorithms.

Documents

Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science