21
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science Department

Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

Co-Training and Expansion: Towards Bridging Theory and

Practice

Maria-Florina Balcan, Avrim Blum, Ke Yang

Carnegie Mellon University,Computer Science Department

Page 2: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

2

Combining Labeled and Unlabeled Data (a.k.a. Semi-supervised Learning)

• Many applications have lots of unlabeled data, but labeled data is rare or expensive:

•Web page, document classification•OCR, Image classification

• Several methods have been developed to try to use unlabeled data to improve performance, e.g.:

•Transductive SVM•Co-training•Graph-based methods

Page 3: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

3

Co-training: method for combining labeled & unlabeled data

• Works in scenarios where examples have distinct, yet sufficient feature sets:– An example:– Belief is that the two parts of the example are

consistent, i.e. 9 c1, c2 such that• Each view is sufficient for correct classification

• Works by using unlabeled data to propagate learned information.

+++

X1X2

Page 4: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

4

Co-Training: method for combining labeled & unlabeled data

• For example, if we want to classify web pages:

My Advisor

Prof. Avrim Blum

My Advisor

Prof. Avrim Blum

x2- Text infox1- Link infox - Link info & Text info

Page 5: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

5

Iterative Co-Training

• Have learning algorithms A1, A2 on each of the two views.

• Use labeled data to learn two initial hypotheses h1, h2.

• Look through unlabeled data to find examples where one of hi is confident but other is not.

• Have the confident hi label it for algorithm A3-

i.

Repeat

Page 6: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

6

Iterative Co-Training A Simple Example: Learning Intervals

c1

c2

Use labeled data to learn h11 and h2

1

Use unlabeled data to bootstrap

h11

h21

Labeled examples Unlabeled examples

h12

h21

h12

h22

Page 7: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

7

Theoretical/Conceptual Question

• What properties do we need for co-training to work well?

• Need assumptions about:– the underlying data distribution– the learning algorithms on the two sides

Page 8: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

8

Theoretical/Conceptual Question

• What property of the data do we need for co-training to work well?

• Previous work:1) Independence given the label2) Weak rule dependence

• Our work - much weaker assumption about how the data should behave:

• expansion property of the underlying distribution

• Though we will need stronger assumption on the learning algorithm compared to (1).

Page 9: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

9

Co-Training, Formal Setting• Assume that examples are drawn from distribution D

over instance space .• Let c be the target function; assume that each view is

sufficient for correct classification:– c can be decomposed into c1, c2 over each view s. t. D has no

probability mass on examples x with c1(x1) c2(x2)

• Let X+ and X- denote the positive and negative regions of X.

• Let D+ and D- be the marginal distribution of D over X+ and X- respectively.

• Let – think of as

D+

D-

Page 10: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

10

(Formalization)

• We assume that D+ is expanding.

• Expansion:

• This is a natural analog of the graph-theoretic notions of conductance and expansion.

S1S2

Page 11: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

11

Property of the underlying distribution

• Necessary condition for co-training to work well:– If S1 and S2 (our confident sets) do not expand, then

we might never see examples for which one hypothesis could help the other.

• We show, sufficient for co-training to generalize well in a relatively small number of iterations, under some assumptions:– the data is perfectly separable– have strong learning algorithms on the two

sides

Page 12: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

12

Expansion, Examples: Learning Intervals

c1

c2

D+

c1

c2

Zero probability mass in the regions

Non-expanding distribution Expanding distribution

c1

c2

D+

S1

S2

Page 13: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

13

Weaker than independence given the label & than weak rule dependence.

D+S1

S2

D-

e.g, w.h.p. a random degree-3 bipartite graph is expanding, but would NOT have independence given the label, or weak rule dependence

Page 14: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

14

Main Result• Assume D+ is -expanding.• Assume that on each of the two views we have

algorithms A1 and A2 for learning from positive data only.

• Assume that we have initial confident sets S10 and S2

0 such that

Page 15: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

15

Main Result, Interpretation

• Assumption on A1, A2 implies the they never generalize incorrectly.

• Question is: what needs to be true for them to actually generalize to whole of D+?

+++

X1+ X2

+

Page 16: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

16

Main Result, Proof Idea• Expansion implies that at each iteration, there is

reasonable probability mass on "new, useful" data.• Algorithms generalize to most of this new region.

• See paper for real proof.

+++

Page 17: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

17

What if assumptions are violated?

• What if our algorithms can make incorrect generalizations and/or there is no perfect separability?

Page 18: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

18

What if assumptions are violated?

• Expect "leakage" to negative region.

• If negative region is expanding too, then incorrect generalizations will grow at exponential rate.

• Correct generalization are growing at exponential rate too, but will slow down first.

• Expect overall accuracy to go up then down.

Page 19: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

19

Synthetic Experiments • Create a 2n-by-2n bipartite graph;

– nodes 1 to n on each side represent positive clusters– nodes n+1 to 2n on each side represent negative clusters

• Connect each node on the left to 3 nodes on the right: – each neighbor is chosen with prob. 1- to be a random

node of the same class, and with prob. to be a random node of the opposite class

• Begin with an initial confident set and then propagate confidence through rounds of co-training:– monitor the percentage of the positive class covered, the

percent of the negative class mistakenly covered, and the overall accuracy

Page 20: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

20

Synthetic Experiments

• solid line indicates overall accuracy• green curve is accuracy on positives• red curve is accuracy on negatives

=0.01, n=5000, d=3 =0.001, n=5000, d=3

Page 21: Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science

21

Conclusions

• We propose a much weaker expansion assumption of the underlying data distribution.

• It seems to be the “right” condition on the distribution for co-training to work well.

• It directly motivates the iterative nature of many of the practical co-training based algorithms.