Maria-Florina Balcan Active Learning of Binary Classifiers Presenters: Nina Balcan and Steve Hanneke

Maria-Florina Balcan

Active Learning of Binary Classifiers

Presenters: Nina Balcan and Steve Hanneke


Outline

• What is Active Learning?

• Active Learning Linear Separators

• General Theories of Active Learning

• Open Problems


Labeled examples

Supervised Passive Learning

Learning Algorithm

Expert / Oracle

Data Source

Unlabeled examples

Algorithm outputs a classifier


Incorporating Unlabeled Data in the Learning process

• In many settings, unlabeled data is cheap & easy to obtain, labeled data is much more expensive.

• Web page, document classification• OCR, Image classification


Labeled Examples

Semi-Supervised Passive Learning

Learning Algorithm

Expert / Oracle

Data Source

Unlabeled examples

Algorithm outputs a classifier

Unlabeled examples



• Several methods have been developed to try to use unlabeled data to improve performance, e.g.:

• Transductive SVM [Joachims ’98]• Co-training [Blum & Mitchell ’98], [BBY04]• Graph-based methods [Blum & Chawla01],

[ZGL03]





[ZGL03]

+

+

_

_

Labeled data only

+

+

_

_

+

+

_

_

Transductive SVMSVM



Workshops [ICML ’03, ICML’ 05]

Books: Semi-Supervised Learning, MIT 2006 O. Chapelle, B. Scholkopf and A. Zien (eds)



[ZGL03]

Theoretical models: Balcan-Blum’05


A Label for that ExampleRequest for the Label of an Example

A Label for that ExampleRequest for the Label of an Example

Active Learning

Data Source

Unlabeled examples

. . . Algorithm outputs a classifier

Learning Algorithm

Expert / Oracle


Choose the label requests carefully, to get informative labels.

What Makes a Good Algorithm?

• Guaranteed to output a relatively good classifier for most learning problems.

• Doesn’t make too many label requests.


Can It Really Do Better Than Passive?

• YES! (sometimes)

• We often need far fewer labels for active learning than for passive.

• This is predicted by theory and has been observed in practice.


Active Learning in Practice

• Active SVM (Tong & Koller, ICML 2000) seems to be quite useful in practice.

At any time during the alg., we have a “current guess” of the separator: the max-margin separator of all labeled points so far.

E.g., strategy 1: request the label of the example closest to the current separator.


When Does it Work? And Why?

• The algorithms currently used in practice are not well understood theoretically.

• We don’t know if/when they output a good classifier, nor can we say how many labels they will need.

• So we seek algorithms that we can understand and state formal guarantees for.

Rest of this talk: surveys recent theoretical results.


Standard Supervised Learning Setting

• S={(x, l)} - set of labeled examples- drawn i.i.d. from some distr. D over X and

labeled by some target concept c* 2 C

–err(h)=Prx 2 D(h(x) c*(x))

• Want to do optimization over S to find some hyp. h, but we want h to have small error over D.

Sample Complexity, Finite Hyp. Space, Realizable case


Sample Complexity: Uniform Convergence Bounds

• Infinite Hypothesis Case

E.g., if C - class of linear separators in Rd, then we need roughly O(d/ examples to achieve generalization error

Non-realizable case – replace with 2.


Active Learning

• The learner has the ability to choose specific examples to be labeled:

- The learner works harder, in order to use fewer labeled examples.

• We get to see unlabeled data first, and there is a charge for every label.

• How many labels can we save by querying adaptively?


Can adaptive querying help? [CAL92, Dasgupta04]

• Consider threshold functions on the real line:

w

+-

Exponential improvement in sample complexity

hw(x) = 1(x ¸ w), C = {hw: w 2 R}

• Sample with 1/ unlabeled examples.

-

• Binary search – need just O(log 1/) labels.

Active setting: O(log 1/) labels to find an -accurate threshold.

Supervised learning needs O(1/) labels.

+-


Active Learning might not help [Dasgupta04]

C = {linear separators in R1}: active learning reduces sample complexity substantially.

In this case: learning to accuracy requires 1/ labels…

In general, number of queries needed depends on C and also on D.

C = {linear separators in R2}: there are some target hyp. for which no improvement can be achieved! - no matter how benign the input distr.

h1

h2

h3

h0


Examples where Active Learning helps

• C = {linear separators in R1}: active learning reduces sample complexity substantially no matter what is the input distribution.

In general, number of queries needed depends on C and also on D.

• C - homogeneous linear separators in Rd, D - uniform distribution over unit sphere: • need only d log 1/ labels to find a hypothesis

with error rate < .

• Freund et al., ’97.

• Dasgupta, Kalai, Monteleoni, COLT 2005

• Balcan-Broder-Zhang, COLT 07


Region of uncertainty [CAL92]

current version space

• Example: data lies on circle in R2 and hypotheses are homogeneous linear separators.

region of uncertainty in data space

• Current version space: part of C consistent with labels so far.• “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space)

++



Pick a few points at random from the current region of uncertainty and query their labels.


region of uncertainy

Algorithm:





region of uncertainty in data space

++




new version space

New region of uncertainty in data space

++


Region of uncertainty [CAL92], Guarantees

Algorithm: Pick a few points at random from the current region of uncertainty and query their labels.

• C - homogeneous linear separators in Rd, D -uniform distribution over unit sphere.

• low noise, need only d2 log 1/ labels to find a hypothesis with error rate < .• realizable case, d3/2 log 1/ labels.

•supervised -- d/ labels.

[Balcan, Beygelzimer, Langford, ICML’06]

Analyze a version of this alg. which is robust to noise.

• C- linear separators on the line, low noise, exponential improvement.


Margin Based Active-Learning Algorithm

Use O(d) examples to find w1 of error 1/8.

iterate k=2, … , log(1/) • rejection sample mk samples x from D

satisfying |wk-1T ¢ x| · k ;

• label them;• find wk 2 B(wk-1, 1/2k ) consistent with all these

examples.end iterate

[Balcan-Broder-Zhang, COLT 07]

wk

wk+

1

γk

w*


BBZ’07, Proof Ideaiterate k=2, … , log(1/)

Rejection sample mk samples x from D satisfying |wk-1

T ¢ x| · k ;ask for labels and find wk 2 B(wk-1, 1/2k ) consistent with all these examples.

end iterate

Assume wk has error · . We are done if 9 k s.t. wk+1 has error · /2

and only need O(d log( 1/)) labels in round k.

wk

wk+1

γk

w*





end iterate



wk

wk+1

γk

w*





end iterate



Key Point

Under the uniform distr. assumption for

we have · /4 wk

wk+1

γk

w*


BBZ’07, Proof Idea

Key Point

Under the uniform distr. assumption for

we have · /4

So, it’s enough to ensure that

We can do so by only using O(d log( 1/)) labels in round k.

wk

wk+1

γk

w*

Key Point


General Theories of Active Learning


General Concept Spaces

• In the general learning problem, there is a concept space C, and we want to find an -optimal classifier h C with high probability 1-.


How Many Labels Do We Need?• In passive learning, we know of an algorithm

(empirical risk minimization) that needs only

labels (for realizable learning), and

if there is noise.

• We also know this is close to the best we can expect from any passive algorithm.


How Many Labels Do We Need?

• As before, we want to explore the analogous idea for Active Learning, (but now for general concept space C).

• How many label requests are necessary and sufficient for Active Learning?

• What are the relevant complexity measures? (i.e., the Active Learning analogue of VC dimension)


What ARE the Interesting Quantities?

• Generally speaking, we want examples whose labels are highly controversial among the set of remaining concepts.

• The likelihood of drawing such an informative example is an important quantity to consider.

• But there are many ways to define “informative” in general.


What Do You Mean By “Informative”?

• Want examples that reduce the version space.• But how do we measure progress?

• A problem-specific measure P on C?

• The Diameter?

• Measure of the region of disagreement?• Cover size? (see e.g., Hanneke, COLT 2007)

All of these seem to have interesting theories associated with them. As an example, let’s take a look at Diameter in detail.


Diameter (Dasgupta, NIPS 2005)

• Define distance d(g,h) = Pr(g(X)h(X)).• One way to guarantee our classifier is

within of the target classifier is to (safely) reduce the diameter to size .

Imagine each pair of concepts separated by distance > has an edge between them.We have to rule out at least one of the two concepts for each edge.

Each unlabeled example X partitions the concepts into two sets.

And guarantees some fraction of the edges will have at least one concept contradicted, no matter which label it has.


Diameter

•Theorem: (Dasgupta, NIPS 2005)

If, for any finite subset V C, PrX(X eliminates a ρ fraction of the edges) , then (assuming no noise) we can reduce the diameter to using a number of label requests at most

Furthermore, there is an algorithm that does this, which with high probability requires a number of unlabeled examples at most


Open Problems in Active Learning

• Efficient (correct) learning algorithms for linear separators provably achieving significant improvements on many distributions.

• What about binary feature spaces?• Tight general-purpose sample

complexity bounds, for both realizable and agnostic.

• An optimal active learning algorithm?

Documents

Maria-Florina Balcan Active Learning of Binary Classifiers Presenters: Nina Balcan and Steve Hanneke