of 19 /19
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th

# Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th

Embed Size (px)

Citation preview

Maria-Florina Balcan

Active Learning

Maria Florina Balcan

Lecture 26th

Active Learning

A Label for that Example

Request for the Label of an Example

A Label for that Example

Request for the Label of an Example

Data Source

Unlabeled examples

. . .

Algorithm outputs a classifier

Learning Algorithm

Expert / Oracle

• The learner can choose specific examples to be labeled. • He works harder, to use fewer labeled examples.

Maria-Florina Balcan

• Choose the label requests carefully, to get informative labels.

What Makes a Good Algorithm?

• Guaranteed to output a relatively good classifier for most learning problems.

• Doesn’t make too many label requests.

Maria-Florina Balcan

Can It Really Do Better Than Passive?

• YES! (sometimes)

• We often need far fewer labels for active learning than for passive.

• This is predicted by theory and has been observed in practice.

Can adaptive querying help? [CAL92, Dasgupta04]

• Threshold fns on the real line:

w

+-

Exponential improvement.

hw(x) = 1(x ¸ w), C = {hw: w 2 R}

• Sample with 1/ unlabeled examples; do binary search.-

• Binary search – need just O(log 1/) labels.

Active: only O(log 1/) labels.

Passive supervised: (1/) labels to find an -accurate threshold.

+-

Active Algorithm

Other interesting results as well.

Maria-Florina Balcan

Active Learning might not help [Dasgupta04]

C = {linear separators in R1}: active learning reduces sample complexity substantially.

In this case: learning to accuracy requires 1/ labels…

In general, number of queries needed depends on C and also on D.

C = {linear separators in R2}: there are some target hyp. for which no improvement can be achieved! - no matter how benign the input distr.

h1

h2

h3

h0

Maria-Florina Balcan

Examples where Active Learning helps

• C = {linear separators in R1}: active learning reduces sample complexity substantially no matter what is the input distribution.

In general, number of queries needed depends on C and also on D.

• C - homogeneous linear separators in Rd, D - uniform distribution over unit sphere: • need only d log 1/ labels to find a hypothesis

with error rate < .

• Freund et al., ’97.

• Dasgupta, Kalai, Monteleoni, COLT 2005

• Balcan-Broder-Zhang, COLT 07

Maria-Florina Balcan

Region of uncertainty [CAL92]

current version space

• Example: data lies on circle in R2 and hypotheses are homogeneous linear separators.

region of uncertainty in data space

• Current version space: part of C consistent with labels so far.• “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space)

++

Maria-Florina Balcan

Region of uncertainty [CAL92]

Pick a few points at random from the current region of uncertainty and query their labels.

current version space

region of uncertainy

Algorithm:

Maria-Florina Balcan

Region of uncertainty [CAL92]

• Current version space: part of C consistent with labels so far.• “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space)

current version space

region of uncertainty in data space

++

Maria-Florina Balcan

Region of uncertainty [CAL92]

• Current version space: part of C consistent with labels so far.• “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space)

new version space

New region of uncertainty in data space

++

Maria-Florina Balcan

Region of uncertainty [CAL92], Guarantees

Algorithm: Pick a few points at random from the current region of uncertainty and query their labels.

• C - homogeneous linear separators in Rd, D -uniform distribution over unit sphere.

• low noise, need only d2 log 1/ labels to find a hypothesis with error rate < .• realizable case, d3/2 log 1/ labels.

•supervised -- d/ labels.

[Balcan, Beygelzimer, Langford, ICML’06]

Analyze a version of this alg. which is robust to noise.

• C- linear separators on the line, low noise, exponential improvement.

Maria-Florina Balcan

Margin Based Active-Learning Algorithm

Use O(d) examples to find w1 of error 1/8.

iterate k=2, … , log(1/) • rejection sample mk samples x from D

satisfying |wk-1T ¢ x| · k ;

• label them;• find wk 2 B(wk-1, 1/2k ) consistent with all these

examples.end iterate

[Balcan-Broder-Zhang, COLT 07]

wk

wk+

1

γk

w*

Maria-Florina Balcan

Margin Based Active-Learning, Realizable Case

u

v

(u,v)

Theorem

If then after

ws has error · .

PX is uniform over Sd.

and

iterations

Fact 1

Fact 2v

Maria-Florina Balcan

Margin Based Active-Learning, Realizable Case

u

v

(u,v)

If and

Theorem

If then after

ws has error · .

PX is uniform over Sd.

and

iterations

Fact 1

Fact 3vu

v

Maria-Florina Balcan

BBZ’07, Proof Ideaiterate k=2, … , log(1/)

Rejection sample mk samples x from D satisfying |wk-1

T ¢ x| · k ;ask for labels and find wk 2 B(wk-1, 1/2k ) consistent with all these examples.

end iterate

Assume wk has error · . We are done if 9 k s.t. wk+1 has error · /2

and only need O(d log( 1/)) labels in round k.

wk

wk+1

γk

w*

Maria-Florina Balcan

BBZ’07, Proof Ideaiterate k=2, … , log(1/)

Rejection sample mk samples x from D satisfying |wk-1

T ¢ x| · k ;ask for labels and find wk 2 B(wk-1, 1/2k ) consistent with all these examples.

end iterate

Assume wk has error · . We are done if 9 k s.t. wk+1 has error · /2

and only need O(d log( 1/)) labels in round k.

wk

wk+1

γk

w*

Maria-Florina Balcan

BBZ’07, Proof Ideaiterate k=2, … , log(1/)

Rejection sample mk samples x from D satisfying |wk-1

T ¢ x| · k ;ask for labels and find wk 2 B(wk-1, 1/2k ) consistent with all these examples.

end iterate

Assume wk has error · . We are done if 9 k s.t. wk+1 has error · /2

and only need O(d log( 1/)) labels in round k.

Key Point

Under the uniform distr. assumption for

we have · /4 wk

wk+1

γk

w*

Maria-Florina Balcan

BBZ’07, Proof Idea

Key Point

Under the uniform distr. assumption for

we have · /4

So, it’s enough to ensure that

We can do so by only using O(d log( 1/)) labels in round k.

wk

wk+1

γk

w*

Key Point