View
219
Download
4
Tags:
Embed Size (px)
Citation preview
MariaFlorina Balcan
Active Learning of Binary Classifiers
Presenters: Nina Balcan and Steve Hanneke
MariaFlorina Balcan
Outline
• What is Active Learning?
• Active Learning Linear Separators
• General Theories of Active Learning
• Open Problems
MariaFlorina Balcan
Labeled examples
Supervised Passive Learning
Learning Algorithm
Expert / Oracle
Data Source
Unlabeled examples
Algorithm outputs a classifier
MariaFlorina Balcan
Incorporating Unlabeled Data in the Learning process
• In many settings, unlabeled data is cheap & easy to obtain, labeled data is much more expensive.
• Web page, document classification• OCR, Image classification
MariaFlorina Balcan
Labeled Examples
SemiSupervised Passive Learning
Learning Algorithm
Expert / Oracle
Data Source
Unlabeled examples
Algorithm outputs a classifier
Unlabeled examples
MariaFlorina Balcan
SemiSupervised Passive Learning
• Several methods have been developed to try to use unlabeled data to improve performance, e.g.:
• Transductive SVM [Joachims ’98]• Cotraining [Blum & Mitchell ’98], [BBY04]• Graphbased methods [Blum & Chawla01],
[ZGL03]
MariaFlorina Balcan
SemiSupervised Passive Learning
• Several methods have been developed to try to use unlabeled data to improve performance, e.g.:
• Transductive SVM [Joachims ’98]• Cotraining [Blum & Mitchell ’98], [BBY04]• Graphbased methods [Blum & Chawla01],
[ZGL03]
+
+
_
_
Labeled data only
+
+
_
_
+
+
_
_
Transductive SVMSVM
MariaFlorina Balcan
SemiSupervised Passive Learning
Workshops [ICML ’03, ICML’ 05]
Books: SemiSupervised Learning, MIT 2006 O. Chapelle, B. Scholkopf and A. Zien (eds)
• Several methods have been developed to try to use unlabeled data to improve performance, e.g.:
• Transductive SVM [Joachims ’98]• Cotraining [Blum & Mitchell ’98], [BBY04]• Graphbased methods [Blum & Chawla01],
[ZGL03]
Theoretical models: BalcanBlum’05
MariaFlorina Balcan
A Label for that ExampleRequest for the Label of an Example
A Label for that ExampleRequest for the Label of an Example
Active Learning
Data Source
Unlabeled examples
. . . Algorithm outputs a classifier
Learning Algorithm
Expert / Oracle
MariaFlorina Balcan
Choose the label requests carefully, to get informative labels.
What Makes a Good Algorithm?
• Guaranteed to output a relatively good classifier for most learning problems.
• Doesn’t make too many label requests.
MariaFlorina Balcan
Can It Really Do Better Than Passive?
• YES! (sometimes)
• We often need far fewer labels for active learning than for passive.
• This is predicted by theory and has been observed in practice.
MariaFlorina Balcan
Active Learning in Practice
• Active SVM (Tong & Koller, ICML 2000) seems to be quite useful in practice.
At any time during the alg., we have a “current guess” of the separator: the maxmargin separator of all labeled points so far.
E.g., strategy 1: request the label of the example closest to the current separator.
MariaFlorina Balcan
When Does it Work? And Why?
• The algorithms currently used in practice are not well understood theoretically.
• We don’t know if/when they output a good classifier, nor can we say how many labels they will need.
• So we seek algorithms that we can understand and state formal guarantees for.
Rest of this talk: surveys recent theoretical results.
MariaFlorina Balcan
Standard Supervised Learning Setting
• S={(x, l)}  set of labeled examples drawn i.i.d. from some distr. D over X and
labeled by some target concept c* 2 C
–err(h)=Prx 2 D(h(x) c*(x))
• Want to do optimization over S to find some hyp. h, but we want h to have small error over D.
Sample Complexity, Finite Hyp. Space, Realizable case
MariaFlorina Balcan
Sample Complexity: Uniform Convergence Bounds
• Infinite Hypothesis Case
E.g., if C  class of linear separators in Rd, then we need roughly O(d/ examples to achieve generalization error
Nonrealizable case – replace with 2.
MariaFlorina Balcan
Active Learning
• The learner has the ability to choose specific examples to be labeled:
 The learner works harder, in order to use fewer labeled examples.
• We get to see unlabeled data first, and there is a charge for every label.
• How many labels can we save by querying adaptively?
MariaFlorina Balcan
Can adaptive querying help? [CAL92, Dasgupta04]
• Consider threshold functions on the real line:
w
+
Exponential improvement in sample complexity
hw(x) = 1(x ¸ w), C = {hw: w 2 R}
• Sample with 1/ unlabeled examples.

• Binary search – need just O(log 1/) labels.
Active setting: O(log 1/) labels to find an accurate threshold.
Supervised learning needs O(1/) labels.
+
MariaFlorina Balcan
Active Learning might not help [Dasgupta04]
C = {linear separators in R1}: active learning reduces sample complexity substantially.
In this case: learning to accuracy requires 1/ labels…
In general, number of queries needed depends on C and also on D.
C = {linear separators in R2}: there are some target hyp. for which no improvement can be achieved!  no matter how benign the input distr.
h1
h2
h3
h0
MariaFlorina Balcan
Examples where Active Learning helps
• C = {linear separators in R1}: active learning reduces sample complexity substantially no matter what is the input distribution.
In general, number of queries needed depends on C and also on D.
• C  homogeneous linear separators in Rd, D  uniform distribution over unit sphere: • need only d log 1/ labels to find a hypothesis
with error rate < .
• Freund et al., ’97.
• Dasgupta, Kalai, Monteleoni, COLT 2005
• BalcanBroderZhang, COLT 07
MariaFlorina Balcan
Region of uncertainty [CAL92]
current version space
• Example: data lies on circle in R2 and hypotheses are homogeneous linear separators.
region of uncertainty in data space
• Current version space: part of C consistent with labels so far.• “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space)
++
MariaFlorina Balcan
Region of uncertainty [CAL92]
Pick a few points at random from the current region of uncertainty and query their labels.
current version space
region of uncertainy
Algorithm:
MariaFlorina Balcan
Region of uncertainty [CAL92]
• Current version space: part of C consistent with labels so far.• “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space)
current version space
region of uncertainty in data space
++
MariaFlorina Balcan
Region of uncertainty [CAL92]
• Current version space: part of C consistent with labels so far.• “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space)
new version space
New region of uncertainty in data space
++
MariaFlorina Balcan
Region of uncertainty [CAL92], Guarantees
Algorithm: Pick a few points at random from the current region of uncertainty and query their labels.
• C  homogeneous linear separators in Rd, D uniform distribution over unit sphere.
• low noise, need only d2 log 1/ labels to find a hypothesis with error rate < .• realizable case, d3/2 log 1/ labels.
•supervised  d/ labels.
[Balcan, Beygelzimer, Langford, ICML’06]
Analyze a version of this alg. which is robust to noise.
• C linear separators on the line, low noise, exponential improvement.
MariaFlorina Balcan
Margin Based ActiveLearning Algorithm
Use O(d) examples to find w1 of error 1/8.
iterate k=2, … , log(1/) • rejection sample mk samples x from D
satisfying wk1T ¢ x · k ;
• label them;• find wk 2 B(wk1, 1/2k ) consistent with all these
examples.end iterate
[BalcanBroderZhang, COLT 07]
wk
wk+
1
γk
w*
MariaFlorina Balcan
BBZ’07, Proof Ideaiterate k=2, … , log(1/)
Rejection sample mk samples x from D satisfying wk1
T ¢ x · k ;ask for labels and find wk 2 B(wk1, 1/2k ) consistent with all these examples.
end iterate
Assume wk has error · . We are done if 9 k s.t. wk+1 has error · /2
and only need O(d log( 1/)) labels in round k.
wk
wk+1
γk
w*
MariaFlorina Balcan
BBZ’07, Proof Ideaiterate k=2, … , log(1/)
Rejection sample mk samples x from D satisfying wk1
T ¢ x · k ;ask for labels and find wk 2 B(wk1, 1/2k ) consistent with all these examples.
end iterate
Assume wk has error · . We are done if 9 k s.t. wk+1 has error · /2
and only need O(d log( 1/)) labels in round k.
wk
wk+1
γk
w*
MariaFlorina Balcan
BBZ’07, Proof Ideaiterate k=2, … , log(1/)
Rejection sample mk samples x from D satisfying wk1
T ¢ x · k ;ask for labels and find wk 2 B(wk1, 1/2k ) consistent with all these examples.
end iterate
Assume wk has error · . We are done if 9 k s.t. wk+1 has error · /2
and only need O(d log( 1/)) labels in round k.
Key Point
Under the uniform distr. assumption for
we have · /4 wk
wk+1
γk
w*
MariaFlorina Balcan
BBZ’07, Proof Idea
Key Point
Under the uniform distr. assumption for
we have · /4
So, it’s enough to ensure that
We can do so by only using O(d log( 1/)) labels in round k.
wk
wk+1
γk
w*
Key Point
MariaFlorina Balcan
General Theories of Active Learning
MariaFlorina Balcan
General Concept Spaces
• In the general learning problem, there is a concept space C, and we want to find an optimal classifier h C with high probability 1.
MariaFlorina Balcan
How Many Labels Do We Need?• In passive learning, we know of an algorithm
(empirical risk minimization) that needs only
labels (for realizable learning), and
if there is noise.
• We also know this is close to the best we can expect from any passive algorithm.
MariaFlorina Balcan
How Many Labels Do We Need?
• As before, we want to explore the analogous idea for Active Learning, (but now for general concept space C).
• How many label requests are necessary and sufficient for Active Learning?
• What are the relevant complexity measures? (i.e., the Active Learning analogue of VC dimension)
MariaFlorina Balcan
What ARE the Interesting Quantities?
• Generally speaking, we want examples whose labels are highly controversial among the set of remaining concepts.
• The likelihood of drawing such an informative example is an important quantity to consider.
• But there are many ways to define “informative” in general.
MariaFlorina Balcan
What Do You Mean By “Informative”?
• Want examples that reduce the version space.• But how do we measure progress?
• A problemspecific measure P on C?
• The Diameter?
• Measure of the region of disagreement?• Cover size? (see e.g., Hanneke, COLT 2007)
All of these seem to have interesting theories associated with them. As an example, let’s take a look at Diameter in detail.
MariaFlorina Balcan
Diameter (Dasgupta, NIPS 2005)
• Define distance d(g,h) = Pr(g(X)h(X)).• One way to guarantee our classifier is
within of the target classifier is to (safely) reduce the diameter to size .
Imagine each pair of concepts separated by distance > has an edge between them.We have to rule out at least one of the two concepts for each edge.
Each unlabeled example X partitions the concepts into two sets.
And guarantees some fraction of the edges will have at least one concept contradicted, no matter which label it has.
MariaFlorina Balcan
Diameter
•Theorem: (Dasgupta, NIPS 2005)
If, for any finite subset V C, PrX(X eliminates a ρ fraction of the edges) , then (assuming no noise) we can reduce the diameter to using a number of label requests at most
Furthermore, there is an algorithm that does this, which with high probability requires a number of unlabeled examples at most
MariaFlorina Balcan
Open Problems in Active Learning
• Efficient (correct) learning algorithms for linear separators provably achieving significant improvements on many distributions.
• What about binary feature spaces?• Tight generalpurpose sample
complexity bounds, for both realizable and agnostic.
• An optimal active learning algorithm?