View
222
Download
1
Tags:
Embed Size (px)
Citation preview
Maria-Florina Balcan
Modern Topics in Learning Theory
Maria-Florina Balcan04/19/2006
Maria-Florina Balcan
Modern Topics in Learning Theory
• Semi-Supervised Learning
• Active Learning
• Kernels and Similarity Functions
• Tighter Data Dependent Bounds
Maria-Florina Balcan
Semi-Supervised Learning
• Hot topic in recent years in Machine Learning.
• Many applications have lots of unlabeled data, but labeled data is rare or expensive:
•Web page, document classification•OCR, Image classification
Maria-Florina Balcan
Combining Labeled and Unlabeled Data
• Several methods have been developed to try to use unlabeled data to improve performance, e.g.:
• Transductive SVM [J98]• Co-training [BM98, BBY04]• Graph-based methods [BC01], [ZGL03],
[BLRR04]
• Augmented PAC model for SSL [BB05, BB06]
Maria-Florina Balcan
Can we extend the PAC model to deal with Unlabeled Data?
• PAC model – nice/standard model for learning from labeled data.
• Goal – extend it naturally to the case of learning from both labeled and unlabeled data.
– Different algorithms are based on different assumptions about how data should behave.
– Question – how to capture many of the assumptions typically used?
Maria-Florina Balcan
Example of “typical” assumption
• The separator goes through low density regions of the space/large margin.– assume we are looking for linear separator– belief: should exist one with large separation
+
+
_
_
Labeled data only
+
+
_
_
+
+
_
_
Transductive SVMSVM
Maria-Florina Balcan
Another Example
• Agreement between two parts : co-training.– examples contain two sufficient sets of features, i.e. an
example is x=h x1, x2 i and the belief is that the two parts of the example are consistent, i.e. 9 c1, c2 such that c1(x1)=c2(x2)=c*(x)
– for example, if we want to classify web pages:My Advisor
Prof. Avrim Blum
My Advisor
Prof. Avrim Blum
x2- Text infox1- Link infox - Link info & Text info
x = h x1, x2 i
Maria-Florina Balcan
Co-Training [BM98]
Text infoLink info
+
+
+
X1X2
Works by using unlabeled data to propagate learned information.
Maria-Florina Balcan
Proposed Model [BB05, BB06]
• Augment the notion of a concept class C with a notion of compatibility between a concept and the data distribution.
• “learn C” becomes “learn (C,)” (i.e. learn class C under compatibility notion )
• Express relationships that one hopes the target function and underlying distribution will possess.
• Idea: use unlabeled data & the belief that the target is compatible to reduce C down to just {the highly compatible functions in C}.
Maria-Florina Balcan
Proposed Model, cont
• Idea: use unlabeled data & our belief to reduce size(C) down to size(highly compatible functions in C) in our sample complexity bounds.
• Want to be able to analyze how much unlabeled data is needed to uniformly estimate compatibilities well.
• Require that the degree of compatibility be something that can be estimated from a finite sample.
Maria-Florina Balcan
Proposed Model, cont• Augment the notion of a concept class C with a notion
of compatibility between a concept and the data distribution.
• Require that the degree of compatibility be something that can be estimated from a finite sample.
• Require to be an expectation over individual examples:
– (h,D)=Ex 2 D[(h, x)] compatibility of h with D, (h,x) 2
[0,1]
– errunl(h)=1-(h, D) incompatibility of h with D (unlabeled
error rate of h)
Maria-Florina Balcan
Margins, Compatibility• Margins: belief is that should exist a large margin
separator.
• Incompatibility of h and D (unlabeled error rate of h) – the probability mass within distance of h.
• Can be written as an expectation over individual examples (h,D)=Ex 2 D[(h,x)] where:
• (h,x)=0 if dist(x,h) ·
• (h,x)=1 if dist(x,h) ¸
Highly compatible +
+
+
_
_
Maria-Florina Balcan
Margins, Compatibility• Margins: belief is that should exist a large margin
separator.
• If do not want to commit to in advance, define (h,x) to be a smooth function of dist(x,h), e.g.:
• Illegal notion of compatibility: the largest s.t. D has probability mass exactly zero within distance of h.
Highly compatible +
+
+
_
_
Maria-Florina Balcan
Co-Training, Compatibility
• Co-training: examples come as pairs h x1, x2 i and the goal is to learn a pair of functions h h1, h2 i
• Hope is that the two parts of the example are consistent.
• Legal (and natural) notion of compatibility: – the compatibility of h h1, h2 i and D:
– can be written as an expectation over examples:
Maria-Florina Balcan
Examples of results in our model Sample Complexity - Uniform convergence bounds
Finite Hypothesis Spaces, Doubly Realizable Case
• Define CD,() = {h 2 C : errunl(h) · }.
Theorem
• Bound the # of labeled examples as a measure of the
helpfulness of D with respect to – a helpful distribution is one in which CD,() is small
Maria-Florina Balcan
Semi-Supervised Learning Natural Formalization (PAC)
• We will say an algorithm "PAC-learns" if it runs in
poly time using samples poly in respective bounds.
• E.g., can think of ln|C| as # bits to describe target
without knowing D, and ln|CD,()| as number of bits
to describe target knowing a good approximation to
D, given the assumption that the target has low
unlabeled error rate.
Maria-Florina Balcan
Examples of results in our model Sample Complexity - Uniform convergence bounds
Finite Hypothesis Spaces – c* not fully compatible:
Theorem
Maria-Florina Balcan
Examples of results in our model Sample Complexity - Uniform convergence bounds
Infinite Hypothesis Spaces
Assume (h,x) 2 {0,1} and (C) = {h : h 2 C} where h(x) =
(h,x).
C[m,D] - expected # of splits of m points from D with concepts in
C.
Maria-Florina Balcan
Examples of results in our model Sample Complexity - Uniform convergence bounds
• For S µ X, denote by US the uniform distribution over S, and by C[m, US]
the expected number of splits of m points from US with concepts in C.
• Assume err(c*)=0 and errunl(c*)=0.
• Theorem
• The number of labeled examples depends on the unlabeled sample.
• Useful since can imagine the learning alg. performing some calculations over the unlabeled data and then deciding how many labeled examples to purchase.
Maria-Florina Balcan
Examples of results in our model Sample Complexity, -Cover-based bounds
• For algorithms that behave in a specific way: – first use the unlabeled data to choose a representative set of
compatible hypotheses– then use the labeled sample to choose among these
Theorem
• Can result in much better bound than uniform convergence!
Maria-Florina Balcan
Implications of our analysisWays in which unlabeled data can help
• If the target is highly compatible with D and have enough unlabeled data to estimate over all h 2 C, then can reduce the search space (from C down to just those h 2 C whose estimated unlabeled error rate is low).
• By providing an estimate of D, unlabeled data can allow a more refined distribution-specific notion of hypothesis space size (such as Annealed VC-entropy or the size of the smallest -cover).
• If D is nice so that the set of compatible h 2 C has a small -cover and the elements of the cover are far apart, then can learn from even fewer labeled examples than the 1/ needed just to verify a good hypothesis.
Maria-Florina Balcan
Modern Topics in Learning Theory
• Semi-Supervised Learning
• Active Learning
• Kernels and Similarity Functions
• Data Dependent Bounds
Maria-Florina Balcan
Active Learning• Unlabeled data is cheap & easy to obtain,
labeled data is (much) more expensive.
• The learner has the ability to choose specific examples to be labeled:
- The learner works harder, in order to use fewer labeled examples.
Maria-Florina Balcan
Membership queries
• The learner “constructs” the examples.
• [Baum and Lang, 1991] tried fitting a neural net to handwritten characters.– synthetic instances created were incomprehensible to
humans…
Maria-Florina Balcan
A PAC-like model [CAL92]
• Underlying distribution P on the (x,y) data. (agnostic setting)
• Learner has two abilities:• draw an unlabeled sample from the distribution• ask for a label of one of these samples
• Special case : assume the data is separable, i.e. some concept h 2 C labels all points perfectly.
(realizable setting)
Maria-Florina Balcan
Can adaptive querying help? [CAL92, D04]
• Consider threshold functions on the real line:
• Start with 1/ unlabeled points.
• Binary search – need just log 1/ labels, from which the rest can be inferred.
• Output a consistent hypothesis.
w
+-
Exponential improvement in sample complexity
Maria-Florina Balcan
Region of uncertainty [CAL92]
current version space
• Example: data lies on circle in R2 and hypotheses are linear separators.
region of uncertainty in data space
• Current version space: part of C consistent with labels so far.• “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space)
++
Maria-Florina Balcan
Region of uncertainty [CAL92]
current version space
region of uncertainty in data space
Algorithm: Of the unlabeled points which lie in the region of uncertainty, pick one at random to query.
Maria-Florina Balcan
Region of uncertainty [CAL92]
• Number of labels needed depends on C and also on P.
• Example: C -- linear separators in Rd, D -- uniform distribution over unit sphere.
• need only d3/2 log 1/ labels to find a hypothesis with error rate < .• supervised learning: d/ labels.Exponential improvement in sample complexity
For a robust version of CAL’92 see BBL’06.