Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Modern Topics in Learning Theory

Maria-Florina Balcan04/19/2006



• Semi-Supervised Learning

• Active Learning

• Kernels and Similarity Functions

• Tighter Data Dependent Bounds


Semi-Supervised Learning

• Hot topic in recent years in Machine Learning.

• Many applications have lots of unlabeled data, but labeled data is rare or expensive:

•Web page, document classification•OCR, Image classification


Combining Labeled and Unlabeled Data

• Several methods have been developed to try to use unlabeled data to improve performance, e.g.:

• Transductive SVM [J98]• Co-training [BM98, BBY04]• Graph-based methods [BC01], [ZGL03],

[BLRR04]

• Augmented PAC model for SSL [BB05, BB06]


Can we extend the PAC model to deal with Unlabeled Data?

• PAC model – nice/standard model for learning from labeled data.

• Goal – extend it naturally to the case of learning from both labeled and unlabeled data.

– Different algorithms are based on different assumptions about how data should behave.

– Question – how to capture many of the assumptions typically used?


Example of “typical” assumption

• The separator goes through low density regions of the space/large margin.– assume we are looking for linear separator– belief: should exist one with large separation

+

+

_

_

Labeled data only

+

+

_

_

+

+

_

_

Transductive SVMSVM


Another Example

• Agreement between two parts : co-training.– examples contain two sufficient sets of features, i.e. an

example is x=h x1, x2 i and the belief is that the two parts of the example are consistent, i.e. 9 c1, c2 such that c1(x1)=c2(x2)=c*(x)

– for example, if we want to classify web pages:My Advisor

Prof. Avrim Blum

My Advisor

Prof. Avrim Blum

x2- Text infox1- Link infox - Link info & Text info

x = h x1, x2 i


Co-Training [BM98]

Text infoLink info

+

+

+

X1X2

Works by using unlabeled data to propagate learned information.


Proposed Model [BB05, BB06]

• Augment the notion of a concept class C with a notion of compatibility between a concept and the data distribution.

• “learn C” becomes “learn (C,)” (i.e. learn class C under compatibility notion )

• Express relationships that one hopes the target function and underlying distribution will possess.

• Idea: use unlabeled data & the belief that the target is compatible to reduce C down to just {the highly compatible functions in C}.


Proposed Model, cont

• Idea: use unlabeled data & our belief to reduce size(C) down to size(highly compatible functions in C) in our sample complexity bounds.

• Want to be able to analyze how much unlabeled data is needed to uniformly estimate compatibilities well.

• Require that the degree of compatibility be something that can be estimated from a finite sample.


Proposed Model, cont• Augment the notion of a concept class C with a notion

of compatibility between a concept and the data distribution.

• Require that the degree of compatibility be something that can be estimated from a finite sample.

• Require to be an expectation over individual examples:

– (h,D)=Ex 2 D[(h, x)] compatibility of h with D, (h,x) 2

[0,1]

– errunl(h)=1-(h, D) incompatibility of h with D (unlabeled

error rate of h)


Margins, Compatibility• Margins: belief is that should exist a large margin

separator.

• Incompatibility of h and D (unlabeled error rate of h) – the probability mass within distance of h.

• Can be written as an expectation over individual examples (h,D)=Ex 2 D[(h,x)] where:

• (h,x)=0 if dist(x,h) ·

• (h,x)=1 if dist(x,h) ¸

Highly compatible +

+

+

_

_


Margins, Compatibility• Margins: belief is that should exist a large margin

separator.

• If do not want to commit to in advance, define (h,x) to be a smooth function of dist(x,h), e.g.:

• Illegal notion of compatibility: the largest s.t. D has probability mass exactly zero within distance of h.

Highly compatible +

+

+

_

_


Co-Training, Compatibility

• Co-training: examples come as pairs h x1, x2 i and the goal is to learn a pair of functions h h1, h2 i

• Hope is that the two parts of the example are consistent.

• Legal (and natural) notion of compatibility: – the compatibility of h h1, h2 i and D:

– can be written as an expectation over examples:


Examples of results in our model Sample Complexity - Uniform convergence bounds

Finite Hypothesis Spaces, Doubly Realizable Case

• Define CD,() = {h 2 C : errunl(h) · }.

Theorem

• Bound the # of labeled examples as a measure of the

helpfulness of D with respect to – a helpful distribution is one in which CD,() is small


Semi-Supervised Learning Natural Formalization (PAC)

• We will say an algorithm "PAC-learns" if it runs in

poly time using samples poly in respective bounds.

• E.g., can think of ln|C| as # bits to describe target

without knowing D, and ln|CD,()| as number of bits

to describe target knowing a good approximation to

D, given the assumption that the target has low

unlabeled error rate.



Finite Hypothesis Spaces – c* not fully compatible:

Theorem



Infinite Hypothesis Spaces

Assume (h,x) 2 {0,1} and (C) = {h : h 2 C} where h(x) =

(h,x).

C[m,D] - expected # of splits of m points from D with concepts in

C.



• For S µ X, denote by US the uniform distribution over S, and by C[m, US]

the expected number of splits of m points from US with concepts in C.

• Assume err(c*)=0 and errunl(c*)=0.

• Theorem

• The number of labeled examples depends on the unlabeled sample.

• Useful since can imagine the learning alg. performing some calculations over the unlabeled data and then deciding how many labeled examples to purchase.


Examples of results in our model Sample Complexity, -Cover-based bounds

• For algorithms that behave in a specific way: – first use the unlabeled data to choose a representative set of

compatible hypotheses– then use the labeled sample to choose among these

Theorem

• Can result in much better bound than uniform convergence!


Implications of our analysisWays in which unlabeled data can help

• If the target is highly compatible with D and have enough unlabeled data to estimate over all h 2 C, then can reduce the search space (from C down to just those h 2 C whose estimated unlabeled error rate is low).

• By providing an estimate of D, unlabeled data can allow a more refined distribution-specific notion of hypothesis space size (such as Annealed VC-entropy or the size of the smallest -cover).

• If D is nice so that the set of compatible h 2 C has a small -cover and the elements of the cover are far apart, then can learn from even fewer labeled examples than the 1/ needed just to verify a good hypothesis.



• Semi-Supervised Learning

• Active Learning

• Kernels and Similarity Functions

• Data Dependent Bounds


Active Learning• Unlabeled data is cheap & easy to obtain,

labeled data is (much) more expensive.

• The learner has the ability to choose specific examples to be labeled:

- The learner works harder, in order to use fewer labeled examples.


Membership queries

• The learner “constructs” the examples.

• [Baum and Lang, 1991] tried fitting a neural net to handwritten characters.– synthetic instances created were incomprehensible to

humans…


A PAC-like model [CAL92]

• Underlying distribution P on the (x,y) data. (agnostic setting)

• Learner has two abilities:• draw an unlabeled sample from the distribution• ask for a label of one of these samples

• Special case : assume the data is separable, i.e. some concept h 2 C labels all points perfectly.

(realizable setting)


Can adaptive querying help? [CAL92, D04]

• Consider threshold functions on the real line:

• Start with 1/ unlabeled points.

• Binary search – need just log 1/ labels, from which the rest can be inferred.

• Output a consistent hypothesis.

w

+-

Exponential improvement in sample complexity


Region of uncertainty [CAL92]

current version space

• Example: data lies on circle in R2 and hypotheses are linear separators.

region of uncertainty in data space

• Current version space: part of C consistent with labels so far.• “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space)

++



current version space

region of uncertainty in data space

Algorithm: Of the unlabeled points which lie in the region of uncertainty, pick one at random to query.



• Number of labels needed depends on C and also on P.

• Example: C -- linear separators in Rd, D -- uniform distribution over unit sphere.

• need only d3/2 log 1/ labels to find a hypothesis with error rate < .• supervised learning: d/ labels.Exponential improvement in sample complexity

For a robust version of CAL’92 see BBL’06.

Documents

Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006