29
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

  • View
    222

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Modern Topics in Learning Theory

Maria-Florina Balcan04/19/2006

Page 2: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Modern Topics in Learning Theory

• Semi-Supervised Learning

• Active Learning

• Kernels and Similarity Functions

• Tighter Data Dependent Bounds

Page 3: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Semi-Supervised Learning

• Hot topic in recent years in Machine Learning.

• Many applications have lots of unlabeled data, but labeled data is rare or expensive:

•Web page, document classification•OCR, Image classification

Page 4: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Combining Labeled and Unlabeled Data

• Several methods have been developed to try to use unlabeled data to improve performance, e.g.:

• Transductive SVM [J98]• Co-training [BM98, BBY04]• Graph-based methods [BC01], [ZGL03],

[BLRR04]

• Augmented PAC model for SSL [BB05, BB06]

Page 5: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Can we extend the PAC model to deal with Unlabeled Data?

• PAC model – nice/standard model for learning from labeled data.

• Goal – extend it naturally to the case of learning from both labeled and unlabeled data.

– Different algorithms are based on different assumptions about how data should behave.

– Question – how to capture many of the assumptions typically used?

Page 6: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Example of “typical” assumption

• The separator goes through low density regions of the space/large margin.– assume we are looking for linear separator– belief: should exist one with large separation

+

+

_

_

Labeled data only

+

+

_

_

+

+

_

_

Transductive SVMSVM

Page 7: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Another Example

• Agreement between two parts : co-training.– examples contain two sufficient sets of features, i.e. an

example is x=h x1, x2 i and the belief is that the two parts of the example are consistent, i.e. 9 c1, c2 such that c1(x1)=c2(x2)=c*(x)

– for example, if we want to classify web pages:My Advisor

Prof. Avrim Blum

My Advisor

Prof. Avrim Blum

x2- Text infox1- Link infox - Link info & Text info

x = h x1, x2 i

Page 8: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Co-Training [BM98]

Text infoLink info

+

+

+

X1X2

Works by using unlabeled data to propagate learned information.

Page 9: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Proposed Model [BB05, BB06]

• Augment the notion of a concept class C with a notion of compatibility between a concept and the data distribution.

• “learn C” becomes “learn (C,)” (i.e. learn class C under compatibility notion )

• Express relationships that one hopes the target function and underlying distribution will possess.

• Idea: use unlabeled data & the belief that the target is compatible to reduce C down to just {the highly compatible functions in C}.

Page 10: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Proposed Model, cont

• Idea: use unlabeled data & our belief to reduce size(C) down to size(highly compatible functions in C) in our sample complexity bounds.

• Want to be able to analyze how much unlabeled data is needed to uniformly estimate compatibilities well.

• Require that the degree of compatibility be something that can be estimated from a finite sample.

Page 11: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Proposed Model, cont• Augment the notion of a concept class C with a notion

of compatibility between a concept and the data distribution.

• Require that the degree of compatibility be something that can be estimated from a finite sample.

• Require to be an expectation over individual examples:

– (h,D)=Ex 2 D[(h, x)] compatibility of h with D, (h,x) 2

[0,1]

– errunl(h)=1-(h, D) incompatibility of h with D (unlabeled

error rate of h)

Page 12: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Margins, Compatibility• Margins: belief is that should exist a large margin

separator.

• Incompatibility of h and D (unlabeled error rate of h) – the probability mass within distance of h.

• Can be written as an expectation over individual examples (h,D)=Ex 2 D[(h,x)] where:

• (h,x)=0 if dist(x,h) ·

• (h,x)=1 if dist(x,h) ¸

Highly compatible +

+

+

_

_

Page 13: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Margins, Compatibility• Margins: belief is that should exist a large margin

separator.

• If do not want to commit to in advance, define (h,x) to be a smooth function of dist(x,h), e.g.:

• Illegal notion of compatibility: the largest s.t. D has probability mass exactly zero within distance of h.

Highly compatible +

+

+

_

_

Page 14: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Co-Training, Compatibility

• Co-training: examples come as pairs h x1, x2 i and the goal is to learn a pair of functions h h1, h2 i

• Hope is that the two parts of the example are consistent.

• Legal (and natural) notion of compatibility: – the compatibility of h h1, h2 i and D:

– can be written as an expectation over examples:

Page 15: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Examples of results in our model Sample Complexity - Uniform convergence bounds

Finite Hypothesis Spaces, Doubly Realizable Case

• Define CD,() = {h 2 C : errunl(h) · }.

Theorem

• Bound the # of labeled examples as a measure of the

helpfulness of D with respect to – a helpful distribution is one in which CD,() is small

Page 16: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Semi-Supervised Learning Natural Formalization (PAC)

• We will say an algorithm "PAC-learns" if it runs in

poly time using samples poly in respective bounds.

• E.g., can think of ln|C| as # bits to describe target

without knowing D, and ln|CD,()| as number of bits

to describe target knowing a good approximation to

D, given the assumption that the target has low

unlabeled error rate.

Page 17: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Examples of results in our model Sample Complexity - Uniform convergence bounds

Finite Hypothesis Spaces – c* not fully compatible:

Theorem

Page 18: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Examples of results in our model Sample Complexity - Uniform convergence bounds

Infinite Hypothesis Spaces

Assume (h,x) 2 {0,1} and (C) = {h : h 2 C} where h(x) =

(h,x).

C[m,D] - expected # of splits of m points from D with concepts in

C.

Page 19: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Examples of results in our model Sample Complexity - Uniform convergence bounds

• For S µ X, denote by US the uniform distribution over S, and by C[m, US]

the expected number of splits of m points from US with concepts in C.

• Assume err(c*)=0 and errunl(c*)=0.

• Theorem

• The number of labeled examples depends on the unlabeled sample.

• Useful since can imagine the learning alg. performing some calculations over the unlabeled data and then deciding how many labeled examples to purchase.

Page 20: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Examples of results in our model Sample Complexity, -Cover-based bounds

• For algorithms that behave in a specific way: – first use the unlabeled data to choose a representative set of

compatible hypotheses– then use the labeled sample to choose among these

Theorem

• Can result in much better bound than uniform convergence!

Page 21: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Implications of our analysisWays in which unlabeled data can help

• If the target is highly compatible with D and have enough unlabeled data to estimate over all h 2 C, then can reduce the search space (from C down to just those h 2 C whose estimated unlabeled error rate is low).

• By providing an estimate of D, unlabeled data can allow a more refined distribution-specific notion of hypothesis space size (such as Annealed VC-entropy or the size of the smallest -cover).

• If D is nice so that the set of compatible h 2 C has a small -cover and the elements of the cover are far apart, then can learn from even fewer labeled examples than the 1/ needed just to verify a good hypothesis.

Page 22: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Modern Topics in Learning Theory

• Semi-Supervised Learning

• Active Learning

• Kernels and Similarity Functions

• Data Dependent Bounds

Page 23: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Active Learning• Unlabeled data is cheap & easy to obtain,

labeled data is (much) more expensive.

• The learner has the ability to choose specific examples to be labeled:

- The learner works harder, in order to use fewer labeled examples.

Page 24: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Membership queries

• The learner “constructs” the examples.

• [Baum and Lang, 1991] tried fitting a neural net to handwritten characters.– synthetic instances created were incomprehensible to

humans…

Page 25: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

A PAC-like model [CAL92]

• Underlying distribution P on the (x,y) data. (agnostic setting)

• Learner has two abilities:• draw an unlabeled sample from the distribution• ask for a label of one of these samples

• Special case : assume the data is separable, i.e. some concept h 2 C labels all points perfectly.

(realizable setting)

Page 26: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Can adaptive querying help? [CAL92, D04]

• Consider threshold functions on the real line:

• Start with 1/ unlabeled points.

• Binary search – need just log 1/ labels, from which the rest can be inferred.

• Output a consistent hypothesis.

w

+-

Exponential improvement in sample complexity

Page 27: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Region of uncertainty [CAL92]

current version space

• Example: data lies on circle in R2 and hypotheses are linear separators.

region of uncertainty in data space

• Current version space: part of C consistent with labels so far.• “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space)

++

Page 28: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Region of uncertainty [CAL92]

current version space

region of uncertainty in data space

Algorithm: Of the unlabeled points which lie in the region of uncertainty, pick one at random to query.

Page 29: Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

Maria-Florina Balcan

Region of uncertainty [CAL92]

• Number of labels needed depends on C and also on P.

• Example: C -- linear separators in Rd, D -- uniform distribution over unit sphere.

• need only d3/2 log 1/ labels to find a hypothesis with error rate < .• supervised learning: d/ labels.Exponential improvement in sample complexity

For a robust version of CAL’92 see BBL’06.