Analysis of greedy active learning Sanjoy Dasgupta UC San Diego

Analysis of greedy active learning

Sanjoy DasguptaUC San Diego

Standard learning modelGiven m labeled points, want to learn a classifier with misclassification rate <, chosen from a hypothesis class H with VC dimension d < 1.

VC theory: need m to be roughly d/, in the realizable case.

Active learningUnlabeled data is easy to come by, but there is a charge for each label.

What is the minimum number of labels needed to achieve the target error rate?

Can adaptive querying help?

Simple hypothesis class: threshold functions on the real line:

hw(x) = 1(x ¸ w), H = {hw: w 2 R}Start with m ¼ 1/ unlabeled points

Binary search – need just log m labels, from which the rest can be inferred! An exponential improvement in sample complexity.

Binary search

X1?

X6? X8?

X1 X8X6X3

m data points: there are effectively m+1 different hypotheses.

Query tree has m+1 leaves, depth ¼ log m.Question: Is this a general phenomenon? For other

hypothesis classes, is a generalized binary search possible?

Bad news – I

H = {linear separators in R1}: active learning reduces sample complexity from m to log m.

But H = {linear separators in R2}: there are some target hypotheses for which all m labels need to be queried! (No matter how benign the input distribution.)In this case: learning to accuracy requires 1/ labels…

The benefit of averaging

For linear separators in R2:In the worst case over target hypotheses, active learning offers no improvement in sample complexity.

But there is a query tree in which the depths of the O(m2) target hypotheses are spread almost evenly over [log m, m].

The average depth is just log m.Question: does active learning help only in a Bayesian model?

Degrees of Bayesian-ity

Prior over hypotheses

Pseudo-Bayesian modelThe prior is used only to count queries

Bayesian modelThe prior is used for counting queries and also for the generalization bound

High mass

Low mass

Different stopping criteria. Suppose the remaining version space is:

Effective hypothesis class

Fix a hypothesis class H of VC dimension d < 1, and a set of unlabeled examples x1, x2, …, xm, where m ¸ d/.

Sauer’s lemma: H can label these points in at most md different ways… the effective hypothesis class

Heff = { (h(x1), h(x2), …, h(xm)) : h 2 H}

has size |Heff| · md.

Goal (in the realizable case): pick the element of Heff which is consistent with all the hidden labels, while asking for just a small subset of these labels.

Model of an active learner

Query tree:

X1?

X6? X8?

X3?

h1h5

h6 h3h2

Each leaf is annotated with an element of Heff.

Weights over Heff.

Goal: a tree T of small average depth,

Q(T,) = h (h) ¢ depth(h)

(can also use random coin flips at internal nodes)

Question: in this averaged model, can we always find a tree of depth o(m)?

Bad news – II

Pick any d > 0 and m ¸ 2d. There is an input space of size m and a hypothesis class H of VC dimension d such that (for uniform ) any active learning strategy requires ¸ m/8 queries on average.

Choose:Input space = any {x1, …, xm}H = all concepts which are positive on exactly d inputs.

A revised goalDepending on

the choice of the hypothesis classperhaps the input distribution

the average number of labels needed by an optimal active learner is somewhere in the range [d log m, m].

Ideal case: d log m perfect binary searchWorst case:m randomly chosen labels(within constants)

Is there a generic active learning strategy which always achieves close to the optimal number of queries, no matter what it might be?

Heuristics for active learning

A common strategy in many heuristics:Greedy strategy. After seeing t labels, remaining

version space is some Ht. Always choose the point which most evenly divides Ht, according to -mass.

For instance, Tong-Koller (2000) – linear separators:

/ volume

Question: How good is this greedy scheme? And how does its performance depend on the choice of ?

Greedy active learningChoose any . How does the greedy query tree TG compare to the optimal tree T*?

Upper bound. Q(TG, ) · 4 Q(T*, ) log 1/(minh (h)).

Example: For uniform , the approximation ratio is log |Heff| · d log m.

Lower bounds.[1] Uniform : we have an example in whichQ(TG, ) ¸ Q(T*, ) ¢ (log |Heff|/log log |Heff|)[2] Non-uniform : an example where ranges between 1/2 and 1/2n, and Q(TG, ) ¸ Q(T*, ) ¢ (n).

Sub-optimality of greedy scheme

[1] The case of uniform .

There are simple examples in which the greedy scheme uses (log n/log log n) times the optimal number of labels.

(a) The hypothesis class consists of several clusters(b) Each cluster is efficiently searchable(c) But first the version space must be narrowed down to one of these clusters: an inefficient process[Invoke this construction recursively.]

Optimal strategy reduces entropy only gradually at first, then ramps it up later – an over-eager greedy scheme is fooled.

Sub-optimality, cont’d[2] The case of general .

For any n ¸ 2:

There is a hypothesis class H of size 2n+1 and distribution over H such that:(a) ranges from 1/2 to 1/2n+1

(b) optimal expected number of queries is <3(c) greedy strategy uses ¸ n/2 queries on average.

h0

h11 h21

h1

2

h13

h22

h23

h1n h2n

H, (proportional to area)

Sub-optimality, cont’dThree types of queries:

(i) Is target some h1i ? (ii) some h2i ? (iii) h1j or h2j ?

Upper bound: overview

Upper bound. Q(TG, ) · 4 Q(T*, ) log 1/(minh (h)).

If the optimal tree is short, then

either: there is a query which (in expectation) cuts off a good chunk of the version space

or: some particular hypothesis has high weight.

At least in the first case, the greedy scheme gets off to a good start [cf. Johnson’s argument for set cover].

Quality of a query

Need a notion of query quality which can only decrease with time.

If S is a version space, and query xi splits it into S+, S-, we’ll say that “xi shrinks (S, )” by

2 (S+) (S-) (S)

Claim: If xi shrinks (Heff, ) by , then it shrinks (S,) by at most for any S µ Heff.

When is the optimal tree short?

Claim: Pick any S µ Heff, and any tree T whose leaves include all of S. Then there must be a query which shrinks (S, S) by at least:

(1 – CP(S))/Q(T, S).

Here:S is restricted to S

CP() = h (h)2 (collision probability)

Main argument

If the optimal tree has small average depth, then there are two possible cases:

Case one: there is some query which shrinks the version space significantly

In this case, the greedy strategy will find such a query and clear progress will be made. The resulting subtrees, considered together, will also require few queries.

Proof, cont’d

Case two: some classifier h* has very high -mass

In this case, the version space might shrink by just an insignificant amount in one round. But:

in roughly the number of queries that the optimal strategy requires for target h*, the greedy strategy will either eliminate h* or declare it to be the answer.

In the former case, by the time h* is eliminated, the version space will have shrunk significantly.

These two cases form the basis of an inductive argument.

An open problem

Just about the only positive result in active learning:

[FSST97] Query by committee: if the data distribution is uniform over the unit sphere, can learn homogeneous linear separators using just O(d log 1/) labels.

But the minute we allow non-homogeneous hyperplanes, the query complexity increases to 1/… What’s going on?

Documents

Analysis of greedy active learning Sanjoy Dasgupta UC San Diego