96
A tutorial on active learning Sanjoy Dasgupta 1 John Langford 2 UC San Diego 1 Yahoo Labs 2

A tutorial on active learning - Machine Learning (Theory)

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A tutorial on active learning - Machine Learning (Theory)

A tutorial on active learning

Sanjoy Dasgupta1 John Langford2

UC San Diego1

Yahoo Labs2

Page 2: A tutorial on active learning - Machine Learning (Theory)

Exploiting unlabeled data

A lot of unlabeled data is plentiful and cheap, eg.

documents off the web

speech samples

images and video

But labeling can be expensive.

Page 3: A tutorial on active learning - Machine Learning (Theory)

Exploiting unlabeled data

A lot of unlabeled data is plentiful and cheap, eg.

documents off the web

speech samples

images and video

But labeling can be expensive.

Unlabeled points

Page 4: A tutorial on active learning - Machine Learning (Theory)

Exploiting unlabeled data

A lot of unlabeled data is plentiful and cheap, eg.

documents off the web

speech samples

images and video

But labeling can be expensive.

Unlabeled points Supervised learning

Page 5: A tutorial on active learning - Machine Learning (Theory)

Exploiting unlabeled data

A lot of unlabeled data is plentiful and cheap, eg.

documents off the web

speech samples

images and video

But labeling can be expensive.

Unlabeled points Supervised learning Semisupervised andactive learning

Page 6: A tutorial on active learning - Machine Learning (Theory)

Active learning example: drug design [Warmuth et al 03]

Goal: find compounds which bind to a particular target

Large collection of compounds, from:

vendor catalogs

corporate collections

combinatorial chemistry

unlabeled point ≡ description of chemical compound

label ≡ active (binds to target) vs. inactive

getting a label ≡ chemistry experiment

Page 7: A tutorial on active learning - Machine Learning (Theory)

Active learning example: pedestrian detection [Freund et al 03]

Page 8: A tutorial on active learning - Machine Learning (Theory)

Typical heuristics for active learning

Start with a pool of unlabeled data

Pick a few points at random and get their labels

Repeat

Fit a classifier to the labels seen so farQuery the unlabeled point that is closest to the boundary(or most uncertain, or most likely to decrease overalluncertainty,...)

Page 9: A tutorial on active learning - Machine Learning (Theory)

Typical heuristics for active learning

Start with a pool of unlabeled data

Pick a few points at random and get their labels

Repeat

Fit a classifier to the labels seen so farQuery the unlabeled point that is closest to the boundary(or most uncertain, or most likely to decrease overalluncertainty,...)

Biased sampling: thelabeled points are notrepresentative of theunderlying distribution!

Page 10: A tutorial on active learning - Machine Learning (Theory)

Sampling bias

Start with a pool of unlabeled data

Pick a few points at random and get their labels

Repeat

Fit a classifier to the labels seen so farQuery the unlabeled point that is closest to the boundary(or most uncertain, or most likely to decrease overalluncertainty,...)

Example:

45% 5% 5% 45%

Page 11: A tutorial on active learning - Machine Learning (Theory)

Sampling bias

Start with a pool of unlabeled data

Pick a few points at random and get their labels

Repeat

Fit a classifier to the labels seen so farQuery the unlabeled point that is closest to the boundary(or most uncertain, or most likely to decrease overalluncertainty,...)

Example:

45% 5% 5% 45%

Even with infinitely many labels, converges to a classifier with 5%error instead of the best achievable, 2.5%. Not consistent!

Page 12: A tutorial on active learning - Machine Learning (Theory)

Sampling bias

Start with a pool of unlabeled data

Pick a few points at random and get their labels

Repeat

Fit a classifier to the labels seen so farQuery the unlabeled point that is closest to the boundary(or most uncertain, or most likely to decrease overalluncertainty,...)

Example:

45% 5% 5% 45%

Even with infinitely many labels, converges to a classifier with 5%error instead of the best achievable, 2.5%. Not consistent!

Manifestation in practice, eg. Schutze et al 03.

Page 13: A tutorial on active learning - Machine Learning (Theory)

Can adaptive querying really help?

There are two distinct narratives for explaining how adaptivequerying can help.

Case I: Exploiting (cluster) structure in data

Case II: Efficient search through hypothesis space

Page 14: A tutorial on active learning - Machine Learning (Theory)

Case I: Exploiting cluster structure in data

Suppose the unlabeled data looks like this.

Then perhaps we just need five labels!

Page 15: A tutorial on active learning - Machine Learning (Theory)

Case I: Exploiting cluster structure in data

Suppose the unlabeled data looks like this.

Then perhaps we just need five labels!

Challenges: In general, the cluster structure (i) is not so clearlydefined and (ii) exists at many levels of granularity. And theclusters themselves might not be pure in their labels. How toexploit whatever structure happens to exist?

Page 16: A tutorial on active learning - Machine Learning (Theory)

Case II: Efficient search through hypothesis space

Ideal case: each query cuts the version space in two.

+ −

H

Then perhaps we need just log |H| labels to get a perfecthypothesis!

Page 17: A tutorial on active learning - Machine Learning (Theory)

Case II: Efficient search through hypothesis space

Ideal case: each query cuts the version space in two.

+ −

H

Then perhaps we need just log |H| labels to get a perfecthypothesis!

Challenges: (1) Do there always exist queries that will cut off agood portion of the version space? (2) If so, how can these queriesbe found? (3) What happens in the nonseparable case?

Page 18: A tutorial on active learning - Machine Learning (Theory)

Outline of tutorial

I. Exploiting (cluster) structure in data

II. Efficient search through hypothesis space

Sine qua non: statistical consistency.

Page 19: A tutorial on active learning - Machine Learning (Theory)

A cluster-based active learning scheme [ZGL 03]

(1) Build neighborhood graph

Page 20: A tutorial on active learning - Machine Learning (Theory)

A cluster-based active learning scheme [ZGL 03]

(1) Build neighborhood graph (2) Query some random points

1

0

Page 21: A tutorial on active learning - Machine Learning (Theory)

A cluster-based active learning scheme [ZGL 03]

(1) Build neighborhood graph (2) Query some random points

1

0

(3) Propagate labels

1

0

.2.5

.8

.8

.7

.6

.4

Page 22: A tutorial on active learning - Machine Learning (Theory)

A cluster-based active learning scheme [ZGL 03]

(1) Build neighborhood graph (2) Query some random points

1

0

(3) Propagate labels

1

0

.2.5

.8

.8

.7

.6

.4

(4) Make query and go to (3)

1

0

.2.5

.8

.8

.7

.6

.4

Page 23: A tutorial on active learning - Machine Learning (Theory)

A cluster-based active learning scheme [ZGL 03]

(1) Build neighborhood graph (2) Query some random points

1

0

(3) Propagate labels

1

0

.2.5

.8

.8

.7

.6

.4

(4) Make query and go to (3)

1

0

.2.5

.8

.8

.7

.6

.4

Clusters in data ⇒ graph cut which curtails propagation of influence

Page 24: A tutorial on active learning - Machine Learning (Theory)

Exploiting cluster structure in data [DH 08]

Basic primitive:

Find a clustering of the data

Sample a few randomly-chosen points in each cluster

Assign each cluster its majority label

Now use this fully labeled data set to build a classifier

Page 25: A tutorial on active learning - Machine Learning (Theory)

Exploiting cluster structure in data [DH 08]

Basic primitive:

Find a clustering of the data

Sample a few randomly-chosen points in each cluster

Assign each cluster its majority label

Now use this fully labeled data set to build a classifier

Page 26: A tutorial on active learning - Machine Learning (Theory)

Finding the right granularity

Unlabeled data

Page 27: A tutorial on active learning - Machine Learning (Theory)

Finding the right granularity

Unlabeled data Find a clustering

Page 28: A tutorial on active learning - Machine Learning (Theory)

Finding the right granularity

Unlabeled data Find a clustering

Ask for some labels

Page 29: A tutorial on active learning - Machine Learning (Theory)

Finding the right granularity

Unlabeled data Find a clustering

Ask for some labels

Now what?

Page 30: A tutorial on active learning - Machine Learning (Theory)

Finding the right granularity

Unlabeled data Find a clustering

Ask for some labels

Now what?

Refine the clustering

Page 31: A tutorial on active learning - Machine Learning (Theory)

Using a hierarchical clustering

Rules:

Always work with some pruning of the hierarchy: a clusteringinduced by the tree. Pick a cluster (intelligently) and query arandom point in it.

For each tree node (i.e. cluster) v maintain: (i) majority label L(v);(ii) empirical label frequencies pv,l ; and (iii) confidence interval[plb

v,l , pubv,l ]

Page 32: A tutorial on active learning - Machine Learning (Theory)

Algorithm: hierarchical sampling

Input: Hierarchical clustering T

For each node v maintain: (i) majority label L(v); (ii) empiricallabel frequencies pv,l ; and (iii) confidence interval [plb

v,l , pubv,l ]

Initialize: pruning P = root, labeling L(root) = ℓ0

for t = 1, 2, 3, . . . :

v = select-node(P) pick a random point z in subtree Tv and query its label update empirical counts for all nodes along path from z to v choose best pruning and labeling (P ′, L′) of Tv

P = (P \ v) ∪ P ′ and L(u) = L′(u) for all u in P ′

for each v in P : assign each leaf in Tv the label L(v)

return the resulting fully labeled data set

v = select-node(P) ≡

Prob[v ] ∝ |Tv | random sampling

Prob[v ] ∝ |Tv |(1 − plbv,l) active sampling

Page 33: A tutorial on active learning - Machine Learning (Theory)

Outline of tutorial

I. Exploiting (cluster) structure in data

II. Efficient search through hypothesis space

(a) The separable case(b) The general case

Page 34: A tutorial on active learning - Machine Learning (Theory)

Efficient search through hypothesis space

Threshold functions on the real line:

H = hw : w ∈ Rhw(x) = 1(x ≥ w) w

− +

Supervised: for misclassification error ≤ ǫ, need ≈ 1/ǫ labeledpoints.

Page 35: A tutorial on active learning - Machine Learning (Theory)

Efficient search through hypothesis space

Threshold functions on the real line:

H = hw : w ∈ Rhw(x) = 1(x ≥ w) w

− +

Supervised: for misclassification error ≤ ǫ, need ≈ 1/ǫ labeledpoints.

Active learning: instead, start with 1/ǫ unlabeled points.

Page 36: A tutorial on active learning - Machine Learning (Theory)

Efficient search through hypothesis space

Threshold functions on the real line:

H = hw : w ∈ Rhw(x) = 1(x ≥ w) w

− +

Supervised: for misclassification error ≤ ǫ, need ≈ 1/ǫ labeledpoints.

Active learning: instead, start with 1/ǫ unlabeled points.

Binary search: need just log 1/ǫ labels, from which the rest can beinferred. Exponential improvement in label complexity!

Page 37: A tutorial on active learning - Machine Learning (Theory)

Efficient search through hypothesis space

Threshold functions on the real line:

H = hw : w ∈ Rhw(x) = 1(x ≥ w) w

− +

Supervised: for misclassification error ≤ ǫ, need ≈ 1/ǫ labeledpoints.

Active learning: instead, start with 1/ǫ unlabeled points.

Binary search: need just log 1/ǫ labels, from which the rest can beinferred. Exponential improvement in label complexity!

Challenges: Nonseparable data? Other hypothesis classes?

Page 38: A tutorial on active learning - Machine Learning (Theory)

Some results of active learning theory

Separable data General (nonseparable) dataQuery by committee

Aggressive (Freund, Seung, Shamir, Tishby, 97)Splitting index (D, 05)Generic active learner A2 algorithm(Cohn, Atlas, Ladner, 91) (Balcan, Beygelzimer, L, 06)

Disagreement coefficientMellow (Hanneke, 07)

Reduction to supervised(D, Hsu, Monteleoni, 2007)Importance-weighted approach(Beygelzimer, D, L, 2009)

Page 39: A tutorial on active learning - Machine Learning (Theory)

Some results of active learning theory

Separable data General (nonseparable) dataQuery by committee

Aggressive (Freund, Seung, Shamir, Tishby, 97)Splitting index (D, 05)Generic active learner A2 algorithm(Cohn, Atlas, Ladner, 91) (Balcan, Beygelzimer, L, 06)

Disagreement coefficientMellow (Hanneke, 07)

Reduction to supervised(D, Hsu, Monteleoni, 2007)Importance-weighted approach(Beygelzimer, D, L, 2009)

Issues:

Computational tractability

Are labels being used as efficiently as possible?

Page 40: A tutorial on active learning - Machine Learning (Theory)

A generic mellow learner [CAL ’91]

For separable data that is streaming in.

H1 = hypothesis class

Repeat for t = 1, 2, . . .

Receive unlabeled point xt

If there is any disagreement within Ht about xt ’s label:query label yt and set Ht+1 = h ∈ Ht : h(xt) = yt

elseHt+1 = Ht

Is a label needed?

Page 41: A tutorial on active learning - Machine Learning (Theory)

A generic mellow learner [CAL ’91]

For separable data that is streaming in.

H1 = hypothesis class

Repeat for t = 1, 2, . . .

Receive unlabeled point xt

If there is any disagreement within Ht about xt ’s label:query label yt and set Ht+1 = h ∈ Ht : h(xt) = yt

elseHt+1 = Ht

Is a label needed? Ht = current candidatehypotheses

Page 42: A tutorial on active learning - Machine Learning (Theory)

A generic mellow learner [CAL ’91]

For separable data that is streaming in.

H1 = hypothesis class

Repeat for t = 1, 2, . . .

Receive unlabeled point xt

If there is any disagreement within Ht about xt ’s label:query label yt and set Ht+1 = h ∈ Ht : h(xt) = yt

elseHt+1 = Ht

Is a label needed? Ht = current candidatehypotheses

Region of uncertainty

Page 43: A tutorial on active learning - Machine Learning (Theory)

A generic mellow learner [CAL ’91]

For separable data that is streaming in.

H1 = hypothesis class

Repeat for t = 1, 2, . . .

Receive unlabeled point xt

If there is any disagreement within Ht about xt ’s label:query label yt and set Ht+1 = h ∈ Ht : h(xt) = yt

elseHt+1 = Ht

Is a label needed? Ht = current candidatehypotheses

Region of uncertainty

Problems: (1) intractable to maintain Ht ; (2) nonseparable data.

Page 44: A tutorial on active learning - Machine Learning (Theory)

Maintaining Ht

Explicitly maintaining Ht is intractable. Do it implicitly, byreduction to supervised learning.

Explicit version

H1 = hypothesis classFor t = 1, 2, . . .:

Receive unlabeled point xt

If disagreement in Ht about xt ’s label:query label yt of xt

Ht+1 = h ∈ Ht : h(xt ) = ytelse:

Ht+1 = Ht

Implicit version

S = (points seen so far)For t = 1, 2, . . .:

Receive unlabeled point xt

If learn(S ∪ (xt , 1)) and learn(S ∪ (xt , 0))both return an answer:

query label yt

else:set yt to whichever label succeeded

S = S ∪ (xt , yt)

Page 45: A tutorial on active learning - Machine Learning (Theory)

Maintaining Ht

Explicitly maintaining Ht is intractable. Do it implicitly, byreduction to supervised learning.

Explicit version

H1 = hypothesis classFor t = 1, 2, . . .:

Receive unlabeled point xt

If disagreement in Ht about xt ’s label:query label yt of xt

Ht+1 = h ∈ Ht : h(xt ) = ytelse:

Ht+1 = Ht

Implicit version

S = (points seen so far)For t = 1, 2, . . .:

Receive unlabeled point xt

If learn(S ∪ (xt , 1)) and learn(S ∪ (xt , 0))both return an answer:

query label yt

else:set yt to whichever label succeeded

S = S ∪ (xt , yt)

This scheme is no worse than straight supervised learning. But canone bound the number of labels needed?

Page 46: A tutorial on active learning - Machine Learning (Theory)

Label complexity [Hanneke]

The label complexity of CAL (mellow, separable) active learning can becaptured by the the VC dimension d of the hypothesis and by a parameter θcalled the disagreement coefficient.

Page 47: A tutorial on active learning - Machine Learning (Theory)

Label complexity [Hanneke]

The label complexity of CAL (mellow, separable) active learning can becaptured by the the VC dimension d of the hypothesis and by a parameter θcalled the disagreement coefficient.

Regular supervised learning, separable case.

Suppose data are sampled iid from an underlying distribution. To get ahypothesis whose misclassification rate (on the underlying distribution) is≤ ǫ with probability ≥ 0.9, it suffices to have

d

ǫ

labeled examples.

Page 48: A tutorial on active learning - Machine Learning (Theory)

Label complexity [Hanneke]

The label complexity of CAL (mellow, separable) active learning can becaptured by the the VC dimension d of the hypothesis and by a parameter θcalled the disagreement coefficient.

Regular supervised learning, separable case.

Suppose data are sampled iid from an underlying distribution. To get ahypothesis whose misclassification rate (on the underlying distribution) is≤ ǫ with probability ≥ 0.9, it suffices to have

d

ǫ

labeled examples.

CAL active learner, separable case.

Label complexity is

θd log1

ǫ

Page 49: A tutorial on active learning - Machine Learning (Theory)

Label complexity [Hanneke]

The label complexity of CAL (mellow, separable) active learning can becaptured by the the VC dimension d of the hypothesis and by a parameter θcalled the disagreement coefficient.

Regular supervised learning, separable case.

Suppose data are sampled iid from an underlying distribution. To get ahypothesis whose misclassification rate (on the underlying distribution) is≤ ǫ with probability ≥ 0.9, it suffices to have

d

ǫ

labeled examples.

CAL active learner, separable case.

Label complexity is

θd log1

ǫ There is a version of CAL for nonseparable data. (More to come!)

If best achievable error rate is ν, suffices to have

θ

d log2 1

ǫ+

dν2

ǫ2

«

labels. Usual supervised requirement: d/ǫ2.

Page 50: A tutorial on active learning - Machine Learning (Theory)

Disagreement coefficient [Hanneke]

Let P be the underlying probability distribution on input space X .Induces (pseudo-)metric on hypotheses: d(h, h′) = P[h(X ) 6= h′(X )].Corresponding notion of ball B(h, r) = h′ ∈ H : d(h, h′) < r.

Disagreement region of any set of candidate hypotheses V ⊆ H:

DIS(V ) = x : ∃h, h′ ∈ V such that h(x) 6= h′(x).

Disagreement coefficient for target hypothesis h∗ ∈ H:

θ = supr

P[DIS(B(h∗, r))]

r.

h*

h

d(h∗, h) = P[shaded region] Some elements of B(h∗, r) DIS(B(h∗, r))

Page 51: A tutorial on active learning - Machine Learning (Theory)

Disagreement coefficient: separable case

Let P be the underlying probability distribution on input space X .Let Hǫ be all hypotheses in H with error ≤ ǫ. Disagreement region:

DIS(Hǫ) = x : ∃h, h′ ∈ Hǫ such that h(x) 6= h′(x).

Then disagreement coefficient is

θ = supǫ

P[DIS(Hǫ)]

ǫ.

Page 52: A tutorial on active learning - Machine Learning (Theory)

Disagreement coefficient: separable case

Let P be the underlying probability distribution on input space X .Let Hǫ be all hypotheses in H with error ≤ ǫ. Disagreement region:

DIS(Hǫ) = x : ∃h, h′ ∈ Hǫ such that h(x) 6= h′(x).

Then disagreement coefficient is

θ = supǫ

P[DIS(Hǫ)]

ǫ.

Example: H = thresholds in R, any data distribution.

target

Therefore θ = 2.

Page 53: A tutorial on active learning - Machine Learning (Theory)

Disagreement coefficient: examples [H ’07, F ’09]

Page 54: A tutorial on active learning - Machine Learning (Theory)

Disagreement coefficient: examples [H ’07, F ’09]

Thresholds in R, any data distribution.

θ = 2.

Label complexity O(log 1/ǫ).

Page 55: A tutorial on active learning - Machine Learning (Theory)

Disagreement coefficient: examples [H ’07, F ’09]

Thresholds in R, any data distribution.

θ = 2.

Label complexity O(log 1/ǫ).

Linear separators through the origin in Rd , uniform data distribution.

θ ≤√

d .

Label complexity O(d3/2 log 1/ǫ).

Page 56: A tutorial on active learning - Machine Learning (Theory)

Disagreement coefficient: examples [H ’07, F ’09]

Thresholds in R, any data distribution.

θ = 2.

Label complexity O(log 1/ǫ).

Linear separators through the origin in Rd , uniform data distribution.

θ ≤√

d .

Label complexity O(d3/2 log 1/ǫ).

Linear separators in Rd , smooth data density bounded away from zero.

θ ≤ c(h∗)d

where c(h∗) is a constant depending on the target h∗.

Label complexity O(c(h∗)d2 log 1/ǫ).

Page 57: A tutorial on active learning - Machine Learning (Theory)

Gimme the algorithms

1. The A2 algorithm.

2. Limitations

3. IWAL

Page 58: A tutorial on active learning - Machine Learning (Theory)

The A2 Algorithm in Action

0 10

0.5E

rror

Rat

e

Threshold/Input Feature

Error Rates of Threshold Function

Problem: find the optimal threshold function on the [0, 1] intervalin a noisy domain.

Page 59: A tutorial on active learning - Machine Learning (Theory)

Sampling

0 0 0 0 0 0

0 10

0.5E

rror

Rat

e

Threshold/Input Feature

1 1 1 1

Labeled Samples

Label samples at random.

Page 60: A tutorial on active learning - Machine Learning (Theory)

Bounding

0 0 0 0 0 0

0 10

0.5

Err

or R

ate

Threshold/Input Feature

1 1 1 1

Upper Bound

LowerBound

Compute upper and lower bounds on the error rate of eachhypothesis.Theorem: For all H, for all D, for all numbers of samples m,

Pr(|true error rate− empirical error rate| ≤ f (H, δ,m)) ≥ 1− δ

Page 61: A tutorial on active learning - Machine Learning (Theory)

Eliminating

0 0 0 0 0 0

0 10

0.5E

rror

Rat

e

Threshold/Input Feature

1 1 1 1

Elimination

Chop away part of the hypothesis space, implying that you ceaseto care about part of the feature space.

Page 62: A tutorial on active learning - Machine Learning (Theory)

Eliminating

0 0 0 0 0 0

0 10

0.5E

rror

Rat

e

Threshold/Input Feature

1 1 1 1

Elimination

Chop away part of the hypothesis space, implying that you ceaseto care about part of the feature space. Recurse!

Page 63: A tutorial on active learning - Machine Learning (Theory)

What is this master algorithm?

Let e(h,D) = Prx ,y∼D(h(x) 6= y)

Page 64: A tutorial on active learning - Machine Learning (Theory)

What is this master algorithm?

Let e(h,D) = Prx ,y∼D(h(x) 6= y)

Let UB(S , h) = upper bound on e(h,D) assuming S IID from D.Let LB(S , h) = lower bound on e(h,D) assuming S IID from D.

Page 65: A tutorial on active learning - Machine Learning (Theory)

What is this master algorithm?

Let e(h,D) = Prx ,y∼D(h(x) 6= y)

Let UB(S , h) = upper bound on e(h,D) assuming S IID from D.Let LB(S , h) = lower bound on e(h,D) assuming S IID from D.

Disagree(H,D) = Prx ,y∼D(∃h, h′ ∈ H : h(x) 6= h′(x))

Page 66: A tutorial on active learning - Machine Learning (Theory)

What is this master algorithm?

Let e(h,D) = Prx ,y∼D(h(x) 6= y)

Let UB(S , h) = upper bound on e(h,D) assuming S IID from D.Let LB(S , h) = lower bound on e(h,D) assuming S IID from D.

Disagree(H,D) = Prx ,y∼D(∃h, h′ ∈ H : h(x) 6= h′(x))

Done(H,D) =

(minh∈H

UB(S , h)−minh∈H

LB(S , h))Disagree(H,D)

Page 67: A tutorial on active learning - Machine Learning (Theory)

Agnostic Active (error rate ǫ, classifiers H)

while Done(H,D) > ǫ:S = ∅, H ′ = H

while Disagree(H ′,D) ≥ 12Disagree(H,D):

if Done(H ′,D) < ǫ: return h ∈ H ′

S ′= 2|S |+ 1 unlabeled x which H disagrees onS = (x ,Label(x)) : x ∈ S ′H ′ ← h ∈ H : LB(S , h) ≤ minh′∈H UB(S , h′)

H ← H ′

return h ∈ H

Page 68: A tutorial on active learning - Machine Learning (Theory)

Agnostic Active: result

Theorem: There exists an algorithm Agnostic Active that:

1. (Correctness) For all H,D, ǫ with probability 0.99 returns anǫ-optimal c .

2. (Fall-Back) For all H,D the number of labeled samplesrequired is O(Batch).

3. (Structured Low noise) For all H,D with disagreementcoefficient θ and d = VC (H) with ν < ǫ, O

(θ2d ln2 1

ǫ

)

labeled examples suffice.

4. (Structured High noise) For all H,D with disagreement

coefficient θ and d = VC (H) with ν > ǫ, O(

θ2ν2

ǫ2 d)

labeled

examples suffice.

Page 69: A tutorial on active learning - Machine Learning (Theory)

Proof of low noise/high speedup case

0 1

The Noiseless Case

Threshold

Err

or R

ate

measure = DX (x)

ν noise can not significantly alter this picture for ν small.

⇒ constant classifier fraction has error rate > 0.25.

⇒ O(ln 1δ) labeled examples gives error rates of all thresholds up

to tolerance 18 with probability 1− δ

Page 70: A tutorial on active learning - Machine Learning (Theory)

General low noise/high speedup case

⇒ constant classifier fraction has error rate > 0.25.

⇒ O(θ2(d + ln 1δ)) labeled examples gives error rates of all

classifiers up to tolerance 18 with probability 1− δ

Page 71: A tutorial on active learning - Machine Learning (Theory)

Proof of low noise/high speedup case II

0 1

The Noiseless Case

ThresholdE

rror

Rat

e

⇒ 12 fraction of “far” classifiers eliminatable via bound algebra.

⇒ Problem recurses. Spreading δ across recursions implies result.

Page 72: A tutorial on active learning - Machine Learning (Theory)

Proof of high noise/low speedup case

low noise/high speedup theorem ⇒ Disagree(C ,D) ≃ ν after fewexamples.

Done(C ,D)= [minc∈C UB(S , c)−minc∈C LB(S , c)]Disagree(C ,D)= [minc∈C UB(S , c)−minc∈C LB(S , c)]ν

⇒ Computing bounds to error rate ǫν

makes Done(C ,D) ≃ ǫ

O(

θ2ν2

ǫ2 d)

samples suffice.

Page 73: A tutorial on active learning - Machine Learning (Theory)

Gimme the algorithms

1. The A2 algorithm

2. Limitations

3. IWAL

Page 74: A tutorial on active learning - Machine Learning (Theory)

What’s wrong with A2?

Page 75: A tutorial on active learning - Machine Learning (Theory)

What’s wrong with A2?

1. Unlabeled complexity You need infinite unlabeled data tomeasure Disagree(C ,D) to infinite precision.

2. Computation You need to enumerate and checkhypotheses—exponentially slower and exponentially morespace than common learning algorithms.

3. Label Complexity Can’t get logarithmic label complexity forǫ < ν.

4. Label Complexity Throwing away examples from previousiterations can’t be optimal, right?

5. Label Complexity Bounds are often way too loose.

6. Generality We care about more than 0/1 loss.

Page 76: A tutorial on active learning - Machine Learning (Theory)

What’s wrong with A2?

1. Unlabeled complexity [DHM07]

2. Computation [DHM07] partially, [Beygelzimer, D, L, 2009],partially.

3. Label Complexity [Kaariainen 2006], Log is impossible forsmall ǫ.

4. Label Complexity [DHM07], use all labels for all decisions.

5. Label Complexity [BDL09] Importance weights partiallyaddress loose bounds.

6. Generality [BDL09] Importance weights address other losses.

Page 77: A tutorial on active learning - Machine Learning (Theory)

An improved lower bound for Active LearningTheorem: [BDL09] For all H with VC (H) = d , for all ǫ, ν > 0 withǫ < ν

2 < 18 , there exists D, such that all active learning algorithms

require:

Ω

(dν2

ǫ2

)

samples to achieve to find an h satisfying e(h,D) < ν + ǫ.

Page 78: A tutorial on active learning - Machine Learning (Theory)

An improved lower bound for Active LearningTheorem: [BDL09] For all H with VC (H) = d , for all ǫ, ν > 0 withǫ < ν

2 < 18 , there exists D, such that all active learning algorithms

require:

Ω

(dν2

ǫ2

)

samples to achieve to find an h satisfying e(h,D) < ν + ǫ.

Implication: For ǫ =√

dνT

as in supervised learning, all label

complexity bounds must have a dependence on:

dν2

ǫ2=

dν2

dνT

= Tν

Page 79: A tutorial on active learning - Machine Learning (Theory)

An improved lower bound for Active LearningTheorem: [BDL09] For all H with VC (H) = d , for all ǫ, ν > 0 withǫ < ν

2 < 18 , there exists D, such that all active learning algorithms

require:

Ω

(dν2

ǫ2

)

samples to achieve to find an h satisfying e(h,D) < ν + ǫ.

Implication: For ǫ =√

dνT

as in supervised learning, all label

complexity bounds must have a dependence on:

dν2

ǫ2=

dν2

dνT

= Tν

Proof sketch:

1. VC(H) = d means there are d inputs x where the predictioncan go either way.

2. Make |P(y = 1|x) − P(y = 0|x)| = 2ǫν+2ǫ

on these points.

3. Apply coin flipping lower bound for determining whetherheads or tails is most common.

Page 80: A tutorial on active learning - Machine Learning (Theory)

Gimme the algorithms

1. The A2 algorithm.

2. Limitations

3. IWAL

Page 81: A tutorial on active learning - Machine Learning (Theory)

IWAL(subroutine Rejection-Threshold)

S = ∅While (unlabeled examples remain)

1. Receive unlabeled example x .

2. Set p = Rejection-Threshold(x ,S).

3. If U(0, 1) ≤ p, get label y , and add (x , y , 1p) to S .

4. Let h = Learn(S).

Page 82: A tutorial on active learning - Machine Learning (Theory)

IWAL: fallback result

Theorem: For all choices of Rejection-Threshold if for all x ,S :

Rejection-Threshold(x ,S) ≥ pmin

then IWAL is at most a constant factor worse than supervisedlearning.Proof: Standard PAC/VC analysis, except with a martingale.But is it better than supervised learning?

Page 83: A tutorial on active learning - Machine Learning (Theory)

Loss-weighting(unlabeled example x , history S)

1. Let HS =hypotheses that we might eventually choose,assuming the distribution is IID.

2. Return maxf ,g∈HS ,y ℓ(f (x), y)− ℓ(g(x), y)

Page 84: A tutorial on active learning - Machine Learning (Theory)

Loss-weighting safety

(Not trivial, because sometimes 0 is returned)Theorem: IWAL(Loss-weighting) competes with supervisedlearning.

But does it give us speedups?

Page 85: A tutorial on active learning - Machine Learning (Theory)

The Slope Asymmetry

For two predictions z,z’ what is the ratio of the maximum andminimum change in loss as a function of y?

Cℓ = supz ,z ′

maxy∈Y |ℓ(z , y)− ℓ(z ′, y)|

miny∈Y |ℓ(z , y)− ℓ(z ′, y)|

= generalized maximum derivative ratio of loss function ℓ.

= 1 for 0/1 loss.

= 1 for hinge loss on [−1, 1] (i.e. normalized margin SVM)

≤ 1 + eB for logistic in [−B ,B ]

∞ for squared loss.

Page 86: A tutorial on active learning - Machine Learning (Theory)

-3

-2

-1

0

1

2

3

4

-3 -2 -1 0 1 2 3

Loss

val

ue

Margin

Logistic Loss

log(1+exp(-x))max derivativemin derivative

Page 87: A tutorial on active learning - Machine Learning (Theory)

Generalized Disagreement Coefficient

Was:

θ = supr

P[DIS(B(h∗, r))]

r

= supr

Ex∼D I [∃h, h′ ∈ B(h∗, r) : h(x) 6= h′(x)]

r

Generalized:

θ = supr

Ex∼D maxh∈B(h∗ ,r) maxy |ℓ(h(x), y) − ℓ(h∗(x), y)|

r

Big disagreement coefficient ⇔ near optimal hypotheses oftendisagree in loss.

Page 88: A tutorial on active learning - Machine Learning (Theory)

Sufficient conditions for IWAL(Loss-weighting)

Let ν = minh∈H Ex ,y∼Dℓ(h(x), y) = minimum possible loss rate.

Theorem: For all learning problems D, for all hypothesis sets H,the label complexity after T unlabeled samples is at most:

θCl

(νT +

√T ln

|H|T

δ

)

(up to constants)

Page 89: A tutorial on active learning - Machine Learning (Theory)

Experiments in the convex case

If:

1. Loss is convex

2. Representation is linear

Then computation is easy.(*)

(*) caveat: you can’t quite track HS perfectly, so must haveminimum sample probability for safety.

Page 90: A tutorial on active learning - Machine Learning (Theory)

Experiments in the nonconvex case

But, sometimes a linear predictor isn’t the best predictor. What dowe do?Bootstrap(unlabeled example x , history S)

1. If |S | < t return 1.

2. If |S | = t train 10 hypotheses H ′ = h1, ..., h10 using abootstrap sample.

3. If |S | ≥ t returnpmin + (1− pmin)maxf ,g∈H,y ℓ(f (x), y) − ℓ(g(x), y)

Page 91: A tutorial on active learning - Machine Learning (Theory)

Label savings results

Logistic/interior point Online Constrained MNist 65%

J48 Batch Bootstrap MNist 35%

J48 Batch Bootstrap Adult 60%

J48 Batch Bootstrap Pima 32%

J48 Batch Bootstrap Yeast 19%

J48 Batch Bootstrap Spambase 55%

J48 Batch Bootstrap Waveform 16%

J48 Batch Bootstrap Letter 25%

In all cases active learner’s prediction performance ≃ supervisedprediction performance. (In fact, more often better than worse.)

Page 92: A tutorial on active learning - Machine Learning (Theory)

An Example Plot: J48/Batch Bootstrap/MNist 3 vs 5

200

300

400

500

600

700

200 400 600 800 1000 1200 1400 1600 1800 2000

test

err

or (

out o

f 200

0)

number of points seen

supervisedactive

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 500 1000 1500 2000nu

mbe

r of

poi

nts

quer

ied

number of points seen

supervisedactive

Active learner marginally outperforms supervised learner on thesame number of examples while using only 2/3rds as many labels.

Page 93: A tutorial on active learning - Machine Learning (Theory)

What the theory says

Active Learning applies anywhere supervised learning applies.Current techniques help best when:

1. Discrete loss Hard choices with discrete losses are made.

2. Low noise The best predictor has small error.

3. Constrained H The set of predictors you care about isconstrained.

4. Low Data You don’t already have lots of labeled data.

Page 94: A tutorial on active learning - Machine Learning (Theory)

Future work for all of us

1. Foundations Is active learning possible in a fully adversarialsetting?

2. Application Is an active learning reduction to supervisedpossible without constraints?

3. Extension What about other settings for interactive learning?(structured? partial label? Differing oracles with differingexpertise?)

4. Empirical Can we achieve good active learning performancewith a consistent algorithm on a state-of-the-art problem?

Further discussion at http://hunch.net

Page 95: A tutorial on active learning - Machine Learning (Theory)

Bibliography

1. Yotam Abramson and Yoav Freund, Active Learning for VisualObject Recognition, UCSD tech report.

2. Nina Balcan, Alina Beygelzimer, John Langford, AgnosticActive Learning. ICML 2006

3. Alina Beygelzimer, Sanjoy Dasgupta, and John Langford,Importance Weighted Active Learning, ICML 2009.

4. David Cohn, Les Atlas and Richard Ladner. Improvinggeneralization with active learning, Machine Learning15(2):201-221, 1994.

5. Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling foractive learning. ICML 2008.

6. Sanjoy Dasgupta, Coarse sample complexity bounds for activelearning. NIPS 2005.

Page 96: A tutorial on active learning - Machine Learning (Theory)

Bibliography II

1. Sanjoy Dasgupta, Daniel J. Hsu, and Claire Monteleoni. Ageneral agnostic active learning algorithm. NIPS 2007.

2. Yoav Freund, H. Sebastian Seung, Eli Shamir, and NaftaliTishby, Selective Sampling Using the Query by CommitteeAlgorithm, Machine Learning, 28, 133-168, 1997.

3. Hanneke, S. A Bound on the Label Complexity of AgnosticActive Learning. ICML 2007.

4. Manfred K. Warmuth, Gunnar Ratsch, Michael Mathieson,Jun Liao, and Christian Lemmon, Active Learning in the DrugDiscovery Process, NIPS 2001.

5. Xiaojin Zhu, John Lafferty and Zoubin Ghahramani,Combining active learning and semi-supervised learning usingGaussian fields and harmonic functions, ICML 2003 workshop