Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,

Dasgupta, Kalai & Monteleoni COLT 2005

Analysis of perceptron-based active learning

Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago

Claire Monteleoni, MIT


Selective sampling, online constraints

Selective sampling framework:Unlabeled examples, xt, are received one at a time.

Learner makes a prediction at each time-step. A noiseless oracle to label yt, can be queried at a cost.

Goal: minimize number of labels to reach error istheerror rate (w.r.t. the target) on the sampling

distribution.

Online constraints:Space: Learner cannot store all previously seen examples (and then perform batch learning).Time: Running time of learner’s belief update step should not scale with number of seen examples/mistakes.


AC Milan v. Inter Milan


Problem framework

uvt

t

Target:Current hypothesis:

Error region:

Assumptions:Separabilityu is through originx~Uniform on S error rate:

t


Related work

Analysis under selective sampling model, of Query By Committee algorithm [Seung,Opper&Sompolinsky‘92] :

Theorem [Freund,Seung,Shamir&Tishby ‘97]: Under selective sampling from the uniform, QBC can learn a half-space through the origin to generalization error , using Õ(d log 1/) labels.

! BUT: space required, and time complexity of the update both scale with number of seen mistakes!


Related work

Perceptron: a simple online algorithm:If yt SGN(vt ¢ xt), then: Filtering rule

vt+1 = vt + yt xt Update step

Distribution-free mistake bound O(1/2), if exists margin .

Theorem [Baum‘89]: Perceptron, given sequential labeled examples from the uniform distribution, can converge to generalization error after Õ(d/2) mistakes.


Our contributions

A lower bound for Perceptron in active learning context of (1/2) labels.

A modified Perceptron update with a Õ(d log 1/) mistake bound.

An active learning rule and a label bound of Õ(d log 1/).

A bound of Õ(d log 1/) on total errors (labeled or not).


Perceptron

Perceptron update: vt+1 = vt + yt xt

error does not decrease monotonically.

uvt

xt

vt+1


Lower bound on labels for Perceptron

Theorem 1: The Perceptron algorithm, using any active learning rule, requires (1/2) labels to reach generalization error w.r.t. the uniform distribution.

Proof idea: Lemma: For small t, the Perceptron update will increase t unless kvtk

is large: (1/sin t). But, kvtk growth

rate: So need t ¸ 1/sin2t.

Under uniform,

t / t ¸ sin t.

uvt

xt

vt+1


A modified Perceptron updateStandard Perceptron update:

vt+1 = vt + yt xt

Instead, weight the update by “confidence” w.r.t. current hypothesis vt:

vt+1 = vt + 2 yt |vt ¢ xt| xt (v1 = y0x0)

(similar to update in [Blum et al.‘96] for noise-tolerant learning)

Unlike Perceptron:Error decreases monotonically:

cos(t+1) = u ¢ vt+1 = u ¢ vt + 2 |vt ¢ xt||u ¢ xt|

¸ u ¢ vt = cos(t)

kvtk =1 (due to factor of 2)


A modified Perceptron update

Perceptron update: vt+1 = vt + yt xt

Modified Perceptron update: vt+1 = vt + 2 yt |vt ¢

xt| xt

uvt

xt

vt+1vt+1

vt

vt+1


Mistake boundTheorem 2: In the supervised setting, the modified

Perceptron converges to generalization error after Õ(d log 1/) mistakes.

Proof idea: The exponential convergence follows from a multiplicative decrease in t:

On an update,

! We lower bound 2|vt ¢ xt||u ¢ xt|, with high probability, using our distributional assumption.


Mistake bound

a

{k

{x : |a ¢ x| · k} =

Theorem 2: In the supervised setting, the modified Perceptron converges to generalization error after Õ(d log 1/) mistakes.

Lemma (band): For any fixed a: kak=1, · 1 and for x~U on S:

Apply to |vt ¢ x| and |u ¢ x| ) 2|vt ¢ xt||u ¢ xt| is

large enough in expectation (using size of t).


Active learning rule

vt

st

u

{

Goal: Filter to label just those points in the error region. ! but t, and thus t unknown!

Define labeling region:Tradeoff in choosing threshold st:

If too high, may wait too long for an error.If too low, resulting update is too small.

makes

constant.

! But t unknown! Choose st adaptively: Start high. Halve, if no error in R consecutive labels.

L


Label bound

Theorem 3: In the active learning setting, the modified Perceptron, using the adaptive filtering rule, will converge to generalization error after Õ(d log 1/) labels.

Corollary: The total errors (labeled and unlabeled) will be Õ(d log 1/).


Proof techniqueProof outline: We show the following lemmas hold with

sufficient probability:

Lemma 1. st does not decrease too quickly:

Lemma 2. We query labels on a constant fraction of t.

Lemma 3. With constant probability the update is good.

By algorithm, ~1/R labels are mistakes. 9 R = Õ(1).

) Can thus bound labels and total errors by mistakes.


Proof techniqueLemma 1. st is large enough:

Proof: (By contradiction) Let t be first time Then

A halving event means we saw R labels with no mistakes, so

Lemma 1a: For any particular i, this event happens w.p. · 3/4:


Proof technique

uvt

st

Lemma 1a.

Proof idea: Using this value of st, band lemma in Rd-1 gives constant probability of x0 falling in appropriatelydefined band w.r.t. u0.

where:x0: component of x orthogonal to vt

u0: component of u orthogonal to vt

)


Proof techniqueLemma 2. We query labels on a constant fraction of t.Proof: Assume Lemma 1 for lower bound on st. Apply Lemma

1a and band lemma )

Lemma 3. With constant probability the update is good.Proof: Assuming Lemma 1, by Lemma 2, each error is labeled

w. constant p. From mistake bound proof, each update is good (multiplicative decrease in error) w. constant p.

Finally, solve for R: Every R labels there is at least 1 update or we halve st, so

There exists R = Õ(1) s.t.


Summary of contributions samples mistakes labels total errors online?

PACcomplexity[Long‘03][Long‘95]

Perceptron[Baum‘97]

QBC[FSST‘97]

[DKM‘05]

Õ(d/) (d/)

Õ(d/3)(1/2)

Õ(d/2)(1/2) (1/2)

Õ(d/log 1/)

Õ(dlog 1/)

Õ(dlog 1/)

Õ(d/log 1/)

Õ(dlog 1/)

Õ(dlog 1/)

Õ(dlog 1/)


Conclusions and open problemsAchieve optimal label-complexity for this problem

unlike QBC, a fully online algorithmMatching bound on total errors (labeled and

unlabeled).

Future work:Relax distributional assumptions:

Uniform is sufficient but not necessary for proof.Note: this bound is not possible under arbitrary

distributions [Dasgupta‘04].Relax separability assumption:

Allow “margin” of tolerated error.Analyze margin version:

for exponential convergence, without d dependence.


Thank you!

Documents

Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,