Geometry-based sampling in learning and classification Or, Universal -approximators for integration of some nonnegative functions Leonard J. Schulman

Geometry-based sampling in learning and classification

Or,

Universal -approximators for integration of some nonnegative functions

Leonard J. Schulman

Caltech

Joint with

Michael Langberg

Open U. Israelwork in progress

Outline

1. Vapnik-Chervonenkis (VC) method / PAC learning; -nets, -approximators. Shatter function as cover code.

2. approximators (core-sets) for clustering; universal approximation of integrals of families of unbounded nonnegative functions.

3. Failure of naive sampling approach.

4. Small-variance estimators. Sensitivity and total sensitivity.

5. Some results on sensitivity; MIN operator on families. Sensitivity for k-medians.

6. Covering code for k-median.

PAC Learning. Integrating {0,1}-valued functions

If F is a family of “concepts” (functions f:X {0,1}) of finite VC dimension d, then for every input distribution there exists a distribution with support O((d/2) log 1/), s.t. for every f 2 F,

| s f d - s f d| < .Method: (1) Create by repeated independent sampling x1,…xm from . This

creates an estimator T = (1/m) 1

m f(xi) which w.h.p. (1+)-approximates any the integral of any specific f.

Example: F=characteristic fcns of intervals on .These are {0,1}-valued functions with VC dimension = 2.

a b



| s f d - s f d| < .Method: (1) Create by repeated independent sampling x1,…xm from . This


m f(xi) which w.h.p. (1+)-approximates the integral of any specific f.


a b



| s f d - s f d| < .Method: Create by repeated independent sampling x1,…xm from . This


m f(xi) which w.h.p. (1+)-approximates the integral of any specific f.


a b

PAC Learning. Integrating {0,1}-valued functions (cont.)

Easy to see that for any particular f 2 F, w.h.p. | s f d - s f d| < .But how do we argue this is simultaneously achieved for all the

functions f 2 F? Can’t take union bound over infinitely many “bad events”.

Need to express that there are few “types” of bad events. To conquer the infinite union bound apply “Red & Green Points” argument.

Sample m=O((d/2) log (1/)) “green” points G from . Will use =G=uniform dist. on G.

P(G not an -approximator) = P(9 f 2 C badly-counted by G) · P(9 f 2 C: |(f)-G(f)|>)Suppose G is not an -approximator: 9 f: |(f)-G(f)|>. Sample another m=O((d/2) log (1/)) “red” points R from .

With probability > ½, |(f)-R(f)|<2. (Markov ineq.) So:

P(9 f 2 C: |(f)-G(f)|>) < 2 P(9 f 2 C: |R(f)-G(f)|>/2).

PAC Learning. Integrating {0,1}-valued functions (cont.)

P(9 f 2 C: |(f)-G(f)|>) < 2 P(9 f 2 C: |R(f)-G(f)|>/2).

has vanished from our expression! Our failure event depends only on the restriction of f to R [ G.

Key: for finite-VC-dimension F, every f 2 F is identical on R [ G to one of a small (much less than 2m) set of functions. These are a “covering code” for F on R [ G. Cardinality (m)=md ¿ 2m.

Integrating functions into larger ranges (still additive approximation)

Extensions of VC-dimension notions to families of f:X {0,...n}:Pollard 1984Natarajan 1989Vapnik 1989Ben-David, Cesa-Bianchi, Haussler, Long 1997.

Families of functions f:X [0,1]: extension of VC-dimension notion (analogous to discrete definitions but insists on quantitative separation of values): “fat-shattering”.

Alon, Ben-David, Cesa-Bianchi, Haussler 1993Kearns, Schapire 1994Bartlett, Long, Williamson 1996Bartlett, Long 1998

Function classes with finite “dimension” (as above) possess small core-sets for additive -approximation of integrals. Same method still works: simply construct by repeatedly sampling from .

Does not solve multiplicative approximation of nonnegative functions.

What classes of +-valued functions possess core-sets for integration?

Why multiplicative approximation? In optimization we often wish to minimize a nonnegative loss function.

Makes sense to settle for (1+)-multiplicative approximation (and often unavoidable because of hardness).

Example: important optimization problems arise in classification:

Choose c1,...ck to minimize:

• k-median function: cost(fc1,...ck)= s ||x-{c1,...ck}|| d(x)

• k-means function: cost(fc1,...ck)= s ||x-{c1,...ck}||2 d(x)

• Or for any >0, cost(fc1,...ck)= s ||x-{c1,...ck}|| d(x)

Core-set can be useful because:Standard algorithmic approach: (1) Replace input (empirical distribution on huge number of points,

or even a continuous distribution given via certain “oracles”) by an -approximator (aka core-set) supported on a small set.

(2) Find an optimal (or near-optimal) c1,...ck for .

(3) Infer that it is near-optimal for .

Core-set can be useful because:Standard algorithmic approach: (1) Replace input (empirical distribution on huge number of points,




Standard algorithmic approach: (1) Replace input (empirical distribution on huge number of points,




In this lecture we focus solely on existence/non-existence of finite-cardinality core-sets; not on how to find them. Theorems will hold for any “input” distribution regardless of how it is presented.

c2

Core-set can be useful because:

c1

Known core-setsIn numerical analysis: quadrature methods for exact integration over

canonical measures (constant on interval or ball; Gaussian; etc).In CS previously: very different approaches from what I’ll present

today. (If is uniform on a finite set, let n=|Support()|.)

Our general goal is to find out what families of functions have, for every , bounded core-sets for integration.

In particular our method shows existence (but no algorithm) of core-sets for k-median, of size poly(d,k,1/), ~ O(d k3 -2).

Unbounded (dependent on n):

Har-Peled, Mazumdar ’04: k-medians & k-means: O(k -d log n)

Chen ’06: k-medians & k-means: ~ O(d k2 -2 log n)

Har-Peled ’06: in one dimension, integration of other families of functions (e.g., monotone): ~ O([family-specific] log n)

Bounded (independent of n):

Effros, Schulman ’04: k-means (k(d/)d)O(k) deterministically

Har-Peled, Kushal ’05: k-medians: O(k2/d), k-means O(k3/d+1)

a

Why doesn’t “sample from ” work? Ex.1For additive approximation, the learning theory approach: “construct

by repeatedly sampling from ” was sufficient to obtain a core-set. Why does it fail now?

Simple example: 1-means functions. F={(x-a)2}a 2 Let be:

a

For additive approximation, the learning theory approach: “construct by repeatedly sampling from ” was sufficient to obtain a core-set. Why does it fail now?


Why doesn’t “sample from ” work? Ex.1

a



Construct by sampling repeatedly from : almost surely all samples will lie in the left-hand singularity.


a



Construct by sampling repeatedly from : almost surely all samples will lie in the left-hand singularity.

If a is at the left-hand singularity, s f d>0, but whp s f d=0. No multiplicative approximation.

Underlying problem: the estimator of s f dhas large variance.



Even simpler example: Interval functions. F={fa,b} where fa,b(x)=1 for x 2 [a,b], 0 otherwise.These are {0,1}-valued functions with VC dimension = 2.

a b



Even simpler example: Interval functions. F={fa,b} where fa,b(x)=1 for x 2 [a,b], 0 otherwise.These are {0,1}-valued functions with VC dimension = 2. Let be:

a b



Even simpler example: Interval functions. F={fa,b} where fa,b(x)=1 for x 2 [a,b], 0 otherwise.These are {0,1}-valued functions with VC dimension = 2. Let be:

For unif. dist. on n pts, f = 1/n while for x sampled from , StDev(f(x)) ~ 1/n.

Actually nothing works for this family F. For an accurate multiplicative estimate of f =s f d, a core-set would need to contain the entire support of .

a b


Even the simpler family of step functions doesn’t have finite core-sets:

Why doesn’t “sample from ” work? Ex.3: step functions




Need geometrically spaced points (factor (1+)) in the support.


General approach: weighted sampling.

Sample not from but from a distribution q which depends on both and F.

Weighted sampling has long been used for clustering algorithms [Fernandez de la Vega, Kenyon’98; Kleinberg, Papadimitriou, Raghavan’98; Schulman’98; Alon, Sudakov’99;...], to reduce the size of the data set.

What we’re trying to explain is (a) For what classes of functions can weighted sampling provide an -approximator (core-set); (b) What is the connection with the VC proof of existence of -approximators in learning theory.

Return to Ex.1: show 9 small-variance

estimator for f =s f d

General approach: weighted sampling. Sample not from but from a distribution q which depends on both

and F.

Sample x from q. The random variable T = x fx / qx is an unbiased estimator of f.

Can we design q so Var(T) is small 8 f 2 F? Ideally: Var(T) 2 O(f2)

For the case of “1-means in one dimension”,the optimization

Given , choose q to minimize maxf 2 F Var(T)

can be solved (with mild pain) by Lagrange multipliers. Solution: Let 2=Var(). Center at 0. Then sample from

qx=x(2+x2)/(22).

(Note: heavily weights the tails of .) Calculation: Var(T) · f2.

(Now average O(1/2) samples. For any specific f, only 1± error.)

Return to Ex.1: show 9 small-variance

estimator for f =s f d

For what classes F of nonnegative functions does there exist, for all , an estimator T with Var(T) 2 O(f2)?

E.g., what about nonnegative quartics, fx=(x-a)2(x-b)2 ?

Shouldn’t have to do Lagrange multipliers each time.

Key notion: sensitivity.Define the sensitivity of x w.r.t. (F,): sx = supf 2 F fx/fDefine the total sensitivity of F: S(F) = sup s sx d

Sample from the distribution qx = x sx / S. (Emphasizes sensitive x’s)

Theorem 1: Var(T) · (S-1) f2 Proof omitted.

Exercise: For “parabolas”, F={(x-a)2}, show S=2.Corollary: Var(T) · f2 (as previously obtained via Lagrange mults)

Theorem 2 (slightly harder): T has a Chernoff bound (distribution has exponential tails). Don’t need this today.

Can we generalize the success of Ex.1?

Example 1. Let V be a real or complex vector space of dimension d.For each v=(...vx...) 2 V define an f 2 F by fx=|vx|2.

Theorem 3: S(F)=d. Proof omitted.

Corollary (again): Quadratics in 1 dimension have S(F)=2.

Quartics in 1 dimension have S(F)=3.

Can we calculate S for more examples?

Let V be a real or complex vector space of dimension d.For each v=(...vx...) 2 V define an f 2 F by fx=|vx|2.

Theorem 3: S(F)=d. Proof omitted.

Corollary (again): Quadratics in 1 dimension have S(F)=2.

Quartics in 1 dimension have S(F)=3.

Quadratics in r dimensions have S(F)=r+1.


Example 2. Let F+G={f+g: f 2 F, g 2 G}Theorem 4 (easy): S(F+G) · S(F)+S(G).Corollary: bounded sums of squares of bounded-degree polynomials

have finite S.

Example 3. Parabolas on k disjoint regions. Direct sum of vector spaces, so S · 2k.

0 1 2 3


But all these examples don’t even handle the 1-median functions:



And certainly not the k-median functions:



And certainly not the k-median functions:

Will return to this...


Question: If F and G have finite total sensitivity, is the same true ofMIN(F,G) = {min(f,g): f 2 F, g 2 G} ?Want this for optimization: e.g., k-means or k-median functions are

constructed by MIN out of simple families.

We know S(Parabolas)=2; what is S(MIN(Parabolas,Parabolas))?

What about MIN(F,G)?

Question: If F and G have finite total sensitivity, is the same true of

MIN(F,G) = {min(f,g): f 2 F, g 2 G} ?

Want this for optimization: eg k-means or k-median functions are constructed by MIN out of simple families.

We know S(Parabolas)=2; what is S(MIN(Parabolas,Parabolas))?

Answer: unbounded.

So total sensitivity does not remain finite under MIN operator.

What about MIN(F,G)?

Roughly, on a suitable distribution , a sequence of “pairs of parabolas” can mimic a sequence of step functions.

e-|x|

And recall from earlier: step functions have unbounded total sensitivity.

Idea for counterexample:


e-|x|

And recall from earlier: step functions have unbounded total sensitivity.



e-|x|

And recall from earlier: step functions have unbounded total sensitivity.This counterexample relies on scaling the two parabolas differently.

What if we only allow translates of a single “base function”?


Let M be any metric space.Let F={||x-a||}a 2 M (1-median is =1, 1-means is =2)

Theorem 5: For any >0, S(F)<1. Note: Bound is independent of M. S is not an analogue of VCdim /

cover function; it is a new parameter needed for unbounded fcns.

Theorem 6: For any >0, S(MIN(F,...(k times)...,F)) 2 O(k).

But remember this is only half the story: bounded S ensures only good approximation of f = s f d for each individual function f 2 F. Also need to handle all f simultaneously – the “VC” aspect.

Finite total sensitivity of clustering functions

2-median functionin M=2

Recall “Red and Green Points” argumentin VC theory: after picking 2m pointsfrom , all the {0,1}-valued functionsin the concept class fall into just mO(1)

equivalence classes by their restriction to R [ G.(“Shatter function” is (m)=O(mVC-dim).) These restrictions are a covering

code for the concept class.

For +-valued functions use a more general definition. First try: f 2 F is “covered” by g if 8 x 2 R [ G, fx = (1±) gx.

But this definition neglects the role of sensitivity. Corrected definition:f 2 F is “covered” by g if 8 x 2 R [ G, |fx - gx| < f sx / 8 S.Notes: (1) Error can scale with f rather than fx. (2) Tolerates more error

on high-sensitivity points.A “covering code” (for , R [ G) is a small ((m,) subexponential in m)

family G, such that every f 2 F is covered by some g 2 G.

Cover codes for families of functions F

So (now focusing on k-median) we need to prove two things:

Theorem 6: S(MIN(F,...(k times)...,F)) 2 O(k).

Theorem 7:

(a) In d, (MIN(F,...(k times)...,F)) 2 mpoly(k,1/,d)

(b) Chernoff bound for s fx/sx dG as an estimator of s fx/sx dR [ G

(Recall G = uniform dist. on G.)

Today talk only about:

Theorem 6

Theorem 7 in the case k=1, d arbitrary.

Cover codes for families of functions F

Theorem 6: S(MIN(F,...(k times)...,F)) 2 O(k).Proof:

Given let f* be the optimal clustering function, with centers u*1,...u*

k, so

h = k-median-cost() = s ||x - {u*1,...u*

k}|| d.

For any x, need to upper bound sx. Let Ui = Voronoi region of u*i.

pi = sUi d

hi = (1/pi) sUi ||x- u*

i|| d

h = pi hi

Suppose x 2 U1. Let f be any k-median function, with centers u1,...uk. Closest to u*

1 is wlog u1. Let a = ||u*1 - u1||. By Markov inequality,

at least p1/2 mass is within 2h1 of u*1. So:

f ¸ (p1/2) max(0,a-2h1)f ¸ hand so

f ¸ h/2 + (p1/4) max(0,a-2h1)

Thm 6: Total sensitivity of k-median functions

u*1

x

u*2

u*3

a

u1

f ¸ h/2 + (p1/4) max(0,a-2h1)

From the definition of sensitivity,

sx = maxf fx / f · maxf ||x-{u1,...,uk}|| / f · maxf ||x-u1|| / f · ...

(can show worst case is either a=2h1 or a=1)

... · 4h1/h+ 2||x-u*1||/h + 4/p1

Thus

S = s sx d = i sUi sx d

· i sUi [4h1/h+ 2||x-u*

1||/h + 4/p1] d

= (4/h) pi hi + (2/h) s ||{x- u*1,...u*

k}|| d + i 4= 4 + 2 + 4k= 6+4k.

(Best possible up to constants.)

Thm 6: Total sensitivity of k-median functions

Theorem 7(b): Consider R [ G as having been chosen already; now select G randomly within R [ G. Need Chernoff bound for random variable s fx/sx dG as an estimator of s fx/sx dR [ G .

Proof: Recall fx /sx · f, so 0 · s fx/sx dR [ G · f.Need error O(f) in estimator, but not necessarily O(s fx/sx dR [ G); so

standard Chernoff bounds for bounded random variables suffice.

Theorem 7(a): Start with case k=1, i.e. family F1 = {||x-a||}a 2 d.(Wlog shift so minimum h is achieved at a=0.)By Markov ineq., ¸ ½ the mass lies in B(0,2h).Cover code: two “clouds” of f’s. Inner cloud: centers “a” sprinkled so the balls B(a,h/mS) cover B(0,3h).Outer cloud: geometrically spaced, factor (1+/mS), to cover B(0,hmS/).

NB: Size of the cover code ~ (ms/d)d.Poly in m so “Red/Green” argument works.

Thm 7: Chernoff bound for “VC” argument, Cover code

Why is every f=||x-a|| covered by this code?In cases 1 & 2, f is covered by the g whose root b is closest to a.Case 1: a 2 inner ball B(0,3h).Then for all x, |fx-gx| is bounded by Lipshitz property.

Case 2: a 2 outer ball B(0, hmS/). This forces f to be large (proportional to a rather than h) which makes it easier to achieve |fx-gx| · f; again use Lipshitz property.

Case 3: a outer ball B(0, hmS/). In this case f is covered by the constant function gx=a.Again this forces f to be large (proportional to a rather than h), but for x far from 0 this is not enough. Use the inequality h ¸ |x|/sx. Distant points have high sensitivity. Take advantage of the extra tolerance for error on high-sensitivity points.

Thm 7: Chernoff bound for “VC” argument, Cover code

For k>1, use similar construction, but start from the optimal clustering f* with centers u*

1,...,u*k. Surround each u*

i by a cloud of appropriate radius.

Given a k-median function f with centers ui, cover it by a function g which, in each Voronoi region Ui of f*, is either a constant or a 1-median function centered at a cloud point nearest ui.

This produces a covering code for k-median functions, withlog |covering code| 2 ~O(kd log S/)

Need m (number of samples from ) to be: (log |covering code|) £ (Var(T)/f2) k d S2 -2 d k3 -2.

Thm 7: Chernoff bound for “VC” argument, Cover code. k>1

(1) Efficient algorithm to find a small -approximator? (Suppose Support() is finite.)

(2) For {0,1}-valued functions there was a finitary characterization of whether the cover function F was exponential or sub-exponential: largest set shattered by F.

Question: Is there an analogous finitary characterization for the cover

function for multiplicative approximation of +-valued functions?

(Not sufficient that level sets have low VC dimension; step functions are a counterexample.)

Some open questions

Documents

Geometry-based sampling in learning and classification Or, Universal -approximators for integration of some nonnegative functions Leonard J. Schulman