Upload
junior-west
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
Geometry-based sampling in learning and classification
Or,
Universal -approximators for integration of some nonnegative functions
Leonard J. Schulman
Caltech
Joint with
Michael Langberg
Open U. Israelwork in progress
Outline
1. Vapnik-Chervonenkis (VC) method / PAC learning; -nets, -approximators. Shatter function as cover code.
2. approximators (core-sets) for clustering; universal approximation of integrals of families of unbounded nonnegative functions.
3. Failure of naive sampling approach.
4. Small-variance estimators. Sensitivity and total sensitivity.
5. Some results on sensitivity; MIN operator on families. Sensitivity for k-medians.
6. Covering code for k-median.
PAC Learning. Integrating {0,1}-valued functions
If F is a family of “concepts” (functions f:X {0,1}) of finite VC dimension d, then for every input distribution there exists a distribution with support O((d/2) log 1/), s.t. for every f 2 F,
| s f d - s f d| < .Method: (1) Create by repeated independent sampling x1,…xm from . This
creates an estimator T = (1/m) 1
m f(xi) which w.h.p. (1+)-approximates any the integral of any specific f.
Example: F=characteristic fcns of intervals on .These are {0,1}-valued functions with VC dimension = 2.
a b
PAC Learning. Integrating {0,1}-valued functions
If F is a family of “concepts” (functions f:X {0,1}) of finite VC dimension d, then for every input distribution there exists a distribution with support O((d/2) log 1/), s.t. for every f 2 F,
| s f d - s f d| < .Method: (1) Create by repeated independent sampling x1,…xm from . This
creates an estimator T = (1/m) 1
m f(xi) which w.h.p. (1+)-approximates the integral of any specific f.
Example: F=characteristic fcns of intervals on .These are {0,1}-valued functions with VC dimension = 2.
a b
PAC Learning. Integrating {0,1}-valued functions
If F is a family of “concepts” (functions f:X {0,1}) of finite VC dimension d, then for every input distribution there exists a distribution with support O((d/2) log 1/), s.t. for every f 2 F,
| s f d - s f d| < .Method: Create by repeated independent sampling x1,…xm from . This
creates an estimator T = (1/m) 1
m f(xi) which w.h.p. (1+)-approximates the integral of any specific f.
Example: F=characteristic fcns of intervals on .These are {0,1}-valued functions with VC dimension = 2.
a b
PAC Learning. Integrating {0,1}-valued functions (cont.)
Easy to see that for any particular f 2 F, w.h.p. | s f d - s f d| < .But how do we argue this is simultaneously achieved for all the
functions f 2 F? Can’t take union bound over infinitely many “bad events”.
Need to express that there are few “types” of bad events. To conquer the infinite union bound apply “Red & Green Points” argument.
Sample m=O((d/2) log (1/)) “green” points G from . Will use =G=uniform dist. on G.
P(G not an -approximator) = P(9 f 2 C badly-counted by G) · P(9 f 2 C: |(f)-G(f)|>)Suppose G is not an -approximator: 9 f: |(f)-G(f)|>. Sample another m=O((d/2) log (1/)) “red” points R from .
With probability > ½, |(f)-R(f)|<2. (Markov ineq.) So:
P(9 f 2 C: |(f)-G(f)|>) < 2 P(9 f 2 C: |R(f)-G(f)|>/2).
PAC Learning. Integrating {0,1}-valued functions (cont.)
P(9 f 2 C: |(f)-G(f)|>) < 2 P(9 f 2 C: |R(f)-G(f)|>/2).
has vanished from our expression! Our failure event depends only on the restriction of f to R [ G.
Key: for finite-VC-dimension F, every f 2 F is identical on R [ G to one of a small (much less than 2m) set of functions. These are a “covering code” for F on R [ G. Cardinality (m)=md ¿ 2m.
Integrating functions into larger ranges (still additive approximation)
Extensions of VC-dimension notions to families of f:X {0,...n}:Pollard 1984Natarajan 1989Vapnik 1989Ben-David, Cesa-Bianchi, Haussler, Long 1997.
Families of functions f:X [0,1]: extension of VC-dimension notion (analogous to discrete definitions but insists on quantitative separation of values): “fat-shattering”.
Alon, Ben-David, Cesa-Bianchi, Haussler 1993Kearns, Schapire 1994Bartlett, Long, Williamson 1996Bartlett, Long 1998
Function classes with finite “dimension” (as above) possess small core-sets for additive -approximation of integrals. Same method still works: simply construct by repeatedly sampling from .
Does not solve multiplicative approximation of nonnegative functions.
What classes of +-valued functions possess core-sets for integration?
Why multiplicative approximation? In optimization we often wish to minimize a nonnegative loss function.
Makes sense to settle for (1+)-multiplicative approximation (and often unavoidable because of hardness).
Example: important optimization problems arise in classification:
Choose c1,...ck to minimize:
• k-median function: cost(fc1,...ck)= s ||x-{c1,...ck}|| d(x)
• k-means function: cost(fc1,...ck)= s ||x-{c1,...ck}||2 d(x)
• Or for any >0, cost(fc1,...ck)= s ||x-{c1,...ck}|| d(x)
Core-set can be useful because:Standard algorithmic approach: (1) Replace input (empirical distribution on huge number of points,
or even a continuous distribution given via certain “oracles”) by an -approximator (aka core-set) supported on a small set.
(2) Find an optimal (or near-optimal) c1,...ck for .
(3) Infer that it is near-optimal for .
Core-set can be useful because:Standard algorithmic approach: (1) Replace input (empirical distribution on huge number of points,
or even a continuous distribution given via certain “oracles”) by an -approximator (aka core-set) supported on a small set.
(2) Find an optimal (or near-optimal) c1,...ck for .
(3) Infer that it is near-optimal for .
Standard algorithmic approach: (1) Replace input (empirical distribution on huge number of points,
or even a continuous distribution given via certain “oracles”) by an -approximator (aka core-set) supported on a small set.
(2) Find an optimal (or near-optimal) c1,...ck for .
(3) Infer that it is near-optimal for .
In this lecture we focus solely on existence/non-existence of finite-cardinality core-sets; not on how to find them. Theorems will hold for any “input” distribution regardless of how it is presented.
c2
Core-set can be useful because:
c1
Known core-setsIn numerical analysis: quadrature methods for exact integration over
canonical measures (constant on interval or ball; Gaussian; etc).In CS previously: very different approaches from what I’ll present
today. (If is uniform on a finite set, let n=|Support()|.)
Our general goal is to find out what families of functions have, for every , bounded core-sets for integration.
In particular our method shows existence (but no algorithm) of core-sets for k-median, of size poly(d,k,1/), ~ O(d k3 -2).
Unbounded (dependent on n):
Har-Peled, Mazumdar ’04: k-medians & k-means: O(k -d log n)
Chen ’06: k-medians & k-means: ~ O(d k2 -2 log n)
Har-Peled ’06: in one dimension, integration of other families of functions (e.g., monotone): ~ O([family-specific] log n)
Bounded (independent of n):
Effros, Schulman ’04: k-means (k(d/)d)O(k) deterministically
Har-Peled, Kushal ’05: k-medians: O(k2/d), k-means O(k3/d+1)
a
Why doesn’t “sample from ” work? Ex.1For additive approximation, the learning theory approach: “construct
by repeatedly sampling from ” was sufficient to obtain a core-set. Why does it fail now?
Simple example: 1-means functions. F={(x-a)2}a 2 Let be:
a
For additive approximation, the learning theory approach: “construct by repeatedly sampling from ” was sufficient to obtain a core-set. Why does it fail now?
Simple example: 1-means functions. F={(x-a)2}a 2 Let be:
Why doesn’t “sample from ” work? Ex.1
a
For additive approximation, the learning theory approach: “construct by repeatedly sampling from ” was sufficient to obtain a core-set. Why does it fail now?
Simple example: 1-means functions. F={(x-a)2}a 2 Let be:
Construct by sampling repeatedly from : almost surely all samples will lie in the left-hand singularity.
Why doesn’t “sample from ” work? Ex.1
a
For additive approximation, the learning theory approach: “construct by repeatedly sampling from ” was sufficient to obtain a core-set. Why does it fail now?
Simple example: 1-means functions. F={(x-a)2}a 2 Let be:
Construct by sampling repeatedly from : almost surely all samples will lie in the left-hand singularity.
If a is at the left-hand singularity, s f d>0, but whp s f d=0. No multiplicative approximation.
Underlying problem: the estimator of s f dhas large variance.
Why doesn’t “sample from ” work? Ex.1
For additive approximation, the learning theory approach: “construct by repeatedly sampling from ” was sufficient to obtain a core-set. Why does it fail now?
Even simpler example: Interval functions. F={fa,b} where fa,b(x)=1 for x 2 [a,b], 0 otherwise.These are {0,1}-valued functions with VC dimension = 2.
a b
Why doesn’t “sample from ” work? Ex.2
For additive approximation, the learning theory approach: “construct by repeatedly sampling from ” was sufficient to obtain a core-set. Why does it fail now?
Even simpler example: Interval functions. F={fa,b} where fa,b(x)=1 for x 2 [a,b], 0 otherwise.These are {0,1}-valued functions with VC dimension = 2. Let be:
a b
Why doesn’t “sample from ” work? Ex.2
For additive approximation, the learning theory approach: “construct by repeatedly sampling from ” was sufficient to obtain a core-set. Why does it fail now?
Even simpler example: Interval functions. F={fa,b} where fa,b(x)=1 for x 2 [a,b], 0 otherwise.These are {0,1}-valued functions with VC dimension = 2. Let be:
For unif. dist. on n pts, f = 1/n while for x sampled from , StDev(f(x)) ~ 1/n.
Actually nothing works for this family F. For an accurate multiplicative estimate of f =s f d, a core-set would need to contain the entire support of .
a b
Why doesn’t “sample from ” work? Ex.2
Even the simpler family of step functions doesn’t have finite core-sets:
Why doesn’t “sample from ” work? Ex.3: step functions
Even the simpler family of step functions doesn’t have finite core-sets:
Why doesn’t “sample from ” work? Ex.3: step functions
Even the simpler family of step functions doesn’t have finite core-sets:
Need geometrically spaced points (factor (1+)) in the support.
Why doesn’t “sample from ” work? Ex.3: step functions
General approach: weighted sampling.
Sample not from but from a distribution q which depends on both and F.
Weighted sampling has long been used for clustering algorithms [Fernandez de la Vega, Kenyon’98; Kleinberg, Papadimitriou, Raghavan’98; Schulman’98; Alon, Sudakov’99;...], to reduce the size of the data set.
What we’re trying to explain is (a) For what classes of functions can weighted sampling provide an -approximator (core-set); (b) What is the connection with the VC proof of existence of -approximators in learning theory.
Return to Ex.1: show 9 small-variance
estimator for f =s f d
General approach: weighted sampling. Sample not from but from a distribution q which depends on both
and F.
Sample x from q. The random variable T = x fx / qx is an unbiased estimator of f.
Can we design q so Var(T) is small 8 f 2 F? Ideally: Var(T) 2 O(f2)
For the case of “1-means in one dimension”,the optimization
Given , choose q to minimize maxf 2 F Var(T)
can be solved (with mild pain) by Lagrange multipliers. Solution: Let 2=Var(). Center at 0. Then sample from
qx=x(2+x2)/(22).
(Note: heavily weights the tails of .) Calculation: Var(T) · f2.
(Now average O(1/2) samples. For any specific f, only 1± error.)
Return to Ex.1: show 9 small-variance
estimator for f =s f d
For what classes F of nonnegative functions does there exist, for all , an estimator T with Var(T) 2 O(f2)?
E.g., what about nonnegative quartics, fx=(x-a)2(x-b)2 ?
Shouldn’t have to do Lagrange multipliers each time.
Key notion: sensitivity.Define the sensitivity of x w.r.t. (F,): sx = supf 2 F fx/fDefine the total sensitivity of F: S(F) = sup s sx d
Sample from the distribution qx = x sx / S. (Emphasizes sensitive x’s)
Theorem 1: Var(T) · (S-1) f2 Proof omitted.
Exercise: For “parabolas”, F={(x-a)2}, show S=2.Corollary: Var(T) · f2 (as previously obtained via Lagrange mults)
Theorem 2 (slightly harder): T has a Chernoff bound (distribution has exponential tails). Don’t need this today.
Can we generalize the success of Ex.1?
Example 1. Let V be a real or complex vector space of dimension d.For each v=(...vx...) 2 V define an f 2 F by fx=|vx|2.
Theorem 3: S(F)=d. Proof omitted.
Corollary (again): Quadratics in 1 dimension have S(F)=2.
Quartics in 1 dimension have S(F)=3.
Can we calculate S for more examples?
Let V be a real or complex vector space of dimension d.For each v=(...vx...) 2 V define an f 2 F by fx=|vx|2.
Theorem 3: S(F)=d. Proof omitted.
Corollary (again): Quadratics in 1 dimension have S(F)=2.
Quartics in 1 dimension have S(F)=3.
Quadratics in r dimensions have S(F)=r+1.
Can we calculate S for more examples?
Example 2. Let F+G={f+g: f 2 F, g 2 G}Theorem 4 (easy): S(F+G) · S(F)+S(G).Corollary: bounded sums of squares of bounded-degree polynomials
have finite S.
Example 3. Parabolas on k disjoint regions. Direct sum of vector spaces, so S · 2k.
0 1 2 3
Can we calculate S for more examples?
But all these examples don’t even handle the 1-median functions:
Can we calculate S for more examples?
But all these examples don’t even handle the 1-median functions:
And certainly not the k-median functions:
Can we calculate S for more examples?
But all these examples don’t even handle the 1-median functions:
And certainly not the k-median functions:
Will return to this...
Can we calculate S for more examples?
Question: If F and G have finite total sensitivity, is the same true ofMIN(F,G) = {min(f,g): f 2 F, g 2 G} ?Want this for optimization: e.g., k-means or k-median functions are
constructed by MIN out of simple families.
We know S(Parabolas)=2; what is S(MIN(Parabolas,Parabolas))?
What about MIN(F,G)?
Question: If F and G have finite total sensitivity, is the same true of
MIN(F,G) = {min(f,g): f 2 F, g 2 G} ?
Want this for optimization: eg k-means or k-median functions are constructed by MIN out of simple families.
We know S(Parabolas)=2; what is S(MIN(Parabolas,Parabolas))?
Answer: unbounded.
So total sensitivity does not remain finite under MIN operator.
What about MIN(F,G)?
Roughly, on a suitable distribution , a sequence of “pairs of parabolas” can mimic a sequence of step functions.
e-|x|
And recall from earlier: step functions have unbounded total sensitivity.
Idea for counterexample:
Roughly, on a suitable distribution , a sequence of “pairs of parabolas” can mimic a sequence of step functions.
e-|x|
And recall from earlier: step functions have unbounded total sensitivity.
Idea for counterexample:
Roughly, on a suitable distribution , a sequence of “pairs of parabolas” can mimic a sequence of step functions.
e-|x|
And recall from earlier: step functions have unbounded total sensitivity.This counterexample relies on scaling the two parabolas differently.
What if we only allow translates of a single “base function”?
Idea for counterexample:
Let M be any metric space.Let F={||x-a||}a 2 M (1-median is =1, 1-means is =2)
Theorem 5: For any >0, S(F)<1. Note: Bound is independent of M. S is not an analogue of VCdim /
cover function; it is a new parameter needed for unbounded fcns.
Theorem 6: For any >0, S(MIN(F,...(k times)...,F)) 2 O(k).
But remember this is only half the story: bounded S ensures only good approximation of f = s f d for each individual function f 2 F. Also need to handle all f simultaneously – the “VC” aspect.
Finite total sensitivity of clustering functions
2-median functionin M=2
Recall “Red and Green Points” argumentin VC theory: after picking 2m pointsfrom , all the {0,1}-valued functionsin the concept class fall into just mO(1)
equivalence classes by their restriction to R [ G.(“Shatter function” is (m)=O(mVC-dim).) These restrictions are a covering
code for the concept class.
For +-valued functions use a more general definition. First try: f 2 F is “covered” by g if 8 x 2 R [ G, fx = (1±) gx.
But this definition neglects the role of sensitivity. Corrected definition:f 2 F is “covered” by g if 8 x 2 R [ G, |fx - gx| < f sx / 8 S.Notes: (1) Error can scale with f rather than fx. (2) Tolerates more error
on high-sensitivity points.A “covering code” (for , R [ G) is a small ((m,) subexponential in m)
family G, such that every f 2 F is covered by some g 2 G.
Cover codes for families of functions F
So (now focusing on k-median) we need to prove two things:
Theorem 6: S(MIN(F,...(k times)...,F)) 2 O(k).
Theorem 7:
(a) In d, (MIN(F,...(k times)...,F)) 2 mpoly(k,1/,d)
(b) Chernoff bound for s fx/sx dG as an estimator of s fx/sx dR [ G
(Recall G = uniform dist. on G.)
Today talk only about:
Theorem 6
Theorem 7 in the case k=1, d arbitrary.
Cover codes for families of functions F
Theorem 6: S(MIN(F,...(k times)...,F)) 2 O(k).Proof:
Given let f* be the optimal clustering function, with centers u*1,...u*
k, so
h = k-median-cost() = s ||x - {u*1,...u*
k}|| d.
For any x, need to upper bound sx. Let Ui = Voronoi region of u*i.
pi = sUi d
hi = (1/pi) sUi ||x- u*
i|| d
h = pi hi
Suppose x 2 U1. Let f be any k-median function, with centers u1,...uk. Closest to u*
1 is wlog u1. Let a = ||u*1 - u1||. By Markov inequality,
at least p1/2 mass is within 2h1 of u*1. So:
f ¸ (p1/2) max(0,a-2h1)f ¸ hand so
f ¸ h/2 + (p1/4) max(0,a-2h1)
Thm 6: Total sensitivity of k-median functions
u*1
x
u*2
u*3
a
u1
f ¸ h/2 + (p1/4) max(0,a-2h1)
From the definition of sensitivity,
sx = maxf fx / f · maxf ||x-{u1,...,uk}|| / f · maxf ||x-u1|| / f · ...
(can show worst case is either a=2h1 or a=1)
... · 4h1/h+ 2||x-u*1||/h + 4/p1
Thus
S = s sx d = i sUi sx d
· i sUi [4h1/h+ 2||x-u*
1||/h + 4/p1] d
= (4/h) pi hi + (2/h) s ||{x- u*1,...u*
k}|| d + i 4= 4 + 2 + 4k= 6+4k.
(Best possible up to constants.)
Thm 6: Total sensitivity of k-median functions
Theorem 7(b): Consider R [ G as having been chosen already; now select G randomly within R [ G. Need Chernoff bound for random variable s fx/sx dG as an estimator of s fx/sx dR [ G .
Proof: Recall fx /sx · f, so 0 · s fx/sx dR [ G · f.Need error O(f) in estimator, but not necessarily O(s fx/sx dR [ G); so
standard Chernoff bounds for bounded random variables suffice.
Theorem 7(a): Start with case k=1, i.e. family F1 = {||x-a||}a 2 d.(Wlog shift so minimum h is achieved at a=0.)By Markov ineq., ¸ ½ the mass lies in B(0,2h).Cover code: two “clouds” of f’s. Inner cloud: centers “a” sprinkled so the balls B(a,h/mS) cover B(0,3h).Outer cloud: geometrically spaced, factor (1+/mS), to cover B(0,hmS/).
NB: Size of the cover code ~ (ms/d)d.Poly in m so “Red/Green” argument works.
Thm 7: Chernoff bound for “VC” argument, Cover code
Why is every f=||x-a|| covered by this code?In cases 1 & 2, f is covered by the g whose root b is closest to a.Case 1: a 2 inner ball B(0,3h).Then for all x, |fx-gx| is bounded by Lipshitz property.
Case 2: a 2 outer ball B(0, hmS/). This forces f to be large (proportional to a rather than h) which makes it easier to achieve |fx-gx| · f; again use Lipshitz property.
Case 3: a outer ball B(0, hmS/). In this case f is covered by the constant function gx=a.Again this forces f to be large (proportional to a rather than h), but for x far from 0 this is not enough. Use the inequality h ¸ |x|/sx. Distant points have high sensitivity. Take advantage of the extra tolerance for error on high-sensitivity points.
Thm 7: Chernoff bound for “VC” argument, Cover code
For k>1, use similar construction, but start from the optimal clustering f* with centers u*
1,...,u*k. Surround each u*
i by a cloud of appropriate radius.
Given a k-median function f with centers ui, cover it by a function g which, in each Voronoi region Ui of f*, is either a constant or a 1-median function centered at a cloud point nearest ui.
This produces a covering code for k-median functions, withlog |covering code| 2 ~O(kd log S/)
Need m (number of samples from ) to be: (log |covering code|) £ (Var(T)/f2) k d S2 -2 d k3 -2.
Thm 7: Chernoff bound for “VC” argument, Cover code. k>1
(1) Efficient algorithm to find a small -approximator? (Suppose Support() is finite.)
(2) For {0,1}-valued functions there was a finitary characterization of whether the cover function F was exponential or sub-exponential: largest set shattered by F.
Question: Is there an analogous finitary characterization for the cover
function for multiplicative approximation of +-valued functions?
(Not sufficient that level sets have low VC dimension; step functions are a counterexample.)
Some open questions