41
Learning, testing, and approximating halfspaces Rocco Servedio Columbia University DIMACS-RUTCOR Jan 2009

Learning, testing, and approximating halfspaces Rocco Servedio Columbia University DIMACS-RUTCOR Jan 2009

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Learning, testing, and approximatinghalfspaces

Rocco Servedio

Columbia University

DIMACS-RUTCOR

Jan 2009

Overview

++

+

+

+

++

++ +

+

+

--- -

-

-

-

-- -

-

Halfspaces over

testing

learning

approximation

Joint work with:

Ronitt Rubinfeld

Kevin MatulefRyan O’Donnell

Ilias Diakonikolas

Approximation

Given a function goal is to obtain a “simpler”

function such that

• Measure distance between functions under uniform distribution.

Approximating classes of functions

Interested in statements of the form:

“Every function in class has a simple approximator.”

Every -size decision tree can be -approximated by a decision tree of depth

Example statement:

0 1 0 1 01

0 11

0 11

1

TestingGoal: infer “global” property of function via few “local” inspections

Tester makes black-box queries to arbitrary

Tester must output

• “yes” whp if • “no” whp if is -far from

every

• Any answer OK if is -close to some

distance

Usual focus: information-theoretic

# queries required

oracle for

Some known property testing results

parity functions [BLR93]

deg- polynomials [AKK+03]

literals [PRS02]

conjunctions [PRS02]

-juntas [FKRSS04]

-term monotone DNF [PRS02]

-term DNF [DLM+07]

size- decision trees [DLM+07]

-sparse polynomials [DLM+07]

Class of functions over # of queries

We’ll get to learning later

Halfspaces

A function

is a halfspace if

such that

for all

• Also called linear threshold functions (LTFs), threshold gates, etc.

• Fundamental to learning theory– Halfspaces are at the heart of many learning algorithms: Perceptron,

Winnow, boosting, Support Vector Machines,…

• Well studied in complexity theory

Some examples of halfspaces

Weights can be all the same…

(decision list)

…but don’t have to be…

What’s a “simple” halfspace?

Every halfspace has a representation with integer weights:

– finite domain, so can “nudge” weights to rational #s, scale to integers

Some halfspaces over require integer weights

[MTT61, H94]

Low-weight halfspaces are nice for complexity, learning.

is equivalent to

Approximating halfspaces using small weights?

Let be an arbitrary halfspace.

If is a halfspace which -approximates how large do the weights of need to be?

Consider (view as n-bit binary numbers)

This is a halfspace:

but it’s easy to -approximate with weight

Any halfspace for requires weight …

Let’s warm up with a concrete example.

Approximating all halfspaces using small weights?

So there are halfspaces that require weight

but can be -approximated with weight

Let be an arbitrary halfspace.

If is a halfspace which -approximates how large do the weights of need to be?

Can every halfspace be approximated by a small-weight halfspace?

Yes

Every halfspace has a low-weight approximator

• Can’t do better in terms of ; may need some

• Dependence on must be [H94]

Theorem: [S06]

Let be any halfspace. For any

there is an -approximator with

integer weights that has

How good is this bound?

Idea behind the approximation

Let

• If weights decrease rapidly, then

is well approximated by a junta

WOLOG have

Key idea: look at how these weights decrease.

• If weights decrease slowly, then

is “nice” – can get a handle on

distribution of

A few more details

Def: Critical index of is the first index such that

is “small relative to the remaining weights”:

Let

How do these weights decrease?

critical index

Sketch of approximation: case 1

• “First weights all decrease rapidly” – factor of

• Remaining weight after very small

• Can show

is -close to , so can approximate just by truncating

• has relevant variables so can be expressed with integer weights each at most

Critical index is first

index such that

First case:

Why does truncating work?

Have

only if either

or

each of these weights

small, so unlikely by

Hoeffding bound

unlikely by more complicated argument (split up

into blocks; symmetry argument on each

block bounds prob by ½; use independence)

Let’s write for

Sketch of approximation: case 2

Second case:

Critical index is first

index such that

• “weights are smooth”

• Intuition: behaves like Gaussian

• Can show it’s OK to round weights to small integers

(at most )

Why does rounding work?

Let

so

Have only if either

or

each small, so

unlikely by

Hoeffding bound

unlikely since

Gaussian is “anticoncentrated”

}

Sketch of approximation: case 2

Second case:

Critical index is first

index such that

• “weights are smooth”

• Intuition: behaves like Gaussian

• Can show it’s OK to round weights to small integers

(at most )

• Need to deal with first weights, but at most

many – they cost at most

END OF

SKETCH

Extensions

Let be any halfspace. For any

there is an -approximator with

integer weights that has

We saw:

Recent improvement [DS09]: replace with

For

with bit flipped

Standard fact: Every halfspace has (but can be much less)

Proof uses structural properties of

halfspaces from testing & learning.

Can be viewed as (exponential)

sharpening of Friedgut’s theorem:

Every Boolean is -close to a

function on variables.

We show:

Every halfspace is -close to a

function on

variables.approximation

Combines

• Littlewood-Offord type theorems on

“anticoncentration” of

• delicate linear programming

arguments

Gives new proof of original

bound that does not use the

“critical index”

So halfspaces have low-weight approximators.What about testing?

Use approximation viewpoint: two possibilities depending on critical index.

First case: critical index large

• close to junta halfspace over variables

• Implicitly identify the junta variables (high influence)

• Do Occam-type “implicit learning” similar to [DLMORSW07]

(building on [FKRSS02]): check every possible halfspace over the junta variables

– If is a halfspace, it’ll be close to some function you check

– If far from every halfspace, it’ll be close to no function you check

So halfspaces have low-weight approximators.What about testing?

Second case: critical index small

• every restriction of high-influence vars makes “regular”

– all weights & influences are small

• Low-influence halfspaces have nice Fourier properties

• Can use Fourier analysis to check that each restriction

is close to a low-influence halfspace

• Also need to check:

– cross-consistency of different restrictions (close to low-influence halfspaces with same weights)?

– global consistency with a single set of high-influence weights most

s

A taste of FourierA helpful Fourier result about low-influence halfspaces:

“Theorem”: [MORS07] Let be any Boolean function such that:

• all the degree-1 Fourier coefficients of are small

• the degree-0 Fourier coefficient synchs up with the degree-1 coeffs

Then is close to a halfspace

A taste of FourierA helpful Fourier result about low-influence halfspaces:

“Theorem”: [MORS07] Let be any Boolean function such that:

• all the degree-1 Fourier coefficients of are small

• the degree-0 Fourier coefficient synchs up with the degree-1 coeffs

Then is close to a halfspace – in fact, close to the halfspace

• Useful for soundness portion of test

Testing halfspaces

When all the dust settles:

Theorem: [MORS07]

The class of halfspaces over is testable with queries.

approximation

testing

What about learning?

Learning halfspaces from random

labeled examples is easy using

poly-time linear programming.

1. The RFA model

2. Agnostic learning under uniform distribution

-

+

+

++

++

+++

++

+++

++

-

-

-- -

-

- -

----

--

++

+

++

+

++

+

++

++

+

+++

++

- ------- --- -----

----

?!There are other harder learning models…

The RFA learning model

• Introduced by [BDD92]: “restricted focus of attention”

• For each labeled example the learner gets to choose

one bit of the example that he can see

(plus the label of course).

• Examples are drawn from uniform distribution over

• Goal is to construct -accurate hypothesis

Question: [BDD92, ADJKS98, G01]

Are halfspaces learnable in RFA model?

The RFA learning model in action

learner oracle

May I have a random example, please?

Sure, which bit would you like to see?

Oh, man…uh, x7.

Thanks, I guess

Watch your manners

Here’s your example:

Very brief Fourier interlude

Every has a unique Fourier representation

The coefficients

are sometimes called the Chow parameters of

Another view of the RFA learning model

Every has a unique Fourier representation

The coefficients

are sometimes called the Chow parameters of

RFA model: learner gets

Not hard to see:

In the RFA model, all the learner can do is estimate

the Chow parameters

• With examples, can estimate any given Chow parameter

to additive accuracy

(Approximately) reconstructing halfspaces from their (approximate) Chow parameters

Theorem [C61]: If is a halfspace &

has for all then

Perfect information about Chow parameters suffices for halfspaces:

To solve 1-RFA learning problem, need a version of Chow’s theorem which is both robust and effective

• robust: only get approximate Chow parameters (and only hope for approximation to )

• effective: want an actual poly(n) time algorithm!

Previous results

Theorem: Let be a weight- halfspace. Let

be any Boolean function satisfying

for all Then is an -approximator for

[Goldberg01] proved:

[ADJKS98] proved:

Theorem: Let be any halfspace. Let be any function

satisfying

for all Then is an -approximator for

• Good for low-weight halfspaces, but could be

• Better bound for high-weight halfspaces, but superpolynomial in n.

Neither of these results is algorithmic.

Robust, effective version of Chow’s theorem

Theorem: [OS08] For any constant and any halfspace given accurate enough approximations of the Chow

parameters of

algorithm runs in time and w.h.p. outputs a halfspace

that is -close to

Fastest runtime dependence on of any algorithm for learning halfspaces, even in usual random-examples model

– Previous best runtime: time for learning to constant accuracy

– Any algorithm needs examples, i.e. bits of input

Corollary: [OS08] Halfspaces are learnable to any constant accuracy in

time in the RFA model.

A tool from testing halfspaces

If itself is a low-influence halfspace, means we can plug in

degree-1 Fourier coefficients as weights and get a good approximator.

Also need to deal with high-influence case…a hassle, but doable.

Recall helpful Fourier result about low-influence halfspaces:

“Theorem”: Let be any function which is such that:

• all the degree-1 Fourier coefficients of are small

• the degree-0 Fourier coefficient synchs up with the degree-1 coeffs

Then is close to

We know (approximations to)

these in the RFA setting!polynomial time!

Recap of whole talk++ ++

++ ++ ++

++

--- --

---- --

approximation

testing

learning

1. Every halfspace can be approximated to any constant accuracy with small integer weights.

2. Halfspaces can be tested with queries.

3. Halfspaces can be efficiently learned from

(approximations of) their degree-0 and degree-1 Fourier coefficients.

Halfspaces over

Future directionsBetter quantitative results (dependence on ?)

– Testing:

– Approximating:

– Learning (from Chow parameters):

What about {approximating, testing, learning} w.r.t. other distributions?– Rich theory of distribution-independent PAC learning

– Less fully developed theory of distribution-independent testing [HK03,HK04,HK05,AC06]

– Things are harder; what is doable?

– [GS07] Any distribution-independent algorithm for testing whether is a halfspace requires queries.

Thank you for your attention

II. Learning a concept class

Setup: Learner is given a sample of labeled examples

• Target function is unknown to learner

• Each example in sample is independent, uniform over

Goal: For every , with probability learner should output a hypothesis such that

“PAC learning concept class under the uniform distribution”