28
Two Broad Classes of Functions for Which a No Free Lunch Result Does Not Hold Matthew J. Streeter Genetic Programming, Inc. Mountain View, California [email protected] GECCO 2003, July 12-16 Chicago

Two Broad Classes of Functions for Which a No Free Lunch Result Does Not Hold Matthew J. Streeter Genetic Programming, Inc. Mountain View, California [email protected]

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Two Broad Classes of Functions for Which a No Free Lunch Result Does Not

Hold

Matthew J. StreeterGenetic Programming, Inc.Mountain View, California

[email protected]

GECCO 2003, July 12-16 Chicago

Overview

• Definitions & prior work

• Bayesian formulation of No Free Lunch & its implications

• No Free Lunch and description length

• No Free Lunch and infinite sets

Definitions (search algorithm framework)

X = set of points (individuals) in search spaceY = set of cost (fitness) valuesxi = an individual yi = the fitness of xi

• Search algorithm takes as input a list {(x0,y0), (x1,y1), …, (xn,yn)} of previously visited points and their cost values and produces next point xn+1 as output

• We consider only deterministic, non-retracing search algorithms, but

• Conclusions can be extended to stochastic, potentially retracing algorithms using arguments of (Wolpert and Macready 1997)

Definitions (search algorithm framework)

A = a search algorithmF = a set of functions to be optimizedP = a probability distribution over Ff = a particular function in F

• Performance vector V (A,f ) vector of cost (fitness) values that A receives when run against f.

• Performance measure M (V (A,f )) function that assigns scores to performance vectors

• Overall performance MO(A) Ff

fAVMfP ,

Definitions (search algorithm framework)

• No Free Lunch result applies to (F,P ) iff., for any M, A and B,

MO(A) = MO(B)

Definitions (permutations of functions)

• A permutation f of a function f is a rearrangement of f ’s y values with respect to its x values

• A set of function F is closed under permutation iff., for any fF and any permutation f, f F

Some No Free Lunch results

• NFL holds for (F,P ) where P is uniform and F is set of all functions f:XY where X and Y are finite (Wolpert and Macready 1997)

• NFL holds for (F,P ) where P is uniform iff. F is closed under permutation (Schumacher, Vose and Whitley 2001)

Known limitations of NFL

• Does not hold for some classes of extremely simple functions (Droste, Jansen, and Wegener 1999)

• Under certain restrictions on “successor graph”, SUBMEDIAN-SEEKER outperforms random search (Christensen and Oppacher 2001)

• Does not hold when F has less than the maximum number of local optima or less than the maximum steepness (Igel and Toussaint 2001)

Bayesian formulation of NFL

• Let f be a function drawn at random from F under probability distribution PF

• Consider running a search algorithm for n steps to obtain the set of (point,cost) pairs:

S = {x0,y0), (x1,y1), …, (xn,yn)}

• Let P (f (xi )=y | S ) denote the conditional probability that f (xi )=y, given our knowledge of S

Bayesian formulation of NFL

• No Free Lunch result holds for (F,PF ) if and only if:

P (f (xi )=y | S ) = P (f (xj )=y | S )

for all xi, xj, S, and y, where

xi, xj have not yet been visited

Proof (If)

• By induction:

– Base case: If S={}, P (f (xi )=y) = P (f (xj )=y), so the search algorithm has no influence on the first cost (fitness) value obtained

– Inductive step: If S = {(x0,y0), (x1,y1), …, (xk,yk)} and search algorithm cannot influence first k cost values, P (f (xi )=y | S ) = P (f (xj )=y | S ) so search algorithm cannot influence yk+1

• Because list of cost values is independent of search algorithm, performance is independent of search algorithm

Proof (Only if)

• Assume for some xi, xj, S, y,

P (f (xi )=y | S ) ≠ P (f (xj)=y | S )

where S = {(x0,y0), (x1,y1), …, (xn,yn)}

• Consider performance measure M that only rewards performance vectors beginning with the prefix y0, y1, …, yn, y

• We can construct search algorithms A and B that behave identically, except that A samples xi after observing S whereas B samples xj

• It follows that MO(A) > MO(B ), so NFL does not hold

Analysis

• P (f (xi)=y | S ) = P (f(xj )=y | S ) implies that

– E[|f (xi )-f (xj )|] = E[|f (xk )-f (xl )|]: No correlation between genotypic similarity and similarity of fitness

– P (ymin f (xi ) ymax | S ) = P (ymin f (xj ) ymax | S ): We cannot use S to estimate probability of unseen points lying in some fitness interval [ymin, ymax]

Contrast with Holland’s assumptions

• Holland assumed that by sampling a schema, you can extrapolate about the fitness of unseen instances of that schema; this implies P(ymin f (xi ) ymax | S ) ≠ P(ymin f (xj ) ymax | S )

• Objection has been made before that NFL assumptions violate Holland’s assumption (Droste, Jansen, and Wegener 1999), but I just showed that NFL holds only when you violate this assumption

No Free Lunch and Description Length

Definitions

• Description length of a function is the length of the shortest program that implements that function (a.k.a. Kolmogorov complexity)

• Description length depends on which Turing machine you use, but differences between any two TMs are bound by a constant (Compiler Theorem)

Previous work on NFL and problem description length

• Schumacher 2001: NFL can hold for sets of functions that are highly compressible (e.g., set of needle-in-the-haystack functions)

• Droste, Jansen, and Wegener 1999: NFL does not hold for certain sets with highly restricted description length (“Perhaps Not a Free Lunch, but at Least a Free Appetizer”)

My results concerning description length

• NFL does not hold in general for sets with bounded description length

• For sets defined by bound on description length, any reasonable bound rules out NFL

Proof outline

• Fk all functions f:XY with description length k or less

• If k is sufficiently large, Fk will contain a hashing function like h (x ) = x mod |Y |

• If the number of permutations of h (x ) is more than 2k+1-1, F cannot be closed under permutation, therefore:

– NFL result will not hold for (F,P ) where P is uniform (Schumacher, Vose, and Whitley 2001)

Illustration

• Let X consist of all n-bit chromosomes, Y consist of all m-bit fitness values, and Fk be defined as before

• It turns out the number of permutations of h (x ) = x mod |Y | is:

|| mod |||||| mod ||

!||||

!||||

! YXYYX

YX

YX

X

Illustration

Chromosome length n

Fitness value length m

NFL does not hold when:

16 1 kmod ≤ k ≤ 6.55*104

32*8 8 kmod ≤ k ≤ 9.26*1077

500*8 32 kmod ≤ k ≤ 4.22*101205

where kmod description length of h (x )= x mod |Y |

Additional proofs

• NFL does not hold if P assigns probability that is strictly decreasing w.r.t. description length

• NFL does not hold if P is a Solomonoff-Levin distribution

• Same proof technique (counting argument, permutation closure)

No Free Lunch and infinite sets

• Original NFL proof depends on P assigning equal probability to every f in F; can’t do this for infinite F

• Bayesian formulation still works for infinite F

• NFL only holds when P (f ) can be expressed solely as a function of f ’s domain and range

• Unlike case of finite F, there is no aesthetically pleasing argument for assuming P Is such that NFL holds

Limitations of this work

• When we show that a NFL results does not hold for certain (F,P ), all this means is that for some performance measure, some algorithms perform better than others

• The fact that NFL does not hold for (F,P ) does not necessarily mean (F,P ) is “searchable”

• For stronger results under more restrictive assumptions, see Christensen and Oppacher 2001

Conclusions

• Lots of statement have been made with NFL as the justification:– All search algorithms are equally robust– All search algorithms are equally specialized– Performance of algorithms on benchmark

problems is not predictive of performance in general

Conclusions

• However, these statements are only true if you assume that:– The (F,P ) of interest to the GA community is

such that parent and offspring fitness are uncorrelated, and

– there is no correlation between genotypic similarity and similarity of fitness

Conclusions

• Moreover, there are theoretically motivated choices of (F,P ) for which NFL does not hold– P (f ) as a decreasing function of f ’s description

length– P (f ) as a Solomonoff-Levin distribution

Conclusions

• If anything, proponents of NFL are on shakier ground when F is infinite