View
214
Download
1
Tags:
Embed Size (px)
Citation preview
Two Broad Classes of Functions for Which a No Free Lunch Result Does Not
Hold
Matthew J. StreeterGenetic Programming, Inc.Mountain View, California
GECCO 2003, July 12-16 Chicago
Overview
• Definitions & prior work
• Bayesian formulation of No Free Lunch & its implications
• No Free Lunch and description length
• No Free Lunch and infinite sets
Definitions (search algorithm framework)
X = set of points (individuals) in search spaceY = set of cost (fitness) valuesxi = an individual yi = the fitness of xi
• Search algorithm takes as input a list {(x0,y0), (x1,y1), …, (xn,yn)} of previously visited points and their cost values and produces next point xn+1 as output
• We consider only deterministic, non-retracing search algorithms, but
• Conclusions can be extended to stochastic, potentially retracing algorithms using arguments of (Wolpert and Macready 1997)
Definitions (search algorithm framework)
A = a search algorithmF = a set of functions to be optimizedP = a probability distribution over Ff = a particular function in F
• Performance vector V (A,f ) vector of cost (fitness) values that A receives when run against f.
• Performance measure M (V (A,f )) function that assigns scores to performance vectors
• Overall performance MO(A) Ff
fAVMfP ,
Definitions (search algorithm framework)
• No Free Lunch result applies to (F,P ) iff., for any M, A and B,
MO(A) = MO(B)
Definitions (permutations of functions)
• A permutation f of a function f is a rearrangement of f ’s y values with respect to its x values
• A set of function F is closed under permutation iff., for any fF and any permutation f, f F
Some No Free Lunch results
• NFL holds for (F,P ) where P is uniform and F is set of all functions f:XY where X and Y are finite (Wolpert and Macready 1997)
• NFL holds for (F,P ) where P is uniform iff. F is closed under permutation (Schumacher, Vose and Whitley 2001)
Known limitations of NFL
• Does not hold for some classes of extremely simple functions (Droste, Jansen, and Wegener 1999)
• Under certain restrictions on “successor graph”, SUBMEDIAN-SEEKER outperforms random search (Christensen and Oppacher 2001)
• Does not hold when F has less than the maximum number of local optima or less than the maximum steepness (Igel and Toussaint 2001)
Bayesian formulation of NFL
• Let f be a function drawn at random from F under probability distribution PF
• Consider running a search algorithm for n steps to obtain the set of (point,cost) pairs:
S = {x0,y0), (x1,y1), …, (xn,yn)}
• Let P (f (xi )=y | S ) denote the conditional probability that f (xi )=y, given our knowledge of S
Bayesian formulation of NFL
• No Free Lunch result holds for (F,PF ) if and only if:
P (f (xi )=y | S ) = P (f (xj )=y | S )
for all xi, xj, S, and y, where
xi, xj have not yet been visited
Proof (If)
• By induction:
– Base case: If S={}, P (f (xi )=y) = P (f (xj )=y), so the search algorithm has no influence on the first cost (fitness) value obtained
– Inductive step: If S = {(x0,y0), (x1,y1), …, (xk,yk)} and search algorithm cannot influence first k cost values, P (f (xi )=y | S ) = P (f (xj )=y | S ) so search algorithm cannot influence yk+1
• Because list of cost values is independent of search algorithm, performance is independent of search algorithm
Proof (Only if)
• Assume for some xi, xj, S, y,
P (f (xi )=y | S ) ≠ P (f (xj)=y | S )
where S = {(x0,y0), (x1,y1), …, (xn,yn)}
• Consider performance measure M that only rewards performance vectors beginning with the prefix y0, y1, …, yn, y
• We can construct search algorithms A and B that behave identically, except that A samples xi after observing S whereas B samples xj
• It follows that MO(A) > MO(B ), so NFL does not hold
Analysis
• P (f (xi)=y | S ) = P (f(xj )=y | S ) implies that
– E[|f (xi )-f (xj )|] = E[|f (xk )-f (xl )|]: No correlation between genotypic similarity and similarity of fitness
– P (ymin f (xi ) ymax | S ) = P (ymin f (xj ) ymax | S ): We cannot use S to estimate probability of unseen points lying in some fitness interval [ymin, ymax]
Contrast with Holland’s assumptions
• Holland assumed that by sampling a schema, you can extrapolate about the fitness of unseen instances of that schema; this implies P(ymin f (xi ) ymax | S ) ≠ P(ymin f (xj ) ymax | S )
• Objection has been made before that NFL assumptions violate Holland’s assumption (Droste, Jansen, and Wegener 1999), but I just showed that NFL holds only when you violate this assumption
Definitions
• Description length of a function is the length of the shortest program that implements that function (a.k.a. Kolmogorov complexity)
• Description length depends on which Turing machine you use, but differences between any two TMs are bound by a constant (Compiler Theorem)
Previous work on NFL and problem description length
• Schumacher 2001: NFL can hold for sets of functions that are highly compressible (e.g., set of needle-in-the-haystack functions)
• Droste, Jansen, and Wegener 1999: NFL does not hold for certain sets with highly restricted description length (“Perhaps Not a Free Lunch, but at Least a Free Appetizer”)
My results concerning description length
• NFL does not hold in general for sets with bounded description length
• For sets defined by bound on description length, any reasonable bound rules out NFL
Proof outline
• Fk all functions f:XY with description length k or less
• If k is sufficiently large, Fk will contain a hashing function like h (x ) = x mod |Y |
• If the number of permutations of h (x ) is more than 2k+1-1, F cannot be closed under permutation, therefore:
– NFL result will not hold for (F,P ) where P is uniform (Schumacher, Vose, and Whitley 2001)
Illustration
• Let X consist of all n-bit chromosomes, Y consist of all m-bit fitness values, and Fk be defined as before
• It turns out the number of permutations of h (x ) = x mod |Y | is:
|| mod |||||| mod ||
!||||
!||||
! YXYYX
YX
YX
X
Illustration
Chromosome length n
Fitness value length m
NFL does not hold when:
16 1 kmod ≤ k ≤ 6.55*104
32*8 8 kmod ≤ k ≤ 9.26*1077
500*8 32 kmod ≤ k ≤ 4.22*101205
where kmod description length of h (x )= x mod |Y |
Additional proofs
• NFL does not hold if P assigns probability that is strictly decreasing w.r.t. description length
• NFL does not hold if P is a Solomonoff-Levin distribution
• Same proof technique (counting argument, permutation closure)
No Free Lunch and infinite sets
• Original NFL proof depends on P assigning equal probability to every f in F; can’t do this for infinite F
• Bayesian formulation still works for infinite F
• NFL only holds when P (f ) can be expressed solely as a function of f ’s domain and range
• Unlike case of finite F, there is no aesthetically pleasing argument for assuming P Is such that NFL holds
Limitations of this work
• When we show that a NFL results does not hold for certain (F,P ), all this means is that for some performance measure, some algorithms perform better than others
• The fact that NFL does not hold for (F,P ) does not necessarily mean (F,P ) is “searchable”
• For stronger results under more restrictive assumptions, see Christensen and Oppacher 2001
Conclusions
• Lots of statement have been made with NFL as the justification:– All search algorithms are equally robust– All search algorithms are equally specialized– Performance of algorithms on benchmark
problems is not predictive of performance in general
Conclusions
• However, these statements are only true if you assume that:– The (F,P ) of interest to the GA community is
such that parent and offspring fitness are uncorrelated, and
– there is no correlation between genotypic similarity and similarity of fitness
Conclusions
• Moreover, there are theoretically motivated choices of (F,P ) for which NFL does not hold– P (f ) as a decreasing function of f ’s description
length– P (f ) as a Solomonoff-Levin distribution