Initial-Population Bias in the Univariate Estimation of Distribution Algorithm

Initial-Population Bias in the UnivariateEstimation of Distribution Algorithm

Martin Pelikan and Kumara Sastry

Missouri Estimation of Distribution Algorithms Laboratory (MEDAL)University of Missouri, St. Louis, MO

http://medal.cs.umsl.edu/[email protected]

Download MEDAL Report No. 2009001

http://medal.cs.umsl.edu/files/2009001.pdf

Martin Pelikan and Kumara Sastry Initial-Population Bias in UMDA

http://medal.cs.umsl.edu/

http://medal.cs.umsl.edu/files/2009001.pdf

Motivation

Importance of bias

I Efficiency enhancements of EDAs may introduce bias.I Examples

I Local search.I Injection of prior full or partial solutions.I Bias based on prior knowledge about the problem.

I Bias may have positive or negative effects.I It is important to understand these effects.

This study

I Study the effects of biasing the initial population.I Consider UMDA on onemax and noisy onemax.I Theory and experiment.


Outline

1. UMDA.

2. Basic model for bias.

3. Population size.

4. Number of generations.

5. Compare to hill climber.

6. Conclusions.

7. Future work.


Probability Vector as a Model

Probability vector, p

I Store probability of 1 in each position.

I p = (p1, p2, . . . , pn).I pi is probability of 1 in position i.

Replace crossover/mutation by model building and sampling

I Learn the probability vector from selected points.

I Sample new points according to the learned vector.


Univariate Marginal Distribution Algorithm (UMDA)

UMDA (Muhlenbein & Paaß, 1996).

1. Generate random population of binary strings.2. Selection (e.g. tournament selection).3. Learn probability vector for selected solutions.4. Sample probability vector to generate new solutions.5. Incorporate new solutions into original population.

Martin Pelikan, Probabilistic Model-Building GAs 13

Example: Probability Vector

(Mühlenbein, Paass, 1996), (Baluja, 1994)

01011

11000

11001

10101

Selectedpopulation

11101

11001

10101

10001

Newpopulation

11001

10101

01011

11000

1.0 0.5 0.5 0.0 1.0

Probabilityvector

Currentpopulation


Assumptions

Algorithm

I UMDA with binary tournament selection and full replacement.I Results should generalize to other selection methods with

fixed selection intensity.

Fitness

I Deterministic onemax:

onemax(X1, X2, . . . , Xn) =n∑

i=1

Xi

I Noisy onemax:

onemaxnoisy(X1, X2, . . . , Xn) =n∑

i=1

Xi + N(0, σ2)

I Results should generalize to other separable problems ofbounded order (if good model is used).


Basic Model for Bias

Basic model

I Introduce bias in the initial population.

I Increase or decrease the initial proportion pinit of optimal bits.

I Use the same bias for all string positions.

I Examples

pinit = 0.2 pinit = 0.5 pinit = 0.800001 11110 1111000001 01010 0101101000 11101 0111100010 00010 1111110000 11011 10111

I What to expect?I pinit grows ⇒ UMDA performance improves.I pinit decreases ⇒ UMDA performance suffers.


Theoretical Model for Deterministic Onemax

Population size

I Gambler’s ruin population-sizing model (Harik et al., 1997).

I Population sizing bound

N = − 14pinit

lnα√

πn

Number of generations

I Convergence model (Thierens & Goldberg, 1994).

I Number of generations bound

G =(π

2− arcsin(2pinit − 1)

)√πn


Deterministic Onemax: Theoretical Speedup

Speedup factors

I How many times faster the algorithm becomes compared topinit = 0.5?

I Population size:

ηN =1

2pinit

I Number of generations:

ηG = 1− 2 arcsin(2pinit − 1)π

I Number of evaluations:

ηE =1

2pinit

(1− 2 arcsin(2pinit − 1)

π

)


Experimental Setup

Basic setup

I Binary tournament selection without replacement.

I Full replacement (no elitism or niching).

I Problems of n = 100 to n = 500 tested (focus on n = 500).

I Population size set using bisection to ensure 10 successfulruns with 95% optimal solution out of 10 independent runs.

I Bisection repeated 10 times for each setting.

Observed statistics

I Population size.

I Number of generations.

I Number of evaluations.


Deterministic Onemax: Speedup and Slowdown

Speedup Slowdown

0 0.2 0.4 0.6 0.8 1

0

5

10

15

20

pinit

Slo

wdo

wn

(slower than pinit

=0.5)

(faster than pinit

=0.5)

Number of evaluationsPopulation sizeNumber of generationsBase case

0 0.2 0.4 0.6 0.8 1

0

2

4

6

8

pinit

Spe

edup

(faster than pinit

=0.5)

(slower than pinit

=0.5)


Figure 2: The factor by which the population size, the number of generations and the number ofevaluations should change with varying pinit compared to the base case with pinit = 0.5. The threefactors are based on the population-sizing and time-to-convergence models. The results are shownas speedup and slowdown curves.

problem of n = 500 is used in most presented results. In all experiments, binary tournamentselection without replacement is used. The sampling ensures that the generated population ofstrings corresponds to the the used probability vector as closely as possible in order to minimizesampling errors; this is done by first generating the expected number of 1s for each position and thenrandomly shuffling the bits in each position across the population of new strings (Branke, Lode,& Shapiro, 2007). In each generation, a population of the same size as the original populationis generated and the old population is fully replaced by the newly generated strings. There is noelitism or niching. To alleviate the effects of stochastic noise on the initial-population bias, theinitial population is generated similarly as the new solutions in each iteration; first, the expectednumber of 1s is put in every string position and the bits are then shuffled randomly.

The bisection method (Sastry, 2001; Pelikan, 2005) was used to find an adequate populationsize for each setting, which ensured that the algorithm finds a good-enough solution in each of the10 independent runs. In each run, UMDA was required to find a solution with at least 95% ofstring positions containing a 1 (at least 95% bits of the global optimum were correctly discovered).To provide more accurate estimates of the actual UMDA performance, the bisection was repeated10 times for each setting, yielding 100 successful runs for each configuration.

The results for UMDA are shown in Figure 3. The results indicate a close match between thetheoretical model and experimental results. Analogical match was observed for other problem sizes.

5 Mathematical Model for Noisy Onemax

Genetic algorithms are known to perform well even when the fitness cannot be determined exactlydue to external noise. The noise may come from an inherently noisy problem domain or a noisyapproximation of a computationally expensive fitness function. External noise can also be used tomodel certain types of problem difficulty. This section extends the analysis presented thus far toonemax with external Gaussian noise. This will provide us with the basic understanding of theinteraction between the initial-population bias and external noise.

The section starts with the description of the noisy onemax function. Next, the section presentsthe population-sizing and time-to-convergence models for noisy onemax. Finally, the models areverified with experiments and the results are discussed.

7

0 0.2 0.4 0.6 0.8 1

0

5

10

15

20

pinit

Slo

wdo

wn

(slower than pinit

=0.5)

(faster than pinit

=0.5)


0 0.2 0.4 0.6 0.8 1

0

2

4

6

8

pinit

Spe

edup

(faster than pinit

=0.5)

(slower than pinit

=0.5)


Figure 2: The factor by which the population size, the number of generations and the number ofevaluations should change with varying pinit compared to the base case with pinit = 0.5. The threefactors are based on the population-sizing and time-to-convergence models. The results are shownas speedup and slowdown curves.

problem of n = 500 is used in most presented results. In all experiments, binary tournamentselection without replacement is used. The sampling ensures that the generated population ofstrings corresponds to the the used probability vector as closely as possible in order to minimizesampling errors; this is done by first generating the expected number of 1s for each position and thenrandomly shuffling the bits in each position across the population of new strings (Branke, Lode,& Shapiro, 2007). In each generation, a population of the same size as the original populationis generated and the old population is fully replaced by the newly generated strings. There is noelitism or niching. To alleviate the effects of stochastic noise on the initial-population bias, theinitial population is generated similarly as the new solutions in each iteration; first, the expectednumber of 1s is put in every string position and the bits are then shuffled randomly.

The bisection method (Sastry, 2001; Pelikan, 2005) was used to find an adequate populationsize for each setting, which ensured that the algorithm finds a good-enough solution in each of the10 independent runs. In each run, UMDA was required to find a solution with at least 95% ofstring positions containing a 1 (at least 95% bits of the global optimum were correctly discovered).To provide more accurate estimates of the actual UMDA performance, the bisection was repeated10 times for each setting, yielding 100 successful runs for each configuration.

The results for UMDA are shown in Figure 3. The results indicate a close match between thetheoretical model and experimental results. Analogical match was observed for other problem sizes.

5 Mathematical Model for Noisy Onemax

Genetic algorithms are known to perform well even when the fitness cannot be determined exactlydue to external noise. The noise may come from an inherently noisy problem domain or a noisyapproximation of a computationally expensive fitness function. External noise can also be used tomodel certain types of problem difficulty. This section extends the analysis presented thus far toonemax with external Gaussian noise. This will provide us with the basic understanding of theinteraction between the initial-population bias and external noise.

The section starts with the description of the noisy onemax function. Next, the section presentsthe population-sizing and time-to-convergence models for noisy onemax. Finally, the models areverified with experiments and the results are discussed.

7

I Empirical results confirm intuition.I Positive bias improves performance.I Negative bias worsens performance.


Deterministic Onemax: Experiments vs. Theory

Population size Number of generations

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

100

200

300

400

pinit

Pop

ulat

ion

size

ExperimentTheory

(a) Population size.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

20

40

60

80

100

120

pinit

Num

ber

of g

ener

atio

ns

ExperimentTheory

(b) Number of generations.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1

2

3

4

5x 104

pinit

Num

ber

of e

valu

atio

ns

ExperimentTheory

(c) Number of evaluations.

Figure 3: Effects of initial-population bias on UMDA performance on onemax of n = 500 bitswithout external noise.

5.1 Noisy Onemax

Noisy fitness functions can be modeled as (Goldberg, Deb, & Clark, 1992)

fitness′ = fitness + noise, (18)

where fitness is the true fitness of the candidate solution and noise is a random variable corre-sponding to the external noise. External noise can be characterized by two basic quantities: biasand variance (Sastry, 2001). Unbiased noise only adds variance to the population without affectingthe expected fitness of any candidate solution. While increasing the fitness variance makes decisionmaking between competing partial solutions or building blocks more difficult, the expected out-come of the computation should remain the same, assuming that the population size is increased toaverage out the added variance. On the other hand, biased noise may change the expected fitnessof any candidate solution. This will lead to both a more difficult decision making as well as thepotential for changing the overall outcome, since the best solution of the noisy fitness may notrepresent the best solution with no external noise.

In models of noisy fitness functions, external noise is typically distributed according to a zero-mean Gaussian distribution, which is also the considered here. The noisy onemax is thus definedas

onemaxnoisy(X1, X2, . . . , Xn) = onemax(X1, X2, . . . , Xn) + N(0, σ2

N ), (19)

where N(0, σ2

N ) is a random variable distributed according to the Gaussian distribution with mean0 and variance σ2

N .

Intuitively, the effects of external noise depend on how its variance compares the overall varianceof the fitness function. If the external-noise variance is very small compared to the fitness variance,the effects of external noise can be expected to be negligible; on the other hand, if the noise varianceis larger than the fitness variance, the effects of external noise can be expected to be substantial.That is why the strength (variance) of external noise is often measured relatively to the fitnessvariance:

σ2

N = βσ2

F , (20)

where σ2

F is the fitness variance and β ≥ 0 is a constant factor. In this paper, we consider fourlevels of external noise: β = {0.5, 1, 1.5, 2}. Note that for the considered range of β, the effects ofnoise on the fitness function are substantial. For the smallest β = 0.5, the variance of the externalnoise is half the overall fitness variance, and for the largest β = 2, the variance of the added noiseis twice the fitness variance.

8

I Empirical results match theory.

I Theory makes conservative estimates.

I Empirical results confirm intuition.


Theoretical Model for Noisy Onemax: Population Size

Population size

I Gambler’s ruin population-sizing model (Harik et al., 1997).

I Variance of external noise given in terms of fitness variance:

σ2noise = β × σ2

fitness

I Population sizing bound becomes

N = − 14pinit

lnα√

πn(1 + β)


Theoretical Model for Noisy Onemax: Generations

Number of generations

I Convergence model (Miller & Goldberg, 1994; Sastry, 2001;Goldberg, 2002).

I Difficult to solve analytically for arbitrary pinit.

I Effects of pinit modeled by an empirical fit.

I Number of generations bound

G =π

2√

πn√

1 + β


π

)


Noisy Onemax: Theoretical Speedup

Speedup factors same as for deterministic case!

I Population size:

ηN =1

2pinit

I Number of generations:

ηG = 1− 2 arcsin(2pinit − 1)π

I Number of evaluations:

ηE =1

2pinit


π

)


Noisy Onemax: Experiments vs. Theory for β = 1

Population size Number of generations

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

200

400

600

800

pinit

Pop

ulat

ion

size

ExperimentTheory


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

50

100

150

pinit

Num

ber

of g

ener

atio

ns

ExperimentTheory


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

2

4

6

8

10x 104

pinit

Num

ber

of e

valu

atio

ns

ExperimentTheory


Figure 4: Effects of initial-population bias on UMDA performance on onemax of n = 500 bits,σ2

N = 0.5σ2

F = 0.125n.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

200

400

600

800

pinit

Pop

ulat

ion

size

ExperimentTheory


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

50

100

150

200

250

pinit

Num

ber

of g

ener

atio

ns

ExperimentTheory


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15x 104

pinit

Num

ber

of e

valu

atio

ns

ExperimentTheory


Figure 5: Effects of initial-population bias on UMDA performance on onemax of n = 500 bits,σ2

N = σ2

F = 0.25n.

Figure 8 visualizes the effects of external noise on the number of evaluations. Comparing theoverall number of evaluations confirms the intuition that the number of evaluations grows with thestrength of external noise. Specifically, from the models presented earlier, the number of evaluationsgrows linearly with the ratio of the variance of external noise and the fitness variance, β.

6 Effects of mutation and the sampling procedure

Thus far we examined the effects of the initial population bias on the performance of standardUMDA. This section answers three additional questions related to this topic:

(1) How does bit-flip mutation affect performance of UMDA with initial population bias?

(2) How does the sampling procedure affect performance of UMDA with initial population bias?

(3) How is performance of mutation-based hill climbing affected by initial bias and noise comparedto UMDA?

11

I Empirical results match theory.

I Population sizing remains a conservative estimate.

I Note: β = 1 is a lot of noise (noise variance equal to overallfitness variance).


Compare to Hill Climber on Deterministic Case

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1

2

3

4

5x 105

pinit

Num

ber

of e

valu

atio

ns

Standard samplingNoiseless sampling

(a) Number of evaluations.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

(noiseless is faster)

(standard is faster)

pinit

Noi

sele

ss v

s. s

tand

ard

sam

plin

g

n=100n=200n=300n=400n=500

(b) Speedup of noiseless over standardsampling.

Figure 12: Effects of sampling on UMDA performance with external noise σ2

N = 2σ2

F .

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1000

2000

3000

4000

pinit

Num

ber

of e

valu

atio

ns

ExperimentTheory

(a) Hill climbing on 500-bit onemax.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1

2

3

4x 104

pinit

Num

ber

of e

valu

atio

ns

UMDAHill Climbing

(b) Comparison of UMDA and HC.

Figure 13: Performance of HC on a 500-bit deterministic onemax and its comparison to UMDA.Theoretical model of Muhlenbein (Muhlenbein, 1992) is used to provide an upper bound on thetime to convergence (on the left).

1s. The total number of evaluations until termination was then recorded and averaged over all 100runs. To analyze the influence of the initial-population bias on HC performance, the initial stringis generated to contain pinitn ones.

It is well known that HC works very well on onemax and similar unimodal problems (Muhlenbein,1992). This is confirmed with the results shown in Figure 13. HC is clearly much less sensitive tothe value of pinit than UMDA. This fact is not surprising because it was shown (Muhlenbein, 1992)that in HC majority of computational resources are typically spent in making the last few stepstoward the optimum. Furthermore, the results clearly indicate that on onemax HC significantlyoutperforms UMDA except for relatively large values of pinit. The smaller pinit, the greater thedifferences between HC and UMDA.

6.3.2 Performance of Hill Climbing on Noisy Onemax

After several initial experiments, it became clear that while on deterministic onemax HC signifi-cantly outperforms UMDA, on noisy onemax the situation changes rapidly. This result is not a bigsurprise, because it was shown that recombination-based search can deal with external noise muchmore effectively than search based on local operators (Sastry & Goldberg, 2004); nonetheless, the

15

I Performance of HC is great regardless of bias.

I This agrees with theory (Muhlenbein, 1992).


Compare to Hill Climber on Noisy Case

Performance of HC becomes poor with noise!

β n pinit HC evaluations UMDA evaluations

0.5 10 0.1 4,449 1,2100.5 25 0.1 2,125,373 1,8860.5 10 0.5 11,096 660.5 25 0.5 8,248,140 1691.0 5 0.1 215 5741.0 15 0.1 5,691,725 1,2101.0 5 0.5 64 201.0 15 0.5 15,738,168 64


Conclusions

I We have good theoretical understanding of the effects of onetype of initial-population bias on performance of UMDA ondeterministic and noisy onemax.

I Effects of bias match intuitionI Good bias improves performance.I Bad bias worsens performance.

I Effects of bias are independent of noise.

I Experimental results match theory.


Future Work

I Study specific efficiency enhancement techniques and the biasthey introduce, and apply the theory developed here toestimate the final effects.

I Extend this work to other types of bias.

I Extend this work to other evolutionary algorithms, especiallythe standard genetic algorithms with two-parentrecombination and EDAs with multivariate models (e.g. BOAand ecGA).

I Eliminate the empirical fit from the model for the noisyonemax.


Acknowledgments

Acknowledgments

I NSF; NSF CAREER grant ECS-0547013.

I U.S. Air Force, AFOSR; FA9550-06-1-0096.

I University of Missouri; High Performance ComputingCollaboratory sponsored by Information Technology Services;Research Award; Research Board.