Hill climbing: Simulated annealing and Tabu search ... · Hill climbing: Simulated annealing and Tabu search Heuristic algorithms Giovanni Righini University of Milan ... (it can

Hill climbing: Simulated annealing and Tabusearch

Heuristic algorithms

Giovanni Righini

University of Milan

Department of Computer Science (Crema)

Hill climbing

Instead of repeating local search, it is possible to carry on the search,after a local optimum has been reached:

• either changing the neighborhood or the objective

• or accepting sub-optimal solutions

x ′ := arg minx∈N(x)

z(x)

and possibly worsening moves.

The main problem with the latter alternative is looping, i.e. cyclicallyvisiting the same solutions.

The two main strategies allowing to control this effect are

• Simulated Annealing (SA), which uses randomness;

• Tabu Search (TS), which uses memory.

Annealing

The SA algorithm derives from the Metropolis algorithm (1953), thatsimulates a physical process:

• a metal is brought to a temperature close to the melting point,so that particles spread in a random and uniform way;

• then it is cooled very slowly, so that energy decreases,but there is enough time to converge to thermal equilibrium.

The aim of the process is to obtain• a regular crystal lattice with no defects, corresponding to the

ground state (the configuration of minimum energy)• a material with useful physical properties.

Simulated Annealing

The correspondence with combinatorial optimization is the following:• the particles correspond to variables (the spin of the particles

corresponds to a binary domain);• the states of the physical system correspond to solutions;• the energy corresponds to the objective function;• the ground state corresponds to globally minima solutions;• the state transitions correspond to local search moves;• the temperature corresponds to a parameter.

This suggests to use Metropolis algorithm for optimization purposes.

According to thermodynamics laws at thermal equilibrium each statehas probability

π′T (i) =

e−Eik T

∑

j∈Se

−Ejk T

with S the set of states, T the temperature and k is Boltzmannconstant.

It describes what happens at thermal equilibrium when the system iscontinuously subject to random transitions between states.

Metropolis algorithm

Metropolis algorithm generates a random sequence of states

• the current state i has energy Ei

• the algorithm perturbs i, generating a state j with energy Ej

• the transition from i to j occurs with probability

πT (i, j) =

{

1 if Ej < Ei

eEi−Ej

k T = π′(j)

π′(i) if Ej ≥ Ei

The Simulated Annealing algorithm simulates this.

Simulated Annealing

Algorithm SimulatedAnnealing(

I, x (0),T)

x := x (0);

x∗ := x (0);

While Stop() = false do

x ′ := RandomExtract(N, x); { random uniform extraction }

If z(x ′) < z(x) or U [0; 1] < ez(x)−z(x′)

T then x := x ′;

If z(x ′) < z(x∗) then x∗ := x ′;

T := Aggiorna(T );

EndWhile;

Return (x∗, z(x∗));

Remark: it is possible to do worsening moves even when improvingmoves exist because the neighborhood is not fully explored.

Acceptance criterion

πT (x , x ′) =

{

1 if z(x ′) < z(x)

ez(x)−z(x′)

T if z(x ′) ≥ z(x)

The “temperature” parameter T calibrates the probability of acceptingworsening moves

• with T ≫ 0 they are frequently accepted: the search tends todiversify, as in a random walk;

• with T ≈ 0 they are frequently rejected: the search tends tointensify, as in steepest descent.

Note the analogy with ILS.

Convergence to the optimum

The probability that the current solution is x ′ is thesum over all possible predecessor states x of the probabilities of

• extracting move (x , x ′), which is uniform,• and accepting the move, which is

πT (x , x ′) =

{

1 if z(x ′) < z(x)

ez(x)−z(x′)

T if z(x ′) ≥ z(x)

Hence, at each step it only depends on the probability of the previousstate:random variable x forms a Markov chain.

For each given value of T , the transition probabilities are uniform: theMarkov chain is homogeneous.

If the search space is connected with respect to neighborhood N, theprobability of reaching each state is strictly positive and the Markovchain is irreducible.

Under these assumptions, the probability of the states tends to astationary distribution, independent of the initial solution.


The stationary distribution is that indicated in thermodynamics for thethermal equilibrium of physical systems, and it favors “good”solutions:

πT (x) =e

−z(x)T

∑

x∈Xe

−z(x)T

for each x ∈ X

where X is the feasible region.

If T → 0, the distribution tends to a limit distribution

π(x) = limT→0

πT (x) =

1|X∗| for x ∈ X∗

0 for x ∈ X \ X∗

which corresponds to guaranteed convergence to a global optimum(!)


However, the result holds at equilibrium and low values of T imply

• high probability of visiting a global optimum;

• slow convergence to the optimum (many moves are rejected).

In finite time, using a lower value for T does not always improve theresult.

On the other side, it is not necessary to visit global optima often:one visit is enough to discover the optimum.

In practice, T is updated, decreasing it according to a coolingschedule.

The initial value T [0] is set

• high enough to allow accepting many moves;

• low enough to allow rejecting the worst moves.

After sampling the first neighborhood N(x (0)), usually one fixes T [0]

so that a given fraction (e.g., 90%) of N(x (0)) is accepted.

Cooling schedule

In each outer iteration r = 0, . . . ,m:

• a constant value T [r ] is used for ℓ[r ] inner iterations

• T [r ] is updated according to an exponential function

T [r ] := αr T [0]

with 0 < αr < 1;• ℓ[r ] is also updated

• increasing with r (e.g. linearly)• depending on the diameter of the search graph

(and hence on the size of the instance).

If T is variable, we have a non-homogeneous Markov chain, but

• if T decreases slowly enough, it converges to the global optimum

• the parameters depend on the instance (in particular, onz(x)− z(x∗), where x is the best local-but-not-global optimum).

Computational efficiency and variants

Instead of computing probabilities through an exponential function, itis convenient to pre-compute a table of values of e

δ

T for each possibleδ = z(x)− z(x ′)

In adaptive simulated annealing algorithms the parameter T dependson the results obtained:

• T is tuned so that a give fraction of N(x) is likely to be accepted;

• T is increased if the solution does not improve significantly anddecreased otherwise.

Tabu Search (TS)

Tabu Search (Glover, 1986) keeps the same selection criterion of thesteepest descent algorithm

x ′ := arg minx∈N(x)

{z(x)}

i.e. selecting the best solution in the neighborhood of the current one.

If trivially implemented, this would cause loops in the search.

The idea is to forbid already visited solutions, by imposing some tabuto the search:

x ′ := arg minx∈N(x)\V

{z(x)}

where V is the set of tabu solutions.

The principle is very simple, but the crucial issue is how to make itefficient.

Tabu search

An exchange heuristic based on the exhaustive exploration of theneighborhood with a tabu on the already visited solutions requires:

1. to evaluate the feasibility of each subset produced by theexchanges (when it is not possible to guarantee it a priori);

2. to evaluate the cost of each feasible solution;

3. to evaluate the tabu/non-tabu status of each promising feasiblesolution;

4. to select the best feasible and non-tabu solution.

An easy way to evaluate the tabu status is

• to record the already visited solutions in suitable data-structure(called tabu list);

• to check whether each explored solution belongs to the tabu listor not.

Making tabu search efficient

This is very inefficient:• the check requires linear time in the size of the tabu list

(it can be reduced with hash tables and search trees)• the number of visited solutions increases with time;• the memory occupation increases with time.

The Cancellation Sequence Method and the Reverse EliminationMethod tackle these problems, exploiting the observation that ingeneral

• visited solutions form a chain of little variations;• few visited solutions belong to the neighborhood of the current

one.

The idea is to concentrate on the variations, not on the solutions:• to keep a list of moves, instead of a list of solutions;• to evaluate the overall variations done;• to find solutions that have been subject to few/little changes

(recent solutions or solutions subject to changes that have beenundone later).

More reasons for not using tabu solutions

There are other phenomena that affect the effectiveness of tabusearch.

Forbidding already visited solutions may have two different negativeeffects:

• it can disconnect the search graph

(hence it is would be better to avoid absolute prohibitions)

• it may restrain exiting from attraction basins;

(hence it would be better to apply the tabu status to many othersolutions in the same basin).

The two observations suggest opposite remedies.

Example

A tricky example is the following:• the ground set E contains L elements;• all subsets are feasible: X = 2E ;• the objective combines an additive term which is almost uniform

(ǫ ≪ 1) and a large negative term in x = E and zero otherwise

z(x) =

∑

i∈x1 + ǫi for x 6= E

∑

i∈x1 + ǫi − L − 1 for x 6= E

If we consider the neighborhood made by the solutions at Hammingdistance ≤ 1

NH1(x) = {x ′ ∈ 2E : d(x , x ′) ≤ 1}the problem has

• a local optimum x = ∅ with z(x) = 0,whose attraction basin contains all solutions with |x | ≥ L − 1;

• a global optimum x∗ = E with z(x∗) = L(L − 1)ǫ/2 − 1 < 0,whose attraction basin contains all solutions with |x | ≤ L − 2.

Example

Starting from x (0) = x = ∅ and running Tabu Search forbidding thealready visited solutions, the trajectory of the search

• scans a large part of 2E , going father from x and then closeragain, with z oscillating up and down;

• for values of L ≥ 4 it gets stuck in a solution whose neighborhoodhas been completely explored, although other solutions have notbeen visited yet;

• for large values of L (e.g., L = 16), it cannot reach the globaloptimum.

Example

The oscillations of the objective function show the drawbacks of themethod.

The solution x repeatedly goes farther from x (0) = x and then closerto it:

• it visits almost entirely the attraction basin of x ;• eventually it does not leave the basin, it but remains in a solution

whose neighborhood is completely tabu.

Tabu attributes

To overcome these difficulties some simple techniques are used:1. instead of forbidding visited solutions solutions hare tabu when

they possess some “attributes” in common with the visitedsolutions:

• a set A of relevant attributes is defined;• a subset A of attributes (initially empty) is declared tabu;• all solutions with tabu attributes are tabu

A(y) ∩ A 6= ∅ ⇒ y is tabu

• if a move transforms the current solution x into x ′,attributes that x had and x ′ does not have are inserted into A

(in this way x becomes tabu)

This means that

• solutions similar to those already visited are tabu;

• the search is faster in leaving the attraction basins of the alreadyvisited local optima.

Temporary tabu and aspiration criteria

Since the tabu list generates regions that are difficult or impossible toreach,

2. the tabu status has a limited duration, defined by a numberiterations L

• tabu solutions become accessible again• it is possible to re-visit the same solutions

(however, if A is different, the next iterations will be different).

The tabu tenure L is a critical parameter of TS.

Since the tabu list could forbid global optima just because they aresimilar to visited solutions, an aspiration criterion is used: a tabusolution is accepted when it is better than the best incumbentsolution.

When all solutions in the neighborhood of the current solution aretabu the algorithm accepts the one with the most ancient tabu status.

Tabu search

Algorithm TabuSearch(

I, x(0), L)

x := x(0); x∗ := x(0);

A := ∅;

While Stop() = false do

z′ := +∞;

For each y ∈ N (x) do

If z (y) < z′ then

If Tabu(

y , A)

= false or z (y) < z (x∗) then x ′ := y ; z′ := z (y);

EndIf

EndFor

A := Ipdate(

A, x ′, L)

;

If z (x ′) < z (x∗) then x∗ := x ′;

EndWhile

Return (x∗, z (x∗));

Tabu attributes

Some possible definitions of “attribute”:

• an element belongs to the solution (A(x) = x ):when the move from x to x ′ deletes an element i from thesolution,the tabu status forbids the reinsertion of i in the next L iterations;every solution with element i becomes tabu;

• an element does not belong to the solution (A(x) = E \ x ):when the move from x to x ′ inserts an element i in the solution,the tabu status forbids the deletion of i in the next L iterations;every solution without element i becomes tabu.

It is common to use several attributes together, each one with its owntabu tenure and tabu list (e.g., after replacing i with j, it is forbidden todelete j for Lin iterations and to reinsert i for Lout iterations, withLin 6= Lout).

Tabu attributes

Other examples of attributes:

• the value of the objective function

• the value of an auxiliary function (e.g., the distance from the bestincumbent solution)

Complex attributes can be obtained by combining simple ones:

• if a move from x to x ′ replaces element i with element j,we can forbid the replacement of j with i, but we can allow fordeleting j only or inserting i only.

Efficient evaluation of the tabu status

Even when it is based on attributes, the evaluation of the tabu statusof a solution must be efficient: scanning the whole solution is notacceptable. Attributes are associated with moves, not with solutions

The evaluation can be done in constant time by recording in adata-structure the iteration in which the tabu status begins, for eachattribute.

When insertions are tabu (the attribute is the presence of anelement):

• at iteration t , it is tabu to insert any i ∈ E \ x such thatt ≤ T in

i + Lin

• at iteration t , we set T ini = t for each i just deleted from x .

When deletions are tabu (the attribute is the absence of an element):• at iteration t , it is tabu to delete any i ∈ x such that t ≤ T out

i + Lout

• at iteration t , we set T outi = t for each i just inserted into x .

If both are used, one vector is enough, since either i ∈ x or i ∈ E \ x .

For more complex attributes matrices or other data-structures areneeded.

Example: the TSPWe consider the neighborhood NR2 generated by the 2-opt exchanges andwe use both the presence and the absence of edges as attributes.

• Initially Tij := −∞ for every edge (i , j) ∈ A;• at each iteration t , the algorithm scans the n(n − 1)/2 pairs of edges

that can be deleted and the corresponding pairs of edges that wouldreplace them;

• the move (i , j) that replaces (si , si+1) and (sj , sj+1) with (si , sj) and(si+1, sj+1), is tabu at iteration t if one of the following conditions holds:

1. t ≤ Tsi ,si+1 + Lout

2. t ≤ Tsj ,sj+1 + Lout

3. t ≤ Tsi ,sj + Lin

4. t ≤ Tsj+1,si+1 + Lin

• Once the move (i∗, j∗) has been chosen, the data-structures areupdated:

1. Tsi∗ ,si∗+1 := t2. Tsj∗ ,sj∗+1 := t3. Tsi∗ ,sj∗ := t4. Tsj∗+1,si∗+1 := t

Since n edges belong to the solution and n(n − 2) do not, it is convenient toset Lout ≪ Lin.

Example: the KP

The neighborhood NH1 contains the solutions at Hamming distance≤ 1

For simplicity we use the attribute “flipped variable”: a vector Trecords when each variable i ∈ E has been flipped the last time. LetL = 3.

t = 1 T = [−∞ −∞ −∞ −∞]

t = 2 T = [−∞ −∞ 1 −∞] t = 3 T = [−∞ 2 1 −∞]

Tuning the tabu tenure

The value L of the tabu tenure is of paramount importance:• too large values may hide the global optimum and

in the worst case they block the search;• too small values may leave the search in useless regions and in

the worst case they allow for looping.

The best value for L• in general depends on the size of the instance• often slowly increases (a recipe is L ∈ O(

√n))

• almost constant values work fine also for different sizes.Extracting L at random from a range [Lmin; Lmax] breaks loops.

Adaptive tabu tenures react to the results of the search updating Lwithin a given range [Lmin; Lmax]

• L decreases when the current solution x improves: the search islikely to approach a local optimum and one wants to intensify thesearch

• L increases when the current solution x worsens: the search islikely to escape from a visited attraction basin and one wants todiversify the search.

Variations

In the long range, adaptive techniques tend to loose theireffectiveness.

Long-term strategies are employed:• Reactive Tabu Search:

• uses efficient data-structures to record visited solutions• detects loops• if solutions repeat too often, it shifts the range [Lmin; Lmax] to larger

values.• Frequency-based Tabu Search:

• records the frequency of each attribute in the solution indata-structures similar to the tabu list;

• if an attribute occurs very often• it favors the moves that insert it, by a modification of z (as in DLS);• or it forbids the moves that insert it, or penalizes them by a

modification of z.

• Exploring Tabu Search: re-initializes the search from goodsolutions already found but never used as current solution

(they are the “second best solutions” in some neighborhood).• Granular Tabu Search: modifies the neighborhood by

progressively enlarging it.

Documents

Hill climbing: Simulated annealing and Tabu search ... · Hill climbing: Simulated annealing and Tabu search Heuristic algorithms Giovanni Righini University of Milan ... (it can