Probabilistic Modeling for Combinatorial Optimization Scott Davies School of Computer Science Carnegie Mellon University Joint work with Shumeet Baluja

Probabilistic Modeling for Combinatorial Optimization

Scott Davies

School of Computer Science

Carnegie Mellon University

Joint work with Shumeet Baluja(done partially at Justsystem Pittsburgh Research Center)

Combinatorial Optimization

Maximize “evaluation function” f(x) input: fixed-length bitstring x output: real value

x might represent: job shop schedules TSP tours discretized numeric values etc.

Our focus: “Black Box” optimization No domain-dependent heuristics

Most Commonly Used Approaches

Hill-climbing, simulated annealing Generate candidate solutions “neighboring” single current

working solution (e.g. differing by one bit) Typically make no attempt to model how particular bits

affect solution quality

Genetic algorithms Attempt to implicitly capture dependency of solution

quality on bit values by maintaining a population of candidate solutions

Use crossover and mutation operators on population members to generate new candidate solutions

Using Explicit Probabilistic Models

Maintain an explicit probability distribution P from which we generate new candidate solutions Initialize P to uniform distribution Until termination criteria met:

Stochastically generate K candidate solutions from PEvaluate themUpdate P to make it more likely to generate solutions “similar” to

the “good” solutions

Several different choices for what sorts of P to use and how to update it after candidate solution evaluation

Probability Distributions Over Bitstrings

Let x = (x1, x2, …, xn), where xi can take one of the values {0, 1} and n is the length of the bitstring.

Can factorize any distribution P(x1…xn) bit by bit:

P(x1,…,xn) = P(x1) P(x2 | x1) P(x3 | x1, x2)…P(xn|x1, …, xn-1)

In general, the above formula is just another way of representing a big lookup table with one entry for each of the 2n possible bitstrings.

Obviously too many parameters to estimate from limited data!

Representing Independencies withBayesian Networks

Graphical representation of probability distributions Each variable is a vertex Each variable’s probability distribution is conditioned only

on its parents in the directed acyclic graph (“dag”)

D

F

I A

H

Wean on Fire

Fire AlarmActivated

Hear Bells

Ice CreamTruck Nearby

Office Door Open

P(F,D,I,A,H) =P(F) *P(D) *P(I|F) *P(A|F) *

P(H|D,I,A)

“Bayesian Networks” for Bitstring Optimization?

P(x1,…,xn) = P(x1) P(x2 | x1) P(x3 | x1, x2)…P(xn|x1, …, xn-1)

x1 x2 x3 xn

Yuck. Let’s just assume all the bits are independent instead.

(For now.)

x1 x2 x3 xn

Ah…much better.

P(x1,…,xn) = P(x1) P(x2) P(x3)…P(xn)

Population-Based Incremental Learning

Population-Based Incremental Learning (PBIL) [Baluja, 1995] Maintains a vector of probabilities: one independent

probability P(xi=1) for each bit xi.

Until termination criteria met:Generate a population of K bitstrings from PEvaluate themUse each of the best M of the K to update P as follows:

))1(1()1()1(' iii xPxPxP if xi is set to 1)1()1( ii xPxP if xi is set to 0or

Optionally, also update P similarly with the bitwise complement of the worst of the K bitstrings, and/or “mutate” P with random tweaks

Return best bitstring ever evaluated

PBIL vs. Discrete Learning Automata

Equivalent to a team of Discrete Learning Automata, one automata per bit. [Thathachar & Sastry, 1987] Learning automata choose actions independently, but

receive common reinforcement signal dependent on all their actions

PBIL update rule equivalent to linear reward-inaction algorithm [Hilgard & Bower, 1975] with “success” defined as “best in the bunch”

However, Discrete Learning Automata typically used previously in problems with few variables but noisy evaluation functions

PBIL vs. Genetic Algorithms

PBIL originated as tool for understanding GA behavior

Similar to Bit-Based Simulated Crossover (BSC) [Syswerda, 1993] Regenerates P from scratch after every generation All K used to update P, weighted according to probabilities

that a GA would have selected them for reproduction

Why might normal GAs be better? Implicitly capture inter-bit dependencies with population However, because model is only implicit, crossover must

be randomized. Also, limited population size often leads to premature convergence based on noise in samples.

Four Peaks Problem

Problem used in [Baluja & Caruana, 1995] to test how well GAs maintain multiple solutions before converging.

Given input vector X with N bits, and difficulty parameter T: FourPeaks(T,X)=MAX(head(1,X), tail(0,X))+Bonus(T,X)

head(b,X) = # of contiguous leading bits in X set to btail(b,X) = # of contiguous trailing bits in X set to bBonus(T,X) = 100 if (head(1,X)>T) AND (tail(0,X) > T), or

0 otherwise

GA with one-point crossover can combine two solutions to get Bonus relatively easily?

Four Peaks Problem Results

80 GA settings tried; best five shown hereAverage over 50 runs. 100 bits, varying T

Title:

Creator:idrawPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Large-Scale PBIL Empirical Comparison

Sorry, this slide is a placeholder for a big table on another non-PowerPoint slide!

Gist of results: 27 domains; comparison of Two versions of PBIL Three versions of multiple-restart hillclimbing Two GAs PBIL best on 22, hillclimbing on 3, GA on 2

Modeling Inter-Bit Dependencies

How about automatically learning probability distributions in which at least some dependencies between variables are modeled?

Problem statement: given a dataset D and a set of allowable Bayesian networks {Bi}, find the Bi with the maximum posterior probability:

)|()(maxarg)|(maxarg iiB

iB

BDPBPDBPii

Likelihood Scoring Function

Where: d j is the jth datapoint di

j is the value assigned to xi by d j.

i is the set of xi’s parents in B

Dd

j

j

BdPBDP )|(log)|(log

),|(log1

BddxP ji

Dd

n

i

jii i

j

n

ibax

iiii

ii

BbaxPbaxPD1

, possible all

),|(log),('

is the set of values assigned to i by d j

P’ is the empirical probability distribution exhibited by D

j

id

Likelihood Scoring Function, cont’d

Fact: given that B has a network structure S, the optimal probabilities to use in B are just the probabilities in D, i.e. P’. So:

n

ibax

iiii

ii

BbaxPbaxP1

, possible all

),|(log),('

Pick B to Maximize:

n

ibax

iiii

ii

SbaxPbaxP1

, possible all

),|('log),('

Pick S to Maximize:

n

iii SxH

1

),|('

Entropy Calculation Example

x17

x20

x32

x20’s contribution to score:

# timesx20=0

# timesx20=1

total #times

x17=0,x32=0

5 5 10

x17=0,x32=1

5 15 20

x17=1,x32=0

20 10 30

x17=1,x32=1

10 10 20

80total

datapoints

),(80

20),(

80

102015

205

105

105 HH

),(80

20),(

80

302010

2010

3010

3020 HH

where H(p,q) = - p log p - q log q.

),|( 321720 xxxH

Single-Parent Tree-Shaped Networks

Now let’s allow each bit to be conditioned on at most one other bit. x17

x5x22 x2

x6x19x13 x14 x8

Adding an arc from xj to xi increases network score by

H(xi) - H(xi|xj) (xj’s “information gain” with xi)

= H(xj) - H(xj|xi) (not necessarily obvious, but true)

= I(xi, xj) Mutual information between xi and xj

Optimal Single-Parent Tree-Shaped Networks

To find optimal single-parent tree-shaped network, just find maximum spanning tree using I(xi, xj) as the weight for the edge between xi and xj. [Chow and Liu, 1968] Start with an arbitrary root node xr.

Until all n nodes have been added to the tree:Of all pairs of nodes xin and xout, where xin has already been added to

the tree but xout has not, find the pair with the largest I(xin, xout).

Add xout to the tree with xin as its parent.

Can be done in O(n2) time (assuming D has already been reduced to sufficient statistics)

Optimal Dependency Trees for Combinatorial Optimization

[Baluja & Davies, 1997]Start with a dataset D initialized from the uniform

distributionUntil termination criteria met:

Build optimal dependency tree T with which to model D. Generate K bitstrings from probability distribution represented

by T. Evaluate them. Add best M bitstrings to D after decaying the weight of all

datapoints already in D by a factor between 0 and 1.

Return best bitstring ever evaluated.Running time: O(K*n + M*n2) per iteration

Tree-based Optimization vs. MIMIC

Tree-based optimization algorithm inspired by Mutual Information Maximization for Input Clustering (MIMIC) [De Bonet, et al., 1997] Learned chain-shaped networks rather than tree-shaped

networks

Dataset: best N% of all bitstrings ever evaluated+: dataset has simple, well-defined interpretation-: have to remember bitstrings-: seems to converge too quickly on some larger problems

x4 x8x17 x5

MIMIC dataset vs. Exp. Decay dataset

Solving a system of linear equations the hard way Error of best solution as a function of generation #,

averaged over 10 problems 81 bits

Title:


Graph-Coloring Example

Noisy Graph-Coloring example For each edge connected to vertices of different colors, add

+1 to evaluation function with probability 0.5Title:pogp.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Peaks problems results

Fraction of time Bonus rewarded in 100 runsTitle:


Title:


Title:


Title:


Checkerboard problem results

16*16 grid of bitsFor each bit in middle 14*14 subgrid, add +1 to

evaluation for each of four neighbors set to opposite value

Title:


Genetic Algorithm (avg: 740)

PBIL (avg: 742)

Chain (avg: 760)

Tree (avg: 776)

Linear Equations Results

81 bits again; 9 problems, each with results averaged over 25 runsMinimize error

Problem # Tree Chain 95% sig Tree < Chain PBIL GA mut .001 GA mut .005 GA3 mut .02

1 313 393 Yes 2306 3994 2779 37372 721 1537 Yes 3825 4132 1812 40963 405 544 Yes 2981 3361 2493 40694 706 1347 Yes 3204 3554 2313 39215 848 1255 Yes 3031 3370 2198 35116 648 520 No 2248 3561 1975 36417 692 1034 Yes 3331 3719 2252 37198 708 1029 Yes 2996 3100 2076 38419 577 904 No 3135 3537 1875 3837

Modeling Higher-Order Dependencies

The maximum spanning tree algorithm gives us the optimal Bayesian Network in which each node has at most one parent.

What about finding the best network in which each node has at most K parents for K>1? NP-complete problem! [Chickering, et al., 1995] However, can use search heuristics to look for “good”

network structures (e.g., [Heckerman, et al., 1995]), e.g. hillclimbing.

Scoring Function for Arbitrary Networks

Rather than restricting K directly, add penalty term to scoring function to limit total network size Equivalent to priors favoring simpler network structures Alternatively, lends itself nicely to MDL interpretation Size of penalty controls exploration/exploitation tradeoff

n

iii BxHDBP

1

),|(')|(log |B|

penalty factor

|B|: number of parameters in B

Bayesian Network-Based Combinatorial Optimization

Initialize D with C bitstrings from uniform distribution, and Bayesian network B to empty network containing no edges

Until termination criteria met: Perform steepest-ascent hillclimbing from B to find locally optimal network

B’. Repeat until no changes increase score:Evaluate how each possible edge addition, removal or deletion would

affect penalized log-likelihood scorePerform change that maximizes increase in score.

Set B to B’. Generate and evaluate K bitstrings from B. Decay weight of datapoints in D by . Add best M of the K recently generated datapoints to D.

Return best bit string ever evaluated.

Cutting Computational Costs

Can cache contingency tables for all possible one-arc changes to network structure Only have to recompute scores associated with at most two

nodes after arc added, removed, or reversed. Prevents having to slog through the entire dataset

recomputing score changes for every possible arc change when dataset changes.

However, kiss memory goodbye

Only a few network structure changes required after each iteration since dataset hasn’t changed much (one or two structural changes on average)

Evolution of Network Complexity

100 bits; +1 added to evaluation function for every bit set to the parity of the previous K bits

Title:


Summation Cancellation

Minimize magnitudes of cumulative sum of discretized numeric parameters (s1, …, sn) represented with standard binary encoding:

Parameters Bits Bayesian Network Tree PBIL GA

10 50 2797 53.7 21 23.612 60 32.2 29.3 16.1 18.815 75 20.3 16.8 11.2 14.620 100 12.4 11 7.5 8.9

15.016.0 is

Ni ...111 sy

Niysy iii ...2for 1

N

iiy

f

1000001

0.1

Average value over 50 runs of best solution found in 2000 generations

Bayesian Networks: Empirical Results Summary

Does better than Tree-based optimization algorithm on some toy problems Significantly better on “Summation Cancellation” problem 10% reduction in error on System of Linear Equation problems

Roughly the same as Tree-based algorithm on others, e.g. small Knapsack problems

Significantly more computation despite efficiency hacks, however.

Why not much better results? Too much emphasis on exploitation rather than exploration? Steepest-ascent hillclimbing over network structures not good

enough, particularly when starting from old networks?

Using Probabilistic Models for Intelligent Restarts

Tree-based algorithm’s O(n2) execution time per generation very expensive for large problems

Even more so for more complicated Bayesian networks

One possible approach: use probabilistic models to select good starting points for faster optimization algorithms, e.g. hillclimbing

COMIT

Combining Optimizers with Mutual Information Trees [Baluja & Davies, 1997b]: Initialize dataset D with bitstrings drawn from uniform

distribution Until termination criteria met:

Build optimal dependency tree T with which to model D.Generate K bitstrings from the distribution represented by T.

Evaluate them.Execute a hillclimbing run starting from single best bitstring of

these K.Replace up to M bitstrings in D with the best bitstrings found

during the hillclimbing run.

Return best bitstring ever evaluated

COMIT, cont’d

Empirical tests performed with stochastic hillclimbing algorithm that allows at most PATIENCE moves to points of equal value before restarting

We compare COMIT vs.: Hillclimbing with restarts from bitstrings chosen randomly

from uniform distribution Hillclimbing with restarts from best bitstring out of K

chosen randomly from uniform distribution

COMIT: Example of Behavior

TSP domain: 100 cities, 700 bitsTitle:


Title:


Evaluation Number * 103

Tour L

ength * 10 3

Hillclimber

COMIT

COMIT: Empirical Comparisons

Each number is average over at least 25 runs Highlighted: better than each non-COMIT hillclimber with

P > 95% AHCxxx: pick best of xxx randomly generated starting

points before hillclimbing COMITxxx: pick best of xxx starting points generated by

tree before hillclimbingProblem Min/Max # bits HC AHC100 AHC1000 COMIT100 COMIT1000

TSP Min 700 1629 1599 1573 1335 1336Jobshop 1 Min 500 998 988 982 978 970Jobshop 2 Min 700 965 961 957 954 953Jobshop 3 Min 700 1207 1200 1199 1196 1195Bin-packing Min 504 1.70E-03 1.58E-03 1.62E-03 1.56E-03 1.45E-03Knapsack Max 512 3238 3377 3335 6684 6259Rectangle-packing Min 525 8.32E+06 8.23E+06 8.25E+06 8.24E+06 8.21E+06Sum. Canc. Min 675 64 61 59 54 52

Summary

PBIL uses very simple probability distribution from which new solutions are generated, yet works surprisingly well

Algorithm using tree-based distributions seems to work even better, though at significantly more computational expense

More sophisticated networks: past the point of diminishing marginal returns? “Future research”

COMIT makes tree-based algorithm applicable to much larger problems

Future Work

Making algorithm based on complex Bayesian Networks more practical Combine w/simpler search algorithms, ala COMIT?

Applying COMIT to more interesting problems WALKSAT?

Using COMIT to combine results of multiple search algorithms

Optimization in real-valued state spaces. What sorts of PDF representations might be useful?

Gaussians?Kernel-based representations?Hierarchical representations?Hands off :-)

Acknowledgements

Shumeet Baluja Justsystem Pittsburgh Research CenterDoug Baker, Justin Boyan, Lonnie Chrisman, Greg

Cooper, Geoff Gordon, Andrew Moore, …?

Documents

Probabilistic Modeling for Combinatorial Optimization Scott Davies School of Computer Science Carnegie Mellon University Joint work with Shumeet Baluja