View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Probabilistic Modeling for Combinatorial Optimization
Scott Davies
School of Computer Science
Carnegie Mellon University
Joint work with Shumeet Baluja(done partially at Justsystem Pittsburgh Research Center)
Combinatorial Optimization
Maximize “evaluation function” f(x) input: fixed-length bitstring x output: real value
x might represent: job shop schedules TSP tours discretized numeric values etc.
Our focus: “Black Box” optimization No domain-dependent heuristics
Most Commonly Used Approaches
Hill-climbing, simulated annealing Generate candidate solutions “neighboring” single current
working solution (e.g. differing by one bit) Typically make no attempt to model how particular bits
affect solution quality
Genetic algorithms Attempt to implicitly capture dependency of solution
quality on bit values by maintaining a population of candidate solutions
Use crossover and mutation operators on population members to generate new candidate solutions
Using Explicit Probabilistic Models
Maintain an explicit probability distribution P from which we generate new candidate solutions Initialize P to uniform distribution Until termination criteria met:
Stochastically generate K candidate solutions from PEvaluate themUpdate P to make it more likely to generate solutions “similar” to
the “good” solutions
Several different choices for what sorts of P to use and how to update it after candidate solution evaluation
Probability Distributions Over Bitstrings
Let x = (x1, x2, …, xn), where xi can take one of the values {0, 1} and n is the length of the bitstring.
Can factorize any distribution P(x1…xn) bit by bit:
P(x1,…,xn) = P(x1) P(x2 | x1) P(x3 | x1, x2)…P(xn|x1, …, xn-1)
In general, the above formula is just another way of representing a big lookup table with one entry for each of the 2n possible bitstrings.
Obviously too many parameters to estimate from limited data!
Representing Independencies withBayesian Networks
Graphical representation of probability distributions Each variable is a vertex Each variable’s probability distribution is conditioned only
on its parents in the directed acyclic graph (“dag”)
D
F
I A
H
Wean on Fire
Fire AlarmActivated
Hear Bells
Ice CreamTruck Nearby
Office Door Open
P(F,D,I,A,H) =P(F) *P(D) *P(I|F) *P(A|F) *
P(H|D,I,A)
“Bayesian Networks” for Bitstring Optimization?
P(x1,…,xn) = P(x1) P(x2 | x1) P(x3 | x1, x2)…P(xn|x1, …, xn-1)
x1 x2 x3 xn
Yuck. Let’s just assume all the bits are independent instead.
(For now.)
x1 x2 x3 xn
Ah…much better.
P(x1,…,xn) = P(x1) P(x2) P(x3)…P(xn)
Population-Based Incremental Learning
Population-Based Incremental Learning (PBIL) [Baluja, 1995] Maintains a vector of probabilities: one independent
probability P(xi=1) for each bit xi.
Until termination criteria met:Generate a population of K bitstrings from PEvaluate themUse each of the best M of the K to update P as follows:
))1(1()1()1(' iii xPxPxP if xi is set to 1)1()1( ii xPxP if xi is set to 0or
Optionally, also update P similarly with the bitwise complement of the worst of the K bitstrings, and/or “mutate” P with random tweaks
Return best bitstring ever evaluated
PBIL vs. Discrete Learning Automata
Equivalent to a team of Discrete Learning Automata, one automata per bit. [Thathachar & Sastry, 1987] Learning automata choose actions independently, but
receive common reinforcement signal dependent on all their actions
PBIL update rule equivalent to linear reward-inaction algorithm [Hilgard & Bower, 1975] with “success” defined as “best in the bunch”
However, Discrete Learning Automata typically used previously in problems with few variables but noisy evaluation functions
PBIL vs. Genetic Algorithms
PBIL originated as tool for understanding GA behavior
Similar to Bit-Based Simulated Crossover (BSC) [Syswerda, 1993] Regenerates P from scratch after every generation All K used to update P, weighted according to probabilities
that a GA would have selected them for reproduction
Why might normal GAs be better? Implicitly capture inter-bit dependencies with population However, because model is only implicit, crossover must
be randomized. Also, limited population size often leads to premature convergence based on noise in samples.
Four Peaks Problem
Problem used in [Baluja & Caruana, 1995] to test how well GAs maintain multiple solutions before converging.
Given input vector X with N bits, and difficulty parameter T: FourPeaks(T,X)=MAX(head(1,X), tail(0,X))+Bonus(T,X)
head(b,X) = # of contiguous leading bits in X set to btail(b,X) = # of contiguous trailing bits in X set to bBonus(T,X) = 100 if (head(1,X)>T) AND (tail(0,X) > T), or
0 otherwise
GA with one-point crossover can combine two solutions to get Bonus relatively easily?
Four Peaks Problem Results
80 GA settings tried; best five shown hereAverage over 50 runs. 100 bits, varying T
Title:
Creator:idrawPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Large-Scale PBIL Empirical Comparison
Sorry, this slide is a placeholder for a big table on another non-PowerPoint slide!
Gist of results: 27 domains; comparison of Two versions of PBIL Three versions of multiple-restart hillclimbing Two GAs PBIL best on 22, hillclimbing on 3, GA on 2
Modeling Inter-Bit Dependencies
How about automatically learning probability distributions in which at least some dependencies between variables are modeled?
Problem statement: given a dataset D and a set of allowable Bayesian networks {Bi}, find the Bi with the maximum posterior probability:
)|()(maxarg)|(maxarg iiB
iB
BDPBPDBPii
Likelihood Scoring Function
Where: d j is the jth datapoint di
j is the value assigned to xi by d j.
i is the set of xi’s parents in B
Dd
j
j
BdPBDP )|(log)|(log
),|(log1
BddxP ji
Dd
n
i
jii i
j
n
ibax
iiii
ii
BbaxPbaxPD1
, possible all
),|(log),('
is the set of values assigned to i by d j
P’ is the empirical probability distribution exhibited by D
j
id
Likelihood Scoring Function, cont’d
Fact: given that B has a network structure S, the optimal probabilities to use in B are just the probabilities in D, i.e. P’. So:
n
ibax
iiii
ii
BbaxPbaxP1
, possible all
),|(log),('
Pick B to Maximize:
n
ibax
iiii
ii
SbaxPbaxP1
, possible all
),|('log),('
Pick S to Maximize:
n
iii SxH
1
),|('
Entropy Calculation Example
x17
x20
x32
x20’s contribution to score:
# timesx20=0
# timesx20=1
total #times
x17=0,x32=0
5 5 10
x17=0,x32=1
5 15 20
x17=1,x32=0
20 10 30
x17=1,x32=1
10 10 20
80total
datapoints
),(80
20),(
80
102015
205
105
105 HH
),(80
20),(
80
302010
2010
3010
3020 HH
where H(p,q) = - p log p - q log q.
),|( 321720 xxxH
Single-Parent Tree-Shaped Networks
Now let’s allow each bit to be conditioned on at most one other bit. x17
x5x22 x2
x6x19x13 x14 x8
Adding an arc from xj to xi increases network score by
H(xi) - H(xi|xj) (xj’s “information gain” with xi)
= H(xj) - H(xj|xi) (not necessarily obvious, but true)
= I(xi, xj) Mutual information between xi and xj
Optimal Single-Parent Tree-Shaped Networks
To find optimal single-parent tree-shaped network, just find maximum spanning tree using I(xi, xj) as the weight for the edge between xi and xj. [Chow and Liu, 1968] Start with an arbitrary root node xr.
Until all n nodes have been added to the tree:Of all pairs of nodes xin and xout, where xin has already been added to
the tree but xout has not, find the pair with the largest I(xin, xout).
Add xout to the tree with xin as its parent.
Can be done in O(n2) time (assuming D has already been reduced to sufficient statistics)
Optimal Dependency Trees for Combinatorial Optimization
[Baluja & Davies, 1997]Start with a dataset D initialized from the uniform
distributionUntil termination criteria met:
Build optimal dependency tree T with which to model D. Generate K bitstrings from probability distribution represented
by T. Evaluate them. Add best M bitstrings to D after decaying the weight of all
datapoints already in D by a factor between 0 and 1.
Return best bitstring ever evaluated.Running time: O(K*n + M*n2) per iteration
Tree-based Optimization vs. MIMIC
Tree-based optimization algorithm inspired by Mutual Information Maximization for Input Clustering (MIMIC) [De Bonet, et al., 1997] Learned chain-shaped networks rather than tree-shaped
networks
Dataset: best N% of all bitstrings ever evaluated+: dataset has simple, well-defined interpretation-: have to remember bitstrings-: seems to converge too quickly on some larger problems
x4 x8x17 x5
MIMIC dataset vs. Exp. Decay dataset
Solving a system of linear equations the hard way Error of best solution as a function of generation #,
averaged over 10 problems 81 bits
Title:
Creator:idrawPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Graph-Coloring Example
Noisy Graph-Coloring example For each edge connected to vertices of different colors, add
+1 to evaluation function with probability 0.5Title:pogp.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Peaks problems results
Fraction of time Bonus rewarded in 100 runsTitle:
Creator:idrawPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Title:
Creator:idrawPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Title:
Creator:idrawPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Title:
Creator:idrawPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Checkerboard problem results
16*16 grid of bitsFor each bit in middle 14*14 subgrid, add +1 to
evaluation for each of four neighbors set to opposite value
Title:
Creator:idrawPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Genetic Algorithm (avg: 740)
PBIL (avg: 742)
Chain (avg: 760)
Tree (avg: 776)
Linear Equations Results
81 bits again; 9 problems, each with results averaged over 25 runsMinimize error
Problem # Tree Chain 95% sig Tree < Chain PBIL GA mut .001 GA mut .005 GA3 mut .02
1 313 393 Yes 2306 3994 2779 37372 721 1537 Yes 3825 4132 1812 40963 405 544 Yes 2981 3361 2493 40694 706 1347 Yes 3204 3554 2313 39215 848 1255 Yes 3031 3370 2198 35116 648 520 No 2248 3561 1975 36417 692 1034 Yes 3331 3719 2252 37198 708 1029 Yes 2996 3100 2076 38419 577 904 No 3135 3537 1875 3837
Modeling Higher-Order Dependencies
The maximum spanning tree algorithm gives us the optimal Bayesian Network in which each node has at most one parent.
What about finding the best network in which each node has at most K parents for K>1? NP-complete problem! [Chickering, et al., 1995] However, can use search heuristics to look for “good”
network structures (e.g., [Heckerman, et al., 1995]), e.g. hillclimbing.
Scoring Function for Arbitrary Networks
Rather than restricting K directly, add penalty term to scoring function to limit total network size Equivalent to priors favoring simpler network structures Alternatively, lends itself nicely to MDL interpretation Size of penalty controls exploration/exploitation tradeoff
n
iii BxHDBP
1
),|(')|(log |B|
penalty factor
|B|: number of parameters in B
Bayesian Network-Based Combinatorial Optimization
Initialize D with C bitstrings from uniform distribution, and Bayesian network B to empty network containing no edges
Until termination criteria met: Perform steepest-ascent hillclimbing from B to find locally optimal network
B’. Repeat until no changes increase score:Evaluate how each possible edge addition, removal or deletion would
affect penalized log-likelihood scorePerform change that maximizes increase in score.
Set B to B’. Generate and evaluate K bitstrings from B. Decay weight of datapoints in D by . Add best M of the K recently generated datapoints to D.
Return best bit string ever evaluated.
Cutting Computational Costs
Can cache contingency tables for all possible one-arc changes to network structure Only have to recompute scores associated with at most two
nodes after arc added, removed, or reversed. Prevents having to slog through the entire dataset
recomputing score changes for every possible arc change when dataset changes.
However, kiss memory goodbye
Only a few network structure changes required after each iteration since dataset hasn’t changed much (one or two structural changes on average)
Evolution of Network Complexity
100 bits; +1 added to evaluation function for every bit set to the parity of the previous K bits
Title:
Creator:idrawPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Summation Cancellation
Minimize magnitudes of cumulative sum of discretized numeric parameters (s1, …, sn) represented with standard binary encoding:
Parameters Bits Bayesian Network Tree PBIL GA
10 50 2797 53.7 21 23.612 60 32.2 29.3 16.1 18.815 75 20.3 16.8 11.2 14.620 100 12.4 11 7.5 8.9
15.016.0 is
Ni ...111 sy
Niysy iii ...2for 1
N
iiy
f
1000001
0.1
Average value over 50 runs of best solution found in 2000 generations
Bayesian Networks: Empirical Results Summary
Does better than Tree-based optimization algorithm on some toy problems Significantly better on “Summation Cancellation” problem 10% reduction in error on System of Linear Equation problems
Roughly the same as Tree-based algorithm on others, e.g. small Knapsack problems
Significantly more computation despite efficiency hacks, however.
Why not much better results? Too much emphasis on exploitation rather than exploration? Steepest-ascent hillclimbing over network structures not good
enough, particularly when starting from old networks?
Using Probabilistic Models for Intelligent Restarts
Tree-based algorithm’s O(n2) execution time per generation very expensive for large problems
Even more so for more complicated Bayesian networks
One possible approach: use probabilistic models to select good starting points for faster optimization algorithms, e.g. hillclimbing
COMIT
Combining Optimizers with Mutual Information Trees [Baluja & Davies, 1997b]: Initialize dataset D with bitstrings drawn from uniform
distribution Until termination criteria met:
Build optimal dependency tree T with which to model D.Generate K bitstrings from the distribution represented by T.
Evaluate them.Execute a hillclimbing run starting from single best bitstring of
these K.Replace up to M bitstrings in D with the best bitstrings found
during the hillclimbing run.
Return best bitstring ever evaluated
COMIT, cont’d
Empirical tests performed with stochastic hillclimbing algorithm that allows at most PATIENCE moves to points of equal value before restarting
We compare COMIT vs.: Hillclimbing with restarts from bitstrings chosen randomly
from uniform distribution Hillclimbing with restarts from best bitstring out of K
chosen randomly from uniform distribution
COMIT: Example of Behavior
TSP domain: 100 cities, 700 bitsTitle:
Creator:idrawPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Title:
Creator:idrawPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Evaluation Number * 103
Tour L
ength * 10 3
Hillclimber
COMIT
COMIT: Empirical Comparisons
Each number is average over at least 25 runs Highlighted: better than each non-COMIT hillclimber with
P > 95% AHCxxx: pick best of xxx randomly generated starting
points before hillclimbing COMITxxx: pick best of xxx starting points generated by
tree before hillclimbingProblem Min/Max # bits HC AHC100 AHC1000 COMIT100 COMIT1000
TSP Min 700 1629 1599 1573 1335 1336Jobshop 1 Min 500 998 988 982 978 970Jobshop 2 Min 700 965 961 957 954 953Jobshop 3 Min 700 1207 1200 1199 1196 1195Bin-packing Min 504 1.70E-03 1.58E-03 1.62E-03 1.56E-03 1.45E-03Knapsack Max 512 3238 3377 3335 6684 6259Rectangle-packing Min 525 8.32E+06 8.23E+06 8.25E+06 8.24E+06 8.21E+06Sum. Canc. Min 675 64 61 59 54 52
Summary
PBIL uses very simple probability distribution from which new solutions are generated, yet works surprisingly well
Algorithm using tree-based distributions seems to work even better, though at significantly more computational expense
More sophisticated networks: past the point of diminishing marginal returns? “Future research”
COMIT makes tree-based algorithm applicable to much larger problems
Future Work
Making algorithm based on complex Bayesian Networks more practical Combine w/simpler search algorithms, ala COMIT?
Applying COMIT to more interesting problems WALKSAT?
Using COMIT to combine results of multiple search algorithms
Optimization in real-valued state spaces. What sorts of PDF representations might be useful?
Gaussians?Kernel-based representations?Hierarchical representations?Hands off :-)
Acknowledgements
Shumeet Baluja Justsystem Pittsburgh Research CenterDoug Baker, Justin Boyan, Lonnie Chrisman, Greg
Cooper, Geoff Gordon, Andrew Moore, …?