10
A Bayesian Approach to Tackling Hard Computational Problems Eric Horvitz Microsoft Research Redmond, WA 98052 horvitzOmicrosoft.com Yongshao Ruan Comp. Sci. gz Engr. Univ. of Washington Seattle~ WA 98195 ruanOcs.washington.edu Carla Gomes Dept. of Comp. Sci. Cornell Univ. Ithaca~ NY 14853 [email protected] Henry Kautz Bart Selman Comp. Sci. & Engr. Dept. of Comp. Sci. Univ. of Washington Cornell Univ. Seattle, WA 98195 Ithaca, NY 14853 kautz~cs.washington.edu selman~cs.cornell.edu Max Chickering Microsoft Research Redmond, WA 98052 dmax~microsoft.com Abstract We describe research and results centering on the construction and use of Bayesian mod- els that can predict the run time of problem solvers. Our efforts are motivated by observa- tions of high variance in the time required to solve instances for several challenging prob- lems. The methods have application to the decision-theoretic control of hard search and reasoning algorithms. We illustrate the ap- proach with a focus on the task of predict- ing run time for general and domain-specific solvers on a hard class of structured con- straint satisfaction problems. Wereview the use of learned models to predict the ultimate length of a trial, based on observing the be- havior of the search algorithm during an early phase of a problem session. Finally, we dis- cuss how we can employ the models to inform dynamic run-time decisions. Design, real-time control, insi hts W°r’i’ °n : ver Contextual Structural Execution evidence evidence evidence Feature refinement, insights Run time ~,( Bayesian Learner ! ~ Predi~ctive Model Figure 1: Bayesian approach to problem solver design and optimization. Weseek to learn predictive mod- els to refine and control computational procedures as well as to gain insights about problem structure and hardness. 1 Introduction The design of procedures for solving difficult problems relies on a combinationof insight, observation, and it- erative refinements that take into consideration the be- havior of algorithms on problem instances. Complex, impenetrable relationships often arise in the process of problem solving, and such complexity leads to uncer- tainty about the basis for observed efficiencies and in- efficiences associated with specific problem instances. Webelieve that recent advances in Bayesian methods for learning predictive modelsfrom data offer valuable tools for designing, controlling, and understanding au- tomated reasoning methods. Wefocus on using machine learning to characterize variation in the run time of instances observed in in- herently exponential search and reasoning problems. Predictive models for run time in this domain could provide the basis for more optimal decision making at the microstructure of algorithmic activity as well as inform higher-level policies that guide the allocation of resources. Our overall methodology is highlighted in Fig. 1. We seek to develop models for predicting execution time by considering dependencies between execution time and one or more classes of observations. Such classes include evidence about the nature of the generator that has provided instances, about the structural properties of instances noted before problem solving, and about the run-time behaviors of solvers as they struggle to solve the instances. The research is fundamentally iterative in nature. We exploit learning methods to identify and continue to refine observational variables and models, balancing the predictive power of multiple observations with the cost of the real-time evaluation of such evidential dis- 64 From: AAAI Technical Report FS-01-04. Compilation copyright © 2001, AAAI (www.aaai.org). All rights reserved.

A Bayesian Approach to Tackling Hard Computational Problems · 2006. 1. 11. · constraint corresponds to solving a matching problem on a bipartite graph using a network-fiow algorithm

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Bayesian Approach to Tackling Hard Computational Problems · 2006. 1. 11. · constraint corresponds to solving a matching problem on a bipartite graph using a network-fiow algorithm

A Bayesian Approach to Tackling Hard Computational Problems

Eric HorvitzMicrosoft Research

Redmond, WA 98052

horvitzOmicrosoft.com

Yongshao RuanComp. Sci. gz Engr.

Univ. of Washington

Seattle~ WA 98195

ruanOcs.washington.edu

Carla GomesDept. of Comp. Sci.

Cornell Univ.

Ithaca~ NY 14853

[email protected]

Henry Kautz Bart SelmanComp. Sci. & Engr. Dept. of Comp. Sci.

Univ. of Washington Cornell Univ.

Seattle, WA 98195 Ithaca, NY 14853kautz~cs.washington.edu selman~cs.cornell.edu

Max ChickeringMicrosoft Research

Redmond, WA 98052dmax~microsoft.com

Abstract

We describe research and results centering onthe construction and use of Bayesian mod-els that can predict the run time of problemsolvers. Our efforts are motivated by observa-tions of high variance in the time required tosolve instances for several challenging prob-lems. The methods have application to thedecision-theoretic control of hard search andreasoning algorithms. We illustrate the ap-proach with a focus on the task of predict-ing run time for general and domain-specificsolvers on a hard class of structured con-straint satisfaction problems. We review theuse of learned models to predict the ultimatelength of a trial, based on observing the be-havior of the search algorithm during an earlyphase of a problem session. Finally, we dis-cuss how we can employ the models to informdynamic run-time decisions.

Design, real-time control, insi hts

W°r’i’ °n : verContextual Structural Executionevidence evidence evidence

Feature refinement, insights

Run time

~,( BayesianLearner !

~Predi~ctiveModel

Figure 1: Bayesian approach to problem solver designand optimization. We seek to learn predictive mod-els to refine and control computational procedures aswell as to gain insights about problem structure andhardness.

1 Introduction

The design of procedures for solving difficult problemsrelies on a combination of insight, observation, and it-erative refinements that take into consideration the be-havior of algorithms on problem instances. Complex,impenetrable relationships often arise in the process ofproblem solving, and such complexity leads to uncer-tainty about the basis for observed efficiencies and in-efficiences associated with specific problem instances.We believe that recent advances in Bayesian methodsfor learning predictive models from data offer valuabletools for designing, controlling, and understanding au-tomated reasoning methods.

We focus on using machine learning to characterizevariation in the run time of instances observed in in-herently exponential search and reasoning problems.Predictive models for run time in this domain could

provide the basis for more optimal decision making atthe microstructure of algorithmic activity as well asinform higher-level policies that guide the allocationof resources.

Our overall methodology is highlighted in Fig. 1. Weseek to develop models for predicting execution timeby considering dependencies between execution timeand one or more classes of observations. Such classesinclude evidence about the nature of the generator thathas provided instances, about the structural propertiesof instances noted before problem solving, and aboutthe run-time behaviors of solvers as they struggle tosolve the instances.

The research is fundamentally iterative in nature. Weexploit learning methods to identify and continue torefine observational variables and models, balancingthe predictive power of multiple observations with thecost of the real-time evaluation of such evidential dis-

64

From: AAAI Technical Report FS-01-04. Compilation copyright © 2001, AAAI (www.aaai.org). All rights reserved.

Page 2: A Bayesian Approach to Tackling Hard Computational Problems · 2006. 1. 11. · constraint corresponds to solving a matching problem on a bipartite graph using a network-fiow algorithm

tinctions. We seek ultimately to harness the learnedmodels to optimize the performance of automated rea-soning procedures. Beyond this direct goal, the overallexploratory process promises to be useful for providingnew insights about problem hardness.

We first provide background on the problem solvingdomains we have been focusing on. Then, we describeour efforts to instrument problem solvers and to learnpredictive models for run time. We describe the for-mulation of variables we used in learning and modelconstruction and review the accuracy of the inferredmodels. Finally, we discuss opportunities for exploit-ing the models. We focus on the sample applicationof generating context-sensitive restart policies in ran-domized search algorithms.

2 Hard Search Problems

We have focused on applying learning methods to char-acterize run times observed in backtracking search pro-cedures for solving NP-complete problems encoded asconstraint satisfaction (CSP) and Boolean satisfiabil-ity (SAT). For these problems, it has proven extremelydifficult to predict the particular sensitivities of runtime to changes in instances, initialization settings,and solution policies. Numerous studies have demon-strated that the probability distribution over run timesexhibit so-called heavy-tails [10]. Restart strategieshave been used in an attempt to find settings for aninstance that allow it to be solved rapidly, by avoidingcostly journeys into a long tail of run time. Restartsare introduced by way of a parameter that terminatesthe run and restarts the search from the root with anew random seed after some specified amount of timepasses, measured in choices or backtracks.

Progress on the design and study of algorithms forSAT and CSP has been aided by the recent devel-opment of new methods for generating hard randomproblem instances. Pure random instances, such ask-Sat, have played a key role in the development of al-gorithms for propositional deduction and satisfiabilitytesting. However, they lack the structure that char-acterizes real world domains. Gomes and Selman [9]introduced a new benchmark domain based on Quasi-groups, the Quasigroup Completion Problem (QCP).QCP captures the structure that occurs in a variety ofreal world problems such as timetabling, routing, andstatistical experimental design.

A quasigroup is a discrete structure whose multipli-cation table corresponds to a Latin Square. A LatinSquare o] order n is an n x n array in which n dis-tinct symbols are arranged so that each symbol occursonce in each row and column. A partial quaisgroup (orLatin Square) of order n is an n x n array based on

Figure 2: Graphical representation of the quasigroupproblem. Left: A quasigroup instance with its comple-tion. Right: A balanced instance with two holes perrow/column.

n distinct symbols in which some cells may be emptybut no row or column contains the same element twice.The Quasigroup Completion Problem (QCP) can stated as follows: Given a partial quasigroup of ordern can it be completed to a quasigroup of the sameorder?

QCP is an NP-complete problem [5] and random in-stances have been found to exhibit a peak in prob-lem hardness as a function of the ratio of the numberof uncolored cells to the total number of cells. Thepeak occurs over a particular range of values of thisparameter, referred to as a region of phase transition[9, 2]. A variant of the QCP problem, Quasigroup withHoles (QWH) [2], includes only satisfiable instances.The QWH instance-generation procedure essentiallyinverts the completion task: it begins with a randomly-generated completed Latin square, and then erases col-ors or "pokes holes." Completing QWH is NP-Hard[2]. A structural property that affects hardness of in-stances significantly is the pattern of the holes in rowand columns. Balancing the number holes in each rowand column of instances has been found to significantlyincrease the hardness of the problems [1].

3 Experiments with Problem Solvers

We performed a number of experiments with Bayesianlearning methods to elucidate previously hidden dis-tinctions and relationships in SAT and CSP reason-ers. We experimented with both a randomized SATalgorithm running on Boolean encodings of the QWHand a randomized CSP solver for QWH. The SAT al-gorithm was Satz-Rand [11], a randomized version ofthe Satz system of Li and Anbulagan [20]. Satz is thefastest known complete SAT algorithm for hard ran-dom 3-SAT problems, and is well suited to many inter-esting classes of structured satisfiability problems, in-cluding SAT encodings of quasigroup completion prob-lems [10] and planning problems [17]. The solver is aversion of the classic Davis-Putnam (DPLL) algorithm[7] augmented with one-step lookahead and a sophisti-

65

Page 3: A Bayesian Approach to Tackling Hard Computational Problems · 2006. 1. 11. · constraint corresponds to solving a matching problem on a bipartite graph using a network-fiow algorithm

cated variable choice heuristic. The lookahead opera-tion is invoked at most choice points and finds anyvariable/value choices that would immediately leadto a contradiction after unit propagation; for these,the opposite variable assignment can be immediatelymade. The variable choice heuristic is based on pickinga variable that if set would cause the greatest numberof ternary clauses to be reduced to binary clauses. Thevariable choice set was enlarged by a noise parameterof 30%, and value selection was performed determin-istically by always branching on ’true’ first.

The second backtrack search algorithm we studied isa randomized version of a specialized CSP solver forquasigroup completion problems, written using theILOG solver constraint programming library. Thebacktrack search algorithm uses as a variable choiceheuristic a variant of the Brelaz heuristic. Further-more, it uses a sophisticated propagation method toenforce the constraints that assert that all the colorsin a row/column must be different. We refer to sucha constraint as alldiff. The propagation of the aUdiffconstraint corresponds to solving a matching problemon a bipartite graph using a network-fiow algorithm[9, 26, 24].

We learned predictive models for run-time, motivatedby two different classes of target problem. For the firstclass of problem, we assume that a solver is challengedby an instance and must solve that specific problem asquickly as possible. We term this the Single Instanceproblem. In a second class of problem, we draw casesfrom a distribution of instances and are required tosolve any instance as soon as possible, or as many in-stances as possible for any amount of time allocated.We call these challenges Multiple Instance problems,and the subproblems as the Any Instance and MaxInstances problems, respectively.

We collected evidence and built models for CSP andSatz solvers applied to the QWH problem for boththe Single Instance and Multiple Instances challenge.We shall refer to the four problem-solving experimentsas CSP-QWH-Single, CSP-QWH-Multi, Satz-QWH-Single, and Satz-QWH-Multi. Building predictiveBayesian models for the CSP-QWH-Single and Satz-QWH-Single problems centered on gathering data onthe probabilistie relationships between observationalvariables and run time for single instances with ran-domized restarts. Experiments for the CSP-QWH-Multi and Satz-QWH-Multi problems centered on per-forming single runs on multiple instances drawn fromthe same instance generator.

3.1 Formulating Evidential Variables

We worked to define variables that we believed couldprovide information on problem-solving progress for a

period of observation in an early phase of a run thatwe refer to as the observation horizon. The definitionof variables was initially guided by intuition. However,results from our early experiments helped us to refinesets of variables and to propose additional candidates.

We initially explored a large number of variables, in-cluding those that were difficult to compute. Althoughwe planned ultimately to avoid the use of costly ob-servations in real-time forecasting settings, we wereinterested in probing the predictive power and inter-dependencies among features regardless of cost. Un-derstanding such informational dependencies promisedto be useful in understanding the potential losses inpredictive power with the removal of costly features,or substitution of expensive evidence with less expen-sive, approximate observations. We eventually limitedthe features explored to those that could be computedwith low (constant) overhead.

We sought to collect information about base valuesas well as several variants and combinations of thesevalues. For example formulated features that couldcapture higher-level patterns and dynamics of problemsolver state and progress that could be useful probesof the progress of the problem solver. For example, be-yond exploring base observations about the programstate at particular points in a case, we defined newfamilies of observations such as first and second deriva-tives of the base variables, and summaries of the statusof variables over time.

Rather than include a separate variable in the modelfor each feature at each choice point--which wouldhave led to an explosion in the number of variablesand severely limited generalization--features and theirdynamics were represented by variables for their sum-mary statistics over the observation horizon. The sum-mary statistics included initial, final, average, mini-mum, and maximum values of the features during theobservation period. For example, at each choice point,the SAT solver recorded the current number of binaryclauses. The training data would thus included a vari-able for the average first derivative of the number ofbinary clauses during the observation period. Finally,for several of the features, we also computed a sum-mary statistic that measured the number of times thesign of the feature changed from negative to positiveor vice-versa.

We developed distinct sets of observational variablesfor the CSP and Satz solvers. The features for theCSP solver included some that were generic to anyconstraint satisfaction problem, such as the number ofbacktracks, the depth of the search tree, and the aver-age domain size of the unbound CSP variables. Otherfeatures, such as the variance in the distribution of un-bound CSP variables between different columns of the

66

Page 4: A Bayesian Approach to Tackling Hard Computational Problems · 2006. 1. 11. · constraint corresponds to solving a matching problem on a bipartite graph using a network-fiow algorithm

square, were specific to Latin squares. As we will seebelow, the inclusion of such domain-specific featureswere important in learning strongly predictive mod-els. The CSP solver recorded 18 basic features at eachchoice point which were summarized by a total of 135variables. The variables that turned out to be mostinformative for prediction are described in Sec. 4.1 be-low.

The features recorded by Satz-Rand were largelygeneric to SAT. We included a feature for the num-ber of Boolean variables that had been set positively;this feature is problem specific in the sense that underthe SAT encoding we used, only a positive Booleanvariable corresponds to a bound CSP variable (i.e. colored squared). Some features measured the currentproblem size (e.g. the number of unbound variables),others the size of the search tree, and still others theeffectiveness of unit propagation and lookahead.

We also calculated two other features of special note.One was the logarithm of the total number of possibletruth assignments (models) that had been ruled outat any point in the search; this quantity can be effi-ciently calculated by examining the stack of assumedand proven Boolean variable managed by the DPLLalgorithm. The other is a quantity from the theory ofrandom graphs called A, that measures the degree ofinteraction between the binary clauses of the formula[23]. In all Satz recorded 25 basic features that weresummarized in 127 variables.

the last decade, there has been steady progress onmethods for inferring Bayesian networks from data[6, 27, 12, 13]. Given a dataset, the methods typicallyperform heuristic search over a space of dependencymodels and employ a Bayesian score to identify mod-els with the greatest ability to predict the data. TheBayesian score estimates p(modelldata) by approxi-mating p( datalmodel)p( model). Chickering, Hecker-man and Meek [4] show how to evaluate the Bayesianscore for models in which the conditional distributionsare decision trees. This Bayesian score requires a priordistribution over both the parameters and the struc-ture of the model. In our experiments, we used a uni-form parameter prior. Chickering et al. suggest usinga structure prior of the form: p(model) = ~/P, where0 < s < 1 and ]p is the number of free parameters inthe model. Intuitively, smaller values of ~ make largetrees unlikely a priori, and thus s can be used to helpavoid overfitting. We used this prior, and tuned s asdescribed below.

We employed the methods of Chickering et ai. to infermodels and to build decision trees for run time fromthe data collected in experiments with CSP and Satzproblem solvers applied to QWH problem instances.We shall describe sample results from the data col-lection and four learning experiments, focusing on theCSP-QWH-Single case in detail.

4.1 CSP-QWH-Single Problem

3.2 Collecting Run-Time Data

For all experiments, observational variables were col-lected over an observational horizon of 1000 solverchoice points. Choice points are states in search proce-dures where the algorithm assigns a value to variableswhere that assignment is not forced via propagation ofprevious set values, as occurs with unit propagation,backtracking, lookahead, or forward-checking. Theseare points at which an assignment is chosen heuris-tically per the policies implemented in the problemsolver.

For the studies described, we represented run time as abinary variable with discrete states short versus long.We defined short runs as cases completed before themedian of the run times for all cases in each data set.Instances with run times shorter than the observationhorizon were not considered in the analyses.

4 Models and Results

We employed Bayesian structure learning to infer pre-dictive models from data and to identify key variablesfrom the larger set of observations we collected. Over

For a sample CSP-QWH-Single problem, we built atraining set by selecting nonbalanced QWH probleminstance of order 34 with 380 unassigned variables. Wesolved this instance 4000 times for the training set and1000 times for the test data set, initiating each runwith a random seed. We collected run time data andthe states of multiple variables for each case over anobservational horizon of 1000 choice points. We alsocreated a marginal model, capturing the overall run-time statistics for the training set.

We optimized the s parameter used in the structureprior of the Bayesian score by splitting the training set70/30 into training and holdout data sets, respectively.We selected a kappa value by identifying a soft peakin the Bayesian score. This value was used to build adependency model and decision tree for run time fromthe full training set. We then tested the abilities of themarginal model and the learned decision tree to pre-dict the outcomes in the test data set. We computeda classification accuracy for the learned and marginalmodels to characterize the power of these models. Theclassification accuracy is the likelihood that the classi-fier will correctly identify the run time of cases in thetest set. We also computed an average log score forthe models.

67

Page 5: A Bayesian Approach to Tackling Hard Computational Problems · 2006. 1. 11. · constraint corresponds to solving a matching problem on a bipartite graph using a network-fiow algorithm

Fig. 3 displays the learned Bayesian network for thisdataset. The figure highlights key dependencies andvariables discovered for the data set. Fig. 4 shows thedecision tree for run time.

The classification accuracy for the learned model is0.963 in contrast with a classification accuracy of 0.489for the marginal model. The average log score of thelearned model is -0.134 and the average log score ofthe marginal model was -0.693.

Because this was both the strongest and most com-pact model we learned, we will discuss the features itinvolves in more detail. Following Fig. 4 from left toright, these are:

VarRowColumn measures the variance in the numberof uncolored cells in the QWH instance across rowsand across columns. A low variance indicates the opencells are evenly balanced throughout the square. Asnoted earlier, balanced instances are harder to solvethan unbalanced ones [1]. A rather complex summarystatistic of this quantity appears at the root of thetree, namely the minimum of the first derivative ofthis quantity during the observation period. In futurework we will be examining this feature carefully inorder to determine why this particular statistic wasmost relevant.

AvgColumn measures the number of uncolored cellsaveraged across columns. A low value for this featureindicates the quasigroup is almost complete. The sec-ond node in the decision tree indicates that when theminimum value of this quantity across the entire ob-servation period is low the run is predicted to be short.

MinDepth is the minimum depth of all leaves of thesearch tree, and the summary statistic is simply the fi-nal value of this quantity. The third and fourth nodesof the decision tree show that short runs are associ-ated with high minimum depth and long runs withlow minimum depth. This may be interpreted as in-dicating the search trees for the shorter runs have amore regular shape.

AvgDepth is the average depth of a node in the searchtree. The model discovers that short runs are associ-ated with a high frequency in the change of the signof the first derivative of the average depth. In otherwords, frequent fluctuations up and down in the aver-age depth indicate a short run. We do not yet have anintuitive explanation for this phenomena.

VarRowColumn appears again as the last node in thedecision tree. Here we see that if the maximum vari-ance of the number of uncolored cells in the QWHinstance across rows and columns is low (i.e., the prob-lem remains balanced) then the run is long, as mightbe expected.

4.2 CSP-QWH-Multi Problem

For a CSP-QWH-Multi problem, we built training andtest sets by selecting instances of nonbalanced QWHproblems of order 34 with 380 unassigned variables.We collected data on 4000 instances for the trainingset and 1000 instances for the test set.

As we were running instances of potentially differentfundamental hardnesses, we normalized the featuremeasurements by the size of the instance (measured inCSP variables) after the instances were initially sim-plified by forward-checking. That is, although all theinstances originally had the same number of uncoloredcells, polynomial time preprocessing fills in some of thecells, thus revealing the true size of the instance.

We collected run time data for each instance overan observational horizon of 1000 choice points. Thelearned model was found to have a classification accu-racy of 0.715 in comparison to the marginal model ac-curacy of 0.539. The average log score for the learnedmodel was found to be -0.562 and the average log scorefor the marginal model was -0.690.

4.3 Satz-QWH-Single Problem

We performed analogous studies with the Satz solver.In a study of the Satz-QWH-Single problem, we stud-ied a single QWH instance (bqwh-34-410-16). found that the learned model had a classification ac-curacy of 0.603, in comparison to a classification accu-racy of 0.517 for the marginal model. The average logscore of the learned model was found to be -0.651 andthe log score of the marginal model was -0.693.

The predictive power of the SAT model was less thanthat of the corresponding CSP model. This is reason-able since the CSP model had access to features thatmore precisely captured special features of quasigroupproblems (such as balance). The decision tree was stillrelatively small, containing 12 nodes that referred to10 different summary variables.

Observations that turned out to be most relevant forthe SAT model included:

¯ The maximum number of variables set to ’true’during the observation period. As noted earlier,this corresponds to the number of CSP variablesthat would be bound in the direct CSP encoding.

¯ The number of models ruled out.

¯ The number of unit propagations performed.

¯ The number of variables eliminated by Satz’slookahead component: that is, the effectivenessof lookahead.

68

Page 6: A Bayesian Approach to Tackling Hard Computational Problems · 2006. 1. 11. · constraint corresponds to solving a matching problem on a bipartite graph using a network-fiow algorithm

/

Figure 3: The learned Bayesian network for a sample CSP-QWH-Single problem. Key dependencies and variablesare highlighted.

~"~"~ ~~"~¯ 5;03 (2225) No(< ~33 (e97) ~ :!9.S{ .3.~4,} Noti~’lSig:(~’22:l ~ 233 {’135} Not ~21,3~(B8),

Figure 4: The decision tree inferred for run time from data gathered in a CSP-QWtI-Single experiment. Theprobability of a short run is captured by the light component of the bargraphs displayed at the leaves.

69

Page 7: A Bayesian Approach to Tackling Hard Computational Problems · 2006. 1. 11. · constraint corresponds to solving a matching problem on a bipartite graph using a network-fiow algorithm

* The quantity A described in Sec. 3.1 above, a mea-sure of the constrainedness of the binary clausesubproblem.

4.4 Satz-QWH-Multi Problem

For the experiment with the Satz-QWH-Multi prob-lem, we executed single runs of QWH instanceswith the same parameters as the instance studied inthe Satz-QWH-Single Problem (bqwh-34-410) for thetraining and test sets. Run time and observationalvariables were normalized in the same manner as forthe CSP-QWH-Multi problem. The classification ac-curacy of the learned model was found to be 0.715.The classification accuracy of the marginal model wasfound to be 0.526. The average log score for the modelwas -0.557 and the average log score for the marginalmodel was -0.692.

4.5 Toward Larger Studies

For broad application in guiding computational prob-lem solving, it is important to develop an understand-ing of how results for sample instances, such as theproblems described in Sections 4.1 through 4.4, gener-alize to new instances within and across distinct classesof problems. We have been working to build insightsabout generalizability by exploring the statistics of theperformance of classifiers on sets of problem instances.The work on studies with larger numbers of data setshas been limited by the amount of time required togenerate data sets for the hard problems being stud-ied. With our computing platforms, several days ofcomputational effort were typically required to pro-duce each data set.

As an example of our work on generalization, we re-view the statistics of model quality and classificationaccuracy, and the regularity of discriminatory featuresfor additional data sets of instances in the CSP-QWH-Single problem class, as described in Section 4.1.

We defined ten additional nonbalanced QWH probleminstances, parameterized in the same manner as theCSP problem described in Section 4.1 (order 34 with380 unassigned variables). We employed the same datageneration and analysis procedures as before, buildingand testing ten separate models. Generating data forthese analyses using the ILOG libary executed on anIntel Pentium III (running at 600 Mhz) required ap-proximately twenty-four hours per 1000 runs. Thus,each CSP dataset required approximately five days ofcomputation.

In summary, we found significant boosts in classi-fication accuracy for all of the instances. For theten datasets, the mean classification accuracy for thelearned models was 0.812 with a standard deviation

70

of 0.101. The predictive power of the learned modelsstands in contrast to the classification accuracy of us-ing background statistics; the mean classification accu-racy of the marginal models was 0.49? with a standarddeviation of 0.025. The log scores of the models aver-aged -0.304 with a standard deviation of 0.167. Thus,we observed relatively consistent predictive power ofthe methods across the new instances.

We observed variation in the tree structure and dis-criminatory features across the ten learned models.Nevertheless, several features appeared as valuablediscriminators in multiple models, including statisticsbased on measures of VarKowColumn, AvgColumn,AvgDepth, and MinDepth. Some of the evidential fea-tures recurred for different problems, showing signifi-cant predictive value across models with greater fre-quency than others. For example, measures of themaximum variation in the number of uncolored cellsin the QWH instance across rows and columns (Max-VarRowColumn) appeared as being an important dis-criminator in many of the models.

5 Generalizing Observation Policies

For the experiments described in Sections 3 and 4, weemployed a policy of gathering evidence over an obser-vation horizon of the initial 1000 choice points. Thisobservational policy can be generalized in several ways.For example, in addition to harvesting evidence withinthe observation horizon, we can consider the amountof time expended so far during a run as an explicitobservation. Also, evidence gathering can be general-ized to consider the status of variables and statisticsof variables at progressively later times during a run.

Beyond experimenting with different observationalpolicies, we believe that there is potential for employ-ing value-of-information analyses to optimize the gath-ering of information. For example, there is opportu-nity for employing otfine analysis and optimization togenerate tractable real-time observation policies thatdictate which evidence to evaluate at different timesduring a run, potentially conditioned on evidence thathas already been observed during that run.

5.1 Time Expended as Evidence

In the process of exploring alternate observationpolicies, we investigated the value of extending thebounded-horizon policy, described in Section 3, witha consideration of the status of time expended so farduring a run. To probe potential boosts with inclusionof time expended, we divided several of the data setsexplored in Section 4.5 into subsets based on whetherruns with the data set had exceeded specific run-timeboundaries. Then, we built distinct run-time-specific

Page 8: A Bayesian Approach to Tackling Hard Computational Problems · 2006. 1. 11. · constraint corresponds to solving a matching problem on a bipartite graph using a network-fiow algorithm

models and tested the predictive power of these modelson test sets containing instances of appropriate mini-real length. Such time-specific models could be usedin practice as a cascade of models, depending on theamount of time that had aleady been expended on arun.

We typically found boosts in predictive power withsuch decompositions, especially for the longer runtimes as we had expected. As an example, let us con-sider one of the data sets generated for the study inSection 4.5. The model that had been built previouslywith all of the data had a classification accuracy of0.793. The median time for the runs represented in theset was nearly 18,000 choice points. We created threeseparate subsets of the complete set of runs: the setof runs that exceeded 5,000 choice points, the set thatexceeded 8,000 choice points, and the set that had ex-ceeded 11,000 choice points. We created distinct pre-dictive models for each training set and tested thesemodels with cases drawn from test sets containing runsof appropriate minimal length. The classification ac-curacies of the models for the low, medium, and hightime expenditure were 0.779, 0.799, and 0.850 respec-tively. We shall be continuing to study the use of timeallocated as a predictive variable.

6 Application: Dynamic RestartPolicies

A predictive model can be used in several ways tocontrol a solver. For example, the variable selectionheuristic used to decompose the problem instance canbe designed to minimize the expected solution timeof the subproblems. Another application centers onbuilding distinct models to predict the run time as-sociated with different global strategies. As an ex-ample, we can learn to predict the relative perfor-mance of ordinary chronological backtrack search anddependency-directed backtracking with clause learn-ing [16]. Such a predictive model could be used todecide whether the overhead of clause learning wouldbe worthwhile for a particular instance.

Problem and instance-specific predictions of run timecan also be used to drive dynamic cutoff decisions onwhen to suspend a current case and restart with a newrandom seed or new problem instance, depending onthe class of problem. For example, consider a greedyanalysis, where we consider the expected value of ceas-ing a run that is in progress and performing a restarton that instance or another instance, given predictionsabout run time. A predictive model provides the ex-pected time remaining until completion of the currentrun. Initiating a new run will have an expected runtime provided by the statistics of the marginal model.

From the perspective of a single-step analysis, whenthe expected time remaining for the current instanceis greater than the expected time of the next instance,as defined by the background marginal model, it isbetter to cease activity and perform a restart. Moregenerally, we can consider richer multistep analysesthat provide the fastest solutions to a particular in-stance or the highest rate of completed solutions withcomputational effort. Such policies can be formulatedby computing the cumulative probability of achievinga solution as a function of the total dwell time allo-cated to an instance, based on the inferred probabilitydistribution over execution time. For an ideal policy,we wish to choose dynamically a cutoff time t thatmaximizes an expected rate of return of answers.

We can also use the predictive models to perform com-parative analyses with previous policies. Luby et al.[21] have shown that the optimal restart policy, as-suming full knowledge of the distribution, is one witha fixed cutoff. They also provide a universal strat-egy (using gradually increasing cutoffs) for minimizingthe expected cost of randomized procedures, assum-ing no prior knowledge of the probability distribution.They show that the universal strategy is within a logfactor of optimal. These results essential settle thedistribution-free case.

Consider now the following dynamic policy: Observe arun for O steps. If a solution is not found, then predictwhether the run will complete within a total of L steps.If the prediction is negative, then immediately restart;otherwise continue to run for up to a total of L stepsbefore restarting if no solution is found.

An upper bound on the expected run of this policy canbe calculated in terms of the model accuracy A and theprobability Pi of a single run successfully ending in ior fewer steps. For simplicity of exposition we assumethat the model’s accuracy in predicting long or shortruns is identical. The expected number of runs untila solution is found is E(N) = 1/(A(PL - Po) + An upper bound on the expected number of steps ina single run can be calculated by assuming that runsthat end within 0 steps take exactly 0 steps, and thatruns that end in 0 + 1 to L steps take exactly L steps.The probability that the policy continues a run past 0steps (i.e., the prediction was positive) is APL + (1 --A)(1--PL). An upper bound on the expected length a single run is E,~b(R) = O+ (L- O)(APL (1- A)(PL)). Thus, an upper bound on the expected time solve a problem using the policy is E(N)E,~b(R).

It is important to note that the expected time dependson both the accuracy of the model and the predictionpoint L; in general, one would want to vary L in or-der to optimize the solution time. Furthermore, ingeneral, it would be better to design more sophisti-

Page 9: A Bayesian Approach to Tackling Hard Computational Problems · 2006. 1. 11. · constraint corresponds to solving a matching problem on a bipartite graph using a network-fiow algorithm

cated dynamic policies that made use of all informa-tion gathered over a run, rather than just during thefirst O steps. But even a non-optimized policy baseddirectly on the models discussed in this paper can out-perform the optimal fixed policy. For example, in theCSP-QWH-single problem case, the optimal fixed pol-icy has an expected solution time of 38,000 steps, whilethe dynamic policy has an expected solution time ofonly 27,000 steps. Optimizing the choice of L shouldprovide about an order of magnitude further improve-ment.

While it may not be surprising that a dynamic policycan outperform the optimal fixed policy, it is interest-ing to note that this can occur when the observationtime O is g~ater than the fixed cutoff. That is, forproper values of L and A, it may be worthwhile to ob-serve each run for 1000 steps even if the optimal fixedstrategy is to cutoff after 500 steps. These and otherissues concerning applications of prediction models torestart policies are examined in detail in a forthcomingpaper.

7 Related Work

Learning methods have been employed in previous re-search in a attempt to enhance the performance opti-mize reasoning systems. In work on "speed-up learn-ing," investigators have attempted to increase plan-ning efficiency by learning goal-specific preferences forplan operators [22, 19]. Khardon and Roth exploredthe offiine reformulation of representations based onexperiences with problem solving in an environmentto enhance run-time efficiency [18]. Our work on usingprobabilistic models to learn about algorithmic perfor-mance and to guide problem solving is most closely re-lated to research on flexible computation and decision-theoretic control. Related work in this arena focusedon the use of predictive models to control computa-tion, Breese and Horvitz [3] collected data about theprogress of search for graph cliquing and of cutset anal-ysis for use in minimizing the time of probabilistic in-ference with Bayesian networks. The work was mo-tivated by the challenge of identifying the ideal timefor preprocessing graphical models for faster inferencebefore initiating inference, trading off reformulationtime for inference time. Trajectories of progress asa function of parameters of Bayesian network prob-lem instances were learned for use in dynamic de-cisions about the partition of resources between re-formulation and inference. In other work, Horvitzand Klein [14] constructed Bayesian models consid-ering the time expended so far in theorem proving.They monitored the progress of search in a proposi-tional theorem prover and used measures of progressin updating the probability of truth or falsity of as-

sertions. A Bayesian model was harnessed to updatebelief about different outcomes as a function of theamount of time that problem solving continued with-out halting. Stepping back to view the larger body ofwork on the decision-theoretic control of computation,measures of expected value of computation [15, 8, 25],employed to guide problem solving, rely on forecastsof the refinements of partial results with future com-putation. More generally, representations of problem-solving progress have been central in research on flex-ible or anytime methods--procedures that exhibit arelatively smooth surface of performance with the al-location of computational resources.

8 Future Work and Directions

This work represents a point in a space of ongoing re-search. We are pursuing several lines of research withthe goals of enhancing the power and generalizing theapplicability of the predictive methods. We are explor-ing the modeling of run time at a finer grain throughthe use of continuous variables and prototypical nameddistributions. We are also exploring the value of de-composing the learning problem into models that pre-dict the average execution times seen with multipleruns and models that predict how well a particular in-stance will do relative to the overall hardness of theproblem. In other extensions, we ave exploring thefeasibility of inferring the likelihood that an instanceis solvable versus unsolvable and building models thatforecast the overall expected run time to completionby conditioning on each situation. For the analyses de-scribed in this paper, we have used fixed policies over ashort initial observation horizon. We are interested inpursuing more general, dynamic observational policiesand in harnessing the value of information to identifya set of conditional decisions about the pattern andtiming of monitoring. Finally, we are continuing toinvestigate the formulation and testing of ideal poli-cies for harnessing the predictive models to optimizerestart policies.

9 Summary

We presented a methodology for characterizing the runtime of problem instances for randomized backtrack-style search algorithms that have been developed tosolve a hard class of structured constraint-satisfactionproblems. The methods axe motivated by recent suc-cesses with using fixed restart policies to address thehigh variance in running time typically exhibited bybacktracking search algorithms. We described two dis-tinct formulations of problem-solving goals and builtprobabilistic dependency models for each. Finally, wediscussed opportunities for leveraging the predictive

72

Page 10: A Bayesian Approach to Tackling Hard Computational Problems · 2006. 1. 11. · constraint corresponds to solving a matching problem on a bipartite graph using a network-fiow algorithm

high variance in running time typically exhibited bybacktracking search algorithms. We described two dis-tinct formulations of problem-solving goals and builtprobabilistic dependency models for instances in theseproblem classes. Finally, we discussed opportunitiesfor leveraging the predictive power of the models tooptimize the performance of randomized search pro-cedures by monitoring execution and dynamically set-ting cutoffs.

Acknowledgments

We thank Dimitris Achlioptas for his insightful contri-butions and feedback.

References

[1] H. Kautz, Y. Ruan, D. Achlioptas, C.P. Gomes,B. Selman, and M. Stickel. Balance and Filter-ing in Structured Satisfiable Problems. In To Ap-pear in: Proceedings of the Seventeenth InternationalJoint Conference on Artificial Intelligence (IJCAL01), Seattle, 2001.

[2] D. Achlioptas, C.P. Gomes, H. Kautz, and B. Selman.Generating Satisfiable Instances. In Proceedings of theSeventeenth National Conference on Artificial Intelli-gence (AAAI-O0), New Providence, RI, 2000. AAAIPress.

[3] J.S. Breese and E.J. Horvitz. Ideal reformulationof belief networks. In Proceedings of UAI-90, Cam-bridge, MA, pages 64-72. Assoc. for Uncertainty inAI, Mountain View, CA, July 1990.

[4] D.M. Chickering, D. Heckerman, and C. Meek. ABayesian approach to learning Bayesian networkswith local structure. In Proceedings of UAI-97, Prov-idence, RI, pages 80-89. Morgan Kaufmann, August1997.

[5] C. Colbourn. The complexity of completing partiallatin squares. Discrete Applied Mathematics, (8):25-30, 1984.

[6] G. Cooper and E. Herskovits. A Bayesian methodfor the induction of probabilistic networks from data.Machine Learning, 9:309-347, 1992.

[7] M. Davis, G. Logemann, and D. Loveland. A machineprogram for theorem proving. Communications of theACM, 5:394-397, 1979.

[8] T.L. Dean and M.P. Wellman. Planning and Control,chapter 8.3 Temporally Flexible Inference, pages 353-363. Morgan Kaufmann, San Francisco, 1991.

[9] C.P. Gomes and B. Selman. Problem Structure in thePresence of Perturbations. In Proceedings of AAAI-97, New Providence, RI, 1997. AAAI Press.

[10] C.P. Gomes, B. Selman, N. Crato, and H. Kautz.Heavy-tailed phenomena in satisfiability and con-straint satisfaction problems. J. of Automated Rea-soning, 24(1-2):67-100, 2000.

[11] C.P. Gomes, B. Selman, and H. Kautz. BoostingCombinatorial Search Through Randomization. InProceedings of the Fifteenth National Conference onArtificial Intelligence (AAAI-98), New Providence,RI, 1998. AAAI Press.

[12] D. Heckerman, J. Breese, and K. Rommelse. Decision-theoretic troubleshooting. CACM, 38:3:49-57, 1995.

[13] D. Heckerman, D.M. Chickering, C. Meek, R. Roun-thwaite, and C. Kadie. Dependency networks for den-sity estimation, collaborative filtering, and data visu-alization. In Proceedings of UAI-2000, Stanford, CA,pages 82-88. 2000.

[14] E. Horvitz and A. Klein. Reasoning, metareasoning,and mathematical truth: Studies of theorem provingunder limited resources. In Proceedings of UAI-95,pages 306-314, Montreal, Canada, August 1995. Mor-gan Kaufmann, San Francisco.

[15] E.J. Horvitz. Reasoning under varying and uncer-tain resource constraints. In Proceedings of AAAI-88,pages 111-116. Morgan Kaufmann, San Mateo, CA,August 1988.

[16] R.J. Bayardo Jr. and R. Schrag. Using CSP look-back techniques to solve real-world SAT instances. InAAAI/IAAI, pages 203-208. AAAI Press, 1997.

[17] H. Kautz and B. Selman. Unifying Sat-based andGraph-based Planning. In Proceedings of the SixteenthInternational Joint Conference on Artificial Intelli-gence (IJCAI-99), Stockholm, 1999.

[18] R. Khardon and D. Roth. Learning to reason. In InProceedings of AAAI-94, pages 682-687. AAAI Press,July 1994.

[19] C. Knoblock, S. Minton, and O. Etzioni. Bntegra-tion of abstraction and explanation-based learning inprodigy. In Proceedings of AAAI-91. AAAI, MorganKaufmann, August 1991.

[20] C.M. Li and Anbulagan. Heuristics based on unitpropagation for satisfiability problems. In Proceedingsof the IJCAI-97, pages 366-371. AAAI Pess, 1997.

[21] M. Luby, A. Sinclair, and D. Zuckerman. Optimalspeedup of Las Vegas algorithms. Information Pro-cessing Letters, 47(4):173-180, 1993.

[22] S. Minton, J. G. Carbonell, C. A. Knoblock, D. R.Kuokka, O. Etzioni, and Y. Gil. Explanation-basedlearning: A problem solving perspective. ArtificialIntelligence, 40:63-118, 1989.

[23] M. Molloy, R. Robinson, H. Robalewska, andN. Wormald. Factorisations of random regular graphs.

[24] J. R@gin. A filtering algorithm for constraints of differ-ence in CSPs. In Proceedings of The Twelfth NationalConference on Artificial Intelligence, pages 362-367,1994.

[25] S. Russell. Fine-grained decision-theoretic search con-trol. In Proceedings of Sixth Conference on Uncer-tainty in Artificial Intelligence. Assoc. for Uncertaintyin AI, Mountain View, CA, August 1990.

[26] P. Shaw, K. Stergiou, and T. Walsh. Arc Consistencyand Quasigroup Completion. In Proceedings of theECAI-98 workshop on non-binary constraints, 1998.

[27] D. Spiegelhalter, A. Dawid, S. Lauritzen, and R. Cow-ell. Bayesian analysis in expert systems. StatisticalScience, (8):219-282, 1993.