A Dynamic Programming Approach for Nation Building Problems

Proceedings of the 2011 Industrial Engineering Research ConferenceT. Doolen and E. Van Aken, eds.

A Dynamic Programming Approach forNation Building Problems

Gregory Tauer, Rakesh Nagi, and Moises SuditDepartment of Industrial and Systems Engineering,

State University of New York at BuffaloBuffalo, NY

Abstract

Recent attention has been given to quantitative methods for studying Nation Building problems. A nations economic,political, and social structures constitute large and complex dynamic systems. This leads to the construction of largeand computationally intensive Nation Building simulation models, especially when a high level of detail and validityare important. We consider a Markov Decision Process model for the Nation Building problem and attempt a dynamicprogramming solution approach. However DP algorithms are subject to the curse of dimensionality. This is espe-cially problematic since the models we consider are of large size and high dimensionality. We propose an algorithmthat focuses on a local decision rule for the area of a Nation Building models state space around the target nationsactual state. This process progresses in an online fashion; as the actual state transitions, a new local decision rule iscomputed. Decisions are chosen to maximize an infinite horizon discounted reward criteria that considers both shortand long-term gains. Short term gains can be described exactly by the local model. Long term gains, which must beconsidered to avoid myopic behavior of local decisions, are approximated as fixed costs locally.

KeywordsNation Building, Approximate Dynamic Programming, Markov Decision Processes.

1. IntroductionThe ultimate goal of a Nation Building mission, as stated by RAND, is to ... promote political and economic reformswith the objective of transforming a society emerging from conflict into one at peace with itself and its neighbors.[1]. Nation Building is a task at which that the United Nations and United States have considerable experience, andmuch effort has been devoted to understanding what works well, and how the process can be improved [2]. Part of thisimprovement effort includes the construction of simulation models to study nation building problems [3, 4].

This paper considers the problem of using simulation models for the task of planning in a nation building environ-ment. Our goal is to find a policy that can be followed to most efficiently realize a given measure of stability. We takea dynamic programming approach and view the problem as a Markov Decision Process (MDP), which allows us tostudy in a structured way the dynamic and random nature of the nation building problem. The largest difficulty facedin this effort is the high dimensionality, and therefore large number of potential states, in the simulation models understudy. This is the well known Curse of Dimensionality in dynamic programming. For instance the example modelshown in Section 4 involves 500 people that can each be in any of 20 employment categories for a total of 2.281034state combinations (one for each unique employment configuration of the 500 people).

To help overcome the intractable number of states encountered in nation building problems we assume some goodpolicy is known prior to any study of the simulation model. This assumption is based on the long history and studyof nation building efforts. In Section 3.1 we will call this prior policy an Expert Policy and it will be used to helpjudge long term consequences of immediate actions. In some sense this is similar to transfer learning, specifically thetransfer of human advice [5]. Our suggested approach is even more closely related to the rollout policy approach [6]since we concentrate on results which improve a given policy.

The approach we adopt is similar in many aspects to response surface methodology (RSM) in the statistics litera-ture. In RSM, one starts from some initial factor setting (x0) and runs a series of experiments around x0. The results of

Tauer, Nagi, and Sudit

these experiments are then used to fit a function, often a low degree polynomial function, to approximate the responsesurface around x0. This function is then used as guidance to make small changes to improve the system. Usually, onlysmall changes are made because the function being used to approximate the response surface was fit exclusively to thearea near x0; the approximation might provide poor generalization if big changes are made. This prompts an iterativeapproach where a function is fit, a small change is made, then a new function is fit for the area around the new factorsetting, and so on.

In our problem we have a large dynamic system with some current state x0. We will sample the dynamics of thesystem around x0, fit a function to approximate the value of states (the value function) near x0, and then use thatfunction to make decisions about how to control the system. While in RSM it is possible to directly sample the re-sponse surface, the main complication in our problem is that it is not possible to directly sample the value function.The expert policy helps to more accurately estimate the true value function by providing a mechanism to look at thesystem far away from x0 without needing to explicitly consider far away states. Similarly to RSM, after the systemmoves away from x0 a new function will be fit; when the system transitions to a new state, x0 is redefined as the newcurrent state.

The approach of localized approximation is not in itself novel [710], and neither is the approach of using param-eterized functions to approximate the value function [1113], however we will discuss how their combination can helpovercome some problems with using DP with simulation, as well as propose an extension to these techniques thatallows the incorporation of prior knowledge about how to act; our Expert Policy.

2. BackgroundThis section contains a brief overview of necessary topics. Section 2.1 briefly describes the Markov Decision Pro-cess framework and Section 2.2 shows how a linear programming formulation can solve such problems. Section2.3 discusses specific difficulties faced when using simulation models to study MDP. Finally, Section 2.4 discussesthe use of local approximations and 2.5 covers using linear basis functions to approximate the value function of a MDP.

2.1 Markov Decision ProcessesCertain qualities of Nation Building models can be inferred from the systems they address. We assume these modelsare stochastic and discrete time. With the additional simplifying assumption that any continuous quantities (such asgallons of water in a reservoir) can be expressed as discrete quantities (the reservoir is empty, half full, or full), Theresult is a discrete time, discrete state, and stochastic system that is well modeled as a Markov Decision Processes.

A Markov Decision Process (MDP) [14] consists of a set of states X , a set of actions A , a state transition probabilityfunction p(x|x,a) [0,1] from all x X to x X when action a A is taken, and a reward function R(x,a) Rfor all x X and a A . Given the current state x X a decision maker must choose an action a A . Their actionchoice will result in the system transitioning to state x with probability p(x|x,a) and the decision maker receivingan immediate reward of R(x,a). The decision maker must then chose a new action to take in x, and so on. Thedecision makers behavior can be specified by a decision rule pi : X A , known as a policy. The goal of optimizationfrom a planning perspective to find a policy that directs the system such that as much reward as possible is collected.The specific objective that will be used here is to maximize the discounted expected reward received over an infinitehorizon with respect to a starting state. This quantity is called the value of a state and is denoted as V pi(x). Startingfrom state x0, where xt is the state at time t, this is given by:

V pi(x0) = E

[

t=0

tR(xt ,pi(xt))x0],0 < < 1. (1)

The discount parameter 0 < < 1 controls the tradeoff between preferring short or long term rewards. The ultimategoal of planning in a MDP is to solve for an optimal policy pi that satisfies: pi = argmaxpiV pi(x). When pi is omittedfrom the notation V pi(x), the value is assumed to be with respect to an optimal policy:

V (x) =V pi(x) = max

piV pi(x). (2)

V (x) can be represented recursively with respect to the value of the other states. This relationship is called the Bell-


man equation: V (x) = R(x,pi(x))+ xX p(x|x,pi(x))V (x). Combined with the relationship of Equation (2), theBellman equation shows that:

V (x) = maxaA

(R(x,a)+

xXp(x|x,a)V (x)

), (3)

Which also means that pi can be defined with respect to V (x) by the action which maximizes the argument: pi(x) =argmaxaA (R(x,a)+xX p(x|x,a)V (x)). In the following sections, it will be assumed that 0 R(x,a) Rmax forall x X and a A .

A key problem faced by MDP is the curse of dimensionality, or the tendency for the size of X to quickly becomeintractable with increasing dimensionality of the state space. Section 2.3 and Section 2.4 will review two methods foraddressing this problem.

2.2 Linear Programming Solution ApproachThe solution method that will be used here is a linear programming approach. Equation (3) can be used as a set ofbounds on V (x), with one bound for each action a A : V (x) R(x,a)+xX p(x|x,a)V (x),a A . The MDP canbe solved using these bounds for all states x X using the linear program [14]:

LPP =

{min

xXV (x)

V (x) R(x,a)+ xX

p(x|x,a)V (x),x X ,a A}. (4)

There are |X | variables in LPP; these variables are the state values V (x) and there is one for each x X . LPP also has|X | |A | constrains, one for each state-action pair; this approach scales poorly with high dimensionality.

2.3 MDP and SimulationUsing Markov Decision Processes to study simulation models presents additional challenges. For one, simulationmodels we examine do not provide the state transition probability function p(x|x,a) explicitly. Reinforcement Learn-ing deals with this problem of solving a MDP without prior knowledge of p(x|x,a) [15]. One way to handle notknowing p(x|x,a) is to estimated it by running the simulation model, although this approach is problematic in prac-tice when facing large state spaces.

We assume a type of simulation model called a generative mode, defined by [8] to be a black box that takes asan input any state-action pair (x,a) and returns a randomly sampled next state x and reward function value R(x,a).This allows an estimate on p(x|x,a) to be strengthened for an arbitrary x X . It is also possible to feed the generativemodels output back into itself, resulting in a trajectory of sequential states {x0,x1, . . . ,xN}, which is the output typi-cally considered a simulation run.

2.4 Local ApproximationsExamining only the state space centered around (local to) the current system state is a method for dealing withvery large state spaces. [710]. An H-neighborhood around state x0 is defined to be the subset of states X X suchthat for all x X then starting from x0, x is reachable within H transitions of the system. It is possible to find anH-neighborhood when given a generative model by using a look-ahead tree [8]. Using a full look ahead tree is oftenimpractical due to the exponential slowdown it experiences with increasing H. In Section 3 this will be avoided bysampling depth first on only a small subset of the branches.

Given an H-neighborhood X , a local approximation method applied to planning in a MDP first solves for an opti-mal policy considering just X , implements this policy for the current transition, and then repeats this process for thestate the system transitions into. This approach is advantageous since the size of X , and thus the complexity of solvingthe approximation, can be set through H independent of the number of all states, |X |.

The solution to a local model centered around x0 containing states X can be found using the LP:

LPH =

{minV (x0)

V (x) R(x,a)+ xX

p(x|x,a)V (x),x X ,a A}. (5)


LPH exactly solves an H-neighborhood local approximation around x0 [10].

2.5 Value Function ApproximationAnother method for dealing with very large state spaces is to represent V (x) using a function with parameters [1113]. We are specifically interested in approximating V (x) using a function that is linear with respect to since thiswill allow a linear programming approach [12]. This is done using a set of basis functions F where the value of basisfunction f F for state x is denoted f (x). Then V (x) can be approximated by a linear combination of these basisfunctions as V (x|):

V (x|) = fF

f f (x) (6)

The goal of this approach is to use far fewer basis functions than there are states and let solution methods solve for instead of directly for V (x). Note that the basis functions f (x) may be nonlinear even though V (x|) is linear withrespect to . An overview of using basis functions to compactly represent functions is given in [16].

The approximation V (x|) can be substituted into the exact LPP (4) to obtain the approximate linear program shownin (7), the solution to which approximates the solution to the MDP under study [12].

LPA =

{xXfF

f f (x) fF

f f (x) R(x,a)+ xX

p(x|x,a) fF

f f (x),x X ,a A}

(7)

The only difference between LPA and LPP is the substitution of V for V . LPA has only || variables which can be farfewer than LPP which has |X |. Unfortunately, LPA still has |X | |A | constraints. The result is that LPA still has anintractable number of constraints when |X | is large. A solution is proposed in which only a subset of the constraintsare imposed, the resulting LP is called a Reduced Linear Program [17]. While this approach only offers formal guar-antees under special conditions, it has been shown to work well in some applications as illustrated by the queuingnetwork example of [17]. Unfortunately, LPA is not suitable for direct use in a simulation setting since it requires priorknowledge of X and p(x|x,a) which are assumed to be unavailable.

3. MethodologyIn Section 2, two different methods were examined for reducing the number of states that must be considered in find-ing an approximate solution to a MDP. In Section 3.1, the method from Section 2.5 of linear programming using alinear value function approximation will be modified to consider only the area local to some initial state x0. Section3.2 will show how this approximation can be used in an online fashion to approximately solve the MDP under study,and section 3.3 discusses some of this methods limitations and practical issues.

3.1 Incorporating Expert Policies to the Linear Programming ApproachThe largest advantage enjoyed by methods of local approximations to MDPs, that they do not consider states outside ofX , is also a significant weakness. Take for example an extreme case where R(x,a) = 0,x X but R(x,a)> 0,x X \X . From the local models point of view this example contains no rewards, and so the estimated value of all x Xwill be zero and all policies will be thought optimal. In some sense, this error is due to the local approximation notconsidering the portion of the state space that is interesting (R(x,a)> 0). In this section we discuss how, through useof a given suggested policy, it is possible to sample the value of following that policy and include the result as a fixedreward (by which we mean constant reward) in the local model.

It is the case in many systems that an expert is able to provide a policy before any optimization. We know by definitionfrom (2) that an experts policy piE has a value V piE (x) V (x). An important property of V piE (x) is that it can beestimated with arbitrary precision through simulation; all that is needed is to repeatedly run the simulation startingfrom x, record the sum of discounted rewards received, and take their average. We will call this estimated value V piExand note that V piEx = V piE (x)+ for some error . In this section we assume = 0, but in Section 3.3 we propose astrategy to cope with the reality that ||> 0 due to sampling error.

We pursue an approximate approach similar to that of Section 2.5 since it allows us to set our workload indepen-dently of |X | through controlling many constraints are sampled. This is especially important, since in problems of


high dimensionality the H-neighborhood around x0 might itself be too large to solve exactly. Unfortunately, whatsleft after we enforce only a sample of the constraints local to x0 from LPA is an approximation of an approximation,the performance of which is hard to guarantee.

The resulting LP, which we name RALPHE , is a Reduced Approximate Linear Program for the H-neighborhood aroundan initial state x0, and assuming a given experts policy piE .

min xX (F1)

fF

f f (x) (8)

subject to: fF

f f (x) R(x,a)+ xX

p(x|x,a) fF

f f (x), x X (F1),a A (9)

fF

f f (x)V piEx , x X (F2) (10)

This approximation is built on samples taken from a given generative model. The sets X (F1) and X (F2) are randomsubsets of X . The constraint set (9) is built as follows. First the generative model is run from x0 using a random actiona0 resulting in a next state x1. The generative is run again, this time from x1 using a new randomly chosen action a1to get a next state x2. This process is repeated until xH , then one of the visited states is chosen at random and added tothe set X (F1). This process is repeated F1 times, resulting in |X (F1)| = F1 (it is assumed there are no repeats). Next,one constraint of type (9) is built for each xi X (F1) by sampling the generative model C1 times for each a A . Theresulting samples are used to form the approximations R(x,a) and p(x|x,a) and the constraint (9) for each state inX (F1). While constraint set (9) includes a summation over all x X , this summation only has to be performed overall x X for which p(x|x,a)> 0 of which there are at most F1 (since only F1 samples were taken to build p(x|x,a)).X (F1) is also now used to build the objective function.

X (F2) is built in the same way as X (F1), except instead of selecting a state to add at random from {x0,x1, . . . ,xH},xH is chosen. V

piEx is then estimated from the average of (1) over C2 runs of D time steps while following piE . Each

one of the C2 runs for each xH X (F2) requires running the generative model for D time steps using policy piE to get{xH ,xH+1, . . . ,xH+D} and evaluate (1) for this sequence of states, actions, and rewards. The estimate V piEx can then beused to generate the corresponding constraint (10) for each x X (F2).

The solution to RALPHE is a value for the parameters . Since the interest here is in planning, we wish to knowwhat action to take from x0. This can be found using and by:

aRALP = argmaxaA

(R(x0,a)+

xXp(x|x0,a)

fF f f (x)

), (11)

And for convenience this whole process, with parameters, is referred to as RALP(x0,piE ,H,D,C1,C2,F1,F2,)= aRALP

3.2 An Online Approximate Linear Programming ApproachIn this section an algorithm is described that uses the approximate local model described in Section 3.1 to compute apolicy on demand. We call this policy piHE and its value for a given state is computed as:

piHE (x) = RALP(x,piE ,H,D,C1,C2,F1,F2,). (12)

This method is called an Online Approach because piHE is only computed as required; piHE (x) will only be evaluated ifstate x is encountered. If only a subset of states is ever expected to be visited, this online approach is beneficial sinceit limits the time spent planning policy choices for states that are never encountered.

3.3 Practical IssuesOne large issue the method described in Section 3.2 faces in practice is that of choosing suitable basis functions. Inmany cases some intuition on this is available, in other cases it is necessary to guess. When forced to guess, it mightbecome necessary to include far more flexibility in the approximation than would otherwise be required. When usinga linear combination of basis functions as an approximation, increased flexibility comes from an increased number


Figure 1: Population transitions within region.

of basis functions (higher |F |). This is undesirable from a computational standpoint since the resulting linear pro-grams size is directly related to |F |. One strength of our method over similar approaches is that by limiting the statesexamined to a local subset, the number of basis functions needed to provide a decent fit over that area should be smaller.

A second issue faced in practical applications is that, although assumed known exactly in Section 3.1, the estimateV piExhas uncertainty associated with it. This could be handled by using the worst-case confidence interval bound, howeversince the goal is to compute a next-step action in an approximate sense, it seems reasonable to continue assignment ofV piEx to the mean of its samples as described in Section 3.1.

4. ResultsThe proposed method is illustrated using a discrete time nation building simulation model having a structure simplifiedfrom [4], then modified to have multiple regions and be controllable. The model under consideration is of a labor mar-ket and concerns the employment state of an areas population. Within each of the models five regions, each residentcan occupy one of four employment states; unemployed, private sector employee, government employee, or crimi-nal. Unemployed persons may become employed as either private sector or government employees or as criminals.Likewise private sector employees, government employees, and criminals may all become unemployed. In addition toemployment, unemployed persons may emigrate to a different region. The allowable employment state transitions areshown in Figure 1.

The rates at which transitions occur follow binomial distributions with rates governed by action choice. There are twobasic types of control actions in this model: Economic Stimulus which favorably affects the unemployed to privatesector / government employee transition rates and Crime Prevention which favorably affects the unemployed to crimi-nal transition rates. These rates are given in Table 1 for Crime Prevention and Table 2 for Economic Stimulus actions.Each action can be chosen for at most one region in each discrete time period leading to a total of |A | = 36 uniqueaction settings.

Table 1: Effect of crime prevention action on rates between unemployed and criminal populations.On Off

From/To Unemployed Criminal Unemployed CriminalUnemployed - 0.00 - 0.02Criminal 0.06 0.02 -

Table 2: Effect of economic stimulus action on rates between unemployed and govt./private employed populations.On Off

From/To Unemployed Private Government Unemployed Private GovernmentUnemployed - 0.025 0.038 - 0.010 0.010Private 0.000 - - 0.009 - -Government 0.000 - - 0.009 - -

The total population in the model is 500 residents spread across the five regions. Unemployed residents may migrate


from their current region to another. Residents emigrating from Region 1 will join Region 2, leaving Region 2 toRegion 3, ... , Region 5 back to Region 1. The number of unemployed residents migrating also follows a binomialdistribution with a rate of 0.01.

Considered as a MDP, the state of this simulation is a 20 dimension population vector (one dimension per popula-tion group per region) for a total of |X | = 2.28 1034 states. The reward function R(x,a) is defined to be the totalpopulation across all five regions employed by the private sector or government. The decision makers goal is tochoose on which region to apply the control actions to direct the system for maximum employment.

The model was tested using the method of local approximation described in Section 3, using parameters choicesfrom Table 3. As required for the method proposed in Section 3, piE is defined as: apply the economic stimulus andcrime prevention efforts to the region with the highest total population, with ties broken at random. The set of basisfunctions F for the method is defined as the set of polynomials that span all polynomials of degree two centered onx0, so there are a total of |F |=

(202

)+(20

1

)+20+1 = 231 basis functions.

Table 3: Detailed parameter settings.H C1 C2 F1 F2 D 20 20 50 1000 100 100 0.95

Testing was done from two starting states. In the first starting state, each region has 50 unemployed residents, 50criminal residents, and no private or government employees. In the second starting state Region 1 has 75 privatesector and 75 government employees and no unemployed or criminal residents, while the other four regions have 100unemployed residents. The second starting state was chosen to demonstrate a condition where piE performs poorly.

The model was run repeatedly from each starting state to compare the performance of three policy choices: randomaction selection, piE , and the method of local approximation from Section 3. Runs were carried out for 200 discretetime steps from the first starting state and 20 from the second. The shorter time period from the second starting condi-tion was chosen to illustrate the ability of RALPHE to recover in a state area where piE alone performs poorly. Resultsare reported as a 99% confidence interval on the average reward received over the course of a run. No confidenceinterval bounds are reported for the random and piE cases since they have short runtime and were run enough times(10,000 each) to make their confidence interval widths insignificant. The results are presented in Table 4.

Table 4: Results from two starting states of the three tested policies.Random piE RALPHE

Mean Mean CI- Mean CI+Starting Point 1 171.7 179.5 175.9 180.8 185.8Startint Point 2 218.4 209.1 222.6 225.9 229.1

As shown by Table 4, RALPHE outperformed random from both starting points. While the mean performance ofRALPHE also outperformed the mean performance of piE from each starting state this difference was not statisticallysignificant from Starting Point 1. The second starting point demonstrates how RALPHE can still perform better thanrandom choices on this model, even when the expert policy it uses, piE , can not. Unfortunately the state space of thismodel is too large to permit an optimal solution so it is not possible to report on performance with respect to optimality.

5. ConclusionsUsing an online method to compute policy choices is advantageous if the effort required to compute a single choiceapproximately is far less than required to compute the entire policy at once. This is the case when each policy iscomputed based on only the subset of a MDPs state space near the current state x0, since the H-Neighborhood aroundx0 depends on H, not on the number of total states. Section 3 discusses how the estimates obtained from these localapproximations can be strengthened using a reasonable policy guess, and Section 4 showed this to be the case empir-ically. Future work will focus on improving this approach and further testing its performance. One specific area forimprovement is that, while the method presented here focuses on improving an approximation of the value function,this may not always result in an improved policy.


References1. Dobbins, J., Jones, S. G., Crane, K., and DeGrasse, B. C., 2007. The Beginners Guide to Nation Building. RAND

Corporation.

2. Dobbins, J., 2005. Nation-Building: UN Surpasses U.S. on Learning Curve. Technical report, RAND Corporation.

3. Choucri, N., Electris, C., Goldsmith, D., Mistree, D., Madnick, S., Morrison, J. B., Siegel, M., and Sweitzer-Hamilton, M., 2005. Understanding and Modeling State Stability: Exploiting System Dynamics. Technicalreport, Composite Information Systems Laboratory, Sloan School of Management, Massachusetts Institute ofTechnology.

4. Richardson, D. B., Deckro, R. F., and Wiley, V. D., 2004. Modeling and Analysis of Post-Conflict Reconstruction.The Journal of Defense Modeling and Simulation: Applications, Methodology, Technology, 1(4), 201213.

5. Taylor, M. E. and Stone, P., 2009. Transfer Learning for Reinforcement Learning Domains: A Survey. Journal ofMachine Learning Research, 10, 16331685.

6. Bertsekas, D. P. and Castanon, D. A., 1999. Rollout Algorithms for Stochastic Scheduling Problems. Journal ofHeuristics, 5(1), 89108.

7. Chang, H. S. and Marcus, S. I., 2003. Approximate Receding Horizon Approach for Markov Decision Processes:Average Reward Case. Journal of Mathematical Analysis and Applications, 286, 636651.

8. Kerns, M., Mansour, Y., and Ng, A. Y., 2002. A Sparse Sampling Algorithm for Near-Optimal Planning in LargeMarkov Decision Processes. Machine Learning, 49, 193208.

9. Leizarowitz, A. and Shwartz, A., 2008. Exact Finite Approximations of Average-cost countable Markov DecisionProcesses. Automatica, 44, 14801487.

10. Heinz, S., Kaibel, V., Peinhardt, M., Rambau, J., and Tuchscherer, A., 2006. LP-Based Local Approximation forMarkov Decision Problems. Technical report, Konrad-Zuse-Zentrum fur Informationstechnik Berlin.

11. Tsitsiklis, J. N. and Roy, B. V., 1997. An Analysis of Temporal-Difference Learning with Function Approxima-tion. IEEE Transactions on Automatic Control, 42(5), 674690.

12. de Farias, D. P. and Roy, B. V., 2003. The Linear Programming Approach to Approximate Dynamic Programming.Operations Research, 51(6), 850865.

13. Powell, W., 2007. Approximate Dynamic Programming. John Wiley & Sons Inc.

14. Puterman, M., 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley &Sons Inc.

15. Kaelbling, L., Littman, M., and Moore, A., 1996. Reinforcement Learning: A Survey. Journal of ArtificialIntelligence Research, 4, 237285.

16. Bishop, C., 2006. Pattern Recognition and Machine Learning. Springer.

17. de Farias, D. P. and Roy, B. V., 2004. On Constraint Sampling in the Linear Programming Approach to Approxi-mate Dynamic Programming. Mathematics of Operations Research, 29(3), 462478.

Documents

A Dynamic Programming Approach for Nation Building Problems