114167416 Dynamic Programming and Optimal Control

DynamicProgramming andOptimalControl VolumeI THIRDEDITION P.Bertsekas MassachusettsInstitute of Technology WWW siteforbookinformation and http:/ /www.athenasc.com liJAthena Scientific,Belmont, Athena Scientific PostOfficeBox805 NH03061-0805 U.S.A. Ernail:[email protected] WWW:http:/ /www.athenasc.co:m CoverDesign:AnnGallager,www .gallagerdesign.com 2005,2000,1995DimitriP.Bertsekas Allrightsreserved.Nopartofthisbookmaybereproducedinanyform by ~ m ~electronic or mechanical means(including photocopying, recording, ormformationstorageandretrieval)withoutpermissioninwritingfrom the publisher. Publisher'sCataloging-in-Publication Data I3ertsekas,DimitriP. Dynamic Programming andOptimalControl IncludesBibliographyandIndex 1.MathematicalOptimization.2.DynamicProgramming.I. Title. QA402.5.B4652005519.70300-91281 ISBN1-886529-26-4 ABOUTTHE AUTHOR DimitriBertsekasstudiedMeehaniealandElectricalEngineeringat theNationalTechnicalUniversityofAthens,Greece,andobtainedhis Ph.D. in system science from the Massachusetts Institute of Technology.He hasheldfacultypositionswiththeEngineering-EconomicSystemsDept., StanfordUniversity,andtheElectricalEngineeringDept.oftheUniver-sityof Illinois,Urbana.Since1979hehasbeenteachingattheElectrical EngineeringandComputerScienceDepartmentoftheMassachusettsIn-stituteof Technology(M.I.T.),whereheiscurrentlyMcAfeeProfessorof Engineering. His research spans several fields,including optimization, control, large-scalecomputation,anddatacommunicationnetworks,andiscloselytied tohisteachingandbookauthoringactivities.Hehaswrittennmnerous research papers,and thirteen books,several of which are used astextbooks inMITclasses.Heconsultsregularlywithprivateindustryandhasheld editorialpositionsin several journals. ProfessorBertsekaswasawardedtheINFORMS1997PrizeforH.e-search Excellencein the Interface Between Operations ResearchandCom-puterScienceforhisbook"Neuro-DynamicProgramming"(eo-authored withJohn Tsitsiklis),the2000GreekNationalAwardforOperationsRe-search,andthe2001ACCJohnR.RagazziniEducationAward.In2001, he waselectedto the UnitedStatesNationalAcademy of Engineering. ATHENASCIENTIFIC OPTIMIZATIONANDC0l\1PUTATIONSERIES 1.Convex Analysis and Optimization, by Dimitri P.Bertsekas, with AngeliaNedicandAsumanE.Ozdaglar,2003,ISBN1-886529-45-0,560pages 2.Introduction to Probability,by Dimitri P.Bertsekas and John N. Tsitsiklis,2002,ISBN1-886529-40-X,430pages 3.DynamicProgrammingandOptimalControl,Two-VolumeSet, by Dimitri P.Bertsekas,2005,ISBN1-886529-08-6,840pages 4.NonlinearProgramming,2ndEdition,byDimitriP.Bertsekas, 1999,ISBN1-886529-00-0,791pages 5.Network Optimization:Continuous and Discrete Models, by Dim-itri P.Bertsekas,1998,ISBN1-886529-02-7,608pages 6.Network Flows and Monotropic Optimization, by R. Tyrrell Rock-afellar,1998,ISBN1-886529-06-X,634 pages 7.Introduction to Linear Optimization,by Dimitris Bertsimas and John N.Tsitsiklis,1997,ISBN1-886529-19-1,608pages 8.ParallelandDistributedComputation:NumericalMethods,by Dimitri P. Bertsekas and John N.Tsitsiklis, 1997, ISBN 1-886529-01-9,718pages 9.Neuro-Dynamic Programming, by Dimitri P.Bertsekas and John N.Tsitsiklis,1996,ISBN1-886529-10-8,512pages 10.Constra.inedOptimization and Lagrange Multiplier Methods,by DimitriP.Bertsekas,1996,ISBN1-88(1529-04-3,410pages 11.Stochastic Opthnal Control:The Discrete-Time Case, by Dimitri P.BertsekasandStevenE.Shreve,1996,ISBN1-886529-03-5, 330pages Contents 1.The Dynamic ProgrammingAlgorithm 1.1.Introduction......... 1.2.The Basic Problem.......... 1.3.The Dynamic ProgrammingAlgorithm. 1.4.State Augmentation andOther Reformulations 1.5.SomeMathematical Issues........ 1.6.Dynamic Prograrnmingand MinimaxControl 1.7.Notes,Sources,and Exercises....... 2.DeterministicSystemsand the ShortestPath Problein 2.1.Finite-State Systemsand ShortestPaths 2.2.SomeShortest Path Applications 2.2.1.Critical Path Analysis 2.2.2.Hidden MarkovModelsand the ViterbiAlgorithm 2.3.Shortest Path Algorithms........... 2.3.1.LabelCorrecting Methods........ 2.3.2.LabelCorrecting Variations- A*Algorithm 2.3.3.Branch-and-Bound.......... 2.3.4.Constrained and Multiobjective Problems 2.4.Notes,Sources,and Exercises. 3.DeterministicContinuous-Time 3. LContinuous-TimeOptimal Control 3.2.The Hamilton-Jacobi-Bellman Equation 3.3.The Pontryagin Minimum Principle 3.3.1.An InformalDerivation Usingthe HJBEquation 3.3.2.ADerivation Based on Variational Ideas 3.3.3.MinimumPrincipleforDiscrete-Time Problems 3.4.Extensions of the Minimum Principle 3.4.1.Fixed Terminal State 3.4. 2.FreeInitial State p.2 p.12 p.18 p.35 p.42 p.46 p.51 p.64 p.f\8 p.68 p.70 p.77 p.78 p.87 p.88 p.91 p.97 p.106 p.109 p.115 p.115 p.125 p.129 p.131 p.131 p.135 vi 3.4.3.FreeTerminalTime..... Time-Varying SystemandCost 3.4.5.SingularProblems.. Notes,Sonrces,andExercises.... Contents p.135 p.138 p.139 p.142 4.Proble1nswithPerfectState Information 4.1.LinearSystems andQuadraticCost p.148 p.162 p.170 p.176 p.186 p.190 p.191 p.197 p.201 4. 2.InventoryControl L1.3.DynamicPortfolioAnalysis.... 4.4.Optimal Stopping Problems.... 4.5.Scheduling andthe Interchange Argument 4.6.Set-MembershipDescription of Uncertainty 4.6.1.Set-MembershipEstimation.... 4.6.2.Control withUnknown-but-BoundedDisturbances 4. 7.Notes,Sources,andExercises....... 5.Problen'lswith ImperfectState Information 5.1.Reductiontothe PerfectInformationCase 5.2.LinearSystemsandQuadraticCost p.218 p.229 p.236 p.251 p.252 p.258 p.270 6. 5.3.MinimumVarianceControl of LinearSystems 5.4.SufiicientStatistics andFinite-State MarkovChains 5.4.1.TheConditionalState Distribution 5.4.2.Finite-State Systems. 5.5.Notes,Sources,andExercises .. ,JLJL.u2pw,we see that the optimal open-loop policy isto play timid in oneof the twogamesandplayboldinthe other,and otherwiseitisoptimal toplayboldinbothgames.For Pw= 0.45andPd=0.9,Eq.(1.3)givesan optimal open-loop match win probability of roughly 0.425.Thus,the value of the information (the outcome of the firstgame) is the difference of the optimal closed-loop and open-loop values, which is approximately 0.53-0.425 =0.105. Sec.1.2The Basic Problem Moregenerally,by subtracting Eqs.(1.2)and(1.:3),weseethat Valueof information= P;v (2- Pw)+ Pw (lPw )Pd - P;v- Pw(l- Pw)max(2pw,Pd) = Pw(l- Pw) min(pw,Pd- Pw) 11 It shouldbenoted,however,thatwhereasavailabilityofthestate informationcannothurt,itmaynotresultinanadvantageeither.For instance,indeterministicproblems,wherenorandomdisturbancesare present,one can predict the future states given the initial state andthese-quence of controls.Thus, optimization over all sequences {1to, 'ttl, ... , 1LN _1} of controlsleadsto the sameoptimalcostasoptimizationoveralladmis-siblepolicies.The same can be true even in some stochastic problems(see forexampleExercise1.13).Thisbringsuparelatedissue.Assumingno information isforgotten,the controlleractually knowsthe prior states and controls:r:o,tto, ... ,Xk-I,'ttk-1aswellasthecnrrentstate:r;kTherefore, the questionariseswhetherpoliciesthatusethe entire system history can be superior to policies that use just the current state.The answer turns out to be negativealthough the proof istechnically complicated(see[BeS78]). Theintuitivereasonisthat,foragiventimekandstateXk,allfuture expected costsdepend explicitly just onXkand notonpriorhistory. Encoding Riskin theCost:Function Asmentionedabove,animportantcharacteristicofstochasticproblems isthepossibilityofusinginformationwithadvantage.Anotherdistin-guishing characteristic isthe need to take intoaccountT'iskin the problem formulation.For example,in atypical investmentproblem oneisnotonly interested in the expected profitof the investmentdecision,butalsoinits variance:givenachoicebetweentwoinvestmentswithnearlyequalex-pected profitand markedly differentvariance,mostinvestorswouldprefer theinvestmentwithsmallervariance.Thisindicatesthatexpectedvalue of cost or reward need not be the most appropriate yardstick forexpressing adecisionmaker'spreferencebetweendecisions. Asamoredramaticexampleoftheneedtotakeriskintoaccount whenformulatingoptimizationproblemsunderuncertainty,considerthe so-called St. Petersburg paradox.Here,aperson isoffered the opportunity of payingxdollarsinexchangeforparticipationinthefollowinggame:a faircoinisflippedsequentiallyandthepersonispaid2kdollars,wherek isthenumberoftimesheadshavecomeupbeforetailscomeupforthe firsttime.Thedecisionthatthepersonmustmakeiswhethertoaccept orrejectparticipation inthe game.Nowif heaccepts,hisexpectedprofit lTheDynamic Programming AlgorithmChap.1 fromthe game is 001 V-- 2k-X =00 ~2k+1' k=O soifhisaeceptaneeeriterionisbasedonmaximizationof expectedprofit, heiswillingtopayanyamountxto enterthegame.This,however,isin strongdisagreementwithobservedbehavior,duetotheriskelementin-volvedinentering the game,and showsthat adifferentformulationof the problemisneeded.Theformulationof problemsof deeisionunderuncer-tainty so that riskisproperly taken into aeeountisadeep subject with an interesting theory.An introduction to thistheory isgiveninAppendix G. It isshowninparticular that minimization of expected costisappropriate under reasonable assumptions, provided the cost function is suitably chosen sothatitproperly en eo desthe riskpreferencesof the decisionmaker. 1.3THE DYNAMICPROGRAMMINGALGORITHM Thedynamieprogramming(DP)techniquerestsonaverysimpleidea, theprincipleof optimality.Thename isdue to Bellman,whocontributed agreatdealtothepopularizationof DPandtoitstransformationintoa systematictool.H.oughly,theprincipleofoptimalitystatesthefollowing ratherobviousfact. P r::._1 of Optirnality Letn*{JLo,11i, ... , 11N- _1}beanoptimalpolicyforthe basicprob-lem,andassumethat whenusingn*,agivenstateXioccursat time iwithpositiveprobability.Considerthe subproblem wherebyweare atXiattimeiandwishto minimizethe"cost-to-go"fromtimeito timeN Then the truncated poliey {JLi, fLi+ll ... , p,J'v _1}is optimal for this sub-problem. The intuitive justification of the principle of optimality is very simple. If the truncated policy{JLi, J-Li+I,. .. , J-LJ'v _1}were not optimal as stated, we wouldbeableto reduce the costfurtherby switching to an optimal policy forthe subproblemoncewereachXiForanauto travelanalogy,suppose that the fastestroute fromLosAngeles to Boston passes through Chicago. The principle of optimality translates to the obvious factthat the Chicago to Boston portion of the route isalso the fastestroute foratrip that starts fromChicagoand ends inBoston. Sec.1.3 TheDynamic Progmrnming Algorithm 19 Theprincipleofoptimalitysuggeststhatanoptimalpolieycanbe construetedinpiecemealfashion,firstconstruetinganoptimalpolicyfor the"tail subproblem"involvingthe laststage,then extending the optimal policy to the "tail subproblem"involving the last two stages, and continuing in this manner until an optimal poliey for the entire problem isconstrueted. The DP algorithm is based on this idea:it proceeds sequentially,by solving allthetailsubproblemsofagiventimelength,usingthesolutionofthe tailsubproblemsof shortertimelength.Weintroducethe algorithmwith twoexamples,one deterministicandone stochastie. Let us consider the scheduling example of the preceding section,anclletus apply the principle of optimality to calculate the optimal schedule.We ~ 1 ~ v e tosehecluleoptimallythefouroperationsA,B,C,andD.Thetrans1tlon and setup costsare shownin Fig.1.3.1next to the correspondingares. Aecordingto the prineipleof optimality,the"tail"portionof anop-timalschedulemustbeoptimal.Forexample,supposethattheoptimal scheduleisCABD.Then,havingscheduledfirstCandthenA,itmust be optimal tocompletethe sehedulewith BDratherthan with DB.With thisinmind,wesolveallpossibletail subproblemsof length two,thenall tail subproblems of length three,and finallytheoriginalproblem that has lengthfour(thesubproblemsoflengthoneareofcoursetrivialbecause thereisonlyoneoperationthatisasyetunseheduled).Aswewillsee shortly,the tail subproblemsof lengthk + 1 are easily solved once we have solved thetail subproblemsof lengt.hk,andthisistheessenceof theDP technique. TailSubproblemsof Length 2:These subproblems are the onesthat involve two unscheduled operations and correspond to the states AB,AC,CA, and CD(seeFig.1.3.1) 8tateAB: Here it is only possible to schedule operation Cas the next operation,sotheoptimalcostofthissubproblemis9(theeostof schedulingCafterB,whichis3,plusthe eostof schedulingDafter C,whichis6). StateAC:Herethepossibilitiesareto(a)scheduleoperationBand thenD,whichhascost5,or(b)scheduleoperationDandthenB, which has cost 9.The firstpossibility isoptimal,and the correspond-ing cost of the tail subproblem is5,asshown next to node ACin Fig. 1.3.1. StateCA:Herethepossibilitiesareto(a)sehecluleoperationBand thenD,whichhascost3,or(b)scheduleoperationDandthenB, which has cost 7.The firstpossibility isoptimal,and the eorrespond-The Dynamic Programming AlgorithmChap.1 i() Figure1.3.1'[htnsitiongraphofthedeterministicschedulingproblem,with thecostofeachdecisionshownnexttothecorrespondingarc.Nexttoeach node/stateweshowthecosttooptimallycompletetheschedulestartingfrom that state.Thisisthe optimalcost of thecorresponding tail subproblem(cf.the principleofoptimality).Theoptimalcostfortheoriginalproblemisequalto 10,asshownnexttotheinitialstate.Theoptimalschedulecorrespondstothe thick-linearcs. ing cost of the tail subproblem is3,as shown next to node CAin Fig. 1.3.1. StateCD:Here it isonly possible to schedule operation Aas the next operation,sotheoptimal costof this subproblemis5. Ta'ilSubpmblemsof Length3:These subproblems can now be solvedusing theoptimal costsof the subproblemsof length2. StateA:Herethepossibilitiesareto(a)schedulenextoperationB (cost2)andthensolveoptimallythecorrespondingsubproblemof length 2(cost 9,ascomputed earlier),atotal cost of 11, or (b)sched-ule next operation C(cost 3)and then solve optimally the correspond-ing subproblem of length2(cost5,ascomputed earlier),atotal cost of 8.The second possibility isoptimal,and the corresponding cost of the tail subproblem is8,asshownnextto nodeAinFig.1.3.1. StateC:Herethepossibilitiesareto(a)schedulenextoperationA (cost4)andthensolveoptimallythecorrespondingsubproblemof length 2 (cost 3,as computed earlier),atotal cost of 7,or (b)schedule next operation D(cost 6)and then solve optimally the corresponding Sec.1.3 The DynamicProgramming Algorithm 21 subproblemof length2(cost5,ascomputedearlier),atotalcostof 11.The firstpossibilityisoptimal,and the corresponding costof the tail subproblem is7,as shownnext to nodeAinFig. OriginalProblemof Length4:The possibilitieshereare(a)start with op-eration A(cost5)and then solveoptimally the corresponding subproblem oflength3(cost8,ascomputedearlier),atotalcostof13,or(b)start with operation C(cost3)and then solveoptimally the corresponding sub-problemof length3(cost7,ascomputed earlier),atotalcostof10.The secondpossibilityisoptimal,andthe corresponding optimal costis10,as shownnext to theinitial state nodein Fig.1.:3.1. Note that havingcomputed the optimal costof theoriginalproblem through the solution of all the tail subproblems, wecan construct the opti-malschedulebystarting attheinitialnodeandproceedingforward,each timechoosingtheoperationthatstartstheoptimalscheduleforthecor-respondingtailsubproblem.Inthisway,byinspectionofthegraphand thecomputationalresultsofFig.1.3.1,wedeterminethatCABDisthe optimal schedule. Consider the inventorycontrol example of the previous section.Similarto thesolutionof theprecedingdeterministicschedulingproblem,wecalcu-late sequentiallytheoptimalcostsofallthetailsubproblems,goingfrom shortertolongerproblems.Theonlydifferenceisthattheoptimalcosts arecomputed asexpected values,sincethe problem hereisstochastic. Ta,ilSubproblemsofLength1: thatatthebeginningofperiod N- 1 thestockisXN-1Clearly, matterwhathappenedinthepast, theinventorymanagershouldordertheamountofinventorythatmini-mizesoverUN-1 0the sum of the ordering costand the expected tenni-nalholding/shortagecost.Thus,heshouldminimizeoverUN-1thesum C1LN-1+ E{R(xN )}, whichcan be written as CUN-1+E{ R(XN-1 + 'LLN-1- 'WN-d }. 'WN-1 Adding the holding/ shortage cost of period N1,weseethat the optimal costforthe last period(plusthe terminalcost)isgivenby JN-1(XN-d =r(XN-1) +minrCUN-1+E{ R(XN-1 + 'ILN-1- 'WN-1) }] . UN-12.0lWN-1 Naturally,JN_1 isafunctionofthestockXN-lIt iscalcnla,tedeither analyticallyornumerically(inwhichcaseatableisusedforcomputer 22: The Dynamic Programming AlgorithmChap.1 stora.ge of the function JN-1).In the process of caleulating JN-1, we obtain the optimal inventory policy p,]v_1 (xN-d forthe last period:p,]v_1 (xN-d isthevalueof 'LlN -1thatminimizestheright-handsideof thepreceding equationforagivenvalueof XN-l Ta'ilS'ubproblernsofLength2:Assumethatatthebeginningofperiod N2thestockisx:N-2It isclearthattheinventorymanagershould ordertheamountof inventorythatminimizesnotjusttheexpectedcost of periodN- 2butrather the (expectedcostof periodN- 2)+ (expectedcostof periodN- 1, giventhatanoptimal policy willbe usedatperiodN- 1), whichisequalto Usingthe system equation 1;N-1= XN-2 + uN_2- WN-2, the lastterm is alsowrittenasJN-I(XN-2 + HN-2WN-2). Thustheoptimalcostforthelasttwoperiodsgiventhatweareat state 'l:N-2,denotedJN-2(XN-2),isgivenby =T(XN-2) +,min[c'LtN-2+E{ JN-I(XN-2 + 'LlN-2- WN-2) }] IIN-2?,0WN-2 AgainJN-2(:r:N-2)iscalculatedforeveryXN_2.Atthesametime,the optimalpolicyf1,]v_2(x:N-2)isalsocomputed. Ta'ilSubproblemsof LengthN- k:Similarly,wehavethatatperiodk.:,whenthe stock is:x:k,the inventorymanager shouldorder Ukto minimize (expectedcostof periodk)+ (expectedcostof periodsk + 1, ... , N- 1, giventhat an optimal policywillbeusedfortheseperiods). BydenotingbyJk(xk)theoptimalcost,wehave (1.4) whichisactuallythedynamicprogramming equationforthisproblem. ThefunctionsJk(:x:k)denotetheoptimalexpectedcostforthetail subproblemthatstartsatperiodkwithinitialinventoryXk.Thesefunc-tionsare computed recursivelybackwardin time,starting at period N- 1 andendingatperiod0.ThevalueJo (:.co)istheoptimalexpectedcost whenthe initial stock at time 0isx: 0 .During the calculations,the optiinal Sec.1.3The Dynamic Programming Algorithm23 policy is simultaneously computed from the minimization in the right-hand sideof Eq.(1.4). TheexampleillustratesthemainadvantageofferedbyDP.While theoriginalinventoryproblemrequiresanoptimizationoverthesetof policies,theDPalgorithmofEq.(1.4)decomposesthisproblemintoa sequenceofminimizationscarriedoutoverthesetofcontrols.Eachof theseminimizationsismuch simpler than the originalproblem. TheDPAlgorithm WenowstatetheDPalgorithmforthebasicproblemandshowitsopti-mality by translating into mathematical terms the heuristic argument given aboveforthe inventoryexample. Proposition 1.3.1:For every initial state xo, the optimal cost J*(xo) of thebasicproblemisequaltoJo(xo),givenbythelaststepof the followingalgorithm,whichproceedsbackwardintimefromperiod N- 1toperiod 0: (1.5) k= 0, 1, ... , N- 1, (1.6) where the expectation is taken with respect to the probability distribu-tion of 'Wk,which depends on Xkand llk.F'urthermore,if uk=Jt'k(xk) minimizestherightsideofEq."(1.6)foreachXkandk,thepolicy 1r*={Jt0, ... , p, ]v _1}isoptimal. Proof:tForany admissiblepolicy7f= {p,o, fLI,... , flN-dandeachk= 0,1, ... , N-1, denote 1rk={J.tk, p,k+b ... , JLN-dFork0, 1, ... , N-1, letJJ:(xk)be the optimal cost forthe (N- k)-stage problem that starts at stateXkand timek,andendsattimeN, tOurproofissomewhatinformalandassumesthatthefunctionsJkare well-definedandfinite.Forastrictlyrigorousproof,sometechnicalmathemat-icalissuesmustbeaddressed;seeSection1.5.Theseissuesdonotariseifthe disturbanceWktakesafiniteorcountablenumberofvaluesandtheexpected valuesofalltermsintheexpressionofthecostfunction(1.1)arewell-defined andfiniteforeveryadmissiblepolicy7f. The Dynamic Programming AlgorithmChap.1 ForkN,wedefineJjy(xN)=9N(XN ).Wewillshowbyinduction thatthefunctionsJ'kareequaltothefunctionsJkgeneratedbytheDP algorithm,sothatfork=0,wewillobtain the desiredresult. Indeed,wehavebydefinitionJjy=J N=9N.Assumethatfor somekandallXk+l,wehaveJk+l (xk+d=Jk-1-l (xk+l)Then,since Irk(Jtk:,1rk+l ),wehaveforallXk =minE{gk (xk:,J-lk(x:k),wk) ftk'Wk+ m k ~ n l[,E{gN(XN) +~9i (xi, P,i(Xi), 'Wi)}]} nfwk+l, ... ,wN-1. t=k+l =minE{gk (xk, fl,k(xk), wk)+ Jk-1-l(fk (xk, ftk(xk), wk))} JLk 'Wk=minE {gk(xk,Jtk(xk),wk)+ Jk+l(fk(xk,J-lk(xk),wk))} Jlk'Wk =minE{gk(Xk,'Uk,wk) + Jk+l(fk(Xk,ttk,wk))} 'ltkEUk(:ck)wk =Jk(Xk), completingtheinduction.Inthesecondequationabove,wemovedthe minimum over 1rk+linside the braced expression,using aprinciple of opti-malityargument:"thetailportion of anoptimal policy isoptimal forthe tail subproblem"(a more rigorous justification of this step isgivenin Sec-tion1.5).In the third equation,weusedthe definition of Jk+l'andin the fourth equation weused the induction hypothesis.In the fifthequation, we convertedtheminimizationoverftktoaminimizationoverUk,usingthe factthat forany functionFof xand u,wehave min F(x, tt(x))=minF(;r:,u), ttEMuEU(x) whereMisthesetof allfunctionsft ( x)suchthatJt ( x)EU ( x)forallx. Theargumentoftheprecedingproofprovidesaninterpretationof Jk(xk)asthe optimalcostforan(N- k)-stageproblemstarting at state :x:k andtimek,andendingattimeN.WeconsequentlycallJk(xk)the cost-to-go at state Xkand time k,and refer toJkas thecost-to-go function attimek. Sec.1.3The Dynamic Programming Algorithm25 Ideally,wewouldlike to usethe DPalgorithmto obtain closed-form expressionsforJ koranoptimalpolicy.Inthisbook,wewilldiscussa large number of modelsthatadmitanalytical solution byDP.Even if such models rely on oversimplified assumptions, they are often very useful.They may provide valuable insights about the structure of the optimal solution of more complex models,and they may formthe basis forsuboptimal control schemes.J:t'urthermore,the broad collection of analytically solvable models provides helpful guidelines formodeling:when facedwith anew problem it isworth trying topatternitsmodelafteroneof theprincipalanalytically tractable models. Unfortunately,inmanypracticalcasesananalyticalsolutionisnot possible, and one has to resort to numerical execution of the DP algorithm. Thismaybequitetime-consumingsincetheminimizationintheDPEq. (1.6)mustbecarriedoutforeachvalueofXk.Thestatespacemustbe discretizedinsomewayif itisnotalreadyafiniteset.Thecomputa-tionalrequirementsareproportionaltothenumberofpossiblevaluesof Xk,soforcomplexproblemsthecomputationalburdenmaybeexcessive. Nonetheless,DPistheonlygeneralapproachforsequentialoptimization under uncertainty,and even when itiscomputationally prohibitive,itcan serveas the basisformorepractical suboptimal approaches,which willbe discussedinChapter 6. Thefollowingexamplesillustrate someof theanalyticalandcompu-tational aspectsof DP. Example1.3.1 Acertain material ispassed through asequence of twoovens(seeFig.1.3.2). Denote xo:initialtemperature of the material, Xk,k =1, 2:temperatureof the material at the exitof ovenk, Uk-I,k=1, 2:prevailingtemperatureinovenk. Weassumeamodel of the form k0, 1, whereaisaknownscalarfromtheinterval(0, 1).Theobjectiveistoget the finaltemperature x2close to agiven target T,whileexpending relatively little energy.This isexpressedbyacostfunctionof theform wherer> 0isagivenscalar.WeassumenoconstraintsonHk.(Inreality, thereareconstraints,butifwecansolvetheunconstrainedproblemand verify that the solution satisfies the constraints,everything willbe une.)The problem isdeterministic;that is,there isno stochasticuncertainty.However, 26TheDynamic Programming AlgorithmChap.1 Initial Temperature x0 Oven1 Temperature uo .---------. Final Oven 2Temperature x2 Temperature u1Figure1.3.2ProblemofExample1.3.1.Thetemperatureofthematerial evolvesaccordingtoXk+l=(1a)xk+auk,whereaissomescalarwith O0, ifXN= 0, ifXN< 0. (1.10) In this equation,wehaveJN(O)=Pwbecause when the score iseven after N games(xN= 0),itisoptimal to play boldinthe firstgame of sudden death. By executing the DPalgorithm(1.8)starting withthe terminalcondi-tion(1.10),andusingthecriterion(1.9)foroptimalityof boldplay,wefind thefollowing,assuming that Pel>Pw: optirna1play:either JN-1(1)= max(pd + (1- Pd)Pw,Pw+ (1Pw)Pw] Pel+(1- Pd)Pw;optimalplay:timid JN-1(0)=Pw;optimal play:bold optimal play:bold optimalplay:either. Also,givenJN-I(XN-1),andEqs.(1.8)and(1.9)weobtain JN-2(0)=max [PdPw+ (1- Pd)P'!,Pw (Pel+(1- Pd)Pw)+ (1-= Pw (Pw+ (P.w+ Pd)(1Pw)) andthatifthescoreisevenwith2gamesremaining,itisoptirnaltoplay bold.Thusfora2-gamematch,theoptimalpolicyforbothperiodsisto play timidif and onlyif the player 1saheadin the score.Theregionof pairs (Pw,Pd)forwhichthe player has abetter than 50-50chance to wina2-game match is and,asnotedin the preceding section,itincludespointswhere Pw< 1/2. Example1.3.4(Finite-State Systems) Wementionedearlier( cf.theexamplesinSection1.1)thatsystemswith afinitenumberofstatescanberepresentedeitherintermsofadiscrete-timesystemequationorintermsoftheprobabilitiesof transitionbetween thestates.LetusworkouttheDPalgorithmcorrespondingtothelatter case.Weassumeforthesakeofthefollowingdiscussionthattheproblem isstationary(i.e.,thetransitionprobabilities,thecostperstage,andthe controlconstraintsetsdonotchangefromonestage to the next).Then,if 'll} The Dynamic ProgrammingAlgorithmChap.1 arethe transitionprobabilities,wecan alternativelyrepresentthe systemby the system equation( cf.the discussionof theprevioussection) wheretheprobabilitydistribution of the disturbance Wkis P{wk= jI Xk= i, Uk= u} = Pij(u). Using this system equation and denoting by g( i, u)the expected cost per stage at state 'iwhencontrol uisapplied,theDPalgorithmcan herewrittenas .h(i) =min[g(i, u) + E{ A+I(wk) }] uEU(t) orequivalently(inviewof the distributionof wkgivenpreviously) Jk('i)=min[g(i,u) + "'"'Pii(u)Jk+l(j)]. uEU(t)L...,.. j Asanillustration,inthe machinereplacementexample of Section1.1, thisalgorithmtakestheform i=1, ... ,n, The two expressions in the above minimization correspond to the two available decisions(replaceornotreplacethe machine). InthequeueingexampleofSection1.1,theDPalgorithmtakesthe form lN('i)= R(i),'i=O,l, ... ,n, The two expressions in the above minimization correspond to the two possible decisions(fastandslowservice). Note that if there are nstates at each stage,and U ( i)contains as many asmcontrols,theminimizationintheright-handsideoftheDPalgorithm requires,foreach(i, k),asmanyasaconstantmultipleofmnoperations. SincetherearenN state-timepairs,thetotalnumberof operationsforthe DPalgorithmisaslargeasaconstantmultipleofmn2 Noperations.By contrast,thenmnberofallpoliciesisexponentialinnN(itisaslargeas rnnN),soabrute forceapproach which enumerates allpolicies and compares theircost,requiresan exponentialnumberof operationsinnN. Sec.1.4 State AugmentationandOther Reformulations35 l.4STATE AUGMENTATION AND OTHER REFORlVJ[ULATIONS We now discuss how to deal with situations where some of the assumptions of the basic problem are violated.Generally,in such cases the problem can be reformulated into the basic problem format.This process iscalledstate augmentat'ionbecauseittypicallyinvolvestheenlargementofthestate space.Thegeneralguidelineinstateaugmentationistoindude'inthe enlar-gedstateattimekalltheinfor-mationthat'isknowntothecontr-oller attimekandcanbeusedwithadvantageinselecting'Uk.Unfortunately, state augmentation often comesataprice:the reformulatedproblem may have very complex state and/or control spaces.We provide some examples . TimeLags Inmanyapplicationsthe system state ::ck+ldependsnotonlyonthepre-cedingstate :rkandcontrolukbutalsoonearlierstatesandcontrols.In otherwords,statesandcontrolsinfluencefuturestateswithsometime lag.Suchsituationscanbehandledbystateaugmentation;thestateis expanded toincludeanappropriatenumberof earlierstatesand controls. Forsimplicity,assumethatthereisatmostasingleperiodtimelag inthe state andcontrol;thatis,the system equationhasthe form k=1, 2, ... , N- 1,(1.11) ::r1=fo(xo, uo, wo). Time lagsof morethan oneperiod . ~ a nbe handledsimilarly. If weintroduce additional state variables Ykandsk,and wemake the identifications Yk=Xk-1?Sk=Uk-1,the system equation(1.11)yields (fk(Xk, Yk, :uk, Sk, Wk)) ;x,k. Uk (1.12) By definingXk(xk,Yk,sk)asthenew state,wehave where the system functionA isdefinedfromEq.(1.12).By usingthepre-ceding equation as the system equation and by expressing the cost function intermsofthenewstate,theproblemisreducedtothebasicproblem withouttimelags.Naturally,thecontrolukshouldnowdependonthe new state Xk,or equivalently apolicy shouldconsistof functionsf"kof the currentstateXk,aswellastheprecedingstateXk-1andthepreceding control Uk-1 TheDynamic Programming AlgorithmChap.1 WhentheDPalgorithmforthereformulatedproblemistranslated intermsofthe variablesof the originalproblem,it takesthe form JN-1 (::CN-1,XN-2, 'ILN-2) minE{9N-I(XN-1, 'ILN-b WN-d 'UN-1 EUN-l(XN-1) WN-1 + JN(JN-l(XN-1, XN-2, UN-1, UN-2, 'WN-1)) }, Xk-1, UJ.;-1)=minE{gk(;x;k, uk, wk) ukEUk(xk)Wk + Jk+l (fk(xk, :rk-1, Uk, Uk-1, wk), Xk, uk) },k =1, ... , N- 2, Jo(:ro)minE{go(xo,'Lto,wo)+ J1(fo(xo,uo,wo),xo,uo)}. uoEUo(xo)wo Similar reformulations are possible when time lags appear in the cost; forexample,inthecase where the costhas the form The extremecaseof time lagsin thecostarisesin the nonadditiveform E{gN(;x;N, ;x;N-1, ... , xo, UN-b ... , uo, WN-1, ... , wo) }. Then,the problemcanbereducedto the basicproblem format,by taking asaugmented state and E{gN(':EN)}as reformulated cost.Policies consist of functions /Lkof the presentand past statesXk,... , xo,the pastcontrols 'Uk- 1,... , uo,and the pastdisturbancesWk-b ... , wo.Naturally,wemustassumethatthepast disturbancesareknowntothecontroller.Otherwise,wearefacedwith aproblemwherethestateisimpreciselyknowntothecontroller.Such problems are known asproblems with imperfect state information and will be discussedinChapter5. 8ec.1.4.State AugmentationandOther ReFormulat;ions31 CorrelatedDisturbances We turn now to the case where the disturbances wkare correlated over time. Acommon situation that can be handled efficientlyby state angrnentation ariseswhenthe processwo, ... , WN-1canberepresentedastheoutputof alinearsystemdrivenbyindependentrandomvariables.Asanexample, suppose that by using statistical methods, wedetermine that the evolution of Wkcanbe modeledby an equation of theform where.A isagivenscalarand{ isasequenceofindependentrandom vectors with given distribution.Then wecan introduce an additional state variable Yk=Wk-l and obtain anew system equation wherethenewstate isthepairfh=(xk, Yk)andthenewdisturbanceis the vector More generally,suppose that Wkcanhe modeledby where k0, ... ,N -1, Ak,Ckare knownmatrices of appropriatedimension,and areindepen-dentrandomvectorswithgivendistribution(seeFig.1.4.1).Byviewing Ykasanadditional state variable,weobtain thenew system equation ;k Yk + 1 = AkYk +;k Figure 1.4.1Representing correlated disturbances as the output of alinear sys-tem drivenbyindependentrandomvectors. 38The Dynamic ProgTamrningAlgorithmChap.1 Notethatinordertohaveperfectstateinformation,thecontroller mustbe able to observe YkUnfortunately,this istrue only in the minority of practical cases;forexamplewhenCkisthe identity matrix and Wk-1is observed before 'ILkisapplied.In the caseof perfectstate information,the DPalgorithm takestheform JN(XN) YN)= 9N(X:N ), Jh:(:ck,yk)=min ukEUk(xk)f;,k+ Jk+l (ik ( Xk,1Lk, Ck(AkYk + AkYk +Forecasts Finally,considerthecasewhereattimekthecontrollerhasaccesstoa forecastYkthatresultsinareassessmentof the probability distribution of Wkandpossiblyof futuredisturbances.For example,Ykmaybe anexact prediction of Wkoran exactprediction that the probability distribution of 'Wkisaspecificoneoutof afinitecollectionof distributions.Forecastsof interestinpracticeare,forexample,probabilisticpredictionson the state of the weather,theinterestrate formoney,and the demand forinventory. Generally,forecastscanbehandledbystateaugmentationalthough the reformulation into the basic problem format may be quite complex.We willtreathereonlyasimple special case. Assumethatatthebeginningofeachperiodk,thecontrollerre-ceivesan accurate prediction that the nextdisturbance 'Wkwillbe selected accordingtoaparticularprobabilitydistributionoutof agivencollection of distributions{ Q1, ... , Qm};thatis,iftheforecastisi,then'Wkisse-lectedaccordingtoQ,i.The aprioriprobability that the forecastwillbe i isdenotedby Piand isgiven. Forinstance,supposethatinourearlierinventoryexamplethede-mand Wkisdetermined according to one of three distributions Q1,Q2,and Qcl,corresponding to"srnall,""medium,"and"large"demand.Each of the three types of demand occurs with agiven probability at each time period, independently of the values of demandatprevious time periods.However, theinventorymanager,priortoordering Uk,gets to knowthroughafore-eastthetypeof demandthatwilloccur.(Notethatitistheprobability distributionof demandthatbecomesknownthrough the forecast,notthe demanditself.) The forecastingprocesscan be representedby means of the equation Yk+l= whereYk-t-lcantakethevalues1, ... , m,correspondingtothem possible forecasts,and isarandomvariabletakingthevalueiwithprobability Sec.1.4State Augmentation andOther ReFormulations 39 P'iTheinterpretationhereisthatwhen takesthevalue't,thenwk-t--l willoccuraccording to the distributionQi. By combining the system equation with the forecastequation Yk+l= weobtain anaugmented systemgivenby (Xk+l)=(fk(xk,Uk,wk)). Yk+lThenew state is Xk=(xk, Yk), andbecausetheforecastYkisknownattimek,perfectstateinformation prevails.The newdisturbanceis 'l.;)k =(wk, anditsprobabilitydistributionisdeterminedbythedistributionsQiand theprobabilities Pi,anddependsexplicitlyonXk(viaYk)butnotonthe priordisturbances. Thus,by suitablereformulationof the cost,theproblem canbecast intothebasicproblemformat.Notethatthecontrolapplieddependson boththecurrentstateandthecurrentforecast.TheDPalgorithmtakes the form JN(XN,YN)=9N(XN), minE{9k(:r;k, 'ILk,wk) ukEUk(xk)wk m(1.13) + PiJk+l (fk(:I:k, 'ILk,wk), i)I Yk}, i=l whereYkmaytakethevalues1,. :-., m,andtheexpectationoverwkis taken with respectto the distributionQyk. It shouldbeclearthattheprecedingformulationadmitsseveralex-tensions.Oneexampleisthecasewhereforecastscanbeinfluencedby thecontrolactionandinvolveseveralfuturedisturbances.However,the priceforthese extensions isincreased complexity of the corresponding DP algorithm. Whenaugmentingthestateofagivensystemoneoftenendsupwith compositestates,consistingofseveralcomponents.It turnsontthatif someof thesecomponentscannotbeaffectedbythechoiceof control,the DPalgorithmcan besimplifiedconsiderably,aswewillnowdescribe. Let the state of the system be acomposite (xk, Yk)of two components XkandYkTheevolutionof themaincomponent,xk,isaffectedbythe controlUkaccordingto theequation Xk+l=fk(Xk, Yk, 'l.tk,wk), 40Tlw DynamicProgramming AlgorithmCI1ap.1 wheretheprobabilitydistributionPk(wkI Xk, Yk, uk)isgiven.Theevolu-tionof theothercomponent,Yk,isgovernedbyagivenconditionaldistri-lmtionPk(YkI Xk:)and cannotbeaffectedby the control,except indirectly throughXk.OneistemptedtoviewYkasadisturbance,butthereisa difference:YkisobservedbythecontrollerbeforeapplyingUk,while'Wk occursafterukisapplied,andindeedwkmayprobabilisticallydependon Uk;. WewillformulateaDPalgorithmthat isexecutedoverthe control-lablecomponentof thestate,withthedependenceontheuncontrollable componentbeing"averagedout."Inparticular,letJk(xk, Yk)denotethe optimal cost-to-goat stagekandstate(xk, Yk),anddefine WewillderiveaDP algorithmthat generatesJk(xk) Indeed,wehave .h(:rk) = Eyk {Jk(Xk, Yk)I Xk} = Eyk {minEwk,xk+lYk+l {gk(Xk, Yk, Uk, Wk) ttkE[h(xk,Yk) + Jk-H(xk+bYk+I)I Xk,Yk,uk}I Xk} Eyk {minEwk,xk-H {9k(xk, Yk, uk, wk) ttkEVk(xk,Yk) + EYk+l { "1-t(Xk+l, Yk+t)I Xk+dI Xk, Yk, Uk}I Xk }. and finally E{9k(xk, Yk, 'Uk,wk) Wk + Jk+l(!k(xk,Yk,1tk,wk)) }I :1;k } (1.14) Theadvantage of thisequivalentDPalgorithm isthat itisexecuted over asignificantly reduced state space.For example, if Xktakes npossible valuesandYktakesmpossiblevalues,thenDPisexecutedovernstates insteadofnm states.Note,however,thattheminimizationintheright-handsideoftheprecedingequationyieldsanoptimalcontrollawasa functionof thefullstate(xk, Yk) As an example, consider the augmented state resulting from the incor-poration of forecasts,asdescribed earlier.Then, the forecast'Ukrepresents Sec.1.4 State Augmentation andOther Reformulations4:1 an uncontrolled state component, sothat the DP algorithmcan be simpli-fiedasin Eq.(1.14).In particular,bydefining rn jk (X k)=Pd k( X k, i),k=0, 1, ... , N- 1, i=l and wehave,usingEq.(1.13), m J k ( x k)= minE{ g k ( x k, Uk, Wk) i=lttkEUk(xk) wk + Jk+l(fk(Xk,Uk,Wk))I Yk= i}, which isexecutedoverthe spaceof xkratherthan Xkand Yk Uncontrolled state components often occur in arrival systems, such as queueing,where action must be taken in response to arandom event(such asacustomerarrival)thatcannotbeinfluencedbythechoiceof control. Thenthestateofthearrivalsystemmustbeaugmentedtoincludethe random event,but the DP algorithm can be executed over asmaller space, asper Eq.(1.14).Hereisanother exampleof similartype. Exan1ple1.4.1:(Tetris) Tetris isapopular video gameon a two-dimensional grid.Each square inthegridcanbefullorempty, upa"wallofbricks"with"holes" anda"jagged top".The squares fillup asblocksof differentshapes fallfrom thetopofthegridandareaddedtothetopof thewall.Asagivenblock falls,theplayercanmovehorizontallyandrotatetheblockinallpossible ways,subjecttotheconstraintsimposedbythesidesofthegridandthe top of thewall.The fallingblocksaregeneratedindependentlyaccordingto someprobabilitydistribution,definedoverafinitesetofstandardshapes. The gamestarts withanempty gridandendswhenasquareinthe top row becomesfullandthetopofthewallreachesthetopofthegrid.Whena rowof fullsquares iscreated,this rowisremoved,the brickslyingabovethis rowmoveonerowdownward,andtheplayerscoresapoint.Theplayer's objectiveistomaximizethescoreattained(totalnumberofrowsremoved) withinNstepsoruptotermination of the game,whicheveroccursfirst. Wecan model the problem of findingan optimal tetris playing strategy asastochasticDPproblem.Thecontrol,denotedbyu,isthehorizontal positioningandrotationappliedtothefallingblock.Thestateconsistsof twocomponents: (1)Theboardposition,i.e.,abinarydescriptionof thefull/emptystatus of eachsquare,denotedbyx. 1.5 42TileDynamic Programming AlgorithmChap.1 ( 2)The shape of the currentfallingblock,denotedbyy. Thereisalsoanadditionalterminationstatewhichiscost-free.Oncethe state reachesthe termination state,itstays therewithnochangeincost. Theshapeyisgeneratedaccordingtoaprobabilitydistributionp(y), independentlyofthecontrol,soitcanbeviewedasanuncontrollablestate component.The DP algorithm(1.14)isexecuted over the space of xand has theintuitive form J k(x;)I::p(y) m1~ X[g(x, y, u) + J k+l (f(x, y, u)) J,forallx, ywhere g(x, y, u) and f(x, y, u) are the number of points scored (rows removed), and the board position(or termination state) when the state is(x, y)and con-troluisapplied,respectively.Note,however,thatdespitethesimplification intheDPalgorithmachievedbyeliminatingtheuncontrollableportionof thestate,thenumberof statesxisenormous,andtheproblemcanonlybe addressedby suboptimalmethods,whichwillbe discussedinChapter 6and inVol.II. SOME 1\flATHEl\IIATICALISSUES Letusnowdiscusssometechnicalissuesrelatingtothebasicproblem formulationandthevalidityof the DPalgorithm.The readerwhoisnot mathematically inclined neednotbe concernedabout theseissuesand can skipthissectionwithoutlossof continuity. Onceanadmissiblepolicy{Ito,... , JlN-disadopted,thefollowing sequenceof eventsisenvisionedat the typical stage k: 1.The controllerobservesXkandapplies 'lLk= Jlk(xk). 2.ThedisturbanceWkisgeneratedaccordingtothegivendistribution Pk(-1Xk,/lk(Xk)). ~ ) .The cost9k(xk,Jlk(x:k),wk)isincurredandaddedto previouscosts. Ll.The nextstate x;k+lisgeneratedaccordingto the systemequation If thisisthelaststage( k=N- 1),theterminalcostg N( x N)is addedtopreviouscostsandtheprocessterminates.Otherwise,kis incremented,and the same sequence of events isrepeated at the next stage. For each stage, the above process is well-defined and iscouched in pre-ciseprobabilisticterms.Matters are,however,complicated by the needto Sec.1.5Some MathematicalIssues viewthe costasawell-definedrandom variable with well-definedexpected value.Theframeworkof probabilitytheoryrequiresthatforeachpolicy wedefineanunderlyingprobabilityspace,thatis,asetn,acollectionof eventsin0,andaprobabilitymeasureontheseevents.Inaddition,the costmustbeawell-definedrandomvariableonthisspaceinthesenseof Appendix C(a measurable function from the probability space into the real lineintheterminologyof measure-theoreticprobabilitytheory).Forthis tobe true,additional(measurability)assumptions onthe functionsfk,9k, andJ1kmayberequired,anditmaybenecessarytointroducea.dditional structure onthespacesSk,Ck,andDk.Furthermore,theseassumptions may restrictthe class of admissible policies,sincethe functionsJLkmaybe constrained to satisfyadditional(measurability)requiremen'ts. Thus, unless these additional assumptions and structure are specified, thebasieproblemisformulatedinadequatelyfromamathematicalpoint ofview.Unfortunately,arigorousformulationforgeneralstate,eonLrol, and disturbance spaeesiswellbeyond the mathematical framework of this introductorybookandwillnotbeundertakenhere.Nonetheless,ittnrns outthatthesedifficultiesaremainlytechnicalanddonotsubstantially affect the basic results to be obtained.For this reason,we findit convenient to proceed with informal derivations and arguments;this isconsistent with rnostof the literature on the subject. Wewouldliketostress,however,thatunderatleastonefrequently satisfiedassumption,the mathematical difficultiesmentionedabovedisap-pear.Inparticular,letusassumethatthedisturbancespacesDkareall countableandtheexpectedvaluesofalltermsinthecostarefinitefor every admissible policy(this istrue in particular if the spacesDkare finite sets).Then,forevery admissible policy,the expected valuesof allthe cost terms can be written as(possibly infinite) sumsinvolvingthe probabilities of the elementsof the spacesDk. Alternatively,onemay writethe costas (1.15) where with the preceding expectation taken with respect to the distribu.tiouPk (I Xk, Jlk(xk))defined on the countable set Dk.Then one may take as the ba-sicprobability space theCartesian product of the spacesSk,k1, ... , N, givenforallkby TheDynamic Programmjng AlgorithmChap.1 whereSo={ Xo}.The setskisthe subset of all states that can be reached at tirnekwhen the policy {JLo,... , Jl'N-disused.Because the disturbance spaces Dkare countable, the sets Ehare also countable (this is true since the nnion of any countable collection of countable sets isacountable set).The systemequationXk+1=fk (xk, J-Lk(xk), wk),theprobabilitydistributions Pk (- I xk, JLk(xk)),the initial state xo,and the policy {!Lo,... , /-LN -d define aprobabilitydistributiononthecountablesetEhx xSN,andthe expectedvalueinthecostexpression(1.15)isdefinedwith respectto this latter distribution. Letusnowgiveamoredetailedproof ofthevalidityoftheDPal-gorithm(Prop.1.3.1).WeassumethatthedisturbanceWktakesafinite orcountablenumberof valuesandtheexpectedvaluesof alltermsinthe expression of the cost functionare finite forevery admissible policy 1r.Fur-thermore,thefunctionsJk(xk)generatedbytheDPalgorithmarefinite forall states Xkand timesk.We do not need to assume that the minimum over'Ltkinthedefinition of Jk(xk)isattainedby some 'LtkEU(xk) Foranyadmissiblepolicy1r={p,o,P,1, ... ,JLN-dandeachk= 0, 1, ... , N1,denote 1rk={p,k, /-Lk+l?... , JLN-dJ:iork= 0, 1, ... , N1, letJk(l:k)be the optimal cost forthe (N- k)-stage problem that starts at state 1:kand timek,and endsat timeN;that is, Fork=N,wedefineJJ.r(xN)=9N(XN ).Wewillshowbyinduction thatthefunctionsJkareequaltothefunctionsJkgeneratedbytheDP algorithm,sothat fork =0,wewillobtain the desiredresult. ForanyE>0,andforallkandxk,letJL'k(xk)attaintheminimum intheequation (1.16) k0,1, ... ,N- 1, withinE;thatis,forallXkandk,wehaveJL'k(xk)EUk(xk)and + Jk+I(fk(xk,JL'k(xk),wk))}::;Jk(xk)+E.(1.17) 'WkLetlh:(l:k)bethe expectedcoststartingat stateXkattimek,andusing thepolicy{p,'k,JLk+l'... , f.t]v_1 }.WewillshowthatforallXkandk,we have Jk(Xk):S; Jk(xk):S; Jk(Xk)+ (Nk)E, lie (xk):Slh:(xk):Slie (xk) + (N- k )E, Jk(xk)=Jk(xk) (1.18) (1.19) (1.20) Sec.1.5SomeMathematicalIssues It isseenusingEq.(1.17)thattheinequalities(1.18)and(1.19)holdfor k;= N- 1.By takingE-t 0inEqs.(1.18)and(1.19),itisalsoseenthat lN-1=Jjy_1.AssumethatEqs.(1.18)-(1.20)holdforindexk + 1.We willshow that theyalsoholdforindexk;. Indeed,wehave Jk(xk)=E {gk (xk, JLk(l:k), wk)+ Jk+1 (fkwk))} Wk ::;E{gk (xk,wk)+ Jk+l (fk (1:k,JL'k(xk),wk))}+ (Nk- 1)E 'Wk :S; ,/k,(xk) + E + (N- k1)E =Jk(xk) + (N- k)E, wherethefirstequationholdsbythedefinitionofJk,thefirstinequality holdsbytheinductionhypothesis,andthesecondinequalityholdsEq. (1.17).Wealsohave Jk(xk)=E {9k (xk, Jt'k(xk), wk)+ Jk+1 (fk (xk, lt'k(xk), wk))} wk 2::E{gk(xk,JL'k(xk),wk)+ Jk+I(fk(Xk,Jtk(l:k),wk))} 'Wk wherethefirstinequalityholdsbytheinductionhypothesis.Combining theprecedingtworelations,wesee that Eq.(1.18)holdsforindexk:. Foreverypolicy1r {Jto, P,1,.>., JLN-d,wehave Jk(xk)=E{gk(xk,JL'k(xk),wk)+Jk+1(fk(xk,JL'k(xk),wk))} Wk ::;E {gk (xk,wk)+ Jk+I (fk (xk, ftk(xk), wk))}+ (N- k- 1)E wk :SminE { g k ( x k, u k, w k)+ J k+ 1 (f k ( x k, u k, w k) ) } + (N- k) E ukEU(xk) wk :SminE {gk(xk, uk, wk)+ Jk+I (fk(l:k,'llk, wk))} + (N- k:)c ukEU(xk) Wk :SE{9k ( Xk,Jlk(xk), 'Wk)+ J1rk+l(fk (xk, JLk (;r;k),'Wk))}+ (N - k )E Wk wherethe firstinequalityholdsbytheinductionhypothesis,andthesec-ondinequalityholdsbyEq.(1.17).Takingtheminimumover1rkinthe preceding relation,weobtain forallXk TheDynamic Programming AlgorithmChap.1 WealsohavebythedefinitionofJ'(;,forallXk, J'(;(xk)~Jk(xk). Combiningtheprecedingtworelations,weseethatEq.(1.19)holdsfor indexk.Finally,Eq.(1.20)followsfromEqs.(1.18)and(1.19),bytaking c: -+ 0,andtheinductioniscomplete. Notethatby usingc: =0inthe relation J0(xk)~J0(xk)+ Nc:, [cf.Eq.(1.19)],weseethatapolicythatattainstheminimumforallx:k andkin Eq.(1.16)isoptimal. In conclusion,the basicproblem hasbeen formulatedrigorously,and theDPalgorithmhasbeenprovedrigorouslyonlywhenthedisturbance spacesDo, ... , D N_ 1arecountablesets,andtheexpectedvaluesofall the cost expressions associated with the problem and the DP algorithm are finite.In the absence of these assumptions, the reader should interpret sub-sequentresultsandconclusionsasessentiallycorrectbutmathematically imprecisestatements.Infact,whendiscussinginfinitehorizonproblems (wheretheneedforprecisionisgreater),wewillmakethecountability assmnption explicit. Wenote,however,thattheadvancedreader willhavelittle difficulty iuestablishingmostofoursubsequentresultsconcerningspecificfinite horizonapplications,evenifthecountabilityassumptionisnotsatisfied. Thiscanbedoneby usingthe DPalgorithmasaveEificationtheorem.In particular,if one can findwithin asubset of policies II(such as those satis-fyingcertain measurability restrictions)apolicy that attains the m i n i m u ~ intheDPalgorithm,thenthispolicycanbereadilyshowntobeopti-malwithinfr.ThisresultisdevelopedinExercise1.12,andcanbeused bythemathematically oriented readerto establishrigorouslymany of our snbseqnentresults concerning specificapplications.For example,in linear-quadraticproblems(Section4.1)onedetermines fromthe DP algorithma polieyindosedform,whichislinearinthecurrentstate.When'Wkcan takeuncountablymany values,itisnecessary that admissiblepolicies con-sist of Borel measurable functions/LkSince the linear policy obtained from theDPalgorithmbelongsto thisclass,the resultof Exercise1.12guaran-teesthat thispolicy isoptimal.Forarigorousmathematicaltreatment of DPthatresolvestheassociatedmeasurabilityissuesandsupplementsthe presenttext,seethebookby Bertsekasand Shreve[BeS78]. L6DYNAJVHCPROGRAlVIMINGANDMINilVIAXCONTROL 'l'he problem of optimal control of uncertain systems has traditionally been treated in astochastic framework,whereby alluncertain quantities are de-scribedbyprobabilitydistributions,andthe expectedvalueof thecostis ,Sec.1.6Dynamic Programming andIVIinimaxControl minimized.However,inmanypractical situationsastochasticdescription of the uncertainty may not be available,and one may have information with lessdetailedstructure,suchasboundson themagnitudeoftheuncertain quantities.In other words,one may knowaset within whichthe uncertain quantitiesareknowntolie,butmaynotknowthecorrespondingprob-abilitydistribution.Underthesecircumstancesonemayuseaminimax approach,wherebytheworstpossiblevaluesoftheuncertainquantities withinthe givensetareassumedto occur. Theminimaxapproachfordecisionmakingunderuncertaintyisde-scribedin Appendix Gand iscontrasted with the expected costapproach, which we have been following so far.In its simplest form,the corresponding decisionproblemisdescribedbyatriplet(11,vV,J),whereIIisthesetof policiesunder consideration,Wisthe set in which the uncertain quantities are known to belong,andJ: II xVV~ - - - - +[-oo, +oo]isagivencostfunction. The objective isto overall1rETI. minimizemax J (1r, w) wEW It ispossibletoformulateaminimaxcounterparttothebasicprob-lemwithperfectstateinformation.Thisproblemisaspecialeaseof the abstractminimaxproblemabove,asdiscussedmorefullyinAppendixG. Generally,it is unusual foreven the simplest special cases of this problem to admit aclosed-form solution.However,acomputational solution using DP ispossible,and our purpose in this section isto describe the corresponding algorithm. Intheframeworkof thebasicproblem,considerthecasewherethe disturbancesw0 ,w1, ... ,WN-Idonothaveaprobabilisticdescriptionbut ratherareknowntobelong tocorrei:ipondinggivensetsWk ( x k, 'Uk)CD k, k = 0, 1, ... , N- 1,whichmay depend on the current state x:kand control Uk.Considertheproblemoffindingapolicy1r{J -oo, uEU forallwE W, wehave min max[ao(w)+Gr (f(w), J-L(f(w) ))]=max [ao(w)+minGI (f(w), u)]. JtEMwEJ,VwEWttEU Proof:WehaveforallJL EJ\1[ rnax [co( w)+G1 ( f(w), JL(j(w)) )]~max [co(w) +minG1 (f(w), 'n)] wEWwEWuEU andbytaking the minimum overJL E1\1[,weobtain minmax[Go(w)+Gl(f(w),JL(f(w)) )]~max[Go(w)+minGl(f(w),u)]. JtEMwEWwEWuEU (1.23) Sec.1.6Dynamic Programming andIVIinimaxControl 49 To show the reverse inequality,forany E >0,let/LcE1\1[ be such that Gl(.f(w),JLc(f(w)))::; miunG1(J(w),1t)+E,forallwE W. uE [Suchap,cexistsbecauseof theassumptionminuEU G1 (f(w), 'n)>-oo.] Then minmax[co(w)+G1 ( f(w), tt(f(w)) )] J-LEMwEW ::;max [co( w)+G1 ( f(w), JLt:(j(w)) )] wEW ::;max[co(w) +min G1(f(w), ?t)]+E. 'WEWuEU SinceE >0canbetakenarbitrarilysmall,weobtainthereversetoEq. (1.23),and the desiredresultfollows.Q.E.D. To see how the conclusion of the lemma can failwithout the condition min,uEUG1 (.f(w), u)>-oo forallw,letu beasealar,letw(w1, w2) beatwo-dimensionalvector,andlettherebenoconstraintsonuandw (U =~ 'W=~x~ ' w h e r e ~isthe realline).Letalso Go(w)=w1,f(w)= w2,G1(J(w),u)=f('w)+'tt. Then,forallfJ, EMwehave, max[Go(w)+Gl(f(w),JL(f(w)))j=max[w1+w2+1-t(w2)]=co, wEWw1 Errr,w2Errr sothat min max [ao(w) +G1 ( f(w), Jl(f(w)) )]co. ttEM wEW Onthe other hand, max [ao(w) +min G1 (f(w), u)j=..max[w1+min[w2 + uJ]=-co, wEWuEUw1Errc,w2ERuEW sinceminuErrr[w2+u]=-oo forallw2. WenowturntoprovingtheDPalgorithm(1.21)-(1.22).Theproofis similartotheonefortheDPalgorithmforstochasticproblems.Theoptimal costJ* ( xo)of the problemisgivenby J*(xo)=minminmax max fLO JLN-1woEW[xo,fto(xo)J'WN-lEW[xN-litN-l(xN-1)] [ ~9k(Xk,J'k(Xk), Wk)+ 9N(XN )] min min[minmax max JtO ILN-2JtN-l woEW[xo,Jto(xo)J'WN-2EW(xN-21LN-2(a;N-2)] [N-2 9k(Xk1f-tk(Xk),'11Jk)+max k=OwN.-lEW[xN-l>ltN-l(xN-1)] [9N-1 (xN-1, /'N-l(XN-1), WN-1)-j- JN(XN) l]] 50 The Dynamic Programming AlgorithmChap.1 We can interchange the minimum over fLN-1and the maximum over wo, ... , 'WN-2 byapplyingLemma1.6.1withtheidentifications 'W=(wo,WI,... , 'WN -2), Go(w) where tt=lLN-1,j(w)XN-1, ifwkEWk(Xk,ttk(xk))forallk, otherwise, if 'UEUN -1 (f (w)) , otherwise, Ch (J(w), u)=max[gN-1 (f(w), u, WN-1) wN-lEWN-1 (f(w),u) + JN(JN-l(f(w),tt,WN-1))], toobtain J* (:r:o)=minmin {LQ JLN-2 n1axmax woEW[xo,J-to(xo)]w N-2EW[x N-2fLN -2(x N-2)](1.24) [ ~9k (xk, /tk(Xk), Wk)+ JN-l(XN-1)] TherequiredconditionminuEUG1(f(w), u)>-ooforallw(requiredforap-plicationof Lemma1.6.1)isimpliedbytheassumptionJN-I(XN-I)> -oo for allXN-lNow,byworkingwiththe expressionforJ*(xo)inEq.(1.24),andby similarlycontinuingbackwards,withN- 1inplaceofN,etc.,afterNsteps weobtainJ*(xo)=J0(::x;0),whichisthedesiredrelation.Thelineof argument justgivenalsoshowsthatanoptimalpolicyfortheminimaxproblemcanbe constructedbyminimizingintheright-handsideof theDPEq.(1.22),similar tothe caseof the DPalgorithmforthe stochastic basicproblem. Unfortunately,asmentionedearlier,therearehardlyanyinterestingex-amplesofananalytical,closed-formsolutionof theDPalgorithm(1.21)-(1.22). Acornputationalsolution,requiresroughlycomparableefforttotheoneof the ::;tochasticDPalgorithm.Insteadof theexpectationoperation,onemustcarry ontamaximizationoperation foreachXkandk. Minimaxcontrolproblemswillbe revisitedinChapter 4inthe contextof reachabilityof targetsetsandtargettubes(Section4.6.2),andinChapter6in thecontextof competitivegamesandcomputer chess(Section6.3),andmodel predictivecontrol(Section6.5). Sec.1. 7 Notes,Sources,andExercises51 1.7NOTES,SOURCES,ANDEXERCISES Dynamicprogrammingisasimplemathematicaltechniquethathasbeen usedformany yearsby engineers,mathematicians,andsocialscientistsin avarietyof contexts.It wasBellman,however,whorealizedintheearly fiftiesthat DP could be developed(in conjunction with the then appearing digital computer)into asystematic tool foroptimization.In hisinfluential books[Bel57],[BeD62],Bellman demonstratedthe broad seope of DPand helpedstreamlineitstheory. FollowingBellman'sworks,therewasmuchresearchonDP.Inpar-ticular,the mathematical and algorithmic aspects of infinitehorizonprob-lemswere extensively investigated, extensions to continuous-time problems wereformulatedandanalyzed,andthemathematicalissuesdiscussedin Section1.5,relatingtotheformulationofDPproblems,wereaddressed. Inaddition,DPwasusedinaverybroadvarietyof applications,ranging frommanybranchesof engineeringtostatistics,economics,finance,and someof the socialsciences.Samplesof theseapplicationswillbegivenin subsequentchapters. EXERCISES 1.1 Considerthe system k= 0, 1, 2, 3, with initial state x 05,andthe costfunction 3 l ) x ~+ v . ~ ) . k=O Apply the DPalgorithmforthefollowingthree cases: (a)ThecontrolconstraintsetUk(Xk)is{uI 0:SXku:S5,u:integer}for allXkandk,andthe disturbance 'Wkisequalto zeroforallk. (b)The control constraintand the disturbance 'Wkare asinpart(a),but there isin addition aconstraint X4= 5on the finalstate.Hint:For this problem youneedtodefineastatespaceforX4thatconsistsofjustthevalue X4= 5,and alsoto redefineU3(x3).Alternatively,yonmay useaterminal cost94(x4)equaltoaverylargenumberforX4# 5. 52The Dynamic Programming AlgorithmChap.1 (c)Thecontrolconstraintisasinpart(a)andthedisturbanceWktakesthe values-1and1withequalprobability1/2forallXkandttk,exceptif :r:k + ttkisequal to 0or5,in whichcase 'Wk= 0with probability1. 1.2 Carryoutthecalculationsneededto verifythatJo(1)= 2.67andJ0 (2)= 2.608 inExample1.3.2. 1.3 Supposewehaveamachinethatiseitherrunningorisbrokendown.If itruns throughoutone week,itmakesagrossprofitof $100.If itfailsduring the week, grossprofitiszero.If itisrunningatthestartoftheweekandweperform preventive maintenance, the probability that it willfailduring the week is0.4.If wedonotperformsuchmaintenance,the probability of failureis0.7.However, maintenance willcost $20.When the machine isbroken down at the start of the week,itmayeitherbe repairedatacostof $40,in which caseit willfailduring theweekwithaprobability of 0.4,oritmaybereplacedatacostof $150bya newmachine that isguaranteed to run through its firstweekof operation.Find theoptimalrepair,replacement,andmaintenancepolicythatmaximizestotal profitoverfourweeks,assuminganewmachineat the start of the firstweek. 1.4 Agame of the blackjack variety isplayed by two players as follows:Both players throwadie.Thefirstplayer,knowinghisopponent'sresult,maystopormay throw the die again and add the result to the result of his previous throw.He then maystoporthrowagainandaddtheresultof thenewthrow to the sumof his previous throws.Hemayrepeatthis processasmany timesashe wishes.If his sum exceeds seven(i.e.,he busts),he loses the game.If he stops before exceeding seven,the secondplayer takes over and throws the die successively until the sum of histhrowsisfourorhigher.If the sum of the secondplayerisoverseven,he losesthegame.Otherwise the player with the largersum wins,andincase of a tiethe secondplayer wins.The problemistodetermineastopping strategyfor the firstplayer that maximizes his probability of winning foreach possible initial throwof thesecondplayer.Formulatetheproblemintermsof DPandfindan optimal stopping strategyforthecasewherethe secondplayer'sinitialthrowis three.Jiint:TakeN= 6andastate spaceconsisting of the following15states: x1 :busted x1+i: already stoppedat sum i(1:::; i:::;7), :rs-t-i:current sum isibut the playerhasnotyetstopped(1:::; i:::;7). The optimal strategy isto throwuntil the sum isfourorhigher. Sec.1. 7 Notes,Sources,and Exercises 1.5(Computer Assignment) In the classicalgame of blackjacktheplayerdrawscardsknowingonlyonecard of thedealer.Theplayerlosesuponreachingasumofcardsexceeding21.If theplayer stops before exceeding21,the dealerdrawscardsuntil reaching17or higher.The dealer loses upon reaching asum exceeding 21or stopping at alower sumthantheplayer's.If playeranddealerendupwithanequalsumnoone wins.Inallothercasesthedealerwins.Anacefortheplayermaybecounted asa1oran11astheplayerchooses.Anaceforthedealeriscountedasan11 if thisresultsinasumfrom17to21andasa1otherwise.Jacks,queens,and kingscount as10forboth dealer and player.Weassume an infinite card deck so the probability of aparticular cardshowingupisindependentof earliercards. (a)Foreverypossibleinitialdealercard,calculatetheprobabilitythatthe dealerwillreachasum of 17,18,19,20,21,or over21. (b)Calculatetheoptimalchoiceof theplayer(draworstop)foreachof the possible combinations of dealer's card and player's sum of 12 to 20.Assume that the player'scardsdonotincludeanace. (c)Repeatpart(b)forthe case where the player'scardsincludean ace. 1.6(DiscountedCostper Stage) In the frameworkof the basic problem,consider the case where the costisof the form whereaisadiscountfactorwith0< a< 1.Show that an alternate formof the DP algorithm isgivenby, VN(XN)= 9N(XN), Vk(xk)=minE{9k(Xk, ttk, wk) + o:Vk+l (fk(xk, 'lLk,wk)) } ukEUk(xk) 'Wk1. 7(ExponentialCostFunction) In the frameworkof the basic problem,consider the case wherethe costisof the form (a)Showthattheoptimalcostandanoptimalpolicycanbeobtainedfrom the DP-likealgorithm 54The Dynamic Programming Algorit}unChap.1 (b)Definethe functionsVk (xk)=In Jk (xk).Assumealsothat 9kisafunction of Xkand 'ltkonly(andnotof wk).Showthat the abovealgorithmcan be rewrittenas Note:Theexponentialcostfunctionisanexampleofarisk-sensitive costfunctionthatcanbeusedtoencodeapreferenceforpolicieswith asmallvarianceofthecost9N(XN)+

9k(Xk, Uk, wk).Theassoci-atedproblemshavealotof interestingproperties,whicharediscussedin several sources,e.g.,Whittle(Whi90],Fernandez-GaucherandandMarkus (FeM94],.James,Baras,andElliott[.JBE94],Basar and Bernhard[BaB95]. 1.8(Terrninating Process) Inthe frameworkof the basicproblem,considerthe casewherethe system evo-lutionterminatesattimeiwhenagivenvalue Wiof thedisturbanceattimei occurs,orwhenatermination decisionUiismadeby the controller.If termina-tionoccursattimei,theresultingcostis T+I: 9k(xk, uk, wk), k=O whereTis aterminationcost.If theprocesshasnotterminateduptothefinal timeJV,theresultingcostis9N(xN)+ 9k(Xk,uk,wk).Reformulatethe problem into the framework of the basic problem.Hint:Augment the state space withaspecialtermination state. 1.9(MultiplicativeCost) In the frameworkof the basicproblem,consider the case where the costhas the multiplicativeform E{9N(XN) 9N-I(XN-l, UN-I, 'WN-1) go(xo, uo, wo) }. 'Wkk=O,l, ... ,N-1 DevelopaDP-likealgorithmforthisproblemassumingthat9k(Xk, Uk, Wk)0 forallXk,'Uk,Wk,andk. Sec.1.7Notes,Sources,andExercises55 1.10 Assumethatwehaveavesselwhosemaximumweightcapacityiszandwhose cargoistoconsistofdifferentquantitiesofNdifferentitems.Letv.;denote thevalueof theith typeof item,w.;theweightof ithtypeof itern,andXithe numberof itemsof typeithatareloadedinthevessel.Theproblemistofind themostvaluablecargo,i.e.,tomaximize :J:i'Uisubjecttotheconstraints 2::1 X-i'Wi:s;zandXi=0,1, 2,... Formulate thisproblemintermsof DP. 1.11 ConsideradeviceconsistingofNstagesconnectedinseries,whereeachstage consistsof aparticularcomponent.Thecomponentsaresubjecttofailure,and toincreasethereliabilityof thedeviceduplicatecomponentsareprovided.For _j = 1, 2, ... , N,let(1+ ffij)be thenumberof componentsforthejth stage,let Pi ( ffij)be the probability of successful operation of the jth stage when( 1 + mj) componentsare used,and let Cjdenote the costof asingle component at the jth stage.Formulate in terms of DP the problem of finding the number of cornponents at eachstage that maximizethereliability of the deviceexpressedby subjectto the costconstraintCjffij:s;A, whereA> 0isgiven. 1.12(Minimization overaSubsetof Thisproblemisprimarilyoftheoreticalinterest(seetheendofSection1.5). Consideravariationof the basicproblemwhereby weseek min J1r(xo), 1rETI whereITissomegivensubset of thesetof sequences{tto, /Ll,... , /LN- I} of func-tions/Lk: sk--7ck withjtk(Xk)EUk(xk)forallXkESk.Assumethat belongstoITandattainstheminimumintheDPalgorithm;thatis,forall k0,1, ... ,N -1 andXkEsk, Jk(Xk)=E + 'WkminE{9k(Xk,uk,wk) + .h+l(fk(xk,uk,'Wk:))}, ukEUk(xk)'Wk withJN(xN)=9N(XN).AssumefurtherthatthefunctionsJkarereal-valued and that the preceding expectationsare well-definedandfinite.Showthat1r*is optimal within ITandthatthe corresponding optimalcostisequaltoJo(xo). The Dynamic Programming AlgorithmChap.1 1.13(SemilinearSystems) Consideraprobleminvolvingthe system whereXkE fkare givenfunctions,andAkand 'Wkarerandom nx nmatri-cesandn-vectors,respectively,withgivenprobabilitydistributionsthatdonot dependonXk,UkorpriorvaluesofAkand'Wk.Assumethatthecostisof the form {+ 9k (!Lk(Xk)))}, whereCkaregivenvectorsand9kare givenfunctions.Showthatif theoptimal cost forthis problem isfiniteand the control constraint sets Uk(xk)are indepen-dentofXk,thenthecost-to-gofunctionsof theDPalgorithmareaffine(linear plusconstant).Assumingthatthereisatleastoneoptimalpolicy,showthat thereexistsanoptimalpolicythatconsistsofconstantfunctionsJ-L'k; thatis, ftHxk)=constantforallXkE3ln. 1.14 AfarmerannuallyproducingXkunitsofacertaincropstores(1- Uk)Xkunits of hisproduction,where 0s;Uk::;1,andinveststhe remaining UkXkunits,thus increasing the nextyear'sproduction to alevelXk+lgivenby k= 0, 1, ... , N- 1. The scalars 'Wkareindependentrandomvariableswithidenticalprobability dis-tributions that do not depend either onXkor Uk.Furthermore,E{wk} = w > 0. Theproblemistofindtheoptimalinvestmentpolicythatmaximizesthetotal expectedproductstored overNyears { N-1} - .JliXN+ Uk)Xk. k-O,l, ... ,N-1k-0 Showthe optimality of the followingpolicythat consistsof constantfunctions: (a)If W> 1, = = ft'Jv_1 (XN-1)= 1. (b)If 0 < W< 1/N,= =p,'Jv_1(XN-l) = 0. (c)If 1/N::; w::;1, = = f.-LN-l(XN-1)= 0, where k issuch that1/(k + 1)< w::;1/k. Sec.1. 7Notes,Sources,andExercises51 1.15 LetXkdenote the numberof educators inacertain country at timekand letYk denote the number of research scientists at time k.New scientists(potential edu-cators or research scientists)are produced during the kth period by educators at arate "/kper educator, while educators and research scientists leave the fielddue to death, retirement, and transfer at arate Sk.The scalars "(k,k= 0, 1, ... , N-1, areindependentidenticallydistributedrandomvariablestakingvalueswithina closed and bounded interval of positive numbers.Similarly 1h,k0, 1, ... ,N -1, areindependentidentically distributedand takevaluesinaninterval[S, J']with 0< S::;S'0.Thus,for example,therewardforstartingwithactivity1,switchingto2attimet1,and switching back to1attimet2> t1earnstotal reward Wewanttofindasetof switchingtimesthatmaximizethetotalreward.As-sume thatthe function91(t)- g2(t)changes signafinitenumber of timesin the interval[0, T].Formulate the problem asafinitehorizonproblem and writethe corresponding DPalgorithm. Sec.1. 7Notes,Sources,and Exercises59 1.19(Garnes) (a)Considerasmallerversionofapopularpuzzlegame.Threesquaretiles numbered1,2,and3areplacedina2x2gridwithonespace leftempty. Thetwotilesadjacenttotheemptyspacecanbemovedintothatspace, therebycreatingnewconfigurations.UseaDPargumenttoanswerthe questionwhetheritispossibletogenerateagivenconfigurationstarting fromany other configuration. (b)Fromapileof elevenmatchsticks,twoplayerstaketurnsremovingoneor four sticks.The player who removes the last stick wins.Use aDP argument to showthat there isawinning strategyfortheplayerwhoplaysfirst. 1.20(TheCounterfeitCoin Problem.) vVeare given six coins,one of which iscounterfeitand isknownto have different weightthantherest.Constructastrategytofindthecounterfeitcoinusinga two-pan scale in aminimum average number of tries.!Tint:There are two initial decisionsthatmakesense:(1)testtwoof thecoinsagainsttwoothers,and(2) testoneof the coinsagainstoneother. 1.21(Regular Polygon Theorem) Accordingtoafamoustheorem(attributed totheancientGreekgeometerZen-odorus),of allN-side polygonsinscribedinagivencircle,those thatareregular (allsidesareequal)havemaximalarea. (a)Prove the theorem by applying DP to asuitable problem involving sequen-tialplacementof Npointsin the circle. (b)Use DP to solve the problem of placing agiven number of points on asubarc of the circle,soastomaximizethearea of thepolygonwhoseverticesare these points,the endpointsof the subarc,and thecenterof the circle. 1.22(InscribedPolygon of lVIaximalPerhneter) Considertheproblemof inscribinganN -sidepolygoninagivencircle,sothat the polygonhasmaximalperimeter. (a)Formulate the problemasaDP problem involving sequential placernentof Npointsin the circle. (b)UseDP to show that the optimal polygonisregular(allsidesareequal). 60The Dynamic Programming AlgorithmChap.1 1.23(lVJ[onotonicityProperty of DP) An evident, yetvery important property of the DP algorithm is that if the termi-nal cost g Nis changed to auniformly larger cost gN[i.e.,g N ( x N):::; gN( x N)forall x N],then the last stage cost-to-go J N-l (x N_I) will be uniformly increased.More generally,giventwo functionsJk+landJk+lwithJk+I(Xk+I):::;Jk+I(Xk+l)for allXk-f-l,wehave,forallXkand 1Lk EUk(Xk),

{ Uk, Wk)-!- Jk+l (fk(Xk, 1Lk 1Wk))} :::;E{gk(Xk, Uk, Wk)+ Jk+l (fk(Xk, Uk, Wk)) } wk Supposenowthatinthe basicproblemthe systemandcostaretimeinvariant; thatis,Sk:=S,Ck:=C,Dk:=D,fk= f,Uk=U,Pk=P,and9k:=g forsomeS,C,D,f,U,P,andg.Showthatif intheDPalgorithmwehave JN-1(x):::;JN(x)forallxES, then forallxES andk. Similarly,if wehave1N-l(x) 2:: JN(x)forallxES, then forallxESand k. 1.24(TravelingRepairman Problem) Arepairman must service nsites,whichare located along aline and are sequen-tially numbered 1, 2, ... , n.The repairman starts at agiven sites with 10.Aphase(node)j iscompletedwhenallactivitiesorarcs(i,j)thatareincomingtojare completed.Thespecialnodes1andNrepresentthe startandendof the project.Node 1hasno incomingarcs,whilenode Nhasno outgoingarcs. Furthermore,thereisatleastonepath fromnode1toeveryothernode. Figure2.2.1Graphofanactivitynetwork.Arcsrepresentactivitiesandare labeled by the corresponding duration.Nodes represent completion of some phase of the project.Aphase iscompleted if all activities associated with incoming arcs atthecorrespondingnodearecompleted.Theprojectiscompletedwhenall phases are completed.The project duration time isthe length of the longestpath fromnode1 tonode5,whichisshown with thick line. An important characteristic of an activity network is that it isacyclic; thatis,ithasnocycles.Thisisinherentintheproblemformulationand the interpretation of nodesasphase completions. For any path p= {(1, j1), (j1, j ~ ,), ... , (jk, i)}fromnode1 to anode i,letDpbe the duration of the path definedasthe sum of durationsof its activities;that is, Dp=lift + ti112+ + tjki ]'hen the time Tirequiredto complete phase iis Ti=maxDp. pathsp from1toiThus to findTi,weshould findthelongest path from1to i.This problem mayalsobeviewedasashortestpathproblemwiththelengthofeach arc( i, j)being-tij.Inparticular,findingthedurationof theprojectis equivalenttofindingtheshortestpathfrom1toN.Thispathisalso calledacriticalpath.It canbeseenthatadelaybyagivenamountin thecompletionof oneof theactivitiesonthecriticalpathwilldelaythe completionof the overallprojectbythe sameamount.Notethatbecause the network isacyclic,there can be only afinitenumber of paths from1 to any i,sothat at leastone of thesepaths correspondsto the maximal path duration Ti. 10Deterministic Systemsandthe ShortestPathProblemChap.2 Letusdenote byS1the set of phasesthatdo not depend on comple-tionofanyotherphase,andmoregenerally,fork =1, 2,... ,letskbe the set sk{'i1aupaths from1 to ihavekarcsor less}, withSo{ 1}.ThesetsS kcanbeviewedasthestatespacesforthe equivalent DP problem.Using maximization in place of minimization while changingthesignof thearclengths,the DPalgorithm can be written as foralliESkwith itj_ Sk-l Notethat thisisaforwardalgorithm;thatis,itstartsatthe origin1 and proceedstowardsthedestinationN.Analternativebackwardalgorithm, whichstarts atNand proceedstowards1,isalsopossible,asdiscussedin the preceding section. Asanexample,forthe activitynetworkof Fig.2.2.1,wehave So{1},S1= {1, 2},82{1, 2, 3}, 83= {1,2,3,4},84= {1,2,3,4,5}. Acalculation nsingthepreceding formulayields 'I'hecriticalpath is1~2~3~4~5. 2.2.2Hidden Markovl\1odelsand the ViterbiAlgorithm ConsideraMarkovchainwithafinitenumberofstatesandgivenstate transitionprobabilitiesP'ij.Supposethatwhenatransitionoccurs,the states corresponding to the transition are unknown (or"hidden") to us,but insteadweobtainanobservationthatrelatestothattransition.Givena sequence of observations, we want to estimate in some optimal sense the se-quence of corresponding transitions.Wearegiven theprobability r(z; i,j) of anobservationtakingvaluezwhenthestatetransitionisfromitoj. Weassume independent observations;that is,an observation depends only onitscorrespondingtransitionandnotonothertransitions.Wearealso giventhe probability 1f'ithat theinitial state takesvaluei.The probabili-tiesp,ijrmdr-(z; i, j) areassumedtobe independentof time fornotational convenience.Themethodologytobedescribedadmitsastraightforward extension to the case of time-varying system and observation probabilities. Markovchainswhosestatetransitionsareimperfectlyobservedac-cordingtotheprobabilisticmechanismjustdescribedarecalledHidden Jvfar-kovJY!odels(Hl'VIJ\!Isfor- shor-t)orpartiallyobser-vableMarkovchains. Sec.2.2Some ShortestJ>athApplications71 InChapter5wewilldiscussthecontrolofsuchMarkovchainsinthe contextof stochasticoptimalcontrolproblemswithimperfectstateinfor-mation.In the present section,wewillfocuson theproblem of estimating the state sequencegivenacorrespondingobservationsequence.Thisisan importantproblem that arisesinabroad varietyof practicalcontexts. Weusea"mostlikelystate"estimationcriterion,wherebygiventhe observation sequenceZN={z1, z2, ... , ZN },weadoptasonrestimate the statetransitionsequenceX N={ Xo,X1,... , xN}thatmaximizesoverall XN={xo, x1, ... , XN}theconditionalprobabilityp(XNI ZN ).Wewill showthatX Ncanbefoundbysolvingaspecialtypeofshortestpath problemthatinvolvesanacyclicgraph. Wehave (XI z) =p(XN' ZN) PNNp(ZN), where p( X N,Z N)and p( Z N)aretheunconditionalpro habili tiesof occur-rence of (XN, ZN)and ZN, respectively.Since p(ZN) isapositive constant onceZNisknown,wecanmaximizep(XN, ZN)inplaeeof p(XNI ZN ). The probability p(XN, ZN)can be written as p(XN,ZN) =p(xo,Xl, ... ,XN,Zl,Z2, ... ,zN) =1fx0p(:ri, ... ,XN,Z1,Z2, ... ,zNI :-ro) =7rx0p(xl,zli :r:o)p(x2, ... ,XN,Z2, ... ,ZNI :ro,XJ,ZL) = 1fx0Pxox1T(zl;xo,:r:l)P(X2, . .. , ~ I : N , Z 2 ,... ,ZNI ::eo,:r:1,Z1). Thiscalculation can becontinuedby writing p(x2, ... , XN, z2,... , ZNI xo, :r:1,z1) =p(x2,z2l xo,:r:l,zl)p(x3, ... ,:r:N,Z3,,zN I :x;o,xl,Zl,:x;2,z2) = Px1x2T(z2; :r;1, x2)p(x3, ... , XN, Z3,... , ZNI :x:o,:1:1,Zt, :c2,z2), where forthe lastequation,weused the independence of the observations, i.e., p(z2I xo, X1,X2,zl) = r(z2; x1, :1:2).Combining the above two relations, p(XN, ZN)1f xoPx0x1 r(z1; :x:o,Xl)Px1x2T( Z2;:r1, :r2) p(:x:3, ... ,:x:N,Z3, ... ,zNI :r:o,xl,ZI,X2,z2), and continuing inthe samemanner,weobtain N p(XN' ZN)=1fxon Pxk-lxkr-(zk; :Dk-l, ;r;k) k=1 (2.5) Wenowshowhowthemaximizationoftheaboveexpressioncanbe viewedasashortestpath problem.Inparticular,weconstructagraphof state-time pairs,calledthetreil'isdiagram,by concatenatingN+ 1copies Det;erministicSystemsandtlw ShortestPathProblemChap.2 of the state space,and by preceding and following them with dummy nodes sandt,respectively,asshowninFig.2.2.2.Thenodesof thekthcopy correspondtothe statesXk-1at timek- 1.Anarc connectsanodeXk-l of thekthcopywithanodeXkof the(k + l)stcopyif thecorresponding transition probability Pxk_1xkispositive.Since maximizing apositive cost functionisequivalenttomaximizingitslogarithm,weseefromEq.(2.5) that, given the observation sequenceZN= {z1 , z2 , .. ., ZN},the problem of maximizing p(XN, ZN)isequivalentto the problem N minimize- In( 1fx0)'2: ln(pxk_1xk r(zk; Xk-b Xk)) k=l overallpossiblesequences{xo, x1, ... , XN }. Byassigningtoanarc( s, xo)thelength- ln( 1rxo),toanarc( x N, t)the length0,and to anarc(xk-1, xk)the length-ln(pxk_1xk r(zk; Xk-1, xk)), weseethattheaboveminimizationproblemisequivalenttotheproblem of finding the shortest path fromsto tin the trellis diagram.This shortest path definesthe estimated state sequence{ xo, x1, ... , xN}. Figure2.2.2StateestimationofanHMMviewedasaproblemoffindinga shortestpathfromstot.Lengthof arcsfromstostatesxoisIn( 1rxo),and lengthof arcsfromstatesXNtot iszero.Length of anarc fromastate Xk-lto Xkis-1n(pxk_1xkT(zk;Xk-l,Xk)),whereZkisthekth observation. Inpractice,theshortestpathismostconvenientlyconstructedse-quentially by forwardDP,that is,by firstcalculating the shortest distance fromsto each node x1,then using these distances to calculate the shortest distancefromsto eachnodex2,etc.In particular,supposethat wehave computed the shortest distances Dk(xk) froms to all states Xkon the basis of the observation sequence Zk,and suppose that the new observation zk+l isobtained.ThentheshortestdistancesDk+l (xk+l)fromstoanystate Sec.2.2SomeShortestPathApplications13 ;Ck+lcan becomputed by trw DPrecursion TheinitialconditionisDo(xo)=-ln(7rx0).Thefinalestimatedstate sequenceXNcorrespondstotheshortestpathfromstothefinalstate :c NthatminimizesD N( x N)overthefinitesetofpossiblestatesx N.An advantage of this procedure isthat itcan be executedin real time,as soon aseach newobservationisobtained. Thereareanumberof practical schemesforestimatingaportionof thestatesequencewithoutwaitingtoreceivetheentireobservationse-quenceZN,andthisisusefulif ZNisa.longsequence.Forexample,one cancheckfairlyeasilywhetherforsomek,allshortestpathsfromsto states Xkpass through a.single node in the subgraph of states xo,... , ::ck-l If so,itcanbeseenfromFig.2.2.3thattheshortestpath fromstothat node will not be affected by reception of additional observations,and there-forethe subsequence of state estimates up to that node can be determined withoutwaiting forthe remaining observations. Shortest Path fromstom ~ 0 ~' ,//0 Shortest Paths'>< / frommto States l{k /'/'''O Figure2.2.3Estimatingaportionof thestatesequencepriortoreceivingthe entireobservation sequence.Suppose that the shortestpathsfromsto allstates Xkpassthroughasinglenodem.If anadditionalobservationisreceived,the shortestpaths fromsto allstatesXk+lwillcontinuetopassthroughm.There-fore,theportionofthestatesequenceuptonodemcanbesafelyestimated because additional observations willnot change the initialportion of the shortest paths fromsuptom. The shortest path-based estimation procedure just described is known astheViterbialgorithm,andfindsnumerousapplicationsinavarietyof contexts.An example isspeechrecognition,where the basic goal isto tran-scribeaspokenwordsequenceintermsof elementaryspeechunitscalled phonemes.OnepossibilityistoassociatethestatesoftheHHl\IIwith Deterministic Systemsandthe ShortestPathProblemChap.2 phonemes,and givenasequence of recordedphonemesZN={ z1 ,... ,ZN }, tofindaphonemicsequenceXN={x1 ,... ,xN}thatmaximizesoverall XN{:r1, ... , ;r;N}the conditional probability p(XNI ZN ).The probabil-itiesr(zk; :r:k-1, ::ck)and Pxk_1xkcanbe experimentallyobtained,if neces-sary by specialized"training"foreach speaker that uses the speech recogni-tion system.The Viterbi algorithm can then be used to findthe most likely phonemic sequence.There are also other HMMs used for word and sentence recognition,whereonlyphonemicsequencesthatconstitutewordsfroma givendictionaryareconsidered.WereferthereadertoRabiner[Rab89] andPicone[Pic90]forageneralreviewof HMMsappliedto speechrecog-nitionandforfurtherreferencestorelatedwork.It isalsopossibletouse similarmodelsforcomputerizedrecognitionof handwriting. TheViterbialgorithmwasoriginallydevelopedasaschemeforde-codingdataaftertransmissionoveranoisycommunicationchannel.The followingexampledescribesthiscontextinsomedetail. Whenbinarydataaretransmittedoveranoisycommunicationchannel,it isoftenessentialtousecodingasameansofenhancingreliabilityofcom-rrmnication.Acommontypeof codingmethod,calledconvolutionalcoding, convertsasource-generatedbinary data sequence {w1, w2,... },WkE{0, 1},k=1,2, ... , intoacodedsequence{y1, y2, ... },whereeach Ykisann-dimensional vector withbinarycoordinates,calledcodewoTd, Y1E{0, 1},i= 1, ... ,n,k= 1,2, .... Thesequence{Yl, Y2,... }isthentransmittedoveranoisychannelandgets transformedinto asequence{ z1, z2, ... , } ,whichisthen decodedto yield the decoded data sequence {11; 1 ,1112, .. };see Fig.2.2.4.The objective is to design theencoder/ decoderschemesothatthedecodedsequenceisasclosetothe originalaspossible. CodewordReceivedDecoded Data (Oor 1) SequenceSequenceSequence EncoderDecoder W1,W2 y,, Y2....Z1,Z2, ... ""w1,w2, ... Figure2.2.4Encoder/decoder scheme. Sec.2.2Some ShortestPathApplications75 The problem just discussediscentral in information theoryandcan be approachedinseveraldifferentways.Inaparticularlypopularandefiective techniquecalledconvolutionalcoding,thevectorsYkarerelatedtoWkvia equationsof the form k=1,2, ... ' k= 1,2, ... ':ro:given, (2.6) (2.7) where Xkisan m-dimensional vector with binary coordinates,which we view asstate,andC,d,A,andb arenxm,nx1,mxrn,andrnx1matrices, respectively,with binary coordinates.The products and the sums involved in the expressionsCxk- 1+ dwkandA:r:k-1+ lJ'Wkarecalculatedusingmodulo 2arithmetic. Asan example,letm= 2,n=3,and A Thentheevolutionofthesystem(2.6)-(2.7)canberepresentedbythedi-agramshowninFig.2.2.5.GiventheinitialXo,thisdiagramcanbeused togeneratethecodewordsequence{ Yl, Y2,... }correspondingtoadatase-quence{w1 ,w2 ,.. }.For example,when the initial state isx:o00,the data sequence {1111,'W2,W3,W4}= {1,0,0,1} generatesthe state sequence {xo,X11X2,X3,X4}={00,01,11,10,00}, andthe codewordsequence Old State xk _ 1 00 01 10 11 New State xk 00 01 10 1 1 Figure2.2.5State transition diagram forconvolutionalcoding.Thebinary number pair on each arc isthe data/codeword pair wk/Ykforthe correspond-ingtransition.Soforexample,whenXk-101,azerodatabit(wk=0) effectsatransition to Xk= 11andgeneratesthecodeword001. '16Deterministic Systems andthe ShortestPathProblemChap.2 Assumenowthat thecharacteristics of thenoisytransmissionchannel aresuchthatacodewordyisactuallyreceivedaszwithknownprobability p(zI y),wherezisany n-bitbinary number.Weassumeindependenterrors sothat N p(ZNI YN)=IT p(zkI Yk),(2.8) k=l whereZN={ Z1, ... ,ZN}isthereceivedsequenceandYN= {Yl, ... , YN}is thetransmitted sequence.Byassociatingthe codewordsywith state transi-tions,weformulateamaximumlikelihoodestimationproblem,wherebywe wantto findasequenceYN={j)I, i/2, ... , YN}suchthat The constrainton YNisthat it must be afeasiblecodeword sequence(i.e.,it mnstcorrespondtosomeinitialstateanddata sequence,orequivalently,to asequenceofarcsof the trellis diagram). LetusconstructatrellisdiagrambyconcatenatingNstatetransi-tiondiagramsandappendingdummynodessandtonitsleftandright sides,whichareconnectedwithzero-lengtharcstothestatesxoandXN, respectively.ByusingEq.(2.8),weseethat,giventhereceivedsequence Z N={z1, Z2,... , z N},theproblemof maximizingp( Z NI Y N)isequivalent totheproblem N miniinize k=l overallbinary sequences{ Y1,y2, ... , y N}. This is equivalent to a problem of finding a shortest path in the trellis diagram fromstot,wherethelengthof thearcassociatedwiththecodewordYkis -ln(p(zkI Yk)),andthelengthsofeacharcincidenttoadummynodeis zero.From the shortest path and the trellisdiagram,wecan then obtain the corresponding data sequence{tu1,... , fv N}, which isacceptedasthe decoded data. The maximum likelihoodestimate YNcan be foundby solving the cor-responding shortestpath problem using the Viterbi algorithm.In particular, the shortest distancesDk+l (xk+l)fromsto any state Xk+Iare computed by the DP recursion minfnk(xk)-ln(p(zk+II Yk+I) )] , allxksuchthat (xk,xk+I)isanarc whereYk+Iisthecodewordcorrespondingtothearc(xk, Xk+dThefinal state xNonthe shortestpath isthe onethat minimizesD N ( x N)overx N. Sec.2.3ShortestPathAlgorithms'1'1 2.3SHORTESTPATHALGORITHMS Wehaveseenthatshortestpathproblemsanddeterministicfinite-state optimalcontrolproblemsareequivalent.Thecomputationalimplications of thisaretwofold. (a)OnecanuseDPto solvegeneralshortestpathproblems.Notethat thereareseveralothershortestpathmethods,someofwhichhave superiortheoreticalworst-caseperformancetoDP.However,DPis oftenpreferredinpractice,particularly forproblemswithanacyclic graph structure andalsowhenaparallel computer isavailable. (b)Onecanusegeneralshortestpath methods(otherthanDP)forde-terministicfinite-stateoptimalcontrolproblems.Inmostcases,DP ispreferable to other shortest path methods,because itistailoredto the sequential nature of optimal control problems.However,there are importantcaseswhereother shortestpath methodsarepreferable. Inthissectionwediscussseveralalternativeshortestpathmethods. Wemotivate thesemethodsby focusingon shortestpathproblemswitha very large number of nodes.Suppose that there isonly one origin and only onedestination,asinshortestpathproblemsarisingfromdeterministic optimal control( cf.Fig.2.1.1).Then it isoften true that most of the nodes arenotrelevanttotheshortestpathprobleminthesensethattheyare unlikely candidates for inclusion in ashortest path between the given origin anddestination.Unfortunately,however,in theDPalgorithmeverynode andarcwillparticipateinthe computation,sotherearisesthepossibility of other moreefficientmethods. A similarsituation arisesinsomesearchproblemsthatarecommon inartificialintelligenceandcombiriatorialoptimization.Generally,these problemsinvolvedecisionsthatcanbebrokendownintostages.With proper reformulation, the decision stages can be madeto correspond to arc selectionsinashortestpath problem,ortothe stagesof aDPalgorithm. Weprovidesome examples. Example2.3.1(TheFourQueensProblem) Fourqueensmustbeplacedona4x4portionofachessboardsothatno queen canattackanother.In otherwords,theplacementmustbesuchthat every row,column, or diagonal of the 4 x 4 board contains at most one queen. Equivalently, we can view the problem as a sequence of problems;first,placing aqueeninoneof thefirsttwosquaresinthetoprow,thenplacinganother queeninthesecondrowsothat itisnotattackedbythefirst,andsimilarly placing the third and fourthqueens.(Itissufficientto consideronlythe first twosquaresofthetoprow,sincetheothertwosquaresleadtosymmetric positions.)Wecanassociatepositionswithnodesof anacyclicgraphwhere the rootnodescorresponds to the position withnoqueensandthe terminal nodescorrespondto thepositionswherenoadditionalqueenscanbeplaced '7'8DeterrninisticSystemsandthe ShortestPathProblemChap.2 without some queen attacking another.Let us connect each terminal position withanartificialnodetbymeansofanarc.Letusalsoassigntoallarcs lengthzeroexceptfortheartificialarcsconnectingterminalpositionswith lessthan fourqueens with the artificial node t.These latter arcs are assigned the length oo(see Fig.2.3.1)to express the factthat they correspond to dead-endpositionsthatcannotleadtoasolution.Then,thefourqueensproblem reducesto findingashortestpath fromnodesto node t. Note that once the nodes ofthe graph are enumerated the problem is es-sentially solved.In this 4 x 4 problem the number of nodes issmall.However, wecan think of similar problems with much larger memory requirements.For example,thereisaneightqueensproblemwheretheboardis8x8instead of 4x4. Exarnple2.3.2(The TravelingSalesman Problem) Animportantmodelforschedulingasequenceof operationsistheclassical travelingsalesmanproblem.HerewearegivenNcitiesandthemileage betweeneachpairofcities.Wewishtofindaminimum-mileagetripthat visitseachofthecitiesexactlyonceandreturnstotheoriginnode.To convertthisproblemtoashortestpathproblem,weassociateanodewith everysequenceof ndistinctcities,wheren:::; N.Theconstructionandarc lengthsofthecorrespondinggraphareexplainedbymeansofanexample inFig.2 . ~ 1 . 2 .TheoriginnodesconsistsofcityA,takenasthestart.A sequenceof ncities(n< N)yieldsasequenceof( n+ 1)citiesbyaddinga newcity.Twosuchsequencesareconnectedbyanarcwithlengthequalto the mileage between the last two of the n + 1 cities.Each sequence of Ncities isconnectedtoanartificialterminalnodetwithanarchaving lengthequal tothe distance fromthe last city of the sequence to the starting city A.Note thatthenumber of nodesgrowsexponentially withthenumber of cities. Intheshortestpathproblemthatwewillconsiderinthissection, thereisaspecialnodes,calledtheorigin,and aspecial node t,calledthe destination.Wewillassumeasingledestination,butthemethodstobe discussed admit extensions to the case of multiple destinations (see Exercise 2.6).Anode _j iscalledachild of nodeiif there isan arc(i,j) connecting 'iwithj.The lengthof arc(i,_j)isdenotedby aijand weassumethatall aTcshavenonnegativelength.Exercise2. 7dealswiththecasewhereall cyclelengths(rather than arclengths)areassumednonnegative.We wish tofindashortestpath fromorigintodestination. 2.3.1LabelCorrectingl\1ethods Wenowdiscussageneraltypeof shortestpath algorithm.Theideaisto progressivelydiscovershorterpaths fromthe originto everyothernodei, andtomaintainthelengthof theshortestpath foundsofarinavariable dicalledthelabelof i.Each time diisreduced followingthe discovery of a Sec.2.3ShortestPathAlgorithms Length== ooStartingPosition Root Nodes Artificial Terminal Nodet 79 Figure2.3.1Shortestpath formulationof thefourqueensproblern.Symmetric positions resulting fromplacing aqueen in one of the rightmost squares in the top rowhave been ignored.Squarescontaining aqueenhavebeen darlmned.Allarcs havelengthzeroexceptforthoseconnectingdead-endpositionstotheartificial terminalnode. shorter path to i,the algorithm checksto see if the labels djof the children _j of 'icanbe"corrected,"thatis,theycanbereducedby setting themto 0Deterministic Systemsandthe ShortestPathProblemCl1ap.2 Artificial Terminal Node t Figure2.3.2Exampleof ashortestpathformulationof thetravelingsalesman problem.The distancebetweenthefourcitiesA,B,C,andDareshowninthe table.The arclengthsare shownnextto thearcs. di+ a'i:i[the lengthof the shortest path to ifoundthus farfollowedbyarc (i,j)].Thelabeldtofthedestinationismaintainedinavariablecalled UPPER,whichplaysaspecialroleinthealgorithm.Thelabeldsof the originisinitializedat0andremainsat0throughoutthealgorithm.The labelsof allothernodesare initializedatoo,i.e.,di=ooforallii=s. The algorithm also makes use of alist of nodes called OPEN (another namefrequentlyusediscandidatelist).ThelistOPENcontainsnodes thatarecurrentlyactiveinthe sensethattheyarecandidatesforfurther examinationby thealgorithmandpossibleinclusioninthe shortestpath. Initially,OPEN contains just the origin nodes.Each node that has entered OPEN atleastonce,excepts,isassigneda"parent,"whichissome other node.Theparentnodesarenotnecessaryforthecomputationofthe shortestdistance;theyareneededfortracingtheshortestpathtothe originafterthealgorithmterminates.Thestepsof thealgorithmareas follows(seealsoFig.2.3.3): Sec.2.3ShortestPathAlgorithms1 LabelCorrecting Algorithm Step 1:Remove a node ifromOPEN and foreach child .iof i,execute step2. Step 2:If di+ aij< min{dj, UPPER},setd:i=di+ aiJand setito betheparentof j.Inaddition,if ji=t,placejinOPEN if itisnot already in OPEN, whileif j=t,set UPPER to the new value di+ ait of dt. Step 3:If OPENisempty,terminate;elsegotostep1. Set dj = di + aii INSERT Is di + aij < UPPER ? .. - ~ - Y _ E _ s _ ~ - = - - ~(Does the path s --> i --> j have a chance to be part of ashorter s --> t path?) YES is di + aii < dj? (Is the paths--> i --> j shorter than the current s --> jpath?) Figure2.3.3Diagrammaticillustrationofthelabelcorrectingalgorithm,with aninterpretationof the tests forinsertionof anodeinto the OPENlist. It canbeseenbyinductionthat,throughoutthealgorithm,cljis eitheroo(ifnodejhasnotyetenteredtheOPENlist),orelseitisthe lengthof somepath fromstojconsistingof nodesthathaveenteredthe OPENlistatleastonce.Inthelattercase,thepathcanbeconstructed bytracingbackwardtheparentnodesstarting withtheparentof nodej. F\1rthermore,UPPERiseitheroo,orelseitisthelengthofsomepath fromstot, and consequently it isan upper bound of the shortest distance fromstot.Theideainthealgorithmisthatwhenapathfromstoj isdiscovered,whichisshorterthan thoseconsideredearlier( di+ aij0isagiveninterestrate,andu(t)2:0 ishisrate of expenditure.The total enjoymenthe willobtain isgivenby Here(3 issomepositivescalar,whichservesto discountfutureenjoyment.Find theopt

Documents

114167416 Dynamic Programming and Optimal Control