Upload
vodan
View
214
Download
0
Embed Size (px)
Citation preview
Planning&ReinforcementLearning
SlidesborrowedfromSheilaMcIlraith,KateLarson,andDavidSilver
1CSC384UniversityofToronto
WhyPlanninguE.g.ifwehavearobotwewanttherobottodecidewhattodo;howtoacttoachieveourgoals
CSC384UniversityofToronto 2
3
Planningvs.Search
• Howtochange theworldtosuitourneeds.• Criticalissue:weneedtoreasonaboutwhattheworldwillbelike afterdoingafewactions.
• ThisaspectofplanningisjustlikeSearch.
GOAL:StevenhascoffeeCURRENTLY:robotinmailroom,hasnocoffee,coffeenotmade,Steveninoffice,etc.TODO:gotolounge,makecoffee,…
4
uAutonomousplanning,scheduling,controlu NASA:JPLandAmes
uRemoteAgentExperiment(RAX)u DeepSpace1
uMarsExplorationRover(MER)
AutonomousAgentsforSpaceExploration
5
SchedulingwithActionChoices&ResourceRequirementsu Problemsinsupplychainmanagementu HSTS(HubbleSpaceTelescopescheduler)uWorkflowmanagement
AirTrafficControlu Routeaircraftbetweenrunwaysandterminals.Craftsmustbekeptsafelyseparated.Safedistancedependsoncraftandmodeoftransport.Minimizetaxiandwaittime.
CharacterAnimationu Generatestep-by-stepcharacterbehaviourfromhigh-levelspec
Plan-basedInterfacesu E.g.NLPtodatabaseinterfacesu Planrecognition,ActivityRecognition
OtherApplications(cont.)
6
Applications
Theseapplicationsrequiremorethansearch.Notsufficienttosimplyfindasequenceofactionfortransformingtheworldsoastoachieveagoalstate.
uTheseapplicationsinvolvedealingwithuncertainty.uSensingtheworldandplanningtosensetheworldsoastoreduceuncertainty.uGeneratingaplanthathashighpayofforhighexpectedpayoffratherthansimplyachievingafixedgoal.
uRunningintoproblemswhenexecutingaplanandhavingtorecover.uEtc.
Planning
u Agent: singleagent,ormulti-agentu State: completeorincomplete(logical/probabilistic),stateoftheworldand/oragent’sstateofknowledge
u Actions: world-alteringand/orknowledge-altering(e.g.sensing);deterministicornon-deterministic(logical/stochastic)
u GoalCondition:satisfyingoroptimizing;final-stateortemporallyextended;optimizingforpreference/cost/utility
u Reasoning:offlineoronline(fullyobservable,partiallyobservable)u Plans:partialorder,sequential,conditionalCSC384UniversityofToronto 11
SimplifyingthePlanningProblem
uWesimplifytheplanningproblemasfollows:uAssumecompleteinformationabouttheinitialstate throughclosedworldassumption(CWA)
uAssumefinitedomain ofobjectsuAssumeactioneffectsarerestrictedtomakingconjunctionsofatomicformulaetobetrueorfalse.Noconditionaleffects,etc.
uAssumeactionpreconditions arerestrictedtoconjunctionsofgroundatoms
uPerformClassicalPlanning.NoincompleteoruncertainknowledgeCSC384UniversityofToronto 12
ClassicalPlanningAssumptions
u FiniteSystem:finitelymanystates,actions,eventsu FullyObservable:controlleralwaysknowscurrentstateu Deterministic:eachactionhasonlyoneoutcomeu Static:changesonlyoccurasresultofcontrolleractionsu Attainmentgoals:asetofgoalstatesSgu Sequentialplans:planislinearlyorderedsequenceofactions(a1,…,an)u Implicittime:actionsareinstantaneous(havenoduration)u Off-lineplanning:plannerdoesn’tknowexecutionstatusCSC384UniversityofToronto 13
STRIPSRepresentation
uSTRIPS(StanfordResearchInstituteProblemSolver)uAwayofrepresentingactionswithrespecttoCW-KB–closedworldknowledgebaserepresentingthestateoftheworld
CSC384UniversityofToronto 14
STRIPSActions
uStripsrepresentactionsusing3listsuAlistofpreconditions.uAlistofactionaddeffects.uAlistofactiondelete effects
uTheselistscontainvariables sothatwecanrepresentawholeclassofactionswithonespecification
uEachgroundinstantation ofthevariablesyieldsaspecificaction
CSC384UniversityofToronto 16
STRIPSActions:Example
pickup(X):
Pre:{handempty,clear(X),ontable(X)}Adds:{holding(X)}Dels:{handempty,clear(X),ontable(X)}
CSC384UniversityofToronto 17
C
A B
robot hand
STRIPSActions:Example
pickup(X): iscalledaSTRIPSoperator
pickup(a):(aparticularinstance),iscalledanaction
CSC384UniversityofToronto 18
C
A B
robot hand
STRIPSActions:Example
putdown(X)
Pre:{holding(X)}Adds:{clear(X),ontable(X),handempty}Dels:{holding(X)}
CSC384UniversityofToronto 19
C
A
B
robot hand
STRIPSActions:Example
stack(X,Y)
Pre:{holding(X),clear(Y)}Adds:{on(X,Y),handempty,clear(X)}Dels:{holding(X),clear(Y)}
CSC384UniversityofToronto 20
C
A B
robot hand
STRIPShasnoConditionalEffects
u BlocksWorldassumption:Tablehasinfinitespace,soitisalwaysclearu Ifwestacksomethingontable(Y=table),wecannotdeleteclear(table)u ButifYisanordinaryblock,wemustdeleteclear(Y)
CSC384UniversityofToronto 21
stack(X,Y)
Pre:{holding(X),clear(Y)}Adds:{on(X,Y),handempty,clear(X)}Dels:{holding(X),clear(Y)}
STRIPShasnoConditionalEffectsuSinceSTRIPShasnoconditionaleffects,wemustsometimesutilizeextraactions:oneforeachtypeofcondition.uWeEmbedtheconditioninthepreconditionandthenaltertheeffectsaccordingly
CSC384UniversityofToronto 22
stack(X,Y)
Pre:{holding(X),clear(Y)}Adds:{on(X,Y),handempty,clear(X)}Dels:{holding(X),clear(Y)}
putdown(X)
Pre:{holding(X)}Adds:{ontable(X),handempty,clear(X)}Dels:{holding(X)}
STRIPSActions:Example
uunstack(X,Y)
Pre: { }Adds: { }Dels: { }
CSC384UniversityofToronto 23
C
A B
robot hand
STRIPSActions:Example
uunstack(X,Y)
Pre:{clear(X),on(X,Y),handempty}Adds:{holding(X),clear(Y)}Dels:{clear(X),on(X,Y),handempty}
CSC384UniversityofToronto 24
C
A B
robot hand
PlanningasaSearchProblem
u Givenu ACW-KBrepresentingtheinitialstateu AsetofSTRIPSoperatorsthatmapastatetoanewstateu Agoalconditions (conjunctionoffacts,orasaformula)
Theplanningproblemistodeterminesequenceofactionthat,whenappliedtotheinitialCW-KByieldanupdatedCW-KBwhichsatisfiesthegoal.
ThisistheclassicalplanningtaskCSC384UniversityofToronto 25
PlanningAsSearch
uThisisasearchproblem,inwhichourstatespacerepresentationisaCW-KB
uInitialCW-KBistheinitialstateuactionsareoperatorsmappingastatetoanewstateuGoalissatisfiedbyanystatethatsatisfiesthegoal.Typicallythegoalisaconjunctionofprimitivefacts,sowejustneedtocheckifallthefactsinthegoalarecontainedintheCW-KB
CSC384UniversityofToronto 26
Example
CSC384UniversityofToronto 28
CA B
move(b,c)CA
B
move(c,b)C
A B
move(c,table)CA B
move(a,b)BA C
Problems
u SearchtreeisgenerallyquitelargeuRandomlyreconfiguring9blockstakesthousandsofCPUseconds
uBut:representationsuggestssomestructureuEachactiononlyaffectsasmallsetoffacts,uActionsdependoneachotherviatheirpreconditions
u Planningalgorithmsaredesignedtotakeadvantageoffactthattherepresentationmakesthe“locality”ofactionchangesexplicit
CSC384UniversityofToronto 29
PlanningSummary
uModeloftheenvironmentisknownuAgentperformscomputationswithitsmodel(withoutexternalinteraction)
uAgentimprovesitspolicyuDeliberation,reasoning,introspection,pondering,though,search
CSC384UniversityofToronto 30
Howcanweinformouragentofwhatactionstotake?uAssume:environmentisinitiallyunknownuConsiderusingarewardfunction,toguideagentuIfagentdoesn’tknowwhatactionstotake
uTryanactionoutuSeewhattherewardis,oftakingthataction
uThisisReinforcementLearningCSC384UniversityofToronto 32
Example:TicTacToe
uState:BoardconfigurationuActions:NextmoveuReward:1forwin,-1forloss,0fordraw
uProblem: Find𝜋: 𝑆 → 𝐴 thatmaximizesreward
CSC384UniversityofToronto 34
Example:MobileRobot
uState:locationofrobot,peopleuActions:MotionuReward:numberofhappyfacesuProblem: Find𝜋: 𝑆 → 𝐴 thatmaximizesreward
CSC384UniversityofToronto 35
Example:Atari
uState:pixellocationofgameagents
uActions:agentmovementuReward:scoreuProblem: Find𝜋: 𝑆 → 𝐴 thatmaximizesreward
CSC384UniversityofToronto 36
QuadrupedRobot
CSC384UniversityofToronto 38
http://www.andrewng.org/portfolio/quadruped-robot-locomotion/
ReinforcementLearninguGoal:learntochooseactionsthatmaximize
REWARD=r0 +𝛾 r1 +𝛾2 r2 … ,where 0 < 𝛾 < 1
CSC384UniversityofToronto 39
Reward
uArewardRt isascalarfeedbacksignalu IndicateshowwellagentisdoingatsteptuTheagent’sjobistomaximizecumulativereward
Rewardhypothesis:Allgoalscanbedescribedbythemaximizationofexpectedcumulativereward
CSC384UniversityofToronto 40
SequentialDecisionMaking
uGoal:selectactionstomaximizetotalfuturerewarduActionsmayhavelong-termconsequencesuRewardmaybedelayeduMaybebettertosacrificeimmediaterewardtogainmorelong-termreward(exploitationvs.exploration)
CSC384UniversityofToronto 41
ExplorationandExploitation
uReinforcementlearningisliketrial-and-errorlearninguAgentsshoulddiscoveragoodpolicy
uFromitsexperiencesoftheenvironment (explore)uWithoutlosingtoomuchoftherewardalongtheway(exploit)
CSC384UniversityofToronto 42
Agent’sLearningTask
CSC384UniversityofToronto 43
uExecuteactionsintheworlduObservetheresultsuLearnpolicy𝜋: 𝑆 → 𝐴 thatmaximizesrewardfromsomeinitialstate
FullyObservableEnvironment
CSC384UniversityofToronto 44
uFullobservability: agentdirectlyobservesenvironmentstate
uAgentstate=environmentstate=informationstate
uFormally,thisisaMarkovDecisionProcess(MDP)
PartiallyObservableEnvironment
CSC384UniversityofToronto 45
u Partialobservability: agentindirectlyobservesenvironmentu E.g.robotwithcameravisionisn’ttolditsabsolutelocationu Tradingagentonlyobservescurrentpricesu Pokeplayingagentonlyobservespubliccards
u Agentstate≠environmentstateu FormallythisisapartiallyobservableMarkovDecision
Process(POMDP)u AgentmustconstructitsownstaterepresentationSta
u Completehistory: Sta =Ht
u Beliefsofenvironmentstate Sta =(P[Ste =s1],…,P[Ste =sn]u RecurrentNeuralNetwork Sta =𝜎(𝑆./01 𝑊3 + 𝑂.𝑊6)
RLAgent
uAnRLagentmayincludeoneormore ofthesecomponents
uPolicy:agent’sbehaviour functionuValuefunction: howgoodiseachstateand/oractionuModel:agent’srepresentationoftheenvironmentCSC384UniversityofToronto 46
MazeExample
uStates:Agent’slocationuActions:N,E,S,WuRewards:-1pertime-step
CSC384UniversityofToronto 47
MazeExamplePolicy:Agent’sbehaviouruMapfromstatetoactionuDeterministicpolicy
a=𝜋(𝑠)uStochasticpolicy𝜋(𝑎|𝑠) =P(At =a|St =s)
CSC384UniversityofToronto 48
Eacharrowrepresentspolicy𝜋(𝑠) foreachstates
MazeExampleValuefunction:uPredictionoffuturerewarduUsedtoevaluategoodness/badnessofstates
v𝜋(s)=E𝜋 [Rt+1 +𝛾 Rt+2 +… |St=s]
CSC384UniversityofToronto 49
Numbersrepresentvaluefunctionv𝜋(s)ofeachstates
MazeExampleModel:u Predictswhattheenvironmentwilldonext
u Agentmayhaveinternalmodeloftheenvironment,whichdeterminesu howactionschangethestate,andu howmuchrewardshouldbegivenforeachstate.
u Modelmaybeimperfect
CSC384UniversityofToronto 50