Upload
lam
View
34
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Automatic Induction of MAXQ Hierarchies. Neville Mehta Michael Wynkoop Soumya Ray Prasad Tadepalli Tom Dietterich School of EECS Oregon State University. Funded by DARPA Transfer Learning Program. Hierarchical Reinforcement Learning. Exploits domain structure to facilitate learning - PowerPoint PPT Presentation
Citation preview
Automatic Automatic Induction of MAXQ Induction of MAXQ
HierarchiesHierarchiesNeville MehtaNeville Mehta
Michael WynkoopMichael WynkoopSoumya RaySoumya Ray
Prasad TadepalliPrasad TadepalliTom DietterichTom Dietterich
School of EECSSchool of EECSOregon State UniversityOregon State University
Funded by DARPA Transfer Learning Program
Hierarchical Reinforcement Hierarchical Reinforcement LearningLearning
Exploits domain structure to facilitate Exploits domain structure to facilitate learninglearning Policy constraintsPolicy constraints State abstractionState abstraction
Paradigms: Options, HAMs, MaxQParadigms: Options, HAMs, MaxQ MaxQ task hierarchyMaxQ task hierarchy
Directed acyclic graph of subtasksDirected acyclic graph of subtasks Leaves are the primitive MDP actionsLeaves are the primitive MDP actions
Traditionally, task structure is provided Traditionally, task structure is provided as prior knowledge to the learning agentas prior knowledge to the learning agent
Model RepresentationModel Representation
Dynamic Bayesian Networks for the Dynamic Bayesian Networks for the transition and reward modelstransition and reward models
Symbolic representation of the Symbolic representation of the conditional probabilities/reward conditional probabilities/reward values as decision treesvalues as decision trees
Goal: Learn Task Goal: Learn Task HierarchiesHierarchies
Avoid the significant manual engineering of Avoid the significant manual engineering of task decompositiontask decomposition Requiring deep understanding of the purpose Requiring deep understanding of the purpose
and function of subroutines in computer and function of subroutines in computer sciencescience
Frameworks for learning exit-option Frameworks for learning exit-option hierarchies:hierarchies: HexQ: Determine exit states through random HexQ: Determine exit states through random
explorationexploration VISA: Determine exit states by analyzing DBN VISA: Determine exit states by analyzing DBN
action modelsaction models
Focused Creation of Focused Creation of SubtasksSubtasks
HEXQ & VISA: Create a separate HEXQ & VISA: Create a separate subtask for each possible exit state.subtask for each possible exit state. This can generate a large number of This can generate a large number of
subtaskssubtasks Claim: Defining good subtasks requires Claim: Defining good subtasks requires
maximizing state abstraction while maximizing state abstraction while identifying “useful” subgoals.identifying “useful” subgoals.
Our approach: Our approach: selectivelyselectively define define subtasks with single abstract exit statessubtasks with single abstract exit states
Transfer Learning Transfer Learning ScenarioScenario
Working hypothesis:Working hypothesis: MaxQ value-function learning is much quicker MaxQ value-function learning is much quicker
than non-hierarchical (flat) Q-learningthan non-hierarchical (flat) Q-learning Hierarchical structure is more amenable to Hierarchical structure is more amenable to
transfer from source tasks to the target than transfer from source tasks to the target than value functionsvalue functions
Transfer scenario:Transfer scenario: Solve a “source problem” (no CPU time limit)Solve a “source problem” (no CPU time limit)
Learn DBN modelsLearn DBN models Learn MAXQ hierarchyLearn MAXQ hierarchy
Solve a “target problem” under the assumption Solve a “target problem” under the assumption that the same hierarchical structure appliesthat the same hierarchical structure applies
Will relax this constraint in future workWill relax this constraint in future work
MaxNode State MaxNode State AbstractionAbstraction
Y is irrelevant within this actionY is irrelevant within this action It affects the dynamics but not the reward It affects the dynamics but not the reward
functionfunction In HEXQ, VISA, and our work, we assume In HEXQ, VISA, and our work, we assume
there is only one terminal abstract state, there is only one terminal abstract state, hence no pseudo-reward is neededhence no pseudo-reward is needed
As a side-effect, this enables “funnel” As a side-effect, this enables “funnel” abstractions in parent tasksabstractions in parent tasks
Rt+1
Xt
Yt
At
Xt+1
Yt+1
Our Approach: AI-MAXQOur Approach: AI-MAXQLearn DBN action models via
random exploration
(Other work)
Apply Q learning to solve the source problem
Generate a good trajectory from the learned Q function
Analyze trajectory to produce CAT
Analyze CAT to define MAXQ Hierarchy
(This Talk)
(This Talk)
Wargus Resource-Wargus Resource-Gathering DomainGathering Domain
reg.*
a.l
Causally Annotated Causally Annotated Trajectory (CAT)Trajectory (CAT)
EndStart Goto MG Goto Dep Goto CW Goto Dep
a.r a.r a.r a.r
req.gold
req.wood
req.gold
a.*reg.*
req.wood
A variable v is relevant to an action if the DBN for that action tests or changes that variable (this includes both the variable nodes and the reward nodes)
Create an arc from A to B labeled with variable v iff v is relevant to A and B but not to any intermediate actions.
CAT ScanCAT Scan
EndStart Goto MG Goto Dep Goto CW Goto Dep
An action is absorbed regressively as long An action is absorbed regressively as long asas It does not have an effect beyond the trajectory It does not have an effect beyond the trajectory
segment, preventing exogenous effectssegment, preventing exogenous effects It does not increase the state abstractionIt does not increase the state abstraction
CAT ScanCAT Scan
EndStart Goto MG Goto Dep Goto CW Goto Dep
CAT ScanCAT Scan
EndStart Goto MG Goto Dep Goto CW Goto Dep
Root
CAT ScanCAT Scan
EndStart Goto MG Goto Dep Goto CW Goto Dep
Root
Harvest WoodHarvest Gold
Induced Wargus Induced Wargus HierarchyHierarchy
Root
Harvest WoodHarvest Gold
Get Gold Get Wood
Goto(loc)
Mine Gold Chop WoodGDeposit
Put Gold Put Wood
WGoto(townhall)GGoto(goldmine) WGoto(forest)GGoto(townhall)
WDeposit
Induced Abstraction & Induced Abstraction & TerminationTermination
Task Task NameName State AbstractionState Abstraction Termination ConditionTermination Condition
RootRoot req.gold, req.woodreq.gold, req.wood req.gold = 1 && req.wood = 1req.gold = 1 && req.wood = 1
Harvest GoldHarvest Gold req.gold, agent.resource, req.gold, agent.resource, region.townhallregion.townhall req.gold = 1req.gold = 1
Get GoldGet Gold agent.resource, agent.resource, region.goldmineregion.goldmine agent.resource = goldagent.resource = gold
Put GoldPut Gold req.gold, agent.resource, req.gold, agent.resource, region.townhallregion.townhall agent.resource = 0agent.resource = 0
GGoto(goldmGGoto(goldmine)ine) agent.x, agent.yagent.x, agent.y agent.resource = 0 && region.goldmine = 1agent.resource = 0 && region.goldmine = 1
GGoto(townhGGoto(townhall)all) agent.x, agent.yagent.x, agent.y req.gold = 0 && agent.resource = gold && req.gold = 0 && agent.resource = gold &&
region.townhall = 1region.townhall = 1Harvest Harvest WoodWood
req.wood, agent.resource, req.wood, agent.resource, region.townhallregion.townhall req.wood = 1req.wood = 1
Get WoodGet Wood agent.resource, region.forestagent.resource, region.forest agent.resource = woodagent.resource = wood
Put WoodPut Wood req.wood, agent.resource, req.wood, agent.resource, region.townhallregion.townhall agent.resource = 0agent.resource = 0
WGoto(forestWGoto(forest))
agent.x, agent.yagent.x, agent.y agent.resource = 0 && region.forest = 1agent.resource = 0 && region.forest = 1
WGoto(townhWGoto(townhall)all) agent.x, agent.yagent.x, agent.y req.wood = 0 && agent.resource = wood && req.wood = 0 && agent.resource = wood &&
region.townhall = 1region.townhall = 1
Mine GoldMine Gold agent.resource, agent.resource, region.goldmineregion.goldmine NANA
Chop WoodChop Wood agent.resource, region.forestagent.resource, region.forest NANA
GDepositGDeposit req.gold, agent.resource, req.gold, agent.resource, region.townhallregion.townhall NANA
WDepositWDeposit req.wood, agent.resource, req.wood, agent.resource, region.townhallregion.townhall NANA
Goto(loc)Goto(loc) agent.x, agent.yagent.x, agent.y NANA
Note that because each subtask has a unique terminal state, Result Distribution Irrelevance applies
ClaimsClaims
The resulting hierarchy is uniqueThe resulting hierarchy is unique Does not depend on the order in which goals Does not depend on the order in which goals
and trajectory sequences are analyzedand trajectory sequences are analyzed All state abstractions are safeAll state abstractions are safe
There exists a hierarchical policy within the induced There exists a hierarchical policy within the induced hierarchy that will reproduce the observed trajectoryhierarchy that will reproduce the observed trajectory
Extend MaxQ Node Irrelevance to the induced structureExtend MaxQ Node Irrelevance to the induced structure
Learned hierarchical structure is “locally Learned hierarchical structure is “locally optimal”optimal” No local change in the trajectory segmentation No local change in the trajectory segmentation
can improve the state abstractions (very weak)can improve the state abstractions (very weak)
Experimental SetupExperimental Setup
Randomly generate pairs of Randomly generate pairs of sourcesource--targettarget resource-gathering maps in Wargusresource-gathering maps in Wargus
Learn the optimal policy in Learn the optimal policy in sourcesource
Induce task hierarchy from a single (near) Induce task hierarchy from a single (near) optimal trajectoryoptimal trajectory
Transfer this hierarchical structure to the Transfer this hierarchical structure to the MaxQ value-function learner for MaxQ value-function learner for targettarget
Compare to direct Q learning, and MaxQ Compare to direct Q learning, and MaxQ learning on a manually engineered learning on a manually engineered hierarchy within hierarchy within targettarget
Hand-Built Wargus Hand-Built Wargus HierarchyHierarchy
Root
Get Gold Get Wood
Goto(loc)Mine Gold Chop Wood Deposit
GWDeposit
Hand-Built Abstractions & Hand-Built Abstractions & TerminationsTerminations
Task Task NameName State AbstractionState Abstraction Termination Termination
ConditionCondition
RootRoot req.gold, req.wood, agent.resourcereq.gold, req.wood, agent.resource req.gold = 1 && req.gold = 1 && req.wood = 1req.wood = 1
Harvest Harvest GoldGold agent.resource, region.goldmineagent.resource, region.goldmine agent.resource agent.resource ≠≠ 0 0
Harvest Harvest WoodWood agent.resource, region.forestagent.resource, region.forest agent.resource agent.resource ≠≠ 0 0
GWDepositGWDeposit req.gold, req.wood, agent.resource, region.townhallreq.gold, req.wood, agent.resource, region.townhall agent.resource agent.resource == 0 0
Mine GoldMine Gold region.goldmineregion.goldmine NANA
Chop WoodChop Wood region.forestregion.forest NANA
DepositDeposit req.gold, req.wood, agent.resource, region.townhallreq.gold, req.wood, agent.resource, region.townhall NANA
Goto(loc)Goto(loc) agent.x, agent.yagent.x, agent.y NANA
Results: WargusResults: WargusWargus domain: 7 reps
-1000
0
1000
2000
3000
4000
5000
6000
7000
8000
0 10 20 30 40 50 60 70 80 90 100Episode
To
tal
Du
rati
on
Induced (MAXQ)
Hand-engineered (MAXQ)
No transfer (Q)
Need For Need For DemonstrationsDemonstrations
VISA only uses DBNs for causal informationVISA only uses DBNs for causal information Globally applicable across state space without Globally applicable across state space without
focusing on the pertinent subspacefocusing on the pertinent subspace ProblemsProblems
Global variable coupling might prevent concise Global variable coupling might prevent concise abstractionabstraction
Exit states can grow exponentially: one for each Exit states can grow exponentially: one for each path in the decision tree encodingpath in the decision tree encoding
Modified bitflip domain exposes these Modified bitflip domain exposes these shortcomingsshortcomings
Modified Bitflip DomainModified Bitflip Domain
State space: bState space: b00,…,b,…,bn-1n-1
Action space:Action space: Flip(i), 0 < i < n-1Flip(i), 0 < i < n-1
If bIf b00 … … b bi-1i-1 = 1 then b = 1 then bii ← ~b← ~bii
Else bElse b0 0 ← 0, …, b← 0, …, bii ← 0 ← 0 Flip(n-1)Flip(n-1)
If parity(If parity(bb00, …,b, …,bn-2n-2) ) b bn-2n-2 = 1, b = 1, bn-1n-1 ← ~b← ~bn-1n-1
Else bElse b0 0 ← 0, …, b← 0, …, bn-1n-1 ← 0 ← 0 parity(…) = even if n-1 is even, odd otherwiseparity(…) = even if n-1 is even, odd otherwise
Reward: -1 for all actionsReward: -1 for all actions Terminal/goal state: Terminal/goal state: bb00 … … b bn-1n-1 = 1 = 1
Modified Bitflip DomainModified Bitflip Domain
1 1 1 0 0 0 0
1 1 1 1 0 0 0
Flip(3)
1 0 1 1 0 0 0
Flip(1)
0 0 0 0 0 0 0
Flip(4)
VISA’s Causal GraphVISA’s Causal Graph
Variables grouped into two strongly connected Variables grouped into two strongly connected components (dashed ellipses)components (dashed ellipses)
Both components affect the reward nodeBoth components affect the reward node
b0 b1
Flip(1) Flip(2)bn-2 bn-1
Flip(n-1)b2
Flip(2)
R
Flip(n-1)
Flip(n-1)Flip(3)
Flip(n-1)Flip(3)Flip(2)
Flip(n-2)
Flip(n-2)
Flip(n-2)
VISA task hierarchyVISA task hierarchy
Root
Flip(1)Flip(0) Flip(n-1)
Flip(n-1)Parity(b0,…,bn-2) bn-2 = 1
2n-3 exit options
Bitflip CATBitflip CAT
Flip(n-1) EndStart Flip(0) Flip(n-2)
bn-1
b0 Flip(1)b0
bn-2
b0,…,bn-2 b0,…,bn-1
b1
Induced MAXQ task Induced MAXQ task hierarchyhierarchy
Root
Flip(1)Flip(0)
Flip(n-1)b0…bn-2 = 1
b0…bn-3 = 1 Flip(n-2)
b0 b1 = 1 Flip(n-3)
Results: BitflipResults: BitflipBitflip domain: 7 bits, 20 reps
-500
0
500
1000
1500
2000
2500
3000
0 10 20 30 40 50 60 70 80 90 100
Episode
To
tal
Du
rati
on
Q
MaxQ
VISA
ConclusionConclusion
Causality analysis is the key to our Causality analysis is the key to our approachapproach
Enables us to find concise subtask Enables us to find concise subtask definitions from a demonstrationdefinitions from a demonstration CAT scan is easy to performCAT scan is easy to perform
Need to extend to learn from multiple Need to extend to learn from multiple demonstrationsdemonstrations Disjunctive goalsDisjunctive goals