Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
IROS04 (Japan, Sendai)University of Tehran
Amir massoud Farahmand - Majid Nili Ahmadabadi Babak Najar Araabi
[email protected], {mnili, araabi}@ut.ac.ir
Behavior HierarchyLearning in a Behavior-
based System usingReinforcement Learning
Department of Electrical and Computer EngineeringUniversity of Tehran
Iran
IROS04 (Japan, Sendai)University of Tehran
Paper Outline
• Challenges and Requirements of Robotic Systems• Behavior-based Approach to AI• How should we design a Behavior-based System
(BBS)?!• Learning in BBS• Structure Learning in BBS• Value Function Decomposition• Experiments: Multi-Robot Object Lifting• Conclusions, Ongoing Research, and Future Work
IROS04 (Japan, Sendai)University of Tehran
Challenges andRequirements
of Robotic SystemsChallenges
• Sensor and EffectorUncertainty
• Partial Observability
• Non-Stationarity
Requirements
(among many others)
• Multi-goal
• Robustness
• Multiple Sensors
• Scalability
• Automatic design
• [Learning]
IROS04 (Japan, Sendai)University of Tehran
Behavior-based Approach to AI
• Behavior-based approach as a good candidate for low-levelintelligence.
• Behavioral (activity) decomposition– against functional decomposition
• Behavior: Sensor->Action (Direct link between perception andaction)
• Situatedness– Situatedness motto: The world is its own best model!
• Embodiment• Intelligence as Emergence
– (interaction of agent with environment)
IROS04 (Japan, Sendai)University of Tehran
Behavioral decomposition
manipulatethe world
build maps
explore
locomote
avoid obstacles
sensors actuators
IROS04 (Japan, Sendai)University of Tehran
Behavior-based System Design
• Hand Design– Common in almost everywhere (just ask some people in
IROS04)– Complicated: may be infeasible in complex problems– Even if it is possible to find a working system, probably it is not
optimal.
• Evolution– Time consuming– Good solutions can be found– Biologically feasible
• Learning– Biologically feasible– Learning is essential for life-time survival of the agent.
We have focuses on learning in this presentation.
IROS04 (Japan, Sendai)University of Tehran
The Importance of Learning
• Unknown environment/body– [exact] Model of environment/body is not known
• Non-stationary environment/body– Changing environment (offices, houses, streets, and almost
everywhere)– Aging
• Designer may not know how to benefit from everyaspects of her agent/environment– Let’s the agent learn it by itself (learning as optimization)
• etc …
IROS04 (Japan, Sendai)University of Tehran
Learning in Behavior-basedSystems
• There are a few works on behavior-basedlearning– Mataric, Mahadevan, Maes, and ...
• … but there is no deep investigation aboutit (specially mathematical formulation)!
IROS04 (Japan, Sendai)University of Tehran
Learning in Behavior-basedSystems
There are different methods of learning withdifferent viewpoints, but we haveconcentrated on Reinforcement Learning.– [Agent] Did I perform it correctly?!
– [Tutor] Yes/No!
IROS04 (Japan, Sendai)University of Tehran
Learning in Behavior-basedSystems
We have divided learning in BBS into these twoparts:
• Structure Learning– How should we organize behaviors in the architecture
assume having a repertoire of working behaviors
• Behavior Learning– How should each behavior behave? (we do not have
a necessary toolbox)
IROS04 (Japan, Sendai)University of Tehran
Structure Learning Assumptions
• Structure Learning inSubsumption Architecture as agood sample for BBS
• Purely parallel case• We know B1, B2, and … but we
do not know how to arrangethem in the architecture
– we know how to {avoidobstacles, pick an object,stop, move forward, turn,…} but we don’t knowwhich one is superior toothers.
IROS04 (Japan, Sendai)University of Tehran
Structure Learning
manipulatethe world
build maps
explore
locomoteavoid obstacles
Behavior Toolbox
The agent wants to learnhow to arrange thesebehaviors in order to getmaximum reward from itsenvironment (or tutor).
IROS04 (Japan, Sendai)University of Tehran
Structure Learning
manipulatethe world
build maps
explore
locomoteavoid obstacles
Behavior Toolbox
IROS04 (Japan, Sendai)University of Tehran
Structure Learning
manipulatethe world
build maps
explorelocomote
avoid obstacles
Behavior Toolbox 1-explore becomescontrolling behavior andsuppress avoid obstacles
2-The agent hits a wall!
IROS04 (Japan, Sendai)University of Tehran
Structure Learning
manipulatethe world
build maps
explorelocomote
avoid obstacles
Behavior Toolbox Tutor (environment) givesexplore a punishment for itsbeing in that place of thestructure.
IROS04 (Japan, Sendai)University of Tehran
Structure Learning
manipulatethe world
build maps
explorelocomote
avoid obstacles
Behavior Toolbox“explore” is not a very goodbehavior for the highestposition of the structure. Soit is replaced by “avoidobstacles”.
IROS04 (Japan, Sendai)University of Tehran
Structure Learning Issues
• How should we represent structure?– Sufficient (Concept space should be covered by
Hypothesis space)– Tractable (small Hypothesis space)– Well-defined credit assignment
• How should we assign credits to architecture?– If the agent receives a reward/punishment, how
should we reward/punish structure of thearchitecture?
IROS04 (Japan, Sendai)University of Tehran
Value Function Decomposition andStructure Learning
Each structure has a value regarding itsreceiving reinforcement signal.
[ ]T structure agent with thetTrEV =
•The objective is finding a structure T with ahigh value.•We have decomposed value function tosimpler components that enable us to benefitfrom previous experiments.
IROS04 (Japan, Sendai)University of Tehran
Value Function Decomposition
• It is possible to decompose total system’s valueto value of each behavior in each layer.
• We call it Zero-Order method.
[ ]layeri in thebehavior gcontrollin is ),( th
jtijZO BrEVjiV ==
IROS04 (Japan, Sendai)University of Tehran
Value Function DecompositionZero Order Method
It stores the value of behavior-being in a specificlayer.
avoid obstacles(0.8)
avoid obstacles(0.6)
explore(0.7)
explore(0.9)
locomote(0.4)Higher layer
Lower layer
ZO Value Table in the agent’s mind
locomote(0.4)
IROS04 (Japan, Sendai)University of Tehran
Credit Assignment forZero Order Method
• Controlling behavior is the only responsiblebehavior for the current reinforcement signal.
• Appropriate ZO value table updating method isavailable.
IROS04 (Japan, Sendai)University of Tehran
Value Function DecompositionAnother Method (First Order)
It stores the value of relative order of behaviors– How much is it good/bad if “B1 is being placed higher than B2”?!
• V(avoid obstacles>explore) = 0.8
• V(explore>avoid obstacles) = -0.3
• Sorry! Not that easy (and informative) to show graphically!!• Credits are assigned to all (controlling, activated) pairs of
behaviors.– The agent receives reward while B1 is controlling and B3 and B5 are
activated• (B1>B3): +• (B1>B5): +
IROS04 (Japan, Sendai)University of Tehran
Structure Representation
Both of these methods are provided with alot of probabilistic reasoning which showshow to– decompose total system value to simple
components
– assign credits
– update values table
Check the Proceeding for MathematicalFormulation!
IROS04 (Japan, Sendai)University of Tehran
Example: Multi-RobotObject Lifting
• A Group of three robots wantto lift an object using theirown local sensors– No central control
– No communication
– Local sensors
• Objectives– Reaching prescribed height
– Keeping tilt angle small
IROS04 (Japan, Sendai)University of Tehran
Example: Multi-RobotObject Lifting
Behavior Toolbox
Stop
Push More
Hurry Up
Slow DownDon’t Go Fast
?!
IROS04 (Japan, Sendai)University of Tehran
Example: Multi-RobotObject Lifting
Sample shot of tilt angle of the object after sufficient learning
5 10 15 20 25 30 35 40 45 50
-40
-30
-20
-10
0
10
20
30
40
Episodes
Average total reward per episode
Mean hand-designed performance
Zero order
First order
IROS04 (Japan, Sendai)University of Tehran
Example: Multi-RobotObject Lifting
Sample shot of height of each robot after sufficient learning
0 10 20 30 40 50 60 70 80 900
0.5
1
1.5
2
2.5
3
3.5
Steps
z of robots
goal
1
2
3
IROS04 (Japan, Sendai)University of Tehran
Example: Multi-RobotObject Lifting
Sample shot of tilt angle of the object after sufficient learning
0 10 20 30 40 50 60 70 80 900
5
10
15
20
25
30
35
40
45
Steps
Tilt angle (in degrees)
IROS04 (Japan, Sendai)University of Tehran
Conclusions, Ongoing Research,and Future Work
• We have devised two different methods forstructure learning for behavior-basedsystem.
• Good results in two different tasks– Multi-robot Object Lifting
– An Abstract Problem (not reported yet)
IROS04 (Japan, Sendai)University of Tehran
Conclusions, Ongoing Research,and Future Work
• … but from where should we findnecessary behaviors?!– Behavior Learning
• We have devised some methods forbehavior learning which will be reportedsoon.
IROS04 (Japan, Sendai)University of Tehran
Conclusions, Ongoing Research,and Future Work
• However, there are many steps remained forfully automated agent design– How should we generate new behaviors without even
knowing which sensory information is necessary forthe task (feature selection)
– Problem of Reinforcement Signal Design• Designing a good reinforcement signal is not easy at all.
IROS04 (Japan, Sendai)University of Tehran