Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence

Conference Paper by: Bikramjit Banerjee

University of Southern Mississippi

From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence (2013)

Presentation by: John Mills

Florida Institute of Technology

Pruning for Monte Carlo Distributed Reinforcement Learning in Dec-POMDPs

Agenda• Review of POMDPs

• Decentralized POMDPs

• Applications of Dec-POMDPs

• Overview of Reinforcement Learning

• Purpose of this Work

• Reinforcement Learning for Dec-POMDPs

• Dec-POMPD Algorithms

• Experimental Results

• Review

• Conclusions

Review of POMDPs• Partially Observable Markov Decision Processes

• Agents do not have complete knowledge of their state

• POMDPs are essentially MDPs with sensor models

• Transition Model: P(s’|s,a)

• Actions: A(s)

• Reward Function: R(s)

• Sensor Model: P(e|s)

• Agents estimate their state by computing a belief state -- a conditional probability distribution over the actual states given its history of observations and actions

• The optimal action only depends on the current belief state

Decentralized POMPDs• Dec-POMDPs are used to model realistic multi-agent systems

• The goal is to find the optimal policy for each agent to maximize a global reward using only local observations

• Agents cannot explicitly communicate their states and observations

• A Dec-POMDP is defined by

• number of agents• finite set of states• set of actions where are the actions that agent can perform• transition probability model ()• immediate reward• set of observations where are the observations agent can receive• observation probability model ()

Applications of Dec-POMDPs• Formation flight of UAVs

• Cooperative robotics

• Swarm robotics

• Load Balancing Among Queues

• Communication Networks

• Sensor Networks

Overview of Reinforcement Learning • Reinforcement Learning problems are modelled as MDPs

• Agents have no prior knowledge of the environment model and reward function

• Agents learn the optimal policy by performing actions and observing rewards

• Methods can be model-based or model-free

• A variation of Q-learning is used in this work

• Q-learning learns the action-utility relationship rather than the utilities

• Q-values are the value of performing a certain action in a certain state

• Utility value is directly related to the Q-value (see equation below)

• It is a model free method (the transition model is not needed to estimate Q)

• Agents cannot look ahead without model

Purpose of this Work• Exact solutions to finite horizon Dec-POMDPs require significant time and memory

• Many Dec-POMDP solvers are centralized and assume prior knowledge of the model

• Research by Banerjee et al. aims to use decentralized planning and reinforcement learning to solve Dec-POMDPs with less sample complexity and minimal error

• Additionally pruning is used to remove parts of the experience tree

• Their methods are evaluated by solving four benchmark Dec-POMDPs

Reinforcement Learning for Dec-POMDPs• The authors used a Monte Carlo approach to solve Dec-POMDP problems

• Agents take turns to learn the best response to each others’ policies

• Agents do not know the models P, R, and O

• Assumptions:

• Agents know the size of the problem

• Agents know the overall maximum reward

• Agents have partial communication during learning phase

• This approach is semi-model based because it estimates intermediate reward and history transition functions

MCQ-ALT Algorithm• Algorithm Description:

• The first action is performed and rewards and observations are received

• The experience tree is explored as the reward and history transitions are estimated

• After a policy has been created it is evaluated and the Q-value is updated

• Subroutine Descriptions:

• Actions are selected based on the history (SELECTACTION)

• Reward and history transition functions are estimated (STEP)

• Number of history-action pairs, N, are tracked and when one occurs frequently enough the history is called “Known” (ENDEPISODE)

• The Q-value is estimated (QUPDATE)

MCQ-ALT Algorithm

Explore experience tree

Estimate R and H functions

Greedy selection

Least frequent

Update Q-value

Modification to MCQ-ALT• The MCQ-ALT invests N samples into every leaf node in an experience tree

• Rare histories become a significant liability and contribute little to the value function

• The value function may not need to be so accurate

• 1) Policies usually converge before value functions

• 2) Most of the experience tree does not appear in the optimal policy

• Confidence is preserved by removing actions that do not meet a derived criterion

IMCQ-ALT Algorithm

Remove actions from history at proper level based on confidence

preservation criterion

Perform several passes through experience tree

Experimental Results• The MCQ-ALT and IMCQ-ALT algorithms were tested with four benchmark POMDPs

• DEC-TIGER

• RECYCLING ROBOTS

• BOX-PUSHING

• MARS-ROVERS

• Varied Parameters

• Maximum Frequency of Action-History Pair: N = 10, 20, 50, 100, 200, 500

• Maximum Number of Action-Observation Steps (Horizon): T = 3, 4, 5 and T = 2, 3, 4

DEC-TIGER

BOX-PUSHING

Review

Advantages• Runtime and memory usage are improved

• No model is needed

• Computational burden is effectively distributed

• Algorithm has a well-defined stopping criterion

Disadvantages• Agents can only learn from the previous

agent

• Robustness and adaptability is not considered

Conclusions• Decentralized POMDPs look like a promising method for modelling multi-agent problems

• The authors proposed a method of solving Dec-POMDPs that promises to find (near) optimal solutions in less time and with less memory than other methods

• The algorithm has been shown to perform well for benchmark problems

• Larger horizon values should be investigated

• Future testing should apply the algorithm to other problems with well defined requirements and success criteria

QUESTIONS?

REFERENCES• Confused Robot

• Dec-POMDPs by Frans A. Oliehoek

• Formation Flight of UAVs

• Swarm Robotics

• Dec-Tiger Picture

• Box-Pushing Pic

• Banerjee, B. 2013. Pruning for Monte Carlo Distributed Reinforcement Learning in Decentralized POMDPs. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence (AAAI-13), 88-94.

• Banerjee, B.; Lyle, J.; Kraemer, L.; Yellamraju, R. 2012. Solving Finite Horizon Decentralized POMDPs by Distributed Reinforcement Learning. In The Seventh Annual Workshop on Multiagent Sequential Decision-Making Under Uncertainty (MSDM-2012).

http://2.bp.blogspot.com/-IoRtGd7icts/UBxRFfx6s5I/AAAAAAAAAkI/bQSqTYuLSqM/s1600/confused+robot.jpg

http://www.fransoliehoek.net/docs/Oliehoek12RLBook.pdf




http://www.theblaze.com/wp-content/uploads/2012/02/Drone-Formation_2-620x292.jpg

http://www.theblaze.com/wp-content/uploads/2012/02/Drone-Formation_2-620x292.jpg

http://www.dei.unipd.it/~schenato/pics/RobotsSwarm.jpg

http://www.dei.unipd.it/~schenato/pics/RobotsSwarm.jpg

http://www.intsci.ac.cn/summerschool/slides/Lecture-3_SZilberstein.pdf

http://www.eecs.harvard.edu/~seuken/publications/Seuken_UAI_2007.pdf

Documents

Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence