Results IPPC 2011: MDPS and POMDPS

Results from International Probabilistic Planning Competition

IPPC 2011

@raimonbosch

Why Markov domains? (Crossing Traffic)

Missing information / Stochastic behavior !!

CAN'T PREDICT n+1 !!

We can obtain better rewards depending on a policy!!

(1) Solutions are functions (policies) mapping states into actions(2) Given an observation, stochastic behaviors can emerge.

IPPC 2011: DOMAINS AND EVALUATION

• 8 domains– Traffic Control: highly exogenous, concurrent– Elevator Control: highly exogenous, concurrent– Game of Life: highly combinatoric– SysAdmin: highly exogenous, complex transitions– Navigation: goal-oriented, determinization killer– Crossing Traffic: goal-oriented, deterministic if move far left– Skill Teaching: few exogenous events– Reconnaissance: few exogenous events

• Conditions– 24 hours for all runs– 10 instances per domain, 30 runs per instance

Changes from IPPC 2008 - Not Goal Based.

- Large branching factors.

- Finite-horizon reward minimization.

- More realistic planning scenarios.

MDP winnersPROST (Eyerich, Keller – Uni. Freiburg)UCT/Single Outcome Determinization, Caching

Glutton (Kolobov, Dai, Mausam, Weld – UW)Iterative Deepening RTDP, Caching

POMDP winnersPOMDPX_NUS (Wu, WS Lee, D Hsu – NUS)SARSOP / UCT (POMCP)

KAIST-AILAB (D Kim, K Lee, K-E Kim – KAIST)Symbolic HSVI (ADDs), Symmetry Detection

Understanding UCT: Montecarlo tree search

Understanding UCT:Multi-armed bandit problem

UCT Algorithm by Kocsis and Szepesvari (2006)

Parts of UCT(1) Monte-Carlo Tree Search

(2) Performs rollouts in a tree of decision and chance nodes In decision nodes: * Choose any unvisited successor randomly if there is one * Choose the successor maximizing the UCB1 policy otherwise

1st MDP: PROSTDomain-independent probabilistic planning based on UCT combined with additional techniques:

- Reasonable Action Pruning- Q-value initialization- Search Depth Limitation- Reward Lock Detection

2nd MDP: GLUTTONLRTDP with reverse iterative deepening

• Subsampling transition function• Correlated transition function samples• Caching

POMDP Track: Challenges

- Agent acting under uncertainty.- Stochastic sequential decision problems.- Very large number of states.- Compact representation needed.

1st POMDP: SARSOP Successive Approximations of the Reachable Space

under Optimal Policies

- Solve POMDPs by sampling belief space.

2nd POMDP: KAIST-AILABUses symbolic heuristic search value iteration(symbolic HSVI) for factored POMDPs

- Alpha vector masking method.- Algebraic decision diagram (ADD) representation.- Elimination of symmetric structures in the domains.

Thanks![1]

T. Keller and P. Eyerich, “PROST: Probabilistic Planning Based on UCT,” ICAPS’12, 2012.

[2]A. Kolobov, P. Dai, M. Mausam, and D. S. Weld, “Reverse Iterative Deepening for Finite-Horizon MDPs with Large Branching Factors,” in Twenty-Second International Conference on Automated Planning and Scheduling, 2012.

[3]H. Kurniawati, D. Hsu, and W. S. Lee, “SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces,” in Proc. Robotics: Science and Systems, 2008, vol. 62.

[4]H. S. Sim, K. E. Kim, J. H. Kim, D. S. Chang, and M. W. Koo, “Symbolic heuristic search value iteration for factored POMDPs,” in Proc. Nat. Conf. on Artificial Intelligence, 2008, pp. 1088–1093.

Documents

Results IPPC 2011: MDPS and POMDPS