Click here to load reader
Upload
raimon-bosch
View
288
Download
0
Embed Size (px)
Citation preview
Results from International Probabilistic Planning Competition
IPPC 2011
@raimonbosch
Why Markov domains? (Crossing Traffic)
Missing information / Stochastic behavior !!
CAN'T PREDICT n+1 !!
We can obtain better rewards depending on a policy!!
(1) Solutions are functions (policies) mapping states into actions(2) Given an observation, stochastic behaviors can emerge.
IPPC 2011: DOMAINS AND EVALUATION
• 8 domains– Traffic Control: highly exogenous, concurrent– Elevator Control: highly exogenous, concurrent– Game of Life: highly combinatoric– SysAdmin: highly exogenous, complex transitions– Navigation: goal-oriented, determinization killer– Crossing Traffic: goal-oriented, deterministic if move far left– Skill Teaching: few exogenous events– Reconnaissance: few exogenous events
• Conditions– 24 hours for all runs– 10 instances per domain, 30 runs per instance
Changes from IPPC 2008 - Not Goal Based.
- Large branching factors.
- Finite-horizon reward minimization.
- More realistic planning scenarios.
MDP winnersPROST (Eyerich, Keller – Uni. Freiburg)UCT/Single Outcome Determinization, Caching
Glutton (Kolobov, Dai, Mausam, Weld – UW)Iterative Deepening RTDP, Caching
POMDP winnersPOMDPX_NUS (Wu, WS Lee, D Hsu – NUS)SARSOP / UCT (POMCP)
KAIST-AILAB (D Kim, K Lee, K-E Kim – KAIST)Symbolic HSVI (ADDs), Symmetry Detection
Understanding UCT: Montecarlo tree search
Understanding UCT:Multi-armed bandit problem
UCT Algorithm by Kocsis and Szepesvari (2006)
Parts of UCT(1) Monte-Carlo Tree Search
(2) Performs rollouts in a tree of decision and chance nodes In decision nodes: * Choose any unvisited successor randomly if there is one * Choose the successor maximizing the UCB1 policy otherwise
1st MDP: PROSTDomain-independent probabilistic planning based on UCT combined with additional techniques:
- Reasonable Action Pruning- Q-value initialization- Search Depth Limitation- Reward Lock Detection
2nd MDP: GLUTTONLRTDP with reverse iterative deepening
• Subsampling transition function• Correlated transition function samples• Caching
POMDP Track: Challenges
- Agent acting under uncertainty.- Stochastic sequential decision problems.- Very large number of states.- Compact representation needed.
1st POMDP: SARSOP Successive Approximations of the Reachable Space
under Optimal Policies
- Solve POMDPs by sampling belief space.
2nd POMDP: KAIST-AILABUses symbolic heuristic search value iteration(symbolic HSVI) for factored POMDPs
- Alpha vector masking method.- Algebraic decision diagram (ADD) representation.- Elimination of symmetric structures in the domains.
Thanks![1]
T. Keller and P. Eyerich, “PROST: Probabilistic Planning Based on UCT,” ICAPS’12, 2012.
[2]A. Kolobov, P. Dai, M. Mausam, and D. S. Weld, “Reverse Iterative Deepening for Finite-Horizon MDPs with Large Branching Factors,” in Twenty-Second International Conference on Automated Planning and Scheduling, 2012.
[3]H. Kurniawati, D. Hsu, and W. S. Lee, “SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces,” in Proc. Robotics: Science and Systems, 2008, vol. 62.
[4]H. S. Sim, K. E. Kim, J. H. Kim, D. S. Chang, and M. W. Koo, “Symbolic heuristic search value iteration for factored POMDPs,” in Proc. Nat. Conf. on Artificial Intelligence, 2008, pp. 1088–1093.