Upload
meagan-quinn
View
218
Download
0
Embed Size (px)
Citation preview
1
Search and Planning for Inference and Learning
in Computer Vision
Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu
Markov Decision Processes & Reinforcement Learning
Sinisa Todorovic and Iasonas KokkinosJune 7, 2015
Multi-Armed Bandit Problem• A gambler faces K slot-machines ("armed bandits")• Each machine provides a random reward from an
unknown distribution specific to that machine• Problem:
In which order to play each machine to maximize the sum of rewards of a sequence of lever pulls
s
a1 a2 ak
R(s,a1) R(s,a2) R(s,ak)
…
… Robbins 1952
Outline
• Stochastic Process• Markov Property• Markov Chain• Markov Decision Process• Reinforcement Learning
Discrete Stochastic Process
• A collection of indexed random variables with well-defined ordering
• Characterized by probabilities that the variables take given values, called states
Andrey Makrov
Stochastic Process Example• Classic: Random Walk
– Start at state X0 at time t0
– At time ti, move a step Zi whereP(Zi = -1) = p and P(Zi = 1) = 1 - p
– At time ti, state Xi = X0 + Z1 +…+ Zi
http://en.wikipedia.org/wiki/Image:Random_Walk_example.png
Markov Property
• Also thought of as the “memoryless” property
• If the probability that Xn+1 has any given value depends only on Xn
Markov Chain
• Discrete-time stochastic process with the Markov property
• Example: Google’s PageRankLikelihood of random linking ending up on a page
http://en.wikipedia.org/wiki/PageRank
Markov Decision Process (MDP)
• Discrete time stochastic control process• Extension of Markov chains• Differences:
– Addition of actions (choice)– Addition of rewards (motivation)
• If the actions are fixed, an MDP reduces to a Markov chain
Description of MDPs
• Tuple (S, A, P(.,.), R(.)))– S -> state space– A -> action space– Pa(s, s’) = Pr(st+1 = s’ | st = s, at = a)– R(s) = immediate reward at state s
• Goal: maximize a cumulative function of the rewards = utility function
Example MDP
state node
action node
Solution to an MDP = Policy π
• Given a state, selects the optimal action regardless of history
Value function
Learning Policy
• Value Iteration
• Policy Iteration
• Modified Policy Iteration
• Prioritized Sweeping
Value Iteration
k Vk(PU) Vk(PF) Vk(RU) Vk(RF)
1
2
3
4
5
6
0 0 10 10
0 4.5 14.5 19
2.03 8.55 18.55 24.18
4.76 11.79 19.26 29.23
7.45 15.30 20.81 31.82
10.23 17.67 22.72 33.68
Why So Interesting?
• Straightforward if the transition probabilities are known, but...
• If the transition probabilities are unknown, then this problem is reinforcement learning.
A Typical Agent
• In reinforcement learning (RL), an agent observes a state and takes an action.
• Afterwards, the agent receives a reward.
Mission: Optimize Reward
• Rewards are calculated in the environment• Used to teach the agent how to reach a goal
state• Must signal what we ultimately want
achieved, not necessarily subgoals• May be discounted over time• In general, seek to maximize the expected
return
Monte Carlo Methods
• Instead of
• Compute:
• Qπ(s, a): Expected reward when starting in state s, taking action a, and thereafter following policy π
Monte-Carlo Tree Search
• Builds a tree rooted at current state by repeated Monte-Carlo simulation of a “rollout policy”
• Key Idea: Use statistics of previous trajectories to expand the tree in most promising direction
• No heuristic functions, unlike A*, and branch-and-bound methods
Kocsis & Szepesvari, 2006 Browne et. al., 2012
Monte-Carlo Tree Search
select the best state
so far
take an actionand move
to a new state
simulation backpropagationof the total reward
of simulation
Repeated until the maximum tree depth is reached
Monte-Carlo Tree Search• During construction each tree node s stores:
– state-visitation count n(s) – action counts n(s,a) – action values Q(s,a)
• Repeat until time is up1. Select action a
2. Update statistics of each node s on trajectory• Increment n(s) and n(s,a) for selected action a• Update Q(s,a) by total reward of simulation
Monte-Carlo Tree Search• During construction each tree node s stores:
– state-visitation count n(s) – action counts n(s,a) – action values Q(s,a)
• Repeat until time is up1. Select action a
2. Update statistics of each node s on trajectory• Increment n(s) and n(s,a) for selected action a• Update Q(s,a) by total reward of simulation
Monte-Carlo Tree Search
exploitationexploration
Theoretically, guaranteed to converge to optimal solutions if run long enough.
Practically, it often shows good anytime behavior.
Kocsis & Szepesvari, 2006 Browne et. al., 2012
24
Acknowledgements
NSF IIS 1302700
DARPA MSEE FA 8650-11-1-7149