Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

1

Search and Planning for Inference and Learning

in Computer Vision

Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu

Markov Decision Processes & Reinforcement Learning

Sinisa Todorovic and Iasonas KokkinosJune 7, 2015

Multi-Armed Bandit Problem• A gambler faces K slot-machines ("armed bandits")• Each machine provides a random reward from an

unknown distribution specific to that machine• Problem:

In which order to play each machine to maximize the sum of rewards of a sequence of lever pulls

s

a1 a2 ak

R(s,a1) R(s,a2) R(s,ak)

…

… Robbins 1952

Outline

• Stochastic Process• Markov Property• Markov Chain• Markov Decision Process• Reinforcement Learning

Discrete Stochastic Process

• A collection of indexed random variables with well-defined ordering

• Characterized by probabilities that the variables take given values, called states

Andrey Makrov

Stochastic Process Example• Classic: Random Walk

– Start at state X0 at time t0

– At time ti, move a step Zi whereP(Zi = -1) = p and P(Zi = 1) = 1 - p

– At time ti, state Xi = X0 + Z1 +…+ Zi

http://en.wikipedia.org/wiki/Image:Random_Walk_example.png

Markov Property

• Also thought of as the “memoryless” property

• If the probability that Xn+1 has any given value depends only on Xn

Markov Chain

• Discrete-time stochastic process with the Markov property

• Example: Google’s PageRankLikelihood of random linking ending up on a page

http://en.wikipedia.org/wiki/PageRank

Markov Decision Process (MDP)

• Discrete time stochastic control process• Extension of Markov chains• Differences:

– Addition of actions (choice)– Addition of rewards (motivation)

• If the actions are fixed, an MDP reduces to a Markov chain

Description of MDPs

• Tuple (S, A, P(.,.), R(.)))– S -> state space– A -> action space– Pa(s, s’) = Pr(st+1 = s’ | st = s, at = a)– R(s) = immediate reward at state s

• Goal: maximize a cumulative function of the rewards = utility function

Example MDP

state node

action node

Solution to an MDP = Policy π

• Given a state, selects the optimal action regardless of history

Value function

Learning Policy

• Value Iteration

• Policy Iteration

• Modified Policy Iteration

• Prioritized Sweeping

Value Iteration

k Vk(PU) Vk(PF) Vk(RU) Vk(RF)

1

2

3

4

5

6

0 0 10 10

0 4.5 14.5 19

2.03 8.55 18.55 24.18

4.76 11.79 19.26 29.23

7.45 15.30 20.81 31.82

10.23 17.67 22.72 33.68

Why So Interesting?

• Straightforward if the transition probabilities are known, but...

• If the transition probabilities are unknown, then this problem is reinforcement learning.

A Typical Agent

• In reinforcement learning (RL), an agent observes a state and takes an action.

• Afterwards, the agent receives a reward.

Mission: Optimize Reward

• Rewards are calculated in the environment• Used to teach the agent how to reach a goal

state• Must signal what we ultimately want

achieved, not necessarily subgoals• May be discounted over time• In general, seek to maximize the expected

return

Monte Carlo Methods

• Instead of

• Compute:

• Qπ(s, a): Expected reward when starting in state s, taking action a, and thereafter following policy π

Monte-Carlo Tree Search

• Builds a tree rooted at current state by repeated Monte-Carlo simulation of a “rollout policy”

• Key Idea: Use statistics of previous trajectories to expand the tree in most promising direction

• No heuristic functions, unlike A*, and branch-and-bound methods

Kocsis & Szepesvari, 2006 Browne et. al., 2012


select the best state

so far

take an actionand move

to a new state

simulation backpropagationof the total reward

of simulation

Repeated until the maximum tree depth is reached

Monte-Carlo Tree Search• During construction each tree node s stores:

– state-visitation count n(s) – action counts n(s,a) – action values Q(s,a)

• Repeat until time is up1. Select action a

2. Update statistics of each node s on trajectory• Increment n(s) and n(s,a) for selected action a• Update Q(s,a) by total reward of simulation

Monte-Carlo Tree Search• During construction each tree node s stores:

– state-visitation count n(s) – action counts n(s,a) – action values Q(s,a)

• Repeat until time is up1. Select action a

2. Update statistics of each node s on trajectory• Increment n(s) and n(s,a) for selected action a• Update Q(s,a) by total reward of simulation


exploitationexploration

Theoretically, guaranteed to converge to optimal solutions if run long enough.

Practically, it often shows good anytime behavior.

Kocsis & Szepesvari, 2006 Browne et. al., 2012

24

Acknowledgements

NSF IIS 1302700

DARPA MSEE FA 8650-11-1-7149

Documents

Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1