21
Optimal Learning & Bayes-Adaptive MDPs An Overview Slides on M. Duff’s Thesis/Ch.1 SDM-RG, Mar-09 Slides prepared by Georgios Chalkiadakis

Optimal Learning & Bayes -Adaptive MDPs

  • Upload
    kolya

  • View
    29

  • Download
    3

Embed Size (px)

DESCRIPTION

Optimal Learning & Bayes -Adaptive MDPs. An Overview Slides on M. Duff’s Thesis/Ch.1 SDM-RG, Mar-09. Optimal Learning: Overview. Behaviour that maximizes expected total reward while interacting with an uncertain world. Behave well while learning, learn while behaving well. - PowerPoint PPT Presentation

Citation preview

Page 1: Optimal Learning & Bayes -Adaptive  MDPs

Slides prepared by Georgios Chalkiadakis

Optimal Learning &Bayes-Adaptive MDPs

An Overview

Slides on M. Duff’s Thesis/Ch.1SDM-RG, Mar-09

Page 2: Optimal Learning & Bayes -Adaptive  MDPs

Slides prepared by Georgios Chalkiadakis

Optimal Learning: Overview

Behaviour that maximizes expected total reward while interacting with an uncertain world.

Behave well while learning, learn while behaving well.

Page 3: Optimal Learning & Bayes -Adaptive  MDPs

Slides prepared by Georgios Chalkiadakis

Optimal Learning: Overview

What does it mean to behave optimally under uncertainty?

Optimality is defined with respect to a distribution of environments.

Explore vs. Exploit given prior uncertainty regarding environments

What is the “value of information”?

Page 4: Optimal Learning & Bayes -Adaptive  MDPs

Slides prepared by Georgios Chalkiadakis

Optimal Learning: Overview

Bayesian approach: Evolve uncertainty about unknown process parameters

The parameters describe prior distributions about the world model (transitions/rewards)

That is, about information states

Page 5: Optimal Learning & Bayes -Adaptive  MDPs

Slides prepared by Georgios Chalkiadakis

Optimal Learning: Overview

The sequential problem is described by a “hyperstate”-MDP (“Bayes-Adaptive MDP”):

Instead of just physical states physical states+ information states

Page 6: Optimal Learning & Bayes -Adaptive  MDPs

Slides prepared by Georgios Chalkiadakis

Simple “stateless” example

Bernoulli process parameters θ1 θ2 describe the actual (but unknown) probabilities of success

Bayesian: Uncertainty about parameters describe it by conjugate prior distributions:

Page 7: Optimal Learning & Bayes -Adaptive  MDPs

Slides prepared by Georgios Chalkiadakis

Conjugate Priors

A prior is conjugate to a likelihood function if the posterior is in the same family with the prior

Prior in the family, posterior in the family

A simple update of hyperparameters is enough to get the posterior!

Page 8: Optimal Learning & Bayes -Adaptive  MDPs
Page 9: Optimal Learning & Bayes -Adaptive  MDPs

Information-statetransition diagram

Page 10: Optimal Learning & Bayes -Adaptive  MDPs

Slides prepared by Georgios Chalkiadakis

It simply becomes:

Page 11: Optimal Learning & Bayes -Adaptive  MDPs

Slides prepared by Georgios Chalkiadakis

Bellman optimality equation (with k steps

to go)

Page 12: Optimal Learning & Bayes -Adaptive  MDPs

Slides prepared by Georgios Chalkiadakis

Enter physical states (MDPs)

2 physical states

Page 13: Optimal Learning & Bayes -Adaptive  MDPs

Slides prepared by Georgios Chalkiadakis

Enter physical states (MDPs)

2 physical states / 2 actions

Four Bernoulli processes: 1 at 1, 2 at 1, 1 at 2, 2 at 2

(a_1^1, b_1^1) hyperparameters of beta distribution capturing uncertainty about p^1_{11}

full hyperstate:

Note: we now have to be in a specific physical state to sample a related process

Page 14: Optimal Learning & Bayes -Adaptive  MDPs

Enter physical states (MDPs)

Page 15: Optimal Learning & Bayes -Adaptive  MDPs

Optimality equation

Page 16: Optimal Learning & Bayes -Adaptive  MDPs

Slides prepared by Georgios Chalkiadakis

More than 2 physical states…What priors

now?

Dirichlet – conjugate to the multinomial sampling Sampling is now multinomial : s many s’

We will see examples in future readings…

Page 17: Optimal Learning & Bayes -Adaptive  MDPs

Slides prepared by Georgios Chalkiadakis

Certainty equivalence? Truncate the horizon

Compute terminal values using means

…and proceed with a receding-horizon approach Perform DP , take first “optimal” action, shift

window fwd, repeat

Or, even simpler, consider an horizon of 1• Compute DP “optimal” policies using means of current belief distributions• perform action, ob

Or, even more simply, use a myopic c-e approach:

Use means of current priors to compute DP optimal policies

Execute “optimal” action, observe transition

Update distribution, repeat

Page 18: Optimal Learning & Bayes -Adaptive  MDPs

Slides prepared by Georgios Chalkiadakis

No, it’s not a good idea!...

Actions / state transitions might be starved forever,

…even if the initial prior is an accurate model of uncertainty!

Page 19: Optimal Learning & Bayes -Adaptive  MDPs

Slides prepared by Georgios Chalkiadakis

Example

Page 20: Optimal Learning & Bayes -Adaptive  MDPs

Example (cont.)

Page 21: Optimal Learning & Bayes -Adaptive  MDPs

Slides prepared by Georgios Chalkiadakis

So, we have to be properly Bayesian

If the prior is an accurate model of uncertainty, “important” actions/states will not be starved

There exists Bayesian RL algorithms that do a more than a decent job! (future readings) However, if the prior provides a distorted picture of

reality, then we can have no convergence guarantees

…but “optimal learning” is still in place (assuming that other algorithms operate with the same prior knowledge)