36

Outline

  • Upload
    twyla

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

Outline. MDP (brief) Background Learning MDP Q learning Game theory (brief) Background Markov games (2-player) Background Learning Markov games Littman’s Minimax Q learning (zero-sum) Hu & Wellman’s Nash Q learning (general-sum). Stochastic games (SG). Partially observable SG (POSG). - PowerPoint PPT Presentation

Citation preview

Page 1: Outline
Page 2: Outline

Outline

• MDP (brief)– Background– Learning MDP

• Q learning

• Game theory (brief)– Background

• Markov games (2-player)– Background– Learning Markov games

• Littman’s Minimax Q learning (zero-sum)• Hu & Wellman’s Nash Q learning (general-sum)

Page 3: Outline

/ SG/ POSG

Stochastic games (SG)

Partially observable SG (POSG)

Page 4: Outline

Immediate reward

Expectation over next states

Value of next state

Page 5: Outline

• Model-based reinforcement learning:1. Learn the reward function and the state transition function

2. Solve for the optimal policy

• Model-free reinforcement learning:1. Directly learn the optimal policy without knowing the reward

function or the state transition function

Page 6: Outline

#times action a has been executed in state s

#times action a causes state transition s s’

Total reward accrued when applying a in s

Page 7: Outline

v(s’)

Page 8: Outline

1. Start with arbitrary initial values of Q(s,a), for all sS, aA

2. At each time t the agent chooses an action and observes its reward rt

3. The agent then updates its Q-values based on the Q-learning rule

4. The learning rate t needs to decay over time in order for the learning algorithm to converge

Page 9: Outline
Page 10: Outline

Famous game theory example

Page 11: Outline
Page 12: Outline
Page 13: Outline

A co-operative game

Page 14: Outline
Page 15: Outline
Page 16: Outline

Mixed strategy

Generalization of MDP

Page 17: Outline
Page 18: Outline

Stationary: the agent’s policy does not change over time

Deterministic: the same action is always chosen whenever the agent is in state s

Page 19: Outline

Example

0 1 -1

-1 0 1

1 -1 0

1 -1

-1 1

2 1 1

1 2 1

1 1 2State 1

State 2

1 1

1 1

Page 20: Outline

v(s,*) v(s,) for all s S,

Page 21: Outline

Max V

Such that: rock + paper + scissors = 1

Page 22: Outline

Best response

Worst case

Expectation over all actions

Page 23: Outline
Page 24: Outline

Quality of a state-action pair

Discounted value of all succeeding states weighted by their likelihood

Discounted value of all succeeding states

This learning rule converges to the correct values of Q and v

Page 25: Outline

eplor controls how often the agent will deviate from its current policy

Expected reward for taking

action a when opponent chooses o from state s

Page 26: Outline
Page 27: Outline
Page 28: Outline
Page 29: Outline
Page 30: Outline
Page 31: Outline

Hu and Wellman general-sum Markov games as a framework for RL

Theorem (Nash, 1951) There exists a mixed strategy Nash equilibrium for any finite bimatrix game

Page 32: Outline
Page 33: Outline
Page 34: Outline
Page 35: Outline
Page 36: Outline