24
03/14/22 NIPS 2007 workshop 1 Reinforcement Learning with Multiple, Qualitatively Different State Representations Harm van Seijen Bram Bakker Leon Kester - TNO / UvA - UvA - TNO / UvA

Reinforcement Learning with Multiple, Qualitatively Different State Representations

Embed Size (px)

DESCRIPTION

Reinforcement Learning with Multiple, Qualitatively Different State Representations. - TNO / UvA - UvA - TNO / UvA. Harm van Seijen Bram Bakker Leon Kester. The Reinforcement Learning Problem. action a. Environment. Agent. state s, reward r. - PowerPoint PPT Presentation

Citation preview

04/19/23NIPS 2007 workshop1

Reinforcement Learning with Multiple, Qualitatively Different State Representations

Harm van SeijenBram BakkerLeon Kester

- TNO / UvA- UvA- TNO / UvA

04/19/23NIPS 2007 workshop2

The Reinforcement Learning Problem

Agent Environment

Goal: maximize cumulative discounted reward

action a

state s, reward r

Question: What is the best way to represent the environment?

04/19/23NIPS 2007 workshop3

0 50 1000

50

100

0 50 1000

50

100

0 50 1000

50

100

0 50 1000

50

100

0 50 1000

50

100

0 50 1000

50

100

0 50 1000

50

100

x

y

0 50 1000

50

100

x

y

04/19/23NIPS 2007 workshop4

04/19/23NIPS 2007 workshop5

Explanation of our Approach.

04/19/23NIPS 2007 workshop6

agent 1 : state space S1 = {s11, s1

2, s13, … s1

N1}

state space size = N1

agent 2 : state space S2 = {s21, s2

2, s23, … s2

N2}

state space size = N2

agent 3 : state space S3 = {s31, s3

2, s33, … s3

N3}

state space size = N3

(mutual) action space A = {a1, a2}

action space size = 2

Suppose 3 agents work in the same environment and have the same action-space, but different state space:

04/19/23NIPS 2007 workshop7

Extension action space

External Actionsa_e1 : old a1

a_e2 : old a2

Switch actions:a_s1 : ‘switch to representation 1’a_s2 : ‘switch to representation 2’a_s3 : ‘switch to representation 3’

New Action space:a1 : a_e1 + a_s1 a2 : a_e1 + a_s2

a3 : a_e1 + a_s3

a4 : a_e2 + a_s1 a5 : a_e2 + a_s2

a6 : a_e2 + a_s3

04/19/23NIPS 2007 workshop8

agent 1 : state space S1 = {s11, s1

2, s13, … s1

N1}

state space size = N1agent 2 : state space S2 = {s2

1, s22, s2

3, … s2N2}

state space size = N2agent 3 : state space S3 = {s3

1, s32, s3

3, … s3N3}

state space size = N3

switch agent: state space S = {s1

1, s12, …, s1

N1, s21, s2

2, …, s2N2,s3

1, s32, …, s3

N3}

state space size = N1+N2+N3

Extension state space

04/19/23NIPS 2007 workshop9

Requirements and Advantages.

04/19/23NIPS 2007 workshop10

Requirements for Convergence

Theoretical Requirement

If the individual representations obey the Markov property than convergence to the optimal solution is guaranteed.

Empirical Requirement

Each representation should contain information that is useful for deciding on which external action to take and information that is useful for deciding when to switch.

04/19/23NIPS 2007 workshop11

Representation States Actions State-Actions

Rep 1 100 2 200

Rep 2 50 2 100

Rep 3 100 2 200

Switch (OR) 250 6 1.500

Union (AND) 500.000 2 1.000.000

State-Action Space Sizes Example

04/19/23NIPS 2007 workshop12

Switching is advantageous if:

1. The state-space is very large AND

2. The state-space is heterogeneous.

04/19/23NIPS 2007 workshop13

Results.

04/19/23NIPS 2007 workshop14

Traffic Scenario

Situation: crossroad of 2 one-way roads

Task: traffic agent has to decide at each time step whether the vertical lane or the horizontal lane should get green light. Changing lights involves an orange time of 5 time steps.

Reward: total cars waiting in front of the traffic light * -1

04/19/23NIPS 2007 workshop15

V-LaneH-Lane

Feature 1: traffic light status (4 values)

V-LaneH-Lane

V-LaneH-Lane

Feature 2-3: occupancy first 2 squares vertical lane (2x2 values)

V-LaneH-Lane

Feature 4-5: occupancy first 2 squares horizontal lane (2x2 values)

V-LaneH-Lane

Representation 1

04/19/23NIPS 2007 workshop16

V-LaneH-Lane

Feature 1: traffic light status (4 values)

V-LaneH-Lane

Feature 2: first Car Configuration (3 values)

V-LaneH-Lane

Feature 3: first square before traffic light occupied (2 values)

Representation 2

04/19/23NIPS 2007 workshop17

Representations Compared

Representation States Actions State-Actions

Rep 1 64 2 128

Rep 2 24 2 48

Switch 88 4 352

Rep 1+ 256 2 512

04/19/23NIPS 2007 workshop18

On-line performance for Traffic Scenario

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 105

-2

-1.8

-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

timsteps

aver

age

rew

ard

per

times

tep

representation 1

representation 2switch representation

representation 1+

04/19/23NIPS 2007 workshop19

Demo.

04/19/23NIPS 2007 workshop20

Conclusions and Future Work.

04/19/23NIPS 2007 workshop21

Conclusions

• We introduced an extension to the standard RL problem by allowing the decision agent to dynamically switch between a number of qualitatively different representations.

• This approach offers advantages in RL problems with large, heterogeneous state spaces.

• Experiments with a (simulated) traffic control problem showed good results: the agent allowed to switch had a higher end-performance, while the convergence rate was similar compared to a representation with similar state-action space size.

04/19/23NIPS 2007 workshop22

• Use larger state spaces (~ few hundred states per representation) and more than 2 different representations.

• Explore the application domain of sensor management (for example switch between radar settings)

• Combine the switching approach with function approximation.

• Examine in more detail the convergence properties of the switch representation.

• Use representations that describe realistic sensor output.

• Explore new methods for switching.

Future Work

04/19/23NIPS 2007 workshop23

Thank you.

04/19/23NIPS 2007 workshop24

Switching Algorithm versus POMDP

POMDP: • update estimate of a hidden variable and base decisions on a probability distribution over all possible values of this hidden variable.• not possible to choose between different representations

Switch Algorithm: • hidden information is present, but not taken into account. The price for this is a more stochastic action outcome.• when hidden information is very important for the decision making process the agent can decide to switch to a different representation that does take the information into account.