Reinforcement Learning with Multiple, Qualitatively Different State Representations

04/19/23NIPS 2007 workshop1

Reinforcement Learning with Multiple, Qualitatively Different State Representations

Harm van SeijenBram BakkerLeon Kester

- TNO / UvA- UvA- TNO / UvA


The Reinforcement Learning Problem

Agent Environment

Goal: maximize cumulative discounted reward

action a

state s, reward r

Question: What is the best way to represent the environment?


0 50 1000

50

100

0 50 1000

50

100

0 50 1000

50

100

0 50 1000

50

100

0 50 1000

50

100

0 50 1000

50

100

0 50 1000

50

100

x

y

0 50 1000

50

100

x

y



Explanation of our Approach.


agent 1 : state space S1 = {s11, s1

2, s13, … s1

N1}

state space size = N1


2, s23, … s2

N2}



2, s33, … s3

N3}


(mutual) action space A = {a1, a2}

action space size = 2

Suppose 3 agents work in the same environment and have the same action-space, but different state space:


Extension action space

External Actionsa_e1 : old a1

a_e2 : old a2

Switch actions:a_s1 : ‘switch to representation 1’a_s2 : ‘switch to representation 2’a_s3 : ‘switch to representation 3’

New Action space:a1 : a_e1 + a_s1 a2 : a_e1 + a_s2

a3 : a_e1 + a_s3

a4 : a_e2 + a_s1 a5 : a_e2 + a_s2

a6 : a_e2 + a_s3



2, s13, … s1

N1}

state space size = N1agent 2 : state space S2 = {s2

1, s22, s2

3, … s2N2}

state space size = N2agent 3 : state space S3 = {s3

1, s32, s3

3, … s3N3}


switch agent: state space S = {s1

1, s12, …, s1

N1, s21, s2

2, …, s2N2,s3

1, s32, …, s3

N3}

state space size = N1+N2+N3

Extension state space


Requirements and Advantages.


Requirements for Convergence

Theoretical Requirement

If the individual representations obey the Markov property than convergence to the optimal solution is guaranteed.

Empirical Requirement

Each representation should contain information that is useful for deciding on which external action to take and information that is useful for deciding when to switch.


Representation States Actions State-Actions

Rep 1 100 2 200

Rep 2 50 2 100

Rep 3 100 2 200

Switch (OR) 250 6 1.500

Union (AND) 500.000 2 1.000.000

State-Action Space Sizes Example


Switching is advantageous if:

1. The state-space is very large AND

2. The state-space is heterogeneous.


Results.


Traffic Scenario

Situation: crossroad of 2 one-way roads

Task: traffic agent has to decide at each time step whether the vertical lane or the horizontal lane should get green light. Changing lights involves an orange time of 5 time steps.

Reward: total cars waiting in front of the traffic light * -1


V-LaneH-Lane

Feature 1: traffic light status (4 values)

V-LaneH-Lane

V-LaneH-Lane

Feature 2-3: occupancy first 2 squares vertical lane (2x2 values)

V-LaneH-Lane

Feature 4-5: occupancy first 2 squares horizontal lane (2x2 values)

V-LaneH-Lane

Representation 1


V-LaneH-Lane

Feature 1: traffic light status (4 values)

V-LaneH-Lane

Feature 2: first Car Configuration (3 values)

V-LaneH-Lane

Feature 3: first square before traffic light occupied (2 values)

Representation 2


Representations Compared

Representation States Actions State-Actions

Rep 1 64 2 128

Rep 2 24 2 48

Switch 88 4 352

Rep 1+ 256 2 512


On-line performance for Traffic Scenario

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 105

-2

-1.8

-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

timsteps

aver

age

rew

ard

per

times

tep

representation 1

representation 2switch representation

representation 1+


Demo.


Conclusions and Future Work.


Conclusions

• We introduced an extension to the standard RL problem by allowing the decision agent to dynamically switch between a number of qualitatively different representations.

• This approach offers advantages in RL problems with large, heterogeneous state spaces.

• Experiments with a (simulated) traffic control problem showed good results: the agent allowed to switch had a higher end-performance, while the convergence rate was similar compared to a representation with similar state-action space size.


• Use larger state spaces (~ few hundred states per representation) and more than 2 different representations.

• Explore the application domain of sensor management (for example switch between radar settings)

• Combine the switching approach with function approximation.

• Examine in more detail the convergence properties of the switch representation.

• Use representations that describe realistic sensor output.

• Explore new methods for switching.

Future Work


Thank you.


Switching Algorithm versus POMDP

POMDP: • update estimate of a hidden variable and base decisions on a probability distribution over all possible values of this hidden variable.• not possible to choose between different representations

Switch Algorithm: • hidden information is present, but not taken into account. The price for this is a more stochastic action outcome.• when hidden information is very important for the decision making process the agent can decide to switch to a different representation that does take the information into account.

Documents

Reinforcement Learning with Multiple, Qualitatively Different State Representations