REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES

HIERARCHICAL DECISION MAKING USING SPATIO-TEMPORAL ABSTRACTION REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES

by S.K.RamnandanDeep Learning Group, IIT-Madras

How to emulate intelligent behaviour?

Spatial abstraction - by ignoring irrelevant sensory input

Group sets of primitive states in MDP into abstract states

Temporal abstraction - by ignoring fine grain details of actions

Extended actions directly take agent from one abstract state to another

Identify useful skills

Motivation for spatial abstraction:

Find regions of state space that are well-connected - abstract states

Idea from conformal dynamics - metastability:

Particles stay in same region of state space for long periods of time without external stimulus

Behaviour under random walks

Identified using spectral clustering algorithm - PCCA+

Construct Laplacian of transition matrix corresponding to random walk on underlying MDP

Spectra of the Laplacian encodes the properties of underlying graph

Vertices of a simplex which lie on the transformed basis are the abstract states

States are classified to abstract states based on their membership to clusters after projection

Advantages:

Degree of membership of states to each abstract state

Connectivity information between abstract states

Automatically estimate number of abstract states

PCCA+

Use partitions of state space into abstract states along with membership function returned by PCCA+ to compose options for free

Thus, use the structural information obtained to define behavioral policies for the subtasks independent of the task being solved

Hence these skills may work even for platform games where rewards are hugely delayed

TEMPORAL ABSTRACTION: OPTIONS

Option policy to go from abstract state 1 to 2 in 3-room domain

No access to a model of the MDP

Have to estimate transition matrix from sampled trajectories

Underlying policy while sampling cannot be random since exploration of MDP heavily depends on near-optimal policy

ONLINE AGENT FOR PLATFORM GAMES

Trajectories Featurization DimensionalityReduction

ClusteringFitting Markov State ModelPCCA+

Exponential state space - 25352 possible states

22 x 16 tiled grid with 25 possible values

Higher-level state representation than pixel space

FEATURIZATION

12 possible primitive actions

Rewards for achieving ‘side’ goals, such as gathering coins and killing monsters

MARIO DOMAIN

After featurization, dimensionality of state vector = 240

For 10,000 trajectories, time taken to cluster & fit MSM:

Curse of dimensionality, local feature relevance problem

Reduced dimension representation learning:

Deep Q-Network

Autoencoder (Denoising)

Stacked denoising autoencoder

DIMENSIONALITY REDUCTION

1-D 3-D 240-D15 min 307 min ?

DQNRL presents challenges from a deep learning perspective

No direct association between inputs and targets - RL algorithms must be able to learn from a scalar reward signal that is frequently sparse, noisy and delayed

Correlated data - In RL, encounter sequences of highly correlated data

Non-stationary training distribution - Problematic for deep learning methods that assume a fixed underlying distribution

Neural network trained with TD-error acts as non-linear function approximator for action-values

Experience replay mechanism - randomly samples previous transitions (s-a-r-s’) from replay pool

116 8x8 filters

32 4x4filters

Fully connected hidden layer

Fully connected output layer

84x84x4input

• Deriving an approximate state representation

• Compress last hidden layer to simulate encoder in auto encoders

• Summarize state by values of neurons in last hidden layer

• In case of Mario where input is not in pixel space, replaced convolution layers with fully connected layers

Note: Contractive nature

of reduced dimensionas training epochs

increases

AUTOENCODER

Cross-entropy error for binary inputs

Directly using loss function for ordinal data inputs ?

AUTOENCODER (DENOISING)• Is representation learnt from autoencoder useful enough?

• Further constraints need to be applied to attempt to separate useful information from noise

• Will naturally translate to non-zero reconstruction error

• Two implicit underlying ideas:

• A higher level representation should be rather stable and robust under corruptions of the input

• Performing the denoising task well requires extracting features that capture useful structure in the input distribution

VISULAZATION OF REDUCED DIMENSION

DQN 1-d DQN 2-d DQN 3-d

Auto 1-d Auto 2-d Auto 3-d

VISULAZATION OF REDUCED DIMENSION

dAuto 1-d dAuto 2-d dAuto 3-d

Auto 1-d Auto 2-d Auto 3-d

25%

noi

se0%

noi

se

RECONSTRUCTION ERROR

Reduced dimension

Auto dAuto (25% noise)

h-1 200.559 177.456

h-2 168.765 158.984

h-3 158.751 151.514

h-5 156.246 139.845

dAuto

Auto

Fall in training cost smoother for denoising autoencoder

END-TO-END TESTING RESULTS FOR STATE APPROXIMATION

• Average % increase in return per episode: 15.3%• Average % decrease in time spent per episode: 4.39%

END-TO-END TESTING RESULTS FOR STATE APPROXIMATION

Observations:

Performance improves when approximating state using denoising variant of autoencoder for same latent representation size

Tradeoff when increasing dimensionality of approximated state:

Increase in end-to-end performance

Significant increase in time taken for clustering & fitting a Markov state model

MONTEZUMA’S REVENGE• Much higher emphasis on representation learning than Mario

• DeepMind’s DQN reports worst performance on this - 0% compared to human test player

• After training DQN, we have a 256 real-valued feature vector output by the last fully connected hidden layer

• Has been observed that the magnitude of the output values themselves do not matter in an image recognition task

• Hence can binarize the values and obtain a 256-bit binary feature vector representing a state

• Perform further state approximation using d-Autoencoder

Science

REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES