Upload
ramnandan-krishnamurthy
View
179
Download
1
Embed Size (px)
Citation preview
HIERARCHICAL DECISION MAKING USING SPATIO-TEMPORAL ABSTRACTION REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES
by S.K.RamnandanDeep Learning Group, IIT-Madras
How to emulate intelligent behaviour?
Spatial abstraction - by ignoring irrelevant sensory input
Group sets of primitive states in MDP into abstract states
Temporal abstraction - by ignoring fine grain details of actions
Extended actions directly take agent from one abstract state to another
Identify useful skills
Motivation for spatial abstraction:
Find regions of state space that are well-connected - abstract states
Idea from conformal dynamics - metastability:
Particles stay in same region of state space for long periods of time without external stimulus
Behaviour under random walks
Identified using spectral clustering algorithm - PCCA+
Construct Laplacian of transition matrix corresponding to random walk on underlying MDP
Spectra of the Laplacian encodes the properties of underlying graph
Vertices of a simplex which lie on the transformed basis are the abstract states
States are classified to abstract states based on their membership to clusters after projection
Advantages:
Degree of membership of states to each abstract state
Connectivity information between abstract states
Automatically estimate number of abstract states
PCCA+
Use partitions of state space into abstract states along with membership function returned by PCCA+ to compose options for free
Thus, use the structural information obtained to define behavioral policies for the subtasks independent of the task being solved
Hence these skills may work even for platform games where rewards are hugely delayed
TEMPORAL ABSTRACTION: OPTIONS
Option policy to go from abstract state 1 to 2 in 3-room domain
No access to a model of the MDP
Have to estimate transition matrix from sampled trajectories
Underlying policy while sampling cannot be random since exploration of MDP heavily depends on near-optimal policy
ONLINE AGENT FOR PLATFORM GAMES
Trajectories Featurization DimensionalityReduction
ClusteringFitting Markov State ModelPCCA+
Exponential state space - 25352 possible states
22 x 16 tiled grid with 25 possible values
Higher-level state representation than pixel space
FEATURIZATION
12 possible primitive actions
Rewards for achieving ‘side’ goals, such as gathering coins and killing monsters
MARIO DOMAIN
After featurization, dimensionality of state vector = 240
For 10,000 trajectories, time taken to cluster & fit MSM:
Curse of dimensionality, local feature relevance problem
Reduced dimension representation learning:
Deep Q-Network
Autoencoder (Denoising)
Stacked denoising autoencoder
DIMENSIONALITY REDUCTION
1-D 3-D 240-D15 min 307 min ?
DQNRL presents challenges from a deep learning perspective
No direct association between inputs and targets - RL algorithms must be able to learn from a scalar reward signal that is frequently sparse, noisy and delayed
Correlated data - In RL, encounter sequences of highly correlated data
Non-stationary training distribution - Problematic for deep learning methods that assume a fixed underlying distribution
Neural network trained with TD-error acts as non-linear function approximator for action-values
Experience replay mechanism - randomly samples previous transitions (s-a-r-s’) from replay pool
116 8x8 filters
32 4x4filters
Fully connected hidden layer
Fully connected output layer
84x84x4input
• Deriving an approximate state representation
• Compress last hidden layer to simulate encoder in auto encoders
• Summarize state by values of neurons in last hidden layer
• In case of Mario where input is not in pixel space, replaced convolution layers with fully connected layers
Note: Contractive nature
of reduced dimensionas training epochs
increases
AUTOENCODER
Cross-entropy error for binary inputs
Directly using loss function for ordinal data inputs ?
AUTOENCODER (DENOISING)• Is representation learnt from autoencoder useful enough?
• Further constraints need to be applied to attempt to separate useful information from noise
• Will naturally translate to non-zero reconstruction error
• Two implicit underlying ideas:
• A higher level representation should be rather stable and robust under corruptions of the input
• Performing the denoising task well requires extracting features that capture useful structure in the input distribution
VISULAZATION OF REDUCED DIMENSION
DQN 1-d DQN 2-d DQN 3-d
Auto 1-d Auto 2-d Auto 3-d
VISULAZATION OF REDUCED DIMENSION
dAuto 1-d dAuto 2-d dAuto 3-d
Auto 1-d Auto 2-d Auto 3-d
25%
noi
se0%
noi
se
RECONSTRUCTION ERROR
Reduced dimension
Auto dAuto (25% noise)
h-1 200.559 177.456
h-2 168.765 158.984
h-3 158.751 151.514
h-5 156.246 139.845
dAuto
Auto
Fall in training cost smoother for denoising autoencoder
END-TO-END TESTING RESULTS FOR STATE APPROXIMATION
• Average % increase in return per episode: 15.3%• Average % decrease in time spent per episode: 4.39%
END-TO-END TESTING RESULTS FOR STATE APPROXIMATION
Observations:
Performance improves when approximating state using denoising variant of autoencoder for same latent representation size
Tradeoff when increasing dimensionality of approximated state:
Increase in end-to-end performance
Significant increase in time taken for clustering & fitting a Markov state model
MONTEZUMA’S REVENGE• Much higher emphasis on representation learning than Mario
• DeepMind’s DQN reports worst performance on this - 0% compared to human test player
• After training DQN, we have a 256 real-valued feature vector output by the last fully connected hidden layer
• Has been observed that the magnitude of the output values themselves do not matter in an image recognition task
• Hence can binarize the values and obtain a 256-bit binary feature vector representing a state
• Perform further state approximation using d-Autoencoder