22
May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing Authored by: Amir F. Atiya Department of Computer Engineering Cairo University, Giza, Egypt Alexander G. Parlos Dept. mechanical Engineering Texas A&M University, College Station, Texas Lester Ingber Lester Ingber Research Ingber.com September 13, 2003 Presented by Doug Moody, May 18, 2004

May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

  • View
    226

  • Download
    3

Embed Size (px)

Citation preview

Page 1: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 1

A Reinforcement Learning Method Based on Adaptive Simulated Annealing

Authored by:Amir F. Atiya

Department of Computer EngineeringCairo University, Giza, Egypt

Alexander G. ParlosDept. mechanical Engineering

Texas A&M University, College Station, Texas

Lester IngberLester Ingber Research

Ingber.com

September 13, 2003

Presented by Doug Moody, May 18, 2004

Page 2: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 2

Glass-Blowing and its Impact on Reinforcement Learning

• Considering the whole piece while focusing on a particular section

• Slow cooling to relieve stress and gain consistency• Use of “annealing”

Page 3: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 3

Paper Approach

• Review the reinforcement learning problem, and introduce the use of function approximation to determine state values

• Briefly review the use of an adaptation of “annealing” algorithms to find functions that will determine a state’s value

• Use this approach on a straight forward decision-making problem.

Page 4: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 4

Function ApproximationIntroduction

• Much of our emphasis in reinforcement learning has treated a value function as one entry for each state-action pair

• Finite Markov Decision processes have a fixed number of states and actions

• This approach can, in some problems, introduce limitations when there are many states , insufficient samples across all states or a continuous state space.

• These limitations can be addressed by “generalization”

• Generalization also can be referred to as “function approximation”

• Function approximation has been widely studied in many fields (think regression analysis!)

Page 5: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 5

Function ApproximationCharacteristics

• A “batch” or “supervised learning” approach versus the on-line approach we have encountered

• Requires a “static” training set from which to learn

• Can not handle dynamically changing target functions, which may have been bootstrapped.

• Hence , function approximation is not suitable for all types of reinforcement learning

Page 6: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 6

Function ApproximationGoals

• Requires a “static” training set from which to learn

• Can not handle dynamically changing target functions, which may have been bootstrapped.

• Hence , function approximation is not suitable for all types of reinforcement learning

• The value function is dependent upon a parameter vector which could be the vector of connection in a network

• Typically function approximation wants to minimize:

• P(s) are weights of the errors

MSE( t) P(s) V (s) Vt (s) s S 2

MSE: Mean Squared Error : vector of function parameters

Page 7: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 7

Function ApproximationMethods

• Step by Step Approach : Gradient Descent - move slowly toward optimal “fit”

• Linear Approach: Special case of Gradient where parameters are a column vector

• Coding Methods

– Coarse

– Tile

– Radial Basis Functions

Page 8: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 8

COARSE CODING

features should relate to the characteristics of the stateFor instance for a robot, the location, remaining power may be usedFor chess, the number of pieces, moves for pawn queen, etc.. Slide from Sutton and Barto textbook

Page 9: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 9

LEARNING AND COARSE CODING

Slide from Sutton and Barto textbook

Page 10: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 10

TILE CODING

• Binary feature for each tile

• Number of features present at any one time is constant

• Binary features means weighted sum easy to compute

• Easy to compute indices of the features present

Slide from Sutton and Barto textbook

Page 11: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 11

RADIAL BASIS FUNCTIONS (GAUSSIAN)

s (i) exp s ci

2

2 i2

reflects degrees which feature is presentLook to variance to show relationship of feature in the state space

Page 12: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 12

Paper’s Description of the Reinforcement Learning Model

Basic System

Value Definition

Policy Definition

Optimal Policy

Maximal Value

Eq. 4

Eq. 5

Page 13: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 13

Value Function to Optimize

basis function

weight parameter

GOAL: find the optimal set of that will lead to the most accurate evaluation

Page 14: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 14

Use Simulated Annealing to find best set of Wk

• Annealing algorithms seek to search the entire state space and slowing “cool” to appropriate local minima

• Algorithms trade off between fast convergence and continuous sampling of the entire

• Used typically to find the optimization of a combinatorial problem

• Requirements:

– Concise definition of the system

– Random generator of moves

– Objective function to be optimized

– Temperature schedule

Page 15: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 15

Example of Simulated Annealing

• Problem - find the lowest valley in a mountainous region• View the problem as having two directions - North-South and East-West• Use a bouncing ball to explore the terrain at high temperature• The ball can make high bounces exploring many regions• Each point in the terrain has a “cost function” to optimize• As the temperature cools, the ball’s range and exploration decreases as it

focuses on a smaller region of the terrain• Two distributions are used: generating distribution (for each parameter),

acceptance distribution• Acceptance distribution determines whether to stay in the valley or

bounce out.• Both distributions are affected by temperature

Page 16: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 16

Glass-Blowing Example

• Larger changes are made to the glass piece at higher temperatures• As glass is cooled, the piece is still scanned (albeit more quickly) for stress

points• Can not be “heated” up again and keep previous results

Page 17: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 17

Adaptive Simulated Annealing (ASA)

• Has some approach as “simulated annealing”• Uses a specific distribution with a wider tail• Does not rely on “quenching” to achieve quick convergence• Has been available as a C programming system• Relies heavily upon a large set of tuning options:

– scaling of temperatures , probabilities– limitation on searching in regions with certain parameters– linear vs. non-linear vector

• Supports re-annealing - time is wound back ( and hence temperature) after some results are achieved to take example of found sensitivities

• Good for non-linear functions

More information and software available at www.ingber.com

Page 18: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 18

Reinforcement learning with ASA Search

Page 19: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 19

Sample Implementation

• Problem: Choose the highest number from a sequence of numbers

– Numbers are generated from an unknown source, with a normal distribution having a mean between 0 and 1 and a standard deviation between 0 and .5

– As time passes the reward is discounted

– Hence the tradeoff: more waiting provides more information, but a penalty is incurred

• Paper used 100 sources, with each generating 1000 numbers for a given sequence as the training set.

Page 20: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 20

Solution Approach

• Define a state space as a combination of the following:

– time t

– the current mean at time t of observed numbers

– the current standard deviation’

– the highest number chosen thus far

• Place 10 Gaussian basis functions throughout the State Space

• Use the algorithm to optimize a vector of weight parameters to the basis functions

Page 21: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 21

RESULTS

•ASA achieved an overall reward value•Q-Learning found the standard deviation•Improvement is substantial given that picking the first number in each set would yield .5

Page 22: May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing

May 18,2004 A Reinforcement Learning Method Based on Adaptive Simlated Annealing 22

Paper Comments

• Pros

– Looked to use existing reinforcement taxonomies to discuss the problem

– Selected a straight forward problem

• Negative

– Did not fully describe the basis function placement

– Insufficient parameter for Q-Learning used

– Did not show an non-linear example

– Could have provided more information on ASA Options used for results duplication