ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 19: Case Studies

11

ECE 517: Reinforcement Learning in ECE 517: Reinforcement Learning in Artificial IntelligenceArtificial Intelligence

Lecture 19: Case StudiesLecture 19: Case Studies

Dr. Itamar ArelDr. Itamar Arel

College of EngineeringCollege of EngineeringDepartment of Electrical Engineering and Computer ScienceDepartment of Electrical Engineering and Computer Science

The University of TennesseeThe University of TennesseeFall 2010Fall 2010

November 10, 2010November 10, 2010

ECE 517: Reinforcement Learning in AI 22

Final Project RecapFinal Project Recap

Requirements:Requirements: PresentationPresentation

In-class 15 minute presentation + 5 minutes for In-class 15 minute presentation + 5 minutes for questionsquestions

Presentation assignment slots have been posted on Presentation assignment slots have been posted on websitewebsite

Project report – Due Project report – Due Friday, Dec 3Friday, Dec 3thth

Comprehensive documentation of your workComprehensive documentation of your work Recall that the Final Project is 30% of course Recall that the Final Project is 30% of course

grade!grade!


IntroductionIntroduction

We’ll discuss several case studies of reinforcement We’ll discuss several case studies of reinforcement learninglearning

The intention is to illustrate some of the The intention is to illustrate some of the trade-offstrade-offs and and issues that arise in real applicationsissues that arise in real applications

For example, we emphasize how domain For example, we emphasize how domain knowledgeknowledge is is incorporated into the formulation and solution of the incorporated into the formulation and solution of the problemproblem

We also highlight the We also highlight the representationrepresentation issues that are so issues that are so often critical to successful applicationsoften critical to successful applications

Applications of reinforcement learning are still far from Applications of reinforcement learning are still far from routine and typically require as much art as scienceroutine and typically require as much art as science

Making applications easier and more straightforward is Making applications easier and more straightforward is one of the goals of current research in reinforcement one of the goals of current research in reinforcement learninglearning


TD-Gammon (Tesauro’s 1992, 1994, 1995, …)TD-Gammon (Tesauro’s 1992, 1994, 1995, …)

One of the most impressive applications of RL to date One of the most impressive applications of RL to date is Gerry Tesauro’s (IBM) game of backgammon is Gerry Tesauro’s (IBM) game of backgammon

TD-GammonTD-Gammon, required little backgammon knowledge, , required little backgammon knowledge, yet learned to play extremely well, near the level of yet learned to play extremely well, near the level of the world's strongest grandmastersthe world's strongest grandmasters

The learning algorithm was a straightforward The learning algorithm was a straightforward combination of the TD(combination of the TD() algorithm and nonlinear ) algorithm and nonlinear function approximationfunction approximation

FA using a FFNN trained by backpropagating TD errorsFA using a FFNN trained by backpropagating TD errors

There are probably more professional backgammon There are probably more professional backgammon players than there are professional chess playersplayers than there are professional chess players

BG is in part a game of chance, which can be viewed BG is in part a game of chance, which can be viewed as a large MDPas a large MDP


TD-Gammon (cont.)TD-Gammon (cont.)

The game is played with 15 white and 15 black pieces The game is played with 15 white and 15 black pieces on a board of 24 locations, called on a board of 24 locations, called pointspoints

Here’s a typical position early in the game, seen from Here’s a typical position early in the game, seen from the perspective of the white playerthe perspective of the white player


TD-Gammon (cont.)TD-Gammon (cont.)

White has just rolled a 5 White has just rolled a 5 and a 2, so it can move and a 2, so it can move one of his pieces 5 and one of his pieces 5 and one (possibly the same) 2 one (possibly the same) 2 stepssteps

The objective is to The objective is to advance all pieces to advance all pieces to points 19-24, and then off points 19-24, and then off the boardthe board

HittingHitting – removal of single – removal of single piecepiece

30 pieces, 24 locations implies enormous number of 30 pieces, 24 locations implies enormous number of configurations (state set is ~10 configurations (state set is ~102020))

Effective branching factor of 400, considering that each Effective branching factor of 400, considering that each dice dice role has ~20 possibilities role has ~20 possibilities


TD-Gammon - detailsTD-Gammon - details

Although the game Although the game is highly stochasticis highly stochastic, a complete , a complete description of the game's state is available at all timesdescription of the game's state is available at all times

The estimated value of any state was meant to predict The estimated value of any state was meant to predict the probability of winning starting from that statethe probability of winning starting from that state

RewardReward: 0 at all times except those in which the game : 0 at all times except those in which the game is won, when it is 1is won, when it is 1

Episodic (game = episode), Episodic (game = episode), undiscountedundiscounted

Non-linear form of TD(Non-linear form of TD() using a FF neural network) using a FF neural network Weights initialized to small random numbersWeights initialized to small random numbers Backpropagation of TD errorBackpropagation of TD error Four input units for each point; unary encoding of Four input units for each point; unary encoding of

number of white pieces, plus other featuresnumber of white pieces, plus other features Use of AfterstateUse of Afterstate

Learning during self-play – fully incrementallyLearning during self-play – fully incrementally


TD-Gammon – Neural Network EmployedTD-Gammon – Neural Network Employed


Summary of TD-Gammon ResultsSummary of TD-Gammon Results

Two players played against each otherTwo players played against each other Each had no prior knowledge of the gameEach had no prior knowledge of the game Only the rules of the game were prescribedOnly the rules of the game were prescribed

Human’s learn from machinesHuman’s learn from machines: TD-Gammon learned : TD-Gammon learned to play certain opening positions differently than was to play certain opening positions differently than was the convention among the best human playersthe convention among the best human players


Rebuttal on TD-GammonRebuttal on TD-Gammon

For an alternative view, see For an alternative view, see “Why did TD-“Why did TD-Gammon Work?Gammon Work?”, Jordan Pollack and Alan ”, Jordan Pollack and Alan Blair, NIPS 9 (1997)Blair, NIPS 9 (1997)

Claim: Claim: it was the “co-evolutionary training it was the “co-evolutionary training strategy, playing games against itself, which strategy, playing games against itself, which led to the success”led to the success”

Any such approach would work with Any such approach would work with backgammonbackgammon

Success does not extend to other problemsSuccess does not extend to other problems e.g. Tetris, maze-type problems – exploration e.g. Tetris, maze-type problems – exploration

issue comes upissue comes up


The AcrobotThe Acrobot

Robotic application of RLRobotic application of RL

Roughly analogous to a Roughly analogous to a gymnast swinging on a high bar gymnast swinging on a high bar

The first joint (corresponding toThe first joint (corresponding tothe hands on the bar) cannotthe hands on the bar) cannotexert torqueexert torque

The second joint (correspondingThe second joint (correspondingto the gymnast bending at theto the gymnast bending at thewaist) canwaist) can

This system has been widelyThis system has been widelystudied by control engineersstudied by control engineersand machine learning researchers and machine learning researchers


The Acrobot (cont.)The Acrobot (cont.)

One objective for controlling the Acrobot is to swing One objective for controlling the Acrobot is to swing the tip (the "feet") above the first joint by an amount the tip (the "feet") above the first joint by an amount equal to one of the links in minimum timeequal to one of the links in minimum time

In this task, the torque applied at the second joint is In this task, the torque applied at the second joint is limited to three choices: limited to three choices: positive torquepositive torque of a fixed of a fixed magnitude, magnitude, negative torquenegative torque of the same magnitude, of the same magnitude, or or no torqueno torque

A reward of A reward of –1–1 is given on all time steps until the goal is given on all time steps until the goal is reached, which ends the episode. No discounting is is reached, which ends the episode. No discounting is usedused

Thus, the optimal value of any state is the minimum Thus, the optimal value of any state is the minimum time to reach the goal (an integer number of steps)time to reach the goal (an integer number of steps)

Sutton (1996) addressed the Acrobot swing-up task in Sutton (1996) addressed the Acrobot swing-up task in an on-line, model-free contextan on-line, model-free context


Acrobot Learning Curves for Sarsa(Acrobot Learning Curves for Sarsa())


Typical Acrobot Learned BehaviorTypical Acrobot Learned Behavior


RL in RoboticsRL in Robotics

Robot motor capabilities were investigated using RLRobot motor capabilities were investigated using RL

Walking, grabbing and delivering Walking, grabbing and delivering MIT Media LabMIT Media Lab Robocup competitionsRobocup competitions – soccer games – soccer games

Sony AIBOs are commonSony AIBOs are commonemployedemployed

Maze-type problemsMaze-type problems Balancing themselvesBalancing themselves

on unstable platformon unstable platform Multi-dimensional inputMulti-dimensional input

streamsstreams

Hopefully some new Hopefully some new applications soon applications soon


Introduction to Wireless Sensor Networks (WSN)Introduction to Wireless Sensor Networks (WSN)

A A sensor networksensor network is composed of a large number of is composed of a large number of sensor nodes, which are densely deployed either sensor nodes, which are densely deployed either inside the phenomenon or very close to itinside the phenomenon or very close to it

Random deploymentRandom deployment Cooperative capabilitiesCooperative capabilities

May be wireless or wired, however most modern May be wireless or wired, however most modern applications require wireless applications require wireless communicationscommunications

May be mobile or staticMay be mobile or static

Main challenge: maximize Main challenge: maximize the life of the networkthe life of the networkunder battery constraints!under battery constraints!


Communication Topology of Sensor NetworksCommunication Topology of Sensor Networks


Fire detection and monitoringFire detection and monitoring


Nodes we have here at the labNodes we have here at the lab

UCB TelosB

Intel Mote


Energy Consumption in WSNEnergy Consumption in WSN

Sources of Energy Sources of Energy Consumption Consumption SensingSensing ComputationComputation Communication Communication

(dominant)(dominant)

Energy Wastes on CommunicationsEnergy Wastes on Communications Collisions. Collisions. (Packet retransmission increases energy consumption)(Packet retransmission increases energy consumption) Idle Listening. (listen to the channel when the node are not Idle Listening. (listen to the channel when the node are not

intending to transmit)intending to transmit) Communication Overhead. (the communications cost of the MAC Communication Overhead. (the communications cost of the MAC

protocol)protocol) Overhearing (receive packets which are destined to other nodes)Overhearing (receive packets which are destined to other nodes)


MAC-related problems in WSNMAC-related problems in WSN

Goal:Goal: to schedule or coordinate the to schedule or coordinate the communications among multiple nodes sharing communications among multiple nodes sharing the same wireless radio frequency.the same wireless radio frequency.

5

7

1

2

4

36

Hidden Terminal Problem.Hidden Terminal Problem.Node 5 and node 3 want to transmitNode 5 and node 3 want to transmit

data to node 1. Since node 3 is out ofdata to node 1. Since node 3 is out of

the communication range of node 5, the communication range of node 5, ifif

communication occurscommunication occurs

simultaneously, node 1 will simultaneously, node 1 will experience collision.experience collision.

Exposed Terminal Problem.Exposed Terminal Problem.node 1 sends data to node 3, since node 1 sends data to node 3, since

node 5 also overhears it, the node 5 also overhears it, the transmission from node 6 to node transmission from node 6 to node 5 is constrained.5 is constrained.


S-MAC S-MAC — — by Ye, Heidemann and Estrin (2003)by Ye, Heidemann and Estrin (2003)

TradeoffsTradeoffs

Major components in S-MACMajor components in S-MAC• Periodic listen and sleepPeriodic listen and sleep• Collision avoidanceCollision avoidance• Overhearing avoidanceOverhearing avoidance• Massage passingMassage passing

S-MAC – Example of WSN MAC ProtocolS-MAC – Example of WSN MAC Protocol

Latency

FairnessEnergy


RL-MAC (Z. Liu, I. Arel, 2005) RL-MAC (Z. Liu, I. Arel, 2005)

Formulate the MAC problem as a RL problemFormulate the MAC problem as a RL problem

Similar frame-based structure as in SMAC/TMACSimilar frame-based structure as in SMAC/TMAC

Each node Each node infersinfers the state of other nodes as part of its the state of other nodes as part of its decision making processdecision making process

Active time and duty cycle both a function of the traffic load and Active time and duty cycle both a function of the traffic load and Q-Learning was usedQ-Learning was used

The main effort involved crafting the reward signalThe main effort involved crafting the reward signal

nnbb - # of packets- # of packets

queuedqueued

ttrr– action (active– action (active

time)time)

Ratio of successfulRatio of successfulrx vs. txrx vs. tx

# Failed attempts# Failed attempts Reflect on delayReflect on delay


RL-MAC ResultsRL-MAC Results


RL-MAC Results (cont.)RL-MAC Results (cont.)


SummarySummary

RL is a powerful tool which can support a wide RL is a powerful tool which can support a wide range of applicationsrange of applications

There is an art to defining the observations, There is an art to defining the observations, states, rewards and actionsstates, rewards and actions Main goal: formulate “as simple as possible” Main goal: formulate “as simple as possible”

representationrepresentation Depends on the applicationDepends on the application Can impact results significantlyCan impact results significantly

Fits in high-resource and low-resource systemsFits in high-resource and low-resource systems

Next class, we’ll talk about a particular class of RL Next class, we’ll talk about a particular class of RL techniques called Neuro-Dynamic Programmingtechniques called Neuro-Dynamic Programming

Documents

ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 19: Case Studies