26
1 ECE 517: Reinforcement Learning in ECE 517: Reinforcement Learning in Artificial Intelligence Artificial Intelligence Lecture 19: Case Studies Lecture 19: Case Studies Dr. Itamar Arel Dr. Itamar Arel College of Engineering College of Engineering Department of Electrical Engineering and Computer Science Department of Electrical Engineering and Computer Science The University of Tennessee The University of Tennessee Fall 2010 Fall 2010 November 10, 2010 November 10, 2010

ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 19: Case Studies

  • Upload
    ace

  • View
    53

  • Download
    2

Embed Size (px)

DESCRIPTION

ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 19: Case Studies. November 10, 2010. Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2010. Final Project Recap. Requirements: - PowerPoint PPT Presentation

Citation preview

Page 1: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

11

ECE 517: Reinforcement Learning in ECE 517: Reinforcement Learning in Artificial IntelligenceArtificial Intelligence

Lecture 19: Case StudiesLecture 19: Case Studies

Dr. Itamar ArelDr. Itamar Arel

College of EngineeringCollege of EngineeringDepartment of Electrical Engineering and Computer ScienceDepartment of Electrical Engineering and Computer Science

The University of TennesseeThe University of TennesseeFall 2010Fall 2010

November 10, 2010November 10, 2010

Page 2: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 22

Final Project RecapFinal Project Recap

Requirements:Requirements: PresentationPresentation

In-class 15 minute presentation + 5 minutes for In-class 15 minute presentation + 5 minutes for questionsquestions

Presentation assignment slots have been posted on Presentation assignment slots have been posted on websitewebsite

Project report – Due Project report – Due Friday, Dec 3Friday, Dec 3thth

Comprehensive documentation of your workComprehensive documentation of your work Recall that the Final Project is 30% of course Recall that the Final Project is 30% of course

grade!grade!

Page 3: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 33

IntroductionIntroduction

We’ll discuss several case studies of reinforcement We’ll discuss several case studies of reinforcement learninglearning

The intention is to illustrate some of the The intention is to illustrate some of the trade-offstrade-offs and and issues that arise in real applicationsissues that arise in real applications

For example, we emphasize how domain For example, we emphasize how domain knowledgeknowledge is is incorporated into the formulation and solution of the incorporated into the formulation and solution of the problemproblem

We also highlight the We also highlight the representationrepresentation issues that are so issues that are so often critical to successful applicationsoften critical to successful applications

Applications of reinforcement learning are still far from Applications of reinforcement learning are still far from routine and typically require as much art as scienceroutine and typically require as much art as science

Making applications easier and more straightforward is Making applications easier and more straightforward is one of the goals of current research in reinforcement one of the goals of current research in reinforcement learninglearning

Page 4: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 44

TD-Gammon (Tesauro’s 1992, 1994, 1995, …)TD-Gammon (Tesauro’s 1992, 1994, 1995, …)

One of the most impressive applications of RL to date One of the most impressive applications of RL to date is Gerry Tesauro’s (IBM) game of backgammon is Gerry Tesauro’s (IBM) game of backgammon

TD-GammonTD-Gammon, required little backgammon knowledge, , required little backgammon knowledge, yet learned to play extremely well, near the level of yet learned to play extremely well, near the level of the world's strongest grandmastersthe world's strongest grandmasters

The learning algorithm was a straightforward The learning algorithm was a straightforward combination of the TD(combination of the TD() algorithm and nonlinear ) algorithm and nonlinear function approximationfunction approximation

FA using a FFNN trained by backpropagating TD errorsFA using a FFNN trained by backpropagating TD errors

There are probably more professional backgammon There are probably more professional backgammon players than there are professional chess playersplayers than there are professional chess players

BG is in part a game of chance, which can be viewed BG is in part a game of chance, which can be viewed as a large MDPas a large MDP

Page 5: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 55

TD-Gammon (cont.)TD-Gammon (cont.)

The game is played with 15 white and 15 black pieces The game is played with 15 white and 15 black pieces on a board of 24 locations, called on a board of 24 locations, called pointspoints

Here’s a typical position early in the game, seen from Here’s a typical position early in the game, seen from the perspective of the white playerthe perspective of the white player

Page 6: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 66

TD-Gammon (cont.)TD-Gammon (cont.)

White has just rolled a 5 White has just rolled a 5 and a 2, so it can move and a 2, so it can move one of his pieces 5 and one of his pieces 5 and one (possibly the same) 2 one (possibly the same) 2 stepssteps

The objective is to The objective is to advance all pieces to advance all pieces to points 19-24, and then off points 19-24, and then off the boardthe board

HittingHitting – removal of single – removal of single piecepiece

30 pieces, 24 locations implies enormous number of 30 pieces, 24 locations implies enormous number of configurations (state set is ~10 configurations (state set is ~102020))

Effective branching factor of 400, considering that each Effective branching factor of 400, considering that each dice dice role has ~20 possibilities role has ~20 possibilities

Page 7: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 77

TD-Gammon - detailsTD-Gammon - details

Although the game Although the game is highly stochasticis highly stochastic, a complete , a complete description of the game's state is available at all timesdescription of the game's state is available at all times

The estimated value of any state was meant to predict The estimated value of any state was meant to predict the probability of winning starting from that statethe probability of winning starting from that state

RewardReward: 0 at all times except those in which the game : 0 at all times except those in which the game is won, when it is 1is won, when it is 1

Episodic (game = episode), Episodic (game = episode), undiscountedundiscounted

Non-linear form of TD(Non-linear form of TD() using a FF neural network) using a FF neural network Weights initialized to small random numbersWeights initialized to small random numbers Backpropagation of TD errorBackpropagation of TD error Four input units for each point; unary encoding of Four input units for each point; unary encoding of

number of white pieces, plus other featuresnumber of white pieces, plus other features Use of AfterstateUse of Afterstate

Learning during self-play – fully incrementallyLearning during self-play – fully incrementally

Page 8: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 88

TD-Gammon – Neural Network EmployedTD-Gammon – Neural Network Employed

Page 9: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 99

Summary of TD-Gammon ResultsSummary of TD-Gammon Results

Two players played against each otherTwo players played against each other Each had no prior knowledge of the gameEach had no prior knowledge of the game Only the rules of the game were prescribedOnly the rules of the game were prescribed

Human’s learn from machinesHuman’s learn from machines: TD-Gammon learned : TD-Gammon learned to play certain opening positions differently than was to play certain opening positions differently than was the convention among the best human playersthe convention among the best human players

Page 10: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 1010

Rebuttal on TD-GammonRebuttal on TD-Gammon

For an alternative view, see For an alternative view, see “Why did TD-“Why did TD-Gammon Work?Gammon Work?”, Jordan Pollack and Alan ”, Jordan Pollack and Alan Blair, NIPS 9 (1997)Blair, NIPS 9 (1997)

Claim: Claim: it was the “co-evolutionary training it was the “co-evolutionary training strategy, playing games against itself, which strategy, playing games against itself, which led to the success”led to the success”

Any such approach would work with Any such approach would work with backgammonbackgammon

Success does not extend to other problemsSuccess does not extend to other problems e.g. Tetris, maze-type problems – exploration e.g. Tetris, maze-type problems – exploration

issue comes upissue comes up

Page 11: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 1111

The AcrobotThe Acrobot

Robotic application of RLRobotic application of RL

Roughly analogous to a Roughly analogous to a gymnast swinging on a high bar gymnast swinging on a high bar

The first joint (corresponding toThe first joint (corresponding tothe hands on the bar) cannotthe hands on the bar) cannotexert torqueexert torque

The second joint (correspondingThe second joint (correspondingto the gymnast bending at theto the gymnast bending at thewaist) canwaist) can

This system has been widelyThis system has been widelystudied by control engineersstudied by control engineersand machine learning researchers and machine learning researchers

Page 12: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 1212

The Acrobot (cont.)The Acrobot (cont.)

One objective for controlling the Acrobot is to swing One objective for controlling the Acrobot is to swing the tip (the "feet") above the first joint by an amount the tip (the "feet") above the first joint by an amount equal to one of the links in minimum timeequal to one of the links in minimum time

In this task, the torque applied at the second joint is In this task, the torque applied at the second joint is limited to three choices: limited to three choices: positive torquepositive torque of a fixed of a fixed magnitude, magnitude, negative torquenegative torque of the same magnitude, of the same magnitude, or or no torqueno torque

A reward of A reward of –1–1 is given on all time steps until the goal is given on all time steps until the goal is reached, which ends the episode. No discounting is is reached, which ends the episode. No discounting is usedused

Thus, the optimal value of any state is the minimum Thus, the optimal value of any state is the minimum time to reach the goal (an integer number of steps)time to reach the goal (an integer number of steps)

Sutton (1996) addressed the Acrobot swing-up task in Sutton (1996) addressed the Acrobot swing-up task in an on-line, model-free contextan on-line, model-free context

Page 13: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 1313

Acrobot Learning Curves for Sarsa(Acrobot Learning Curves for Sarsa())

Page 14: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 1414

Typical Acrobot Learned BehaviorTypical Acrobot Learned Behavior

Page 15: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 1515

RL in RoboticsRL in Robotics

Robot motor capabilities were investigated using RLRobot motor capabilities were investigated using RL

Walking, grabbing and delivering Walking, grabbing and delivering MIT Media LabMIT Media Lab Robocup competitionsRobocup competitions – soccer games – soccer games

Sony AIBOs are commonSony AIBOs are commonemployedemployed

Maze-type problemsMaze-type problems Balancing themselvesBalancing themselves

on unstable platformon unstable platform Multi-dimensional inputMulti-dimensional input

streamsstreams

Hopefully some new Hopefully some new applications soon applications soon

Page 16: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 1616

Introduction to Wireless Sensor Networks (WSN)Introduction to Wireless Sensor Networks (WSN)

A A sensor networksensor network is composed of a large number of is composed of a large number of sensor nodes, which are densely deployed either sensor nodes, which are densely deployed either inside the phenomenon or very close to itinside the phenomenon or very close to it

Random deploymentRandom deployment Cooperative capabilitiesCooperative capabilities

May be wireless or wired, however most modern May be wireless or wired, however most modern applications require wireless applications require wireless communicationscommunications

May be mobile or staticMay be mobile or static

Main challenge: maximize Main challenge: maximize the life of the networkthe life of the networkunder battery constraints!under battery constraints!

Page 17: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 1717

Communication Topology of Sensor NetworksCommunication Topology of Sensor Networks

Page 18: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 1818

Fire detection and monitoringFire detection and monitoring

Page 19: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 1919

Nodes we have here at the labNodes we have here at the lab

UCB TelosB

Intel Mote

Page 20: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 2020

Energy Consumption in WSNEnergy Consumption in WSN

Sources of Energy Sources of Energy Consumption Consumption SensingSensing ComputationComputation Communication Communication

(dominant)(dominant)

Energy Wastes on CommunicationsEnergy Wastes on Communications Collisions. Collisions. (Packet retransmission increases energy consumption)(Packet retransmission increases energy consumption) Idle Listening. (listen to the channel when the node are not Idle Listening. (listen to the channel when the node are not

intending to transmit)intending to transmit) Communication Overhead. (the communications cost of the MAC Communication Overhead. (the communications cost of the MAC

protocol)protocol) Overhearing (receive packets which are destined to other nodes)Overhearing (receive packets which are destined to other nodes)

Page 21: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 2121

MAC-related problems in WSNMAC-related problems in WSN

Goal:Goal: to schedule or coordinate the to schedule or coordinate the communications among multiple nodes sharing communications among multiple nodes sharing the same wireless radio frequency.the same wireless radio frequency.

5

7

1

2

4

36

Hidden Terminal Problem.Hidden Terminal Problem.Node 5 and node 3 want to transmitNode 5 and node 3 want to transmit

data to node 1. Since node 3 is out ofdata to node 1. Since node 3 is out of

the communication range of node 5, the communication range of node 5, ifif

communication occurscommunication occurs

simultaneously, node 1 will simultaneously, node 1 will experience collision.experience collision.

Exposed Terminal Problem.Exposed Terminal Problem.node 1 sends data to node 3, since node 1 sends data to node 3, since

node 5 also overhears it, the node 5 also overhears it, the transmission from node 6 to node transmission from node 6 to node 5 is constrained.5 is constrained.

Page 22: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 2222

S-MAC S-MAC — — by Ye, Heidemann and Estrin (2003)by Ye, Heidemann and Estrin (2003)

TradeoffsTradeoffs

Major components in S-MACMajor components in S-MAC• Periodic listen and sleepPeriodic listen and sleep• Collision avoidanceCollision avoidance• Overhearing avoidanceOverhearing avoidance• Massage passingMassage passing

S-MAC – Example of WSN MAC ProtocolS-MAC – Example of WSN MAC Protocol

Latency

FairnessEnergy

Page 23: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 2323

RL-MAC (Z. Liu, I. Arel, 2005) RL-MAC (Z. Liu, I. Arel, 2005)

Formulate the MAC problem as a RL problemFormulate the MAC problem as a RL problem

Similar frame-based structure as in SMAC/TMACSimilar frame-based structure as in SMAC/TMAC

Each node Each node infersinfers the state of other nodes as part of its the state of other nodes as part of its decision making processdecision making process

Active time and duty cycle both a function of the traffic load and Active time and duty cycle both a function of the traffic load and Q-Learning was usedQ-Learning was used

The main effort involved crafting the reward signalThe main effort involved crafting the reward signal

nnbb - # of packets- # of packets

queuedqueued

ttrr– action (active– action (active

time)time)

Ratio of successfulRatio of successfulrx vs. txrx vs. tx

# Failed attempts# Failed attempts Reflect on delayReflect on delay

Page 24: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 2424

RL-MAC ResultsRL-MAC Results

Page 25: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 2525

RL-MAC Results (cont.)RL-MAC Results (cont.)

Page 26: ECE 517: Reinforcement Learning in  Artificial Intelligence Lecture 19: Case Studies

ECE 517: Reinforcement Learning in AI 2626

SummarySummary

RL is a powerful tool which can support a wide RL is a powerful tool which can support a wide range of applicationsrange of applications

There is an art to defining the observations, There is an art to defining the observations, states, rewards and actionsstates, rewards and actions Main goal: formulate “as simple as possible” Main goal: formulate “as simple as possible”

representationrepresentation Depends on the applicationDepends on the application Can impact results significantlyCan impact results significantly

Fits in high-resource and low-resource systemsFits in high-resource and low-resource systems

Next class, we’ll talk about a particular class of RL Next class, we’ll talk about a particular class of RL techniques called Neuro-Dynamic Programmingtechniques called Neuro-Dynamic Programming