Shanghai deep learning meetup 4

Deep Learning Shanghai

#4

Goal

• Help people get introduced into DL

• Help investors find potential projects

• Utilize wonderful techniques to solve problems

Review of #1 meetup• Introduction to DL

• Pai Peng’s talk: • DeepCamera: A Unified Framework for Recognizing Places-of-

Interest based on Deep ConvNets. CIKM 2015

• Problems left: • 9 fraud detect,

• 7 social topic extraction,

• 7 image.

Review of #2 meetup

• Tom’s talk about Deep Learning in HealthCare

• Anson & John’s talk about Introduction to RapidMiner & presentation of DL4J Deep Learning extension

Review of #3 meetup

• PART 1: Deep Learning Program

• PART 2: informative sharing • AlphaGo related technology by Davy

• CNN for text classifcation by Yale from Alibaba

Schedule• PART 1: Deep Reinforcement Learning

• Reinforcement learning

• Deep Q-Network

• Atari games

• PART 2: evaluation platform • RLLAB

• OpenAI gym

Part 1

• Deep reinforcement learning

• Reinforcement learning basis

• Deep Q-Network

• Atari games

Reinforcement Learning

• Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. [the definition from wikipedia]

Reinforcement learning intersection of various domains

• game theory,

• control theory,

• operations research,

• information theory,

• simulation-based optimization,

• multi-agent systems,

• swarm intelligence,

• statistics,

• genetic algorithms.

economics and game theory

• In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality.

Machine learning• In machine learning, the environment is typically

formulated as a Markov decision process (MDP) as many reinforcement learning algorithms for this context utilize dynamic programming techniques.

• The main difference between the classical techniques and reinforcement learning algorithms is that the latter do not need knowledge about the MDP and they target large MDPs where exact methods become infeasible.

Characteristics of RL• Reinforcement learning differs from standard supervised

learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected.

• Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

• The exploration vs. exploitation trade-off in reinforcement learning has been most thoroughly studied through the multi-armed bandit problem and in finite MDPs.

Reinforcement Learning

From Richard Sutton’s book: RL problems have Three characteristics:

1. being closed-loop in an essential way 2. not having direct instructions as to what actions to take 3. where the consequences of actions, including reward signals, play out over extended time periods

Elements of RL

a policy, a reward signal, a value function, and, optionally, a model of the environment.

policy

A policy defines the learning agent’s way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states.

reward signal

A reward signal defines the goal in a reinforcement learning problem.

• On each time step, the environment sends to the reinforcement learning agent a single number, a reward.

• The agent’s sole objective is to maximize the total reward it receives over the long run.

• The reward signal thus defines what are the good and bad events for the agent.

Value function

Whereas the reward signal indicates what is good in an immediate sense, a value function specifies what is good in the long run. • Roughly speaking, the value of a state is the total amount

of reward an agent can expect to accumulate over the future, starting from that state.

Comments on reinforcement learning

• Rewards are basically given directly by the environment, but values must be estimated and re-estimated from the sequences of observations an agent makes over its entire lifetime.

• In fact, the most important component of almost all reinforcement learning algorithms we consider is a method for efficiently estimating values.

• The central role of value estimation is arguably the most important thing we have learned about reinforcement learning over the last few decades.

Model

The fourth and final element of some reinforcement learning systems is a model of the environment. This is something that mimics the behavior of the environment, or more generally, that allows inferences to be made about how the environment will behave.

Model

• For example, given a state and action, the model might predict the resultant next state and next reward. Models are used for planning, by which we mean any way of deciding on a course of action by considering possible future situations before they are actually experienced.

• Methods for solving reinforcement learning problems that use models and planning are called model-based methods, as opposed to simpler model-free methods that are explicitly trial-and-error learners—viewed as almost the opposite of planning.

Definition of RL• Reinforcement learning is a computational approach to

understanding and automat- ing goal-directed learning and decision-making.

• It is distinguished from other computational approaches by its emphasis on learning by an agent from direct interaction with its environment, without relying on exemplary supervision or complete models of the environment.

• In our opinion, reinforcement learning is the first field to seriously address the computational issues that arise when learning from interaction with an environment in order to achieve long-term goals.

History of RLThe term “optimal control” came into use in the late 1950s to describe the problem of designing a controller to minimize a measure of a dynamical system’s behavior over time.

One of the approaches to this problem was developed in the mid-1950s by Richard Bellman and others through extending a nineteenth century theory of Hamilton and Jacobi.

This approach uses the concepts of a dynamical system’s state and of a value function, or “optimal return function,” to define a functional equation, now often called the Bellman equation.

The class of methods for solving optimal control problems by solving this equation came to be known as dynamic programming (Bellman, 1957a).

Bellman (1957b) also introduced the discrete stochastic version of the optimal control problem known as Markovian decision processes (MDPs), and Ronald Howard (1960) devised the policy iteration method for MDPs. All of these are essential elements underlying the theory and algorithms of modern reinforcement learning.

dynamic programming for reinforcement learning

Dynamic programming is widely considered the only feasible way of solving general stochastic optimal control problems. It suffers from what Bellman called “the curse of dimensionality,” meaning that its computational requirements grow exponentially with the number of state variables, but it is still far more efficient and more widely applicable than any other general method.

Dynamic programming has been extensively developed since the late 1950s, including extensions to partially observable MDPs (surveyed by Lovejoy, 1991), many applications (surveyed by White, 1985, 1988, 1993), approximation methods (surveyed by Rust, 1996), and asynchronous methods (Bertsekas, 1982, 1983).

Many excellent modern treatments of dynamic programming are available (e.g., Bertsekas, 2005, 2012; Puterman, 1994; Ross, 1983; and Whittle, 1982, 1983). Bryson (1996) provides an authoritative history of optimal control.

Atari Games• https://deepmind.com/dqn.html

https://deepmind.com/dqn.html

Atari Games

• breakout

• https://www.youtube.com/watch?v=UXurvvDY93o

• https://github.com/corywalker/deep_q_rl/tree/pull_request

• van Hasselt, H., Guez, A., & Silver, D. (2015). Deep Reinforcement Learning with Double Q-learning. arXiv preprint arXiv:1509.06461.

https://www.youtube.com/watch?v=UXurvvDY93o

https://github.com/corywalker/deep_q_rl/tree/pull_request

Reinforcement Learning• Another paradigm machine learning

• learning from interaction

Ingredients of RL

• Markov Decision Process

• Discounted Future Reward

• Q-learning

Questions in Reinforcement learning

• What are the main challenges in reinforcement learning? We will cover the credit assignment problem and the exploration-exploitation dilemma here.

• How to formalize reinforcement learning in mathematical terms? We will define Markov Decision Process and use it for reasoning about reinforcement learning.

• How do we form long-term strategies? We define “discounted future reward”, that forms the main basis for the algorithms in the next sections.

Questions in Reinforcement learning(con.)• How can we estimate or approximate the future reward?

Simple table-based Q-learning algorithm is defined and explained here.

• What if our state space is too big? Here we see how Q-table can be replaced with a (deep) neural network.

• What do we need to make it actually work? Experience replay technique will be discussed here, that stabilizes the learning with neural networks.

• Are we done yet? Finally we will consider some simple solutions to the exploration-exploitation problem.

Go Deeper Reinforcement Learning

• Deep Q-Network [Volodymyr Mnih]

Deep Q-Network

• Deep Q Network

• Experience Replay

• Exploration-Exploitation

Structure of DQN

Parameter settings

• The state of the environment in the Breakout game can be defined by the location of the paddle, location and direction of the ball and the presence or absence of each individual brick. This intuitive representation however is game specific.

• If we apply the same preprocessing to game screens as in the DeepMind paper – take the four last screen images, resize them to 84×84 and convert to grayscale with 256 gray levels – we would have 25684x84x4 ≈ 1067970 possible game states. This means 1067970 rows in our imaginary Q-table

Deep Q-learning algorithm

Experience replay

Deep Q-learning algorithm with experience replay

Pipeline of DQN

Extended Data Figure 1 | Two-dimensional t-SNE embedding of the representations in the last hidden layer assigned by DQN to game states experienced during a combination of human and agent play in Space Invaders. The plot was generated by running the t-SNE algorithm25 on the last hidden layer representation assigned by DQN to game states experienced during a combination of human (30 min) and agent (2 h) play. The fact that there is similar structure in the two-dimensional embeddings corresponding to the DQN representation of states experienced during human play (orange

points) and DQN play (blue points) suggests that the representations learned by DQN do indeed generalize to data generated from policies other than its own. The presence in the t-SNE embedding of overlapping clusters of points corresponding to the network representation of states experienced during human and agent play shows that the DQN agent also follows sequences of states similar to those found in human play. Screenshots corresponding to selected states are shown (human: orange border; DQN: blue border).

• Double Q-learning http://arxiv.org/abs/1509.06461

• Prioritized Experience Replay http://arxiv.org/abs/1511.05952

• Dueling Network Architecture http://arxiv.org/abs/1511.06581

• extension to continuous action space http://arxiv.org/abs/1509.02971

http://arxiv.org/abs/1509.06461




• But beware, that deep Q-learning has been patented by Google !!!

OpenAI

• From (partially)closed to open

• release a benchmark platform for reinforcement learning

Plan for the DL program

• explanations about the design, the implementation, and the tricks inside DL

• instructions for solving problems (we prefer using DL)

• inspirations for novel thoughts and applications

DL startups

• clarifi

• alchemyapi

• metamind

• …

online courses

• https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/ Lecture15-16

• http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html (rl)

• http://rll.berkeley.edu/deeprlcourse/

https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

http://rll.berkeley.edu/deeprlcourse/

Benchmarks

• RLLAB

• OpenAI gym

Recall the Plan

• explanations about the design, the implementation, and the tricks inside DL

• instructions for solving problems (we prefer using DL)

• inspirations for novel thoughts and applications

Standpoint

• independent researcher and practicers

• open to novel ideas

• focus on technology for addressing humanity issues

• open to sponsors

Tracking

• trello board

• meetup page

• periscope

References• http://www.nervanasys.com/demystifying-deep-

reinforcement-learning/

• http://www.nature.com/news/game-playing-software-holds-lessons-for-neuroscience-1.16979

• http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf

http://www.nervanasys.com/demystifying-deep-reinforcement-learning/

http://www.nature.com/news/game-playing-software-holds-lessons-for-neuroscience-1.16979

http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf

Technology

Shanghai deep learning meetup 4