61
Deep Learning Shanghai #4

Shanghai deep learning meetup 4

Embed Size (px)

Citation preview

Page 1: Shanghai deep learning meetup 4

Deep Learning Shanghai

#4

Page 2: Shanghai deep learning meetup 4

Goal

• Help people get introduced into DL

• Help investors find potential projects

• Utilize wonderful techniques to solve problems

Page 3: Shanghai deep learning meetup 4

Review of #1 meetup• Introduction to DL

• Pai Peng’s talk: • DeepCamera: A Unified Framework for Recognizing Places-of-

Interest based on Deep ConvNets. CIKM 2015

• Problems left: • 9 fraud detect,

• 7 social topic extraction,

• 7 image.

Page 4: Shanghai deep learning meetup 4

Review of #2 meetup

• Tom’s talk about Deep Learning in HealthCare

• Anson & John’s talk about Introduction to RapidMiner & presentation of DL4J Deep Learning extension

Page 5: Shanghai deep learning meetup 4

Review of #3 meetup

• PART 1: Deep Learning Program

• PART 2: informative sharing • AlphaGo related technology by Davy

• CNN for text classifcation by Yale from Alibaba

Page 6: Shanghai deep learning meetup 4

Schedule• PART 1: Deep Reinforcement Learning

• Reinforcement learning

• Deep Q-Network

• Atari games

• PART 2: evaluation platform • RLLAB

• OpenAI gym

Page 7: Shanghai deep learning meetup 4

Part 1

• Deep reinforcement learning

• Reinforcement learning basis

• Deep Q-Network

• Atari games

Page 8: Shanghai deep learning meetup 4

Reinforcement Learning

• Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. [the definition from wikipedia]

Page 9: Shanghai deep learning meetup 4

Reinforcement learning intersection of various domains

• game theory,

• control theory,

• operations research,

• information theory,

• simulation-based optimization,

• multi-agent systems,

• swarm intelligence,

• statistics,

• genetic algorithms.

Page 10: Shanghai deep learning meetup 4

economics and game theory

• In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality.

Page 11: Shanghai deep learning meetup 4

Machine learning• In machine learning, the environment is typically

formulated as a Markov decision process (MDP) as many reinforcement learning algorithms for this context utilize dynamic programming techniques.

• The main difference between the classical techniques and reinforcement learning algorithms is that the latter do not need knowledge about the MDP and they target large MDPs where exact methods become infeasible.

Page 12: Shanghai deep learning meetup 4

Characteristics of RL• Reinforcement learning differs from standard supervised

learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected.

• Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

• The exploration vs. exploitation trade-off in reinforcement learning has been most thoroughly studied through the multi-armed bandit problem and in finite MDPs.

Page 13: Shanghai deep learning meetup 4

Reinforcement Learning

From Richard Sutton’s book: RL problems have Three characteristics:

1. being closed-loop in an essential way 2. not having direct instructions as to what actions to take 3. where the consequences of actions, including reward signals, play out over extended time periods

Page 14: Shanghai deep learning meetup 4

Elements of RL

a policy, a reward signal, a value function, and, optionally, a model of the environment.

Page 15: Shanghai deep learning meetup 4

policy

A policy defines the learning agent’s way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states.

Page 16: Shanghai deep learning meetup 4

reward signal

A reward signal defines the goal in a reinforcement learning problem.

• On each time step, the environment sends to the reinforcement learning agent a single number, a reward.

• The agent’s sole objective is to maximize the total reward it receives over the long run.

• The reward signal thus defines what are the good and bad events for the agent.

Page 17: Shanghai deep learning meetup 4

Value function

Whereas the reward signal indicates what is good in an immediate sense, a value function specifies what is good in the long run. • Roughly speaking, the value of a state is the total amount

of reward an agent can expect to accumulate over the future, starting from that state.

Page 18: Shanghai deep learning meetup 4

Comments on reinforcement learning

• Rewards are basically given directly by the environment, but values must be estimated and re-estimated from the sequences of observations an agent makes over its entire lifetime.

• In fact, the most important component of almost all reinforcement learning algorithms we consider is a method for efficiently estimating values.

• The central role of value estimation is arguably the most important thing we have learned about reinforcement learning over the last few decades.

Page 19: Shanghai deep learning meetup 4

Model

The fourth and final element of some reinforcement learning systems is a model of the environment. This is something that mimics the behavior of the environment, or more generally, that allows inferences to be made about how the environment will behave.

Page 20: Shanghai deep learning meetup 4

Model

• For example, given a state and action, the model might predict the resultant next state and next reward. Models are used for planning, by which we mean any way of deciding on a course of action by considering possible future situations before they are actually experienced.

• Methods for solving reinforcement learning problems that use models and planning are called model-based methods, as opposed to simpler model-free methods that are explicitly trial-and-error learners—viewed as almost the opposite of planning.

Page 21: Shanghai deep learning meetup 4

Definition of RL• Reinforcement learning is a computational approach to

understanding and automat- ing goal-directed learning and decision-making.

• It is distinguished from other computational approaches by its emphasis on learning by an agent from direct interaction with its environment, without relying on exemplary supervision or complete models of the environment.

• In our opinion, reinforcement learning is the first field to seriously address the computational issues that arise when learning from interaction with an environment in order to achieve long-term goals.

Page 22: Shanghai deep learning meetup 4

History of RLThe term “optimal control” came into use in the late 1950s to describe the problem of designing a controller to minimize a measure of a dynamical system’s behavior over time.

One of the approaches to this problem was developed in the mid-1950s by Richard Bellman and others through extending a nineteenth century theory of Hamilton and Jacobi.

This approach uses the concepts of a dynamical system’s state and of a value function, or “optimal return function,” to define a functional equation, now often called the Bellman equation.

The class of methods for solving optimal control problems by solving this equation came to be known as dynamic programming (Bellman, 1957a).

Bellman (1957b) also introduced the discrete stochastic version of the optimal control problem known as Markovian decision processes (MDPs), and Ronald Howard (1960) devised the policy iteration method for MDPs. All of these are essential elements underlying the theory and algorithms of modern reinforcement learning.

Page 23: Shanghai deep learning meetup 4

dynamic programming for reinforcement learning

Dynamic programming is widely considered the only feasible way of solving general stochastic optimal control problems. It suffers from what Bellman called “the curse of dimensionality,” meaning that its computational requirements grow exponentially with the number of state variables, but it is still far more efficient and more widely applicable than any other general method.

Dynamic programming has been extensively developed since the late 1950s, including extensions to partially observable MDPs (surveyed by Lovejoy, 1991), many applications (surveyed by White, 1985, 1988, 1993), approximation methods (surveyed by Rust, 1996), and asynchronous methods (Bertsekas, 1982, 1983).

Many excellent modern treatments of dynamic programming are available (e.g., Bertsekas, 2005, 2012; Puterman, 1994; Ross, 1983; and Whittle, 1982, 1983). Bryson (1996) provides an authoritative history of optimal control.

Page 24: Shanghai deep learning meetup 4

Atari Games• https://deepmind.com/dqn.html

Page 25: Shanghai deep learning meetup 4

Atari Games

• breakout

• https://www.youtube.com/watch?v=UXurvvDY93o

• https://github.com/corywalker/deep_q_rl/tree/pull_request

• van Hasselt, H., Guez, A., & Silver, D. (2015). Deep Reinforcement Learning with Double Q-learning. arXiv preprint arXiv:1509.06461.

Page 26: Shanghai deep learning meetup 4

Reinforcement Learning• Another paradigm machine learning

• learning from interaction

Page 27: Shanghai deep learning meetup 4

Ingredients of RL

• Markov Decision Process

• Discounted Future Reward

• Q-learning

Page 28: Shanghai deep learning meetup 4

Questions in Reinforcement learning

• What are the main challenges in reinforcement learning? We will cover the credit assignment problem and the exploration-exploitation dilemma here.

• How to formalize reinforcement learning in mathematical terms? We will define Markov Decision Process and use it for reasoning about reinforcement learning.

• How do we form long-term strategies? We define “discounted future reward”, that forms the main basis for the algorithms in the next sections.

Page 29: Shanghai deep learning meetup 4

Questions in Reinforcement learning(con.)• How can we estimate or approximate the future reward?

Simple table-based Q-learning algorithm is defined and explained here.

• What if our state space is too big? Here we see how Q-table can be replaced with a (deep) neural network.

• What do we need to make it actually work? Experience replay technique will be discussed here, that stabilizes the learning with neural networks.

• Are we done yet? Finally we will consider some simple solutions to the exploration-exploitation problem.

Page 30: Shanghai deep learning meetup 4
Page 31: Shanghai deep learning meetup 4
Page 32: Shanghai deep learning meetup 4
Page 33: Shanghai deep learning meetup 4
Page 34: Shanghai deep learning meetup 4
Page 35: Shanghai deep learning meetup 4

Go Deeper Reinforcement Learning

• Deep Q-Network [Volodymyr Mnih]

Page 36: Shanghai deep learning meetup 4

Deep Q-Network

• Deep Q Network

• Experience Replay

• Exploration-Exploitation

Page 37: Shanghai deep learning meetup 4

Structure of DQN

Page 38: Shanghai deep learning meetup 4

Parameter settings

Page 39: Shanghai deep learning meetup 4
Page 40: Shanghai deep learning meetup 4

• The state of the environment in the Breakout game can be defined by the location of the paddle, location and direction of the ball and the presence or absence of each individual brick. This intuitive representation however is game specific.

Page 41: Shanghai deep learning meetup 4

• If we apply the same preprocessing to game screens as in the DeepMind paper – take the four last screen images, resize them to 84×84 and convert to grayscale with 256 gray levels – we would have 25684x84x4 ≈ 1067970 possible game states. This means 1067970 rows in our imaginary Q-table

Page 42: Shanghai deep learning meetup 4

Deep Q-learning algorithm

Page 43: Shanghai deep learning meetup 4

Experience replay

Page 44: Shanghai deep learning meetup 4

Deep Q-learning algorithm with experience replay

Page 45: Shanghai deep learning meetup 4

Pipeline of DQN

Page 46: Shanghai deep learning meetup 4
Page 47: Shanghai deep learning meetup 4
Page 48: Shanghai deep learning meetup 4
Page 49: Shanghai deep learning meetup 4

Extended Data Figure 1 | Two-dimensional t-SNE embedding of the representations in the last hidden layer assigned by DQN to game states experienced during a combination of human and agent play in Space Invaders. The plot was generated by running the t-SNE algorithm25 on the last hidden layer representation assigned by DQN to game states experienced during a combination of human (30 min) and agent (2 h) play. The fact that there is similar structure in the two-dimensional embeddings corresponding to the DQN representation of states experienced during human play (orange

points) and DQN play (blue points) suggests that the representations learned by DQN do indeed generalize to data generated from policies other than its own. The presence in the t-SNE embedding of overlapping clusters of points corresponding to the network representation of states experienced during human and agent play shows that the DQN agent also follows sequences of states similar to those found in human play. Screenshots corresponding to selected states are shown (human: orange border; DQN: blue border).

Page 50: Shanghai deep learning meetup 4

• Double Q-learning http://arxiv.org/abs/1509.06461

• Prioritized Experience Replay http://arxiv.org/abs/1511.05952

• Dueling Network Architecture http://arxiv.org/abs/1511.06581

• extension to continuous action space http://arxiv.org/abs/1509.02971

Page 51: Shanghai deep learning meetup 4

• But beware, that deep Q-learning has been patented by Google !!!

Page 52: Shanghai deep learning meetup 4

OpenAI

• From (partially)closed to open

• release a benchmark platform for reinforcement learning

Page 53: Shanghai deep learning meetup 4

Plan for the DL program

• explanations about the design, the implementation, and the tricks inside DL

• instructions for solving problems (we prefer using DL)

• inspirations for novel thoughts and applications

Page 54: Shanghai deep learning meetup 4

DL startups

• clarifi

• alchemyapi

• metamind

• …

Page 55: Shanghai deep learning meetup 4

online courses

• https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/ Lecture15-16

• http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html (rl)

• http://rll.berkeley.edu/deeprlcourse/

Page 56: Shanghai deep learning meetup 4

Benchmarks

• RLLAB

• OpenAI gym

Page 57: Shanghai deep learning meetup 4

Recall the Plan

• explanations about the design, the implementation, and the tricks inside DL

• instructions for solving problems (we prefer using DL)

• inspirations for novel thoughts and applications

Page 58: Shanghai deep learning meetup 4

Standpoint

• independent researcher and practicers

• open to novel ideas

• focus on technology for addressing humanity issues

• open to sponsors

Page 59: Shanghai deep learning meetup 4

Tracking

• trello board

• meetup page

• periscope

Page 60: Shanghai deep learning meetup 4
Page 61: Shanghai deep learning meetup 4

References• http://www.nervanasys.com/demystifying-deep-

reinforcement-learning/

• http://www.nature.com/news/game-playing-software-holds-lessons-for-neuroscience-1.16979

• http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf