Towards Vision-Based Deep Reinforcement Learning for Robotic Motion …juxi.net/papers/others/zhang2015acra-submission.pdf · 2015. 11. 1. · Towards Vision-Based Deep Reinforcement

Towards Vision-Based Deep Reinforcement Learningfor Robotic Motion Control

Fangyi Zhang, Jürgen Leitner, Michael Milford, Ben Upcroft, Peter CorkeARC Centre of Excellence for Robotic Vision (ACRV)

Queensland University of Technology (QUT)[email protected]

Abstract

Manipulation in highly dynamic and complexenvironments is challenging for robots. Thispaper introduces a machine learning based sys-tem for controlling a robotic manipulator withvisual perception only. The capability to au-tonomously learn robot controllers solely fromcamera images and without any prior knowl-edge is shown for the first time. We build uponthe success of recent deep reinforcement learn-ing and develop a system for learning targetreaching with a three-joint robot manipulatorusing external visual observation of the ma-nipulator. In simulation, a Deep Q Network(DQN) was demonstrated to perform targetreaching after training. Transferring the net-work to real hardware and real observation ina naive approach failed, but experiments showthat the network works when replacing cam-era images with synthetic images generated bya simulator according to real-time robot jointangles.

1 IntroductionRobots are widely used to complete various manipula-tion tasks in industrial manufacturing factories whereenvironments are relatively static and simplex. However,these operations are still challenging for robots in highlydynamic and complex environments commonly encoun-tered in everyday life. Nevertheless, humans are able tomanipulate in such highly dynamic and complex envi-ronments. We seem to be able to learn manipulationskills by observing how others perform them (learningfrom observation), as well as, master new skills throughtrial and error (learning from exploration). Inspired bythis, we want robots to learn and master manipulationskills in the same way.

To give robots the ability to learn from explo-ration, methods are required that are able to learn au-

Figure 1: Baxter’s arm being controlled by a traineddeep Q Network (DQN). Synthetic images (on the right)are fed into the DQN to overcome some of the real-worldissues encountered, such as image noise and precise lo-cation of the arm.

tonomously and which are flexible to a range of differ-ing manipulation tasks. A promising candidate for au-tonomous learning in this regard is Deep ReinforcementLearning (DRL), which combines reinforcement learningand deep learning. One topical example of DRL is theDeep Q Network (DQN), which, after learning to playAtari 2600 games over 38 days, was able to match humanperformance when playing the game [Mnih et al., 2013;Mnih et al., 2015]. Despite their promise, applyingDQNs to "perfect" and relatively simple computer gameworlds is a far cry from deploying them in complexrobotic manipulation tasks, especially when factors suchas sensor noise and image offsets are considered.

This paper takes the first steps towards enablingDQNs to be used for learning robotic manipulation. Wefocus on learning these skills from (external) visual ob-servation of the manipulator, without any prior knowl-edge of configuration or joint state. Towards this end, weassess the feasibility of using DQNs to perform a simpletarget reaching task, an important component of generalmanipulation tasks such as object grasping. In particu-

lar, we make the following contributions:

• We present a DQN-based learning system for a tar-get reaching task. The system consists of threecomponents: a 2D robotic arm simulator for tar-get reaching, a DQN learner, and an interface toenable operation on a Baxter robot.

• We train agents in simulation and evaluate them inboth simulation and real world target reaching ex-periments. The experiments in simulation are con-ducted with varying levels of noise, image offsets,initial arm poses and link lengths, which are com-mon concerns in robotic motion control and manip-ulation.

• We identify and discuss a number of issues and op-portunities for future work towards enabling vision-based deep reinforcement learning in real-worldrobotic manipulation.

2 Related Work2.1 Vision-based Robotic ManipulationVision-based robotic manipulation is the process bywhich robots use their manipulators (such as roboticarms) to rearrange environments [Mason, 2001], basedon camera images. The early vision-based robotic ma-nipulation was implemented using pose-based (positionand orientation) closed-loop control, where vision wastypically used to extract the pose of an object as an in-put for a manipulation controller at the beginning of atask [Kragic and Christensen, 2002].

Most current vision-based robotic manipulation meth-ods are closed-loop based on visual perception. A vision-based manipulation system was implemented on a JohnsHopkins “Steady Hand Robot” for cooperative manipu-lation at millimeter to micrometer scales, using virtualfixtures [Bettini et al., 2004]. With both monocular andbinocular vision cues, various closed-loop visual strate-gies were applied to enable robots to manipulate bothknown and unknown objects [Kragic et al., 2005]. How-ever, these methods are respectively effective for certaintasks.

Various learning methods have been applied to im-plement complex manipulation tasks in the real world.With continuous hidden Markov models (HMMs), a hu-manoid robot was able to learn dual-arm manipulationtasks from human demonstrations through vision [Asfouret al., 2008]. However, most of these algorithms are forspecific tasks and need much prior knowledge. They arenot flexible for learning a range of different manipulationtasks.

2.2 Reinforcement Learning in RoboticsReinforcement Learning (RL) [Sutton and Barto, 1998;Kormushev et al., 2013] has been applied in robotics, as

it promises a way to learn complex actions on complexrobotic systems by just providing informing the robotwhether its actions were successful (positive reward) ornot (negative reward). [Peters et al., 2003] reviewedsome of the RL concepts in terms of applicability to con-trol complex humanoid robots and highlighting some ofthe issues with greedy policy search and gradient basedmethods. How to generate the right reward is an ac-tive topic of research. Intrinsic motivation and curiosityhave been shown to provide means to explore large statespaces, such as the ones found on complex humanoids,faster and more efficient [Frank et al., 2014].

2.3 Deep Visuomotor PoliciesTo enable robots to learn manipulation skills with littleprior knowledge, a convolutional neural network (CNN)based policy representation architecture (deep visuo-motor policies) and its guided policy search methodwere introduced by Sergey et al. [Levine et al., 2015a;Levine et al., 2015b]. The deep visuomotor policies mapjoint angles and camera images directly to the jointtorques. Robot configurations are the only necessaryprior knowledge. The policy search method consists oftwo phases, i.e., optimal control phase and supervisedlearning phase. The training consists of three proce-dures, i.e., pose CNN training, trajectories pre-training,and end-to-end training.

The deep visuomotor policies did enable robots tolearn manipulation skills with little prior knowledgethrough supervised learning, but pre-collected datasetswere necessary. Human involvements in the datasets col-lection made this method less autonomous. Besides, thetraining method specifically designed to speed up thecontact-rich manipulation learning made it less flexiblefor other manipulation tasks.

2.4 Deep Q NetworkThe DQN, a topical example of DRL, satisfies both theautonomy and flexibility requirements for learning fromexploration. It successfully learnt to play 49 differentAtari 2600 games, achieving a human-level of control[Mnih et al., 2015]. The DQN used a deep convolutionalneural network (CNN) [Krizhevsky et al., 2012] to ap-proximate a Q-value function. It maps raw pixel im-ages directly to actions. No pre-input feature extractionis needed. The only one thing is to let the algorithmimprove policies through playing games over and overagain. It learnt playing 49 different games, using thesame network architecture with no modification.

The DQN is defined by its inputs – raw pixels of gamevideo frames and received rewards – and the outputs, i.e.,the number of available actions in the game [Mnih et al.,2015]. This number of actions is the only prior knowl-edge, which means no robot configuration information is

Outputs of 3 Convolutional Layers

Outputs of 2 Fully Connected LayersRs

Rs

Rf Rf

Rf Rf

Figure 2: Schematic of the DQN layers for end-to-end learning and their respective outputs. Four input images arereshaped (Rs) and then fed into the DQN network as grey-scale images (converted from RGB). The DQN, consistsof three convolutional layers with rectifier layers (Rf) after each, followed by a reshaping layer (Rs) and two fullyconnected layers (again with a rectifier layer in between). The normalized outputs of each layer are visualized. (Note:The outputs of the last four layers are shown as matrices instead of vectors.)

known to the agent, when learning DQN based motioncontrol. However, in the DQN training process, the Atari2600 game engine worked as a reward function, but forrobotic motion control, no such engine exists. To applyit in robotic motion control, a reward function is neededto assess trials. Besides, sensing noise and higher com-plexity and dynamics are inevitable issues for real-worldapplications.

3 Problem Definition and SystemDescription

A common problem in robotic manipulation is to reachfor the object to be interacted with. This target reach-ing task is defined as controlling a robot arm, such thatits end-effector is reaching a specific target configura-tion. We are interested in the case in which the robothas to estimate its own configuration and the target po-sition from visual inputs only. To learn such a task, wedeveloped a system consisting of three parts:

• a 2D simulator for robotic target reaching, creatingthe visual inputs to the learner

• a deep reinforcement learning framework based onthe implementation by Google Deepmind, nameddeep Q network (DQN) [Mnih et al., 2015], and

• an interface to control a Baxter robot directly withthe outputs from the DQN.

3.1 DQN-based Learning SystemThe DQN adopted here has the same architecture withthat for playing Atari games, which contains three con-volutional layers and two fully connected layers [Mnih etal., 2015]. Fig. 2 shows the architecture and examplaryoutput of each layer. The inputs of the DQN includereward value and image. Its output is the index of theaction to take. The DQN learns target reaching skills inthe interactions with the target reaching simulator. An

Figure 3: System overview

overview of the system framework for both the learningin simulation and testing on a real robot is shown inFig. 3.

In the first phase (training or testing in simulation),the target reaching simulator provides the reward value(R) and image (I). R is used for training the network.The action output (A) of the DQN is directly sent to thesimulated robotic arm.

When testing on a Baxter robot using camera im-ages, an external camera provides the input images (I).The action output (A) of the DQN is sent to the robotthrough our new interface. The interface then controlsBaxter to implement the action according to A, by send-ing the updated robot’s pose (q′).

In the testing on a Baxter robot using synthetic im-ages, the information flow is similar to that in the secondphase, except input images (I) which are provided bythe simulator. But the images provided by the simula-tor here are not what provided in the first phase. Theyare synthetic images. The simulator generates the syn-

Joints

Target

Completion Area

End Effector

S1 E1 W1

(a) Schematic diagram (b) The robot simulatorduring a successful reach

Figure 4: Our simulator for the target reaching task,providing the visual inputs to the DQN learner.

thetic images in accordance to real time joints poses (P )on a Baxter robot provided by the interface to Baxter.

3.2 Target Reaching SimulatorWe simulate the reaching task to control a three-jointrobotic arm in 2D (Fig. 4). As shown in Fig. 4(a), therobotic arm consists of four links and three joints, whoseconfigurations are consistent to the specifications of aBaxter arm, including joints constraints. The blue spotis the target to be reached. The red spot marks the posi-tion of the end-effector. The simulator can be controlledby sending specific commands to the individual joints“S1”, “E1” and “W1”. The simulator screen resolution is160× 320.

The corresponding real scenario that the simulatorsimulates is: with appropriate constant joint angles ofother joints on a Baxter arm, the arm moves in a verti-cal plane controlled by joints “S1”, “E1” and “W1”, anda controller (game player) observes the arm through anexternal camera placed directly aside it with a horizon-tal point of view. The three joints are in position controlmode. The background is white.

The 2D simulator is used as a target reaching videogame in connection with the DQN setup. It providesthe raw pixel inputs to the network and has nine optionsfor action, i.e., three buttons for each joint: joint angleincreasing, decreasing and hold. The joint angle increas-ing/decreasing step is constant at 0.02 rad. At the be-ginning of each round, joints “S1”, “E1” and “W1” will beset to a certain initial pose, such as [0.0, 0.0, 0.0] rad;and the target will be randomly selected. One roundmeans an entire trial before its terminal.

In the game, a reward value will be returned for eachbutton press. The reward value is determined by a re-

Algorithm 1: Reward Functioninput : Pt: the target 2D coordinates;

Pe: the end-effector 2D coordinatesafter an action.

output: R: the reward for current state;T : whether the game is terminal.

1 Dis = ComputeDistance(Pt, Pe);2 DisChange = Dis− PreviousDis;3 if DisChange > 0 then4 R = −1;5 else if DisChange < 0 then6 R = 1;7 else8 R = 0;9 end

10 Racc = Rt +Rt−1 +Rt−2;11 if Racc < −1 then12 T = True;13 else14 T = False;15 end

ward function introduced in Section 3.3. The terminal ofa round is determined by the reward function as well. Fora player, the goal is to get a as high as possible accumu-lated reward in each round. We name this accumulatedreward in each round as a score for each round.

3.3 Reward FunctionTo keep consistent to the DQN setup, the reward func-tion has two return values: one for the reward of eachaction; the other shows whether the target reaching gameis terminal. Its algorithm is shown in Algorithm 1. Thereward of each action is determined according to the dis-tance change between the end-effector and the target. Ifthe distance gets closer, the reward function returns 1; ifgets further, returns −1; otherwise returns 0. If the sumof the latest three scores is smaller than −1, the gameterminates.

4 Experiments and ResultsTo evaluate the feasibility of the DQN-based system inlearning performing target reaching, we did some exper-iments in both simulation and real-world scenarios. Theexperiments consist of three phases: training in simula-tion, testing in simulation, and testing in the real world.

4.1 Training in Simulation ScenariosTo evaluate the robustness of the trained agent to somecommon concerns in robotic manipulation, we trainedseveral agents with different simulator settings. The dif-ferent settings include sensing noise, image offsets, vari-ations in initial arm pose and link length. The setting

(a) Settings A: Simula-tion images

(b) Setting B: Simula-tion images + noise

(c) Setting C: Setting B+ random initial pose

(d) Setting D: Setting C+ random image offset

(e) Setting E: Setting D+ random link lengthchanges

Figure 5: Screenshots highlighting the different training scenarios for the agents.

Table 1: Agents and Training Settings

Agent simulator Settings

A constant initial poseB Setting A + random image noiseC Setting B + random initial poseD Setting C + random image offsetE Setting D + random link length

details for training the five agents are shown in Table 1.Their screenshots are shown in Fig. 5, respectively.

Agent A was trained in Setting A where the 2D roboticarm was initialized to the same pose ([0.0, 0.0, 0.0] rad)at the beginning of each round. No image noise wasadded in Setting A. To simulate camera sensing noise,random noise was added in Setting B, on the basis ofSetting A. The random noise was with a uniform distri-bution with a scale between −0.1 and 0.1 (for float pixelvalues).

In Setting C, in addition to random image noise, theinitial arm pose was randomly selected. In the training ofAgent D, random image offsets were added on the basisof Setting C. The offset ranges in u and v directions wererespectively [−23, 7] and [−40, 20] in pixel. Agent E wastrained with dynamic arm link lengths. The link lengthvariation ratio was [−4.2, 12.5]% with respect to the linklength settings in the previous four settings. The imageoffsets and link lengths were randomly selected at thebeginning of each round, and stayed unchanged in theentire round (not vary at each frame).

All the agents were trained using more than 4 millionframes within 160 hours. Due to the difference in settingcomplexity, the time-cost for the simulator to update

0

10

20

30

40

50

60

70

80

90

100

110

0 10 20 30 40 50 60 70 80

Ave

rage

Max

imum

Act

ion

Q-V

alue

Training Epoch

Agent AAgent BAgent CAgent DAgent E

Figure 6: Action Q-value converging curves. Each epochcontains 50 thousand training steps. The Q-values arethe average of the estimated maximum scores for allstates in a validation set. The validation set has 500frames.

each frame varies in five different settings. Therefore,the exact numbers of used frames for the five agentsare different. They are 6.475, 6.275, 5.225, 4.75 and6.35 million, respectively.

The action Q-value converging curves are shown inFig. 6. The Q-value curves are respect to trainingepochs. Fig. 6 shows the converging case before 80epochs, i.e., 4 million training steps. The Q-values arethe average of the estimated maximum scores for allstates in a validation set. As mentioned in Section 3.2,the score represents the accumulated reward one agentgets in one entire round. The validation set was ran-domly selected at the beginning of each training.

From Fig. 6, we can observe that all the five agentsconverge towards to a certain Q-value state, althoughtheir values are different. One thing we have to empha-

size is this converging is just for scores (accumulatedreward in each round). A high score might but notnecessarily indicate a high performance of an agent inperforming target reaching, since our reward function isbased on distance changes which cannot completely in-dicate the target reaching performance.

4.2 Testing in Simulation ScenariosWe tested the five agents in simulation scenarios withthe 2D simulator. Each agent was tested in all thosefive settings in Table 1. Each test took 200 rounds, i.e.,terminated 200 times. More testing rounds can make thetesting results closer to the ground truth, but need toomuch time.

In the testing, task success rates were evaluated. Inthe computation of success rates, it is regarded as a suc-cess when the end-effector gets into a completion areawith a radius of 16 cm around a target, as shown in thegrey circle in Fig. 4(a), which is twice size of the targetcircle. The radius of 16 cm is equivalent to 15 pixels inthe simulator screen. However, for the DQN, this com-pletion area is a ellipse (a=8 pixels, b=4 pixels), sincethe simulator screen will be resized from 160 × 320 to84× 84 before being input to the learning core.

Table 2 shows the success rates of different agentsin different settings after 3 million training steps (60epochs). The data in the diagonal show the success rateof each agent tested with the same setting in which ittrained, i.e., Agent A was tested in Setting A.

We also did some experiments for agents from differ-ent training steps. Table 3 shows the success rates ofdifferent agents after some certain training steps. Thesuccess rates of each agent were tested with the samesimulator setting in which it was trained, i.e., the casein the diagonal of Table 2. In Table 3, “f” indicates thefinal number of frames used for training each agent in160 hours, as mentioned in Section 4.1.

What we will discuss regarding the data in Table 2and 3 is based on the assumption that some outliersof some conclusions appeared accidentally due to thelimited number of testing rounds. Although 200 test-ing rounds are already able to extract data changingtrends in success rates, they are insufficient to extractthe ground truth. Some minor success rate distortionshappen occasionally. To make the conclusions more con-vincing, more study is necessary.

From Table 2, we can find that Agent A and B canboth adapt to Setting A and B, but can not adapt to theother three settings. This shows that the DQN itself isrobust to random image noise, but not robust to dynamicarm initial pose and link length, and image offsets.

Agent C, D and E can adapt to the corresponding set-tings in which they are trained. This shows that theDQN can learn to adapt to noisy factors if they are

Table 2: Success rates (%) in different settings

Agent SettingA B C D E

A 51.0 53.0 14.0 8.5 8.5B 50.5 49.5 11.0 8.0 10.0C 32.0 34.5 36.0 22.5 14.0D 13.5 16.5 22.0 19.5 15.0E 13.0 16.5 20.0 16.5 19.0

Table 3: Success rates (%) after different training steps

Agent/Setting Training Steps / million1 2 3 4 f

A 36.0 43.0 51.0 36.0 36.5B 58.0 55.5 49.5 51.5 13.5C 30.5 33.0 36.0 48.0 13.5D 16.5 17.5 19.5 26.5 14.0E 13.0 18.5 19.0 23.0 27.0

present in the training process. Besides, the success ratesof Agent D and E do not have too much difference in allthese five settings. Other than the settings in which theyare trained, they can also adapt to other four settings,and even have bigger success rates in Setting C whichis not their training setting. This indicates that agentstrained in more complicated settings can adapt to sim-pler settings.

From the data in Table 3, we can find that the successrate of each agent normally goes up after more trainingsteps. However, some goes down after a certain trainingstep, e.g., the success rate of Agent A goes down after4 million training steps. Theoretically, with a appropri-ate reward function, the DQN should perform better andbetter, and the success rates should go up iteratively.

The going down case was quite possibly caused by thereward function, which has the possibility to guide theagent to a wrong direction. For the case in this pa-per, the evaluation is based on success rates, but thereward function is based on distance changes. The rela-tion between success rates and distance changes is indi-rect. This indirect relation provides the incorrect guid-ance possibility. This should be considered carefully infuture work.

Table 3 also shows that the success rate of the agenttrained in a more complicated setting is normally smallerthan that in a simpler setting, and needs more trainingtime to get to a same level of success rate. For example,the success rate of Agent E is smaller than that of AgentD in each training episode, but is close to that of AgentD in a latter training episode.

In general, no matter whether the discussion assump-tion holds or not, the data in Table 2 and 3 at leastshow that the DQN is feasible to learn performing tar-get reaching from exploration in simulation. More studyis necessary to increase the success rates.

4.3 Real World Experiment Using CameraImages

To check the feasibility of trained agents in the realworld, we did a target reaching experiment in real sce-narios using camera images, i.e., the second phase men-tioned in Section 3.1. In this experiment, we used AgentB trained with 3 million frames, which has relativelyhigh success rates for both Setting A and B in the test-ing in simulation.

The experiment setting were arranged to the case thatthe 2D simulator simulated, i.e., a Baxter arm moved ona vertical plane with a white background. A grey-scalecamera was placed aside the arm, observing the arm witha horizontal view of point (for the DQN, the grey-scalecamera is the same with a color camera, since even theimages from Atari games and the 2D target reachingsimulator are RGB-color images, they are converted togrey-scale images prior to being input to the network).The background was a white sheet. The testing sceneand a sample input to the DQN are shown in Fig. 7(a)and 7(b), respectively.

In the experiment, to make the agent work in the realworld, we tried to match the arm position (in images)in real scenarios to that in simulation scenarios. Theposition adjustment was made through changing camerapose and image cropping parameters. However, no mat-ter how we adjusted, it did not reach the target. Thesuccess rate is 0.

Other than the success rate, we also got a qualitativeresult: Agent B mapped specific input images to certainactions, but the mapping was ineffective for perform-ing target reaching. There were some kind of mappingdistortions between real and simulation scenarios. Thedistortions might be caused by the biases between real-scenario and simulation-scenario images.

4.4 Real World Experiment UsingSimulator Images

To avoid the biases between real and simulation scenar-ios, we did another real world experiment using simulatorimages. In the experiment, we replaced camera imageswith images generated by the 2D simulator. The simu-lator images were generated according to real-time jointangles (“S1”, “E1” and “W1”) on a Baxter robot. Thismade input images to the DQN exactly consistent tothose in training. All other settings were the same withthose in Section 4.3, as shown in Fig. 1.

In this experiment, we used the same agent that wasused in Section 4.3, i.e., Agent B trained with 3 million

(a) Real world experiment withcamera images

(b) A sample of inputimages

Figure 7: Testing scene and a sample input of the realworld experiment using camera images. In the testingscene, a Baxter arm moved on a vertical plane with awhite sheet as the background. To guarantee that im-ages input to the DQN have an as consistent as possibleappearance to those in simulation scenarios, camera im-ages were cropped and masked with a boundary. Theboundary is from the background of a simulator screen-shot.

frames. It achieved a consistent success rate with thatin the simulation-scenario testing.

According to the results, we can conclude that the rea-son why the agents failed in completing the target reach-ing task with camera images is the input image biases.These biases might come from camera pose variations,color and shape distortions, or some other factors. Morestudy is necessary to exactly figure out where the biasescome from (bias factors).

5 Conclusion and DiscussionThe DQN-based system is feasible to learn performingtarget reaching from exploration in simulation, usingonly visual observation with no prior knowledge. How-ever, the agent (Agent B) trained in simulation scenariosfailed to perform target reaching in the real world experi-ment using camera images as inputs. Instead, in the realworld experiment using simulator images as inputs, theagent got a consistent success rate with that in simula-tion. These two different results show that the failurein the real world experiment with camera images wascaused by the input image biases between real and sim-ulation scenarios. More work is necessary to figure outthe bias factors.

To make it work in real scenarios with camera images,we can either decrease the biases or make the DQN ro-

bust to the bias factors to make the agents trained insimulation scenarios work in real scenarios. Decreasingthe biases is a trade-off between making the simulatormore consistent to real scenarios and preprocessing in-put images to make them more consistent to those insimulation scenarios.

Regarding making the DQN robust to the biases, thereare four possible methods: adding variations of the biasfactors into simulation scenarios, adding a fine-tuningprocess in real scenarios after the training in simulationscenarios, training in real scenarios directly, and design-ing a new DRL architecture (still can be a DQN) whichis robust to the bias factors. To increase the success rateand make it more effective and efficient, more study isnecessary in the design of reward function and simulatorsettings.

In addition to solving the biases problem, more studyis necessary in the design of reward function. A goodreward function is the key to get effective motion con-trol or even manipulation skills and also speed up thelearning process. The reward function used in this workis just a first step. It is far less than enough to be agood reward function. Other than the effectiveness andefficiency concerns, a good reward function needs also tobe flexible to a range of general purpose motion controlor even manipulation tasks.

Besides, the visual perception in this work is from anexternal monocular camera. An on-robot stereo cameraor RGBD sensor can be a more effective and practicalsolution for applications in the 3D real world. The jointcontrol mode in this work is position control, some othercontrol modes like speed control and torque control aremore common and appropriate for dynamic motion con-trol and manipulation in real-world applications.

AcknowledgementsThis work was partially sponsored by the Australian Re-search Council (ARC) Centre of Excellence for RoboticVision. Computational resources and services used inthis work were partially provided by the HPC and Re-search Support Group, Queensland University of Tech-nology (QUT).

References[Asfour et al., 2008] Tamim Asfour, Pedram Azad, Flo-

rian Gyarfas, and Rüdiger Dillmann. Imitation learn-ing of dual-arm manipulation tasks in humanoidrobots. International Journal of Humanoid Robotics,5(02):183–202, 2008.

[Bettini et al., 2004] Alessandro Bettini, PanaddaMarayong, Samuel Lang, Allison M Okamura, andGregory D Hager. Vision-assisted control for manip-

ulation using virtual fixtures. IEEE Transactions onRobotics, 20(6):953–966, 2004.

[Frank et al., 2014] Mikhail Frank, Jr̈gen Leitner, Mar-ijn Stollenga, Alexander Förster, and Jürgen Schmid-huber. Curiosity driven reinforcement learning formotion planning on humanoids. Frontiers in Neuro-robotics, 7(25), 2014.

[Kormushev et al., 2013] Petar Kormushev, SylvainCalinon, and Darwin G Caldwell. Reinforcementlearning in robotics: Applications and real-worldchallenges. Robotics, 2(3):122–148, 2013.

[Kragic and Christensen, 2002] Danica Kragic and Hen-rik I Christensen. Survey on visual servoing for ma-nipulation. Technical report, Computational Visionand Active Perception Laboratory, Royal Institute ofTechnology, Stockholm, Sweden, 2002.

[Kragic et al., 2005] Danica Kragic, Mårten Björkman,Henrik I Christensen, and Jan-Olof Eklundh. Vi-sion for robotic object manipulation in domestic set-tings. Robotics and Autonomous Systems, 52(1):85–100, 2005.

[Krizhevsky et al., 2012] Alex Krizhevsky, IlyaSutskever, and Geoffrey E Hinton. Imagenetclassification with deep convolutional neural net-works. In Advances in Neural Information ProcessingSystems (NIPS), pages 1097–1105, 2012.

[Levine et al., 2015a] Sergey Levine, Chelsea Finn,Trevor Darrell, and Pieter Abbeel. End-to-endtraining of deep visuomotor policies. CoRR,abs/1504.00702, 2015.

[Levine et al., 2015b] Sergey Levine, Nolan Wagener,and Pieter Abbeel. Learning contact-rich manipula-tion skills with guided policy search. Proceedings ofthe IEEE International Conference on Robotics andAutomation (ICRA), pages 156–163, 2015.

[Mason, 2001] Matthew T Mason. Mechanics of roboticmanipulation. MIT Press, 2001.

[Mnih et al., 2013] Volodymyr Mnih, KorayKavukcuoglu, David Silver, Alex Graves, IoannisAntonoglou, Daan Wierstra, and Martin Riedmiller.Playing atari with deep reinforcement learning. arXivpreprint arXiv:1312.5602, 2013.

[Mnih et al., 2015] Volodymyr Mnih, KorayKavukcuoglu, David Silver, Andrei A Rusu, JoelVeness, Marc G Bellemare, Alex Graves, MartinRiedmiller, Andreas K Fidjeland, Georg Ostrovski,et al. Human-level control through deep reinforcementlearning. Nature, 518(7540):529–533, 2015.

[Peters et al., 2003] Jan Peters, Sethu Vijayakumar,and Stefan Schaal. Reinforcement learning for hu-manoid robotics. In Proceedings of the IEEE-RAS

International Conference on Humanoid Robots, pages1–20, 2003.

[Sutton and Barto, 1998] R.S. Sutton and A.G. Barto.Reinforcement learning: An introduction. CambridgeUniv Press, 1998.

Documents

Towards Vision-Based Deep Reinforcement Learning for Robotic Motion …juxi.net/papers/others/zhang2015acra-submission.pdf · 2015. 11. 1. · Towards Vision-Based Deep Reinforcement