Upload
fujimoto-keisuke
View
477
Download
1
Embed Size (px)
Citation preview
ValueIterationNetworksA.Tamar,Y.Wu,G.Thomas,S.Levine,andP.Abbeel
Dept.ofElectricalEngineeringandComputerSciences,UCBerkeley
Presenter:KeisukeFujimoto(Twitter@peisuke)
ValueIterationNetworks
Purpose: Machinelearningbasedrobotpathplanning.Thisplannerisavailableinnewenvironment notincludedintraindataset.
Strategy: Predictionofoptimalaction.Themethodcanlearnrewardsofeachplaceandactiontogetgoodrewards.
Result: Planningin28x28gridmap,Applicabletocontinuouscontrolrobot
MapPoseVelocityGoal
Action
A.Tamar,Y.Wu,G.Thomas,S.Levine,andP.AbbeelDept.ofElectricalEngineeringandComputerSciences,UCBerkeley
Presenter:KeisukeFujimoto(ABEJA)
BackgroundTarget:AutonomousRobot• Manipulationrobot,Navigationrobot,Transferrobot
Problem:• Reinforcementlearningcannotworkoutsideoftraining
environments.
Goal
Targetobject
Manipulationrobot Navigationrobot
Contribution
• ValueIterationNetworks(VIN)• Modelfreetraining• Itdoesnotrequirerobotdynamicsmodels.
• Generalizedactionpredictioninnewenvironments• Itcannotworkoutsideoftrainingenvironments.
• Keyapproach• Representsvalue-iterationplanningbyCNN• Predictionofrewardmapandcomputationofsumoffuturerewards.
OverviewofVINInput:Stateoftherobot(pose,velocity),goal,map(leftfig.)Output:Action(direction,mortar'storque)
Strategy:Determinationofoptimalactionusingpredictedrewards(rightfig.).
State Rewards
Rewardpropagation
• Actioncanbedeterminedbysumoffuturerewardgeneratedusingrewardpropagation
-10 -10 -10
-10 -10 1
-10 -10
Map Rewardfrommap
Leftmoveaction
-10 -10 -9 -10
-10 -10 -9 1 0.9
-10 -10 -9
-10 -10 -10
-10 -10 1 -9
-9 -10 -10 0.9
-9 -9
Upmovefrommap
One-steppropagationexample:
Determinationofaction
• Optimalactionatrewardpropagatedplaceismaxrewardaction(middlefig.)
• Determinationofoptimalactionusingpropagatedreward(rightfig.)
Leftmoveaction
-10 -10 -9 -10
-10 -10 -9 1 0.9
-10 -10 -9
-10 -10 -10
-10 -10 1 -9
-9 -10 -10 0.9
-9 -9
Upmovefrommap -10 -10 -9 -10
-10 -10 -9 1 0.9
-9 -10 -10 0.9
-9 -9
Max
AfterRewardpropagation
-10 -10 -9 -8 -10
-10 -10 -9 1 0.9
-9 -10 -10 0.9 0.8
-8 -9 -9 0.8 0.7
-7 -8 -8 0.7 0.6
Currentrobotpose
ValueIterationModule• RewardpropagationwithConvolutionalNeuralNetwork• Inputisrewardmapandoutputissumoffeaturerewardmap• Qishiddenrewardmap,Vissumoffeaturerewardmap
Output
ConvolutionMax
ValueIterationNetworks
• DeepArchitectureofValueIterationNetworks• Inputismapandstate,fR predictsrewardmap• Attentionmodulescropsthevaluemaparoundrobotposition• 𝜓 outputsoptimalaction
Attentionfunction• Attentionmodulecropsasubsetofthevaluesaroundcurrentrobotpose.• Optimalposehaverelativetoonlycurrentrobotpose.• Duetothisattentionmodule,predictionofoptimalactionbecomeseasy.
-10 -10 -9 -8 -10
-10 -10 -9 1 0.9
-9 -10 -10 0.9 0.8
-8 -9 -9 0.8 0.7
-7 -8 -8 0.7 0.6
Ifrobotishere.
-10 0.9 0.8
-9 0.8 0.7
-8 0.7 0.6
Selectedarea
Grid-WorldDomainEnvironment:
Occupancygridmap,testsizeis8x8to28x28Thenumberofrecurrenceis20forthe28x28mapsTrainingdatasetis5000maps,7trajectories.
NetworksArch.:
Competitivemethod:CNNbasedDeepQ-Network,DirectactionpredictionusingFCN
Map,GoalCNN Rewardmap VImodule Attention FClayer
Action
CurrentPosition
3layernet150hiddennode 10channelsinQ-layer 80parameters
ResultsofGrid-WorldDomain
Predictedpath Reward Sumoffeaturereward
MarsRoverNavigationEnvironment:• NavigatingthesurfaceofMarsbyarover.• Itpredictspathfromonlysurfaceimagewithoutobstacle
information.• Successrateis90.3%.
Redpointshowselevationsharper,inpredictiontime,vindoesnotusestheelevationshapeinformation
ContinuousControlEnvironment:• Applytocontinuouscontrolspace.• Gridsizeis28x28• inputispositionandvelocity
whichisfloatdata.• Outputis2dcontinuouscontrol
parameters.
Comparisonaboutfinaldistancetothegoal
Thisresultisfromauthor'spresentation
WebNavChallengeEnvironment:• Navigatewebsitelinkstofindaquery• Features:averagewordembeddings• Usinganapproximategraphforplanning
Evaluation:• Successrateofwithintop-4predictions• Testset1:startfromindexpage• Testset2:startfromrandompage
Result:
ConclusionPurpose:• Machinelearningbasedrobotpathplanning.
Method:• Learningrewardsofeachplaceandpredictaction
usingpropagatedreward.
Result:• VINpolicieslearnanapproximateplanning
computationrelevantforsolvingthetask.• Grid-worlds,tocontinuouscontrol,andevento
navigationofWikipedialinks.
Code:https://github.com/peisuke/vinThiscodeisimplementedinchainer!
Twitter:@peisuke
Wearehiring!!https://www.wantedly.com/companies/abeja