IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT …cpslab.snu.ac.kr/publications/papers/ieee_ra_l_19_1776... · IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY,

IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2020 1

Learning to Walk a Tripod Mobile Robot UsingNonlinear Soft Vibration Actuators with Entropy

Adaptive Reinforcement LearningJae In Kim*1, Mineui Hong*2, Kyungjae Lee2, DongWook Kim1, Yong-Lae Park1, and Songhwai Oh2

Abstract—Soft mobile robots have shown great potential inunstructured and confined environments by taking advantage oftheir excellent adaptability and high dexterity. However, thereare several issues to be addressed, such as actuating speeds andcontrollability, in soft robots. In this paper, a new vibrationactuator is proposed using the nonlinear stiffness characteristic ofa hyperelastic material, which creates continuous vibration of theactuator. By integrating three proposed actuators, we also presentan advanced soft mobile robot with high degrees of freedom ofmovement. However, since the dynamic model of the soft mobilerobot is generally hard to obtain(intractable), it is difficult todesign a controller for the robot. In this regard, we present amethod to train a controller, using a novel reinforcement learning(RL) algorithm called adaptive soft actor-critic (ASAC). ASACgradually reduces a parameter called an entropy temperature,which regulates the entropy of the control policy. In this way,the proposed method can narrow down the search space duringtraining, and reduce the duration of demanding data collectionprocesses in real-world experiments. For the verification of therobustness and the controllability of our robot and the RLalgorithm, experiments for zig-zagging path tracking and obstacleavoidance were conducted, and the robot successfully finished themissions with only an hour of training time.

Index Terms—Modeling, Control, and Learning for SoftRobots, Hydraulic/Pneumatic Actuators, Motion and Path Plan-ning

I. INTRODUCTION

SOFT mobile robots have a great potential in the area offield robotics, since they can perform difficult tasks for

conventional mobile robots, such as locomotion and navigation

Manuscript received: September 10, 2019; Revised December 13, 2019;Accepted January 12, 2020.

This letter was recommended for publication by Editor K.-J. Cho uponevaluation of the Associate Editor and Reviewers’ comments. This work wassupported in part by the National Research Foundation (NRF) Grant funded bythe Korean Government (MSIT). (Grant No.: NRF-2016R1A5A1938472), alsoin part by the Technology Innovation Program (Grant No.: 2017-10069072)funded by the Ministry of Trade, Industry & Energy (MOTIE), Korea, andin part by Institute of Information & Communications Technology Planning& Evaluation (IITP) funded by the Korea Government (MSIT). (Grant No. :2019-0-01190, [SW Star Lab] Robot Learning: Efficient, Safe and Socially-Acceptable Machine Learning). (J. I. Kim and M. Hong equally contributedto this work. Corresponding authors: Y.-L. Park and S. Oh)

1J. I. Kim, D. Kim, and Y.-L. Park are with the Department of Mechanicaland Aerospace Engineering; the Soft Robotics Research Center (SRRC);the Institute of Advanced Machine Design (IAMD), Seoul National Uni-versity, Seoul 08826, Republic of Korea. (e-mails: {snu08mae, shigumchis,ylpark}@snu.ac.kr)

2M. Hong, K. Lee, and S. Oh are with the Department of Electrical andComputer Engineering and ASRI, Seoul National University, Seoul 08826,Republic of Korea (e-mails: {mineui.hong, kyungjae.lee}@rllab.snu.ac.kr,[email protected])

Digital Object Identifier (DOI): see top of this page.

Fig. 1. A tripod mobile robot consists of vibration actuators, DC motor, androtation plate.

in unstructured and confined environments, by utilizing theirhigh adaptability to their surroundings and the dexterity ofmanipulating their own bodies [1]. Among those, soft mobilerobots with pneumatic actuators that provide relatively highforce-to-weight ratios and durability have been widely em-ployed due to the simplicity in design and lightweightness [2].Nevertheless, there is a limitation in deploying those robotsto real world missions due to the relatively slow actuatingspeed of pneumatic actuators and the difficulty in control withtraditional methods, such as a feedback control [3], since it ishard to obtain accurate dynamic models of the robots.

To address these issues, a soft membrane vibration actu-ator was proposed in our prior work [4], composed of asoft membrane, a vibration shaft, and a rigid housing forcontinuous vibration with a constant input pressure. We wereable to not only increase the actuating speed in this actuatorbut also demonstrate the controllability of the tripod mobilerobot composed of these vibration actuators by learning thedynamic model of the robot using Gaussian process regression(GPR), which is one of non-parametric methods convenientin modeling soft robots [5]. However, there still exist severalchallenges in design and control in the actuator. First, thegap between the shaft and the hub inside the chamber forthe vibration caused air leakage during actuation, since thegap could not be removed or further reduced due to friction.Another issue is the reduced performance in actuation whenthe actuator is in contact with an object due to the exposedpart of the shaft. In addition, GPR used to model the dynamicsof the robot is a supervised method and therefore the trainingdata had to be manually collected, which made the processexpensive and time-consuming.

2 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2020

Fig. 2. The design of (a) a tripod robot and (b) a soft vibration actuator. (c) Internal design of the chamber housing represented using cross sections.

In this paper, we propose a new design of the vibrationactuator by replacing the material of the rigid shaft withan elastomer with nonlinear stiffness. This solves the shaftfriction and the air leakage problems since the actuator nolonger requires a hub for the shaft. In addition, the softshaft moves only inside the chamber, allowing for continuousvibration regardless of contact with external objects. A newtripod mobile robot was built using the new actuators. Forcontrol, we utilize model-free reinforcement learning (RL) byneural networks as a function approximator. Since RL au-tonomously prioritizes control actions based on their potentialto get higher rewards while training, RL efficiently learnsto control a robot with complex dynamics as demonstratedin [6], [7]. Specifically, we propose a maximum entropy RLmethod with the entropy temperature adaptation. In general,maximum entropy RL algorithms [8], [9], Shannon entropymaximization has been employed to encourage exploration byenhancing a random behaviors. Since the data collected frommaximum entropy RL cover a wide range of state and controlspaces, the learned feedback controller can be robust againstto unexpected situations. However, the maximum entropyframework may hamper the exploitation of the policy, asshown in [8], since it prevents convergence of the policy. Toalleviate this issue while taking advantages of entropy max-imization, we control the level of exploration by schedulingthe entropy temperature, α. Thus, at the beginning of learning,our method collects a wide range of data, and converges aslearning progresses by gradually decreasing α. As a result,our algorithm can control the developed soft robot to followthe desired path, by training only with 2,500 data instances,while 46,000 manually-collected data instances were requiredto learn the whole dynamics model using GPR in the previouswork.

II. DESIGN

A. Tripod Mobile Robot

We designed a new mobile robot to enhance the robustnessand the dexterity from the previous version [4], as shownFig. 2-(a). For robustness against external contacts, a newvibration actuator was designed, and the same three actuatorswere used to form an equilateral triangle (top view) of therobot in order to increase the stability during ground contact.

For dexterity, the vibration amplitude of the actuator wasincreased by adding a 100 g weight to the top of each actuator.Also, a direct current (DC) motor was installed at the centerof the robot, combined with the rotating plate to control thedirection of rotation. As a result, the mobile robot is capable ofmaking various motions, such as bi-directional rotations andtranslation, with a combination of the three vibration modesof the actuator. When one of the actuators is driven, the robotfollows the direction of the actuator. Also, the orientation ofthe robot can be controlled by rotating the motor. In addition,by driving the motor clockwise and vibrating three actuatorsat a constant frequency, the robot rotates counterclockwisewithout translation, and vice versa.

B. Soft Vibration Actuator

As shown in Fig. 2-(b), the actuator consists of a chamberhousing, a soft shaft, and a soft membrane. The soft shaftis coupled with the soft membrane, and the membrane iscombined with the chamber housing. The chamber housinghas an air inlet and an outlet which are marked by the blueand red circles. The air flow through the chamber housingis shown in Fig. 2-(c) using the cross-sectional view of thechamber housing. In this new design, we solved the frictionand the air leakage problems of the previous actuator. The softshaft no longer vibrates along the passageway of the chamberhousing, and the head of the shaft is able to completely blockthe exhaust of the chamber housing. In addition, since the softshaft is located inside the chamber housing, the actuator canvibrate continuously and robustly regardless of contact withexternal objects.

C. Vibration Mechanism of Actuator

The proposed actuator makes use of the nonlinear stiffnesscharacteristic of hyperelastic material (Eco-flex 30) for vibra-tion which shows a nonlinear stress-strain behavior over largestrains. Fig. 3-(a) shows the initial state of the actuator. In thisstate, the exhaust of the chamber housing is closed due to thelength constraint of the soft shaft. However, the soft membraneexpands until the vertical distance from the soft membraneto the exhaust and the length of the soft shaft become thesame, as shown in Fig. 3-(b). If the internal pressure is thesame as the atmospheric pressure here, the exhaust valve will

KIM et al.: LEARNING TO WALK A TRIPOD MOBILE ROBOT USING NONLINEAR SOFT VIBRATION ACTUATORS WITH ENTROPY ADAPTIVE REINFORCEMENT LEARNING3

Fig. 3. Vibration sequence of the actuator. (a) An initial state. The chamberhousing is closed. (b) The soft membrane is expanded according to the lengthof the soft shaft. The chamber is still closed due to internal pressure. (c)A maximum inflation state which upward force due to internal pressure anddownward force due to internal stress of the soft shaft are equal. (d) A deflatingstate. As the air escapes, the soft membrane and soft shaft return to theiroriginal positions. The actuator is vibrating repeatedly through (a) to (d).

open. However, the flange of the soft shaft is pushed upwardto close the exhaust by the relatively high internal pressurethat already contributed expansion of the soft membrane. Withthe exhaust closed, the soft membrane continues to expand,as shown in Fig. 3-(c), causing the soft shaft to begin tostretch. Due to the stress formed inside the soft shaft, theflange begins to be pushed downward now. As the membranefurther inflates, the soft shaft is stretched further. Therefore,this downward force becomes stronger. Due to the nonlinearstiffness of the membrane and the shaft, there is a momentwhen the downward force is larger than the upward force. Asa result, the exhaust opens and the upward force applied tothe flange is canceled. Then, the stretched soft shaft quicklyreturns to its original state, as shown in Fig. 3-(d). Also, asthe air inside the chamber escapes, the soft membrane returnsto its initial state and the soft shaft closes the exhaust again.Through this process, the actuator continues to vibrate.

D. Opening Principle of Exhaust using Nonlinear StiffnessCharacteristic of Hyper-Elastic Structures

In order to explain the principle of opening of the exhaust,the relationship between the upward and the downward forcesof the soft shaft was analyzed. The nonlinear stiffness is variedaccording to the shape of the hyper-elastic material. As shownin Fig. 4, the soft membrane is assumed to be inflated to makea partial sphere with respect to the internal pressure p, and thesoft shaft elongates with the inflation of the soft membrane.

The elastic strain energy density W in Ogden model [10](µ1 = 0, µ2 = E

6 ) is given as

W (λ1, λ2, λ3) =E

24(λ4

1 + λ42 + λ4

3 − 3). (1)

The Cauchy stress is given as

σi = λi∂W

∂λi− p∗, (2)

where p∗ is the hydrostatic pressure which is determined fromthe boundary conditions, and σi and λi are the principle stressand strain, respectively.

Fig. 4. (a) An initial state of hyper-elastic structure. (b) An inflated state byinternal pressure p.

The hyper-elastic material is assumed to be incompressiblegiven as

λ1 · λ2 · λ3 = 1. (3)

It is also assumed that the soft membrane is subjectedto equibiaxial loading (σ1 = σ2 = σ, σ3 = 0). FromEquations (2) and (3), the principle strains are

λ1 = λ2 = λ, λ3 =1

λ2. (4)

Using the geometrical relationship of the volume of aspherical cap Vc = πL2

3 (3R − L) and the radius of thecurvature Rc =

L2+r202L , the principle strain λ is given as

λ =Rcθ

r0=L2 + r2

0

2Lr0· arcsin(

2Lr0

L2 + r20

), (5)

where L is the inflation length of the actuator [11], r0 is theinitial radius of the membrane, and θ = arcsin(r0/Rc).

The total potential energy Πtot is the sum of the strainenergy W · V0 and the work of pressure Vc · p, where Wis the elastic strain energy density, V0 is the initial volume ofthe soft circular membrane, Vc is the volume of the sphericalcap, and p is the internal pressure, and therefore

Πtot = W · V0 − Vc · p. (6)

At the static equilibrium,

∂Πtot(L, p)

∂L= 0. (7)

Solving Equation (7), we can finally find the inflation lengthL as a function of the internal pressure p.

L = f(p) (8)

The soft shaft is assumed to be subjected to uniaxial loading(σ1 = σ, σ2 = σ3 = 0). From Equations (2) and (3), the strainsand stress can be expressed as

λ1 = λ(p) =L(p) + h0

h0,

λ2 = λ3 =1√λ,

σ(p) =E

6(λ(p)4 − 1

λ(p)2).

(9)

Finally, we can find the upward force Fupward and thedownward force Fdownward based on the internal pressure


Fig. 5. Maximum inflation state (a) without a weight, and (b) with a weight.A weight increases the inflation length of the actuator.

p and the stress σ(p) of the shaft acting on the flange,respectively, shown in Fig. 4.

Fupward = p×Ared (10)

Fdownward = σ(p)×Ablue (11)

From Equations (10) and (11), it can be noted thatFdownward increases more rapidly than Fupward since theincrease of the stress σ(p) is larger than that of p. Therefore,due to this nonlinear characteristic of the soft structure withthe internal pressure p, the exhaust opens when Fdownwardis greater than Fupward. From this principle, the actuator cangenerate vibration.

E. Effect of Weight

If an additional weight is applied to the actuator, the verticaldisplacement of the chamber from inflation length will besmaller than that without the weight. In other words, a higherpressure is required to inflate the actuator to the same volumewith a weight, which means an increase in the upward forceapplied to the flange of the soft shaft, and an additionaldownward force is required to open the exhaust. As a result,if a weight is added to the top of the actuator, the inflationdisplacement of the actuator increases, as shown in Fig. 5.

III. LEARNING FOR CONTROL

To control the proposed robot, it is essential to design afeedback controller. However, the complex dynamics modelof the robot makes it difficult to design a controller explicitly.In particular, a motion generated by the soft membrane vibra-tion actuators is affected by multiple and complex physicalproperty of the environment such as pneumatic lines, frictionand flatness of the ground, or surface condition of membrane.Analyzing a dynamic system considering all these factors isa demanding task if not impossible. To handle this issue,we employ a model-free RL method which can learn theparameter of a feedback controller, even if its dynamics modelis unknown. Furthermore, since it is difficult to collect alarge amount of data for learning directly from the robot

in real-world, the learning algorithm needs to be sampleefficient, while exploring enough search space to find theoptimal control of the robot. In this regard, we utilize anovel RL method, which is sample efficient and robust tohyperparameters, to learn a feedback controller without theknowledge of dynamic models.

A. Problem FormulationOur controller aims at making the robot move to the desired

position, while maintaining its orientation toward the target.For that, we first define a state space, which represents currentstatus of the robot, and an action space, the possible controlinputs which can be chosen by the controller. The proposedrobot has three soft membrane vibration actuators and onemotor for controlling the angular momentum of the robot.Hence, our action space (or control input) is defined as 4Dvectors as at = [ p1, p2, p3, δM ], where pi indicates aninput pressure of each vibration actuator and M indicatesan angular velocity of the motor and δM is the change inthe motor speed. Note that, since the motor has a controldelay, drastic change of motor signal may induce unstablemotion and inconsistent movements. Hence, we smoothlychange the motor speed by controlling δM rather than directlycontrolling M . Then, the state of the feedback controller isdefined as st := [δθt, dt,Mt], where δθt := θg − θt isan angular difference between heading and goal direction,dt :=

√x2t + y2

t , is the Euclidean distance to the desiredposition, and Mt is the current motor speed.

We define the feedback controller as a Gaussian policyfunction, (µt, σt) = fφ(st), where µt and σt are the meanand standard deviation of a Gaussian distribution, and φ is aparameter of the controller. In particular, we model fφ as aneural network which has shown high performance to model acomplex nonlinear function. Since the robot requires to explorethe state and action spaces, during the training phase, wesample a control from the Gaussian policy as,

at ∼ πφ(at|st) := N (at; fφ(st)) (12)

For testing phase, the mean µt is used as a feedback control.We also design a reward function r(st), which assigns a

higher score as a control reduces the gap between robot’scurrent state and desired state:

r(st) := −√δx2t + δy2

t − δθt + c, (13)

where c is an alive reward. (In experiments, c = 2 is used).During the learning phase, a robot starts from an initial

state s0 ∼ d(s0) and samples an action at from the Gaus-sian controller πφ(·|st) and executes the sampled control.As a result of the control, the robot transitions to the nextstate st+1 according to unknown dynamics P (st+1|at, st)and receives a reward rt+1 = r(st+1). As we sequentiallycontrol the robot, the trajectory of states, actions, and rewardsτ = (s0, a0, s1, r1, a1, s2, r2, · · · ) is generated. Finally, thepurpose of an RL algorithm is to maximize the followingobjective function (also known as expected return):

maximizeπ∈Π

Eτ∼P,π

[ ∞∑t=1

γtrt

], (14)


by updating the parameter φ based on sampled trajectories,where γ ∈ (0, 1) is a discount factor. If the robot achieves themaximum expected return, it indicates that we find an optimalfeedback controller.

B. Maximum Entropy Reinforcement Learning Preliminaries

The maximum entropy reinforcement learning (RL) [12]–[14] maximizes both the sum of expected rewards and theShannon entropy of the policy distribution, i.e., H(πφ(·|s)) =Ea∼πφ [− log(πφ(a|s))]1. An optimal policy, π∗α, of the maxi-mum entropy RL is defined by,

π∗α := argmaxπ

Eτ∼P,π

[ ∞∑t=0

γt (r(st, at) + αH (π (·|st)))], (15)

where r(s, a) = Es′∼P (·|s,a) [r(s′)] and an entropy temper-ature α ∈ [0,∞) is a parameter to determine the relativeimportance of the entropy with respect to the reward. Notethat, the objective of π?α depends on the temperature α and ifα = 0, the objective of (15) becomes the original objective(14). On the contrary, if α is large, the robot tries randomactions to maximize the Shannon entropy.

When the dynamic model P is fully known, in [8], Haarnojaet al. proposed a soft policy iteration which can achieve theoptimal policy of (15). In the soft policy iteration, soft statevalue V πα and soft state-action value (or soft action value) Qπαare defined as,Qπα(s, a) := Eτ∼P,π [r(s0, a0) + γV πα (s1)|s0 = s, a0 = a] ,

V πα (s) := Eτ∼P,π

[ ∞∑t=0

γt (r(st, at) + αH (π (·|st)))

∣∣∣∣∣s0 = s

],

(16)

where these values indicate the expected sum of rewards andthe entropy. The soft policy iteration iteratively evaluate Qπαand update π based on Qπα. Furthermore, in [12], it hasshown that the soft policy iteration converges to the optimalitycondition of (15), which is given by

π∗α(a|s) =exp

(1αQ∗α(s, a)

)∫A exp( 1

αQ∗α(s, a′))da′

, (17)

which is called a soft Bellman optimality (SBO) equation [15].In [8], Haarnoja et al. extended the soft policy iteration

to a soft actor critic (SAC) method which can be appliedto RL problems. SAC has benefits in terms of explorationand empirically shows the efficiency. However, there existdisadvantages in that the algorithm is especially sensitive tothe entropy temperature, α, as mentioned in [8]. Since thetemperature must be tuned for each learning task manually,it is tricky to apply the algorithm to learn in real-worldexperiments.

C. Entropy Temperature Adaptation

While SAC empirically showed that the a large entropyhelps exploration, there exists a drawback of entropy max-imization. From the optimality condition of (17), we canobserve that if α goes to zero, the optimal policy of themaximum entropy RL problem (15) converges to the optimal

1This is also known as an entropy bonus.

policy of the original RL problem (14), since the effect of theentropy decreases when α goes to zero. From this observation,it can be known that, by gradually reducing the entropytemperature to zero, we can recover the original optimal policy.Hence, we schedule the entropy temperature to decrease froman initial value to zero.

When it comes to reduction of α, we should consider theestimation error of Qπα. As mentioned in Section III-B, inthe RL problem, due to absence of the dynamic model, theexpectation in (16) is intractable. Thus, we estimates V πα andQπα by training a neural network similarly to other existingmethods [6], [8], [12], [16]. Since we update π based onQπα, if we drastically reduce α to zero, the policy can betoo greedy on mis-estimated value, and cannot explore theaction space thoroughly to find a better policy. Therefore, wepresent a method to schedule the temperature using the trustregion method [16]. The trust region method ensures a mono-tonic improvement of performance, by limiting the Kullback-Leibler (KL) divergence between old and new policies usinga threshold δ. Applying the concept of the trust region to ourapproach, we can expect the algorithm to automatically adjustthe temperature by the proper amount that guarantees not tohamper enough exploration.

The proposed method consists of two parts: first, for givenαm, we obtain a near-optimal policy π∗αm by running SAC.Second, αm is reduced using the trust region method. Hence,the policy learned by SAC converges to π∗αm and, as αmdecreases, π∗αm converges to π∗ which is the optimal policyof the original RL problem. Note that SAC converges to π∗αmwithin a small number of iterations in pratice where moredetail setting can be found in Section IV.

Now, let Qπα denotes an estimated soft action value functionof a policy π, Qπ = Qπα|α=0 denotes an estimated stateaction value function without considering the entropy, andρπ(s) = (1 − γ)

∑∞t=0 γ

tP (st = s) denotes discountedvisitation frequency of a state s. Then, we update αm bysolving the following optimization problem:

maximizeαm+1

Es∼ρπαm ,a∼παm

[παm+1(a|s)παm(a|s)

Qπαm (s, a)

]subject to Es∼ρπαm

[DKL(παm ||παm+1

)]≤ δ,

(18)

where DKL(παm ||παm+1) indicates the KL-divergence de-

fined as∫A παm(a|s) log

παm (a|s)παm+1

(a|s)da, which measures thedifference between two policy παm and παm+1

, for a states. Note that παm indicates the optimal policy of (15) whenα = αm which is obtained by the SAC. Now, note that,

d

dαm+1E[παm+1

(a|s)παm(a|s)

Qπαm (s, a)

]= − 1

α2m

E[(Qπαm (s, a)− V παm (s)

)2]≤ 0,

(19)

where, V παm (s) =∫A Q

παm (s, a)παm(a|s)da, and E[...] de-notes Es∼ρπαm ,a∼παm [...] for simplicity. Equation (19) meansthat the objective always increases as αm+1 decreases. Thus,αm+1 appears at the equality of KL constraints, then we cansolve Equation (18) with a quadratic approximation of the KL-


Algorithm 1 Adaptive Soft Actor-CriticInitialize parameter vectors ψ, ψ, θi, φ, λ, ω, entropy coeffi-cient α, and replay buffer D.for each iteration do

for each environment steps doSample a transition {st, at, r(st, at), st+1}, and storeit in the replay buffer D.

endfor each gradient steps do

Minimize JV α(ψ), JQα(θ1,2), JQ(µ), JV (ω), andJπ(φ) using stochastic gradient descent.ψ ← (1− τ)ψ + τψ

endif πφ converges then

Update α with trust region methodend

end

divergence using Taylor expansion as,

E[DKL(παm ||παm+1)

]≈ (αm+1 − αm)2

2αm4E[(Qπαm (s, a)− V παm (s)

)2]= δ.

(20)

Then, we can get αm+1 as,

αm+1 = αm − αm2

√√√√ 2δ

E[Aπαm (s, a)

2] , (21)

where, Aπαm (s, a) = Qπαm (s, a)− V παm (s).Furthermore, we theoretically prove that the policy con-

verges to the optimal policy by applying the SAC with thesequence of temperature {αm}.

Theorem 1: Consider a sequence of coefficients {αm}generated from Equation (21). Then, repeated application ofsoft policy iteration with {αm}, from any initial policy π0,converges to an optimal policy π∗.

Theorem 1 indicates that scheduling the temperature withthe proposed method ensures not only the improvement ofthe performance, but also the optimality of the algorithm. Thedetailed derivations of all the equations and theorems in thissection are included in the supplementary material [17].

D. Algorithm

In this section, we present an actor-critic algorithm, calledadaptive soft actor-critic (ASAC), which schedules the entropytemperature using the proposed trust region method. To handlecontinuous state and action domains, ASAC maintains sevennetworks to model value and policy functions (a soft actionvalue Qαθ1,2 , a soft state value V αψ , a target state value V α

ψ,

an action value Qλ, a state value Vω , and a policy πφ)2.The soft action value and the soft state value function areneeded to update the policy, and the action value and thestate value function are needed to update the temperatureα. Also, we utilize a replay buffer D, which stores every

2Subscripts denote the network parameters of each function.

transition (st, at, rt+1, st+1) obtained by interactions with anenvironment.

The objective functions of the soft action value and the softstate value are defined the same as SAC [8].

JV α(ψ) =

ED[1

2

(V αψ (s)− Ea∼πφ

[miniQαθi(s, a)− α log πφ(a|s)

])2],

JQα(θi) = ED[1

2(Qαθi(s, a)− (r + γV αψ (s′)))2

].

Then the target value network parameter ψ is updated by anexponentially moving average of the value network.

Also, we modeled the policy function as a tangent hyper-bolic of a Gaussian random variable, i.e., a := fφ(s, ε) =tanh(µφ(s)+εσφ(s)), where µφ(s) and σφ(s) are the outputsof πφ, and ε ∼ N (0, I). Now to approximate the softpolicy iteration, the policy function is trained to minimize theexpected KL-divergence given by,

Es∼D

[DKL

(πφ(·|s)

∣∣∣∣∣∣∣∣∣∣ exp

Qαθ (s,a)α∫

A expQαθ (s,a′)

α da′

)]. (22)

Then using a reparameterization trick as in [8], minimizing(22) can be changed to minimizing the following objective:

Jπ(φ)

= Es∼D,ε∼N [α log πφ(fφ(ε; s)|s)−Qαθ1(s, fφ(ε; s))].(23)

Furthermore, we estimate V πφ and Qπφ to compute (21),by adding two networks Vω and Qλ, which are trained tominimize the squared residual errors:

JV (ω) = Es∼D[

1

2(Vω(s)− Ea∼πφ [Qλ(s, a)])2

]JQ(λ) = E(s,a,r,s′)∼D

[1

2(Qλ(s, a)− (r + γVω(s′)))2

].

(24)

Finally, we determine whether the policy has converged, bycomparing the changes of Jπ(φ). If the change in Jπ(φ) is lessthan a threshold 4, i.e., Jπ(φold)−Jπ(φnew)

Jπ(φold) < 4, we assumethat πφ has converged and decrease the temperature using (21).In practice, the proposed criterion can be satisfied not onlywhen the policy converges to π∗α, but also when it strugglesto find a better policy. At that time, reducing the coefficientleads to more efficient exploration, and helps the policy escapefrom sub-optimal policy.

IV. EXPERIMENT

A. Platform SetupOur experiment setup consists of a single workstation, a

camera and the proposed tripod robot. For the workstation, weuse an Intel Core Quad Processor i5-6600 CPU and a Titan XGPU for learning the network parameter. Furthermore, an IntelRealSense D435 camera, attached at a height one meter abovethe robot, is used for sensing the robot. As the camera capturesan image of the robot and passing it to the workstation, theposition and heading direction of the robot are extracted bydetecting colored markers on the robot. In order to controlthe robot, the pneumatic is controlled by a pressure regulator(SMC, ITV2050). Fig. 6 shows how the components of ourexperiment setup interact to each other.


Fig. 6. Given a state of the robot extracted from the image taken by thecamera, the next action is sampled from the policy of the controller. Also, areplay buffer stores transitions, which are used for training the controller.

B. Reinforcement Learning Setup

In real robot experiments, we train a feedback controllerof the proposed robot using a reward function defined in 13.We compare the proposed method to SAC with automaticallyentropy adjustment (AEA) [6], and SAC with fixed tempera-tures (α = 0.01 and α = 0.2). The SAC-AEA also controlsthe temperature α of the entropy, where the temperature iscontrolled by maintaining the entropy to be greater than apredefined threshold. The SAC with α = 0.01 and α = 0.2are designed to verify the effect of the entropy.

All of the value and policy functions of all algorithms areparameterized with a single-layer of 300 hidden units, and weused ADAM optimizer [18] to learn the parameters. The initialα of ASAC is set at 0.2 based on the simulation experiment, ofwhich the result is included in the supplementary [17], and thetarget entropy of SAC-AEA is set to be the negative dimensionof actions as suggested in [6]. For all algorithms, we traineach controller with 50 episodes. For each episode, a goalpoint is sampled about 20 centimeters away from the robotin a uniformly random direction within ±π/4 rad and eachcontroller is required to move the robot to the goal in 50 steps.At each step, each controller samples an action and executesit for a second, i.e., 1Hz of control frequency. Therefore,total 2500 steps of control actions (≈50 minutes) are usedfor training.

C. Robustness, Accuracy, and Success Rate Index

We compare the performance of the feedback controllerslearned by four algorithms: SAC with α = 0.01, SAC withα = 0.2, SAC-AEA, and ASAC. For the comparative evalua-tion, we demonstrate each controller to track the goal pointsplaced in 20 different directions (from −π/2 to π/2 rad). 10 ofthe target points in the demonstration set are on the directionover π/4 (or under −π/4) away from the initial heading ofthe robot, which are not included in training episodes. Throughthis experiment, the robustness of the learned controllers canbe verified by demonstrating it on the unexperienced tasks.Also, the accuracy of the controllers can be evaluated bymeasuring root mean square (RMS) of the distance betweenthe goal and the robot (d =

√δx2 + δy2), and the angular

TABLE ICOMPARISON PERFORMANCE OF THE LEARNED CONTROLLERS

Algorithm RMS(d)(cm)

RMS(δθ)(rad)

Success Rate(|δθ0| ≤ π

4)

Success Rate(|δθ0| > π

4)

SAC (α = 0.2) 8.56 1.15 0.80 0.80

SAC (α = 0.01) 11.64 0.80 0.80 0.30

SAC-AEA 8.54 0.78 0.90 1.00

ASAC (ours) 7.48 0.56 1.00 1.00

difference between heading and goal direction (δθ = θg − θt)during the whole episodes. In addition, we measure the successrate by assuming the robot succeeds in tracking when the robotreaches within three centimeters from the target point.

D. Robot Controllability

In order to show the controllability and dexterity of the pro-posed tripod robot for advanced missions with the controllerlearned by our algorithm, we demonstrate a zig-zagged pathfollowing experiment and obstacle avoidance experiment. Apath is given by a human or a planning algorithm and therobot tracks the given waypoints in order.

V. RESULT

A. Learning Feedback Controller

We evaluate the return, a cumulative reward sum, i.e.,R =

∑Tt=1 rt, to check the performance of the controller

during the training time. Fig. 7(a) shows how the sum ofrewards changes as the number of sampled transition increasesfor each learning algorithm, and ASAC shows the fastestconvergence and the highest sum of rewards compared to otheralgorithms. In particular, the controller learned by ASAC wasable to reach any target point with only about 1500 steps(≈30 minutes) of training. Furthermore, training the controllerfor more episodes, the movement of the robot becomes morestable and faster. Since ASAC gradually reduces the influenceof the entropy terms as α decreases, ASAC shows stableconvergence. However, since SAC-AEA uses the contraintsof the entropy, SAC-AEA hampers increasing the accuracy ofthe feedback controller. In this regards, we can conclude thatASAC makes the controller more precise with less trainingsteps, while the constrained entropy in SAC-AEA hampersaccurate control.

B. Robustness, Accuracy, and Success Rate of Learned Feed-back Controllers

The results of experiments for evaluating the performance ofthe learned controllers are shown in Table I. The accuracy ofthe controllers are measured as a root mean squared differencefrom desired position (RMS(d)) and heading angle (RMS(δθ)).Also, we present the success rate both from the experienced(|δθ0| ≤ π

4 ), and unexperienced scenarios (|δθ0| > π4 ).

As the result, ASAC shows the best accuracy and successrate, over other algorithms. In particular, ASAC shows 100%success rates even for the unexperienced scenarios, while SACwith less entropy maximization (α = 0.01) shows only 30%of success rates. This result indicates that ASAC takes an


0 500 1000 1500 2000 2500Sampled Transitions

0

20

40

60

80

Sum

of R

ewar

d

asacaeasac(alpha=0.01)sac(alpha=0.2)

(a) Evaluation on soft mobile robot (b) Zig-zag movement (c) Obstacle avoidance

Fig. 7. (a) Comparison the performance of the controllers trained using SAC with fixed temperature (α = 0.2, 0.01), SAC-AEA, and ASAC. The cumulativesum of rewards is evaluated for every 10 episodes, and five different scenarios are used for each evaluation. We repeated the whole training process fivetimes for each algorithm, and the mean value is plotted as the solid line and the one standard deviation is plotted as the shade region. (b) The tripod robotfollows the zig-zagging path. (c) The tripod robot avoids the obstacle by following the planned path. (b), (c) Blue circles are planned waypoints, a green linerepresents the trajectory of the robot, and red arrows show heading directions of the robot at each point.

advantage of maximizing entropy in that it explores diversepolicy and is robust under unexpected situations, and alsomakes the controller more accurate by automatically reducingthe entropy at the end.

C. Controllability

In the zig-zagging path traking and the obstacle avoidanceexperiments, we used RRT* [19] algorithm for planning thepath and the controller trained using ASAC is used to controlthe robot. The zig-zagging path traking is a challengingtask, since the robot is desired to consistently change itsheading direction over 90 degree. Fig.7(b) shows that the robotcan follow the given path represented by waypoints, whichindicates that the robot controlled by the learned controller hashigh controllability on its direction. Also, as shown in Fig.7(c),the robot was able to dexterously avoid the obstacle block astracking the planned path while keeping its heading directionto each waypoint. The video of the zig-zagging path trakingexperiment and obstacle avoidance experiment is presented inour video submission.

VI. CONCLUSION

In this paper, a new pneumatic vibration actuator wasdeveloped which utilizes the nonlinear stiffness characteristicsof hyperelastic material in order to ensure vibration stabilityand robustness against the external environment. Based on thisactuator, we proposed an advanced soft mobile robot capableof orientation control which was not possible in our previouswork. In order to control the robot, we present a reinforcementlearning algorithm called adaptive soft actor-critic (ASAC),which provides efficient exploration strategy and is adaptiveto various control tasks. As the result, the feedback controllertrained by ASAC not only accurately controls the robot, butit is also robust against unexpected situations as demonstratedin experiments.

REFERENCES

[1] S. Kim, C. Laschi, and B. Trimmer, “Soft robotics: a bioinspiredevolution in robotics,” Trends in biotechnology, vol. 31, no. 5, pp. 287–294, 2013.

[2] D. J. Preston, H. J. Jiang, V. Sanchez, P. Rothemund, J. Rawson, M. P.Nemitz, W.-K. Lee, Z. Suo, C. J. Walsh, and G. M. Whitesides, “A softring oscillator,” Science Robotics, vol. 4, no. 31, p. eaaw5496, 2019.

[3] T. G. Thuruthel, E. Falotico, F. Renda, and C. Laschi, “Model-basedreinforcement learning for closed-loop dynamic control of soft roboticmanipulators,” IEEE Transactions on Robotics, vol. 35, no. 1, pp. 124–134, 2018.

[4] D. Kim, J. I. Kim, and Y.-L. Park, “A simple tripod mobile robot usingsoft membrane vibration actuators,” IEEE Robotics and AutomationLetters, vol. 4, no. 3, pp. 2289–2295, 2019.

[5] C. E. Rasmussen, “Gaussian processes in machine learning,” in SummerSchool on Machine Learning. Springer, 2003, pp. 63–71.

[6] T. Haarnoja, A. Zhou, S. Ha, J. Tan, G. Tucker, and S. Levine, “Learningto walk via deep reinforcement learning,” in Proceedings of the 15thRobotics: Science and Systems, RSS, 2019.

[7] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Bench-marking deep reinforcement learning for continuous control,” in Pro-ceedings of the 33nd International Conference on Machine Learning,ICML 2016. New York City, NY, USA: JMLR.org, 2016, pp. 1329–1338.

[8] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochasticactor,” in Proceedings of the 35th International Conference on MachineLearning, ICML 2018, Stockholmsmassan, Stockholm, Sweden, 2018,pp. 1856–1865.

[9] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan,V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine, “Soft actor-critic algorithms and applications,” CoRR, vol. abs/1812.05905, 2018.

[10] G. Marckmann and E. Verron, “Comparison of hyperelastic models forrubber-like materials,” Rubber chemistry and technology, vol. 79, no. 5,pp. 835–858, 2006.

[11] E. W. Weisstein, “Spherical cap,” 2008. [Online]. Available: http://thznetwork.net/index.php/thz-images

[12] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learningwith deep energy-based policies,” in Proceedings of the 34th Interna-tional Conference on Machine Learning, ICML 2017, Sydney, NSW,Australia, 2017, pp. 1352–1361.

[13] M. Bloem and N. Bambos, “Infinite time horizon maximum causalentropy inverse reinforcement learning,” in 53rd IEEE Conference onDecision and Control, Dec 2014, pp. 4911–4916.

[14] J. Schulman, P. Abbeel, and X. Chen, “Equivalence between policygradients and soft q-learning,” CoRR, vol. abs/1704.06440, 2017.[Online]. Available: http://arxiv.org/abs/1704.06440

[15] M. L. Puterman, Markov decision processes: discrete stochastic dynamicprogramming. John Wiley & Sons, 2014.

[16] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz, “Trustregion policy optimization,” in Proceedings of the 32nd InternationalConference on Machine Learning, ICML 2015, Lille, France, 2015, pp.1889–1897.

[17] J. I. Kim, M. Hong, K. Lee, D. Kim, Y.-L. Park, and S. Oh. Learningto walk a tripod mobile robot using nonlinear soft vibration actuatorswith entropy adaptive reinforcement learning: Supplementary material.[Online]. Available: http://rllab.snu.ac.kr/publications/papers/2020 raladasac supp.pdf/

[18] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014.

[19] S. Karaman and E. Frazzoli, “Sampling-based algorithms for optimalmotion planning,” I. J. Robotics Res., vol. 30, no. 7, pp. 846–894, 2011.

http://thznetwork.net/index.php/thz-images

http://thznetwork.net/index.php/thz-images

http://arxiv.org/abs/1704.06440

http://rllab.snu.ac.kr/publications/papers/2020_ral_adasac_supp.pdf/

http://rllab.snu.ac.kr/publications/papers/2020_ral_adasac_supp.pdf/

Documents

IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT …cpslab.snu.ac.kr/publications/papers/ieee_ra_l_19_1776... · IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY,