Balancing an Inverted Pendulum · 2017. 10. 2. · Kajal Damji Gada Page 9 / 17. Balancing an Inverted Pendulum Appendix Read Me The program is coded in Python 3. To run the program:

Balancing an Inverted Pendulum

Kajal Damji Gada

ENPM808F Robot LearningFinal Project

12th December 2016


Contents

Abstract 3

1 Introduction 3

2 Related Work 4

3 Approach 53.1 Q-learning: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 Q-learning: Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.3 Q-learning: Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4 Implementation 6

5 Results 7

6 Analysis 8

7 Conclusion 8

8 Future Work 8

References 9

Appendix 10

Kajal Damji Gada Page 2 / 17


Abstract

1 Introduction

An inverted pendulum is a touchstone which every Robotic student touches once [1]. Beginning from sta-bilization of unstable open-loop system to real-world application of Segway, it is a benchmark in ControlTheory and Robotics. It is also a good application to aid in learning of any new algorithm, which in thisscenario, is Q-learning. Thus, the goal of the project is to understand the working of Q-learning, a machinelearning algorithm, by implementation for an inverted pendulum.

(a) (b)

Figure 1: (a)Segway [2] (b) Furuta Pendulum [3]

The inverted pendulum problem has many variations: Furuta Pendulum [3], Double Inverted Pendulum [4],etc. In this project, a case of inverted Pendulum on cart is considered. The system may appear simplisticin design. However, it is a non-linear system with a static stable (equilibrium) point at pending position(face-down) and dynamic equilibrium point at upright position.

This makes designing a control system for an inverted pendulum into a challenging problem. In the caseof Q-learning, it is not needed to know the model. Q-learning is regarded as a model-free [5] reinforcementlearning. However, it does come with its own set of challenges. One of the most important one being dis-cretization of the model as Q-learning works for discrete system with an end-game reward.

Literature related to this project is discussed in section 2. Then in section 3, plan towards the projectproblem is charted out. Next in section 4 and 5, the actual implementation and results are shown. Theresults are analyzed in section 6 and concluded in section 7.



2 Related Work

The work by Lasse Scherffig [6] starts with explaination of Reinforcement Learning Theory and goes on toexplain the difference between Supervised Learning and Reinforcement Learning. The main difference beingReinforcement Learning doesn’t have a set of sample actions to be taken, it is infact learn by exploring andassessing the rewards.

The paper then discusses the Inverted Pendulum model, followed by the work done. The paper address2 problems: balancing and full control. Balancing is about maintaining balance when in face-up positionand Full control is about getting to face-up position from any position including face-down position. Whilethe first problem is solved using Q-learning, the second part uses Artificial Neural Network (ANN) as thenumber of states are too large.

In a second paper, the author disusses use of resource-allocation network with Q-learning [7]. The paperstarts with a discussion on use of supervised learning and memorization for balancing an inverted pendulum.The method essentially memorize each move using Gaussian signal. Then the disuccusion moves onto howthe use Q-learning to solve the problem.

Figure 2: Q-learning network with Restart algorithm [7]

Instead of using a Q-table, the paper talks about use of Q-learning network as shown in Figure 2. Thepoint is instead of storing each state-action pair and making it a large memorization table like supervisedlearning, use a network and reallocate resources. So everytime a new state-action is learnt, it is store at theunit that is least useful. This approach is called Restart Algorithm and gives results that work better thana combination of supervised learning and memorization.



3 Approach

3.1 Q-learning: Introduction

The task is defined as balancing an inverted pendulum on a cart in an upright position. The method chosenfor this task is a machine learning algorithm: Q-learning. It is a method, that doesn’t require knowledge ofmodel for learning. It learns by experiencing the reward for taking a sequence of action [5].

Figure 3: Interface between Agent and Environment in Q-learning [6]

In other words, the agent takes an action and observes the result in form of result from environment asshown in Figure 3. The reward is stored in a table, called Q-table, along with state. The next time, whenthe same state is encountered it decides to taken an action based on rewards learned last time.

3.2 Q-learning: Exploration

A good reward would lead to taking the action again. And a bad reward would lead to not taking the actionagain. But what if there was a better reward? Thus, there is a component of exploration. That is whendeciding the next action, it takes an action not explored even when an existing action gives a good reward.

Based on the available combination of states-action pairs, the size of Q-table is decided. Also, it affects thenumber of iterations to be performed to obtain satisfactory results.

3.3 Q-learning: Formula

For each iteration, the current state (s) is observed. An action is chosen for execution based on equation (1)and then the Q-table is updates based on action chosen as mention in equaion (2):

π(s) = argmaxaQ(s, a) (1)

Q(s, a) = r + γmaxaQ(s′, a

′) (2)

where π(s) is policy for State s; a is action chosen; r is reward for action chosen; γ is delay reward factorand s

′is the new state after action is executed [6].



4 Implementation

The program is implemented in Python 3. The code is written to build the Q-table over multiple iterationsand store the best result. The best results can then be played in an animation using Penplot command fromthe plot.py file.

The program (Inverted pendulum q learning) starts with an empty Q-table. The program iterates overmultiple episodes and for each episode, the current state is randomized. A policy is calculated for the currentstate and all actions. An action is chosen based on the calculated policy and executed.

Based on the chosen action, a new state is calculated based on the system model. Based on this new state,a reward is calculated. The reward is based on position of cart and the angle of pendulum. The reward isused to updated the Q-table. If the pendulum is dropped, a new episode begins with new random start state.

Note that an inverted pendulum is a continuous system. Thus, each state is discretized for implementation.

The states chosen are:

• Position of cart (x)

• Linear Velocity of cart x

• Angle of pendulum with cart (θ)

• Angular velocity of pendulum (θ)

Next, the actions set includes:

• Move left (−1)

• Move right (1)

Thus, the cart moves with a Force of F Newton on left or right based on action chosen. The F is set to 10Nand can be changed. Other variables include:

• Magnitude of Force on cart (F ) and Gravity constant (g)

• Mass of cart (mc), Mass of pole (mp) and Length of Pole (lp)

• Reward delay factor (γ)

• Exploration factor (ε)



5 Results

The Figure 5 shows an example of results after 1000000 iterations. As seen, the pendulum is able to maintainitself in the upright position and eventually stops when it goes at the end of cart track (beyond 2.4 units).

Figure 4: Snapshot of animation for Inverted Pendulum Balancing

It can be seen that as the reward is maximum at the top, it attempts to maintain the state. Note, that thissystem is dynamically system and thus must move continuously to be at the unstable equilibrium point.

Figure 5: Results



6 Analysis

Based on results for various experimental runs, it was observed that system is able to identify the policy formaintaining the angle of pendulum between −1 to 1 degrees. To assist in learning, the initial few trials hadthe start state at (x, x, θ, θ) = (0, 0, 0, 0). At later iterations (episodes), the system starts with initial statewhich is randomized. This helps learn better in fewer iterations.

To achieve better results, another method would be to create more discrete states. This also applies for thecase when the algorithm wants to learn about bringing up the pendulum from face-down to face-up position.However, a Q-table would not be ideal for a high number of states. For such cases, Artifical Neural Networks(ANN) should be considered as shown in [6]

7 Conclusion

The project was concluded by implementing the Q-learning algorithm to balance an inverted pendulum inan upright position. It was also realized that it is difficult to implement a continuous system. It requiresdiscretization of states which can prove challenging.

If the discretization is too little, the transition from one state to another is less accurate and with morestates the Q-table becomes quite big. With lots of state, even more iterations are required to learn and buildthe Q-table. In such a case, other options such as Artifical Neural Networks should be explored.

8 Future Work

This project focused on balancing the pendulum, a natural extension would be to get the pendulum to comeinto an upright position from a face down position.

Figure 6: Balancing a glass of Wine

Though an interesting future work would be to learn to balance the inverted pendulum when moving in aparticular direction. This could be seen applicable for a scenario when a mobile Robot would bring you aglass of wine while balancing it at the end of stick (an inverted pendulum) as shown in Figure 6.



References

[1] Boubaker Olfa, ”The Inverted Pendulum: A fundamental Benchmark in Control Theory and Robotics”,Education and e-learning Innovations (ICEELI), 2012.

[2] W. Younis, M. Abdelati, Design and implementation of an experimental segway model, AIP ConferenceProceedings, vol. 1107, pp. 350-354, 2009

[3] J. . Acosta, Furuta’s pendulum: A conservative nonlinear model for theory and practice, MathematicalProblems in Engineering, 2010.

[4] Henmi Tomohiro, Deng Mingcong, Inoue Akira, Ueki Nobuyuki and Hirashima Yoichi, ”Swing-up Controlof a Serial Double Inverted Pendulum”, American Control Conference, 2004

[5] Watkins Christopher J.C.H, ”Technical Note: Q-learning”, Machine Learning, pp. 279-292, 1992.

[6] Scherffig Lasse, ”Reinforcement Learning in Motor Control”

[7] Anderson, Charles W., ”Q-learning with Hidden-Unit Restarting”



Appendix

Read Me

The program is coded in Python 3. To run the program:

python3 inverted pendulum q learning.py

Ensure that both codes: (1) inverted pendulum q learning.py and (2) plot.py are in the same folder. Firstcompile and then run the code. In Ubuntu:

chmod +x inverted pendulum q learning.pychmod +x plot.py

To change values of parameters such as γ, ε, etc. change the value at start of function. To change displaysetting, use command:

Penplot(best states, anime=True, fig=True)

where anime=True is for animation and fig=True is for graph.



Main Program (In python3)

1 #!/ usr / bin /env python2

3 import numpy as np4 from p lo t import Penplot5 import random6 from math import degrees , s in , cos7

8 # −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−#9 # CONSTANT VALUES #

10 # −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−#11

12 mass pole = 0 .113 mass cart = 0 .514 mass to ta l = mass pole + mass cart15

16 l e n g t h p o l e = 0 .317

18 f o rce magni tude = 219 c o n s t a n t g r a v i t y = 9 .820

21 tau = 0.0222 alpha = 0 .523 gamma = 0.524

25 g l o b a l e p s i l o n26 e p s i l o n = 0 .227

28 # −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−#29 # FUNCTIONS #30 # −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−#31

32 de f c a l c u l a t e i n d e x ( c u r r e n t s t a t e ) :33

34 i f c u r r e n t s t a t e [ 0 ] < −0.8:35 x = 036 e l i f c u r r e n t s t a t e [ 0 ] < 0 . 8 :37 x = 138 e l s e :39 x = 240

41 i f c u r r e n t s t a t e [ 1 ] < −0.5:42 x dot = 043 e l i f c u r r e n t s t a t e [ 1 ] < 0 . 5 :44 x dot = 145 e l s e :46 x dot = 247

48 i f deg ree s ( c u r r e n t s t a t e [ 2 ] ) < −12.0:49 theta = 050 e l i f degree s ( c u r r e n t s t a t e [ 2 ] ) < −6.0:51 theta = 152 e l i f degree s ( c u r r e n t s t a t e [ 2 ] ) < −1.0:53 theta = 2



54 e l i f degree s ( c u r r e n t s t a t e [ 2 ] ) < 0 . 0 :55 theta = 356 e l i f degree s ( c u r r e n t s t a t e [ 2 ] ) < 1 . 0 :57 theta = 458 e l i f degree s ( c u r r e n t s t a t e [ 2 ] ) < 6 . 0 :59 theta = 560 e l i f degree s ( c u r r e n t s t a t e [ 2 ] ) < 1 2 . 0 :61 theta = 662 e l s e :63 theta = 764

65 i f deg ree s ( c u r r e n t s t a t e [ 3 ] ) < −50.0:66 the ta dot = 067 e l i f degree s ( c u r r e n t s t a t e [ 3 ] ) < −25.0:68 the ta dot = 169 e l i f degree s ( c u r r e n t s t a t e [ 3 ] ) < 2 5 . 0 :70 the ta dot = 271 e l i f degree s ( c u r r e n t s t a t e [ 3 ] ) < 5 0 . 0 :72 the ta dot = 373 e l s e :74 the ta dot = 475

76 re turn x , x dot , theta , the ta dot77

78 de f c a l c u l a t e p r o b ( c u r r e n t s t a t e , Q table ) :79

80 p o l i c y = [ ]81

82 x , x dot , theta , the ta dot = c a l c u l a t e i n d e x ( c u r r e n t s t a t e )83

84 value = [ Q table [ act ion , x , x dot , theta , the ta dot ] f o r ac t i on inrange (2 ) ]

85

86 f o r a c t i o n in value :87 i f a c t i o n == max( value ) :88 p o l i c y . append ( 1 . 0 − e p s i l o n + e p s i l o n / 2)89 e l s e :90 p o l i c y . append ( e p s i l o n / 2)91

92 i f sum( p o l i c y ) == 1 . 0 :93 re turn p o l i c y94 e l s e :95 p o l i c y = [ 0 . 5 , 0 . 5 ]96 re turn p o l i c y97

98 de f cho o s e a c t i on ( p o l i c y ) :99

100 prob num = random . randrange (0 ,100) /100 .0101

102 i f prob num <= p o l i c y [ 0 ] :103 ac t i on choosen = 0104 e l s e :105 ac t i on choosen = 1106



107 re turn ac t i on choosen108

109 de f update s ta t e ( c u r r e n t s t a t e , a c t i on choosen ) :110

111 x cur , x dot cur , theta cur , t h e t a d o t c u r = c u r r e n t s t a t e112

113 i f a c t i on choosen == 0 :# act i on 0 i s l e f t

114 f o r c e v a l u e = − f o rce magni tude115 e l s e :

# ac t i on 1 i s r i g h t116 f o r c e v a l u e = force magni tude117

118 temp = ( f o r c e v a l u e + ( mass pole ∗ l e n g t h p o l e ) ∗ t h e t a d o t c u r ∗∗2 ∗ s i n( the ta cu r ) ) / mas s to ta l

119

120 th e ta ac c = ( c o n s t a n t g r a v i t y ∗ s i n ( the ta cu r ) − cos ( the ta cu r ) ∗ temp) / \

121 ( l e n g t h p o l e ∗ ( ( 4 . 0 / 3 . 0 ) − mass pole ∗ cos ( the ta cu r ) ∗∗2/ mas s to ta l ) )

122

123 x acc = temp − ( mass pole ∗ l e n g t h p o l e ) ∗ th e ta ac c ∗ cos ( the ta cu r ) /mas s to ta l

124

125 x new = x cur + ( tau ∗ x dot cur )126 x dot new = x dot cur + ( tau ∗ x acc )127 theta new = the ta cu r + ( tau ∗ t h e t a d o t c u r )128 theta dot new = t h e t a d o t c u r + ( tau ∗ th e ta ac c )129

130 re turn x new , x dot new , theta new , theta dot new131

132 de f update Qtable ( c u r r e n t s t a t e , ac t ion choosen , new state , reward , Q table ) :133

134 x , x dot , theta , the ta dot = c a l c u l a t e i n d e x ( new state )135 Q max = max( Q table [ 0 , x , x dot , theta , the ta dot ] , Q table [ 1 , x ,

x dot , theta , the ta dot ] )136

137 x , x dot , theta , the ta dot = c a l c u l a t e i n d e x ( c u r r e n t s t a t e )138 Q cur = Q table [ ac t ion choosen , x , x dot , theta , the ta dot ]139

140 Q table [ ac t ion choosen , x , x dot , theta , the ta dot ] = Q cur + alpha ∗( reward + (gamma∗Q max) − Q cur )

141

142 re turn Q table143

144 de f t a k e a c t i o n ( c u r r e n t s t a t e , Q table ) :145

146 p o l i c y = c a l c u l a t e p r o b ( c u r r e n t s t a t e , Q table )147 ac t i on choosen = cho o s e a c t i on ( p o l i c y )148 new state = update s ta t e ( c u r r e n t s t a t e , a c t i on choosen )149

150 reward = 0151

152 i f abs ( new state [ 0 ] ) < 2 . 4 :



153 i f abs ( degree s ( new state [ 2 ] ) ) < 1 . 0 :154 reward = 10155 e l i f abs ( degree s ( new state [ 2 ] ) ) < 3 . 0 :156 reward = 5157 e l i f abs ( degree s ( new state [ 2 ] ) ) < 6 . 0 :158 reward = 2159 e l i f abs ( degree s ( new state [ 2 ] ) ) < 2 0 . 0 :160 reward = 1161

162 Q table = update Qtable ( c u r r e n t s t a t e , ac t ion choosen , new state ,reward , Q table )

163

164 re turn reward , new state , Q table165

166 # −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−#167 # MAIN PROGRAM #168 # −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−#169

170 Q table = np . z e ro s ( [ 2 , 3 , 3 , 8 , 5 ] ) # ac t i on (2 ) ∗s t a t e x (3 ) ∗ s t a t e x d o t (3 ) ∗ s t a t e t h e t a (6 ) ∗ s t a t e t h e t a d o t (3 )

171

172 max steps = 0173 b e s t s t a t e s = [ ]174

175 max episodes = 1000000176 # max episodes = 10000177

178 f o r ep i sode in range (1 , max episodes +1) :179

180 s t a t e s = [ ]181

182 i f ep i sode < 10000 :183 c u r r e n t s t a t e = (0 ,0 , random . randrange (−1 ,1) ,0 )

# s t a r t s t a t e = 0184 e l i f ep i sode < 20000 :185 c u r r e n t s t a t e = ( 0 . 1∗ random . randrange (−5 ,5) ,0 , random . randrange

(−3 ,3) ,0 )186 e l i f ep i sode < 30000 :187 c u r r e n t s t a t e = ( 0 . 1∗ random . randrange (−8 ,8) ,0 , random . randrange

(−5 ,5) ,0 )188 e l i f ep i sode < 50000 :189 c u r r e n t s t a t e = ( 0 . 1∗ random . randrange (−15 ,15) ,0 , random .

randrange (−12 ,12) ,0 )190 e l s e :191 c u r r e n t s t a t e = ( 0 . 1∗ random . randrange (−20 ,20) ,0 , random .

randrange (−15 ,15) ,0 )192

193 s t a t e s . append ( c u r r e n t s t a t e )194

195 f o r s tep in range (1 ,1000) :196

197 reward , new state , Q table = t a k e a c t i o n ( c u r r e n t s t a t e ,Q table )

198 c u r r e n t s t a t e = new state



199 s t a t e s . append ( c u r r e n t s t a t e )200

201 i f reward < 1 :#

Pendulum dropped202

203 i f s t ep > max steps :204 b e s t s t a t e s = s t a t e s205 max steps = step206

207 i f ( ep i sode % 10000) == 0 :208 pr in t ( ’ After ’ , ep isode , ’ ep i sode ’ )209 pr in t ( ’Max s t ep s : ’ , max steps )210 pr in t ( ’−−−−−−−−−−−−−−−−−−−−− ’ )211

212 # Penplot ( b e s t s t a t e s , anime=True , f i g=False )213

214 e p s i l o n −= 0.002215

216 i f e p s i l o n < 0 :217 e p s i l o n = 0218

219 break220

221 Penplot ( b e s t s t a t e s , anime=True , f i g=True )222

223 # −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− #



Program for animation (In python3)

1 #!/ usr / bin /env python2

3 import math4 import matp lo t l i b5 matp lo t l i b . use ( ’Qt5Agg ’ )6 import matp lo t l i b . pyplot as p l t7 # import matp lo t l i b . pyplot as p l t8 import matp lo t l i b . animation as animation9

10 c l a s s Penplot ( ob j e c t ) :11 de f i n i t ( s e l f , s t a t e s , anime=False , f i g=False ) :12 s e l f . anime = anime13 s e l f . f i g = f i g14 s e l f . x = [ s t a t e [ 0 ] f o r s t a t e in s t a t e s ]15 s e l f . x dot = [ s t a t e [ 1 ] f o r s t a t e in s t a t e s ]16 s e l f . theta = [ s t a t e [ 2 ] f o r s t a t e in s t a t e s ]17 s e l f . the ta dot = [ s t a t e [ 3 ] f o r s t a t e in s t a t e s ]18 s e l f . p r o c e s s ( )19

20 de f p l o t ( s e l f , data ) :21 x , theta , frame = data22 s e l f . t ime t ex t . s e t t e x t (” time :%. 2 f s \nstep :%d” % ( frame ∗0 .02 , frame ) )23

24 y = 0.0525 the ta x = x + math . s i n ( theta ) ∗ 0 .2526 the ta y = y + math . cos ( theta ) ∗ 0 .2527

28 s e l f . car . s e t d a t a (x , y / 2 . 0 )29 s e l f . l i n e . s e t d a t a ( ( x , the ta x ) , (y , the ta y ) )30

31 de f gen ( s e l f ) :32 f o r frame in range ( l en ( s e l f . x ) ) :33 y i e l d s e l f . x [ frame ] , s e l f . theta [ frame ] , frame34

35 de f p r o c e s s ( s e l f ) :36 i f s e l f . anime :37 f i g = p l t . f i g u r e ( f i g s i z e =(20 , 4 . 5 ) )38 ax = f i g . add subplot (1 , 1 , 1)39 ax . s e t x l i m (−3.0 , 3 . 0 )40 ax . s e t y l i m (−0.1 , 0 . 9 )41 ax . g r id ( )42

43 s e l f . t ime t ex t = ax . t ex t ( 0 . 0 5 , 0 . 9 , ”” , trans form=ax . transAxes )44 s e l f . car , = ax . p l o t ( [ ] , [ ] , ” s ” , ms=15)45 s e l f . l i n e , = ax . p l o t ( [ ] , [ ] , ”b−”, lw=2)46

47 ani = animation . FuncAnimation ( f i g , s e l f . p l o t , s e l f . gen , i n t e r v a l=1, r e p e a t d e l a y =3000 , repeat=True )

48

49 p l t . show ( )50

51 i f s e l f . f i g :52 s t ep s = range ( l en ( s e l f . x ) )



53

54 # p l t . f i g u r e55

56 p l t . subplot (2 , 1 , 1)57 p l t . t i t l e (”x , theta ”)58 p l t . p l o t ( steps , s e l f . x , l a b e l=”x ”)59 p l t . p l o t ( steps , s e l f . theta , l a b e l=”theta ”)60 p l t . l egend ( l o c=”best ”)61 p l t . g r id ( )62

63 p l t . subplot (2 , 1 , 2)64 p l t . t i t l e (” x dot , the ta dot ”)65 p l t . p l o t ( steps , s e l f . x dot , l a b e l=”x dot ”)66 p l t . p l o t ( steps , s e l f . theta dot , l a b e l=”the ta dot ”)67 p l t . l egend ( l o c=”best ”)68 p l t . g r id ( )69 p l t . show ( )70 p l t . c l o s e ( )


Documents

Balancing an Inverted Pendulum · 2017. 10. 2. · Kajal Damji Gada Page 9 / 17. Balancing an Inverted Pendulum Appendix Read Me The program is coded in Python 3. To run the program: