Students: Lior Kupfer Pavel Lifshits Supervisor: Andrey Bernstein Advisor: Prof. Nahum Shimkin Technion – Israel Institute of Technology Faculty of Electrical

An automated trading system based on Recurrent Reinforcement

Learning

Students: Lior Kupfer Pavel LifshitsSupervisor: Andrey BernsteinAdvisor: Prof. Nahum Shimkin

Technion – Israel Institute of Technology

Faculty of Electrical Engineering

Control and Robotics Laboratory

Introduction Notations The System The Learning Algorithm Project Goals Results

◦ Artificial Time Series (the AR case)◦ Real Foreign Exchange / Stock Data

Conclusions Future work

Outline•Outline•Introduction•Notations•The system•The Learning Algorithm•Project Goals•Results

•Artificial Series•Real Forex Data

•Conclusions•Future work

2L. Kupfer & P.Lifshits : “An automated trading system based on Recurrent Reinforcement Learning”,

Technion - Israel Institute of Technology, Faculty of Electrical Engineering, Control and Robotics Laboratory.

Using Machine Learning methods for trading

◦ One relatively new approach to financial trading

◦ Using learning algorithms to predict the rise and fall of asset prices before they occur

◦ An optimal trader would buy an asset before the price rises, and sell the asset before its value declines

Introduction•Outline•Introduction•Notations•The system•The Learning Algorithm•Project Goals•Results





Introduction•Outline•Introduction•Notations•The system•The Learning Algorithm•Project Goals•Results





Trading technique

◦ An asset trader was implemented using recurrent reinforcement learning (RRL) suggest by Moody and Saffell (2001)

◦ It is a gradient ascent algorithm which attempts to maximize a utility function known as Sharpe’s ratio.

◦ We denote a parameter vector which completely defines the actions of the trader.

◦ By choosing an optimal parameter for the trader, we attempt to take advantage of asset price changes.

Introduction•Outline•Introduction•Notations•The system •The Learning Algorithm•Project Goals•Results





Due to transactions costs which include◦ Commissions◦ Bid/Ask spreads◦ Price slippage◦ Market impact

Our constrains◦ Can’t make arbitrarily frequent trade◦ Can’t make large changes in portfolio composition.

Model assumptions ◦ Fixed position size◦ Single security

Notations•Outline•Introduction•Notations•The system•The Learning Algorithm•Project Goals•Results





– Fixed quantities of security◦ The price series is ◦ - The corresponding price changes

- Out position in each time step◦

- System return in each time step◦

Z 1 2, ,.., Tz z z

1i i ir z z

tF , , 1,0, 1tF Long Neutral Short

tR 1 1t t t t tR F r F F number of securities commision rate

Notations•Outline•Introduction•Notations•The system •The Learning Algorithm•Project Goals•Results





- Additive profit accumulated over T time periods

◦

- Performance criterion◦ ◦ Is the marginal increase in the

performance

TP

0 01

, 0 0T

T t Tt

P R P and usually F F

U 2 1,..., ,..., ,T T tU U R R R R

1t t t tD U U U

The system•Outline•Introduction•Notations•The system •The Learning Algorithm•Project Goals•Results





◦ - Parameters vector (which we attempt to learn)◦ - Information available at time t (in our case - the

price changes)◦ - Stochastic extension (noise) which level can be

varied to control “exploration vs. exploitation”.

Our system is a single layer recurrent neural network:

Formally:

◦ ◦

1; , ;t t t t tF F F I e

tanht tF V

0 1, , ,..., ,T

t mu v v v w

1 1, , ,..., ,1t t t t mV F r r r

ttI

te

The system•Outline•Introduction•Notations•The system •The Learning Algorithm•Project Goals•Results





The learning algorithm•Outline•Introduction•Notations•The system •The Learning Algorithm•Project Goals•Results





We use reinforcement learning (RL) to adjust the parameters of the system to maximize our performance criteria of choice

RL – an alternative between supervised & unsupervised learning

RL Framework:◦ Agent Environment◦ Reward◦ Expected Return◦ Policy Learning

RL modus operandi◦ Agent perceives the state of the environment st and chooses an action at. It

subsequently observes the new state of the environment st+1 and receives a reward rt.

◦ Aim : Learn a policy π (mapping from states to actions), which optimizes the expected return

t






RL approaches

◦ Direct RL In this approach, the policy is represented directly. The reward function

(immediate feedback) is used to adjust the policy on the fly. e.g. policy search

◦ Value function RL In this approach ,values are assigned to each state (or state‐action

pair). Values correspond to estimates of future expected returns, or in other words, to the long‐term desirability of states. These values help guide the agent towards the optimal policy

e.g. TD-Learning, Q-Learning.

◦ Actor-Critic The model is split into two parts: the critic, which maintains the state

value estimate V, and the actor, which is responsible for choosing the appropriate actions at each state.






In RRL we learn the policy by gradient ascent in the performance function

Performance function can be◦ Profits◦ Sharpe’s ratio◦ Sterling ratio◦ Double deviation

Moody suggests an additive and differentiable approximation for Sharpe’s ratio – the differential Sharpe’s ratio

1t t t

tt

t

dU

d

tU

ˆtS

1 1 1

21 1 1

ˆ t

t

t t t t t t

t t t t t t

AS t

B

A A R A A A first order expansionof exponential moving averageof returns

B B R B B B first order expansionof exponential moving averageof std

learning rate

decay parameter






Now we develop

◦ Note:

t

t

dU

d

1 1

32 2

1 1

1 13

2 21 1

12t t t t

tt

t t

t t t t

t t t t

t t t t

tt t

B A A BdSD

d B A

dU dU dS dDtocalculate wewill need tocalculate

d dR dR dR

dD B A Rwhere

dR B A

1

1 1

1 1

1 1 1 1

...

T t t t t t t

t t t t t t

t t t t t t t

t t t t t t t

dU dD dR dF dR dF

d dR dF d dF d

dF F F F F F Fwhere

d F F

Project goals•Outline•Introduction•Notations•The system •The Learning Algorithm•Project Goals•Results





Investigate Reinforcement Learning by policy gradient

Implement an automated trading system which learns it’s trading strategy by Recurrent Reinforcement Learning algorithm

Analyze the system’s results & structure

Suggest and examine improvement methods

Results•Outline•Introduction•Notations•The system •The Learning Algorithm•Project Goals•Results





DataSet Goals

Artificial Time Series •Show the system can learn•Analyze parameters effect•Validate various model approximations

Real Foreign Exchange EUR/USD Data

•Show the system can learn a profitable strategy on real data•Search for possible improvements

Results•Outline•Introduction•Notations•The system •The Learning Algorithm•Project Goals•Results





The challenges we face◦ Model parameters:

◦ If and how to normalize the learned weights?

◦ How to normalize the input? The averages changes over time (non stationary) – we assume that the

change is slower than “how far back we look”

/ tanh

e

train

test

l h

M number of autoregressiveinputs

learning rate

decay parameter

adaptation rate

n number of learning epochs

L sizeof training set

L sizeof test set

q q quantizationlevels for

d transactioncost

Results – Artificial series•Outline•Introduction•Notations•The system •The Learning Algorithm•Project Goals•Results





rt – the return series are generated by AR(p) process

We analyze the effect of ◦ transaction costs◦ quantization levels◦ number of autoregressive inputs

On◦ Sharpe’s ratio◦ trading frequency◦ Profits

Effect of initial conditions






0 50 100 150 200 2501080

1100

1120

pric

e

0 50 100 150 200 250-1

0

1

Ft

0 50 100 150 200 250-50

0

50

Pro

fits

in %

0 50 100 150 200 250-1

0

1

Sha

rpe

ratio






39

40

41

42

43

44

45

46

47

48

49

0% 0.1% 0.5% 1%

trading frequency vs. transaction costs

30

32

34

36

38

40

42

44

46

0% 0.1% 0.5% 1%

profits vs. transaction cost






0 10 20 30 40 50 60 70 800.04

0.06

0.08

0.1

0.12

0.14

0.16Sharpe's ratio per epoch

Sha

rpe'

s ra

tio

epoch






0 50 100 150 200 250-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5Profits with prices generated by i.i.d. process

0 50 100 150 200 2500

1

2

3

4

5

6

7Profits with prices generated by AR(2) process

pavel

במסקנות - יתירות החיזוי






0 50 100 150 200 250-10

0

10

20

30

40

50

60

70

80

3 positions conservative

2 positions

3 positions equal levels

0 50 100 150 200 250-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 50 100 150 200 250-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

50 100 150 200 250-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

% Long positions

% Neutral positions

% Short positions

2 Positions trader

51.2% 0% 48.8%

3 Positions trader

40% 25.2% 34.8%

3 Positions conservative trader

31.2% 48.8% 20%

Results – Real Forex Data•Outline•Introduction•Notations•The system •The Learning Algorithm•Project Goals•Results





The prices series are of US Dollar vs. Euro exchange rate between 21/05/2007 until 15/01/2010 on 15 minutes data points

We compare our trader with◦ Random strategy of Uniform distribution◦ Buy and Hold strategy of Euro against US Dollar.






0 50 100 150 200 250-4

-2

0

2

4

6

8x 10

-3

prof

its

Our trader

MonkeyBuy & Hold

0 50 100 150 200 2501.337

1.338

1.339

1.34

1.341

1.342

1.343

1.344

pric

e 1

EU

R =

x U

SD

No commissions






With commissions (0.1%)

0 50 100 150 200 250-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

prof

its

Our trader

Monkey

Buy & Hold

0 50 100 150 200 2501.337

1.338

1.339

1.34

1.341

1.342

1.343

1.344

pric

e 1

EUR

= x

USD

Conclusions•Outline•Introduction•Notations•The system •The Learning Algorithm•Project Goals•Results





RRL performs better than the random strategy

Positive Sharpe Ratios achieved in most cases

RRL seems to struggle during volatile periods

Large variance is a major cause for concern

Can’t unravel complex relationships in the data

Changes in market condition lead to waste of all the system’s learning during the training phase (but most learning systems suffer from this).

Conclusions•Outline•Introduction•Notations•The system •The Learning Algorithm•Project Goals•Results





When trading real data - the transaction cost is a killer

Normalizing the input series can be a real challenge◦ The input series are non-stationary◦ We assume the rate of change of average number of AR

inputs to the system

Normalizing the weights – heuristically◦ Threshold method leads to best results on both artificial & real

data

Redundancy when input series are ARMA processes

Large training sessions under constant market conditions lead to overfitting

Future work•Outline•Introduction•Notations•The system •The Learning Algorithm•Project Goals•Results





Wrapping the system with risk management layer (e.g. Stop-Loss, retraining trigger, shut down the system under anomalous behavior)

Dynamical adjustment of external parameters (such as learning-rate)

Working with more than one security

Working with variable size positions

Working with coordination with another expert system (based on other algorithms)

Acknowledgment



We would like to thank our project supervisor Andrey Bernstein for the guidance, Prof. Nahum Shimkin for advising us and allowing us to pursue a research project of our interest and sharing his experience with us.

Additionally we would like to thank Prof. Ron Meir & Prof. Neri Merhav for their time spent consulting us.

Special warm thanks to Gabriel Molina from Stanford university and Tikesh Ramtohul from University of Basel for their priceless help.

Questions?



References



[1] J Moody, M Saffell, Learning to Trade via Direct Reinforcement, IEEE Transactions on Neural Networks,Vol 12, No 4, July 2001

[2] Carl Gold, FX Trading via Recurrent Reinforcement Learning, CIFE, Hong Kong, 2003

[3] M.A.H. Dempster, V. Leemans, An Automated FX trading system using adaptive reinforcement learning, Expert Systems with Applications 30, pp.543-552, 2006

Documents

Students: Lior Kupfer Pavel Lifshits Supervisor: Andrey Bernstein Advisor: Prof. Nahum Shimkin Technion – Israel Institute of Technology Faculty of Electrical