BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell,...

Preview:

Citation preview

BioIntelligence Lab. 1

Learning to Trade via Direct Learning to Trade via Direct ReinforcementReinforcement

John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001

Summarized by Jangmin O

BioIntelligence Lab. 2

AuthorAuthor

J. Moody Director of Computational Finance Program and a Professor of

CSEE at Oregon Graduate Institute of Science and Technology Founder & President of Nonlinear Prediction Systems Program Co-Chair for Computational Finance 2000 a past General Chair and Program Chair of the NIPS a member of the editorial board of Quantitative Finance

BioIntelligence Lab. 3

I. IntroductionI. Introduction

BioIntelligence Lab. 4

Optimizing Investment Optimizing Investment PerformancePerformance Characteristic

Path-dependent

Methods : Direct Reinforcement learning (DR) Recurrent Reinforcement Learning [1, 2] No need for forecasting model Single security or Asset allocation

Recurrent Reinforcement Learning (RRL) Adaptive policy search Learning investment strategy on-line No need to learn a value function Immediate rewards available in financial market

BioIntelligence Lab. 5

Difference between RRL & Q Difference between RRL & Q or TDor TD Financial decision making problem : suitable to RRL

Immediate feedback available

Performance criteria : risk-adjusted investment returns Shape ratio Downside risk minimization

Differential form

BioIntelligence Lab. 6

Experimental DataExperimental Data

U.S. dollar/British Pound foreign exchange market

S&P 500 Stock Index and Treasury Bills

RRL v.s. Q Bellman’s curse of dimensionality

BioIntelligence Lab. 7

II. Trading Systems and II. Trading Systems and Performance CriteriaPerformance Criteria

BioIntelligence Lab. 8

Structure of Trading Systems Structure of Trading Systems (1)(1) An agent : assumption

단일 시장에서 고정 포지션씩 거래 Trader at time t , Ft {+1, 0, 1}

Long : 매수 , Neutral : 관망 , Short : 공매도 이익 Rt

(t-1, t] 의 끝에 실현 , Ft-1 포지션에 따른 손익 + Ft-1 에서 Ft 로의 포지션 이동에 따른 수수료

Recurrent 구조로 가야 한다 ! 수수료 , 마켓 임팩트 , 세금등을 고려한 결정을 하기

위해서

BioIntelligence Lab. 9

Structure of Trading Systems Structure of Trading Systems (2)(2) A single asset trading system

t : system parameter at time t

It : information at time t

zt : price series, yt : other external variable series

Simple example

,...},,,...;,,{

),;(

2121

1

ttttttt

tttt

yyyzzzI

IFFF

)...( 1101 wrvrvrvuFsignF mtmtttt

BioIntelligence Lab. 10

Profit and Wealth for Trading Profit and Wealth for Trading Systems (1)Systems (1) Performance functions, U(), for risk insensitive trader =

Profit Additive profits

Security 의 고정 주수 (shares or contracts) 에 대한 거래 rt = zt – zt-1 : risky asset 의 리턴 rt

f : risk-free asset 의 리턴 (T-bill 같은 )

: 수수료 비율 Trader 의 자산 : WT = W0 + PT

||)( , 111

ttf

tttf

tt

T

ttT FFrrFrRRP

BioIntelligence Lab. 11

Profit and Wealth for Trading Profit and Wealth for Trading Systems (2)Systems (2) Multiplicative profits

누적 자산의 일정 비율 > 0 이 투자됨 rt = (zt/zt-1 –1)

In case of no short sales, when = 1

||1)1(1}1{

}1{

111

10

ttttf

ttt

T

ttT

FFrFrFR

RWW

BioIntelligence Lab. 12

Performance CriteriaPerformance Criteria

UT in general form of U(RT,…,Rt,…,R2,R1;W0) Simple form U(WT) : standard economic utility

Path-dependent performance function : Sharpe ratio etc.

Moody 의 관심사 . Marginal increase of Ut, caused by Rt at each time step

Differential performance criteria

1 ttt UUD

BioIntelligence Lab. 13

Differential Sharpe Ratio (1)Differential Sharpe Ratio (1)

Sharpe ratio : risk adjusted return

Differential Sharpe ratio 온라인 러닝을 위해 , 시간 t 에서의 Rt 의 영향을 계산이

필요 . 지수 이동 평균 사용

Adaptation rate 에 대한 1 차 Taylor 전개

)(Deviation Standard

)(Average

t

tT R

RS

)(

)(||

2

0

1

2

0

00

OS

S

OS

SS

tt

ttt

= 0 이면 St = St-1

BioIntelligence Lab. 14

Differential Sharpe Ratio (2)Differential Sharpe Ratio (2)

Exponential moving average with adaptation rate

Sharpe Ratio

Taylor 전개로부터 ,

ttttt

ttttt

BBBRB

AAARA

112

11

)1(

)1(

2/12 )( tt

tt ABK

AS

2/3211

11

0)(

21

tt

ttttt

t AB

BAABSD

Rt > At-1 : increased reward

Rt2 > Bt-1 : increased risk

BioIntelligence Lab. 15

Differential Sharpe Ratio (3)Differential Sharpe Ratio (3)

Derivative with .

Dt is max at Rt = Bt-1/At-1

Meaning of differential Sharpe ratio Making on-line learning possible : At-1 과 Bt-1 로부터 쉽게

계산 가능 Recursive updating 이 가능함 최근 return 에 강한 가중치 부여 해석력 : Rt 의 기여도를 알 수 있게됨

2/3211

11

)(

tt

ttt

t

t

AB

RAB

R

D

BioIntelligence Lab. 16

IIIIII. . Learning to TradeLearning to Trade

BioIntelligence Lab. 17

Reinforcement FrameworkReinforcement Framework

RL Maximizing the expected reward Trial and error exploration of the environment

Comparison with supervised learning [1, 2] Problematic with transaction costs Structural credit assignment v.s. temporal credit assignment

Types of RL DL : policy search Q-learning : value function Actor-critic method

BioIntelligence Lab. 18

Recurrent Reinforcement Recurrent Reinforcement Learning (1)Learning (1) Goal

트레이딩 시스템 Ft() 에 대해 , UT 를 최대화 하는 파라미터 를 찾는 것

Example 트레이딩 시스템

Trading return

시간 T 후의 미분 공식

,...},,,...;,,{

),;(

2121

1

ttttttt

tttt

yyyzzzI

IFFF

|| , 11 ttttttT FFrFRRP

1

11

)( t

t

tt

t

tT

t t

TT F

F

RF

F

R

R

UU

BioIntelligence Lab. 19

Recurrent Reinforcement Recurrent Reinforcement Learning (2)Learning (2) 학습 기법

Back-propagation through time (BPTT)

Temporal dependencies

Stochastic version Rt 에 관계되는 항에만 집중

1

1

t

t

ttt F

F

FFF

1

1

11

)(

t

t

t

t

t

t

t

t

t

t

t

tt F

F

RF

F

R

R

UU

1

1

)( t

t

tt

t

t

t

t

t

ttt

F

F

RF

F

R

R

DD

Differential performance criteria Dt

BioIntelligence Lab. 20

Recurrent Reinforcement Recurrent Reinforcement Learning (3)Learning (3) Remind

Moody 는 특정 액션에 대한 즉각적인 측정치 , Dt 를 최적화 하는 것에 초점

[1, 2] 포트폴리오 최적화 등

BioIntelligence Lab. 21

Value Function (1)Value Function (1)

Implicitly learning correct actions through value iteration Value function

Discounted future rewards being received from state x following the policy

a y

xy yVayxDapaxxV )}(),,({)(),()(

상태 x 에서 액션 a 를 취할 확률

x y 상태 전이시 액션 a 를 취할 확률

x y 상태 전이시 액션 a 를 취할 때의 immediate reward

Future reward 와 immediate rewards 간의 discount factor

BioIntelligence Lab. 22

Value Function (2)Value Function (2)

Optimal value function & Bellman’s optimally equation

Value iteration update : Converge to optimal solution

Optimal Policy

)(max)(* xVxV

)}(),,({)(max)( ** yVayxDapxVy

xya

)}(),,({)(max)(1 yVayxDapxV ty

xya

t

)}(),,({)(maxarg ** yVayxDapay

xya

BioIntelligence Lab. 23

Q-LearningQ-Learning

Q-function : 현재 상태와 현재 액션에 대한 future reward 계산

Value iteration update : Converge to optimal Q-function

Calculating the best action No need to know pxy(a)

)},(max),,({)(),( ** byQayxDapaxQb

yxy

)},(max),,({)(max),(1 byQayxDapaxQ tb

yxy

at

)),((maxarg ** axQaa

2)),(),(max),,((2

1axQbyQayxD

b

Error function of function approximator (i.e. NN)

BioIntelligence Lab. 24

IVIV. . Empirical ResultsEmpirical Results

1. Artificial price series

2. U.S. Dollar/British Pound Exchange rate

3. Monthly S&P 500 stock index

BioIntelligence Lab. 25

A trading system based on DR

BioIntelligence Lab. 26

Artificial price seriesArtificial price series

Data : autoregressive trend processes

10,000 samples 검증

RRL 이 트레이딩 전략의 학습 도구로 적합한지 ? 거래세의 증가에 따른 거래 횟수의 경향은 ?

)()1()(

)()1()1()(

tvtt

tkttptp

)/)(exp()( Rtptz

BioIntelligence Lab. 27

Error function of function approximator (i.e. NN)10,000 샘플

{long, short} position only

~2,000 기간 동안 성능 저하

BioIntelligence Lab. 28

9,000~ 확대

= 0.01

BioIntelligence Lab. 29

거래횟수

누적이익

Sharpe Ratio

100 번 실험 후 결과100 에포크 학습 + 온라인 적응거래세 0.2%, 0.5%, 1%

BioIntelligence Lab. 30

U.S. Dollar/British Pound U.S. Dollar/British Pound Foreign Exchange TradingForeign Exchange Trading {long, neutral, short} trading system

30 minute U.S. Dollar/British Pound foreign exchange (FX) rate data 주 5 일 , 24 시간 거래 : 1996 년 1~8 월 분량

전략 2,000 데이터 학습 480 데이터 트레이딩 (2 주 ) 윈도우 이동후 재학습

결과 Annualized 15% return with annualized Sharpe ratio 2.3 평균적으로 5 시간당 1 번 거래

고려되지 않은 사항 피크를 이룬 트레이딩 . 시장의 비유동성

BioIntelligence Lab. 31

BioIntelligence Lab. 32

S&P 500/T-Bill Asset S&P 500/T-Bill Asset Allocation (1)Allocation (1) 소개

Long position : S&P 500 에 포지션 , T-Bill 이윤은 없음 Short position : 2 배의 T-Bill 비율을 얻음

배당금 재투자

T-Bill 배당금

S&P500 배당금

BioIntelligence Lab. 33

S&P 500/T-Bill Asset S&P 500/T-Bill Asset Allocation (2)Allocation (2) 시뮬레이션

데이터 (1950 ~ 1994): 초기 학습 (~1969) + 테스트 (1970~) 학습 윈도우 : 10 년 학습 + 10 년 validation Input Feature : 84 (financial + macroeconomic) series

RRL-trader tanh 유닛 1 개 , weight decay

Q-trader bootstrap 샘플 사용 2-layer FNN (30 tanh 유닛 ) Bias/variance trade off : 10, 20, 30, 40 유닛 모델중 선택

BioIntelligence Lab. 34

Voting methods

RRL : 30 번 , Q : 10 번거래세 0.5%이익금 재투자Multiplicative profit ratio

Buy and Hold : 1348%Q-Trader : 3359%RRL-Trader : 5860%

BioIntelligence Lab. 35

대전제 : 1970 ~ 1994 의 25 년 동안 미국 증권 /재무증권 시장은 예측가능했다 .

오일쇼크

통화긴축

시장조정

시장붕괴

걸프 전쟁

Statistically significant

BioIntelligence Lab. 36

Sensitivity Analysisj

ji

i dx

dF

dx

dFS max/

인플레이션 기대치

BioIntelligence Lab. 37

VV . . Learn the Policy or Learn Learn the Policy or Learn the Value?the Value?

BioIntelligence Lab. 38

Immediate v.s. Future Immediate v.s. Future RewardsRewards Reinforcement signal

Immediate (RRL) or delayed (Q , dynamic programming, or TD)

RRL Policy is represented directly. Learning value function is bypassed

Q Policy is represented indirect

BioIntelligence Lab. 39

Policies v.s. ValuesPolicies v.s. Values

Some limitations of value function approach Original formulation of Q-learning : discrete action & state spaces Curse of dimensionality Policies derived from Q-learning tend to be brittle : small changes in

value function may lead large changes in the policy Large scale noise and non-stationarity may lead severe problems

RRL’s advantages Policy is represented directly : Simpler functional form is sufficient Can produce real valued actions More robust in noisy environment / Quick adaptation to non-

stationarity

BioIntelligence Lab. 40

An ExampleAn Example

Simple trading system {buy, sell} a single asset. Assumption: rt+1 is known in advance.

No need to future rewards : = 0

Policy funciton is trivial : at = rt+1

1 tanh unit is sufficient

Value function : ability to treat XOR 2 tanh units needed

BioIntelligence Lab. 41

ConclusionConclusion

How to train trading systems via DR RRL algorithm Differential Sharpe ratio & differential downside

deviation ratio RRL is more efficient than Q-learning in financial area.

Recommended