BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell,...

BioIntelligence Lab. 1

Learning to Trade via Direct Learning to Trade via Direct ReinforcementReinforcement

John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001

Summarized by Jangmin O

AuthorAuthor

J. Moody Director of Computational Finance Program and a Professor of

CSEE at Oregon Graduate Institute of Science and Technology Founder & President of Nonlinear Prediction Systems Program Co-Chair for Computational Finance 2000 a past General Chair and Program Chair of the NIPS a member of the editorial board of Quantitative Finance

I. IntroductionI. Introduction

Optimizing Investment Optimizing Investment PerformancePerformance Characteristic

Path-dependent

Methods : Direct Reinforcement learning (DR) Recurrent Reinforcement Learning [1, 2] No need for forecasting model Single security or Asset allocation

Recurrent Reinforcement Learning (RRL) Adaptive policy search Learning investment strategy on-line No need to learn a value function Immediate rewards available in financial market

Difference between RRL & Q Difference between RRL & Q or TDor TD Financial decision making problem : suitable to RRL

Immediate feedback available

Performance criteria : risk-adjusted investment returns Shape ratio Downside risk minimization

Differential form

Experimental DataExperimental Data

U.S. dollar/British Pound foreign exchange market

S&P 500 Stock Index and Treasury Bills

RRL v.s. Q Bellman’s curse of dimensionality

II. Trading Systems and II. Trading Systems and Performance CriteriaPerformance Criteria

Structure of Trading Systems Structure of Trading Systems (1)(1) An agent : assumption

단일 시장에서 고정 포지션씩 거래 Trader at time t , Ft {+1, 0, 1}

Long : 매수 , Neutral : 관망 , Short : 공매도 이익 Rt

(t-1, t] 의 끝에 실현 , Ft-1 포지션에 따른 손익 + Ft-1 에서 Ft 로의 포지션 이동에 따른 수수료

Recurrent 구조로 가야 한다 ! 수수료 , 마켓 임팩트 , 세금등을 고려한 결정을 하기

위해서

Structure of Trading Systems Structure of Trading Systems (2)(2) A single asset trading system

t : system parameter at time t

It : information at time t

zt : price series, yt : other external variable series

Simple example

,...},,,...;,,{

ttttttt

yyyzzzI

)...( 1101 wrvrvrvuFsignF mtmtttt

Profit and Wealth for Trading Profit and Wealth for Trading Systems (1)Systems (1) Performance functions, U(), for risk insensitive trader =

Profit Additive profits

Security 의 고정 주수 (shares or contracts) 에 대한 거래 rt = zt – zt-1 : risky asset 의 리턴 rt

f : risk-free asset 의 리턴 (T-bill 같은 )

: 수수료 비율 Trader 의 자산 : WT = W0 + PT

||)( , 111

ttT FFrrFrRRP

Profit and Wealth for Trading Profit and Wealth for Trading Systems (2)Systems (2) Multiplicative profits

누적 자산의 일정 비율 > 0 이 투자됨 rt = (zt/zt-1 –1)

In case of no short sales, when = 1

||1)1(1}1{

FFrFrFR

Performance CriteriaPerformance Criteria

UT in general form of U(RT,…,Rt,…,R2,R1;W0) Simple form U(WT) : standard economic utility

Path-dependent performance function : Sharpe ratio etc.

Moody 의 관심사 . Marginal increase of Ut, caused by Rt at each time step

Differential performance criteria

1 ttt UUD

Differential Sharpe Ratio (1)Differential Sharpe Ratio (1)

Sharpe ratio : risk adjusted return

Differential Sharpe ratio 온라인 러닝을 위해 , 시간 t 에서의 Rt 의 영향을 계산이

필요 . 지수 이동 평균 사용

Adaptation rate 에 대한 1 차 Taylor 전개

)(Deviation Standard

)(Average

= 0 이면 St = St-1

Exponential moving average with adaptation rate

Sharpe Ratio

Taylor 전개로부터 ,

2/12 )( tt

tt ABK

2/3211

BAABSD

Rt > At-1 : increased reward

Rt2 > Bt-1 : increased risk

Derivative with .

Dt is max at Rt = Bt-1/At-1

Meaning of differential Sharpe ratio Making on-line learning possible : At-1 과 Bt-1 로부터 쉽게

계산 가능 Recursive updating 이 가능함 최근 return 에 강한 가중치 부여 해석력 : Rt 의 기여도를 알 수 있게됨

2/3211

IIIIII. . Learning to TradeLearning to Trade

Reinforcement FrameworkReinforcement Framework

RL Maximizing the expected reward Trial and error exploration of the environment

Comparison with supervised learning [1, 2] Problematic with transaction costs Structural credit assignment v.s. temporal credit assignment

Types of RL DL : policy search Q-learning : value function Actor-critic method

Recurrent Reinforcement Recurrent Reinforcement Learning (1)Learning (1) Goal

트레이딩 시스템 Ft() 에 대해 , UT 를 최대화 하는 파라미터 를 찾는 것

Example 트레이딩 시스템

Trading return

시간 T 후의 미분 공식

,...},,,...;,,{

ttttttt

yyyzzzI

|| , 11 ttttttT FFrFRRP

Recurrent Reinforcement Recurrent Reinforcement Learning (2)Learning (2) 학습 기법

Back-propagation through time (BPTT)

Temporal dependencies

Stochastic version Rt 에 관계되는 항에만 집중

Differential performance criteria Dt

Recurrent Reinforcement Recurrent Reinforcement Learning (3)Learning (3) Remind

Moody 는 특정 액션에 대한 즉각적인 측정치 , Dt 를 최적화 하는 것에 초점

[1, 2] 포트폴리오 최적화 등

Value Function (1)Value Function (1)

Implicitly learning correct actions through value iteration Value function

Discounted future rewards being received from state x following the policy

xy yVayxDapaxxV )}(),,({)(),()(

상태 x 에서 액션 a 를 취할 확률

x y 상태 전이시 액션 a 를 취할 확률

x y 상태 전이시 액션 a 를 취할 때의 immediate reward

Future reward 와 immediate rewards 간의 discount factor

Value Function (2)Value Function (2)

Optimal value function & Bellman’s optimally equation

Value iteration update : Converge to optimal solution

Optimal Policy

)(max)(* xVxV

)}(),,({)(max)( ** yVayxDapxVy

)}(),,({)(max)(1 yVayxDapxV ty

)}(),,({)(maxarg ** yVayxDapay

Q-LearningQ-Learning

Q-function : 현재 상태와 현재 액션에 대한 future reward 계산

Value iteration update : Converge to optimal Q-function

Calculating the best action No need to know pxy(a)

)},(max),,({)(),( ** byQayxDapaxQb

)},(max),,({)(max),(1 byQayxDapaxQ tb

)),((maxarg ** axQaa

2)),(),(max),,((2

1axQbyQayxD

Error function of function approximator (i.e. NN)

IVIV. . Empirical ResultsEmpirical Results

1. Artificial price series

2. U.S. Dollar/British Pound Exchange rate

3. Monthly S&P 500 stock index

A trading system based on DR

Artificial price seriesArtificial price series

Data : autoregressive trend processes

10,000 samples 검증

RRL 이 트레이딩 전략의 학습 도구로 적합한지 ? 거래세의 증가에 따른 거래 횟수의 경향은 ?

)()1()(

)()1()1()(

tkttptp

)/)(exp()( Rtptz

Error function of function approximator (i.e. NN)10,000 샘플

{long, short} position only

~2,000 기간 동안 성능 저하

9,000~ 확대

= 0.01

거래횟수

누적이익

Sharpe Ratio

100 번 실험 후 결과100 에포크 학습 + 온라인 적응거래세 0.2%, 0.5%, 1%

U.S. Dollar/British Pound U.S. Dollar/British Pound Foreign Exchange TradingForeign Exchange Trading {long, neutral, short} trading system

30 minute U.S. Dollar/British Pound foreign exchange (FX) rate data 주 5 일 , 24 시간 거래 : 1996 년 1~8 월 분량

전략 2,000 데이터 학습 480 데이터 트레이딩 (2 주 ) 윈도우 이동후 재학습

결과 Annualized 15% return with annualized Sharpe ratio 2.3 평균적으로 5 시간당 1 번 거래

고려되지 않은 사항 피크를 이룬 트레이딩 . 시장의 비유동성

S&P 500/T-Bill Asset S&P 500/T-Bill Asset Allocation (1)Allocation (1) 소개

Long position : S&P 500 에 포지션 , T-Bill 이윤은 없음 Short position : 2 배의 T-Bill 비율을 얻음

배당금 재투자

T-Bill 배당금

S&P500 배당금

S&P 500/T-Bill Asset S&P 500/T-Bill Asset Allocation (2)Allocation (2) 시뮬레이션

데이터 (1950 ~ 1994): 초기 학습 (~1969) + 테스트 (1970~) 학습 윈도우 : 10 년 학습 + 10 년 validation Input Feature : 84 (financial + macroeconomic) series

RRL-trader tanh 유닛 1 개 , weight decay

Q-trader bootstrap 샘플 사용 2-layer FNN (30 tanh 유닛 ) Bias/variance trade off : 10, 20, 30, 40 유닛 모델중 선택

Voting methods

RRL : 30 번 , Q : 10 번거래세 0.5%이익금 재투자Multiplicative profit ratio

Buy and Hold : 1348%Q-Trader : 3359%RRL-Trader : 5860%

대전제 : 1970 ~ 1994 의 25 년 동안 미국 증권 /재무증권 시장은 예측가능했다 .

오일쇼크

통화긴축

시장조정

시장붕괴

걸프 전쟁

Statistically significant

Sensitivity Analysisj

dFS max/

인플레이션 기대치

VV . . Learn the Policy or Learn Learn the Policy or Learn the Value?the Value?

Immediate v.s. Future Immediate v.s. Future RewardsRewards Reinforcement signal

Immediate (RRL) or delayed (Q , dynamic programming, or TD)

RRL Policy is represented directly. Learning value function is bypassed

Q Policy is represented indirect

Policies v.s. ValuesPolicies v.s. Values

Some limitations of value function approach Original formulation of Q-learning : discrete action & state spaces Curse of dimensionality Policies derived from Q-learning tend to be brittle : small changes in

value function may lead large changes in the policy Large scale noise and non-stationarity may lead severe problems

RRL’s advantages Policy is represented directly : Simpler functional form is sufficient Can produce real valued actions More robust in noisy environment / Quick adaptation to non-

stationarity

An ExampleAn Example

Simple trading system {buy, sell} a single asset. Assumption: rt+1 is known in advance.

No need to future rewards : = 0

Policy funciton is trivial : at = rt+1

1 tanh unit is sufficient

Value function : ability to treat XOR 2 tanh units needed

ConclusionConclusion

How to train trading systems via DR RRL algorithm Differential Sharpe ratio & differential downside

deviation ratio RRL is more efficient than Q-learning in financial area.

BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell,...

Documents

· 9 n i ˇ ˇ ˛ˆ ˇ o .889 ˘ ˇ ˆ ˙ ˝ ˛ ˚ ˜ ˝ ˇ ! : n & ˚ k ˙;.889@˙ ˇ ˛ˆ ˚ ˇ

889 IMPLEMENTATION SECTION Section 889 (Prohibition on ... 889 Flyer.pdf2. Corporate Enterprise Tracking. Determine through reasonable inquiry whether you use “covered telecommunications”

889 IMPLEMENTATION SECTION Section 889 (Prohibition on … · 2020. 8. 12. · See GSA 889 Questions and Answers (Q&A) outside SECTION 889 IMPLEMENTATION SECTION 889 Section 889 ("Prohibition

200718 Ts 889

Computational Complexity Jang, HaYoung (hyjang@bi.snu.ac.kr)hyjang@bi.snu.ac.kr BioIntelligence Lab

Copyright (c) 2002 by SNU CSE Biointelligence Lab 1 Chap. 4 Pairwise alignment using HMMs Biointelligence Laboratory School of Computer Sci. & Eng. Seoul

MBA 889 : BasicFrenchGrammar&Composition I 889.pdf · MBA 889 : BasicFrenchGrammar&Composition I 2.0 OBJECTIVES By the time you finish this unit, you will be able to: Ask questions

SA SADDLEBRED NATIONAL CHAMPIONSHIPS 2018 … · 3 dundee`s final touch 889 buck ridge stalle kosie pansegrouw prestprops bk 3 889 889 889 889 ... sponsor / borg: dalene de beer

Alternator Test Leads Battery Ground - AmFor Electronics · 889-gmc delco cs130d pcm controlled ... 889-sj bosch, jeep,chrysler 98+ large body 889-sk chrysler 98+ ext. regulated (neon,

The Biointelligence Explosion

B. Saffell, N. Sanchez, and D. Leavell B. Saffell, N. Sanchez, y D. … · cities and towns as landscape trees. Species vary in susceptibility. European birch, Asian birch, Himalayan

BartellParts.com 800-889-0661 · 2020. 6. 23. · PARTS LIST 1. WALK-BEHIND TROWELS GENERAL ASSEMBLY - B436, B436SD. 4. BartellParts.com 800-889-0661 BartellParts.com 800-889-0661

Quick Tips for 642-889 Exam Dumps - Get Actual 642-889 Question Answers

Scout & Parent Handbook - Troop 889

User Manual 5E EMEA.889

LawTalk 889

Symbolic Regression via Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon

889 Samurai John Barry

PLAS 889 Solarium Crown - JP Weaver Companyjpweaver.com/plaster/Crowns/PLAS_889_Solarium_Crown.pdf · plas 889 solarium crown photo scale: 3/16" - 6-3/4" plas 889 solarium crown 6-1/8"

Review 889 All