arXiv:1910.03743v1 [cs.AI] 9 Oct 2019 · Model-based Reinforcement Learning for Predictions and Control for Limit Order Books Haoran Wei,1* Yuanbo Wang,2*, Lidia Mangu,2**, Keith

Model-based Reinforcement Learning for Predictions and Control for LimitOrder Books

Haoran Wei,1* Yuanbo Wang,2*, Lidia Mangu,2**, Keith Decker,1**1University of Delaware

2J.P. [email protected], [email protected], [email protected] [email protected]

Abstract

We build a profitable electronic trading agent with Re-inforcement Learning that places buy and sell orders inthe stock market. An environment model is built onlywith historical observational data, and the RL agentlearns the trading policy by interacting with the environ-ment model instead of with the real-market to minimizethe risk and potential monetary loss. Trained in unsu-pervised and self-supervised fashion, our environmentmodel learned a temporal and causal representation ofthe market in latent space through deep neural networks.We demonstrate that the trading policy trained entirelywithin the environment model can be transferred backinto the real market and maintain its profitability. Webelieve that this environment model can serve as a ro-bust simulator that predicts market movement as well astrade impact for further studies.

IntroductionModel-free Reinforcement Learning (MFRL) using deep-learning architecture has achieved robust performance on avariety of complicated tasks, ranging from the classic Atari2600 video games (Mnih et al. 2015), to locomotion tasks(Lillicrap et al. 2015), and have even delivered super-humanperformance in challenging exploration domains (Salimansand Chen 2018). Learning by trial and error, one of the lim-itations of such approaches is that they usually require alarge number of interactions with the environment for train-ing purposes. This challenge becomes even more significantwhen these interactions are expensive or even dangerous(e.g., financial trading or self-driving cars).

Building a simulator of the environment could alleviatesuch problems. By learning a dynamic model that predictsthe next state given current state and action, model-basedRL (MBRL) enables agents to explore inside the simulatorsafely. In recent research, methods have proved to be sam-ple efficient in various tasks without compromising policyeffectiveness compared to model-free approaches (Kaiser etal. 2019; Zhang et al. 2018). Additionally, an environmentmodel that enables predictions of the future is not only ap-pealing to RL but also has general applications across vari-ous domains.

Copyright c© 2020, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Most of the MBRL work still relies on using some inter-active feedback from reality to update the simulator online.This is crucial to avoid mismatching problems: the approxi-mation error of the environment model may lead the RL pol-icy in the wrong direction. With a fixed data set to train pol-icy with, this deviation from optimal strategy becomes chal-lenging to self-correct. Recent studies have started challeng-ing this requirement for real-time interactions with the envi-ronment. David Ha et al(Ha and Schmidhuber 2018) exper-imented with training policies for the VizDoom game com-pletely inside the environment model and it outperformedbenchmarks after being transferred back into the originalgame. Similarly, robust performance was observed in ap-plying model-based RL methods to learn car-driving poli-cies using observational data (Henaff, Canziani, and LeCun2019). To advance MBRL systems without real environmentinteractive data in the more complex real world, we proposea model-based RL framework and test its robustness in elec-tronic trading domains.

Interestingly enough, we observe similarities between theimplicit structures of MBRL and the explicit design of anelectronic trading system. In the most general and simplisticcase, an electronic trading system consists of a market sim-ulator and decision strategies. The simulator predicts mar-ket dynamics using quantitative and statistical models, andit has inspired extensive research(Ang and Bekaert 2006;Bacchetta, Mertens, and Van Wincoop 2009). The deci-sion strategies are usually influenced by domain knowl-edge, heuristics, observations, and sometimes, preferencesof the algorithm users. There are two main types of ap-proach to building market simulators: traditional statisticalmethods and machine learning methods. Traditional statis-tical approaches try to model linear processes that under-lie the generation of time-series. For instance, the Glostenmodel (Rosu 2009) assumes that the price impact of tradesis linear, immediate, and not state-dependent. The VectorAuto-Regressive model (VAR) (Zivot and Wang 2006) ad-dresses the non-stationary aspect of financial time-seriesand can forecast with multiple variances. Similarly, in (Bel-tran, Grammig, and Menkveld 2005), researchers investi-gate the near-linear dynamics between sequences of ordersand the evolution of the market. On the other hand, ma-chine learning approaches, which do not assume linearityand require little prior knowledge about the input data, have

arX

iv:1

910.

0374

3v1

[cs

.AI]

9 O

ct 2

019

also shown promising performance in market forecasting(Zhang, Zohren, and Roberts 2019; Ntakaris et al. 2018;Tsantekidis et al. 2017a; Tsantekidis et al. 2017b; Dixon,Klabjan, and Bang 2017).

Two drawbacks in this explicit simulator-policy frame-work are a compilation of approximation errors from thesimulator and explosion in the number of corner cases inhand-crafted decision policies. There have been efforts tobypass the market-simulating stage and proceed to train anRL agent that takes market conditions as input and directlyoutputs decisions. (Bacoyannis et al. 2018) applies a model-free technique to solve trade execution problems. (Nevmy-vaka, Feng, and Kearns 2006) trains an RL policy to re-place hand-crafted decision rules while still employing atraditional market simulator. However, the same challengepersists that RL agents are still trained using static, pre-generated time-series data.

Given the structural similarities between MBRL and elec-tronic trading systems, we think a model-based RL agenttrained on observational data could potentially solve theproblems in electronic trading that we have listed. To ourknowledge, no prior work has been done on this topic, andtherefore, we decide to bridge the gap. We have two maincontributions: (1) In our MBRL framework, we use latentrepresentation learning to model not only the state space butalso rewards. We demonstrate the effectiveness of such rep-resentation learning in the financial domain, where data ishigh-dimensional and non-stationary. (2) Using this modelof the environment, we show that our model-based agentconsistently outperforms commonly used benchmark trad-ing strategies. This approach enables the learning of prof-itable trading policies using observational data with no en-vironment interaction or labeling by human experts. Thisproject code will be released soon for replication.

Related WorkTwo studies inspired our study. In World Models (Ha andSchmidhuber 2018), the authors employ generative modelsto get latent representations of the environment, which en-ables the RL agent to learn a compact yet effective policy.The other recent study builds a neural network simulatedenvironment env0 that not only shares an action space andreward space with the original environment env but also pro-duces observations in the same format (Kaiser et al. 2019).

In reality, there are many scenarios where real environ-ment interactions are costly or not feasible, such as au-tonomous driving (Wu et al. 2017), recommendation sys-tems(Zhao et al. 2019), and trading systems. These infeasi-ble interaction environments yield a challenge for RL. How-ever, if the next state can be predicted, real environment in-teractions may not be necessary anymore. Lukasz Kaiser etal. (Kaiser et al. 2019) show a complete model-based RL ap-proach to play Atari games where a CNN-based predictionmodel is used to predict the next game frame given the pre-vious frames and action, and CNN layers are used to extracthidden features autonomously. A variational layer is used asthe last layer to learn the posterior of the next frame contextso that environment stochasticity is considered, and learn-ing is shown to be improved(Ha and Schmidhuber 2018;

Oh et al. 2015; Leibfried, Kushman, and Hofmann 2016).Recently, model-based RL is also used in recommendationsystems (Zhao et al. 2019) for conducting random explo-ration without bothering users overwhelmingly. However,few works have been seen in real-world applications com-pared to the wide application in the gaming domain. Thisis reasonable because the real world has more complicated,uncertain factors to model.

In the Finance domain, RL has been applied to many dif-ferent problems (Fischer 2018), especially designing elec-tronic trading strategies (Bacoyannis et al. 2018; Bertoluzzoand Corazza 2012). However, most of the work hasbeen done with model-free RL, such as Deep Q-networks(DQN)(Huang 2018), that have lower sampling complex-ity. Alternatively, model-based methods require many fewertraining samples; however, there is no existing finance RLmodel for random exploration. Our work tries to show that1) A trading model can be built with historical observationdata; 2) model-based RL has potential in time-series deci-sion making.

BackgroundIn this section, we introduce some important elements inelectronic trading systems and how they translate to con-cepts familiar to RL research. We also give a brief overviewof the RL methods used in this study.

Trading ProblemLimit Order Books (LOBs) are used by more than half ofthe exchange markets in the world (Rosu and others 2010).An LOB has two types of orders: bid and ask. At any giventime t, a bid (ask) order is to buy (sell) certain quantity(aka size), bs(t) (as(t)), of a financial instrument at or be-low (above) the specifies price bp(t) (ap(t)), as shown in Fig1. Orders submitted at time t are sorted into different levelsbased on their prices. For instance, the lowest ask price andthe highest bid price are grouped into the first level order,followed by the second lowest ap and the second-highest bpas the second level, and so on. {ap, bp, as, bs}i are vectorsof values and quantities at different price levels i. The time-series evolution of an LOB can be seen as a 3-dimensionaltensor: the first dimension represents time, the second di-mension is level, and the third represents prices and orderquantities on both the buy and sell sides(Gould et al. 2013).When submitted orders are executed by an LOB’s trade-matching algorithm, the orders’ price and quantity with di-rection (bid or ask) are removed from the LOB and recordedin a historical trade print.

Mid Price is the mean value of the first-level ask and bidprice, as Eq 1

midt =ap1 + bp1

2(1)

Take the slice of LOB in Fig 1 as an example, the mid priceat this time t is (408.0+408.3)/2 = 408.15. The movementof the mid price is commonly used to approximate marketchange. In this study, we use the mid price to calculate re-ward.

Figure 1: A slice of limit order books (LOB) with three lev-els on both ask and bid sides

Trade prints are the record of executed trades that con-tain information of direction (buy or sell), trading price, andquantity. The collection of trade prints may be executed bydifferent agents in the market. In this study, we use histor-ical trade prints as our RL agent’s exploration actions. Wealso include a sequence of trade prints prior to the targetaction as part of the state. This data provides crucial infor-mation for the state transition probabilities model (detailsare in the next section). We observe a similar problem setupin (Henaff, Canziani, and LeCun 2019), where the authorsuse a target car’s driving trajectory for RL exploration andsurrounding cars’ trajectories as part of states.

RL for agent

We test three commonly used RL algorithms for policy train-ing: Double Deep-Q network (Double DQN) (Van Hasselt,Guez, and Silver 2016), Policy Gradient (Williams 1992),and Advantage Actor-Critic (A2C) (Mnih et al. 2016).

Double DQN A deep Q-network is a neural network thatapproximates the Q-value for an input state-action pair. It’soptimized by minimizing the square error between the pre-dicted and target Q-value. Double DQN uses two networks:an online network for selecting the actions according to thevalue of (Q, θ), and another target network to determine itsvalue (Q, θ−). Optimization of the online network (Q, θ) isdone through minimizing the prediction error L(θ) of thetarget Q-network:

L(θ) = E[(Q(s, a; θ)−Qtarget)2]

Qtarget = r + γQ(s′, arg maxa′

Q(s′, a′; θ); θ−)(2)

Qtarget is the output of target network (Q, θ−), the weights(θ−) of which remain unchanged except for a periodic copyof weights from the online Q-network (Q, θ). Having a sep-arate target Q-network helps reduce policy variance causedby oscillations of the target value.

Policy Gradient In the policy gradient (PG) algo-rithm, a policy is directly modeled with a θ-parameterizedfunction π(a|s, θ). Given π(a|s, θ) and environmentmodel p(s′|s, a), we can generate a trajectory τ =(s1, a1, s2, a2, ...st, at) and accumulated reward r, see Eq3.The idea is to maximize the accumulated reward J(θ) by

repeatedly optimize θ with gradient ascent∇J(θ).

J(θ) = Eτ [

H∑t=0

log π(a|s, θ)Gτ ]

where Gτ =

H∑t=0

γtrt

(3)

Compared to other value-based RL methods, PG learns apolicy directly. It is also less sensitive to value overestima-tion problems common in Q-learning (caused by the “max”operation). However, one drawback is that reward accumu-lation along a trajectory may cause high policy variance.

Advantage Actor-Critic (A2C) A2C is a hybrid RLmethod combining policy gradient and value-based meth-ods. It consists of two networks: actor and critic. The actornetwork updates policy π(a|s, θ) by maximizing the objec-tive function:

J(θ) = Eτ [

H∑t=0

log π(a|s, θ)A(s, a)] (4)

whereA(s, a) is the advantage value representing how muchbetter (worse) a given action performs in state s comparedto the average performance over all actions. It updates as:

A(st, at) = rt + γV (st+1;w)− V (st;w) (5)

V (s;w) is the state utility representing the average perfor-mance over all actions in a given state, computed by thecritic network. The critic network’s parameter w is updatedby gradient descent on the TD-error (parameterized withw):

J(w) = (rt + γV (st+1;w)− V (st;w−))2 (6)

w− is the parameter from the critic’s previous update. Theadvantage of A2C is twofold: 1) policy variance is reduceddue to the advantage value; 2) the policy is directly updatedinstead of via a value estimation function.

Problem Formulation and DatasetLimit Order Book data is time-series with high samplingfrequency. We model the environment as a Markov Deci-sion Process (MDP). We use time-series data (LOB + tradeprints) from one stock traded on the Hong Kong Stock Ex-change. The update frequency of the LOB is ∼ 0.17s. Weuse January 2018 ∼ March 2018 (61 days) data for trainingwith 20% as the validation set, and test model performanceon April 2018 (19 days). In all, the dataset contains approx-imately 6 million transitions for training and 2 million fortesting.

MDP FormingWe model the trading problem as a Markov Decision Process(MDP) represented as {S,A,R, T , ρ0}. We’ll explain howto build the MDP with trading elements in this section.

State Space (S): st = {ae(obt−T :t), ut, pot}. zt =ae(obt−T :t) is a latent representation of the LOB within atime duration T . We will discuss the latent representationmodel more in the following section. ut is a vector of trade

Figure 2: A demonstration of forming a state-action-statetransition with limit order books and trade prints (the sub-scriptions are the time ticks for the lob and are indices onlyfor the trade prints)

prints occurring within the same duration T . pot is the RLagent’s position at time t. Position reflects the private inven-tory held by the agent, which, in our case, is bounded by(−pomax, pomax).

Action Space (A): at = ±q. Each action is the RLagent’s decision to trade. The decision includes price, quan-tity and direction of the trade. In our study, we assume thatthe trading price is set at mid-price, and can be directly cal-culated from the LOB update. Therefore the RL agent’s ac-tion contains q the absolute value of quantity and ± tradingdirection (sell/buy).

Reward Function (R): We use a mark-to-market PnL tocalculate agent’s reward. It’s defined as:

R(t) = ∆midst,st+1× pot (7)

where pot is the RL agent’s position at time t and∆midst,st+1

is the difference in average LOB mid-price be-tween state st and st+1.

Transitions (T (st+1|st, at)). We use observed trajecto-ries of state transitions to train the environment model, andone transition is demonstrated in Fig 2. Specifically, we iter-ate through trade prints in historical data and treat each tradeas a target action in the transition. The same action may be-come a part of the state ut in the next transition when thenext trade becomes the target action.

Initial State (ρ0): The initial states are sampled from thefirst state over all days in the training dataset following auniform distribution.

Data PreprocessingFor each LOB time-series sequence obt−T :t with a lengthof T , we use the feature-level min and max to normalizethe data across time. For the trade quantity normalization,we first exclude the outlier trades that either has less than100 or exceed 1000 of quantity. Then we use 100 and 1000as the boundary for min-max normalization, and the bound-ary is empirically determined based on the data distribution.We also implement the min-max normalization followed by

a sigmoid transformation on the rewards. Here it’s worthnoticing that the reward transformation is only done insidethe world model during exploration for training purposes.

ModelThis work has two parts: a world model that consists of la-tent representation learning of the LOB, state-action transi-tion, and reward models. A trading agent trained based onthe world model with three widely used RL methods.

World ModelThe world model consists of 3 parts: a latent representationmodel, transition model, and reward model.

Latent Representation Model (Auto-encoder) The LOBdata contains important time-series market information butit’s difficult to learn due to its high dimensionality. The roleof the Auto-Encoder (AE) (as shown in Fig 3(a)) is to findan abstract and low-dimensional representation of the LOBobservations for easy learning. The input and output of AEare both of dimension T × 4L where L is the number oflevels in LOB. We use 3 levels in our experiments. AfterAE is trained, we take the middle layer of dimension 1×mas the latent representation of high-dimensional LOB data.Here we have m = 16.

Transition Model (RNN-MDN) We use a RNN-MDNmodel to learn the state-action transitions P(s′|s, a). Thismodel also addresses trades’ impact on the market. We use along sequence of s, a to train RNN and learn both short termand long term impacts. We further combine the RNN witha Mixture Density Network (MDN) (Bishop 1994). This ap-proach approximates the output of RNN as a mixture of dis-tributions rather than a deterministic prediction of s′. TheMAD is written as:

p(s′|s, a) =

K∑k=1

wk(s, a)D(s′|µk(s, a), σ2k(s, a)) (8)

where D(·) is a presumed distribution, such as Gaussian,Bernoulli, etc. The length of input sequence to the RNN isN , and the (N + 1)th state is the prediction based on theprevious N sequences, shown as Fig 3(b). This approachhas been widely applied in the past on sequence-generatingtasks (Ha and Eck 2017).

Reward Model Since the PnL depends on the next statethat is unseen for the RL agent, a regression model is usedto predict the change of mid-price based on the current latentstate and the predicted next latent state:

rt = R(zt, zt+1;β)× pot (9)

where β is the reward model’s parameter.In the reward model, position po depends on the executed

actions and a hand-crafted capacity that casts the limit. Thenext position given current position and action is calculatedas:

pot+1 =

{min(pot + |at| , pomax) when at > 0

max(pot − |at| ,−pomax) when at < 0(10)

(a) (b) (c)

Figure 3: (a) Extract latent LOB representation with an auto-encoder (AE) (b)The transition model: a sequence-to-sequenceRNN-MDN (c)The RL agent’s workflow: exploring with the environment model and exploiting in the real environment.(Thedashed lines represent the exploitation and solid lines represent the exploration)

where (−pomax, pomax) defines the position capacity.The latent representation, transition model, and reward

model together are the world model. Then, we train the RLtrading agent purely based on this world while maximizingthe total reward along a certain time horizon.

Agent ModelThe agent learns policy by exploring completely within thepre-trained environment model. Starting with a randomly se-lected initial state, the RL agent outputs an action, and thenthe state and action are fed back to the environment modelto predict the next state and so on until a stop criteria. Theworkflow is illustrated in Fig 3(c). Unlike Atari games thatusually have a clear terminal state, termination of tradingactions are less well defined. Here we fix time horizon andtrain RL policy to maximize total reward. Once a policy hasconverged, we evaluate it with historical data repeatedly fol-lowing similar steps: 1) from the current state with the latentLOB (zt) and the corresponding trade prints (ut); 2) take anaction by the RL agent; 3) collect a reward; 4) if not time up,return step 1.

ExperimentWe discuss details of training, testing and performance anal-ysis in the experiment.

BenchmarkWe use a momentum-driven trading strategy and a classifier-based strategy as our benchmarks. The former one is a well-performing industrial strategy, and the latter one is the state-of-the-art with deep learning networks. We also use a greedyoptimal strategy to measure how close the RL policy is to theoptimal policy.

Momentum-driven Strategy The momentum is calcu-lated by subtracting the opening mid-price from the clos-ing mid-price. For example, if there are 40 time ticks in onestate, the closing mid-price is the mid-price at the 40th timetick, and the opening mid-price is the one at the 1st timetick. It roughly reflects price changes within one state andthe change is assumed to be carried over to the next state.

Table 1: Classifier Performanceclass precision recall F1 supportdown 0.66 0.62 0.64 1799no change 0.90 0.91 0.90 6401up 0.63 0.63 0.63 1800

The movement can fall into three classes: up, down, and no-change w.r.t a pre-defined threshold. The trading policy ishand-crafted with a fixed quantity for each action in the ac-tion space. If the agent always takes the maximum quantityin each action either buy or sell, it is “aggressive”. The ag-gressive agent may have the largest reward but also has thebiggest risk of losing money. The agent could also take theminimum quantity in each action. Such agent is “conserva-tive”, has smaller risks, but it may miss significant gains.

Classifier-based Strategy A classifier is trained with his-torical LOB data, and it predicts the mid-price movementin the next state based on the current state. Same as themomentum-based strategy, a handcrafted policy is used ac-cording to the classification, and it can be anywhere between“progressive” and “conservative.” In this study, we build thisclassifier with three CNN layers and one Dense layer (64neurons) with three output states (same configuration as thework of (Zhang, Zohren, and Roberts 2019)). We use the cat-egorical cross-entropy as the loss function. The performanceof the classifier is demonstrated with randomly sampled test-ing data in Table1.

Greedy Optimal This strategy assumes that the agentknows the future LOB in the next state. The requirementfor future knowledge makes this strategy unrealistic in thereal world. However, it can provide a reference for RL tomeasure how close the RL policy is to the optimal one.The global optimal strategy can be found with breadth-first search, demonstrated as Fig 4(a) where actions are dis-cretized into 21 discrete actions. This strategy is computa-tionally intractable for a trading strategy: the complexity canbe above 211000 for 1000 time ticks, and for a liquid market,a 1000-time-tick length is only ∼3 minutes. To overcomethis obstacle, we use a greedy optimal strategy: only expandthe best action with the maximum cumulative reward at each

level (each time step) without visiting backward along time,shown as Fig 4(b). The greedy optimal policy doesn’t guar-antee the global optima; however, it reduced the computa-tional complexity from exponential to polynomial.

(a) (b)

Figure 4: (a)Breadth first search for the optimal trading pol-icy (b) greedy optimal strategy. *Nodes are states, branchesare actions and each level is one time step

Model ArchitectureWe use a CNN-based auto-encoder to represent the originalLOB data with latent states, and the network architecture islisted in Fig 5. The length of the trading records u for eachstate is limited to 10 with post zero padding. In the tran-sition model (RNN-MDN), one layer of the RNN has 128neurons. Its input and output sequence length is 10. We as-sume the state distribution is Gaussian and the number ofGaussian distributions is 5. The reward model is composedby one layer of 128 LSTM units and one layer of Dense with40 units. The actions for greedy optimal policy and RL agentare discretized into 21 classes w.r.t the trade quantity repre-senting [−1000,−900, · · · , 0, · · · , 900, 1000]. The trainingtime horizon is set as 500 and each state includes 40 timeticks, and thus each training epoch covers∼ 1h market time.The training horizon can be easily scaled up according todifferent demands.

Experiment ResultsWe randomly picked four days in April 2018 with 300× 40time ticks (∼ 0.56hr) each day to demonstrate with the cor-responding mid-price shown on top, seen in Fig 6. We con-sidered the transaction fee as 2% of PnL per quantity unit.The shadow with the momentum-based and the classifier-based strategy represents all possible performances betweenprogressive and conservative policies, and its upper edgeshows the best performance at any given time tick, how-ever, the policy may not always be the same. For the sakeof reducing the risk and increasing the cumulative rewards,a mixed strategy is more reasonable. However, it requiresmore complex hand-crafted policies with benchmarks. Al-ternatively, the RL policy is a straightforward approach toa mixed strategy where actions with various quantities aretaken at different states.

As shown in Fig 6, the RL agent’s performance is veryclose to the greedy-optimal solution in general and outper-forms the other two benchmarks’ average performance by∼ 10% to 30%. In some time segments, the progressivemomentum-based agent beats our RL agent (e.g, t = 100 ∼t = 200 in fig 6(f). This is because the quantity in eachRL action during this period is smaller than the momentum-based agent to avoid the future risk predicted by the transi-

2x2@32(1, 1)

2x2@16(1, 1)

5x3@16(1, 1)

5x3@1(1, 1)

2x2@16(1, 1)

2x2@16(1, 1)

latent state

Figure 5: CNN-based auto-encoder architecture.Yellow lay-ers are Upsampling/Downsampling layers, blue layers areconvolutional layers

tion model and that also explains why the cumulative PnLsurpasses the momentum-based after t = 200 when themid-price starts dropping. A2C overall provides more sta-ble performance compared to PG and DQN. DQN some-times performance poorly, and this can be improved by hav-ing a better state representation (will be addressed in ourfuture work). The classification-based agent performs rela-tively poorly, and we believe it is caused by the classifica-tion bias. The classifier tends to misclassify the movementas “no change” because this is the dominating class, and thusthe classifier-based agent takes fewer actions. This is actu-ally a real-world bottleneck for classification methods in fi-nance because the market doesn’t have a big change withina short period in general. One of the improvement solutionsis to extend the window length of each state so that eachstate expands a longer time horizon, and more changes maybe captured. However, it will decrease the action granularitybecause only one action is associated with one state.

Explanation and AnalysisTo understand and explain the performance better, we alsovisualize the action selections in Fig 7 with mid-pricehighlighted in blue. For observation convenience, we onlyshowed the first 100 time ticks. Three agents (RL(PG),momentum-based, and classifier-based) have similar actionfrequency. The RL agent’s strategy can be summarized asalways sell around the local peaks and buy around the lo-cal valleys. The momentum-based policy has higher actionfrequency to switch the “buy” and “sell” actions, and thismay cause lower position so that smaller rewards are col-lected sometimes. The classifier-based agent has relativelylower action frequency, and it takes irrational actions some-times, such as “buy” at the local peaks in the first five timeticks. The greedy optimal has the highest trading frequency,and its main strategy is to maintain a high position andnot to take actions when the one-step PnL is negative. Wealso compared the policy performance variance with dif-ferent methods in descending markets (Fig 8(a)), ascendingmarket (Fig 8(b)), oscillating market (Fig 8(c)) and overall(Fig 8(d)). Ascending market means mainly goes up alongtime (and similarly for the others). RL performs very well indescending and ascending markets, much better comparedto the start-of-the-art classifier-based approach. However, itdoesn’t perform well in the oscillating market, caused by in-sufficient training data (only ∼15.8% in total training data).We’ll address this problem in our future work.

0 50 100 150 200 250 300

401.5

402.0

402.5

403.0

403.5

404.0

404.5

(a)

0 50 100 150 200 250 300

401.5

402.0

402.5

403.0

403.5

404.0

404.5

(b)

0 50 100 150 200 250 300

404.0

404.5

405.0

405.5

(c)

0 50 100 150 200 250 300

395

396

397

398

399

400

(d)

0 50 100 150 200 250 300time tick (X40)

5

0

5

10

15

20

25

cum

ulat

ive

PnL

RL_A2CRL_DQNRL_PG

momentumclfgreedy_optimal

(e)

0 50 100 150 200 250 300time tick (X40)

5

0

5

10

15

20

cum

ulat

ive

PnL

RL_A2CRL_DQNRL_PG


(f)

0 50 100 150 200 250 300time tick (X40)

5.0

2.5

0.0

2.5

5.0

7.5

10.0

cum

ulat

ive

PnL

RL_A2CRL_DQNRL_PG


(g)

0 50 100 150 200 250 300time tick (X40)

30

20

10

0

10

20

30

cum

ulat

ive

PnL

RL_A2CRL_DQNRL_PG


(h)

Figure 6: Trading agents performance with momentum-based agent, classification-based agent, RL-based agent and greedyoptimal based agent tested on 4 random days

(a) RL-based (PG) (b) Momentum-based (c) Classifier-based (d) Greedy optimal

Figure 7: Performance explanation in terms of actions selection: red shadows represent “sell”, green shadows represent “buy”,no actions taken in blank area. The blue line is the corresponding mid price.

20 10 0 10 20 30 40cumulative PnL

PG

A2C

DQN

Momentum

Clf

Greedy

(a)

20 15 10 5 0 5 10 15cumulative PnL

PG

A2C

DQN

Momentum

Clf

Greedy

(b)

5 0 5 10 15cumulative PnL

PG

A2C

DQN

Momentum

Clf

Greedy

(c)

20 10 0 10 20 30 40cumulative PnL

PG

A2C

DQN

Momentum

Clf

Greedy

(d)

Figure 8: performance variance in the descending market as8(a), ascending market as 8(b), oscillating market as 8(c),and total variance with varied market trending as 8(d).

We also show the transferability of RL policy from theworld model to the real environment. With five randomlypicked days, the same policy is implemented in both worldmodel and real environment on each day, then we comparethe cumulative reward. As shown in Fig 9, the asymptoticperformance (cumulative reward) of RL policy shows thetransferability of RL policy from the environment model tothe real world.

0 20 40 60 80 100

0

2

4

6

8

10

12

14cumulative reward from world model cumulative reward from real world

Figure 9: RL policy (PG) transferability: the policy trainedfully based on the environment model has acceptable perfor-mance in the real world

Conclusion and Future WorkWe propose a model-based RL for learning a trading pol-icy in finance. Our work provides the potential of RL ap-plied in the domains where the state space is high dimen-sional, and real environment interactions are expensive orinfeasible. We also contribute a framework for modelingtrading markets for future purposes. The trading market isa complex domain, and more dynamic factors should beconsidered, including broker fees, dynamic transition fees,etc. Some hand-crafted rules could be combined with an RLagent (e.g., setting a risk threshold). System time latency isanother concern: a delayed response may influence tradingpolicy’s efficacy. Different market data with different liquid-ity should be tested with our RL approach to demonstrate

trading strategy robustness.

References[Ang and Bekaert 2006] Ang, A., and Bekaert, G. 2006.Stock return predictability: Is it there? The Review of Fi-nancial Studies 20(3):651–707.

[Bacchetta, Mertens, and Van Wincoop 2009] Bacchetta, P.;Mertens, E.; and Van Wincoop, E. 2009. Predictability in fi-nancial markets: What do survey expectations tell us? Jour-nal of International Money and Finance 28(3):406–426.

[Bacoyannis et al. 2018] Bacoyannis, V.; Glukhov, V.; Jin,T.; Kochems, J.; and Song, D. R. 2018. Idiosyncrasiesand challenges of data driven learning in electronic trading.arXiv preprint arXiv:1811.09549.

[Beltran, Grammig, and Menkveld 2005] Beltran, H.; Gram-mig, J.; and Menkveld, A. J. 2005. Understanding the limitorder book: Conditioning on trade informativeness. Techni-cal report, CFR Working Paper.

[Bertoluzzo and Corazza 2012] Bertoluzzo, F., and Corazza,M. 2012. Testing different reinforcement learning config-urations for financial trading: Introduction and applications.Procedia Economics and Finance 3:68–77.

[Bishop 1994] Bishop, C. M. 1994. Mixture density net-works.

[Dixon, Klabjan, and Bang 2017] Dixon, M.; Klabjan, D.;and Bang, J. H. 2017. Classification-based financial marketsprediction using deep neural networks. Algorithmic Finance6(3-4):67–77.

[Fischer 2018] Fischer, T. G. 2018. Reinforcement learningin financial markets-a survey. Technical report, FAU Discus-sion Papers in Economics.

[Gould et al. 2013] Gould, M. D.; Porter, M. A.; Williams,S.; McDonald, M.; Fenn, D. J.; and Howison, S. D. 2013.Limit order books. Quantitative Finance 13(11):1709–1742.

[Ha and Eck 2017] Ha, D., and Eck, D. 2017. A neu-ral representation of sketch drawings. arXiv preprintarXiv:1704.03477.

[Ha and Schmidhuber 2018] Ha, D., and Schmidhuber, J.2018. World models. arXiv preprint arXiv:1803.10122.

[Henaff, Canziani, and LeCun 2019] Henaff, M.; Canziani,A.; and LeCun, Y. 2019. Model-predictive policy learningwith uncertainty regularization for driving in dense traffic.arXiv preprint arXiv:1901.02705.

[Huang 2018] Huang, C. Y. 2018. Financial trading asa game: A deep reinforcement learning approach. arXivpreprint arXiv:1807.02787.

[Kaiser et al. 2019] Kaiser, L.; Babaeizadeh, M.; Milos, P.;Osinski, B.; Campbell, R. H.; Czechowski, K.; Erhan, D.;Finn, C.; Kozakowski, P.; Levine, S.; et al. 2019. Model-based reinforcement learning for atari. arXiv preprintarXiv:1903.00374.

[Leibfried, Kushman, and Hofmann 2016] Leibfried, F.;Kushman, N.; and Hofmann, K. 2016. A deep learningapproach for joint video frame and reward prediction inatari games. arXiv preprint arXiv:1611.07078.

[Lillicrap et al. 2015] Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.;Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D.2015. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971.

[Mnih et al. 2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.;Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Ried-miller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015.Human-level control through deep reinforcement learning.Nature 518(7540):529.

[Mnih et al. 2016] Mnih, V.; Badia, A. P.; Mirza, M.; Graves,A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu,K. 2016. Asynchronous methods for deep reinforcementlearning. In International conference on machine learning,1928–1937.

[Nevmyvaka, Feng, and Kearns 2006] Nevmyvaka, Y.;Feng, Y.; and Kearns, M. 2006. Reinforcement learningfor optimized trade execution. In Proceedings of the 23rdinternational conference on Machine learning, 673–680.ACM.

[Ntakaris et al. 2018] Ntakaris, A.; Magris, M.; Kanniainen,J.; Gabbouj, M.; and Iosifidis, A. 2018. Benchmark datasetfor mid-price forecasting of limit order book data with ma-chine learning methods. Journal of Forecasting 37(8):852–866.

[Oh et al. 2015] Oh, J.; Guo, X.; Lee, H.; Lewis, R. L.; andSingh, S. 2015. Action-conditional video prediction usingdeep networks in atari games. In Advances in neural infor-mation processing systems, 2863–2871.

[Rosu and others 2010] Rosu, I., et al. 2010. Liquidity andinformation in order driven markets. Chicago Booth Schoolof Business.

[Rosu 2009] Rosu, I. 2009. A dynamic model of the limitorder book. The Review of Financial Studies 22(11):4601–4641.

[Salimans and Chen 2018] Salimans, T., and Chen, R. 2018.Learning montezuma’s revenge from a single demonstra-tion. arXiv preprint arXiv:1812.03381.

[Tsantekidis et al. 2017a] Tsantekidis, A.; Passalis, N.;Tefas, A.; Kanniainen, J.; Gabbouj, M.; and Iosifidis, A.2017a. Forecasting stock prices from the limit order bookusing convolutional neural networks. In 2017 IEEE 19thConference on Business Informatics (CBI), volume 1, 7–12.IEEE.

[Tsantekidis et al. 2017b] Tsantekidis, A.; Passalis, N.;Tefas, A.; Kanniainen, J.; Gabbouj, M.; and Iosifidis,A. 2017b. Using deep learning to detect price changeindications in financial markets. In 2017 25th EuropeanSignal Processing Conference (EUSIPCO), 2511–2515.IEEE.

[Van Hasselt, Guez, and Silver 2016] Van Hasselt, H.; Guez,A.; and Silver, D. 2016. Deep reinforcement learning withdouble q-learning. In Thirtieth AAAI conference on artificialintelligence.

[Williams 1992] Williams, R. J. 1992. Simple statisticalgradient-following algorithms for connectionist reinforce-ment learning. Machine learning 8(3-4):229–256.

[Wu et al. 2017] Wu, C.; Kreidieh, A.; Parvate, K.; Vinitsky,E.; and Bayen, A. M. 2017. Flow: Architecture and bench-marking for reinforcement learning in traffic control. arXivpreprint arXiv:1710.05465.

[Zhang et al. 2018] Zhang, M.; Vikram, S.; Smith, L.;Abbeel, P.; Johnson, M.; and Levine, S. 2018. Solar:deep structured representations for model-based reinforce-ment learning.

[Zhang, Zohren, and Roberts 2019] Zhang, Z.; Zohren, S.;and Roberts, S. 2019. Deeplob: Deep convolutional neu-ral networks for limit order books. IEEE Transactions onSignal Processing 67(11):3001–3012.

[Zhao et al. 2019] Zhao, X.; Xia, L.; Zhao, Y.; Yin, D.;and Tang, J. 2019. Model-based reinforcement learn-ing for whole-chain recommendations. arXiv preprintarXiv:1902.03987.

[Zivot and Wang 2006] Zivot, E., and Wang, J. 2006. Vectorautoregressive models for multivariate time series. ModelingFinancial Time Series with S-Plus R© 385–429.

Documents

arXiv:1910.03743v1 [cs.AI] 9 Oct 2019 · Model-based Reinforcement Learning for Predictions and Control for Limit Order Books Haoran Wei,1* Yuanbo Wang,2*, Lidia Mangu,2**, Keith