76
School of Education, Culture and Communication Division of Applied Mathematics MASTER THESIS IN MATHEMATICS / APPLIED MATHEMATICS Generative Neural Network for Portfolio Optimization by Mengxin Liu Masterarbete i matematik / tillämpad matematik DIVISION OF APPLIED MATHEMATICS Mälardalen University SE-721 23 Västerås, Sweden

mdh.diva-portal.org1518238/...Master thesis in mathematics / applied mathematics Date: 2021-01-15 Project name: Generative Neural Network for Portfolio Optimization Author: Mengxin

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • School of Education, Culture and CommunicationDivision of Applied Mathematics

    MASTER THESIS IN MATHEMATICS / APPLIED MATHEMATICS

    Generative Neural Network for Portfolio Optimization

    by

    Mengxin Liu

    Masterarbete i matematik / tillämpad matematik

    DIVISION OF APPLIED MATHEMATICSMälardalen University

    SE-721 23 Västerås, Sweden

  • School of Education, Culture and CommunicationDivision of Applied Mathematics

    Master thesis in mathematics / applied mathematics

    Date:2021-01-15

    Project name:Generative Neural Network for Portfolio Optimization

    Author:Mengxin Liu

    Supervisor(s):Supervisor at Qognica AB: George FodorSupervisor at MDH: Olha Bodnar

    Reviewer:Christopher Engström

    Examiner:Daniel Andrén

    Comprising:30 ECTS credits

  • Contents

    1 Introduction 11.1 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2 Traditional Portfolio Optimization Method 52.1 Mean-Variance Portfolio Optimization . . . . . . . . . . . . . . . . . . . . . 52.2 CAPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3 Limitation of Traditional Portfolio Optimization 93.1 Drawbacks within Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Drawbacks within Applications . . . . . . . . . . . . . . . . . . . . . . . . . 9

    4 Preprocessing 114.1 Distribution of Daily Return . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    4.1.1 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 134.1.2 Student’s t Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 134.1.3 Complete Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    4.2 Whether to Include Technical Indicators . . . . . . . . . . . . . . . . . . . . 154.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    4.3.1 Standard Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3.2 Min-Max Scaler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3.3 Robust Scaler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3.4 Max Abs Scaler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3.5 Power Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    5 Artificial Neural Network 185.1 Introduction to Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 18

    5.1.1 Relation between Different Concepts . . . . . . . . . . . . . . . . . 185.1.2 Definition of Artificial Neural Network . . . . . . . . . . . . . . . . 195.1.3 Differences between ANN and Statistical Method . . . . . . . . . . . 21

    5.2 Training Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.2.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    i

  • 5.3 Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.3.1 Sigmoid function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.3.2 Hyperbolic Tangent function . . . . . . . . . . . . . . . . . . . . . . 245.3.3 Rectified Linear Unit function . . . . . . . . . . . . . . . . . . . . . 255.3.4 Exponential Linear Unit . . . . . . . . . . . . . . . . . . . . . . . . 265.3.5 Leaky ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    5.4 Approaches to Prevent Overfitting . . . . . . . . . . . . . . . . . . . . . . . 275.4.1 Increase Data Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.4.2 Reduce Size of Neural Network . . . . . . . . . . . . . . . . . . . . 275.4.3 L1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.4.4 L2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.4.5 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    5.5 Supervised Learning and Unsupervised Learning . . . . . . . . . . . . . . . 295.6 Generative Adversarial Network . . . . . . . . . . . . . . . . . . . . . . . . 30

    5.6.1 Cost function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.7 Implement Neural Network in Portfolio Optimization . . . . . . . . . . . . . 32

    5.7.1 How to Optimize Portfolio from Output of Neural Network . . . . . . 32

    6 Empirical Study 346.1 Data Software and Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    6.1.1 Data and Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . 346.1.2 Software Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.1.3 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    6.2 Risk Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.2.1 Volatility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.2.2 Value at Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.2.3 Conditional Value at Risk . . . . . . . . . . . . . . . . . . . . . . . 35

    6.3 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.3.1 Simulated Path of Monte Carlo Simulation . . . . . . . . . . . . . . 366.3.2 Calculate VaR using Monte Carlo Method . . . . . . . . . . . . . . . 376.3.3 Calculate CVaR using Monte Carlo Method . . . . . . . . . . . . . . 376.3.4 Markowitz GMV Portfolio Selection . . . . . . . . . . . . . . . . . . 39

    6.4 Studies on GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.4.1 Structure of GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.4.2 Key Point on Selecting Batches . . . . . . . . . . . . . . . . . . . . 416.4.3 Output from GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.4.4 The Effect of Epoch . . . . . . . . . . . . . . . . . . . . . . . . . . 466.4.5 The Effect of Batchsize . . . . . . . . . . . . . . . . . . . . . . . . . 476.4.6 The Effect of Latent Dimension . . . . . . . . . . . . . . . . . . . . 476.4.7 A Portfolio Optimization Example . . . . . . . . . . . . . . . . . . . 47

    7 Discussion 497.1 Advantages of Generative Neural Network Portfolio Optimization . . . . . . 497.2 Disadvantages of Generative Neural Network Portfolio Optimization . . . . . 49

    ii

  • 8 Further Research and Conclusion 518.1 Further Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    A Weight of GMV portfolioA.1 Weights of GMV Portfolio . . . . . . . . . . . . . . . . . . . . . . . . . . .

    B VaR and CVaR of Stocks Using Normal Monte CarloB.1 Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B.2 Part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B.3 Part 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    C VaR of the First GAN ResultC.1 Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C.2 Part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    D Epoch Study

    E Batchsize Study

    F Latent Dimension Study

    iii

  • List of Figures

    3.1 Rolling correlation coefficient between AAK and ABB . . . . . . . . . . . . 10

    4.1 Histogram of ABB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.2 Comparison Between Histogram and Normal Distribution PDF . . . . . . . . 134.3 Comparison Between Histogram and Student’s t Distribution PDF . . . . . . 14

    5.1 A brief description of the relation between three different concepts . . . . . . 185.2 LTU unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.3 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.4 Graph of Sigmoid function . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.5 Graph of Hyperbolic Tangent function . . . . . . . . . . . . . . . . . . . . . 255.6 Graph of Rectified Linear Unit function . . . . . . . . . . . . . . . . . . . . 255.7 Graph of Exponential Linear Unit . . . . . . . . . . . . . . . . . . . . . . . 265.8 Graph of Leaky ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.9 Graphical Explanation of Autoencoder . . . . . . . . . . . . . . . . . . . . . 305.10 Graphical Representation of GAN . . . . . . . . . . . . . . . . . . . . . . . 31

    6.1 One path generate by Monte Carlo simulation . . . . . . . . . . . . . . . . . 366.2 VaR of ABB using Monte Carlo simulation . . . . . . . . . . . . . . . . . . 376.3 CVaR of ABB using Monte Carlo simulation . . . . . . . . . . . . . . . . . 386.4 VaR and CVaR of ABB using Monte Carlo simulation . . . . . . . . . . . . . 386.5 Value of GMV portfolio in 10 years . . . . . . . . . . . . . . . . . . . . . . 396.6 VaR and CVaR of GMV portfolio using Monte Carlo simulation . . . . . . . 406.7 Data structure of input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.8 ABB Price paths generated by GAN . . . . . . . . . . . . . . . . . . . . . . 426.9 Histogram comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.10 Histogram comparison(daily return) . . . . . . . . . . . . . . . . . . . . . . 446.11 Heatmap of one generated data . . . . . . . . . . . . . . . . . . . . . . . . . 446.12 Heatmap real stocks returns data . . . . . . . . . . . . . . . . . . . . . . . . 456.13 Rolling Correlation of Neural Network . . . . . . . . . . . . . . . . . . . . . 466.14 Value of Portfolio in 10 years . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    iv

  • Abstract

    This thesis aims to overcome the drawbacks of traditional portfolio optimization by employingGenerative Deep Neural Networks on real stock data. The proposed framework is capable ofgenerating return data that have similar statistical characteristics as the original stock data. Theresult is acquired using Monte Carlo simulation method and presented in terms of individualrisk. This method is tested on real Swedish stock market data. A practical example demon-strates how to optimize a portfolio based on the output of the proposed Generative AdversarialNetworks.

    v

  • Acknowledgements

    I would like to thank everyone in the Qognica AB for giving me this chance of doing thisthesis. It has been a really enjoyable journey for me. Also, I would like to thank my supervisorOlha Bodnar for giving me constructive opinions.

    vi

  • Chapter 1

    Introduction

    1.1 Problem descriptionPortfolios are a set of financial assets selected to optimize trade-offs between risks and returns.Optimal portfolios define a line in the risk vs. return plane called the efficient frontier. Theoptimization process as such is done by Portfolio Managers. A manager selecting assets willtypically consider factors such as the risk aversion of the investor, risk/return profile of eachasset and the risk-free rate and the borrowing rate. Advances in financial engineering led to anincreased sophistication on both the optimization instruments side and also on the investor’sunderstanding of risks. This trend could be accelerate using recent results in machine learningmethods and in advanced computerized mathematical modelling tools.

    In order to construct a portfolio, it is important to model the asset. The assets in a portfoliocan be represented as a combination of weight, expected return, and risk. The weight wi is therepresentation of the portion of stock i in the portfolio. Expected return µi is the representationof investors’ expectation on the return of stock i in the future. Normally risk is measured by afunction that considers the standard deviation σ

    The Harry Markowitz paper[28] gives a solution of how to construct a portfolio based onthe formulation introduced before. The weights of each stock can be represented by a weightvector ~W T = (w1,w2, · · · ,wN), in order to calculate the variance of the whole portfolio, thecovariance matrix Σ need to be calculated:

    Σ =

    σ1,1 · · · σ1,N... . . . ...σN,1 · · · σN,N

    Where σN,N = σ2N is the variance of asset N. σi, j is the covariance between asset i and j.Now the portfolio’s risk σp can be calculated using formula:

    σ2p = ~WT Σ ~W

    The problem of minimizing the risk of the portfolio can be solved using techniques like

    1

  • Lagrange Multiplier. The calculated portfolio is called Global Minimum Variance (GMV)Portfolio.

    If the investors want to have more returns while controlling the risk, then the optimizationproblem could be formulated using the concept of Sharpe ratio[43]. Sharpe ratio Sp can becalculated with the formula:

    Sp =µp− r f

    σp

    Where r f is the risk free rate, µp is the expected return of the portfolio, and σp is thevolatility of the portfolio.

    Solving the optimization problem that maximizes the Sharpe ratio will give a portfoliocalled optimal portfolio. We can define the negative Sharpe ratio as cost function, then theoptimization problem minimizes this cost function.

    In this thesis, some questions are raised. Is the mean variance portfolio optimization frame-work a good portfolio selection method? Just estimating the risks by the standard deviationmight not capture all the regularities that could identify risk patterns. Recently with the de-velopment of computing power, the Artificial Intelligence method especially neural networkalgorithm is becoming more and more important in many fields like computer vision due totheir capacities to recognize patterns. This lead to the problem of this thesis. Is it possible toapply neural network algorithm in the portfolio selection process? How does a neural networkbased portfolio perform compared with the Markowitz portfolio selection framework?

    This thesis aims to find the answer to the previous question. A designed unsupervisedneural network will try to extract features from the existing stock data. Then the neural net-work will generate many return series that have similar characteristics to the original data.Then a portfolio will be constructed based on the generated data. The designed portfolio willhave minimum risk(in the measurement of standard deviation, value at risk or conditionalvalue at risk).

    1.2 Literature reviewThe aim of academic studies in modelling time series is to find a model that can better describetime series characteristics. A more complete model of time series will give a better predic-tion or estimation. The prediction will be used for optimization. A model that seeks to finda better representation of time series has two parts: structure and parameters. When we tryto choose a model, we want to select a structure that leads to the least amount of parameters.Among many proposed time series models with financial applications, Autoregressive Integ-rated Moving Average (ARIMA) model[47] is a good model with good prediction power andfew parameters. As a common rule, Occam Razor[6] states, the simpler model is preferredin any case, this being a normal regularization principle. Apart from this advantage, whatmakes it interesting in financial application is its capability of simulating Brownian motion.Brownian motion is one of the most common way to model prices of financial assets. Wheninvestors try to identify the parameters of ARIMA, essentially they are doing statistical mod-elling, which is built upon statistical assumptions[11]. However, if we want to build a modelthat has no statistical assumptions, ARIMA is not the most suitable in this case. Hence we

    2

  • want to build a model based on no statistical assumption enabling us to find correlations orpatterns that are hard to recognize in a normal setting. Compared to the traditional method,this method will give a more precise estimation of the financial characteristics. To solve thisquestion, we choose to implement the ideas from Artificial Neural Network research. Becauseit is a widely developed field, giving us a new way to model time series(financial time seriesin this case).

    Since the introduction of the neural network, many researchers have been trying to im-plement artificial neural network techniques in financial applications. Article by Cavalcante[7] categorizes the machine learning related articles and summarizes the core implications ac-cording to its directions. From the author’s summarization, the most common applications ofmachine learning in financial applications are price prediction. Machine learning techniquecan also be applied in other applications such as features extraction and outliers detection.

    Under the category of price prediction, there are some articles trying to achieve this goal.W.Bao[4] proposes a framework to predict stock price. Indices data are fed into a wavelettransform system. The purpose of the wavelet transform is to denoise price data. Later thedata will go through a stacked autoencoder. Autoencoder is an unsupervised learning methoddesigned to extract deep features from the data. Subsequently, the extracted features are fedinto a long short term memory model(LSTM) in order to acquire one step ahead prediction.According to the author, the proposed framework has the capability of predicting price datawith Coefficient of determination R2 above 90%. This demonstrates the potential of ArtificialNeural Network in the financial market.

    Another approach in predicting stock price involves a commonly implemented way ofprocessing data: technical indicator. Tegner [46] suggests a method to predict the financialasset price in the future. In his suggested framework, the input of the neural network consistsof prices and technical indicators like Moving average and Momentum. Then a selection isconducted in order to find the technical indicators that have more importance than others. Toacquire the prediction from the neural network, the author chooses to feed the selected datainto a Recurrent Neural Network(RNN). According to the article, the most effective methodachieves an accuracy of 52%. This is one of the articles that incorporate the idea of technicalanalysis with the power of Artificial Neural Networks.

    One of the latest topics of the artificial neural network is generative model. One frame-work: Variational Autoencoder(VAE)[24] is gaining more attention. It can be applied in manyapplications like text generation[50][42], and also image related tasks [33]. Variational Au-toencoder can be seen as an extension of autoencoder, its probabilistic characteristics enableit to generate different outputs with similar characteristics. It belongs to the family of Gener-ative Autoencoder, this type of autoencoder has the ability to generate new data, making it aninteresting topic in financial applications.

    Another generative model Generative Adversarial Networks(GAN)[18] is another modelthat has been applied in many research fields. X.Zhou [52] proposes a framework that im-plements GAN with high-frequency data to predict stock data in the future. The frameworkincorporates Long Short Term Memory with GAN to predict the stock price. The perform-ance is measured based on two measurements, Root Mean Squared Relative Error(RMSRE)and Direction Prediction Accuracy(DPA). The result indicated that GAN could be a good topicin financial related applications.

    The power of GAN is that there are many variations under the categories of GAN. Deep

    3

  • Convolutional Generative Adversarial Network(DCGAN)[34] is one common variation ofGAN. This technique combines Deep Convolutional Neural Network with GAN, and is com-monly applied in image-related applications. Another variation called Wasserstein GAN[3]and its improved version [50] gives a better result compared to Normal GAN structure insome applications

    Reinforcement learning can also be applied in the financial fields[10][31]. These articlestry to implement reinforcement learning into trading execution. The result demonstrates thatreinforcement learning can be applied in the buy or sell case trading execution problem.

    To better understand the ideas and the applications of some references, we also choose torun some programs to test the different neural network models, which will be reflected in ourempirical studies part.

    1.3 OutlineThis thesis will have the following structure, in Chapter 2 we will give an introduction to thetraditional model for portfolio optimization. This will give the reader a better understanding ofportfolio optimization. Then in Chapter 3, we will discuss the drawbacks of traditional port-folio optimization. In Chapter 4, we give our methods of preprocessing data, including fillingand scaling data. Chapter 5 includes the introduction to Artificial Neural Network and alsoGenerative Adversarial Network that is implemented in this thesis. In Chapter 6, the empiricalstudies result, and a study on the effect of hyperparameters will be presented. In Chapter 7,the advantages and disadvantages are discussed, based on the results in the empirical studiespart, a discussion will be presented on the proposed Generative Neural Network portfolio op-timization framework. Finally, in Chapter 8, some directions for further studies, especiallydirections that can improve the proposed framework will be suggested. Also, we will give theconclusions of our proposed framework.

    4

  • Chapter 2

    Traditional Portfolio OptimizationMethod

    The modern portfolio theory begins with the paper of Harry Markowitz in 1952[28]. Thispaper updates the investors’ relationship with risk and return. Before the era of Modern Port-folio Theory, the risk is not properly implemented in the stock selecting process. The investorsfocused more on the return of the individual stock. Modern portfolio theory allows investorsto make decisions in terms of risk and return. The weights of the constructed portfolio can becalculated by solving an optimization problem. The following section is a brief introductionto the two most commonly applied modern portfolio theories

    2.1 Mean-Variance Portfolio OptimizationThe Mean-Variance Portfolio Optimization starts with the assumption that the investor at timet will hold the portfolio for a time period ∆t. The portfolio will be judged based on theterminal value at time t +∆t Under this theory the portfolio selection process is a trade-offbetween return and risk.

    Suppose that investors need to construct a portfolio in a pool of N risky assets. Denotew as the weight vector, which represents the weight of each stock. The weight vector can bewritten as: w = (w1,w2, · · · ,wn). Then to represent that the investor will fully invest his/hermoney, we introduced the first constraint of portfolio optimization.

    N

    ∑i=1

    wi = 1 (2.1)

    This constraint represents that the investor needs to invest all the available money intorisky assets. Therefore the sum of the weight equals to one.

    Then the investors need to estimate the expected return of the stock either from a statisticalmodel or other method. The asset return is denoted as µ =(µ1,µ2, · · · ,µn). Next, the variance-

    5

  • covariance matrix needs to be calculated. The variance-covariance matrix Σ can be written as:

    Σ =

    σ1,1 · · · σ1,N... . . . ...σN,1 · · · σN,N

    (2.2)Where the σi, j denote as the covariance between assets i and j.

    With these assumptions, we have the expected return of the portfolio µp :

    µp = wT µ (2.3)

    and the variance of the portfolio σ2p :

    σ2p = wT Σw (2.4)

    Now we can form an optimization problem that minimizes the risk given a target expectedreturn µ0:

    minw

    wT Σw

    Subject to

    µ0 = wT µ

    wT I = 1, I = [1,1 · · · ,1]

    This optimization can be solved using Lagrange multipliers, and the solution is[15]:

    w = j+ kµ0 (2.5)

    Where j and k are given by:

    j =1

    ln−m2·Σ−1[nI−mµ]

    k =1

    ln−m2·Σ−1[lµ−mI]

    and

    l = IT Σ−1I

    m = IT Σ−1µ

    n = µT Σ−1µ

    Now with different choices of µ0, the optimization problem can be solved, and obtain theweight portfolio. Then the variance of this portfolio can be calculated using equation 2.4.Then we can form many expected return and standard deviation pairs. This forms the termEfficient Frontier.

    6

  • Now the efficient frontier starts from the Global Minimum Variance Portfolio(GMV). Theoptimization problem of this portfolio can be described as:

    minw

    wT Σw

    Subject to

    wT I = 1, I = [1,1 · · · ,1]

    Now the solution of this optimization problem is[15]:

    w =1

    IT Σ−1I·Σ−1I

    Now if the investor has a risk aversion toward risk denoted as λ , then the optimizationproblem can be formulated as:

    maxw

    (wT µ−λwT Σw)

    Subject to

    wT I = 1, I = [1,1 · · · ,1]

    2.2 CAPMCapital Asset Pricing Model (CAPM) is an equilibrium asset pricing model. The CAPM isfounded based on the following assumptions[14]:

    1. The investor makes decision based on expected return and the standard deviation ofreturn.

    2. Investors are rational and risk-averse.

    3. Investors use Modern Portfolio Theory to do portfolio diversification.

    4. Investors invest in the same time period.

    5. Investors all have the same expected return and risk evaluation of all assets.

    6. Investors can borrow or lend at risk free rate at an infinite amount.

    7. There is no transaction cost.

    To introduce the formula of CAPM, we start with a more general case: the single-indexmodel. The single index model can be described as the linear regression between index returnand stock return. In other word, the return of stock i can be described as:

    Ri = ai +βiRm

    7

  • Where ai is the part of the stock return that is irrelevant to the market return. Rm is the returnof the market βi is a constant that describes the relation between market return and return ofstock.

    Then rewrite ai as:

    ai = αi + ei

    Where αi is the mean value of ai and ei is the random value of ai and has expected value of 0.Now the return of stock i can be written as:

    Ri = αi +βiRm + ei

    Then it is obvious that the correlation between ei and Rm is 0Now we give:

    1. The mean return: R̄i = αi +βiR̄m

    2. Variance of return: σ2i = β 2i σ2m +σ2ei

    3. Covariance between return of stock i and stock j: σi j = βiβ jσ2m

    The proofs of the above formulas can be found in[13].Now the formulation of CAPM is written as:

    Ri = R f +βi(RM−R f )

    This is the standard form of CAPM, and according to the formula, the expected return on aparticular stock can be calculated based on the beta of the stocks, risk free rate on the marketand the expected return of the market. This is the standard form of the CAPM also knownas (Sharpe Lintner Mossin) form. There are many other forms that try to solve some of theseproblems in the standard form.

    In theory, CAPM is a good estimation of the expected return of stocks. However, in reality,implementation is more complicated. From the CAPM formula, we can see that the variablesare expressed in terms of future values. In other word, investors need to estimate the return ofthe market and the future beta of stocks. This exposes a problem: a large scale data systematicdata on estimating the expectation does not exist, therefore the accuracy of CPAM cannot beguaranteed.

    8

  • Chapter 3

    Limitation of Traditional PortfolioOptimization

    3.1 Drawbacks within AssumptionsThe Traditional Portfolio Optimization is very useful in many ways, however, it has manydrawbacks. Lagrange multiplier is implemented to solve this optimization problem, to makethe solution optimal, it is necessary to fulfil Karush-Kuhn-Tucker Condition[26]. From thenecessary condition, the whole process needs to be stationary, however, this is not necessarilythe case, as there is no proof for that. Finally, the whole optimization process does not assumethe stochastic characteristics of the data. Therefore this kind of optimization is not robustenough since it does not consider one of the most important characteristics of stock price data.When we talk about the potential solution to this type of optimization, stochastic propertiesshould not be ignored after all investors want to find a portfolio that can give them a secureposition in most cases.

    3.2 Drawbacks within ApplicationsWhen investors try to implement Markowitz portfolio optimization theory in reality, they mayface another weakness of Modern Portfolio Theory: the difficulties of estimating required in-puts. Starting with GMV portfolio. The required input of GMV portfolio optimization is theCovariance Matrix. The core idea behind this optimization is that By combining covarianceand variance of each stock, investor can find the right combination to minimize the varianceof the designed portfolio. The problem with this idea is that covariance is not a good rep-resentation of the relations between the two stocks. Specifically, a number does not have thecapability of explaining how two stocks move together. In Figure3.1 we present the rollingcorrelation coefficient between stock AAK and ABB.

    9

  • Figure 3.1: Rolling correlation coefficient between AAK and ABB

    As can be seen in the Figure, the correlation between stocks varies a lot, therefore theoptimization based on the covariance matrix cannot construct a good portfolio that can give usthe minimum portfolio variance in the future.

    Besides the difficulties of finding the accurate covariance matrix. If the investors also careabout the return, then they need to give another input: Expected Return vector. Expected re-turn is what investors expect from a stock in the future. It is very hard to estimate the return ofa stock in the future, therefore the investors may have inaccurate estimations. Since the max-imum Sharpe ratio portfolio optimization is very sensitive to the expected return[14], henceit will result in a scenario that can be described as "Garbage in, Garbage out". If investorshave a wrong estimation of expected return, then the constructed portfolio cannot have a goodperformance.

    10

  • Chapter 4

    Preprocessing

    Sometimes the original data contains missing data, therefore it should either be ignored orfilled in with data. In this thesis, the core idea is that the missing should be filled. In thefollowing section, an approach is proposed to fill the missing data. Note that we implementdistribution studies only to give us knowledge about how to fill the missing data. Since missingdata does not contribute much in the whole dataset, it will not change our purpose that in themain neural network application part: we do not have any statistical assumption about thedistribution of asset returns.

    4.1 Distribution of Daily ReturnIn order to compute the daily return of stocks, the log return of the stock price daily data isselected to represent the daily return of stocks. The formula for calculating log return is[47]:

    Ri = ln(Pf /Pi)

    Where Ri is the return at the end of the period, Pf is the price at the end of the period and Pi isthe price at the initial of the period.

    The reason behind choosing log return is that we can calculate the return in a period bysumming up all individual log return. For example, denote Pn as the price at time n, thereforethe log return during period 1 to n can be calculated as:

    R1−n = ln(Pn

    P1

    )= ln

    (P2P1· P3

    P2· · · Pn

    Pn−1

    )And we know that:

    ln(a ·b) = ln(a)+ ln(b)

    11

  • Consequently:

    R1−n =n

    ∑i=1

    Ri

    Also, if we choose to apply normal return, then there is a limit of return. The normal returnrange from -1 to +∞, because the stock price cannot have more negative price changes thanits current price. Subsequently, it will create difficulties in studying or emulating daily return.Log return, on the other hand, ranges from −∞ to +∞, making it a good choice to calculatedaily return in our studies.

    Now we choose to identify the statistical properties of the daily stock returns, it is better toplot the histogram of the stock daily returns. Figure 4.1 is the histogram of ABB daily return.This will give us a first glimpse of the statistical properties of stock returns.

    0.08 0.06 0.04 0.02 0.00 0.02 0.04 0.060

    5

    10

    15

    20

    25

    30

    35

    40

    Figure 4.1: Histogram of ABB

    With the histogram of the daily log return, we decide to find the distribution that can betterdescribe the daily returns of stock.

    To find the distribution that fits best with the daily return of stocks, some potential candid-ates are suggested. In the following part, introductions will be made to these distributions andcompared between the (Probability Density Function) PDF of fitted distribution and histogramof the stock return.

    12

  • 4.1.1 Normal DistributionA random variable X follows random distribution if its probability density function can bewritten as[39]:

    f (x) =1

    σ√

    2πe−

    12 (

    x−µσ )

    2

    , −∞ < x < ∞ (4.1)

    Where −∞ < µ < ∞ and 0 < σ2 < ∞. We denote that µ as mean and σ2 as variance. Then arandom variable with mean µ and variance σ2 can be written as: X ∼ N

    (µ,σ2

    ).

    Random variable that generates from the normal distribution will not be suitable for thiscase, the figure shows PDF of the normal return with mean and variance calculated accordingto the history daily return. From Figure 4.2 we can see that the PDF of normal distributiondoes not fit well with the histogram.

    0.08 0.06 0.04 0.02 0.00 0.02 0.04 0.06 0.080

    5

    10

    15

    20

    25

    30

    35

    40 Normal Distribution

    Figure 4.2: Comparison Between Histogram and Normal Distribution PDF

    4.1.2 Student’s t DistributionTo introduce Student’s t Distribution we first introduce the concept of gamma function. Thegamma function Γ(z) can be written as:

    Γ(z) =∫ ∞

    0xz−1e−xdx

    13

  • Then a random variable X follows Student’s t Distribution with ν degree of freedom if theprobability density function can be written as[39]:

    f (x;ν) =Γ( ν+1

    2

    )√

    πν Γ( ν

    2

    )(1+ x

    2

    ν

    )( ν+12 ),

    , −∞ < x < ∞

    The comparison between Student’s t Distribution is presented in the Figure4.3

    0.08 0.06 0.04 0.02 0.00 0.02 0.04 0.06 0.080

    5

    10

    15

    20

    25

    30

    35

    40 Normal DistributionStudents t Distribution

    Figure 4.3: Comparison Between Histogram and Student’s t Distribution PDF

    The student t distribution can be viewed as the generalization of Cauchy Distribution andNormal Distribution. First, we look at the probability density function of Cauchy Distribution[39]:

    f (x) =1

    π (1+ x2)−∞ < x < ∞

    Now, this probability density function can be viewed as the special case that the degree offreedom ν=1. Now when we let the degree of freedom ν → ∞, then we have:

    limν→∞

    f (x;ν) =1√2π

    e−12 x

    2 −∞ < x < ∞

    If we compare the result with the Equation 4.1 we find that the result of the form of thestandard Normal Distribution.

    14

  • 4.1.3 Complete DataBefore filling the missing data, we need to select stocks that we want to fill. Because themissing data is generated from the fitted distribution, if our inputs to the neural network arefilled with too many random variables, the true information inside the real stock price willbe distorted. Hence we select stocks that only contains missing data with length less thanone-fourth of the total length of the data.

    In terms of distribution, the Student’s t Distribution is chosen because it fits good with thestatistical property of daily returns of stock.

    To get the parameters of the distribution we choose to implement Python package Scipy[48],we use functions from Scipy to fit the distribution to return data and generate daily returns thatfollow fitted distribution.

    4.2 Whether to Include Technical IndicatorsIn the finance industry, technical indicators are applied in many trading strategies. However,in its essence, technical indicators analysis is not exact science[32]. Because it is a reflectionof the market price trend, technical analysis aims to find the market trend at an early stage.Technical analysis believes that crowds’ psychology affects the stock price. The investorsstudy the market trend will orient investors to decide on buying and selling in some degree ofconfidence.

    Some studies also choose to use technical indicators as an input, studies like Tegner2018[46] and Widegren 2017 [49] all choose to combine data and technical indicators to feedthem into Artificial Neural Network.

    However, in this thesis technical indicators will not become an input of the designed neuralnetwork. The technical indicator or technical analysis aims to provide predictions of the futureasset price. In this thesis, prediction is not important. In other word, we do not care about thefuture price of tomorrow or next month. The focus of this program is to estimate the risk ofstocks in any given period. The potential loss of the portfolio to be exact. Therefore applyingtechnical indicators cannot provide any improvement to the risk estimation and potentiallygenerate noise and affect our estimations.

    4.3 ScalingBefore feeding data into the neural network. It is important to perform scaling on data. Be-cause the range of the data might be different from each other. If the data is fed into the neuralnetwork or machine learning algorithm without any preprocessing, the result will be deeplycompromised because of the different data range. In the following section, several scalingtechniques will be introduced.

    15

  • 4.3.1 Standard ScoreThe formula for normal scale can be expressed as[53]

    X̂ =x−µ

    σ

    Where: µ is the mean of the data, and σ is the standard deviation of the data.The Standard score is the most common technique of the scaling and applied in many

    machine learning algorithms. However, in our application, this method is not appropriate,because our designed network requires data inputs in the range of -1 to 1. Standard score scaledata based on the standard deviation, from the Figure 4.2 one can see that the stock data hasthe trend of fat tail distribution, therefore the scaled data cannot fulfil our requirement for theproposed framework.

    4.3.2 Min-Max ScalerAnother common technique that can be implemented in our application is Min-Max scaler.The formula for the scaling can be expressed as[35]:

    X̂ =x−min(x)

    max(x)−min(x)

    Then the result will transform the result into the range 0 to 1, the whole data will be scaledbased on the maximum and the minimum of the data. Although this technique will ensure thatthe result in the range of 0 and 1, however in the context of stock data, the whole data willbe scaled according to the outliers, hence the scaled data will overly concentrated around anarrow interval, resulting in inaccurate outputs from the neural network.

    4.3.3 Robust ScalerNow to solve the issue we faced in the Min-Max Scaler, we can implement another type ofscaler: Robust Scaler. The formula for robust scaler can be formulated as[35]:

    X̂ =x−Q2

    Q3−Q1Where Q1 is the 25th quantile of x, Q2 is the median of x, and Q3 is the 75th quantile of x.

    By using Robust Scaler, the scaled data will be evenly distributed. Therefore in the situ-ation of large outliers, compared to Min-Max Scaler, the Robust Scaler will allow data to bedistributed in a wider range.

    4.3.4 Max Abs ScalerMax Abs Scaler is a variant of the Min-Max Scaler, compare to Min Max Scaler Max AbsScaler will scale the data in range of -1 to 1. The formula for Max Abs Scaler is[53]:

    X̂ =x

    max(abs(x))

    16

  • 4.3.5 Power TransformPower transform is one of the techniques that try to transform data to a more normal distri-bution liking data using power function. Under this category, there are two major approachesBox-Cox Transform[40] and Yeo-Johnson Transform[51]. Under this application, we haveto use Yeo-Johnson Transform. Because the Box-Cox Transform has to be implemented onstrictly positive values. Since the Box-Cox Transform is not implemented, there will not bean introduction for it. Now Yeo-Johnson Transform is formulated as:

    yλi =

    ((yi +1)

    λ −1)/λ if λ 6= 0,y≥ 0

    log(yi +1) if λ = 0,y≥ 0−[(−yi +1)(2−λ )−1

    ]/(2−λ ) if λ 6= 2,y < 0

    − log(−yi +1) if λ = 2,y < 0

    Where 0≤ λ ≤ 2. Yeo-Johnson Transform is the choice for the application because it can beapplied to negative data. The transformed data will display some kind of normal distribution.After performing Power transform we can apply Max Abs Scaler to make scaling compatiblewith our network’s required inputs. Because Power Transform is not a linear scaler, the effectof outliers will not have more impact on the Max Abs Scaler’s result compared to Max AbsScaler result without Power Transform.

    17

  • Chapter 5

    Artificial Neural Network

    5.1 Introduction to Neural Network

    5.1.1 Relation between Different ConceptsWith the evolution of computer related technologies especially the GPU1, Artificial Intelli-gence becomes applicable rather than a proposed concept. Terminology like machine learningand neural network become more and more popular and frequently appearing in articles andmedia. However, some readers may have difficulties in understanding the relations betweendifferent concepts. In this section, we give a small introduction to these concepts. Figure 5.1is a brief description of the three most common concepts in the Artificial Intelligence researchfield.

    Figure 5.1: A brief description of the relation between three different concepts

    The concept of Artificial Intelligence began in 1950s, the core idea of A.I is to perform the

    1https://www.nvidia.com/en-us/about-nvidia/ai-computing/

    18

  • intellectual task, which is done by human automatically. In the following years, this concepthas been continuously developed. And under this concept, a new approach was proposed:Machine Learning

    In traditional model based method, data and rules are fed into the system, then the resultwill be calculated according to the data and rules. Machine Learning implements a whole newparadigm, a relation is found according to the given input and result. In other word, machinelearning program aims to replicate the given result with a given input. When a machine learn-ing model is trained, we can give a new set of data as input and obtain output from machinelearning algorithm.

    To explain the concept of Deep Learning, we first elaborate on how machine learningalgorithm works. In a machine learning program, three types of information are necessary.Input data, which for example can be numbers, pictures, and sound etc. And the examplesof the known results, which can be tags of pictures in an image recognition task. Finallythe measurement of the algorithm’s performance. This measurement measures the distancebetween the result of the algorithm and the expected output. Then adjustments are madeaccording to the predetermined measurement and output from this machine learning algorithmwill improve accordingly.

    Now we come to the difference between Machine Learning and Deep Learning. Contraryto the first impression, compared to the normal machine learning algorithm, the deep learningalgorithm will not necessarily give a deep interpretation of the data. The term Deep Learningdefines algorithms that feed input data through a successive set of layers that have an increas-ingly meaningful interpretation[16]. By doing this, input data can be represented by differentlayers of interpretation. The depth of the model is the term that describes the number of layerscontributed to a model. Under the definition of Deep Learning, the technique Aritficial NeuralNetwork is one commonly applied approach. In the following section, a detailed explanationof the neural network will be presented.

    5.1.2 Definition of Artificial Neural NetworkThe concept of Artificial Neural Network(ANN) borrows the idea of how biology neuronworks in real life.

    Now the term Artificial Neural Network is defined as an interconnected assembly of simpleelements nodes or units, whose functionality is similar to biological neuron. The processingability is stored in the weight of the inter unit, which can be obtained from learning.[21]

    The first concept of Artificial Neural Network starts with the paper by McCulloch andPitts(1943)[29]. The paper proposes a computational model that imitates the way neuron workwhen performing complex computation. This is the world’s first Artificial Neural Networkstructure. One of the simplest ANN networks Perception is proposed by Frank Rosenblatt in1957[38]. It is a variation of a another network Linear Threshold Unit(LTU).Figure 5.2 is adescription of the LTU

    19

  • W1 W2 W3

    X1 X2 X3

    Σ

    StepFunction

    Figure 5.2: LTU unit

    First, LTU unit computes the weighted sum of its inputs, which can be expressed as:(w1x1+w2x2+ · · ·+wnxn), we denote the weighted sum as z, then we can rewrite the output ofweighted sum as: z = wT x. Then LTU unity applies a step function on the weighted sum z, theoutput is represented as: G(x) = Step(wT x). With the LTU unit defined, we can introduce theterm Perceptron. A Perceptron is consist of one layer of LTU units with each neuron connectedto input layer.[17]. When we stack multiple Perceptrons together, we create a Multi-LayerPerception (MLP).

    In an MLP we have an input layer, which handles the input. Then in the middle program-mers can choose to have one or multiple LTU units. The middle layer is called a hidden layer.The number of hidden layers is predetermined and can be adjusted according to the applica-tions. The middle layer is connected to an output layer. If ANN has more than 2 hidden layers,it is called Deep Neural Network (DNN). Figure 5.3 is an example of Deep Neural Network.As the figure shown, this network has 2 hidden layers with 5 and 4 LTU units respectively.

    20

  • Input Layer ∈ ℝ⁶ Hidden Layer ∈ ℝ⁵ Hidden Layer ∈ ℝ⁴ Output Layer ∈ ℝ⁶

    Figure 5.3: Neural Network

    For the convenience of this thesis, in the following parts, the term neural network will beequal to artificial neural network.

    5.1.3 Differences between ANN and Statistical MethodTraditionally when one wants to complete a task, a common option is statistical method. Toexplain the philosophy of ANN, assuming that we want to solve a practical problem: Identi-fying handwritten numbers. To complete this task using statistical techniques, a model has tobe proposed. This model is an appropriate representation of the relationships between inputsand outputs. Denote the this model as: y = f (x,β ). In this case x is input(picture) and y is thedesired output(identified number). The input of this task can be a large amount of data, andthe function f is unknown. Consequently, it requires a large number of parameters to give arelatively accurate model. This means that the model for identifying handwritten number islarge and complex.

    Neural Network, on the other hand, takes a different approach. Compared to the statisticalapproach, Neural Network model has far more parameters than statistical technique, thereforethere are many combinations of parameters. In reality, a different combination of paramet-ers can sometimes give the same output, making it hard to interpret the parameters inside theneural network. To be more precise, Neural Network works as a black-box method, and willnot give any more interpretable result from its parameters[17]. However, in this example, wejust want to have a model that can recognize handwritten numbers and do not care about the re-lationships between pixels. Neural Network works well in this kind of applications. Moreover,in the financial market there are hundreds of variables, hence finding a robust statistical modelis very hard. Neural Network is an optimal choice in this type of situation.

    21

  • 5.2 Training Neural NetworkFirst, in a linear model, the relation between input and out can be expressed as:

    f (x) = wT x+b

    where x: input, w: weight and b:biasNow this can be elaborate as:

    f (x) = (w1x1 +w2x2 + · · ·+wnxn +b)

    Then the output of one layers can be fed as inputs to the subsequent layer. However, theprevious equation can only represent the linear relationship, therefore it is important to convertthe output to non-linear relationship. To achieve that, the activation function is applied. Nowthe output can be written as:

    u = τ(w1x1 +w2x2 + · · ·+wnxn +b)

    where u: output, τ: activation functionThen we need to demonstrate how to estimate parameters in the model, the weights in a

    model are chosen to minimize the error of output and expected output, in other word error.The error of a model can be expressed in terms of Mean Squared Error:

    E = ∑l

    ∑i(ŷli− yli)2

    Also, other error can substitute MSE in the Neural Network. When the structure and lossfunction of the designed network is known, neural network problem becomes a non-linearoptimization problem. This type of problem can be solved in many ways, in terms of NeuralNetwork the choice will be Backpropagation (Gradient Descent) algorithm. The weight wi isaltered according to the error, the easiest way to reflect this idea can be written as:

    ∆wi = αdEdwi

    (5.1)

    Denote w(k) as the weights at iteration k then the weights at iteration k+1 becomes:

    w(k+1) = w(k)+∆w(k) (5.2)

    Then if we want to minimize the error, the direction should be the opposite of gradient,then we can combine Formula 5.1 and 5.2 as:

    w(k+1) = w(k)−α dEdw(k)

    (5.3)

    This is the simplest form of Gradient Descent. One of the drawbacks of gradient decedentis that calculation of the sum of all gradient is required, therefore it is computationally heavy.Stochastic Gradient Decedent is designed to mitigate the workload of the gradient decedent

    22

  • algorithm. Instead of calculating the sum of all gradients, it randomly selects observations tocalculate the gradient.

    This process is represented as:

    E(w) =N

    ∑n=1

    En(w)

    Then we have:

    w(k+1) = w(k)−α dEndw(k)

    In the practical applications, both gradient descent and stochastic gradient descent requireto compute the gradient. This can be achieved using the chain rule, and consider the output ofthe network as a function of weights.

    5.2.1 HyperparametersHyperparameters is the parameters that have been specified before the training process, unlikenormal parameters in a neural network, hyperparameters cannot be derived or improved fromthe normal training process. Therefore it is important to select a good set of hyperparameters.In some case, the hyperparameters optimization technique can be applied to improve the ac-curacy. However, in this thesis due to the length and complexity of this topic, this techniqueis not implemented. Instead, the chosen hyperparameters are given then readers can chooseto improve the network using their proposed method. To explain these terminologies we needto start from gradient descent technique, this technique as introduced before is an iterativemethod. Follow the previous notation we called parameter α learning rate in practical ap-plications. This parameter itself is also a hyperparameter that need to be optimized in someapplications.

    Now if we can feed all the data we have to the neural network, then there is no need forbatch size. However, in almost all cases, this cannot be achieved because the data size is toolarge for the computer to handle at once. Now to solve this problem, we have to divide ourdata into smaller pieces and then update the weights inside the neural network for a piece ofdata. In the end, we can get the weights of the trained neural network. And the size of thissmall batch is the batch size.

    One epoch is described as the whole data set pass through to the neural network one time.Then naturally for readers who are not familiar with this topic, why do we need more than oneepoch to neural network? To put it in another way, why do we require to feed the same datamore than once? Gradient descent method is an iterative method, then for a limited numberof datasets, one epoch will not give us a satisfying result, in other word an underfitted result.However, for a large number of epoch, the weights inside the network will become too focusedon the training data, resulting in an overfitted model.

    With epoch defined, we can define iteration, which is a term that describes how manybatches are required to finish one epoch. This is not a hyperparameter since we already definethe batch size.

    23

  • 5.3 Activation FunctionAs illustrated before neural network requires an activation function to transform linear func-tion into non-linear function. In the subsequent part, we will introduce some commonly im-plemented activation functions[5].

    5.3.1 Sigmoid functionSigmoid function as known as logistic function is one common activation function implemen-ted in the neural network. The formula for Sigmoid function is given as:

    f (x) =1

    1+ e−x

    The figure of Sigmoid function is given in Figure 5.4

    Figure 5.4: Graph of Sigmoid function

    5.3.2 Hyperbolic Tangent functionHyperbolic Tangent function(Tanh) is another function commonly used activation function.It is zero centered function with limit between -1 and 1. The output of Hyperbolic Tangentfunction can be calculated using following formula:

    f (x) =ex− e−x

    ex + e−x

    Figure 5.5 is the graph of Tanh function

    24

  • Figure 5.5: Graph of Hyperbolic Tangent function

    5.3.3 Rectified Linear Unit functionRectified Linear Unit function (ReLU) is one commonly used activation functions in deeplearning. The ReLU function can be represented as:

    f (x) = max(0,x)

    Because the right positive part of the function is linear, ReLU function is easier to optimizeusing gradient-descent method. Figure 5.6 is the graph of this function.

    Figure 5.6: Graph of Rectified Linear Unit function

    25

  • 5.3.4 Exponential Linear UnitExponential Linear Unit(ELU) is a variation function of ReLU and can converge faster thanregular version of activation function, the ELU function is formulated as:

    f (x) =

    {z, z > 0α(ez−1), z≤ 0

    The difference between ReLU and ELU is in the negative part of the function. ELU smoothslowly until −α and ReLU smooths sharply. Figure 5.7 is the graph of this function, whereα = 0.7

    Figure 5.7: Graph of Exponential Linear Unit

    5.3.5 Leaky ReLULeaky ReLU is a variant of ReLU the formula is represented as:

    f (x) =

    {x, x > 0αx, x≤ 0

    Figure 5.8 is the graph of Leaky ReLU, where α = 0.1

    26

  • Figure 5.8: Graph of Leaky ReLU

    5.4 Approaches to Prevent OverfittingOverfitting is a common problem facing in the machine learning area, in neural network it isnot uncommon to encounter this problem. Overfitting will compromise the performance andaccuracy of the neural network in its actual applications. Hence it is necessary to take measuresto prevent overfitting. Therefore we introduce several measures to prevent overfitting.

    5.4.1 Increase Data SizeThe most obvious and easiest solution is increasing the size of data. After all the causationof overfitting is that there is not enough data to fully trained a complicated neural network.However, in many circumstances, it is not possible to acquire more data. In this case, there isno more stock return data than the existing data on the market. Therefore other measures needto be applied to prevent overfitting training data.

    5.4.2 Reduce Size of Neural NetworkAnother approach is reducing the size of the neural network. To elaborate on this concept,we have to define the complexity(capacity) of a neural network. The capacity of a neuralnetwork is defined as the number of trainable parameters in a network.[23] For a complexneural network, there are more parameters, which means that it has more capacity to learnand even perfectly represent the training data. For example, let’s assume that our trainingdata consists of 10000 numbers, a network with 200000 trainable parameters will easily finda perfect fit for training data set. In this case, we have an overfitted neural network.

    An overfitted model will not provide any significant prediction for a new set of data. Be-cause the network itself is a perfect representation of the training data rather than a general de-scription for a set of data. Naturally, to solve this problem, the number of trainable parametersin a network need to be reduced. On the other hand, a network with 100 trainable parameters

    27

  • will also not give any meaningful prediction of the data, because the size of network does nothave the capability of representing training data, in the actual application phase this modelcannot give a meaningful result. Hence finding the right number of trainable parameters in aneural network is the key to training a well-performed neural network.

    5.4.3 L1 RegularizationOne technique that can mitigate the effect of overfitting is to regularize the size of the para-meters, this can be achieved by introducing regularization term to the loss function. Denote Las loss function, then this process can be expressed as[46]:

    L( f (w,b),y)+λR(w)

    where R(w) is the regularization function.Then if the regularization function is L1 norm, then it is called L1 regularization. L1

    regularization is expressed as:

    R(w) = ||w||

    =L

    ∑k=1

    ∑i, j

    ∣∣∣W ki, j∣∣∣5.4.4 L2 RegularizationWhen we replace L1 norm with L2 norm, then we can construct the L2 regularization. L2norm is formulated as:

    R(w) = ||w||2= ∑

    m∑k

    W 2mk

    Then the adjusted loss function is minimized in the neural network, which means that theloss function is minimized on the condition that the elements in the weight will not becometoo large. The parameters λ is the term that represents the amount of regularization. Thisparameter should be chosen in a balanced way because a large λ will result in an underfittingmodel.

    5.4.5 DropoutAnother way to prevent the overfit is to apply dropout technique. This approach is proposedby Srivastava in 2014[44]. First, denote l as the number of layers l ∈ L{1 · · ·L}, u(l) as theinput to the layer l. y(l) as the output from layer l. w(l) and b(l) as weights and bias at layer l.Then a normal feed forward network is represented as:

    u(l+1)i = w(l+1)i y

    l +b(l+1)i

    y(l+1)i = τ(

    u(l+1)i)

    28

  • Where τ is the activation function.Now after implement dropout, the network should look like:

    r(l)j ∼ Bernoulli (P)

    ỹ(l) = r(l) ∗y(l)

    u(l+1)i = w(l+1)i ỹ

    l +b(l+1)i

    y(l+1)i = τ(

    u(l+1)i)

    Denote ∗ as a element product, and vector r(l) follows Bernoulli Distribution, with probabilityof P.

    5.5 Supervised Learning and Unsupervised LearningThe traditional application of the neural network is to do the classification application, a briefdescription of classification can be described as a set of data and corresponding label or tag ofthe data set. Then an artificial neural network is trained based on the input of the data and thecorresponding labels. This kind of task can be described as Supervised Learning. Because thelabels or examples are provided by human. In other words, the algorithm tries to replicate thehuman’s judgment or decision. To be more precise, supervised learning requires us to providedata input as well as responses or outcomes. The job of supervised learning is to predict theoutput with given input.

    However, in stock or derivative market, it is hard to implement supervised learning. Thereason behind this argument is the difficulties of finding robust examples or tags for neuralnetworks. For example, if we want to train a neural network that can distinguish good stocksand under-performed stocks. We need to give neural network examples, but recognizing agood stock is not as easy as it sounds. First, there is not a universally recognized standard fora well-performed stock, also we cannot guarantee that our examples are correct or will havea good return in the future. This leads to the second difficulty of implementing supervisedlearning: enforcing biases. Even if we manage to find tags or examples for a supervisedneural network, the designed neural network may amplify human’s judgments and potentiallygive false answers. After all, we cannot differentiate luck and skill. To put it in another way,we cannot tell whether investors earn money just because they are lucky or they possess thenecessary expertise.

    To solve this problem, a new type of learning: Unsupervised Learning is applied. Un-supervised learning only needs input data and tries to extract the features by itself withoutany guidance from human instead of making prediction. Consequently, unsupervised learningtend to find interesting features that cannot be found by human, making it suitable for applic-ations in stock market and portfolio optimization, because it may reveal some interesting newfeatures. Apart from this advantage, it is easier to acquire unlabelled stock data, since it takesextra effort to tag the data.

    Under the category of unsupervised learning, one particular application is gaining moreattention in the research and application area: Autoencoder. An autoencoder is a network that

    29

  • has the same input and output data. A simple explanation of autoencoder can be illustrate asFigure 5.9:

    Data Encoder Compressed Data Decoder Data

    Figure 5.9: Graphical Explanation of Autoencoder

    First, the data will be fed into an encoder, then the output of the encoder will have fewerdimension than the original data. Next, a decoder will process the compressed data and returndata that have the same dimension as the input data. An autoencoder will force the data tocompressed into a lower dimension. By doing so, the algorithm has to extract useful featurefrom the data to minimize the difference between original data and the recovered data.

    Apart from the feature extraction, a type of autoencoder: generative autoencoder can alsorandomly generate data that is similar to the original data. This type of generation leads toa new direction of Monte-Carlo simulation, a more accurate random data will give a betterestimate of the portfolio risk. Hence portfolio optimization will select a lower risk portfolio.

    5.6 Generative Adversarial NetworkThe Generative Adversarial Network(GAN) is a new type of generative network that was in-troduced by I.Goodfellow in 2014[19], and consists of two parts generator and discriminator(critic in some literature). The core idea of GAN is a competition between generator and dis-criminator. Discriminator try to identify whether the sample is real or not. The discriminatorin a generative adversarial network is a supervised network and has binary output. Under thissetting, the examples in this network is the real stock price data.

    The job of discriminator is to classify data into two groups: real and fake. At the sametime, the generator tries to generate sample data that can be classified as real data. In a real-world example, a generator can be described as a counterfeit artist who tries to create a fakecopy of a word famous painting. Discriminator is an art specialist that can distinguish fakepaintings and real artwork. Under this metaphor, the training process can be represented asthe counterfeit artist keep sending paintings to art specialist to tell whether the painting isreal or not. When the painting a classify as real artwork, the counterfeit artist can produce aninfinite amount of paintings that have the same characteristics of a real painting. This graphicalrepresentation of this process is shown in Figure 5.10

    30

  • Noise Generator

    RealData Discriminator

    GeneratedData

    Data with samecharacteristics as

    real data

    Train

    Identify

    Output

    Figure 5.10: Graphical Representation of GAN

    Then to help readers further understand GAN, a detailed explanation of GAN is intro-duced.

    First, denote Discriminator and Generator as functions D and G, then their parametersare represented as θD and θG. Under this notation the optimization process of GAN can berepresented as[18]: The Discriminator tries to minimize CD(θD,θG), while only changingparameter θD. The Generator tries to minimize CG(θD,θG), while changing parameter θG.This process in its essence is an optimization problem, the objective of this problem is to findthe minimum point. (sometimes it may find the local minimum)

    As introduced before, the training process is achieved using Stochastic Gradient Descentmethod and these. Now denote the input noise to the Generator as xG and the observed vari-able(processed stock returns in this case) as xD. For the reason that the neural network al-gorithm cannot process the entire data at once, the data is divided into several batches. Thenthe batches are fed into the network to update the gradient. Note that the updating process isconducted at the same time[18]. At each step, Discriminator find the parameter θD to reducethe value of CD. Generator updates the parameter θG to reduce the value of CG.

    5.6.1 Cost functionAs previously explained, Neural Network is an optimization problem, therefore it is importantto specify the cost function. The discriminator in GAN can be constructed using a normaldeep neural network with Binary Cross-Entropy loss function. Because in this case, the tagfor the data are real and fake. This type of problem can be described as a binary classificationproblem.

    31

  • The cost function of Discriminator is given as[19]:

    CD(

    θ (D),θ (G))=−1

    2ExD∼pdata logD(xD)−

    12ExG log(1−D(G(xG))) (5.4)

    Reader who are familiar with Neural Network will recognize that this is the standard formof the cost function of the binary classification problem. Discriminator implements the coreidea of binary classification, with tag marked the actual data and data generated from Gener-ator.

    In terms of the cost function of the Generator, the objective is to minimize the Cross-Entropy between the output from the Generator and the actual data. Cross-Entropy loss func-tion measures the distance between the distribution of the empirical data and model distribu-tion. Therefore by repeatably training the data, the data will have a similar distribution. Thecost function of the Generator is given as[18]:

    CG =−12ExG logD(G(xG)) (5.5)

    This cost function can be explained as Generator trying to maximize the log possibilitythat the discriminator recognize differences between actual data and the data generated byGenerator.

    5.7 Implement Neural Network in Portfolio OptimizationWith the necessary introduction being made, now a brief introduction will be presented forthe application of this type of Neural Network. Investors can implement neural network tech-nique specifically generative model to conduct portfolio optimization. As suggested beforewe can bring the idea of Monte-Carlo simulation to our result. Since the data generate similarcharacteristics with actual stocks returns. By implementing this type of procedures, NeuralNetworks processes the capability to simulate the stock prices in different scenarios. Hence,the optimized portfolio will have more robustness compared to traditional Markowitz portfoliooptimization. In other word, the designed portfolio may not be the one that gives more returnor smaller risk in one particular case. However, since the Monte-Carlo simulation covers themajority of the scenarios, the designed portfolio will give a more secure position in all pos-sible outcomes compared to the static optimized method with normal portfolio optimizationmethod.

    5.7.1 How to Optimize Portfolio from Output of Neural NetworkSince the goal for portfolio optimization is to minimize the risk of the constructed portfolio(in terms of VaR or CVaR), it is necessary to determine the weight of each stock that give theleast amount of risk in terms of value at risk or conditional value of risk. This kind of idea hasbeen implemented before, Rockafellar 2000[37] gives a solution for portfolio optimization interms of conditional value at risk. This article is a groundbreaking method, some even callit Markowitz 2.0 to signify the importance of this article. Despite that this method cannotbe implemented in this application, because it assumes the distribution of returns(Smooth

    32

  • Multivariate Discrete Distribution to be exact). As demonstrated before, we do not want tomake any distribution assumption for asset returns. To solve this problem, the grid searchmethod is implemented to find the optimal weights for the portfolio. This solution is not theoptimal solution, but due to the time constraint, we choose to implement this easy approach.Readers can choose to do more research on this and propose a better solution.

    33

  • Chapter 6

    Empirical Study

    6.1 Data Software and Hardware

    6.1.1 Data and Data SourceThe data source of this thesis is Yahoo Finance, we use API(Application Programming Inter-face) to download data from the server. And in terms of data, we select stocks that are listed onthe Nasdaq Stockholm, which consists of 378 stocks1. The period of data is from 2009-02-09to 2019-02-08. And the prices are quoted daily and consist of Open, Close, High, Low andAdjusted Close price of each day. Also, the data contains the daily trading volume of eachstock.

    6.1.2 Software ChoiceIn the empirical studies part, programming language Python with necessary scientific pack-ages like numpy[22] scipy[48] and pandas[30] is selected. In the neural network applicationpart of the empirical studies, the framework is Keras[8] with backend of Tensorflow-GPU[1].This indicates that the neural networks are run on the GPU of the computer. The operatingsystem choice is Ubuntu (Ubuntu 18.042).

    6.1.3 HardwareThroughout this thesis, all the codes are run on a PC with an Intel I5-8400 processor, 8GBRAM. In terms of graphics card, the computer has Nvidia GTX 1060 with 6GB of graphicsmemory.

    1http://www.nasdaqomxnordic.com/aktier/listed-companies/stockholm2https://www.ubuntu.com/desktop

    34

  • 6.2 Risk Measurement

    6.2.1 VolatilityThe most common way to measure the risk of an investment asset is volatility, in a math-ematical definition volatility is defined as the standard deviation of return. The core idea ofimplementing standard deviation as the risk representation is that average is the expected out-come of a stock. Therefore a stock that has more diversion from the mean has more risk. Ina statistical term, this is measured using variance or standard deviation. It is natural to thinkthat a stock that has more differences from the mean will give more risk to the investors.

    6.2.2 Value at RiskNonetheless using standard deviation to measure the risk of investment assets has its disad-vantages. Because it only represents the diversion from the mean disregard of the direction ofthe return. For investors, a positive direction to mean is preferred to investors. Subsequently,another representation of risk can be defined. More specifically a risk representation thatmeasures downward risk. One of the measurements is Value at Risk(VaR). Denote X as valueof investment asset, and parameter 0% < α < 100% then α VaR is defined as [12]:

    VaRα(X) = min{c : P(X ≤ c)≥ α}

    VaR can be interpreted as the minimum loss at (1-α) worst-case scenario. By applyingVaR investors can better estimate the downturn risk of their assets.

    6.2.3 Conditional Value at RiskAnother form of risk representation is Conditional Value at Risk, denote X as value of invest-ment asset and parameter 0% < α < 100% then α VaR is defined as[37]:

    CVaRα(X) = E [X |X ≥ VaRα(X)]

    Compared to VaR, CVaR has more distinctive advantages, which makes it preferable to VaR.In Sarykalin 2008[41] the author talks about the advantages in several aspects. Compare toVaR, CVaR has following advantages:

    1. CVaR has better mathematical properties, the risk represent by CVaR is coherent.

    2. CVaR deviation can represent risk, in other word a good substitute of standard deviation.

    3. A Risk management based on CVaR is more efficient compared to the one based onVaR. To be more precise, CVaR can be optimized with regular optimization method.

    4. CVaR considers the effect of the case when loss exceeding a certain level, on the otherhand, VaR does not consider the scenario where the loss exceeds a certain level.

    35

  • 6.3 Monte Carlo SimulationA Monte Carlo simulation describes a type of simulation that conduct random sampling re-peatedly, then using statistical methods to analyze result[36]. To put it in a detailed explana-tion, it is a simulation in another case scenario. In other word, the results from Monte Carlosimulation could be the daily return of financial assets if the actual results do not happen. Byrepeatedly simulate the Potential Realities, a more accurate estimation of financial assets canbe achieved.

    To perform a Monte Carlo simulation, we should first identify a statistical distribution.The most common choice is normal distribution. First, we can draw random variables basedon the statistical distribution that represents the daily return of stock. Then we calculate thevalue of the stock at a given time at a specific path.

    6.3.1 Simulated Path of Monte Carlo SimulationFollowing the introduction from the previous section, we choose to draw random samplesbased on the Gaussian distribution. In this case, we choose to implement daily log return asrandom variables. Because it is easier to calculate the return in a given period.

    To demonstrate path generate, we decide to give one of the paths generated by MonteCarlo simulation based on the statistical properties of ABB stock. Figure 6.1 is the examplepath generated by the Monte Carlo simulation.

    0 50 100 150 200 250Days

    1.0

    1.1

    1.2

    1.3

    1.4

    Val

    ue

    ABB price path

    Figure 6.1: One path generate by Monte Carlo simulation

    36

  • 6.3.2 Calculate VaR using Monte Carlo MethodVaR can be calculated using several approaches a detailed techniques explanation can be foundin Duffie 1997 [12]. In this part, Monte Carlo method is chosen to calculate VaR. Because it isa good comparison with our proposed Monte Carlo method that incorporate Deep Learning.

    In this example, VaR of stocks is calculated in a one year period. We apply statisticalmethod to generate random returns in a one year period, and repeat several times and computethe terminal value in each path. Then to calculate the 95% VaR of stocks, the 5% quantileof yearly terminal value is computed. Figure 6.2 is VaR of ABB using Monte Carlo methodwith 250000 generated paths. In this thesis, VaR is represented in a different format, insteadof representing the loss in terms of money, we choose to represent in the format of the portion.For example, if the 95% VaR of a stock is 0.5, then in the worst 5% scenario, the minimumloss of this stock is 1− 0.5 = 50%. For convenience reasons, in the following studies, 95%confidence level is chosen when computing VaR and CVaR.

    Figure 6.2: VaR of ABB using Monte Carlo simulation

    6.3.3 Calculate CVaR using Monte Carlo MethodThe first step in calculating CVaR is to compute VaR, then select terminal values that are lowerthan VaR and acquire the mean value of these values. Figure 6.3 is CVaR of ABB using MonteCarlo method with 250000 generated paths.

    37

  • Figure 6.3: CVaR of ABB using Monte Carlo simulation

    Same as the definition, CVaR is lower than VaR. To help readers understand the relationbetween VaR and CVaR, we choose to give the comparison in one figure. Figure 6.4 is theVaR and CVaR of ABB using Monte Carlo method with 250000 generated paths.

    Figure 6.4: VaR and CVaR of ABB using Monte Carlo simulation

    38

  • In Appendix B, we choose to give the VaR and CVaR of all the stocks fed into the proposedneural network structure.

    6.3.4 Markowitz GMV Portfolio SelectionGlobal Minimum Variance(GMV) portfolio as introduced before is the most traditional wayof portfolio optimization. The goal of this portfolio optimization is to minimize the risk of theportfolio. Since the objective of this thesis is to find a framework to reduce the risk in termsof variance or CVaR, it is necessary to conduct a portfolio optimization using the traditionalmethod.

    A program that gives the weight of the global minimum variance portfolio is executed.The weights are presented in the appendix part.

    As can be seen in the table, the weights of this portfolio is overly concentrated, therefore itis not very diversified portfolio. The value of GMV portfolio in 10 years is presented in Figure6.5

    Figure 6.5: Value of GMV portfolio in 10 years

    To evaluate the performance of this constructed GMV portfolio, we choose to present thetraditional average yearly return and volatility data. The return and risk of this GMV stock ispresented in Table 6.1

    Yearly Return Yearly Volatility6.14% 6.44%

    Table 6.1: Yearly return and volatility of GMV portfolio

    39

  • This portfolio optimization gives a good insight of this constructed portfolio, it has a rel-atively good return with low volatility, however, this evaluation is not suitable for comparingwith our proposed neural network-based minimum risk portfolio. Therefore we choose topresent the VaR and CVaR of this constructed portfolio to compare with our subsequent em-pirical studies. Figure 6.6 is the graph that shows the comparison between VaR and CVaR.VaR and CVaR are calculated using the same method as the Monte Carlo simulation with250000 paths.

    Figure 6.6: VaR and CVaR of GMV portfolio using Monte Carlo simulation

    6.4 Studies on GAN

    6.4.1 Structure of GANIn this thesis, a GAN is constructed according to the introduced structure of GAN, in terms ofthe overall parameters of the network. The generator consists of two intermediate layers withactivation function of Leaky ReLU function with parameter α = 0.2. These two layers allbelong to fully connected category. The output of second layer then will be passed to a layerwith a flatten number of nodes, with activation function of Hyperbolic Tangent function. Thenreshape it to the same shape as the input. The discriminator network, it contains two fullyconnected layers with leaky ReLU as activation function. The parameter α = 0.2 Then theoutput will be connected to layer with only 1 single node, with sigmoid activation function.This discriminator network takes inputs from actual price data and simulated data generatedfrom the generator network. In terms of the actual input, the data is reshaped in the followingway:

    The input to the net is a rank 3 tensor (three-dimension tensor), the first axis representstime, and the second axis represents features of stocks: for example daily returns or closereturns. The third axis represents different stocks. The output of the generator network has the

    40

  • same formulation of the data structure. It is demonstrated in Figure 6.7. This kind of structureallows us to add features to the input flexibly.

    Features

    Time

    Stocks

    Figure 6.7: Data structure of input

    The input to the generator has a different kind of input. The nature of GAN’s generatorrequires noise as input, we choose to feed Gaussian white noise to the generator. And also weneed to specify the dimension of the noise, this is a hyperparameter that need to be definedbefore the training process.

    Then we program the GAN on Python using package Keras, the code borrows some struc-tures and method from Open source code in Github [27].

    6.4.2 Key Point on Selecting BatchesSome readers may be familiar with neural network’s application in picture identification orgeneration, under these applications a batch can be randomly selected within the data set(pictures)without any consideration of the order. By doing so, the training dataset can be extended, im-proving the performance of the training process. However, in this application, this technique isnot implemented, because we do not want any information leaking in the training process. Tofurther explain this concept, we start with applications in visual identification. For example,when the program tries to identify handwritten number, the training data are the pictures ofhandwritten numbers. The order of the picture is not important, in other word we can feed thelast picture first and the result will not make much difference.

    While in this application, the order of data is very important. Because stock data or returnsare time series, time is a key factor in the structure of the input data. If the same approach isapplied, the information from the future is leaked into the present. To put it in a metaphor, ifprice data in the future is known today. Will the investors have the same estimation of stock

    41

  • risk? Hence in the empirical study part, the batches are selected according to the time order.This will make sure that no information in the future is leaked into the present.

    6.4.3 Output from GANTo receive the simulated paths from the GAN, it is required to train the designed GAN withreal data first. Then to obtain the simulated price paths, noise is generated and sent to thetrained generator to acquire the simulated paths.

    In the price simulation studies, the simulation is the price movement of selected stock ina one year period(252 trading days). To be more precise, each stock starts at value 1 then thevalue of each stock at the end of the one year period is presented. In this one year period,the result is represented in a format of matrix. with one axis represents time and another onerepresents stocks.

    Then to simulate the Potential Realities in this one year period, we repeatedly generatevalue of stocks several times. In this case, we choose to do it 10000 times. This number isselected based on the constraints of RAM in the working computer, the tensor generate fromthe computer takes up 4GB memory in the RAM, for readers who have more RAM storage intheir computer, this number can be larger to acquire a more accurate estimation of risk.

    First 5 of the simulated paths for one of the asset are given to help readers understand thegenerated paths. The model is conducted with parameters epoch: 80, batch size: 60, latentdimension: 200, the generator’s layers have 128, 256, and 512 nodes respectively.

    0 50 100 150 200 250Days

    0.8

    0.9

    1.0

    1.1

    Val

    ue

    Figure 6.8: ABB Price paths generated by GAN

    To further examine the output of GAN, it is necessary to study the statistical properties ofthe output. The most obvious way is to examine the histogram of the generated Monte-Carlo

    42

  • 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.250.0

    0.5

    1.0

    1.5

    2.0

    2.5 NormalNetwork

    Figure 6.9: Histogram comparison

    simulation. First, we present the histogram of the terminal value after one year with 10000samples. To make readers better understand the differences between neural network and tra-ditional Monte-Carlo simulation. The Figure 6.9 is the terminal value histogram comparison.

    From the figure, it is obvious that the neural network estimates more downward risk com-pared to normal Monte Carlo simulation. Hence from the results, investors can interfere thataccording to neural network, ABB stock have more downward risk than the normal distrib-uted based estimation. Now to further understand the output, the daily return generated bythe neural network is presented and compared with normally distributed data generated bythe normal distribution. The comparison between simulated ABB daily return is presented inFigure 6.10

    Now as demonstrated before in a portfolio setting it is crucial to estimate the covariancebetween different stocks. A correct estimation of covariance between stocks is vital to con-structing a good low risk portfolio. The heatmap of the correlation coefficient matrix is presen-ted in Figure 6.11. The data is one of the simulated stock returns data in one year generatedby the Neural Network based Monte Carlo simulation.

    43

  • 0.04 0.02 0.00 0.02 0.04 0.060

    5

    10

    15

    20

    25

    30

    35

    NormalNetwork

    Figure 6.10: Histogram comparison(daily return)

    Figure 6.11: Heatmap of one generated data

    To verify the accuracy of the results, the heatmap of correlation coefficient matrix of real10 years stocks returns data is given in Figure 6.12

    44

  • Figure 6.12: Heatmap real stocks returns data

    As we demonstrated before, the covariance between stocks can vary a lot, so theoreticallythe simulated dataset should exhibit similar characteristics. Therefore the rolling correlationbetween stocks is investigated. Figure 6.13 is the graphical representation between stock AAKand ABB. The result shows that the generated data from the neural networks can simulatethe covariance between different stocks. Compared to the traditional Monte Carlo methodand generated from multivariate normal distribution, this method will give a more reliableestimation of risk.

    45

  • Figure 6.13: Rolling Correlation of Neural Network

    Then compare the paths with normal paths and more importantly computing the risk of theassets in terms of VaR and CVaR. With output from GAN, we can run a similar VaR and CVaRcalculation. The result from this designed GAN with previously specified hyperparametersis presented in the Appendix C. From the output one can observe that this application givegood VaR estimations of many stocks, while some estimations are not realistic. Like a VaRabove 1 is not correct, since this means that in the worst-case scenarios the stocks will stillhave a positive return. Naturally, we ask the question: why there are wrong estimations inthe result? Since the GAN estimations of SAAB, ICA and ATRE are very similar to thetraditional estimations of VaR. As the results are generated from the generator, with input ofrandom noise with 0 mean and standard deviation of 1. Hence the GAN possess the ability toreplicate the actual data.

    The reason behind this problem may relate to GAN’s sensitivity to hyperparameters, there-fore it is harder to train GAN compared to normal neural network[2]. In the following parts, asmall experiment is conducted to observe the effect of hyperparameters on the result of GAN.Since 196 stocks is too large for this type of research, in the subsequent studies the samplesize is shrunk to 20 stocks to better find the effect of different parameters.

    6.4.4 The Effect of EpochIn this section, we run the program with the same parameters other than epoch and the resultis presented in Appendix D, here the VaRs of these 20 stocks are presented. We start from 1epoch, undeniably 1 epoch will not give a satisfactory result, the result is more like the initialstate of the neural network. As we can see in the result, the output from GAN is all wrong

    46

  • with no practical meaning, this corresponds to the nature of the generator since it takes noiseas input.

    Then as epoch grows, the result start showing some accurate estimations of stock’s risk,however, the inst