High-variance multivariate time series forecasting using ...1219571/FULLTEXT01.pdf · With the advent of deep learning and increased processing power, a door has been opened to tackle

Uppsala University

Master’s thesis

High-variance multivariate timeseries forecasting using machine

learning

Nikola Katardjiev

Subject: Information SystemsCorresponds to: 30 hp

supervised byAndreas Hamfelt

examined bySteve McKeever

June 4, 2018

Abstract

There are several tools and models found in machine learning thatcan be used to forecast a certain time series; however, it is not alwaysclear which model is appropriate for selection, as different models aresuited for different types of data, and domain-specific transformationsand considerations are usually required. This research aims to examinethe issue by modeling four types of machine- and deep learning algorithms- support vector machine, random forest, feed-forward neural network, anda LSTM neural network - on a high-variance, multivariate time series toforecast trend changes one time step in the future, accounting for lag.The models were trained on clinical trial data of patients in an alcoholaddiction treatment plan provided by a Uppsala-based company. Theresults showed moderate performance differences, with a concern that themodels were performing a random walk or naive forecast. Further analysiswas able to prove that at least one model, the feed-forward neural network,was not undergoing this and was able to make meaningful forecasts onetime step into the future. In addition, the research also examined the effectof optimization processes by comparing a grid search, a random search,and a Bayesian optimization process. In all cases, the grid search foundthe lowest minima, though its slow runtimes were consistently beaten byBayesian optimization, which contained only slightly lower performancesthan the grid search.

Key words— Data science, alcohl abuse, time series, forecasting, machinelearning, deep learning, neural networks, regression

Sammanfattning

Det finns flera verktyg och modeller inom maskininlärning som kananvänds för att utföra tidsserieprognoser, men det är sällan tydligt vilkenmodell som är lämplig vid val, då olika modeller är anpassade för olikasorts data. Denna forskning har som mål att undersöka problemet genomatt träna fyra modeller - support vector machine, random forest, ett neu-ralt nätverk, och ett LSTM-nätverk - på en flervariabelstidserie med högvarians för att förutse trendskillnader ett tidssteg framåt i tiden, kontrol-lerat för tidsfördröjning. Modellerna var tränade på klinisk prövningsdatafrån patienter som deltog i en alkoholberoendesbehandlingsplan av ettUppsalabaserat företag. Resultatet visade vissa moderata prestandaskill-nader, och en oro fanns att modellerna utförde en random walk -prognos.I analysen upptäcktes det dock att den ena neurala nätverksmodellen integjorde en sådan prognos, utan utförde istället meningsfulla prediktioner.Forskningen undersökte även effekten av optimiseringsprocesser genomatt jämföra en grid search, random search, och Bayesisk optimisering. Ialla fall hittade grid search lägsta minimumpunkten, men dess långsam-ma körtider blev konsistent slagna av Bayesisk optimisering, som ävenpresterade på nivå med grid search.

Nyckelord–– Data science, alkoholmissbruk, tidsserieanalys, prognos, machi-ne learning, deep learning, neurala nätverk, regression

AcknowledgementsI would like to thank my supervisor, Andreas Hamfelt, for his consistent helpin outlining my thesis, as well as for his valuable feedback and comments.

In addition, I would like to give a special thanks to Prashant Singh for hisguidance and expertise in the machine learning field, a landscape previouslyunknown to me.

Finally, this research could not have been done without the help of FredrikRemaeus, Andreas Zetterström, and Markku Hämäläinen, as well as everyoneelse at Kontigo Care AB, for their immeasurable support, and I am extremelygrateful for the resources and advice they provided for this research.

Contents1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Research questions and purpose . . . . . . . . . . . . . . . . . . . 31.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Related work 52.1 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Substance abuse and machine learning . . . . . . . . . . . . . . . 52.3 Regression and time series forecasting . . . . . . . . . . . . . . . 62.4 Optimization processes . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Theoretical framework 73.1 Data science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.1 Time series forecasting . . . . . . . . . . . . . . . . . . . . 73.2 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.1 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . 83.2.2 Machine training process . . . . . . . . . . . . . . . . . . 8

3.3 Classical machine learning algorithms . . . . . . . . . . . . . . . 93.3.1 Support vector machines . . . . . . . . . . . . . . . . . . . 93.3.2 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.3 Random forest algorithm . . . . . . . . . . . . . . . . . . 12

3.4 Deep learning models . . . . . . . . . . . . . . . . . . . . . . . . . 133.4.1 Multi-layer feed-forward neural network . . . . . . . . . . 133.4.2 Recurrent neural network and LSTMs . . . . . . . . . . . 15

3.5 Persistence forecasts . . . . . . . . . . . . . . . . . . . . . . . . . 183.5.1 ARIMA regression models . . . . . . . . . . . . . . . . . . 183.5.2 Random biased forecast . . . . . . . . . . . . . . . . . . . 19

3.6 Development tools . . . . . . . . . . . . . . . . . . . . . . . . . . 193.6.1 Python and TensorFlow . . . . . . . . . . . . . . . . . . . 203.6.2 R - statistical programming language . . . . . . . . . . . . 20

4 Methodology 214.1 Research paradigm and approach . . . . . . . . . . . . . . . . . . 214.2 Training and validation data . . . . . . . . . . . . . . . . . . . . 224.3 Feature selection and pre-processing . . . . . . . . . . . . . . . . 224.4 Model development, training process, and data collection . . . . 23

4.4.1 Grid search . . . . . . . . . . . . . . . . . . . . . . . . . . 244.4.2 Manual and random search . . . . . . . . . . . . . . . . . 244.4.3 Bayesian optimization . . . . . . . . . . . . . . . . . . . . 25

4.5 K-fold cross-validation . . . . . . . . . . . . . . . . . . . . . . . . 264.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Results 295.1 Summary of model performances . . . . . . . . . . . . . . . . . . 295.2 Differences in optimization techniques . . . . . . . . . . . . . . . 305.3 Results from the support vector regressor . . . . . . . . . . . . . 315.4 Results from the Random Forest algorithm . . . . . . . . . . . . 335.5 Results from the multilayer feed-forward neural network . . . . . 375.6 Results from the recurrent neural network . . . . . . . . . . . . . 40

6 Analysis 456.1 The naive forecasting problem . . . . . . . . . . . . . . . . . . . . 456.2 Model validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.3 Ensemble learning . . . . . . . . . . . . . . . . . . . . . . . . . . 506.4 Verification with an independent data set . . . . . . . . . . . . . 526.5 Analysis of optimization processes . . . . . . . . . . . . . . . . . 53

7 Discussion and conclusions 557.1 Summary of research findings . . . . . . . . . . . . . . . . . . . . 557.2 Potential model improvements . . . . . . . . . . . . . . . . . . . 56

7.2.1 Further data transformations . . . . . . . . . . . . . . . . 577.2.2 Defining outputs categorically . . . . . . . . . . . . . . . . 57

7.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

8 Future work 60

List of Figures1 A support vector machine with a linear kernel separating two

dummy classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Example of a non-linear kernel in a support vector machine. . . . 113 Example of a decision tree . . . . . . . . . . . . . . . . . . . . . . 124 Example of a simple feed-forward neural network with one hidden

layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Example of an artificial neuron. . . . . . . . . . . . . . . . . . . . 146 An illustration of how past states are passed through a recurrent

neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 A single LSTM containing a cell, an input gate, and output gate,

and a forget gate . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 A Bayesian optimization on a univariate Gaussian process . . . . 269 A k-fold cross-validation diagram with k = 10 . . . . . . . . . . . 2710 Performance of a support vector machine on the testing set with

no historical data . . . . . . . . . . . . . . . . . . . . . . . . . . . 3211 Performance of a support vector machine on the testing set with

historical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3312 Performance of a random forest algorithm over the testing set

with no historical data . . . . . . . . . . . . . . . . . . . . . . . . 3413 Performance of a random forest algorithm over the testing set

with historical data . . . . . . . . . . . . . . . . . . . . . . . . . . 3514 Zoomed view of changes in Random Forest performances . . . . . 3615 Accuracy changes of a Random Forest in relation to the number

of trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3716 Performance of an artificial neural network over the testing set

with no historical data . . . . . . . . . . . . . . . . . . . . . . . . 3817 Performance of an artificial neural network over the testing set

with historical data . . . . . . . . . . . . . . . . . . . . . . . . . . 3918 Change in the loss function of an artificial neural network over

500 epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4019 Performance of a recurrent LSTM network on the testing set with

no historical data . . . . . . . . . . . . . . . . . . . . . . . . . . . 4120 Performance of a multivariate LSTM network over the testing set

with historical data . . . . . . . . . . . . . . . . . . . . . . . . . . 4221 Bias depiction of the LSTM network . . . . . . . . . . . . . . . . 4322 Change in the loss function of an LSTM network over 250 epochs 4423 Example of a naive ARIMA forecast . . . . . . . . . . . . . . . . 4624 Showcase of the ensemble learner making predictions on the test-

ing set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5225 Predictions of the neural network on an independent data set . . 53

List of Tables1 Summary of algorithm performance without historical data . . . 292 Summary of algorithm performance with historical data . . . . . 303 Best performances of optimization steps, with historical data ex-

cluded, measured in RMSE . . . . . . . . . . . . . . . . . . . . . 304 Best performances of optimization steps, with historical data ex-

cluded, measured in RMSE . . . . . . . . . . . . . . . . . . . . . 315 Persistence forecasts compared to trained models . . . . . . . . . 476 Predicted changes in trends by each model, measured in number

of occurrences, without historical data. . . . . . . . . . . . . . . . 497 Predicted changes in trends by each model, measured in number

of occurrences, without historical data. . . . . . . . . . . . . . . . 498 Performance of the ensemble learner without historical data . . . 509 Performance of the ensemble learner with historical data . . . . . 5110 Predicted trend changes from the ensemble learner, depending on

the presence of historical data. . . . . . . . . . . . . . . . . . . . 5111 Predicted trend changes of the neural network using independent

data, with and without historical data. . . . . . . . . . . . . . . . 5212 Performance differences in percentage between Bayesian opti-

mization and grid search on both historical and non-historicaldatasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

1 IntroductionThis chapter discusses the background knowledge of alcohol substance abuse,data-driven science, and time series forecasting with respect to machine learn-ing, as well as providing the topical case from which this research is built. Inaddition, it aims to define the problem area, subsequent research questions, anddelimitations pertaining to this study.

1.1 BackgroundOne of the driving factors behind data-driven science is the potential abilityto predict future trends and changes in data [6]. The concept of extrapolatingdata, commonly referred to as time series forecasting, has been present in clas-sical statistics modeling with a plethora of methods available [22], with varyingdegrees of success. For example, a successful application of time series fore-casting could be used to predict the rise and fall of various stock markets [7],company bankruptcies [2], or, in the case of the field of medicine, potentialprognoses of cancer patients [5].

The task of time series forecasting can range from trivial to seemingly im-possible depending on the data available; any predicted patterns, periodicities,and eventual regularities a researcher may detect in the sample data availablecould be obfuscated by noise, errors, and so on [6]. As a consequence, traditionalstatistical techniques may prove difficult to apply to certain types of data setsthat are, for example, heavily non-linear, periodic only during certain intervals,and cannot be modelled with simple equations.

One area that has shown promise in time series forecasting is machine learn-ing [4]. Rather than manually testing, adjusting, and re-testing equations inorder to look for a line of best fit, as is the convention in classical statistics,we can provide an algorithm, or a set of algorithms, rooted in statistics andprobabilistic theory, with a collection of training data. The algorithm can sub-sequently ’learn’ from the data set, adjusting on its own the possible lines of bestfit for data sets on a case-by-case basis [14]. This could help shift the burden ofanalyzing and interpreting the data from the researcher and onto the computer.With the advent of deep learning and increased processing power, a door hasbeen opened to tackle more complex forecasting problems, explained furtherin the theoretical framework of this paper. However, applications of time se-ries forecasting in machine learning is highly domain-specific, and require carefuldata transformations and considerations as to what type of behaviour the modelshould expect, making it unclear how any one type of model can be reappliedin the future for a different domain.

To provide a practical example, an organization involving itself in the treat-ment of alcohol abuse patients could track the performance of its patients on adaily basis: how well they are meeting their goals, how often they have a drinkon the weekend, and so on. This data is arranged sequentially, and a model couldbe used to predict the performance x steps in time in the future. Doing such aprediction by hand is labourous at best, given that one has to account for ev-

1

erything from salary day to weekend drinking to bouts of depression. However,providing this data to a machine could ease the burden tremendously. Giventhat machine learning has been applied to forecast other types of behavioursand trend changes, it follows that a careful application of those routines andappropriate transformations should make it possible to forecast trend changesin alcohol abuse behaviour.

1.2 Problem areaConsider the case previously given of the alcohol abuse patients. It is notimmediately apparent as to how one would go about applying machine learningprocesses to such a scenario, and even more so with the specificities of timeseries forecasting. Indeed, the area is far from investigated, with very few, ifany, sources attempting to forecast such behaviour, despite the prevalence ofalcohol abuse in modern countries. For example, there are no rules of thumbconcerning methodology and theories of applying machine learning in this typeof patient care, leaving behind a significant knowledge gap that could be filled toprovide better patient care. Hence, any heuristics developed in machine learningfor making forecasting decisions in a more general sense have not been provedapplicable in this particular domain.

In a broader scope, though machine learning can prove a powerful tool for an-alyzing and forecasting a particular data set, it is not always apparent which ma-chine learning model can best approximate a time series. Conventional decision-making approaches orient towards a trial-and-error approach where multiple al-gorithms are trained on the same data set until a breakthrough is made anda sufficiently accurate prediction can be made [24]. To make matters worse,an algorithm that may intuitively appear to work rather well with a particulardata set could prove a misfit, and it is only by attempting to fit the data to themodel that one can realize this.

Further problems arise as additional variables are included in a data set.A time series is commonly thought of as a set of values matched along time,such that one value corresponds to one moment in time. However, in actualityany one data point may be a composite of multiple variables, resulting in amultivariate time series. These are increasingly complex and difficult to map,and there is little research showcasing algorithm performance on them.

There are additional unclarities in the development of forecasting algorithms.For example, a common question that machine learning scientists are faced withwhile developing an artificial neural network is concerning the number of hiddenlayers that ought to be included. However, there is little consensus other thana trial-and-error approach depending on the data set the network is intended tobe trained on. Being able to accurately tune these parameters is important, asdata scientists may often experience inaccurate results owing to rather simpleparameter tuning errors [39].

It is not unreasonable to then assume that there is a substantial knowledgegap in the development of heuristics of machine learning applications in a spe-cialized field such as that of alcohol abuse treatment. It also follows that there is

2

additional room to investigate how models interact with unknown data series tobetter understand performances of machine learning algorithms in forecastingmultivariate time series.

1.3 Research questions and purposeGiven the obscure and impenetrable nature of machine learning, the applicationof it, and subsequent analysis and ability to draw conclusions from it, it followsthat there are substantial difficulties in applying it to a field where so littleresearch (as detailed in the related work section) has been devoted, given a lackof standards and best practices.

The aim of this research is therefore two-fold; it first aims to examine corre-lations between alcohol abuse patients’ history and their likelihood of relapsingby applying different machine learning models perform on such unexamined ter-ritory. It also seeks to investigate and explore how different machine learningand deep learning algorithms perform on high-variance, multivariate time series.The following questions are posed to that end:

1. To what extent do machine- and deep learning models accurately forecastrelapses of alcohol abuse?

2. If their accuracy is sufficiently high, to what extent do those accuraciesdiffer?

3. How do optimization techniques differ in terms of performance on multi-variate regression problems?

The questions are aimed to be answered with regards to the root meansquare error (RMSE) of each model prediction. An explanation of the verycase-sensitive nature of the RMSE can be detailed in the methodology, and def-initions of the models and optimizations being tested are found in the theoreticalframework.

1.4 DelimitationsCertain delimitations have been made in order to focus the scope of the research.The first is in the data selection; the training, test, and validation data is takenfrom the same source: patient data from an Uppsala-based eHealth companycalled Kontigo Care AB. This data is further discussed in the methodology,and is the core of answering the first research question. Additional time seriesthat could also appear asynchronous are not being considered at this stage.In addition, the paper is limited by the choice in machine learning algorithmsbeing trained with the data. They were selected in a literature review detailedin the following section, based on their promise as forecasting tools. However,it should be noted that there are additional algorithms that can be trained forregressive tasks as well, and could be included in future research. In addition,the algorithms being developed and tested here are primarily supervised learning

3

tasks. Finally, the research considers primarily only the case of Kontigo Care’spatient data as input sources, and other data sources could prove useful in bettergeneralizing eventual conclusions (though an independent dataset unrelated tothe domain at hand is used to support or reject generalizability).

In addition, delimitations in the approach are made concerning the subjectmatter of alcohol abuse; the aim of the project is to investigate the applicationof machine learning on such a field, rather than examining its intricacies (inother words, considering the field of alcohol abuse and the subsequent patientcare in its own context is outside the scope of this paper; it merely concernsthe application of treated data), and as such, the theoretical scope of this pa-per is focused primarily on making predictions on datasets that have alreadyundergone some form of context-sensitive preprocessing and treatment.

4

2 Related workUsing machine learning at large to solve time series forecasting has been triedin several different ways, ranging from linear models to highly complex multi-variate, non-linear models. Modern research in the area has tended away fromconventional and relatively simple models in favour of deep learning models,with LSTM (long short-term memory) networks leading the charge. This sec-tion provides a quick overview of machine learning, how it has been appliedin the specific domain of substance abuse, the current landscape of time seriesforecasting, and their related optimization techniques.

2.1 Machine learningThe field of machine learning has seen a significant amount of recent researchinvested into image recognition and classification, with convolutional neural net-works being a key ingredient. However, recurrent neural networks - a relativelynew addition to the field of machine learning - have enjoyed some attention dueto their success in dealing with natural languages. Owing to the developmentof the LSTM cell (discussed further in Section 3.4.2), recurrent networks saw asignificant boost in popularity, seeing usage in word prediction software used insmartphones, for example.

One key area of progress made in machine learning in recent times is thetransition from conventional machine learning to deep learning, and as such,modern research tends to reflect this. A complex deep learning model developedat Oxford [43] to solve financial data problems was able to approach these withgreater success rates than when applying conventional finance theory.

The embracing of deep learning has opened doors to further development inmultivariate time series forecasting; as they can solve exponentionally complexproblems, it follows that deep learning models should also be able to handle theexponentially large amount of data that comes with increasing dimensionalities,as is so often the case.

2.2 Substance abuse and machine learningAs previously established, pickings are slim with regards to the application ofmachine learning on substance abuse patients. Indeed, only a handful of papersappear to have tackled the issue, such as the case of [49], which developed aclassification-based (refer to the theoretical framework) approach to identifyingrelapses in ongoing treatment of patient treatment. They compared two models,a decision tree classifier and a Bayesian network, on 73 in-treatment patients,mapping outcomes as either successful or relapsing, with both models yieldinga success rate higher than 50% (in other words, it outperformed random pre-dictions). However, the input data was limited in comparison to that of themodels this paper intends to build. It follows, then, that there is evidence tosuggest that the treatment of such a problem in a regression-based manner couldsucceed.

5

2.3 Regression and time series forecastingModern research concerning time series forecasting can primarily be found inthe field of machine learning; a paper by [41], published in 2018, details thedevelopment of an LSTM network capable of not only forecasting future timesteps, but also performing anomaly detection on that time series. In other words,it can forecast a series and anticipate when aberrant behaviour will occur in thefuture x time steps ahead.

Similarly, [42] used a recurrent ARIMA-like neural network to forecast sev-eral types of time series, ranging from Turkish stock prices to Australian beerconsumption, outputting a non-linear auto-regressive function as its output.This type of hybrid network has been developed in previous examples as well;[23] tested a case using a feed-forward neural network and an ARIMA model intandem to forecast stock markets with results surpassing either. There is evi-dence to suggest that hybrid ensemble models such as these can have improvedperformance on time series forecasting.

2.4 Optimization processesResearch in techniques used to optimize machine learning algorithms and de-termining their hyperparameters (i.e. model parameters set by the developerand not trained by the model itself) is rather limited; to a large extent, findingsufficiently adequate solutions is done through a trial-and-error process by usingrule of thumb to select hyperparameters for the algorithm in question.

Consensus appears to be that the de facto method for optimization, the gridsearch, may in the future be replaced by Bayesian optimization. However, it toosuffers from notable drawbacks. The key behind this method is the acquisitionfunction, which suffers from increasing costs as data dimensionality increases.[43] details the usage of dropout to only optimize a subset of the data variablesat each step to mitigate the costly effects of increased dimensionality.

One could be surprised to read that one method of testing and setting opti-mal hyperparameters that is currently being used is a random search, discussedfurther in the theoretical framework; [15] concluded that randomly setting pa-rameters rather than considering the entire grid of possible settings was moreefficient than competing models, though this was not considering Gaussian pro-cesses such as Bayesian optimization.

Given eventual unclarities in performance differences, it stands to reasonthat more tests ought to be conducted to further show areas where they differand where they do not.

6

3 Theoretical frameworkThis section is devoted to explaining the core concepts of time series analysisand forecasting, as well as the machine learning algorithms being tested in thisresearch. In addition, development tools of Python and R, as well as the machinelearning framework of TensorFlow are discussed in this section.

3.1 Data scienceThe field of data science (sometimes referred to as data-driven science), thoughhaving grown immensely in relation to the advancing speeds of computers, isstill rather diffuse [19], and it may not be inherently obvious as to what itis, how it is executed, and what its objectives are. However, for the purposesof this research, it may be defined rather simply as the field of automateddata analysis for the purposes of decision-making, falling closely in line withdata-driven decision making. [17]. In a broader scope, it is a discipline thatshares certain aspects with business analytics and data mining, and has becomeextraordinarily popular with large-scale organizations in the last decade [11].

It is important to, at this stage, distinguish that there is a difference betweendata science and machine learning, lest there be any confusion; though the latteris commonly considered a part of data science, there are several techniques andbranches of data science not covered by machine learning [13]. For example,data visualization and business intelligence, often considered core parts of datascience, are not inherently related to machine learning [13].

3.1.1 Time series forecasting

With the advent of big data (and hence, the quality of the data) and increasedprocessing power, the field of time series analysis has seen further developmentin machine learning and data science, as calculations that were computationallycomplex 20 years ago are now trivial by comparison. We commonly view a timeseries as a set of data points, sequentially ordered with a time stamp. This isconventionally visualized by a time graph [18].

However, there is another way time series can be represented. Given thateach data entry consists of some value x and some time stamp y, a single timeseries can be represented by a 2-dimensional matrix with respect to x and y. Infact, even if a data entry contains multiple values, this can simply be representedby a matrix, or tensor of that size. For example, a data entry with three values,x, y, and z, will have a dimensionality of 3. This can be scaled to any arbitrarysize; if the data entries have n points, they can simply be represented by ann-dimensional matrix.

Being able to consider time series as matrices rather than graphs will proveuseful, as machine learning algorithms commonly, and almost exclusively, con-sider only tensors as inputs, which can conveniently be represented as matricesfor the purposes of machine learning [24].

7

3.2 Machine learningA machine learning algorithm is, simply put, a set of input-output computerinstructions with the ability to self-modify behaviour such that its outputs canbetter match values the user expects. In essence, they can learn about the inputdata, and are repeatedly trained on with the intention of discovering trends,movements, and so forth. We will also define a model as a machine learningalgorithm that has or is undergoing the training process.

Machine learning tasks are traditionally defined in two ways, as supervised,and as unsupervised. In the former, algorithms are fed a set of inputs andoutputs, and are left to discover a pattern or rule that links the two. However,in unsupervised tasks, no outputs are provided; the algorithm is left entirely toits own devices and to discover appropriate trends [4]. As a practical example, asupervised learning task could be classify car chassi variations, based on if theyare sedans, station wagons, or hatchbacks. If it were to be an unsupervised tasks,actual knowledge of chassis would be obscured, allowing for more unpredictable,yet still potentially interesting results.

In addition, machine learning can be categorized based on application usage[4]. The algorithms can be trained for classification, in which the objective isto correctly identify which class an entry belongs to - such as in the examplewith the car chassis - or regression, where the goal of the model is to predict acontinuous, rather than discrete, value [23], as is the objective in this research.

3.2.1 Deep learning

An important distinction needs to be made going forward: that between machinelearning and deep learning. We can consider machine learning algorithms interms of layers, where conventional ones have three layers: an input layer, anoutput layer, and one layer which does calculations and partakes in the actuallearning process [13]. However, we can add additional, hidden layers betweenour inputs and outputs to create more complex algorithms. This is commonlyreferred to as deep learning, a branch of machine learning [13]. Ostensibly,all deep learning algorithms are machine-learning, but not all machine learningalgorithms are deep-learning.

Out of the algorithms described below, ARIMA regression models, supportvector machines, boosted decision trees, and random forests are traditional ma-chine learning methods, while the two types of neural networks are deep learningmethods. As such, this research employs classical machine learning as well asthe more modern deep learning algorithms.

3.2.2 Machine training process

In order to make accurate predictions, a model needs to be trained against somedata. This section describes the process through which a supervised learningtask is done. Assuming the data has been collected and prepared such that atensor containing the necessary entries is available, the entries will be dividedin three separate sets: a training set, a validation set, and a testing set. The

8

training set and validation set will be used in conjunction during the trainingprocess, and when the model is completed, it will be tested against the testingset [6].

The training process for a supervised learning task is commonly defined byepochs, each epoch is a run through a selection of the training data, and themodel is, based on this, allowed to calculate a portion of the validation set. Fromthis, we calculate the error margin of the predictions, and the model adjustsaccordingly. Once all epochs have been run and the training data exhausted,the model is subjected to the testing data; comparing its predictions to the realvalues of the test data [7]. This is, in effect, the accuracy of the model.

3.3 Classical machine learning algorithmsThis research aims to test two types of models: classical machine learning modelsand deep learning ones. Below are described the classical models used in theresearch. It must be mentioned that decision trees are included for a pedagogicalpurpose as a part of the random forest, and not as an independent model.

3.3.1 Support vector machines

Support vector machines (SVMs) are a traditional machine learning algorithmthat sees a high level of usage to this day, despite the prevalence of deep learningmodels, courtesy of its speed and high level of accuracy, able to solve both binaryand non-binary classification problems. However, modified versions of supportvector machines designed to tackle regression problems exist, justifying theirinclusion in this paper.

A support vector machine operates by mapping the data set onto a hy-perplane with dimensions R corresponding to the number of data set featuresw1, w2, ..., wn−1, wn. It then places another hyperplane of dimensions R−1 suchthat it separates the different classes in the data set. For example, if the dataset has two features, a simple line or vector is enough to divide the feature set.However, if the data set has three features, a 2-dimensional plane is a necessityto fully separate the classes. In addition, two vectors accompany the separatinghyperplane with two-fold conditions; they must lie parallel to the separatinghyperplane, and they must intersect the data points closest to the separatinghyperplane on either side of it. See Figure 1 for a visual representation of asupport vector machine.

9

Figure 1: A support vector machine with a linear kernel separating two dummyclasses [31]

In Figure 1, the line bifurcating the two classes is entirely straight; thisis a linear and binary classification problem, and is relatively simple to solve.However, not all problems are this trivial, and using a single vector may notalways be appropriate. Support vector machines take advantage of a kernel,which allows the algorithm to map the problem in different dimensions. A non-linear kernel, such as the radial basis function kernel, maps a 2-dimensionalproblem onto a 3-dimensional space with a plane separating the classes. Whenthe solution is restated in two dimensions, the line of separation appears non-linear; in the case of Figure 2, it exists in the form of an oblong. This can be donefor any arbitrary number of dimensions, but is at its easiest to conceptualize inthe shift from two to three dimensions.

10

Figure 2: Example of a non-linear kernel in a support vector machine. [31]

A support vector machine may seem apt for classification problems, andindeed it is, but it has plenty of usage in regression-based problems, in whichit is referred to as a support vector regressor, as it will be in this paper fromhereon.

3.3.2 Decision trees

Decision trees, arguably most well-known for their usage in classification prob-lems, are conventional decision support tools mapping outcomes, costs, andsuch on a tree-like structure where each path represents a set of circumstances[33]. These have traditionally been done by hand, but have found a resurgencewith machine learning with the development of models that can automaticallygenerate decision trees.

Decision trees are commonly developed using the ID3 algorithm, which, givena set of attributes S, calculates an information gain IG(S) for each attribute,followed by a selection of the attribute with the largest information gain. S issplit by the selected attribute, and the process is repeated for each attribute ineach split until the decision tree is finished [40].

Information gain in a decision tree, for any given attribute, is defined as

IG(S) = P (n)log2P (n)− P (m)log2P (m) (1)

where P (n) is is the proportion of the number of positive attribute examplesin a data set, and P (m) is the number of negative examples in the same dataset [40].

Decision trees come with several advantages over other machine learning al-gorithms. In simple cases, they are easy to follow and understand, and havesubstantial explanatory power; due to their visual and intuitive nature, any de-cision, even in more complex trees, can be justified by following the generatedgraphs to completion. In addition, due to the simple method of generating de-cision trees, decision tree learning is quick and easy compared to more complex,deep learning algorithms [4].

11

Figure 3: Example of a decision tree. [31]

However, given the complexity of the data at hand, and its high variance,decision trees are given weight only as building stones for the random forestalgorithm, rather than an independent model unto itself.

3.3.3 Random forest algorithm

The random forest algorithm is one of many ensemble algorithms, which aredefined by being an aggregate of classifiers. The premise of ensemble algorithmsis that if each constituent classifier may correctly identify an attribute that couldcorrelate to a positive example, it is still not sufficient. However, by aggregatingmultiple weak classifiers, a single, strong prediction could emerge. A randomforest is an ensemble of decision trees, each operating independently, making aprediction as to where an example data entry belongs. The forest aggregates theresults and choosest the strongest prediction [26]. The random forest algorithm,suitable for both classification and regression purposes, is described by [26] asfollows:

1. Draw n bootstrap samples from the data set.

2. For each sample, draw an unpruned decision tree such that nodes arerandomly split based on k features in the data set, and select the bestsplit based on those.

3. Each tree will generate a prediction, and the forest algorithm will selectthe most predicted (i.e. strongest) prediction.

12

Though random forests are more complex to understand than single decisiontrees, studies [28] appear to indicate that they show more promise in solvingregression problems with higher levels of accuracy.

3.4 Deep learning modelsThe previously discussed models are classical machine learning, with a singlestep between inputs and outputs; in addition, their training processes are quitebrief. However, the following two models are considered deep learning, withmultiple hidden layers and steps, and far greater training processes.

3.4.1 Multi-layer feed-forward neural network

The feed-forward neural network (referred to on occasion in this paper by ANN )is the earliest and simplest version of a neural network, and is the foundationfor the other types of neural networks discussed in this research. It is simplydefined as an input layer, an output layer, and several hidden layers. Each layerconsists of multiple artificial neurons, which are tasked with feeding forward datato the next layer. The number of neurons in the input layer must correspond tothe expected size of the input tensor, while the number of output layer ought tocorrespond to the expected number of outcomes [30]. It is worth noting that theinput data for a feed-forward neural network consists of a single 1-dimensionaltensor. For example, if one wishes to process an image with the proportions28 ∗ 28, 784 input nodes with 784 corresponding weights are required; it is easyto see how this can spiral out of control as image sizes grow large enough.

Visually, a neural network could be represented, for example, as in Figure4. Formally, there are no connections between neurons in the same layer, andbetween layers, neuron are fully connected.

Figure 4: Example of a simple feed-forward neural network with one hiddenlayer. [30]

Each node in all the layers of the network represents an artificial neuron,

13

a mathematical model intended to emulate the role of a neuron in a physicalbrain. Each neuron consists of a set of inputs, some type of activation or transferfunction, and an output [30]. The inputs are passed from other neurons in thenetwork in past layers with some type of weight attached to them. In addition,each neuron contains a single variable referred to as the threshold or bias. Theend result is an activation or output, which is passed onto the next hidden layer,and eventually the output layer of the neural network.

Figure 5: Example of an artificial neuron. [30]

Figure 5 showcases the components of an artificial neuron in a feed-forwardneural network. Inputs xi are weighted by wi before being fed into a transferfunction with the intention of introducing non-linearity to the system. Thethree most common types are sigmoid, tanh, and ReLU (the latter of which hasgained in popularity due to the increasing sizes of multi-layer neural networks),which are described in equations (5), (6), and (7). In the context of neuralnetworks and their activation functions, input variables xnwnj are denoted byz.

Sig =1

1 + e−z(2)

TanH =sinh(z)

tanh(z)) (3)

ReLU =

{0 if z is smaller than 0z if z equal to or greater than 0

(4)

14

In supervised learning, neural network performance is improved through abackpropagation algorithm [30]. An error function is defined, which estimatesthe margin of error between the neural network’s outputs and those providedby the ’teacher’, accounting for the weights and biases currently in place [16].

The error function of a neural network is conventionally described with themean square error of the outputs [16],

E(X, θt) =1

2N

N∑i=1

(y′i − yi)2 (5)

where y′i is the actual output and yi is the predicted output made by theneural network. The parameters (weights and biases) for any one iteration canbe calculated by

θt+1 = θt − α∂E(X, θt)

∂θ(6)

θt is the network parameters at iteration t, while α is the learning rate ofthe network; learning rates are set independently by the researcher, and is oneof the hyperparameters that allows for fine-tuning of a system. A high learningrate implies an algorithm that learns quickly, but is prone to inaccuracies in theprocess. However, a lower learning rate takes more time and is more likely to fallinto a local minimum, but will yield more accurate results. The dependency onpast iterations is clear; it follows that if θt+1 can be calculated given θt and θt−1,then θt can be calculated given θt−1 and θt−2. This process is repeated untilthe initial iteration is reached. The partial derivatives can then be calculatedusing the chain rule. We refer to this entire process of estimating the error andoptimizing our parameters in response as gradient descent [30]. In truth, thisis a simplification of the problem at large, but for the purposes of this research,this explanation is sufficient. The following is a step-by-step explanation of howbackpropagation is used to correct a neural network:

1. A feed-forward neural network receives training data as its input, produc-ing a set of outputs.

2. The error function is applied to calculate to what extent the model wasinaccurate.

3. Partial derivatives are calculated with respect to model parameters.

4. Model parameters are adjusted with the gradient descent, with consider-ation for the learning rate α.

3.4.2 Recurrent neural network and LSTMs

Recurrent neural networks are another branch from the traditional feed-forwardnetworks; whereas classical networks take only the latest data values as their

15

inputs, recurrent neural networks are also able to recall their own previous statesand use that to make a prediction. A neural network continuously samples newdata examples, while storing its past states in a context unit, leaving themavailable for access for future predictions. In other words, a prediction yt isinfluenced by the prediction yt−1 [45]. Much as in the case of backpropagation,given that yt is influenced by yt−1, it follows that yt−1 is influenced by yt−2, andthus, yt must be influenced by yt−2. It is easy to see, then, that the prediction ytis influenced not just by the last prediction, but by other preceding predictions,as well [34]. The illustration in Figure 6 illustrates the dependency on previouspredictions.

Figure 6: An illustration of how past states are passed through a recurrentneural network [45]

For example, one area of usage for recurrent neural networks is in automatedsentence construction. A sentence is, at its simplest, a sequential order of words;a recurrent network can backtrack through the sentence, discover all the statesit could take on a case-by-case basis, and successfully predict what the nextword in that sentence will be. This type of application has seen usage in severalfields, from novelties such as automated clickbait headline generation to morepractical usages such as machine-written news articles [45]. It has also seensuccess in forecasting areas [36], making it a suitable inclusion for this research.

Much like other neural networks, recurrent networks take advantage of back-

16

propagation to optimize and tune its weights and biases. However, unlike afeed-forward neural network, a backpropagation algorithm for a recurrent neu-ral network needs to account for time. This is referred to as backpropagationthrough time (BPTT) [45]. Conceptually, it is not much different than tradi-tional backpropagation; however, instead of just propagating back through thenodes of the current network, the algorithm propagates through each time step,containing a copy of the network at that particular step. Referring back toFigure 6, it is clear how b2 can be calculated as a composite function of thenetwork’s past states.

However, a problem can occur with recurrent neural networks as they in-crease in size (and time); due to curiosities with the multiplications operationsexecuted in the backpropagation step, the further back in time the backprop-agation algorithm traverses, the more flattened the data becomes, caused bythe continuous application of activation functions. Eventually, it becomes nighimpossible to detect any change in slope [45]. This is known as the vanishinggradient problem.

Figure 7: A single LSTM containing a cell, an input gate, and output gate, anda forget gate [45]

One solution to the vanishing gradient problem appears in the form of longshort-term memory units (LSTMs). Rather than deriving past errors each timethe backpropagation algorithm traverses past network states, LSTMs will store

17

errors over some time period [44]. Each LSTM is composed of four features:a cell, an input gate, an output gate, and a forget gate. Each feature can bethought of as a classic artificial neuron with its own biases and weights; trainingan LSTM network adjusts these weights such that each LSTM will hold ontoa certain ’memory’ accordingly [45]. Hence, modern recurrent neural networksare, almost to their entirety with very limited exceptions, constructed withLSTM cells as their very foundation.

3.5 Persistence forecastsIn addition to the tested models, this research includes two classical persistenceforecasts to verify the validity of eventual model predictions: a simple ARIMAprocess, and a biased random forecast.

3.5.1 ARIMA regression models

The autoregressive integrated moving average (ARIMA) is a traditional timeseries algorithm designed to be fitted onto a time series for both analytical andforecasting purposes [23]. It is a generalized form of the ARMA forecastingmodel in which data is assumed to be non-stationary. ARIMA is further sub-divided into seasonal- and non-seasonal models; which one is used depends onthe data being fitted. For example, anticipating that there is some type of sea-sonality or periodicity to the data may signify that the non-seasonal ARIMA,is a more suitable choice than the seasonal equivalent [23].

It is a necessary condition that a data set intended to be used with anyARMA model is stationary, meaning that it has a constant mean and variance,regardless of time. ARIMAmodels can deal with this issue by an approach calleddifferencing, in which data values are replaced with the difference between twoconsecutive values. As such, any data point yt in the data set is replaced:

y′t = yt − yt−1 (7)

If the model data is seasonal, seasonal differencing is applied, where m rep-resents the number of seasons in the data set:

y′t = yt − yt−m (8)

Repeating this process for all data points provides a stationary series withmean and variance of zero. [23] provides a formalized definition of a predictionbased on an ARIMA model as:

yt = θ0 + φ1yt−1 + φ2yt−2 + ...+ φnyt−n

+εt − θ1εt−1 − θ2εt−2 − ...− θnεt−n (9)

yt is defined as the data value at any given time t, while θi and φj are modelparameters. In addition, model errors are accounted for by the variable ε.

18

Hence, it follows that finding correct parameters, and if the underlying assump-tions of ARIMA fit on the data, that yt can correctly be predicted. Ostensibly,this can be interpreted as a function where a predicted value is a weighted sumof previous values, accounting for model errors.

For extremely high-variance times series that are prone to constant fluctu-ations, such as stock markets, a commonly accepted approach of an ARIMAis a naive ARIMA, based on a random walk, where the best-fitted model isdescribed as such:

yt = yt−1 + εt (10)

ε is some noise at time t; for very small intervals of ε, the forecast becomesentirely naive as the value approaches 0, where:

yt = yt−1 (11)

We consider this a naive ARIMA forecast, and a good approximator of thecurrently accepted dominant linear model of high-variance forecast models [23].Hence, the naive ARIMA is used as the first persistence forecast in the research.

3.5.2 Random biased forecast

Another way of developing a persistence forecast is by randomly predicting avalue with the range of possible outputs. Hence, it follows a model that canperform better estimates than a random predictor must find some, albeit a weak,correlation in the inputs and outputs it is mapped to.

However, merely forecasting any value within the maximum range is notablyweak, as it is obvious that certain values will never be forecast by a model withaccess to historical data. This research considers instead a biased forecast thatbuilds on the last observed value with added noise within some range a, b.

A random forecast is defined as such:

y′t = yt−1 + random(a, b) (12)

a and b are simply lower and upper bounds for the random noise in the fore-cast. A model that can consistently produce better results than such a forecastis easy to argue in favour of concerning valid forecasts. It may seem rathersimilar to the naive ARIMA previously described, and indeed it is, as outputschange in a similar fashion over time. However, the key difference is in the sizeof the added noise; a naive ARIMA adds only a small noise corresponding tolikely movements in the data, while the random bias we define here correspondto the maximum possible change in the forecasts.

3.6 Development toolsThis section contains a short explanation of the tools used and required pack-ages, as well as a justification behind the choices made.

19

3.6.1 Python and TensorFlow

The main programming language intended to be used for model developmentin this research is Python, which has become the de facto programming lan-guage in research communities. It is a high-level, interpreted language with astrong emphasis on readability and understandability. In addition, its strongcommunity-driven development model and heavily research-based demograph-ics were also beneficial factors. Other machine learning platforms, such as Mi-crosoft’s Azure ML, were dismissed due to an apparent lack of flexibility andcustomizability compared to Python. Its large selection of libraries and featuresthat could help streamline the development process, as well as its heavily ma-chine learning-oriented toolkit, resulted in Python being the primary choice forthis task.

Though machine learning, and especially deep learning, has traditionallybeen a rather difficult subject to break into, recent advances have lightened thelearning curve significantly. On February 11, 2017, Google released its Python-based software library TensorFlow to the public [48]. TensorFlow has severaluses, but its primary domain is machine learning, simplifying the developmentprocess. In addition, this research employs the usage of Keras, a neural networklibrary that is implemented on top of TensorFlow, streamlining the process evenfurther.

Additional Python packages that are used in this project are SciKit Learn,Numpy, Pandas, and Matplotlib. SciKit Learn includes several tools for stream-lining data treatment in machine learning, while the others are used to simplifytool development.

3.6.2 R - statistical programming language

Python and TensorFlow are useful tools for developing deep-learning algorithmsand neural networks, but for the purposes of classical machine learning and otherregression models, R is a powerful statistics-based programming languages, andwill be employed to test those. It is rooted in data mining and data analysis,making it a suitable tool for the purposes of this research, primarily to performrudimentary data transformations to better fit the models.

20

4 MethodologyThis section details the steps taken in order to answer the research questionwith respect to the models described previously. It details the tools of analysisand data collection, as well as the training data provided for the purposes ofthe research.

4.1 Research paradigm and approachThe research is somewhat untouched by social behaviour and correspondingexpectations; as such, it is easy to dismiss the relatively informal alternatives ofinterpretivism and critical research [21] (though it should be mentioned, if thepaper attempted to investigate patient history and predicted behaviour in a non-machine learning process with data treatment included in the methodology, theapproach chosen would be irrelevant). Instead, this project is formally positivistin its design and execution.

The strategy and approach considered most suitable for a project of thisnature is a formal experiment; the environment and variables are entirely withinthe control of the developer, meaning that it is relatively trivial to accountfor external factors influencing the experience, as supported by [21]. Anotherprospective strategy in the form of design science was considered but later ondismissed, given the potential struggle of showcasing evidence that could helpanswer the research questions. One could also question whether the usage of amonolithic database pertaining to only one company is would make the researchmore appropriate for a case study, and there is indeed a case to be made there.However, as described in the next section, the data in question has been shownin clinical trials to accurately account for alcohol abuse, and it is enough ofa foundation to suggest that forecasting changes in that data is sufficient toforecast changes in alcohol abuse, given a previously proven correlation.

Three null hypotheses are proposed for the experiment:

1. There is no correlation between a patient’s historical data and forecastedAMI values that a machine learning model can infer.

2. Deep learning algorithms have no significant impact on time series fore-casting compared to classical machine learning algorithms.

3. There is no noticeable difference in optimization technique performanceson machine learning regressors.

The research considers the specific case of Kontigo Care AB a Uppsala-basedeHealth company with a focus on addiction treatment. Their primary construct,the Addiction Monitoring Index (AMI), is examined and attempted to forecast,applying common machine learning practices and rules of thumb previouslydiscussed to that end. Developing a model that can successfully forecast thatscenario could be used to test the null hypotheses.

21

4.2 Training and validation dataThe data provided in this particular instance concerns their Previct Alcohol c©product, which aids in the treatment of alcohol abuse patients, and consists of alarge set of data entries on a patient-by-patient basis with a high dimensionalityin each data entry. The data can be read about in further detail in the clinicaltrials performed by Kontigo Care [8], but is briefly mentioned here for the sakeof convenience.

The data collection in the clinical trials was done through a breathalyzer-based trial in which 30 patients, based in Sweden, were provided with a toolto measure, between two and four times per day, with the resulting data beinga blood alcohol content (BAC) value in permille. As a control variable, phos-phatidyl ethanol (PEth) values were sampled in the clinical trials to evaluate theaccuracy of the BAC values, as it was considered a strong indicator of alcoholconsumption due to its long half-life, ranging from 4 to 12 days.

The clinical trials were used in the development of a so-called AddictionMonitoring Index (AMI), which is a time-based metric of how each patient isperforming with respect to their individual goals of either staying completelysober or reducing overall alcohol consumption. The conclusion of the clinicaltrials was that AMI processes were advantageous in recovery monitoring, andsuch, this research makes the assumption that AMI is a valid predictor of bloodalcohol content, and that it can be representative of alcohol abuse patients’behaviour in relation to alcohol intake. In this research, AMI can be consideredan output value ranging from 0 to 100, with a maximum change between timesteps of 21.

With respect to the privacy concerns of Kontigo Care AB, other metricsor derived parameters included in the data are not discussed further in thispaper. For now, the following parameters for any one data entry are explicitlyconsidered:

1. The amount of blood alcohol content in permille (BAC)

2. The amount of phosphatidyl ethanol (PEth)

3. The Addiction Monitoring Index (AMI)

The exact distribution and split of training-, validation-, and test data re-mains to be seen, as it is part of the development process, and is decided in theoptimization process (admittedly, there is little research indicating correlationsbetween train/test splits and model performance).

4.3 Feature selection and pre-processingGiven the high level of dimensionality of the training data, the step of pre-processing the data becomes increasingly crucial. This becomes important asadding too many parameters may result in trivial correlations that do not pro-vide any useful predictions, and causes gradient descent to get stuck in local

22

minima far too easily. Dimensionality reduction methods such as PrincipalComponent Analysis (PCA) could be chosen [3], but it is not unheard of toapply rules of thumb in selecting features to be used in the training and valida-tion process. One may choose to consult a domain expert in evaluating whichfeatures or parameters ought to be included or excluded in the training process.In this research, feature selection was done based on advice from Kontigo Care’sprevious data analysis of correlations within the data.

Each type of machine learning algorithm ’expects’ the data to appear insome sort of input. For an ANN, this is in the form of a 1-dimensional tensor,while a recurrent network expects a set of data entries in a time series. Inall forms of supervised learning tasks however, the data needs to be convertedinto a supervised learning problems, which involves specifying a set of inputsX1, X2, ...Xn, and some output Y . For example, predicting AMI could be doneby specifying BAC and PEth as inputs X1, X2, and AMI as Y , and providingthis as a supervised learning problem to the algorithms.

In order to account for patient history, the usage of time lag is employed; theexperiment is done once with a time lag of 1 (in other words, providing inputsfor day 1 to predict outputs of day 2), and once with a time lag of 7 (inputs fordays 1-7 to predict the output of day 8).

Feature selection was, ultimately, not done on a model-by-model basis;rather, to ensure parity, all models were given the exact data sets, transformedonly marginally to suit each model’s expected input and output formats, thoughthe idea of considering model architecture when providing input data was orig-inally intended.

4.4 Model development, training process, and data collec-tion

Each model is developed and tested independently, but the process of datacollection, having developed and testing the stability of the algorithms, mayfollow the same three-step approach for any algorithm:

1. Train the algorithm using the split training data.

2. Test the resulting model with respect to the unseen test data.

3. Perform optimizations and validations to examine algorithm performance.

For a classification problem, the key metric is accuracy, which is defined asthe number of correct predictions with respect to the number of incorrect predic-tions. However, with a regression problem, things operate slightly differently.If a predicted value is a handful of decimals off from the real measurement,it is technically incorrect, but is still extremely close, and a useful predictionnonetheless. Hence, for regression problems, the main measurement for estimat-ing performance of an algorithm is usually some type of loss function, expressedas the distance from the true value a prediction is [2]. In this case, data col-lection is done through the form of the root mean squared error (RMSE) of an

23

algorithm. Data collection is also done by examining predicted trend changesin each model to examine their capabilities as forecasting tools.

In addition, each algorithm can be further tuned or tweaked through hyper-parameter optimization to provide better estimates. Below are described somemethods used in this research to tweak the model parameters on an individualbasis.

Models were selected from a combination of a literature review of availablemodels capable of solving regression problems with a machine learning approachand informal consultations with academics in the field, with the main criteriabeing that there had to be a mixture of deep-learning and classical machinelearning models, and that they were popular choices in the field; the four testedmodels fit the criteria, though there were notable exclusions.

4.4.1 Grid search

The grid search is arguably the most conventional form of hyperparameter op-timization, and is quite like a brute-force approach; a range of hyperparametersare defined (such as a minimum/maximum number of layers in a neural network,and the number of nodes in each layer), and the grid search trains and re-trainsthe network, iterating through the range of values. For example, suppose thatwe define the following ranges of hyperparameters for a neural network:

#layers = {1, 5, 10} (13)#nodes = {10, 50, 100} (14)

A grid search will test each possible combination of the hyperparametersthrough the cartesian product of the ranges, followed by some sort of valida-tion, deciding which combination gave the greatest results. The data collectionprovided by the grid search may help answer some of the research questionsdefined earlier in the paper.

It ought to be mentioned that there are some noticeable downsides to thegrid search. The brute-force technique leaves it open to problems concerningthe number of parameters involved in the data, as the time taken to completethe grid search scales exponentionally as the number of dimensions increases,commonly referred to as the curse of dimensionality [46].

4.4.2 Manual and random search

There are two other approaches quite similar to the common grid search thatcan be considered for the research, and they are the manual search and therandom search, both of which are quite similar to the grid search in terms ofexecution. The manual search is rather trivial [15]; using rules of thumb andintuition, we select hyperparameters that we suspect will yield positive results,and test multiple ones to see which ones perform the best, often referred to asthe grad student search. Amusing nomenclature aside, there is no guarantee that

24

the results will be useful to any extent, given the lack of a systematic approachtaken.

There is also the option of a random search [15]; in order to overcome thecurse of dimensionality faced by the grid search, a random search will select hy-perparameter ranges at random and test the algorithm on those, traversing thegrid without direction. Though it lessens the impact of higher dimensionalities,it does not guarantee an optimal solution like the grid search does.

The data collection from the manual and random search appears in a formcomparable to the grid search, making comparisons and analysis relatively triv-ial.

4.4.3 Bayesian optimization

Though manual, random, and grid search all have the ability to yield posi-tive, tangible results, they suffer from significant drawbacks. They either takesignificant time or resources (a large concern of the grid search), or are notguaranteed to provide the best resulting parameters, as is the case with the gridsearch. Advancements in the field of cost function optimizations have yieldedseveral answers to those conundrums, one of which being the Bayesian optimiza-tion process [9]. Rather than seemingly aimlessly scouring potential parametersin the off-chance that one set will be the best, Bayesian optimization attemptsto include the concept of learning, and it does so through its namesake: thefamous Bayes’ theorem, which is defined as follows:

P (A|B) =P (B|A)P (A)

P (B)(15)

In layman’s terms, it is a description of the probability of a certain event Ahappening given that event B has already occurred. It assumes a distributionwhere the probability of A occurring is proportional to that of B multiplied bythe a priori, previous probability of A. Bayes’ theorem is commonly describedas a theorem for estimating the reliability of evidence, and is the foundation forBayesian statistics, but recent advancements in the field have resulted in theBayesian optimization. The concepts behind Bayes’ theorem can be applied toany gaussian process (commonly thought of in the form of multivariate normaldistributions) using a method called an acquisition function [9] to decide whichlocation to sample next. See Figure 8 for an example of how the Bayesianoptimization process applies to a distribution of a single variable, samplingthree locations in total.

25

Figure 8: A Bayesian optimization on a univariate Gaussian process. Notehow new locations are selected on the curve, and the uncertainty of potentialdistributions adjust accordingly. [9]

Gaussian optimization is, as such, commonly used to stochastic large datasets, such as in the case of machine learning, through the optimization functions,to find suitable hyperparameters [9], and is considered a suitable choice for thepurposes of this research.

4.5 K-fold cross-validationFor any predictive model, it is important have some core metric of evaluatingperformance. This is commonly done through cross-validation, which assessesthe extent to which a certain model can be applied to data outside of the trainingdata. And it is especially useful in the case of machine learning, where overfittingis a very tangible issue [24].

26

K-fold cross-validation is an attempt to wrestle with the assumption that thetraining and validation data match real-world data. As there is little guaranteethat just because the model happens to fit well with the validation set, that itmatches other data sets just as well. Suppose that, for example, the training setjust happens to consist of entirely linear data; the model would then of coursebe trained as a linear model. Hence, cross-validation overcomes this issue is byfurther dividing the training set into subsets of training and validation sets [47].

For example, suppose a data set X is divided into subsets X1, X2, X3, andX4. The model can be trained by selecting three of these, referred to as folds,as training sets, with the remaining fold used to validate the previous set. Themodel is then re-trained by sampling the folds in a different permutation until allarrangements have been exhausted. This is known as k-fold cross-validation, andhas become a commonplace approach to evaluating machine learning models.As such, it is the de facto option in this paper as well. See Figure 9 for a visualexample of k-fold cross-validation.

Figure 9: A k-fold cross-validation diagram with k = 10 [47]

4.6 EvaluationThe evaluation will be two-part; in order to answer the research questions,performance metrics for each individual algorithm will be obtained using theroot mean square error. The algorithms are compared to one another basedon these results, and the experiment is repeated with the inclusion of historicaldata.

Optimization data is gathered in the model development process; in orderto set optimal parameters for each model, this is a crucial step, and to this end,root mean square error of each optimization’s process’ best-chosen parameters

27

are recorded. Much like the case of model testing, this is done for all modelswith and without historical data.

The analysis is done by comparing model performances to that of a naivepersistence forecast (i.e., a model where the input for any time step is alsothe output); this is to account for the probability that the models themselvesare performing a naive forecast. In addition, the models are validated on theirability to successfully predict trend changes in the time series. Finding exactvalues is less relevant for the purposes of the subject data than anticipating howthose changes come about. This is also supplemented by a random persistenceforecast; in the event that the models are not performing a naive forecast, it isexpected that they outperform a forecast of chance, at the very least.

In order to estimate optimization differences, we compare average changesin performance measured in percentage of RMSE, expressing the loss of perfor-mance in moving away from the best-performing process.

28

5 ResultsAll algorithms were optimized using a combination of grid search, randomsearch, and Bayesian optimization. The training process was made using K-fold cross validation. In addition, the algorithms were trained twice, once withno historical data, and once with historical data of 7 time steps. This wasdecided through rule of thumb in consideration to the subject data. For clari-fication purposes, historical data with 7 time steps means that any data entryalso contains the previous 6 time steps. For example, such a data entry wouldcontain the blood alcohol content for time step t, but it would also contain theblood alcohol content for time steps t− 1, t− 2, ...t− 6, t− 7.

The feed-forward and LSTM network parameters that were optimized werethe number of epochs and batch size, while the Random Forest algorithm pa-rameters that were examined were the number of trees, the maximum depth,maximum number of features, and maximum number of leaves. Gamma andC-value were tested for the support vector regressor.

Four algorithms have been fully tested with regards to the research plan: afeed-forward deep learning neural network, a recurrent LSTM network, a sup-port vector regressor, and a Random Forest algorithm. The following sectionsgive a detailed explanation of how each algorithm performed, but for the sakeof understanding the conclusions, reading the summaries should be sufficient.

5.1 Summary of model performancesThis section details how each algorithm performed in predicting the data exactlyone time step into the future. Table 1 showcases how each algorithm performedcontaining no historical data, while Table 2 contains the results of algorithmperformance using historical data with 7 time steps (as advised by the companyto whom the data belongs).

All models showed similar results on the training set without historical data,with room mean square errors (RMSE) in the range of 14-16 (with a range of100, it is not inappropriate to consider RMSE to be equivalent to not only AMI,but also the error in percentage, if one were to chose so). In addition, as wasto be expected, all algorithms suffered reduced performances on the testing set.This is a consequence of the machine learning process, and is considered normalbehaviour as it is unlikely that a model would be better at predicting patternsit has not seen as opposed to patterns where a key was provided beforehand.

Algorithm Train RMSE Test RMSE Best parametersSupport vector regressor 16.386 16.599 C = 10, gamma = 0.001Random Forest 13.453 15.099 #trees = 200Feed-forward neural network 14.494 15.033 #epochs = 200, Batches = 40LSTM network 14.663 14.808 #epochs = 200, Batches = 10

Table 1: Summary of algorithm performance without historical data

All algorithms showed notable improvement on the training set with histori-

29

cal data, though it is unclear whether or not that is a consequence of chance, asthis has not been conclusively proven or disproven. It is also notable that for thesupport vector regressor, Test RMSE was lower than Train RMSE, suggestingimproved performances on the testing set. This is abnormal behaviour (again,one would expect RMSE to increase when considering the testing set), but itcould be explained as a coincidence from the data selection, as it is highly un-expected that a machine learning model would perform better on unseen datathan seen data. However, its test scores were in line with the other modelswhen using historical data, and it is possible that they were overfitting onto thetraining set, though this was not expanded upon.

Algorithm Train RMSE Test RMSE Best parametersSupport vector regressor 12.209 11.209 C = 24, gamma = 0.0001Random Forest 9.341 11.743 #trees = 200Feed-forward neural network 10.920 11.025 #epochs = 200, Batches = 40LSTM network 9.494 11.372 #epochs = 200, Batches = 10

Table 2: Summary of algorithm performance with historical data

5.2 Differences in optimization techniquesA combination of grid search, random searching, and Bayesian optimization hasbeen performed on the four tested models to determine their hyperparameters(showcased in Table 4). Again, here too, we measure results in root meansquare error (RMSE). Much like in the case of model training, optimizationwas performed in two stages: once without historical data, and once with. Thegrid search was consistently able to find optimal hyperparameters, though itdoes suffer from exponential run times as dimensions and number of parametersincreases; for a large enough dataset, this would spell a death sentence.

However, the Bayesian optimization method was able to find (marginally)better hyperparameters than the grid search in one case, and the rest of itsperformances on a non-historical dataset . It is also worth mentioning thatBayesian optimization, despite providing similar results to the grid search, is asignificantly faster approach. Randomly searching parameters proved relativelyfruitless, as it was only once able to find the best parameters.

Algorithm Grid search Random search Bayesian optimizationSupport vector regressor 17.779 16.599 17.885Random Forest 15.099 15.231 15.244Feed-forward neural network 15.033 16.110 15.132LSTM network 14.808 15.750 14.663

Table 3: Best performances of optimization steps, measured in RMSE, withhistorical data excluded

The process is repeated to include historical data in Figure 4, and we findthat on average, both Bayesian optimization and the grid search show increasedpotential in finding lower minima, leaving behind the random search.

30

Algorithm Grid search Random search Bayesian optimizationSupport vector regressor 11.209 16.843 11.618Random Forest 11.704 12.041 11.334Feed-forward neural network 13.043 16.110 11.025LSTM network 11.372 14.522 11.731

Table 4: Best performances of optimization steps, measured in RMSE, withhistorical data included

In general, grid searching provided the lowest RMSE scores, with only oneexception. However, Bayesian optimization oftentimes came close to the resultsof grid searching with far lower runtimes by comparison. Though exact numberswere not measured (the potential gap in runtimes was overlooked at the time ofmodel development), the gap was substantial, and what Bayesian optimizationcould do in hours would often take days if not weeks to perform with a gridsearch. Meanwhile, the random search method was left behind, with runtimessomewhere in the between, but without the results to support it.

5.3 Results from the support vector regressorThe support vector machine originally performed noticably lower than the otheralgorithms on the training sets, with higher RMSE across the board, with andwithout historical data; performing optimizations to find relevant parameterswas a crucial step. However, its performances on the testing set was more in linewith its competitors. This could indicate that the support vector regressor doesnot suffer from overfitting to the same extent as the other algorithms. Withouthistorical data, the algorithm struggled to approximate certain parts of the data,making sporadic predictions at times that cannot be used for practical purposes.

31

Figure 10: Performance of a support vector machine on the testing set with nohistorical data. The Y-axis corresponds to output values, while the X-axis is arange of input time steps. This holds true for all performance graphs furtheron.

With historical data, the algorithm was able to make better predictions onthe macroscopic level. It was able to approximate more periodic sections ofthe data, as showcased in Figure 11. However, it appeared to exemplify thebehaviour of a naive forecast (also known as a random walk); its predictionswould oftentimes appear one time step too late, and it is possible the modelsimply chose to predict the values of time step t to be the same as those of timestep t− 1.

32

Figure 11: Performance of a support vector machine on the testing set withhistorical data

On the whole, the support vector regressor performed surprisingly well;though support vector machines find their main strengths in classification, theregressor produced RMSE values consistent with those of the other algorithms.Indeed, in the case with historical data included, the support vector regressoractually has the lowest RMSE of all four algorithms at 10.096 on the testingset. Given LSTM networks being specialized tools for looking at historical data,this is notable, but it must be said that this could very well be a quirk of thedata being given; if historical data contains little to no value in predictions, thenhistorical models will naturally suffer.

5.4 Results from the Random Forest algorithmSimilarly, the Random Forest, with RMSE scores in line with the other models,suffers from sporadic and seemingly random predictions at times, often failingto anticipate stationary periods in the data. Figure 12 contains a sample of therandom forest’s predictions in comparison to real observations.

33

Figure 12: Performance of a random forest algorithm over the testing set withno historical data

With historical data, the performance of the random forest is less sporadicand able to better approximate the observed outputs on a larger scales, as seenin Figure 13. It is able to approximate the output function relatively well, withsubstantial differentations from the observations being more infrequent than inits non-historical counterpart. In addition, testing RMSE saw a large decreasefrom 15.244 to 11.743. This does seem to indicate some relational significancein the historical data.

34

Figure 13: Performance of a random forest algorithm over the testing set withhistorical data

In terms of optimization, the random forest algorithm does not see notableimprovement from the tweaking of its hyperparameters; if one were to go outof their way, performance could be negatively impacted, but that could be con-sidered an edge case, and not necessarily representative of work with randomforests at large. The hyperparameter with the most explanative power, thenumber of trees in the forest, only has minimal performance improvement veryearly on.

35

Figure 14: The number of trees (estimators) in a Random Forest shows a sig-nificant improvement in the single digits

In Figure 14, it can be seen how the performance of the random forest (rep-resented by score in the graph, rapidly rises as the number of trees iteratesthrough the single digits. In this case, it corresponds to an approximate im-provement of 4% on the model. However, fitting to the data does not improvesignificantly as the number of trees exceeds 10 total trees; this alone would sug-gest that the predictive model in question is rather simple. Whether or not itis worth performing a grid search to attain this increase is worth considering,though the random forest’s quick training process weights heavily in its favour.

36

Figure 15: Accuracy changes of a Random Forest in relation to the number oftrees

The performance of the random forest is notable as it is able to arrive atconclusion far quicker than its competitors; the neural networks require a signif-icant training process that takes up a substantial amount of time (exponentiallymore so if one applies a grid search to optimize hyperparameters), and the sup-port vector machine can oftentimes make trivial errors, as was the case in thetail-end of Figure 10. Indeed, the main benefits of optimizing the Random For-est came in the early stages, with a sharp upturn in performance, but almostnone in the latter stages from 20 trees and onwards, such as in Figure 15.

5.5 Results from the multilayer feed-forward neural net-work

Figure 16 contains a sample of feed-forward network’s predictions in compar-ison to real observations. Though it is able to approximate the function on amacroscopic level, its predictions are often made too late (i.e., one time stepafter the observation) or are highly sporadic, unable to account for stationarysections in the data.

The feed-forward network matched the performance of the random forestregressor, with similar RMSE values across the board. There was a suspicionhere too, that the model could be naively predicting tomorrow’s value by over-

37

estimating today’s value and merely adding some noise to that effect. Again,such a phenomenon would render such a model relatively pointless; though itwould produce low error scores, such predictions are rendered irrelevant as theydo not contain any meaning extracted from the data itself.

Figure 16: Performance of an artificial neural network over the testing set withno historical data

The feed-forward network saw a notable improvement in RMSE when theinput data included historical data (again with 7 time steps in the past), de-creasing from 15.033 to 13.043. However, though the error of each predictionwas lower than previously, the likelihood of the model performing a naive fore-cast increased; predictions where the observations decreased from the previoustime step were oftentimes incorrect, hamstringing the model in terms of usefulforecasts.

38

Figure 17: Performance of an artificial neural network over the testing set withhistorical data

The picture starts to clear up as one observes the loss function for the feed-forward network, depicted by Figure 18; though performance continues to im-prove on the training set (illustrated in blue), testing performance decreases ina stable manner for the very first few epochs, but becomes more unstable andsporadic despite increased training. This may indicate a high level of overfit-ting and may be an inability to find highly relevant correlations in the data,though it is inconclusive without further analysis. In order to determine that,persistence forecasts are being made in the analysis.

39

Figure 18: Change in the loss function of an artificial neural network over 500epochs

The feed-forward network shows results comparable to the random forestand support vector machine in terms of RMSE, but its long training processesand exponentially long optimization processes make it difficult to work with,and the naive forecast problem could be at play here.

5.6 Results from the recurrent neural networkThe LSTM network had low RMSE scores, indicating positive results; thoughits initial optimization processes were noticeably slower than those of the feed-forward network (contrast Figure 18 and Figure 22), it continued to improveon the testing set, unlike the feed-forward network, in which the testing losslevelled out quickly. Without historical data, as was the case in 19, the networkperformed on par with the other models, with similar RMSE scores on both thetraining- and the testing sets. Based on this alone, one would conclude that themodel outperformed the others.

40

Figure 19: Performance of a recurrent LSTM network on the testing set withno historical data

Though the LSTM network performed very well in terms of loss function andRMSE, as seen in Figure 22, a problem arose upon further inspection; thougheach prediction was closely approximating the placement of the actual value,it was time-shifted one step into the previous time step, suggesting that themodel was either highly oversampling the previous time step or performing anaive forecast; either way, the result would be the same. In other words, today’sforecast corresponded much more closely to yesterday’s observation rather thantoday’s observation.

The LSTM model saw some improvement in its error rates when account-ing for historical data, and its predictions were usually closely approximatingobserved outputs. However, it, much like the feed-forward network, it tendedtowards a naive forecast, hinting towards a lack of meaningful relationships inthe data, though this has not been ascertained. In addition, the gap betweentraining and testing RMSE increased, suggesting increased overfitting.

41

Figure 20: Performance of a multivariate LSTM network over the testing setwith historical data

One interesting quirk of the model was that it carried an inherent bias with-out historical data; it was entirely unable to predict values greater than 90,suggesting a bias on its upper limits of −10. It is unclear why this occurs onlywith the non-historical model, as the behaviour is not present within any other ofthe tested models in the entire research. Indeed, the historical LSTM model wasable to eliminate this bias, accurately making predictions in the range 90− 100.

42

Figure 21: The performance of the LSTM network saw a bias of exactly −10 onits upper limits

The loss function of the LSTM network is as one would expect; after a shortnumber of iterations, the loss of the testing set begins to overtake that of thevalidation set. The model’s consistent improvement on the testing set suggeststhat it is not overfitting either. However, the recurring issue of naive forecastscould be at play here; if any value is preceded by another value, a naive forecastis possible. This relationship is true for both the training- and the testing set,meaning that a naive forecast is always possible, and could explain why thenetwork continues to improve so well on the testing set, and why the trainingand testing sets are so in line with each other.

43

Figure 22: Change in the loss function of an LSTM network over 250 epochs

44

6 AnalysisThis section is devoted to interpreting the results and providing a backgroundfor understanding their consequences. It also displays some measurements takento improve model performance, as well as a discussion on the optimization queryposed at the beginning of this research.

The research found little difference in performance between machine learning-and deep learning models on the time series data, with or without historical data.Indeed, all algorithms performed within the same range of test RMSE rangingfrom 10.096 to 13.372 on its highest. However, two notable issues were blatantlyobvious following the data collection stage; the models were suspected to occa-sionally resort to naive forecasting similar to a random walk (albeit to varyingdegrees), and it remained unclear whether or not their predictive capabilitieswere valid in a more practical context.

6.1 The naive forecasting problemExaming output graphs, naive forecasting was indeed an issue to be considered;irrespective of the values of time steps t− 7 to t− 2, the time step of t− 1 couldpotentially weigh far more heavily in making predictions. The relationship wasnot entirely linear, however, as the graphs displayed, in Section 5, periods in thedata where the models did not make naive forecasts. In order to confirm theextent to which the models performed a naive forecast, an ARIMA forecast wasexecuted, feeding it univariate data of the observed outputs as inputs (ratherthan the multivariate data of the previous models). This forecast is expected tostrongly behave similarly to a random walk, given the stochastic nature of theoutputs (i.e, there is little correlation between between two measured outputs).

If the models are performing a naive forecast, then it is expected that theirroot mean square error should correspond - mostly - to, and be in line with, thatof the ARIMA forecast. If the RMSE deviates enough (regardless if it is higheror not), it suggests that the models are not performing a naive forecast or arandom walk on the data. It is not expected that the models should outperformthe naive forecast with a lower RMSE, due to the random spread and highvariance of the data.

45

Figure 23: Example of a naive ARIMA forecast. Though the model appearsto very closely approximate observed values, each prediction is time-shifted,making it a poor predictor of high-variance series.

The ARIMA forecast produces an RMSE of 5.643, significantly lower thanany of the tested models (see Table 5). At first glance, this would indicate thatthe models are simply producing poor results, though this is not necessarily thecase; the value of a predictive tool is in predicting the direction of the short-termtrend, something a naive forecast will consistently get wrong. Considering thenaive ARIMA as the best-performing persistence predictor, and a biased ran-dom forecast and the worst-performing persistence predictor, any model thatcan forecast non-trivial results is expected to have results somewhere inbetweenthese two (a model that can outperform a random walk model on such a spo-radic high-variance time series could be applied to any time series, and couldeffectively predict the future for any data. Needless to say, the development ofsuch a tool is outside the scope of this paper). In addition, to the naive ARIMAforecast previously discussed, a random biased forecast was added, explained inthe methodology.

46

Algorithm RMSE with history RMSE without historySupport vector regressor 11.209 16.559Random Forest 9.341 15.099Feed-forward neural network 11.025 15.033LSTM network 11.372 13.663ARIMA forecast 5.643 5.643Random forecast 13.566 13.566

Table 5: Model performances with historical data compared to two persistencesforecast: an ARIMA model and the biased random forecast.

We find that the model RMSE do not match those of the naive ARIMAforecast; hence, it follows that they are not performing the same type of forecast,and as such, further analysis is requierd to determine how valid they are asforecasting tools. However, crucially, all models were able to outperform therandom predictor; in other words, it is possible to argue that there is somecorrelation in the data with 7 days of lag, though there is evidence to suggestthat the correlation is not strong enough to conquer a naive forecast.

In addition, the loss function of the feed-forward network cannot be ignored;Figure 18 shows that predictions on the validation data do not match the behav-ior of predictions of the training set, but in a naive forecast, one would expectthe opposite (in other words, the pattern that yesterday’s observation is today’sprediction would hold true for both training- and validation data). If it wereperforming such a forecast, its loss function would be better represented by 22,as the loss between the training and validation data would have been identical.Given that the two loss functions are mutually exclusive, it follows that if atleast one model does not perform a naive forecast, and its RMSE is lower thanthat of a random forecast, that there is some correlation, albeit a weak one, inthe data.

A final step of acknowledge is the difference in including historical datain the forecast; as if by coincidence, that inclusion proves sufficient to lowermodel RMSE below that of the random forecast. Though this provides only abaseline to compare to, it is evidence to suggest that historical data makes anotable difference in this case. We can state with some confidence that alcoholicrelapse prediction benefits from the addition of history corresponding to 7 seventime steps. There is some meaning behind this given the trouble caused bynaive forecasting; a model that finds naive predictions to be the best would notimprove by including additional history, and would continue to naively predictfuture values. Hence, there is evidence, albeit inconclusive, that the models maybe able to resolve the issue.

With these facts combined, it is reasonable to conclude that a feed-forwardneural network can, indeed, make non-naive forecasts on patient abuse casesthat are better than chance.

47

6.2 Model validityIt is clear that at least one model is not performing an entirely naive forecast,but rather performing a non-trivial prediction; hence, it must be ascertainedto what extent they are accurately able to predict change of the direction ofthe trend (i.e., whether or not the trend will be positive or negative). In otherwords, a change-trending value is defined by being either lower than both itspreceeding and succeeding values, or being higher than both. This can be donein a straight-forward manner by examining the output data; comparing threepredicted values at a time to three observed values at a time, returning a positiveanswer if the predicted values correctly show a peak or a dip, as shown here inpsuedo-code:

Algorithm 1: Tracking changes in data series trendResult: Outputs number of false positives, false negatives, true positives,

and true negatives1 array observations;2 array predictions;3 for i in observations do4 if change in observations[i] and change in predictions[i] then5 increment true positives;6 end7 if no change in observations[i] and no change in predictions[i] then8 increment true negatives;9 end

10 if no change in observations[i] and change in predictions[i] then11 increment false positives;12 end13 if change in observations[i] and no change in predictions[i] then14 increment false negatives;15 end16 end

This algorithm can be improved significantly by expanding it to also ac-counting for long-term trends rather than sudden movements in the data, butas the value of the predictive tools within the context of the data is to accountfor changes one time-step in the future, a small window was chosen. However,it is important to keep this decision in mind going forward.

Table 6 showcases the number of correctly predicted peaks and dips eachmodel made against the number of incorrectly predicted peaks and dips, usingmodels trained without historical data. The testing data, consisting of 1510trend changes, being shown here is made up of the clinical trials performed byKontigo Care; additional data was used in the training and testing process ofthe models, but that is not displayed here.

48

Algorithm True positives True negatives False positives False negatives % of accurate predictions

SVM 816 1582 1512 694 52.08%Random Forest 891 1372 1722 619 50.03%ANN 864 1453 1641 646 50.03%LSTM network 356 2563 613 1154 62.29%ARIMA 708 2120 1068 849 59.59%Random forecast 977 1100 1995 533 45.33%

Table 6: Predicted changes in trends by each model, measured in number ofoccurrences, without historical data.

The LSTM network was notably out of step concerning predicted changesin the time series, contrasting highly with its low RMSE scores. Though itpredicted true negatives in trend changes far more often than its competitors,it suffered a noticeable gap in in predicting true positives. One hypothesis thatcould explain it is the LSTM network’s upper bias of −10; given its propensitytowards missing high-value data entries, and how frequently those occur, it maycome to pass that despite its low RMSE, the network incorrectly treats directionchanges outside of its bias.

Meanwhile, the feed-forward network, Random Forest, and support vectormachine remained on par with each other; the two former models managedto maintain an identical percentage of accurate predictions. All three modelsfailed to account for true positives as often as the random forecast, but this wassignificantly compensated by their ability to predict true negatives.

The rather low incidence of positive predictions suggests that the tools’predictive capabilities are rather limited; on average, almost half of all trendchanges remain unaccounted for, suggesting that the models tend to miss changesas much as they can accurately predict them. Nevertheless, there is a noticeablegap in performance of tested algorithms compared to the ARIMA forecast; theyperform considerably better at forecasting direction changes in the data trend.Though the ARIMA model has a high performance of 59.59%, it is deceptivelyso; the majority of its positive results arrive from its true negative predictions.

The models were later re-examined with historical data included:

Algorithm True positives True negatives False positives False negatives % of accurate predictions

SVM 782 1714 1380 728 54.21%Random Forest 728 1745 1349 782 53.71%ANN 823 1536 1558 687 51.23%LSTM network 170 2761 333 1340 63.66%ARIMA 708 2120 1068 849 59.59%Random forecast 977 1100 1995 533 45.33%

Table 7: Predicted changes in trends by each model, measured in number ofoccurrences, without historical data.

All four models benefited, at least on a surface level, from the presence ofhistorical data, with accuracy improving slightly across the board. Notably, theRandom Forest model saw a significant performance boost of almost 4%. It isunclear why no other model saw such changes to its accuracy.

49

However, the majority of model improvement came from an increased pre-diction rate of true negatives; in other words, the models were more accuratein predicting occasions when the trend did not change, given historical data.Indeed, predicting true positives saw a notable downturn, with the aforemen-tioned Random Forest algorithm dropping from 891 true positive predictionsdown to 728. It is suspected that the inclusion of historical data caused in-creased overfitting, with models tending more towards naive forecasting. Thisis in line with the naive ARIMA forecast, which is similarly positioned: lowRMSE scores, a high rate of true negative predictions, but a low rate of truepositive predictions.

6.3 Ensemble learningThe issue of low predictive trend change accuracies despite the models’ low meansquare error rates suggest room for improvement, and it is possible that anysuccess could be attributed to the data in question. As such, further analysis wasdone by developing an ensemble learner to investigate whether or not predictionstability could be improved. This is justified from studies in the related workon time series forecasting using machine learning, and similar approaches areemployed here.

Using a strategy similar to the Random Forest, in which multiple weakpredictors combine to make a single strong predictor, an ensemble learner, takingthe aggregate predictions of the four developed models, was constructed, withthe aim being to see if the models performed weak predictions owing to eitherinsufficient data or lack of meaning behind the existing data. The results of thenew ensemble learner is showcased below, in comparison to its four constituentmodels:

Algorithm Train RMSE Test RMSE % of accurate predictionsSupport vector regressor 16.386 16.599 52.08%Random Forest 13.453 15.099 50.03%Feed-forward neural network 14.494 15.033 50.03%LSTM network 14.663 14.808 62.29%Ensemble learner 13.964 14.467 49.82%

Table 8: Performance of the ensemble learner compared to previously testedmodels without historical data included

The ensemble learner proved poor at maintaining low RMSE scores withouthistorical data, with an accuracy lower than any one other model. This maybe due to the inclusion of the LSTM model into the ensemble learner; Table 7shows how rarely the LSTM networks makes accurate predictions about trendchanges.

The model was trained and tested twice, the second with historical dataincluded:

50

Algorithm Train RMSE Test RMSE % of accurate predictionsSupport vector regressor 12.209 11.209 54.21%Random Forest 9.341 11.743 53.71%Feed-forward neural network 10.920 11.025 51.23%LSTM network 9.494 11.372 63.66%Ensemble learner 10.052 10.360 52.36%

Table 9: Performance of the ensemble learner compared to previously testedmodels with historical data included

With models trained on historical data, the ensemble learner has an RMSElower than most of its constituents, though its accuracy is considerably lowerthan three of the four constituents. It appears to be the case here too, thatthe models overfit and cannot reliably predict changes in the trends, due to theinfluence of a naive forecast.

In addition, trend change predictions were also examined on the ensemblelearner. For the sake of brevity, success rates are appended onto tables 8 and 9,but previous trend checks are not restated. The exact results of the ensemblelearner’s ability to predict trend changes are displayed in Table 10.

Historical data True positives True negatives False positives False negatives % of accurate predictions

no 885 1414 1680 625 49.82%yes 717 1694 1400 793 52.36%

Table 10: Predicted trend changes from the ensemble learner, depending on thepresence of historical data.

The change from non-historical to historical data in the training processfor the ensemble learner proves - rather unsurprisingly - to be rather similarto that of its constituents; as historical data gets included, both true and falsenegatives increase at the expense of both true and false positives, with a notableincrease in the percentage of accurate predictions. This may be that the modelis either overfitting (a rather unlikely prospect given its low RMSE), or that itis tending towards a naive forecast, much like its constituents. On the whole,it is reasonable to suggest that the transition to an ensemble learner did notimprove performance to any relevant degree, as seen in Figure 24.

51

Figure 24: Showcase of the ensemble learner making predictions on the testingset

6.4 Verification with an independent data setGiven that all models took on some aspects of a naive forecast to varying degrees,it was considered relevant to verify this phenomenon somehow. To confirmthat it was not simply a quirk concerning the particular high-variance dataset used to develop the models (and in doing so, confirm that it is part ofthe model behaviour), the neural network, chosen due to the aforementionedproof that it is not entirely undergoing a naive forecast, was re-trained usingan independent high-variance data set, as discussed previously. It was trainedtwice, again with and without historical data. The results are displayed in Table11. Note that RMSE has been intentionally ommitted, as the resulting errorwould be proportionally different due to the independent data set containinglargely different spans of values.

Historical data True positives True negatives False positives False negatives % of accurate predictions

no 2466 5368 3704 2914 54.25%yes 2181 5832 3240 3197 55.64%

Table 11: Predicted trend changes of the neural network using independentdata, with and without historical data.

52

The distribution of positive and negative predictions are in agreement withprevious tests, with the majority of of true predictions appearing in the formof negatives, and a similar rate of prediction. Indeed, a quick inspection of theoutput graphs reveals its tendency to perform naive predictions, as is the casein Figure 25.

Figure 25: The neural network made naive predictions even on an independentdata set.

Again, we see that predictions are strongly lagged; the current value is itsstrongest indicator of tomorrow’s value. Hence, it follows that the neural net-work does not necessarily such high-variance problems to a satisfactory extent.This is argued from two assumptions: firstly, that some noise is to be expectedwhen applying a machine learning algorithm over an explicit naive forecastingmodel, and secondly, that a naive forecast is the best proven indicator of atrend. Further testing with competing models are not expected to differ in re-sults, and it is increasingly likely that highly specific and context-relevant datatransformations may be required to perform accurate predictions.

6.5 Analysis of optimization processesThe results from the research investigating optimizations were arguably trivial;the grid search was consistently yielding good results, but with runtimes expo-

53

nentially longer than its competitors, running for days as opposed to a matter ofhours. Though the random search was touted to be a surprisingly useful inclu-sion, the high error rates of models trained with it - only occasionally strikinganywhere close to rivalling techniques - quickly eliminated it as a valid candi-date going forward, as shown in Table 4. As such, further analysis focuses onlyon the differences between Bayesian optimization and grid search.

In order to ascertain the differences between Bayesian optimization and gridsearch, some very simple statistics are computed (given the absence of exacttimestamps, it must be said that the impact of these are diminished somewhat):

1. Performance difference in percentage on a non-historical dataset.

2. Performance difference in percentage on a historical dataset.

3. Average performance difference on both datasets.

The results are presented in Table 12. We find a large difference in per-formances between tests containing historical data and those without, thoughthis may be attributed to a smaller sample size. Nevertheless, it either case,the performance difference failed to justify the overwhelmingly large increase inexecution time.

Without historical data With historical data Average performance difference

0.796% 7.037% 3.481%

Table 12: Performance differences in percentage between Bayesian optimizationand grid search on both historical and non-historical datasets.

Bayesian optimization proved to be a useful tool for optimizing deep learningmodels, as the exponential run time of the grid search made it obselete incomparison to the Bayesian model. Indeed, later developments and changesto models for the purposes of this research were often checked with Bayesianoptimization due the high performance rate. However, this outcome may havebeen a consequence on the models chosen, and may not always be appropriatefor different parameter sets.

The high speed and subsequent accuracy of Bayesian optimization is worthrestating; given that the de jure mode of the grid search is less prevalent thanrules of thumb, Bayesian optimization fills a noticeable gap, and its perfor-mances differences are negligible compared to the grid search. As such, goingforward, Bayesian optimization as a tool ought to be considered when developingregression-based machine learning models. In a moment of reflection, however,it must be said, that there are some deficiencies in the strategy used to in-vestigate this phenomenon; first, the exclusion of measured execution time isnotable, and further discussion could have been had with its inclusion. Second,the very low sample space limits the ability to generalize the research further,though the deterministic nature of the subject means, if nothing else, that theseresults are valid for AMI-based predictors.

54

7 Discussion and conclusionsThis section is devoted to discussing the impact of the research, the value in themethods chosen in the analysis, and help identifying deficiencies in the research.

7.1 Summary of research findingsThis research set out with multiple goals; it aimed to establish whether or notmachine learning models could, to some extent, forecast rapid changes in a al-cohol abuse patient’s behaviour one time step in advance, to investigate modeldifferences in making predictions on such datasets, and it examined what dif-ferences could be found in optimization techniques.

First question

The analysis in the previous section showed that the first goal showed promisesof success. We restate the first null hypothesis:

There is no correlation between a patient’s historical data and forecasted al-cohol behaviour that a machine learning model can infer.

In the analysis, it was found that all models produced RMSE values lowerthan a random forecast. In addition, it was also showed that the feed-forwardneural network could not have performed a naive forecast, as its loss functiondid not reflect this. As such, one must conclude that the feed-forward networkwas able to find a weak correlation between the inputs and outputs. Hence, wereject the null hypothesis and accept the alternative:

There is a correlation between a patient’s historical data and forecasted al-cohol behaviour.

There are some caveats to this statement. First, the findings were foundusing a machine learning approach, and it is possible that these correlationscould be explained far more clearly using classical statistical tools. Second, thiscorrelation was confirmed using a feed-forward neural network, and it is notconfirmed whether or not the conclusions are valid with the other models; eventhough they produced very similar results in terms of RMSE and trend changes,it is possible that this is either due to chance or the nature of data. The notionthat they are performing a naive forecast has neither been confirmed nor dis-missed. The analysis hinted that the inclusion of patient history was evidencein opposition to naive forecasting, but it could also be explained that the addedhistory merely decreased the noise of the naive forecast.

Second question

However, concerning the second research question and goal, which aimed to ex-

55

amine if there were any differences between classical machine- and deep learningmodels, the questions remain unanswered. Indeed, as three of the four mod-els performed similar error rates and trend change predictions, it is difficultto make a conclusion, as both results and analysis were inconclusive on modelperformances with a notable lack of meta-information about how models arriveat conclusions. Given the inherent bias carried by the LSTM network and itsnotably undocumented inability to make accurate predictions on the higher endof outputs (90 or higher), it is likely that this is a malfunction in model devel-opment, and not representative of LSTM networks at large. One must concludethen, that it is unclear at this stage whether or not there is a difference betweenmachine- and deep learning models at forecasting these types of problems atthis stage.

Third question

The results found that both Bayesian optimization and grid search found verysimilar results across the board. However, a quick investigation of the resultsfound that in this specific case, Bayesian optimization corresponded to an ap-proximate average 3.5% loss in performance. In the context of the subject matterof AMI predictions, it corresponds to a far lower number; a quick guesstimatewould suggest that the impact on predictions would be 3.5% of an approximate10 RMSE, or 0.035% change in predictions. However, given the massive gap inruntime, this small impact is negligible at best in this specific context. However,little analysis to see that the random search gets far left behind, with runtimesslower than Bayesian optimization and end performance worse than its com-petitors. One would inevitably have to conclude that the null hypothesis doesnot hold true, as the random search is an objectively inferior alternative thanboth grid search and Bayesian optimization, given the choice of all three.

7.2 Potential model improvementsCertain elements of model development were overlooked: some intentional, whileothers were discovered only in the latter stages of the research process. Hence,it stands to reason that these ought to be discussed. A reflection on the feed-forward neural network, the model recommended by this research for this type oftime series analysis, discovered two areas that could improve performance. Pre-processing the data and making relevant transformations could improve modelperformance, a step intentionally omitted in this research. In a similar vein, onecould even recode the outputs to consider the regression problem a classificationissue, given the limited range of ’relevant’ predictions (i.e., in the context of thedata being used, a predicted value of 34 is not substantially different from aprediction of 33).

56

7.2.1 Further data transformations

The topic of data treatment and preprocessing was left relatively untouched inthis research; a core assumption when developing the models was that their taskwas to find explicit relationships in the untreated data. However, the researchappears to strongly suggest that this is not the case, suggesting that, goingforward, a more rigourous approach to pre-processing is required. There aretwo large questions looming over the data that need to be dealt with.

First, one must ask whether the inclusion of historical data increases thelevel of noise without considering the level of persistence (i.e., the number oftime steps forward each sample exerts influence on). For example, if one were toforecast tomorrow’s weather, providing historical data for the past seven yearscould be considered highly excessive, as anything outside the past few days hasany tangible bearing on what tomorrow’s weather will be like. One could pre-process the data and train the models in such a way to account for this, byhaving multiple models trained on short-term and long-term time lags.

Second, it has been common practice in machine learning to either ignoredata entries containing missing values, or to crudely fill the missing values withzeroes. This carries a certain level of bias, especially in regressive features inwhich the value 0 holds meaning. In those cases, any missing value would becast as 0, overrepresenting it in the data samples. A work-around to this isto consider one-hot encoding, in which a categorical variable is replaced by anumber of boolean flags corresponding to the number of options in that variable.However, this will, inevitably, invoke the curse of dimensionality; if a variablehas 11 possible options, its one-hot equivalent would result in 11 columns: anexplosion in number of features. With added time lag, those 11 columns wouldbe added on again for each included time step.

7.2.2 Defining outputs categorically

There is strong evidence to suggest that there is a relationship, albeit a weak one,between a patient’s past behaviour and future incidence rate of alcohol abuse.However, this lacks a certain level of generalizability, as the methodology inquestion was built on a single company’s own construct for measuring sobriety.In order to examine such a subject in a more general sense, one would have toincorporate aspects touched upon by [49], with a significant change being madeis the treatment of output variables.

The goal of this research was, indeed, to investigate the aforementionedrelationship using a machine learning-based approach, though it was entirelyregression-based. If one were to discard the goal of investigating a generalizeableapproach to multivariate time series forecasting using the methods chosen here,the study could be adapted to provide a categorical statement. Suppose thatinstead of the output being a float value from 0 to 100, that it instead representeda bool of 0 or 1 to indicate whether or not a patient has relapsed that particularday. Proving such a relationship could show results usable by other agencies,and more importantly, would show the relationship more strongly than in the

57

current state.

7.3 ConclusionsThis paper had originally set out to examine three things: that there is a cor-relation between patient history and future behaviour that can be found withmachine learning, whether or not there is a difference between different types ofmachine- and deep-learning models on high-variance regression problems, andfinally, to what extent different hyperparameter optimization proccesses impactperformance.

It was shown that, if nothing else, a feed-forward neural network can makepredictions one day in advance that are better than those of chance; giventhe proof that it is not making a naive forecast (as stated previously, a modeltrained on naive predictions would see similar losses on training and validation),it follows that it has discovered some non-trivial pattern. The significance ofsuch predictions ought not to be understated; there is a significant gap in thedomain-specific applications of machine learning in this field, and this paperserves to show that neural networks can indeed be used to forecast relapsesin alcohol abuse. Hence, it follows that further research in this area is highlyadvised.

However, there are some limitations that ought to be considered. The anal-ysis clearly showed that the neural network was unable to provide more relevantpredictions than simple models such as the ARIMA process, limiting its impact.Furthermore, it ought to be stated that though there is evidence to suggest thatthere is some meaningful pattern in the data, it cannot be ascertained with theuse of a neural network as to what that pattern actually is, limiting explainabil-ity.

Given the lack of knowledge surrounding each model, it is difficult to deter-mine the extent to which a naive forecast impacted the classical machine learningmodels; as their training processes do not include a loss-over-time model, theexclusionary logic used to determine the nature of the neural network’s predic-tions cannot be applied here. Though the similar distribution of RMSE valuesand trend change predictions are in line with the feed-forward neural network,it cannot be said that the relationships they map are meaningful, as the actualvalues may be attributed to chance or coincincidence.

The analysis contained a brief attempt at generalizing the conclusions ofthe neural network, by mapping inputs and outputs of an independent dataset.However, the results were inconclusive, with values being far more stronglylagged than in the case of the subject data. However, it should perhaps berestated that this issue is not necessarily an issue concerning the chosen anddeveloped models; as previously stated, the random walk forecast is a validforecasting method, and is commonly used in finance- and stock forecasting -multidimensional, high-variance data that could be periodic that resembles thesubject data in this paper. It can be said, then, that generalizing forecastingmodels requires a drastically different approach than was taken here.

Going forward, it should perhaps be noted that little data treatment, other

58

than fitting the data to the models, was done, albeit by intention. Given that lit-tle thought was given to dealing with missing values or noise (again, by design),more careful, context-sensitive data treatment and preprocessing may give wayto more accurate predictions with problems the models can resolve. However,as it stands, building a general-purpose time series forecasting tool has provena seemingly unsurmountable task.

A key take-away that ought not to go unmentioned is the extent to whichBayesian optimization proved hastier than its competitors. Though time mea-surements were not taken (at the time, such a level of performance differencewas not expected, and time was not considered as a dependent variable), run-times were orders of magnitude faster than the grid search, and with resultsclose to it, it can easily be argued that Bayesian optimization should, in thefuture, replace the grid search as the de facto optimization process for machinelearning models.

In addition, though the usage of Bayesian optimization for choosing hyper-parameters was a successful endeavour, the approach to which the research con-ducted it was fundamentally flawed, with no consideration for execution times,and a limited amount of trials, limiting generalizability. Though it was ordersof magnitude faster than the grid search approach, having exact time stampswould strengthen the position of this paper considerably.

The generalizability of the conclusions can still be questioned; the regres-sion problem being posed was the mapping of a constructed value by a singlecompany; though there is evidence to suggest that machine learning models canbe applied in the domain, it has not been indisputably proven; to do that, onewould have to make predictions on more general outputs, such as blood alcoholcontent, a far more difficult task.

59

8 Future workThe research was able to show that a feed-forward neural network could makenon-trivial predictions on patient behaviour in the domain of alcohol abuse, andit can be argued that it avoided the naive forecasting trap. However, the modelwas not explicitly trained with this in mind, and it is unclear whether or not anaive forecast could become prevalent if the same model was applied in anotherdomain. One would inevitably have to take steps to account for that whengeneralizing the model to other domains. One consideration that could be doneis excluding output values when including historical data. For example, in thisresearch, each input-output pair could be mapped as such:

yt = f(xt−1, xt−2, ...xt−n) (16)

In reality, xt−n is a composite of multiple values, including the observedvalue of yt−n. Excluding that from the model could help limit the impact ofnaive predictions by ensuring the model does not know what the previous valueis, at the cost of a lower-information system. One area of future research is toconsider a model that does not have historical outputs in its inputs.

A confirmation of the nature of the predictions was only considered for thefeed-forward neural network, given the presence of a loss functin (the mainjustification for considering the prediction valid). However, similar tests or ex-amination were not conducted for competing models (though some argumentcan be made thathe LSTM network was not only flawed due to its bias on itsupper limit, but also made wholly naive predictions), leaving room for improve-ment in future work to confirm whether they, too, were performing non-trivialpredictions.

Furthermore, another step that could be taken is to classify the time seriesindependently before beginning a forecast; developing a classification algorithmthat decides which type of patient each time series represents before solvingthe regression problem would generate additional data that could help create astronger correlation between the existing data and predicted values. In addi-tioni, with sufficient data, one could train multiple models that identify differenttypes of behaviour on the time series; it is conceivable that short-term modelcould be used with time windows of size 7, while a long-term model could beused for a size as high as a year, to provide more overarching estimations abouta person’s behaviour.

In addition, a more serious consideration of the ensemble learner could betterapproximate a single strong predictor; by better considering the data sectionsthat each models struggles with, an ensemble could be developed with this inmind. For example, four networks trained to identify four different phenomenacould be aggregated into one modular network, as has been done before [7]. An-other approach could be to rather than taking a flat average of predictions, onecould weigh predictions based on their deviation from the would-be average andadjust them accordingly, such that predictions with high variance and standarddeviation are downsampled, while more stable predictions are upsampled.

60

One area that the paper set out to accomplish but failed at was the ideaof generalizing the model to solve high-variance regression problems for anydataset; when testing with an independent dataset not pertaining to either thecompany providing the original data, or even the same domain, a feed-forwardnetwork, proven to successfully make useful predictions to some extent on theoriginal dataset, failed to do so on the independent dataset. It appears to bethe case that substituting datasets without a higher level of context-sensitiveconsideration is ill-advised, and future research in the area ought to consider howto perform data preprocessing and appropriate transformations to fit forecastingmodels, rather than making time-consuming and often obscure changes to themodels already in place.

In short, there are several substantial ways further research in developingmachine learning-based forecasts ought to be conducted in order to avoid someof the pitfalls and problems encountered throughout the writing of this paper.

61

References[1] Jinbon bi, Jiangwen Sun, Wu Yu, Howard Tennen, and Stephen Armeli.

A Machine Learning Approach to College Drinking Prediction and RiskFactor Identification. University of Connecticut, 2013.

[2] Sangjae Lee and Wu Sung Choi. A multi-industry bankruptcy predictionmodel using back-propagation neural network and multivariate discriminantanalysis. College of Business Administration, Sejong University, Seoul, Ex-pert Systems with Applications(40):2941–2946, 2013.

[3] Arnaz Malhi and Robert X. Gao. PCA-Based Feature Selection Scheme forMachine Defect Classification. IEEE Transactions on Instrumentation andMeasurement(53):1517–5125, 2004.

[4] Joseph A. Cruz and David S. Wishart. Applications of Machine Learningin Cancer Prediction and Prognosis. Departments of Biological Science andComputing Science, University of Alberta Edmonton, 2006.

[5] Michelino De Laurentiis, Sabino De Placido, Angelo R. Bianco, Gary M.Clark, and Peter M. Ravdin. A Prognostic Model That Makes QuantitativeEstimates of Cancer Patients Clinical Cancer Research(5):4133-4139, 1999.

[6] Kanad Charkraborty, Kishan Mehrotra, Chilukuri K. Mohan, and SanjayRanka. Forecasting the Behavior of Multivariate Time Series using NeuralNetworks. College of Engineering and Computer Science, Syracuse Univer-sity, 1990.

[7] Takashi Kimoto, Kazuo Asakawa, Morio Yoda, and Masakazu Takeoka.Stock Market Prediction System with Modular Neural Networks. IJCNNInternational Joint Conference on Neural Networks, 1990.

[8] Markku D. Hämäläinen, Andreas Zetterström, Maria Winkvist, MarcusSöderquist, Elin Karlberg, Patrik Öhagen, Karl Andersson, and Fred Ny-berg. Real time monitoring using a breathalyzer-based eHealth system canidentify lapse/relapse patterns in alcohol use disorder patients. Alcohol andAlcoholism, 1–8, doi: 10.1093/alcalc/agy011, 2018.

[9] Eric Brochu, Vlad M. Cora, and Nando de Freitas. A Tutorial on BayesianOptimization of Expensive Cost Functions, with Application to Active UserModeling and Hierarchical Reinforcement Learning. Cornell University,2010.

[10] Takashi Kimoto, Kazuo Asakawa, Morio Yoda, and Masakazu Takeoka.Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel,Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer,Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, DavidCournapeau, Matthieu Brucher, Matthieu Perrot and Edouard DuchesnayScikit-learn: Machine Learning in Python. Journal of Machine LearningResearch(12):2825-2830, 2011.

62

[11] Thomas H. Davenport and D.J. Patil. Data Scientist: The Sexiest Job ofthe 21st Century. Available online at:https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century. Harvard Business Review, 2012. [Visited 20-2-2018]

[12] MathWorks Inc. Convolutional Neural Network. Available online at:https://se.mathworks.com/discovery/convolutional-neural-network.html. [Visited 20-2-2018]

[13] Vincent Graville. Difference between Machine Learning, Data Science, AI,Deep Learning, and Statistics. Available online at:https://www.datasciencecentral.com/profiles/blogs/difference-between-machine-learning-data-science-ai-deep-learning. HarvardBusiness Review, 2012. [Visited 20-2-2018]

[14] Cristopher M. Bishop. Pattern Recognition and Machine Learning.Springer, 2007.

[15] James Bergstra and Yoshua Bengio. Random Search for Hyper-ParameterOptimization. Journal of Machine Learning Research(13):281-305, 2012.

[16] John McGonagle, George Shaikouski, and Andrew Hsu. Backpropagation.Available online at:https://brilliant.org/wiki/backpropagation/. Brilliant Math Sci-ence Wiki. [Visited 20-2-2018]

[17] Foster Provost and Tom Fawcett. Data Science and its Relationship to BigData and Data-Driven Decision Making. Big Data(1):51-66, 2013.

[18] James D. Hamilton. Time Series Analysis. Princeton University Press,Princeton University, Princeton, New Jersey, 1994.

[19] Jeffrey M. Stanton. Introduction to Data Science School of InformationStudies, Syracuse University, Syracuse, New York, 2013.

[20] Robert E. Schapire. Nonlinear Estimation and Classification, Springer,2003. The Boosting Approach to Machine Learning: An Overview. AT&TLabs - Research, Shannon Laboratory, Florham Park, New Jersey, 2001.

[21] Briony J. Oates. Researching Information Systems and Computing. SAGEPublications, London, United Kingdom, 2006.

[22] S. Makridakis, A. Andersen, R. Carbone, R. Fildes, M. Hibon, R.Lewandowski, J. Newton, E. Parzen, and R. Winkler. The Accuracy of Ex-trapolation (Time Series) Methods: Results of a Forecasting Competition.Journal of Forecasting(1):111-153, 1982.

[23] G. Peter Zhang. Time series forecasting using a hybrid ARIMA and neuralnetwork model. Department of Management, J. Mack Robinson College ofBusiness, Georgia State University, University Plaza, Atlanta, 2001.

63

[24] Jason Brownlee. Python Machine Learning Mini-Course - Machine Learn-ing Mastery. Available at:https://machinelearningmastery.com/python-machine-learning-mini-course/. [Visited 20-2-2018].

[25] Adit Deshpande. A Beginner’s Guide To Understanding ConvolutionalNeural Networks. Available at:https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner’s-Guide-To-Understanding-Convolutional-Neural-Networks/. [Vis-ited 20-2-2018].

[26] Andy Liaw and Matthew Wiener. Classification and Regression by random-Forest. R News(2):18-22, 2002.

[27] Mikolaj Binkowski, Gautier Marti, and Philippe Donnat. AutoregressiveConvolutional Neural Networks for Asynchronous Time Series. CornellUniversity, Ithaca, New York, 2017.

[28] Mark R Segal. Machine Learning Benchmarks and Random Forest Regres-sion. University of California, San Francisco, 2004.

[29] J. Elith, J.R. Leathwick, and T Hastie. A working guide to boostedregression trees. Available online at:http://onlinelibrary.wiley.com/doi/10.1111/j.1365-2656.2008.01390.x/full [Visited 20-2-2018].

[30] D Svozil, V Kvasnicka, and J Pospichal. Introduction to multi-layerfeed-forward neural networks. Chemometrics and Intelligent LaboratorySystems(39):43-62, 1997.

[31] LexPredict, LLC. A decision tree case study. Available online at:https://www.lexpredict.com/2017/02/case-study-decision-tree/[Visited 20-2-2018].

[32] Stanford University. CS231n Convolutional Neural Networks for VisualRecognition. Available online at:http://cs231n.github.io/ [Visited 20-2-2018].

[33] Thomas G. Dietterich. An Experimental Comparison of Three Methods forConstructing Ensembles of Decision Trees: Bagging, Boosting, and Ran-domization. Machine Learning(40):139-157, 2000.

[34] Hussein Dia. An object-oriented neural network approach to short-term traf-fic forecasting. European Journal of Operational Research(131):253-261,2001.

[35] Thanasis G. Barbounis, John B. Theocharis, Minas C. Alexiadis, and Pet-ros S. Dokopoulos. Long-Term Wind Speed and Power Forecasting UsingLocal Recurrent Neural Network Models. IEEE Transactions on EnergyConversion(21):273-283, 2006.

64

[36] Chung-Ming Kuan and Tung Liu. Forecasting Exchange Rates Using Feed-Forward and Recurrent Neural Networks. Bureau of Economic and BusinessResearch, College of Commerce and Business Administration, University ofIllinois, 1993.

[37] Dominique T. Shipmon, Jason M. Gurevitch, Paolo M. Piselli, Steve Ed-wards. Time Series Anomaly Detection: Detection of Anomalous Dropswith Limited Features and Sparse Examples in Noisy Highly Periodic Data.Google, Inc., Cambridge, MA, USA, 1990.

[38] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning Long-TermDependencies With Gradient Descent is Difficult. IEEE Transactions onNeural Networks(5), 1994.

[39] Jasper Snoek and Hugo Larochelle. Practical Bayesian Optimization of Ma-chine Learning Algorithms. School of Engineering and Applied Sciences,Harvard University, 2012.

[40] Wei Peng, Juhua Chen and Haiping Zhou. An Implementation of ID3 —Decision Tree Learning Algorithm. University of New South Wales, Schoolof Computer Science & Engineering, Sydney, 1994.

[41] Tac Jun Lee, Justin Gottschlich, Nesime Tatbul, Eric Metcalf, and StanZdonik. Greenhouse: A Zero-Positive Machine Learning System for Time-Series Anomaly Detection. Brown University, Providence, Rhode Island,United States, 2018.

[42] Esra Akdeniz, Erol Egrioglu, Eren Bas, and Ufuk Yolcu. An ARMA TypePi-Sigma Artificial Neural Network for Nonlinear Time Series Forecasting.Journal of Artificial Intelligence and Soft Computing Research(8):121-132,2018.

[43] Cheng Li, Sunil Gupta, Santu Rana, Vu Nguyen, Svetha Venkatesh, andAlistair Shilton. High Dimensional Bayesian Optimization Using Dropout.Centre for Pattern Recognition and Data Analytics, Deakin University,Australia, 2018.

[44] Xingjian Shi, Zhourong Chen, Hao Wang, and Dit-Yan Yeung. Convo-lutional LSTM Network: A Machine Learning Approach for PrecipitationNowcasting. Department of Computer Science and Engineering, Hong KongUniversity of Science and Technology, Hong Kong, 2015.

[45] Deep Learning 4 Java. A Beginner’s Guide to Recurrent Networks andLSTMs. available online at:https://deeplearning4j.org/lstm.html [Visited 20-2-2018]

[46] Peter Gleeson. The Curse of Dimensionality. available online at:https://medium.freecodecamp.org/the-curse-of-dimensionality-how-we-can-save-big-data-from-itself-d9fa0f872335

65

[47] Karl Rosaen. K-fold cross-validation. available online at:http://karlrosaen.com/ml/learning-log/2016-06-20/5

[Visited 20-3-2018]

[48] Google Inc. TensorFlow. available online at:https://www.tensorflow.org [Visited 20-2-2018]

[49] J.P Connor, M. Symons, G.F.X. Feeney, R. McD. Young, and J.Wiles.The Application of Machine Learning Techniques as an Adjunct to Clini-cal Decision Making in Alcohol Dependence Treatment. Substance Use andMisuse, 42:14, 2193-2206, 2007.

66

Documents

High-variance multivariate time series forecasting using ...1219571/FULLTEXT01.pdf · With the advent of deep learning and increased processing power, a door has been opened to tackle