UNIFORM INTERVAL NORMALIZATION: DATA REPRESENTATION …

UNIFORM INTERVAL NORMALIZATION:DATA REPRESENTATION OF SPARSE AND NOISY

DATA SETS FOR MACHINE LEARNING

Simon Sävhammar

Supervisor: Jonas MellinExaminer: Joe Steinhauer

Master Degree Project in Informaticswith a specialization in Data Science

Spring term 2020

AcknowledgmentsI would like to extend sincere thanks to the researchers at Sahlgrenska University Hospital for their work and for giving me access to the MoodMapper data set:

Steinn Steingrimsson, Docent and Specialist in Psychiatry, Medicine doctor, Psykiatri Affektiva, Sahlgrenska University Hospital

Ulla Karilampi, Researcher PhD/Psychologist MSc, Kunskapsnätverket Digitalisering och Innovation, Sahlgrenska University Hospital.

I am also grateful to Stefan Andersson (www.safebase.se) and Verklighetslabbet via Susanna Bjälevik (www.verklighetslabbet.se), for their work and for giving me access to the Safebase data set.

http://www.verklighetslabbet.se/

This Master Thesis constitutes a part of the project: BaltSe@nioR 2.0: Innovative solutions to supportBSR in providing more senior - friendly public spaces due to increased capacity of BSR companies and

public institutions, part-financed by the European Union (European Regional Development Fund)within INTERREG Baltic Sea Region Programme.

AbstractThe uniform interval normalization technique is proposed as an approach to handle sparse data and tohandle noise in the data. The technique is evaluated transforming and normalizing the MoodMapper

and Safebase data sets, the predictive capabilities are compared by forecasting the data set with aLSTM model. The results are compared to both the commonly used MinMax normalization technique

and MinMax normalization with a time2vec layer. It was found the uniform interval normalizationperformed better on the sparse MoodMapper data set, and the denser Safebase data set. Future works

consist of studying the performance of uniform interval normalization on other data sets and with othermachine learning models.

keywords: Multivariate time series, forecasting, machine learning, LSTM, data representation,fuzzification

Table of Contents1. Introduction............................................................................................................................................42. Background............................................................................................................................................43. Problem Definition...............................................................................................................................104. Method.................................................................................................................................................11

4.1 Scientific Method..........................................................................................................................114.2 Planning.........................................................................................................................................114.3 Validity..........................................................................................................................................124.4 MoodMapper.................................................................................................................................134.5 Safebase.........................................................................................................................................154.6 uniform interval normalization......................................................................................................164.7 LSTM............................................................................................................................................16

4.7.1 LSTM 1..................................................................................................................................174.8 LSTM+t2v.....................................................................................................................................184.9 Experimental setup .......................................................................................................................20

4.9.1 Metric.....................................................................................................................................205. Results..................................................................................................................................................21

5.1 MoodMapper.................................................................................................................................225.2 Safebase.........................................................................................................................................355.3 Analysis.........................................................................................................................................39

6. Discussion............................................................................................................................................406.1 Effects of the application of uniform interval normalization........................................................406.2 Generalization...............................................................................................................................416.3 Future work...................................................................................................................................416.4 Validity..........................................................................................................................................42

7. Ethic.....................................................................................................................................................428. Conclusion...........................................................................................................................................43

1. IntroductionBipolar disorder affects about two percent of the world's population (Merikangas et. al. (2017). A person suffering from Bipolar disorder experiences mood phases ranging from depression to manic episodes. The recurring manic phases or phases of depression have a major impact on a persons life quality, but early pharmacological intervention can reduce the negative effect of the relapses. However,patients themselves are often unaware of their need of treatment, and monitoring through routine checkups are often not frequent enough (Antosik-Wójcińska et. al., 2020). Antosik-Wójcińska et. al. (2020), reviews the usage of smartphone as a monitoring tool, with machine learning techniques as a predictive tool, and found it to be an area with potential. The hospital of Sahlgrenska in Gothenburg, collected data from individuals suffering from bipolar disorder through a smart phone application called MoodMapper, and in this study how well the data can be used to generate models with high predictability is explored. The data in the MoodMapper data set consists of several multi-variate time series, one multivariate time series per individual who participated in the data collection. The MoodMapper data set is sparse, meaning there are often long periods of time between two events, and the data set contains noise, caused by errors in the sensors. For example, the accelerometer measuring the steps a participant takes, shuts off when the battery level is low, leading to no data being registered by the server. Sparse and noisy data sets are fairly common in the field of psychiatry (Cearns, M, Hahn,T. & Baune, B.T, 2019), and increases the difficulty for deep learning models to identify patterns and therefore decrease the models performance, as well as reduces the reproducibility of research in the field.

The purpose of this study is to examine if reducing the sparsity and noise by fuzzifying the data, could improve the performance of a machine learning model, with respect to the predictability. When the machine learning model is used to forecast a time series. A technique called uniform interval normalization, is used to fuzzify the MoodMapper and Safebase data sets. The machine learning model selected is a Long Short Term Memory (LSTM) neural network. To improve the models ability to detect seasonal patterns, a time2vec layer is applied to the MinMax normalized data, and compared to the other two approaches, uniform interval normalization and MinMax without time2vec. The uniform interval normalization technique is also applied to the Safebase data set, which contains data from bed scales, installed on beds of patients with dementia. The data in the data set is dense with a low amount of noise, and is likely to contain patterns. The Safebase data set is included to examine how uniform interval normalization perform on a dense data set.

The results shows that the fuzzified data, is easier to predict for both data sets, but the effect is greater on the MoodMapper data set than the Safebase data set.

In the following chapter, the general techniques included in the study are explained, the third chapter the research problem is defined, the fourth chapter contains the details of the implementations, the fifth chapter presents the results, the sixth chapter the discussion and further works, the seventh chapter discusses ethical questions and the eight chapter contains the conclusions of the study.

2. BackgroundBipolar disorder is a chronic disease with a high mortality rate, a person who suffers from bipolar disorder experience fluctuations between different mood phases ranging from manic phases to phases with low activity and depression. The recurrent nature of the phases has a severe impact on the quality

1

of life for persons with bipolar disorder, but if medication can be given in time decreases the severity ofthe symptoms and reduces the risk an episode converts to full-blown illness (Antosik-Wójcińska, Dominiak, Chojnacka, Kaczmarek-Majer, Opara, Radziszewska, Olwert & Święcicki 2020).

The MoodMapper data set was created by collecting data with a smart phone application, from volunteers with bipolar disorder. The data set consists of several multivariate time series with data of for example, phone calls, text messages and number of steps taken, if it is possible to forecast the activity of the participants, it could be possible to identify patterns in the behavior of the participant signaling the beginning of a relapse and distribute mood stabilizing drugs in order to prevent or reduce the effects of the relapse.

2.1 Statistics and machine learning

A time series is a sequence of observations recorded at different times and ordered by the time they occurred. In a univariate time series each observation is a single data points, while in a multivariate time series each observation is a vector of data points with known or assumed interrelation. Univariate and multivariate time series are a commonly found in various areas e.g., economics, health care analysis, real estate, etc. Time series are studied for several purposes for example to compare trends and examine the increase or decrease of a variable over time, to identify cyclic or seasonal patterns and to do time series forecasting.

Time series forecasting is a well studied area within data science and statistics. To forecast is according to the Merriam-Webster dictionary (https://www.merriam-webster.com/dictionary/forecast) “to calculate or predict (some future event or condition) usually as a result of study and analysis of available pertinent data” and to indicate as likely to occur. Forecasting of time series data is then to predict future data based on the available historic and current data. Forecasting of time series is according to Faloutsos, Flunkert, Gasthaus, Januschowski and Wang (2019) one of the most sought-after and difficult data mining task, since it has a high importance in industrial, social and scientific applications. Example applications of forecasting are inventory management by forecasting product supply and demand, managing the distribution of electricity by forecasting the production of wind turbine power (Zhang, Yan, Infield, Liu & Lian, 2019) and predicting which patient would develop Alzheimer's disease during a time period (Hong, Lin, Yang, Cai & Clawson, 2020). Common approaches to time series forecasting has been statistical methods such as linear regression and Autoregressive Integrated Moving Average (ARIMA) but in the last three decades, machine learning models such as Support Vector Machines (SVM) and Artificial Neural Networks (ANN) have successfully been used to forecast time series and found to be competitive to the classical statistical models (Bontempi, Ben Taieb & Le Borgne, 2013).

Over the last decade LSTM has become a popular approach for time series forecasting. For example Zhou, Fang, Xu, Zhang, Ding, Jiang and Ji (2020), compared a LSTM network with ARIMA to predict the energy consumption of air-conditioning systems. The results showed that LSTM outperformed ARIMA, based on the Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE). Another example is Meng, Liu, and Wu (2020), who compared ARIMA and LSTM on the task of predicting the rice yield. The results of the study showed that LSTM performed better than ARIMA. However Yamak, Yujian, and Gadosey (2019) found ARIMA to perform better than LSTM when predicting the exchange rate from Bitcoin to Dollars, which the authors claim could be caused by several factors, for example the low number of data points in the data set which could negatively impact the performance of LSTM.

2

LSTM is a type of Recurrent Neural Network (RNN). A RNN uses a hidden state to keep information from previous states while going forward over the input sequence. The current state in a RNN takes thehidden state from the previous step and the current data from the input sequence as input to update and to output the next hidden state.

Figure 1 shows a diagram of a RNN cell, which demonstrates how a RNN functions. The input data X is inputted in to the cell, the data is then processed by the activation function tanh. The resulting output is passed back to the cell, and used together with the next input. This is the recursion of the RNN, data from the previous step is reused in the current step. The output of the RNN, Y, is obtained by passing data from the RNN cell to an output layer. Figure 2 demonstrates how the above RNN cell, is rolled outover several time steps.

3

Figure 1:Diagram of a RNN cell. X is the input to the cell, ftanh is the activation function,tanh is commonly used in RNN. Y is the outputof the cell and the arrow from and to the cell represents the data in the output that is used in the input.

X

Y

ftanh

Figure 2 illustrates and example of how a RNN computes the output Yt from the input sequence X. The input sequence is a sequence of data points ordered sequentially. The hidden state is an internal representation, used by the RNN to remember data from the previous time step. At the first step in the input sequence, the input and an initial hidden state are inputted into the RNN cell, to calculate the nexthidden state h1. At the next time step, the previous hidden state h1 and the input x2 computes the next hidden state. The process repeats until all inputs in the sequence has been processed and the last hidden state ht is the output of the RNN. The last hidden state ht is passed to the output layer, in order to outputYt.

For example, if a RNN is used to predict the last word of a sentence, based on the previous words in thesentence. The input sequence is the words in the sentence, in a sequential order and the input at the firststep, is the first word in the sentence and the input at the last step is the second to last word in the sentence. The hidden state is a vector representation of the output, at the previous step. To predict the last word, the hidden state at the last step, is passed to the output layer.

Another approach is to output the hidden state at every step instead of only the last step, for example to predict the next word in a sentence based on the previous word.

RNN uses the gradient to carry information about the error between the target value and the output value, from the output layer to the input. The information is then used to update the weights in the network. However a common problem in RNN is the exploding or vanishing gradient problem, which occurs when the gradient either grow exponentially or shrinks exponentially towards zero (Bengio, Simard & Frasconi, 1994; Pascanu, Mikolov & Bengio, 2013). To solve the problem of the exploding

4

Figure 2: Illustration of a RNN cell adapted from Li, Johnson and Yeung (2017). The RNN cell is rolled out over t time steps, with an output at the last time step. The hidden states h are represented as rectangles, the recursive cells f as squares, input x as circlesand output y as a circle.

The weight matrices are not included in the illustration.

or vanishing gradient and to improve the long term memory of RNNs, LSTM was proposed by Hochreiter and Schmidhuber (1997).

A LSTM network consists of a series of LSTM cells (figure 3). The differences between a RNN and a LSTM network, is that in the LSTM network each cell has a cell state, which stores data over several time steps and gates which controls the data being written and read from the cell state. The gates are celled the input gate, the forget gate and the output gate.

Figure 3 shows a diagram of the LSTM cell the input Xt and previous hidden state ht-1, is passed to eachof the four gates. Where ft, is the forget gate, regulating what is removed from the cell state, it is the input gate, which decides what to write to the cell state, gt is the activation function and ot is the output gate. The output gate determines what data from the input and previous hidden state, that should be multiplied with the cell state, to generate the next hidden state.

f t=σ(W f∗x t+U f ∗h t-1+b f ) (1)

it=σ(W i∗x t+U i∗ht-1+b i) (2)

5

Figure 3: Diagram of LSTM cell based on Gruet, Chandorkar, Sicard and Camporeale (2018) and Li, Johnson and Yeung (2017). The input X is stacked with the previous hidden state h, and the passed to the four gates, f is the forgetgate which determines what in the cell state C to forget, i is the input gate which determines what to write to the cell state, g is the activation similar to a RNN and o is the output gate which determines what to output.

ot=σ(W o∗x t+U o∗h t-1+bo) (3)

The output of the forget gate ft is calculated by equation 1, where W and U are weight matrices and b is the bias, xt is the input at step t and ht-1 is the hidden state from the previous step and sigma is the sigmoid function. The result from the gates are vectors with the same dimensionality as the cell state. The vectors consists of values between zero and one where values close to one indicates the position should be kept while values close to zero indicates it should be deleted.

Equation 2 and 3 shows the calculation of the input gate and output gate. The input gate decides how much of the input should be added to the cells state and the output gate how much of the cell state should be used as output from the LSTM cell.

g t=tanh(W g∗x t+U g∗h t-1+bg) (4)

Equation 4 is the cell activation function, tanh, which is the same as in a RNN.

c t= f t ∘ ct-1+i t∘ g t (5)

Equation 5 defines the cell state and is elementwise multiplication of the previous cell state ct-1 and the forget gate ft added with elementwise multiplication of the input gate it and cell activation gt.

h t=ot ∘tanh (c t) (6)

The hidden state and output of the LSTM cell is calculated by elementwise multiplication of the output gate ot and cell state ct as (equation 6).

LSTM networks has been in several different field, for example, Hosseini and Sarrafzadeh (2019) used LSTM auto-encoders in order to detect early signs of negative health events. The data set was from the Preterm Infant Cardio-Respiratory Signals Database (PICS), and the data is collected from ten infants with slower than normal heart rate (Bradycardia). Since the data was sparsely annotated, an unsupervised learning model was implemented. The result of the study showed the LSTM model to outperform the anomaly detection models used for the comparison.

The LSTM cell handles an input sequence in a similar way as a RNN, and can be rolled out over time in the same was as the RNN in figure 2.

2.2 Sparsity, noise and fuzzification

During data collection with smart phones, events could occur that introduces uncertainties in the data. Yang, Hanneghan, Qi, Deng, Dong and Fan (2015) discusses uncertainty in the context of life-logging of physical activity data and defines two categories of uncertainties, Irregular Uncertainty (IU) and Regular Uncertainty (RU). The IU is random, could be accidental or caused by misuse of the application used for data collection, and considerably affects the efficiency and accuracy of the data analysis. The RU is frequent, persistent and can not be completely removed, e.g., errors in the devices sensors used for measurement or when the person using the application would measure activity in a new environment.

The uncertainties introduced during the data collection, leads to sparsity and noise being added to the data. Where sparsity means there are several missing observations, and noise means irregular behavior. For example, if a smart phone is used to measure the number of steps a user takes during a day, sparsitywould be introduces to the data, when the used walks around without the phone. The steps taken by the user should have been recorded, but since it is not the data will be filled with empty observations

6

during those time points. Noise could be caused by infrequent step counts. For example, if the steps taken are measured on a weekly basis, and the user goes on a week long hiking vacation. The week during the vacation, will be be different from the normal behavior of the user. Another cause of noise could be is a user shakes the phone in order to inflate the step count. In both cases noisy data points are introduced in to the data, which are likely difficult to predict.

Fuzzification is a technique that transforms data and that can reduce the effect of sparsity and noise (Ohlbach, 2004).

Figure 4 displays an example of fuzzification, the figure contains two series of observations, of the same data. Where one variable (e.g., time) is placed on the x-axis, the value of the measured variable is placed on the y-axis and the dots the observed values at a given time. The above series is without fuzzification, and contains both sparsity and noise. The sparsity is time points without any observed values and noise is irregular measurements. In the bottom time series the data is fuzzified by dividing it, into three intervals and the measuring the total number of dots within the intervals. Each interval canbe represented either as a single data point, or the same value repeated over the full interval. The fuzzified time series can handle the irregular distribution of data points in the last two intervals and identify the same level of activity in both.

7

Figure 4: Example of fuzzification. The above diagram is before fuzzification and the bottom diagram after fuzzification.

Figure 5 shows two time series, measuring the steps taken of an individual during 24-hours. Where time is placed on the x-axis, starting from midnight and ending at 23.00, and the number of steps taken on the y-axis. The top time series is the steps taken during Monday and the bottom time series the number of steps taken during Tuesday.

During Monday the individual is following the normal behavior., there are no steps taken during the night hours, and the first steps are registered at 06.00. At 10.00 there is a peak of activity and varying activity during the rest of the day. However during Tuesday the individual have some activity that is different from the normal behavior. The individual wakes up one hour later than usual, and the peak shifts from ten to eleven. During the afternoon the individual, does not register any steps taken, from 16.00 to 17.00, either the individual did not carry the phone or did not take any steps.

Fuzzification can be used to smooth out the data points and reduce the effect of sparsity and noise in the data.

8

Figure 5: Example of number of steps taken by an individual during 24 hours. The top time series is the steps taken on Monday, and the bottom time series the number of steps taken during Tuesday.

Figure 6 shows the same time series as in figure 5, but in fuzzified. Since the data points has been smoothed over a longer interval of time, the difference between the two days are less clear. The effect of that the individual woke up one hour later on Tuesday than Monday, is not as noticeable in the fuzzified time series, the peak happens in the interval before lunch. During the afternoon, the interval from 15.00 to 17.00 is noticeable lower during Tuesday than Monday, since the missing data points means less activity was observed. However the problem of sparsity is removed, by smoothing the neighboring over the missing time points.

3. Problem DefinitionThe sparsity and noisiness of the MoodMapper data set increases the difficulty for deep learning techniques to identify patterns and to perform accurate forecasting of the multivariate time series. The aim of this study is therefore to alleviate the problem of sparsity and noisiness by examining, how transforming the MoodMapper data set with the uniform interval normalization technique, affects the predictability of the data set.

The sparsity of the MoodMapper data set has mainly two causes, first the infrequency of the events for some of the features. When the time series is sampled, the features with infrequent observations, will have a high number of empty (zero valued) data points. The second cause of sparsity is when the application is shutoff by in order to conserve battery. During the time the application is off, no data is recorded and instead empty data points are added to the data set. The data set contains noise which is

9

Figure 6: Fuzzification example of number of steps taken by an individual during 24 hours. The top time series is the steps taken on Monday, and the bottom time series the number of steps taken during Tuesday.

difficult to discern from the normal data points.

The causes of the noise is when sensors has incorrectly recorded data and irregularities in the behavior of a participant. Example of sensor errors are when the tilt of the smart phone affects the accelerometer which count the steps taken (Kannan, 2015), and moving between the border of two cell towers could lead to a ping pong effect (Iovan et. al., 2013) which inflates the count of cell tower connections. Examples of irregularities in the behavior of a participant, is when a participant goes on a hiking trip, which inflates the number of steps compared to other days, or if a participant shakes the phone in order to increase the number of steps.

In order to accurately forecast based on the MoodMapper data set, it is necessary to handle sparsity andnoisiness of the data set. The uniform interval normalization smooths data over time periods, and thereby handles the sparsity and noise by fuzzifying the data set. The fuzzification of the data could lead to a more effective deep learning model with regards to predictability. Since the model do not haveto pinpoint exact times events occur, but instead predict the level of activity over a time interval.

To evaluate the effect uniform interval normalization has on the predictability of the data set, several different configurations of the uniform interval normalization's interval lengths and overlap lengths is applied to the data set. A LSTM neural network is used to forecast the transformed time series, with different configurations for prediction interval. The results of the forecasting with the uniform interval normalization time series will be compared with forecasting of the data set when both normalized with MinMax scaling and when transformed to a set of sine functions with a time2vec layer (Kazemi, et. al 2019). The hypothesis is that fuzzifying the MoodMapper data set, by applying the uniform interval normalization technique, would increase the performance of the used LSTM model, with respect to predictability. Compared to both when the data is normalized with MinMax normalization and when the data is normalized with MinMax normalization and a time2vec layer is added. The predictability is measured, by the RMSE metric and the Mean Arctangent Absolute Percentage Error (MAPE).

The studied research questions are:

(R1) What is the effect with respect to predictability of activity level, when the sparsity and noisiness in activity of daily life data set is handled by fuzzification?

(R2) Is there a noticeable effect with respect to predictability of activity level, when a dense data set with a low amount of noisiness is fuzzified?

(R3) How do the number of hours in each fuzzified interval, affect the predictability of the activity level?

(R4) Does the predictability of the activity level increase when the intervals are overlapped?

The objectives of the study is to:

(O1) handle validity threats

(O2) evaluate interval configurations, by varying the uniform interval normalization, interval length parameter

(O3) evaluate step size configurations, by varying the uniform interval normalization, step size parameter

(O4) compare the effect of applying uniform interval normalization, with MinMax normalization (base line) and MinMax normalization with a time2vec layer, with respectto predictability

10

(O5) examine the effect of uniform interval normalization on the Safebase data set with respect to predictability

4. Method

4.1 Scientific Method

To determine if the hypothesis of applying the uniform interval normalization technique on the MoodMapper data set increases the predictability is true or not, experiments must be performed and theresults compared to both a default normalization technique (i.e., MinMax normalization) and time2vec (Kazemi et. al., 2019). Time2vec has been proven to be an effective technique for handling periodicity in data. The correctness of the hypothesis cannot be decided by other scientific methods, for example, conducting a survey, since the uniform interval normalization technique has not been used previously and the MoodMapper data set has not been analyzed with regards to predictability. The experiment wasplanned according to the guidelines suggested by Wohlin et. al. (2012), with the goal to analyze how applying uniform interval normalization affects the predictability of the MoodMapper data set. The main study object of the experiment is the uniform interval normalization technique, with MinMax normalization and time2vec as the baseline.

The purpose is to evaluate the effect the application of uniform interval normalization has on the predictability of the MoodMapper data set. Compared to the other two techniques, MinMax normalization and time2vec. The perspective of the experiments is from the view of a data analyst and researcher, who examines if there is a significant difference in the predictability when uniform interval normalization is applied. The context of the experiment is an offline environment, the data sets of MoodMapper and Safebase, and the techniques are implemented in Python and Keras (Chollet, 2015) running on Tensorflow (Abadi, et. al., 2016).

4.2 Planning

The experiment is planned to examine the effect of applying uniform interval normalization, on both the MoodMapper data set, which has a higher degree of sparsity and noisiness, and the Safebase data set which has a lower degree of sparsity and noisiness. The Null hypothesis is that applying uniform interval normalization on the MoodMapper data set will not increase the performance of the LSTM model with regards to predictability compared to the other two techniques, MinMax normalization and time2vec. The alternative hypothesis is that applying uniform interval normalization will increase the predictability of the MoodMapper data set compared to the other two techniques. The smoothing of data points over longer time intervals introduces a level of fuzziness to the data which could reduce the difficulty of the forecasting task.

The independent variables are the choice of data set, the normalization techniques and the hyper parameters tuned in the LSTM model. The sample size is an independent variable, because it varies both per participant and data set, and affects the performance of the LSTM model. While the dependentvariable is the predictability measured by calculating the RMSE of the approaches. The experimental design is one factor with three treatments with a randomized blocked design, the factor is the predictability of the MoodMapper data set and the three treatments are the uniform interval normalization technique, MinMax normalization and time2vec. Each treatment is applied to all subjects, but since the order in which the treatments are applied, is not important there is no

11

randomization in order of application. The treatments are balanced in the sense that an equal amount of data points are available for each technique and configuration.

The LSTM models and time2vec layer are implemented with Python 3.7 and the Keras API (Chollet, 2015) running on Tensorflow (Abadi, et. al. 2016), and the normalization methods are implemented in Python.

The objectives during the experiment are to clean the data of invalid numbers (nan) and obvious errors for example, negative number of steps, implement the necessary software, tune the neural networks, do forecasting with LSTM for each of the three techniques, uniform interval normalization, MinMax normalization and time2vec, calculate the RMSE and finally to compare and visualize the results.

4.3 Validity

Threats to the conclusion validity of the study are low statistical power, fishing and the error rate and reliability of measures. Low statistical power affects this study since the regularization techniques and the initialization of the weights in a neural network uses random number generators. Which leads to that the results of the same configuration could vary over several runs.

Fishing and the error rate has two separate parts. The first part is when searching or fishing for a result, the researcher then influences the result by for example selecting settings that leads to a specific outcome. The second part is about the significance level and the need to adjust it when multiple analysis are conducted. Fishing and the error rate is a threat to the validity, since the hypothesis assumes uniform interval normalization will increase the predictability of the data set. Compared to the other two techniques, MinMax normalization and time2vec. This could lead to results being interpretedin favor of the desired outcome or parameters tuned in favor of uniform interval normalization.

Threats to the construct validity includes mono-operation bias, mono-method bias and restricted generalization across constructs. The of mono-operation bias, is if only the MoodMapper data set is used, it is not possible to generalize the effect on predictability to other data sets. A similar threat is the mono-method bias if the effect on the predictability is measured by one measurement it might bias the results towards a certain technique. The last identified threat to the construct validity is restricted generalization across constructs. Since uniform interval normalization reduces the number of data points by grouping them into one data point, it could reduce the possibility to generalize the technique to other data sets even if the effect on the predictability is higher than the other techniques.

A threat to the external validity is the interaction of setting and treatment, if the predictability is low forall techniques it could indicate LSTM is not suitable for forecasting of time-series.

12

Table 1: Summary of validity threat and how they were handled. First column name of the validity threat, second column how they were managed.

Validity threat Handled by

Low statistical power Each configuration run over ten trials.

Fishing rate and error rate Information of LSTM, layers and tuning of hyper parameters.

Mono-operation bias Including two data sets in the experiment.

Mono-method bias Usage of two measures

Restricted generalization across constructs Partially a threat (reduces sample size)

Interaction of setting and treatment Brief survey of LSTM and time series forecasting.

Table 1 summarizes the validity threat (column one) and how they were managed (column two). The threat of low statistical power was handled by running each configurations ten times. The variance for most of the configurations was low, and an identical experimental setup should yield similar results. However since the seed was not set manually, it is not possible to exactly reproduce the results.

The validity threat of fishing and the error rate was handled by providing information of the LSTM models used and the configurations of the hyper parameters in chapter four. The information can be used to implement an identical model and tune it with the same hyper parameters, in order to analyze a data set. However since a permission is required to access the MoodMapper and Safebase data set, it might not be possible to use the same data.

The validity threat of the reliability of was handled by including the mean of the measures in tables, to avoid a visual comparison of the box plots. In the cases where one configuration is not clearly better than another and a statistical test is needed, they are assumed to be tied. The threat of inadequate preoperational explication of constructs was managed by defining predictability as how well a data point can be predicted, based on a number of previous data points, and where the performance is measured by RMSE and MAAPE. Two data sets are used in the experiment one sparser and with more noise (MoodMapper) and one with more dense data and with less noise, in order both to examine how uniform interval normalization performs on data sets with different characteristics and to manage the mono-bias operation threat. To handle the mono-method bias threat two measures were used, RMSE and MAAPE. The threat of restricted generalization across construct is partially a threat, since uniform interval normalization reduces the number of data points. Even though the technique performs better than MinMax and time2vec, the reduction of data points excludes some of the participants. However the number of data points is generally a limiting factor for machine learning models, if the sample size is not large enough the model cannot learn patterns in the data. The interaction of setting and treatment,several articles were found where LSTM had successfully need used to forecast time series, and in several cases it was the best performing technique.

4.4 MoodMapper

The MoodMapper data set was created in a previous study by gathering data from a smart phone application and contain data from eight participants, seven of the participants have Bipolarity disorder and one participant is a control participant. The time period the participants used the application varies

13

between them and ranges between four and 22 months. During the period the participants used the application, it recorded and sent data about the number of steps taken, the duration of calls measured in seconds, the direction of the call (incoming call or outgoing call), number of characters used in incoming and outgoing text messages, which cell tower the phone is currently connected to, and when the screen turns on and off. The time the event occurred is registered as a timestamp with the year, month, day, hour, minute and second.

The MoodMapper data set is loaded from several files and transformed to one multivariate time-series per participant. During transformation process the data for the different features is sampled with the granularity of one hour and aggregated into a multivariate time series. This introduces several empty records (figure 7).

14

Figure 7: Example of data set after the features have been aggregated into a data table for one participant, several empty records are introduced at time stamps the participant is lacking activity for a feature.

Table 2: Binning method per feature.

Feature Bin method

Call duration (incoming) Sum

Call duration (outgoing) Sum

Text characters (incoming) Sum

Text characters (outgoing) Sum

Cell tower Count

Screen (on) Count

Screen (off) Count

Step count Sum

Table 2 shows the binning method applied per feature, where the first column is the feature and the second column is the binning method. Sum means the values for the recorded events where summed upinto one record for each hour, i.e., if a participant during an hour spent five minutes on the phone and then ten minutes, the sum for the hour will be 15 minutes. The count was calculated by counting the number of records during an hour, i.e., if a participant during an hour switches on the screen four times it creates four rows in the data set, after binning there will be one row with the value of four for the hour.

15

Table 3: Summary of the MoodMapper data set after preprocessing. The first column is the participant id, second column the number of months the participant used the application, the third column the number of streams in the time series (features) and the last column the number of data points (rows).

User id Months Feature count Data points (count)

QPJ42J2B9MPB(sb001)

6* 7 4 080

JJ4HF7RHBC8N (sb002)

11 7 8 568

RQDPZADXJ8CP(sb003)

6 6 4 560

ATRDDP78WJH7(sb004)

4 7 3 408

RYY7EQWJ7WDB(sb005)

4* 6 4 393

W8Z3K4B2DEWE(sb006)

22 7 16 224

DTAQMMDQWMCF(sb007)

4 8 2 784

CCEZZCCXKRAK(sb008)

8 8 6 528

* The participants had a few data points at a date several months after the last continuous date. The data at the unconnected months were removed and is not included in the count.

Table 3 summaries the preprocessed data. The first column contains the user identity assigned to each participant, the identity does not hold any personal information that could be used to identify the person. The second column contains the number of months the participant used the application to record data. The third column is the number of features for which data was recorded, the fourth columnis the number of rows in the data set after preprocessing.

Data from four of the eight participants was selected from the MoodMapper data set, the criteria for selection is the amount of available data points. Since the time period the participants have used the application and the frequency of observations varies between them, the number of recorded data points varies between the participants. Neural networks suffers from overfitting issues when the sample size islow (Chowdhury, Dong & Li, 2019), therefore if participants have too few data points it might not be possible for the LSTM model to do any reasonable predictions. Whereas a higher number of data pointsimproves the training of LSTM neural networks compared to a lower number of data points, and increases the possibility the LSTM model will be able to accurately predict the patterns of the participants. Therefore participants with less than one thousand data points after the data set has been preprocessed and normalized with uniform interval normalization were excluded from the study. The selected participants to be included in the experiment are JJ4HF7RHBC8N, RQDPZADXJ8CP, W8Z3K4B2DEWE and CCEZZCCXKRAK.

4.5 Safebase

The Safebase data set consists of data from bed scales, installed on beds for patients with dementia.

16

Four scales are installed on each bed and the scale weights are measured with the granularity of a second.

Table 4: Summary of the Safebase data set. First column the identification number of the bed, column the number of features (scales) and the third column the number of data points.

Identification Number of features Number of data points

141123 4 13 892 320

341767 4 15 40 48 28

406322 4 14 425 489

465792 4 15 423 100

466438 4 14 750 442

466440 4 14 934 652

466497 4 15 367 091

479449 4 15 463 076

513067 4 15 441 978

Table 4 shows the summary of the Safebase data set. The first column contains the identification number, that identifies each bed in the data set. The identification number cannot be used to identify a patient, and is only used to map data to the bed it was collected from. The second column displays the number of features (i.e., scales) of each patient, which is four since each bed has one scale installed on every leg. The third column shows the number of data points for each patient.

One patient is selected from the data set, the selection is done arbitrary, by picking a identification number. From the chosen patient a period of two weeks are extracted, by finding a connected sequence with a low amount of missing data. In order to evaluate the effect on predictability, the application of uniform interval normalization has on a dense data set with a low amount of noise. The data is then sampled on the granularity of a minute, and by averaging all values during each minute. The sampling is done, in order to reduce the number of data points and shorten the time it takes to tune the parametersand the training time of the LSTM model.

4.6 Uniform interval normalization

The uniform interval normalization handles the sparsity and noise in the the multivariate time series by smoothing events over longer time intervals and transforming the data to activity per time period. This reduce the effect daily events happening on slightly different times has on the predictability and fills in the gaps between events that occurs less frequent than other events. The values for each feature in an interval are summed up and then divided by a normalization factor, resulting in one value per feature for the time period.

Uniform interval normalization has three parameters the length of the intervals, the steps between the start of intervals (which determines the overlap of intervals) and the normalization factor.

17

Uniform interval normalization divides a time series into uniform intervals, and then applies a low-passfilter to filter the intervals. Figure 8 shows a diagram of the intervals of uniform interval normalization,where L is the length of an interval and S is the step size, which determines the distance from the start of one interval to the start of the next interval. The length of intervals is measured in time steps, and decides the time a data point will be smoothed out over. Increasing the length of intervals increases the effect on the sparsity of the data set since the number of timestamps without events are reduced. The step size determines how many time steps there are between the start of two intervals, higher values forthe step size parameter creates less overlap of the intervals but reduces the number of data points in the normalized data set more than a lower step size.

A low-pass filter, allows lower frequencies to pass through, but filters out higher frequencies (Smith, 2008). Figure 9 shows a diagram of a low pass filter, where the frequency is on the x-axis, the gain on the y-axis, fc is the cutoff point and the box represent the passband. If the incoming signal has a frequency that is higher than the cutoff point, it will be multiplied with a gain of zero, which effectivelycancels it out. When the frequency is lower than the cutoff point it will be multiplied with one, which means it is allowed to pass through the filter.

18

Figure 9: Diagram of a low-pass filter, adapted from Smith (2008).

Figure 8: Diagram of the intervals of uniform interval normalization, L is the length of the interval, and S is the distance from the start of one interval to the start of the next interval.

L

S

y (t)= x (t)+ x( t−1) (7)

According to Smith (2008) the simplest low-pass filter is given by equation 7, where y(t) is the output amplitude at time t and x(t) is the input amplitude at time t. The output amplitude of the filter, is the current input amplitude, added with the precious input amplitude.

y (t)=x (t )+x (t−1)+x (t−2)+x (t−n)

norm(8)

Uniform interval normalization, filter values according to equation 8, where n is the length of an interval, and norm is the normalization factor.

Uniform interval normalization is applied to the MoodMapper and the Safebase data sets, which are multivariate time series and the normalization factor is calculated on a per feature basis. Where the highest value for the feature over all intervals are used.

V n(t )=

∑t= l

u

vtn

V maxn for all t ∈T (9)

Equation 9 shows how the feature vector V n with n features is calculated for a participant. Where l

is the time point at the start of an interval and u is the last timestamp included in the interval, v in is

the feature vector at time step i, V maxn is a feature vector with the maximum value for each feature in

the population of participants and T is all time points in the time series.

Periodicity in the data was represented by encoding the position of the intervals during the day. In otherwords the first interval starting at midnight is encoded as the first interval by adding a float number, thenext interval as the second by increasing the float with a small value. The encoding encodes all intervals and for the last interval the float has the highest value but is below one.

4.7 MinMax normalization

yi=xi−min(X )

max (X )−min( X )(10)

Equation 10 is the formula for the MinMax normalization, where yi is the normalized value at position i and xi is the actual value at position i, min(X) is the minimum value of the data set and max(X) is the maximum value of the data set. During the experiments, the software library of scikit-learn (Pedregosa,et. al. 2011), where used to normalize the data.

Before the data was normalized the hours during the day were encoded by an integer value ranging from 1 to 24, in order to increase the LSTM models ability to detect periodic patterns. The value used to encode time is then normalized with the rest of the data, in order to generate values ranging from one

19

to zero.

4.8 LSTM

Two different LSTM models were implemented in Keras and tuned for the experiment, LSTM 1 and LSTM with an implementation of a time2vec layer (LSTM+t2v). LSTM 1 was used for many-to-one forecasts. The LSTM+t2v model uses a custom time2vec layer that transform the input into a sum of sine functions, which are then used as input into the LSTM model. LSTM+t2v is an interpretation of the work of Godfrey and Gashler (2017) as well as Kazemi et. al. (2019).

The LSTM models used for forecasting takes a sequence of data points as input, learns patterns in the sequence and then predicts the data points that would come after the sequence. The output sequence could either be a single value or several values.

The following example is an example with a single value in the output sequence. If a data set consists of a series of integers like, {1,2,3,1,2,3,1,2,3}, and a LSTM model is to be used to forecast the series. The series is split into an input sequence and an output sequence, the input series could be {1,2,3,1,2,3,1,2} and the output sequence or target value is then a single value {3}. The model can then learn the input sequence and predict a single integer that would be the next in the series.

Another example with several data points in the output sequence is, if the same series of integers are used, but split into the input sequence of {1,2,3,1,2,3} and the output sequence of the model {1,2,3}. The model can then learn to recognize and predicts longer portions of a series, and in the case of time series forecast further into the future.

The first approach is usually called many-to-one prediction, since several data points in the input sequence are used to forecast a single value in the output sequence. The second approach is called many-to-many, since several values in the input sequence are used to forecast several values in the output sequence. The many-to-many could be more desirable, since it looks further into the future but is also more complex. As stated by Bontempi, Ben Taieb and Le Borgne (2013):

“If the one-step forecasting of a time series is already a challenging task, performing multi-step forecasting is more difficult because of additional complications, like accumulation of errors, reduced accuracy, and increased uncertainty.” .

The LSTM models have a poor performance on the many-to-one predictions of the MoodMapper data set, especially of the sparser features. Because of the poor performance and due time limitations, a many-to-many model was not implemented.

4.8.1 LSTM 1

LSTM 1 consist of a LSTM input layer LSTM hidden layers and one dense output layer. The model was used for many to one predictions, where the input sequence consists of several data points, and the output sequence only has one single data point.

20

Figure 10 displays the layers in the LSTM 1 model. Where each of the LSTM layers consists of a single LSTM unit (figure 3). The dense layer transform the output sequence into one data point, which is the predicted value for the input sequence.

In order to execute the experiment, the hyper parameters of the LSTM model, has to be tuned for each technique, since the characteristics of the data are different depending on which of the three techniques are applied, for example the sample size, level of sparsity varies for the techniques. The sample size hasan impact on how complex patterns a machine learning model can learn and the effect dropout has on during the training of the model (Srivastava et. al. 2014).

In order to find the optimal number of layers for the MoodMapper and Safebase data sets, several different models with varying number of layers were tested. The first model contained a single LSTM layer, the number of layers were increased by one for the next model and the third model had three layers. Two models with more than three layers were tested, one with four and one with five layers. However, it was found that the models with more than three layers overfitted on the training data and performed worse on the test data. For each model, the three different techniques, uniform interval normalization, MinMax normalization and time2vec, were applied over ten runs.

21

Figure 10: Diagram of the LSTM 1 model. The model has three LSTM layers each consisting of a single LSTM unit, the final dense layer transforms the sequence into the output data point.

LSTM LAYER

Dense layer

LSTM LAYER

LSTM LAYERInput sequence Output

Table 5: Summary of the range of tested values,for the hyper parameters the LSTM model.First column type of parameter and the second column range of tested values.Parameter Range of tested values

Epoch 100-1000

Batch size 16-512

Learning rate 0.0001 - 0.01

Dropout 0.6

Units 25-125

t2v K 20-175

Table 5 shows the hyper parameter that were tuned and the range of values that were tested. The first column is the name of the parameter, where epoch is the number of times the training data passes through the model, batch size is how many samples passes through the model before the weights are updated, learning rate determines by how much the weights in the model are updated, dropout is used to remove a random number of element in the input sequence, units are the length of the output sequence after each LSTM layer and also determines the dimensionality of the gates in the model and t2v K is the number of nodes in the t2v layer. The second column contains range of the tested values.

The same tuning process was followed for each of the three techniques, uniform interval normalization,MinMax normalization and time2vec. However due to the limited time available to train the models, the time necessary to complete one training for a configuration and the lack of preexisting knowledge on the optimal hyper parameter setting for the MoodMapper and Safebase data sets. It was not feasible to find the optimal settings for each of the techniques. Instead the objectives during the tuning process was to, remove or minimize overfitting issues, spend a similar amount of time on each configuration and to find the settings that lead to the lowest Mean Squared Error (MSE) on the test set. The starting point for each technique was the default values as defined in the Keras API of LSTM. Each of the hyper parameter was varied one at a time, until the value which lead to the lowest MSE was found. Theorder the parameters were tuned, corresponds to the order of table 5.

4.9 LSTM+t2v

The LSTM+t2v model uses a custom layer, time2vec to detect and model periodicity in the MoodMapper data set, the layer was implemented in Keras and based on time2vec (Kazemi et. al. 2019) and Neural decomposition (ND) which was proposed by Godfrey and Gashler (2017). The time2vec layer is an interpretation and extension of time2vec and ND that can handle more than one input vector in order to process and transform multivariate time-series.

time2vec( X )[i ]={ wn,i∗xn+bn,i if i=0sin(wn,i∗xn+bn,i)∗an,i if 1⩽i⩽k

for all xn∈ X (11)

22

Equation 11 is adapted from Godfrey and Gashler (2017) but extends the equation to multivariate inputs. X is the input matrix consisting of one or more input vectors, xn is an input vector in X,

wn,L , bn,L is the weight and bias matrices used to calculate the linear term for input vector xn ,wn,i and bn,i are the weight and bias matrices for the sinusoids and corresponds to the frequency

and phase shift of a sinusoid, and an,i is the weight matrix for the amplitude.

Figure 11 shows a diagram of the time2vec implementation. The layer is implemented as a feed forward neural network with a single hidden layer, where each neuron in the hidden layer is connected to one input neuron. The input x to the time2vec layer is one vector per stream in a multivariate time-

23

Figure 11: Example of a time2vec layer with two inputs vectors x1 and x 2 , four

neurons with sine activation functions n11 , n12 , n21 and n22 , two neurons

with linear functions n1L and n2L . The output from the sine activation functions are multiplied with an amplitude before being inputted into the LSTM layer.

series, the input is multiplied with frequency w and the phase shift is added as a bias b, the sine function is used as activation function in the non-linear neurons, and no activation function is applied in the linear neurons. The non-linear output is multiplied with an amplitude before being inputted into the LSTM model. The LSTM model used in LSTM+t2v is the same as LSTM 1 (figure 10).

There are two possible approaches to connect the inputs to the hidden layer, to connected each input node to all nodes in the hidden layer (fully connected) or to connect the inputs to separate nodes in the hidden layer. Both approaches were tested by classifying a synthetic data set, the first data set describedin Kazemi et. al. (2019) and by forecasting a data set of the water level in Venice available at https://www.kaggle.com/lbronchal/venezia. The difference between the two approaches were not statistically significant for either of the two data sets.

In Godfrey and Gashler (2017) includes an amplitude variable in their equation, which is removed by Kazemi et. al. (2019). During the implementation both including and not including the amplitude variable were tested, by classifying the synthetic data set and forecasting the water level of Venice. The variant with the amplitude performed better on the water level data set, if the data was not normalized between zero and one. However if the data set was normalized between zero and one, there were not a statistically significant difference between the two approaches.

The implemented time2vec layer connects the input to separate nodes in the hidden layer, in order to detect periodic patterns in separate features and not in the combined features. The version with amplitude was selected, since it did perform better in the case when the data was not normalized, even though in the experiment all data is normalized between zero and one.

4.10 Experimental setup

This section covers how the experiment were setup, naming of configurations and hyper parameter settings.

The experiment is run over three different computers with different operating systems, which are Windows 7, Windows 10 and Ubuntu. Some of the configurations are run on one computer and others are split over two or three computers. There were no noticeable difference on runs from the same configuration, when they were executed on different systems. The implementation of uniform interval normalization is available at https://github.com/Simon84s/Uniform_Interval_Normalization.

24

Figure 12 is a diagram of the experimental setup, for the MoodMapper and Safebase data sets. Three different techniques are compared to each other, uniform interval normalization, MinMax normalization and time2vec. The bottom three squares is the type of normalization technique applied tothe data sets. The time2vec square above the right most MinMax normalization square, is meant to demonstrate, that after the data has been normalized it is feed into the time2vec layer before being inputted into the LSTM model. The Ellipses visualizes that the parameter setting are different for each of the techniques. A diagram of the LSTM model used during the experiments, is displayed in figure10.

4.10.1 Preliminary: MoodMapper

The data set was split into a training and a test set. For each participant the first 80 percent of the the time series were selected for training and the last 20 percent for the test set. For validation 25 percent ofthe training data was used.

Table 6: Summary of the labeling of the uniform interval normalization configurations for the MoodMapper data set. First column the assigned label used for identification, second column the number of hours used for the interval length, third column

25

Figure 12: Diagram of the experimental setup. The squares represent the techniques, uniform interval normalization, MinMax normalization and time2vec. The ellipses arethe hyper parameter settings of the LSTM Model and the rectangle is the LSTM modelas shown in figure 10.

LSTM

Uniform interval

normalization

MinMax normalization

time2vec

MinMax normalization

Setting: uniform interval

normalizationSetting: MinMax Setting: time2vec

number of hours between the start of each interval.

Label Interval length (hours) Step size (hours)

6h_s3h 6 3

6h_s6h 6 6

12h_s3h 12 3

12h_s6h 12 6

Table 6 displays the labels for the configurations of uniform interval normalization, used for the trials with the MoodMapper data set. The first column is the label used as reference in the report, column is the length of the interval in hours and column is the number of hours between the start of two intervals.

Table 7: Sample sizes for each participant and normalization technique. The first column contains the identification of the participants, the second column the sample size for a three hour step size, the third column the sample size for a size hour step size and the

fourth column the sample size when MinMax normalization is applied instead of uniform interval normalization.

Participant ID Sample size: Step size three hours

Sample size: Step size six hours

Sample size: MinMax & time2vec

sb002 2 856 1 556 8 568

sb003 2 576 760 4 560

sb006 5 416 2 708 16 224

sb008 2 184 1 092 6 528

Table 7 shows the samples sizes for each participant and each step size configuration, where the samples size are the number of data points in the data set. The first column contains the identification of the participants, the second column the sample size when a three hour step size is used, the third column the sample size when a six hour step size is used and the fourth column when MinMax normalization is applied instead of uniform interval normalization. The sample size for time2vec is the same as for MinMax normalization.

A lower step size means more intervals are created, than when a higher step size is used. Since the number of hours between each interval will be less, for the lower step size. The number of samples in the sample size has a effect on the LSTM model, and a higher number of samples can increase the performance of a LSTM model.

4.10.2 MoodMapper: LSTM configuration

Uniform interval normalization and MinMax normalization are run on the LSTM 1 model, and time2vec on the LSTM+t2v model.

26

Table 8: Table 4: Hyper parameter settings for LSTM 1, first column is the corresponding name of the parameterin the Keras LSTM API, the second column is the settings for uniform interval normalization, the third column the settings for MinMax normalization and the fourth column the settings time2vec.

Parameter Setting: uniform interval normalization Setting: MinMax Setting: time2vec

Epoch 500 500 300

Batch size 32 256 200

Learning rate 0.001 0.001 0.0001

Dropout 0.6 0.2 0

Units 75 75 75

t2v K - - 170

Table 8 shows the hyper parameter settings used for the uniform interval normalization and MinMax normalization. First column the name of the parameter, the second column contains the settings for uniform interval normalization, the third column the settings for MinMax normalization and the fourth column the settings for time2vec.

The number of epochs are 500 for uniform interval normalization and MinMax normalization, but 300 for time2vec. The tuning process showed, that in all cases the models started having overfitting issues, if the number of epochs were higher than 500. The loss function continued to slightly improve during training, but worsened on the validation data.

The batch size determines how often the weights of the model are updated, it was found that for uniform interval normalization a lower batch size gave better performance, while for MinMax and time2vec a higher batch size had to be used to avoid overfitting issues. Since uniform interval normalization has a lower sample size than both MinMax and time2vec and processes less samples each epoch it was expected a lower batch size for uniform interval normalization and a higher batch size for MinMax and time2vec had to be used.

The learning rate was set to 0.001 for both uniform interval normalization and MinMax normalization which is the standard setting for the Adam optimizer used in the experiment. For time2vec a lower learning rate was set, since the model was prone to overfitting. Lower setting for the learning rate lead to LSTM 1 not learning enough during training to forecast the test data and a higher setting lead to the model being overfitted on the training data, which lead to poor performance on the unseen test data.

During the tuning process it was observed that all models had tendencies to become overfitted. In orderto manage the overfitting issue in the LSTM 1 model dropout layers were added and dropout rates of 0,0.2, 0.4 and 0.6 were tested. For uniform interval normalization a high dropout rate was required to reduce the overfitting on the training data, but for MinMax a lower dropout rate gave better performance. Since MinMax normalization has a higher sample size than uniform interval normalization, the lower dropout rate support the findings of Srivastava, Hinton, Krizhevsky, Sutskeverand Salakhutdinov (2014). Srivastava et. al. (2014) found that the effect of dropout increases with the size of the sample size, up to a certain point, after which the effect declines.

The units parameter in Keras is the size of the output of each LSTM block and was set to 75 for both configurations, higher values tended to result in overfitting issues and longer training times where lower values resulted in worse performance.

27

4.10.3 MoodMapper: Input and output sequence

Table 9: Summary of the input and output sequence configurations for the MoodMapper data set.The first column contains the label of the normalization technique, the second column number of data points in the input sequence and the third column the number of data points in the outputsequence.

Label Input sequence (data points) output sequence (data points)

6h_s3h 56 1

6h_s6h 56 1

12h_s3h 56 1

12h_s6h 56 1

MinMax 168 1

Time2vec 168 1

Table 9 displays the number of data points in the input and outputs sequences. First column has the label of the normalization technique, the top four are the configurations for the uniform interval normalization. The second column is the number of data points in the input sequence, which is the number of data points the LSTM model uses to train on. The third column the number of data points in the output sequence, which is the number of data points that is forecast by the LSTM model.

For uniform interval normalization the length of the input sequence, is one week for the six hour step size configurations, and two weeks for the three hour step size configurations. The output sequence is one data point, which corresponds to one interval. For MinMax and time2vec, the length of the input sequence is a week. The output sequence is one data point and is equivalent to one hour. In the case of MinMax and time2vec, it was assumed the longer input sequence is needed, in order to cover enough days for the model to learn daily behavior.

4.10.4 Preliminary: Safebase

The data set were split into a training and a test set. For each configuration the first 80 percent of the the time series were selected for training and the last 20 percent for the test set. For validation 25 percent of the training data was used.

Table 10: Summary of the labeling of the configurations of the uniform interval normalization, for the Safebase data set. First column label used for identification, second column the number of minutes in each interval, third column the number of minutes between the start of each interval.

Label Interval length (minutes) Step size (minutes)

15m_s3 15 3

30m_s15 30 15

60m_s20 60 20

Table 10 shows the labels for the configurations of uniform interval normalization, used for the trials with the MoodMapper data set. The first column is the label used as reference in the report, the second

28

column is the length of the interval in hours and the third column is the number of hours between the start of two intervals.

Table 11: Summary of the sample size. The first column contains the label of the configuration and the second column contains the sample size.

Label Sample size

15m_s3 4 033

30m_s15 1 345

60m_s20 1 009

MinMax & time2vec 20 165

Table 11 summarizes the sample size per configuration. The first column contains the label of the configuration, the first three rows are the uniform interval normalization configurations and the final rows is the MinMax normalization and time2vec, the second column displays the number of data pointsin the sample.

4.10.5 Safebase: LSTM configurations

Uniform interval normalization and MinMax normalization are run on the LSTM 1 model, and time2vec on the LSTM+t2v model.

Table 12: Settings of the hyper parameters for the LSTM models. First column name of the parameter, second column the settings for uniform interval normalization, third column settings for MinMax normalization and fourth column settings for time2vec.

Parameter Setting: uniform interval normalization

Setting: MinMax Setting:time2vec

Epoch 400 600 400

Batch size 128 128 128

Learning rate 0.0001 0.0001 0.0001

Dropout 0 0.2 0

Units 125 75 75

t2v K - - 128

Table 12 contains the settings of the hyper parameters of the LSTM models. First column the name of the parameter, the second column contains the settings for uniform interval normalization, the third column the settings for MinMax normalization and the fourth column the settings for time2vec.

The number of epochs varies between the different models and normalization techniques, where MinMax normalization has a higher number of epochs than the other two techniques. When the model was tested on MinMax normalization, with the same settings as the other two techniques, the model

29

was prone to overfitting. However, by introducing a level of dropout the overfitting issues were solved. A higher number of epochs were then tested on the model with dropout, and it lead to a lower value of the loss function on the test set.

The batch size and learning rate was set to the same values for all models and configurations. Uniform interval normalization and time2vec did not experience overfitting issues.

4.10.6 Safebase: Input and output sequence

Table 13: Summary of the input and output sequence configurations. The first column contains the label of the normalization technique, the second column the number of data points in the input sequence and the third column number of data points in the output sequence.

Label Input sequence (data points) output sequence (data points)

15m_s3 20 1

30m_s15 20 1

60m_s20 20 1

MinMax 24 1

Time2vec 24 1

Table 13 displays the number of data points in the input and output sequences, for the different configurations. The first three rows are the different uniform interval normalization configurations, for which twenty data points are used as input. For MinMax normalization and time2vec 24 data points areused.

4.11 Metric

Since RMSE is one of the most commonly used metrics to evaluate machine learning techniques, it is selected as one of the metrics for the experiments.

RMSE=√ 1n∑n=1

n

( yn− yn)2 (12)

RMSE is defined by equation 12, where n is the number of samples, y is the target value and y is the predicted value. Larger errors have a disproportionately large effect on RMSE compared to smaller values, and this punishes one large error more than several smaller errors. This is a desired effect in the context of the experiment, since several forecasts that only missed by a few units, is not as bad as forecasting one large value instead of a small value, or forecasting a small value instead of a large value.

Mean Absolute Percentage Error (MAPE) is one of the most used metrics for measuring forecasting accuracy, but since the MoodMapper data set and the Safebase data set can have zero as an actual valueMAPE cannot be used. Kim and Kim (2016) proposed Mean Arctangent Absolute Percentage Error (MAAPE) as an extension to MAPE. MAAPE can handle zero as an actual value and is motivated geometrically, however since the MAAPE's value is given in radians it is less intuitive than MAPE.

30

MAAPE=1n∑i=1

m

∣arctan∣ai− y i∣

∣a i∣ ∣ (13)

Equation 13 shows the definition of MAAPE, where a i is the actual value and y i is the predictedvalue, the value of MAAPE ranges from 0 to π/2 and a smaller value is better than a larger.

5. ResultsThis section summarizes the result of the experiment, each configuration was executed ten times in order to reduce the effect of good or bad starting weights would have on the result.

5.1 MoodMapper

The results are displayed with box plots for the selected participants in the MoodMapper data set. The errors are calculated over all features in the forecast of the multivariate time series.

The box plots in this chapter are structured in the following manner. The type of configurations are

31

Figure 13: Summary the MAAPE metric for the MoodMapper data set. Type of configuration placed on the x-axis, the metric is on the y-axis. Top left box plot is for participant sb002, top right sb003, bottom left sb006 and bottom right sb008

placed on the x-axis, where 12h_s3 is an interval with the length of twelve hours and three hours between each interval, 12h_s6 is an interval with the length of twelve hours and six hours between each interval, 6h_s3 is an interval with the length of six hours and three hours between each interval, 6h_s6 is an interval with the length of six hours and six hours between each interval, MM is the MinMax normalization and t2v is time2vec. The metric is placed on the y-axis, for both RMSE and MAAPE a lower metric value is better than a higher.

Figure 13 shows the result for the MAAPE metric for all participants. The uniform interval normalization configurations, performed better than the MinMax normalization and time2vec for all theparticipants. Where the twelve hour interval length with a three hour step size was the best setting.

Figure 14 shows the result for the RMSE metric for all participants. The uniform interval normalizationconfigurations, achieves a lower RMSE than the MinMax normalization and time2vec for all the participants. The best setting for uniform interval normalization, was the twelve hour interval length and three hour step size.

Tables 14-17 summaries the ranks of the techniques. The first column is the name of the technique, the second column the rank according to RMSE, the third column the rank according to the MAAPE, the fourth column the mean RMSE over ten runs and the fifth column the mean MAAPE over ten runs.

32

Figure 14: Summary of the RMSE metric for the MoodMapper data set. Type of configuration placed on the x-axis, the metric is on the y-axis. Top left box plot is for participant sb002, top right sb003, bottom left sb006 and bottom right sb008. The RMSE between participants are not comparable, since the scale are dependent on the participant.

Table 14: Table 6: Summary of rank based on RMSE and MAAPE for participant sb002. Type of data representation in first column, rank based on RMSE in second column, rank according to MAAPE in third column, mean RMSE over ten trials in fourth column and mean of MAAPE over ten trials in fifth column.

Technique Rank RMSE Rank MAAPE RMSE (mean) MAAPE (mean)

MinMax (MM) 5 5 32.07 1.31

Time2vec (t2v) 6 6 32.2 1.32

interval 6h step 3h (6h_s3) 4 3 13.68 1.06




Table 14 summarizes the rank assigned based on the box plots for participant sb002. The best configuration for was the twelve hour interval with a step size of three and the other interval configurations performed better than MinMax and time2vec.

Table 15: Summary of rank based on RMSE and MAAPE for participant sb003. Type of data representation in first column, rank based on RMSE in second column, rank according to MAAPE in third column, mean RMSE over ten trials in fourth column and mean of MAAPE over ten trials in fifth column.


MinMax (MM) 4 4 274.26 1.15

Time2vec (t2v) 4 4 276.27 1.56





Table 15 summarizes the rank assigned based on the box plots for participant sb003. The best configuration for participant sb002 was the twelve hour interval with a step size of three and the other interval configurations performed better than MinMax and time2vec.

33



MinMax (MM) 5 4 133.93 1.24

Time2vec (t2v) 5 4 136.76 1.25








MinMax (MM) 5 5 171.63 1.29

Time2vec (t2v) 4 5 160.7 1.28

interval 6h step 3h (6h_s3) 3 3 89.61 1



interval 12h step 6h (12h_s6) 2 2 82 0


5.2 Safebase

In figure 15, figure 16, figure 17 and figure 18 the results of the Safebase data set are presented with box plots. The type of data representation are placed on the x-axis, where 15m_s3 is an interval with the length of fifteen minutes and three minutes between each interval, 30m_s15 is an interval with the length of thirty minutes and fifteen minutes between each interval, 60m_s20 is an interval with the length of sixty minutes and twenty minutes between each interval, MM is the MinMax normalization and t2v is time2vec. The metric is placed on the y-axis.

34

35

Figure 15: Box plot of the MAAPE for the Safebase data set. The first number is the interval length and the second the step size (15m_s3, 30m_s15, 60m_s20), MM is the MinMax normalization and t2v is time2vec.

Figure 15 shows the result for the MAAPE score for the Safebase data set, a lower score is better than ahigher score, and figure 16 displays a zoomed in view to visualize the differences between the techniques. The fifteen minute interval with a step size of three minutes has the best MAAPE score, butis overlapping with the sixty minutes interval. However since the distribution (except one outlier) of theMAAPE score for the fifteen minute interval is below the mean of the MAAPE score for the twenty minute interval, the difference is significant. Time2vec and MinMax normalization has the third best MAAPE score, the difference between the techniques is likely not significant. The thirty minute interval length has the worst MAAPE score.

36

Figure 16: Zoomed in view of the MAAPE scored for the Safebase data set. The zoomed in view enhances the differences of the boxes in the plot.

The result for the RMSE of the Safebase data set is shown in figure 17, where a lower RMSE score is better than a higher score. The uniform interval normalization with an interval of fifteen minute and with three minutes between the start of each interval achieves the best RMSE score. The sixty minute interval with a twenty minutes between the start of each interval is the second best data representation. MinMax normalization and time2vec have similar RMSE scores, but time2ve seem to have a slightly better score. The thirty minute interval with fifteen minutes between the start of each interval has the worst RMSE score.

37

Figure 17: Box plot of the RMSE for the Safebase data set. The first number is the interval length and the second the step size (15m_s3, 30m_s15, 60m_s20), MM is the MinMax normalization and t2v is time2vec.

Figure 18 zooms in on boxes in order to enhance the differences and visualize them more clearly.

Table 18: Summary of rank based on RMSE and MAAPE for the Safebase data set . Type of data representation in first column, rank based on RMSE in second column, rank according to MAAPE in third column, mean RMSEover ten trials in fourth column and mean of MAAPE over ten trials in fifth column.


MinMax (MM) 3 3 1.68 0.36

Time2vec (t2v) 3 3 1.67 0.36

interval 15m step 3m (15m_s3) 1 1 1.13 0.32



Table 18 summarizes the rank assigned based on the box plots, and compares the mean values of the RMSE and MAAPE for the different techniques. The first column contains the label of the applied

38

Figure 18: Zoomed in view of the MAAPE scored for the Safebase data set. The zoomed in view enhances the differences of the boxes inthe plot.

technique, the abbreviation in the parenthesis corresponds to the label in the box plot. The second column is the assigned ranked according the box plot of the RMSE, the third column the rank based on the MAAPE box plot, the fourth column the mean value of the RMSE, calculated over ten trials. The fourth column is the mean MAAPE calculated over ten trials. The best configuration for the safe base data set was the fifteen minutes interval with three minutes between each interval. The worst configuration was the thirty minutes interval with fifteen minutes between each interval, which performed worse than MinMax and time2vec.

5.3 Analysis

Uniform interval normalization smooths activity happening at specific time point over longer time intervals, and functions as a simple low-pass filter (e.g., Moving Average). When the data is smoothed over a longer time interval, high frequency noise is filtered out while low frequency signals pass through unchanged. This manages irregular events that occurs with a high frequency, and when events that usually occurs regularly, happens at unexpected times. The filtering and the reduction of sparsity, leads to a better performance of the LSTM model, on the MoodMapper and Safebase data set comparedto both MinMax normalization and MinMax normalization with the time2vec layer, as seen in the experiment.

For the MoodMapper data set, the longer interval configuration (i.e., twelve hour intervals) has a better performance, than the shorter interval (i.e., six hour interval). The longer intervals reduces the sparseness in the data more than the shorter intervals, and the lower cutoff frequency means that more noise are filtered out compared to the shorter interval. For twelve hour interval length configurations, a step size of three hours performed better than the six hours step size. When the step size is greater, the sample size is reduced more compared to a smaller step size, which lead to a worse performance of the LSTM model. The worse performance on the smaller sample size agrees with the findings of Acikmeseand Alptekin, (2019). Who found that their LSTM model had problems generalizing, since their samplesize was small. The optimal step size for the configurations was not finalized in the experiment and is left as future work.

The differences between the techniques are less for the Safebase data set than for the MoodMapper dataset, since the Safebase data set is more dense compared to MoodMapper and contains less frequent irregular events. Reducing the sparseness and filtering out noise, does not have the same effect on the predictability compared to the MoodMapper data set.

That the sparseness of the data has an effect on the accuracy of the LSTM model, can be seen when comparing features with more sparse data for example, outgoing text messages (appendix H & appendix I), to features with less sparse data for example number of cell towers the phone is connects to each hour (appendix J & appendix K).

Zoomed in views of the box plots, more clearly visualizing the differences between the configurations can be found in Appendices A-G.

6. DiscussionThis chapter discusses the effect of the uniform interval normalization, how generalizable uniform interval normalization is, related research and future research

39

6.1 Effects of the application of uniform interval normalization

The uniform interval normalization can be seen as generating a measure or ratio of activity for the participants and intervals in the data set. The measure is a sum of the registered activity for a participant during an interval divided by a normalization factor. The normalization factor decides the scale of the data, and determines how the transformed values should be interpreted. During this study the normalization factor was set per feature and to be the maximum value of an interval in the population for the feature. The motivation to selecting the maximum in the population for each feature was to transform the data to values between zero and one, and to generate a interpretable activity measure. For example an activity score of zero point five can indicate an activity level between half activity during the full interval, or full activity half of the interval, relative to the maximum activity in the population. Another possible approach could have been to use the highest measured activity for an individual for the current time period. The measured activity would then indicate the relative activity level during the specific interval instead of relative activity level over all intervals.

6.2 Generalization

To investigate how generalizable uniform interval normalization is, it is of interest to apply it both to a wider range of data sets, to other machine learning models and for other tasks than forecasting. A characteristic that could limit the generalization of uniform interval normalization, is that when the datapoints are summed up, the characteristics of the individual data points are lost. For tasks where the characteristics are important on a per data point level, for example, anomaly detection, the application of uniform interval normalization might not suitable.

6.3 Related works

Kazemi et. al. (2019) proposed time2vec and found it to be an effective approach to learn characteristics of time. Time2vec was found to improve the results in the majority of the experiments, and never to deteriorate it. However, in this study the implemented time2vec layer performed worse forthe MoodMapper data set, compared to not using it. The performance of time2vec indicates it is not able to encode the periodic patterns in the sparse data set of MoodMapper. For the Safebase data set time2vec performed similar or slightly better than MinMax normalization without time2vec. Which is similar to the results of Kazemi et. al. (2019).

Ul Alam et. al. (2020), used a LSTM model to predict sepsis, based on sparse medical data. The results of their study found the LSTM model outperformed existing scoring systems, and that the size of the time window had a major impact on the predictive performance of the model. Ul Alam et. al. (2020), evaluated six different windows sizes ranging from one to eight hours and found the two hour window size to have the best effect on the predictability. The windows in the study done by Ul Alam et. al. (2020) was not overlapping. In contrast to the results of Ul Alam (2020), the results of the experiment on the MoodMapper data set and the Safebase data set. Shows the longest window size has the best performance when the data is sparse, and the shortest window size when the data is less sparse. The overlapping of intervals might have an effect that improves the effect on predictability, and could be an area of interest for future works.

6.4 Future works

To detect relapses in a data set containing data from individuals with bipolar disorder, anomaly

40

detection could be a more effective approach than forecasting. However, the MoodMapper data set do not include relapses or labeled anomalies. Detected anomalies would not be an indication a participant is about to experience or already experiences a relapse, and the results would probably not be meaningful. The assumption is that forecasting a time series with relapses is not more difficult than forecasting a time series without relapses and a machine learning model would perform similarly in both cases. If the behavior of a participant is clearly different from the forecast behavior, it could indicate the participant is in need of health care.

For the Safebase data set, the effect of the step size and interval length requires to be studied further. Since the middle interval length is the worst, and not the shortest or longest, another factor than the length of intervals affect the predictability of the data. It could be the differing step sizes affects the result, but to determine the exact cause another experiment must be performed.

To better be able to evaluate the accuracy of the models the Mean Absolute Scaled Errors (MASE), could be used to measure the performance. The MASE is easier to interpret than RMSE and MAAPE, since a MASE value above one indicate a naive one step ahead method, performs better than the approach under consideration (Hyndman & Koehler, 2006).

Uniform interval normalization was compared to MinMax normalization, it is of interest to compare the technique with other techniques that handles sparsity and noisiness in data. For example, other fuzzification approaches to forecasting of time series.

7. EthicsThe MoodMapper and Safebase data sets contain clinical data collect from people with health issues. It is therefore important the data cannot be used to identify an individual and to have ethical appeals. The MoodMapper data was collected under the ethical approval of the project https://www.researchweb.org/is/vgr/project/221821. The data in both the MoodMapper and the Safebase data set are anonymized and it is not possible to obtain information from the data, which could lead to a person being identified. There is no information about a persons health in the data, so it is not possible to know if the participant is a control person or suffers from any health condition.

The MoodMapper data set has hash codes for the cell tower a participant is connected to, as well as hash codes for telephone numbers. A person with enough authority for example a police officer, might be able to identify the cell towers from the hash codes and find out to whom the data belongs. However, it would probably not be possible for a person without special permission to map the hash codes, to cell tower location. Furthermore, in this study the hash codes themselves are not utilized after the data has been transformed, instead the codes are used to calculate a count and then removed from the data.

If forecasting of a persons habits can be done accurately enough, the results could be used in unethical ways, for example to stalk a person, to sell information to third parties to generate ad offers and ad placement.

8. ConclusionThe uniform interval normalization technique increased the predictability of both the MoodMapper andthe Safebase data set. The increase of predictability is greater on the MoodMapper data set, which is likely due to that the data set is sparser and have more noise than the Safebase data set. The effect on

41

predictability is therefore more noticeable. Uniform interval normalization can be used to improve the predictability of sparse and noisy data sets, in order to perform forecasting and possibly other machine learning tasks. The contributions of this study is the application of uniform interval normalization on the MoodMapper and Safebase data sets, and insights about how the sparsity of the data in the MoodMapper data set affects the predictability. Where features with denser data are possible to forecastfor time periods up to twelve hours, when uniform interval normalization is applied.

The answer to the research questions are as follows. Research question (R1), if fuzzification could improve the predictability of activity level, when the data is sparse and noisy. Was answered by the runs on the MoodMapper data set, and it was shown that by fuzzifying the data set, the predictability ofthe activity level was increased compared when it was not fuzzified. The second research question(R2), that was about the effect of fuzzifying a dense data set with a low amount of noise, with respect to predictability of activity level. Was answered by the runs on the Safebase data set, and it was shown the fuzzification had a low effect on the predictability of activity level, but the effect is probably not relevant. Research question (R3), was about the effect of the length of the intervals, and it was shown during the runs on the MoodMapper data set, that longer intervals increased the predictability with respect to activity level. During the runs of of the Safebase data set it was shown shorter intervals gave a better predictability, but the difference were minor. The conclusion drawn is that for sparse data sets with a high amount of noise, longer intervals give a better predictability. Since longer intervals decreases the sparseness and noisiness more than shorter intervals. The final research question (R4)) was about the overlapping of intervals or step size. The answer to the question is, the step size affect the sample size, which is seen in table 7 and table 11, and the sample size has an effect on the predictability. Since for LSTM models a higher samples size generally leads to better performance thana smaller samples size, when the same data set is used.

42

ReferencesAbadi, M. et. al. (2016). TensorFlow: A System for Large-Scale Machine Learning. In 12th Symposium on Operating Systems Design and Implementation. pp. 256-283.

Acikmese, Y., & Alptekin, S. E. (2019). Prediction of stress levels with LSTM and passive mobile sensors. Paper presented at the Procedia Computer Science, , 159 658-667. doi:10.1016/j.procs.2019.09.22

Antosik-Wójcińska, A. Z., Dominiak, M., Chojnacka, M., Kaczmarek-Majer, K., Opara, K. R., Radziszewska, W., Olwert, A. & Święcicki, L. (2020). Smartphone as a monitoring tool for bipolar disorder: A systematic review including data analysis, machine learning algorithms and predictive modelling. International Journal of Medical Informatics, 138 doi:10.1016/j.ijmedinf.2020.104131

Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent isdifficult. IEEE Transactions on Neural Networks, 5(2), 157-166. doi:10.1109/72.279181

Bontempi, G., Ben Taieb, S., & Le Borgne, Y. (2013). Machine learning strategies for time series forecasting doi:10.1007/978-3-642-36318-4 3

Cearns, M., Hahn, T. & Baune, B.T. (2019). Recommendations and future directions for supervised machine learning in psychiatry. Transl Psychiatry 9, 271. https://doi-org.libraryproxy.his.se/10.1038/s41398-019-0607-2

Chollet, F. (2015). keras. https://keras.io.

Chowdhury, S., Dong, X., & Li, X. (2019). Recurrent Neural Network Based Feature Selection for High Dimensional and Low Sample Size Micro-array Data. IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 2019, pp. 4823-4828, doi: 10.1109/BigData47090.2019.9006432.

Faloutsos, C., Flunkert, V., Gasthaus, J., Januschowski, T., & Wang, Y. (2019). Forecasting big time series: Theory and practice. Paper presented at the Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2309-2310. doi:10.1145/3292500.3332289

Godfrey, L. B., & Gashler, M. S. (2018). Neural decomposition of time-series data for effective generalization. IEEE Transactions on Neural Networks and Learning Systems, 29(7), 2973-2985. doi:10.1109/TNNLS.2017.2709324

Gruet, M. A., Chandorkar, M., Sicard, A., & Camporeale, E. (2018). Multiple-hour-ahead forecast of the dst index using a combination of long short-term memory neural network and gaussian process. Space Weather, 16(11), 1882-1896. doi:10.1029/2018SW001898

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. doi:10.1162/neco.1997.9.8.1735.

43

Hong, X., Lin, R., Yang, C., Cai, C., & Clawson, K. (2020). ADPM: An alzheimer's disease prediction model for time series neuroimage analysis. IEEE Access, 8, 62601-62609. doi:10.1109/ACCESS.2020.2979969

Hosseini, A. & Sarrafzadeh, M (2019). Unsupervised Prediction of Negative Health Events Ahead of Time. IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), Chicago, IL, USA, 2019, pp. 1-4, doi: 10.1109/BHI.2019.8834550.

Hyndman, R. & Koehler, A. (2006). Another look at measrures of forecast accuracy. International Journal of Forecasting Volume 22, Issue 4, October–December 2006, Pages 679-688

Iovan, C., Olteanu-Raimond, A.-M., Couronné, T. & Smoreda, Z. (2013). Moving and calling: Mobile phone data quality measurements and spatiotemporal uncertainty in human mobility studies doi:10.1007/978-3-319-00615-4_1

Kannan, R. (2015). Orientation estimation based on LKF using differential state equation. IEEE Sensors Journal, vol. 15, no. 11, pp. 6156-6163.

Kim, S., & Kim, H. (2016). A new metric of absolute percentage error for intermittent demand forecasts. International Journal of Forecasting, 32(3), 669-679. doi:10.1016/j.ijforecast.2015.12.003

Li, F., Johnson, J. & Yeung, S. (2017). Recurrent Neural Networks. Lecture notes, CS231n: Convolutional Neural Networks for Visual Recognition, Stanford University, 2017.05.01. http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf, Accessed:2020.04.29.

Manashty, A. & Light Thomson, J. (2017). A New Temporal Abstraction for Health Diagnosis Prediction using Deep Recurrent Networks. In Proceedings of the 21st International Database Engineering & Applications Symposium (IDEAS 2017). Association for Computing Machinery, New York, NY, USA, 14–19.

Meng, X., Liu, M., & Wu, Q. (2020). Prediction of rice yield via stacked LSTM. International Journal of Agricultural and Environmental Information Systems, 11(1), 86-95. doi:10.4018/IJAEIS.2020010105.

Ohlbach, H. J. (2004). Relations between fuzzy time intervals. Paper presented at the Proceedings of the International Workshop on Temporal Representation and Reasoning, , 11 44-51.

Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. 30th International Conference on Machine Learning, ICML 2013, (PART 3) 2347-2355.

Pedregosa et. al. (2011) Scikit-learn, Machine Learning in Python. Journal of Machine Learning Research 12, pp. 2825-2830.

Smith, J.O. (2007). Introduction ot Digital Filters with Audio Applications. Stanford, Calif.: Stanford Univ.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014). Dropout: A

44

Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15, pp. 1929-1958.

Ul Alam, M., Henriksson, A., Valik, J.K., Ward, L., Naucler, P., Dalianis, H. (2020). Deep learning from heterogeneous sequences of sparse medical data for early prediction of spesis. HEALTHINF 2020- 13th International Conference on Health Informatics, part of 13th International Joint Conference on Biomedical Engineering Systems and Technologies, BIOSTEC 20202020, Pages 45-55.

Wohlin, C., Runeson, P., Host, M., Ohlsson, M., Regnell, B. & Wesslen, A. (2012). Experimentation In Software Engineering. New York: Springer-Verlag New York.

Xiao P., Lee JY., Tao S., Hwang YS. & Sakai T. (2020). Generating Short Product Descriptors Based onVery Little Training Data. In: Wang F. et al. (eds) Information Retrieval Technology. AIRS 2019. Lecture Notes in Computer Science, vol 12004, pp 133-144. Springer, Cham

Yamak, P. T., Yujian, L., & Gadosey, P. K. (2019). A comparison between ARIMA, LSTM, and GRU for time series forecasting. Paper presented at the ACM International Conference Proceeding Series, 49-55. doi:10.1145/3377713.3377722

Yang, P., Hanneghan, M., Qi, J., Deng, Z., Dong, F. & Fan, D. (2015). Improving the Validity of Lifelogging Physical Activity Measures in an Internet of Things Environment. IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing, Liverpool, 2015, pp. 2309-2314. DOI: 10.1109/CIT/IUCC/DASC/PICOM.2015.341

Zhang, J., Yan, J., Infield, D., Liu, Y., & Lien, F. -. (2019). Short-term forecasting and uncertainty analysis of wind turbine power based on long short-term memory network and gaussian mixture model.Applied Energy, 241, 229-244. doi:10.1016/j.apenergy.2019.03.044

Zhou, C., Fang, Z., Xu, X., Zhang, X., Ding, Y., Jiang, X., & ji, Y. (2020). Using long short-term memory networks to predict energy consumption of air-conditioning systems. Sustainable Cities and Society, 55 doi:10.1016/j.scs.2019.102000

45

Appendix A Box plots sb002

46

Appendix B Zoomed view sb002 (RMSE)

47

Appendix C Zoomed view sb002 (MAAPE)

48

Appendix D Zoomed view sb003 (RMSE)

49

Appendix E Zoomed view sb003 (MAAPE)

50

Appendix F Zoomed view sb006 (MAAPE)

51

Appendix G Zoomed view sb008 (RMSE & MAAPE)

52

Appendix H sb006 (text, length out)

53

Appendix I Zoomed view of sb006 (text, length out)

54

Appendix J sb006 12hour/3h step size (cell tower, count)

55

Appendix K Zoomed view sb006 12hour/3h step size(cell tower, count)

56

Appendix L Example graph safebase (black line original time series, orange line prediction)

57

Appendix M Example graph safebase, zoomed view (black line original time series, orange line prediction)

58

Documents

UNIFORM INTERVAL NORMALIZATION: DATA REPRESENTATION …