Supplementary material: An advanced data-driven hybrid ... Web viewSupplementary material: An advanced data-driven hybrid model of SARIMA-NNNAR for tuberculosis incidence time series

Supplementary material: An advanced data-driven

hybrid model of SARIMA-NNNAR for tuberculosis

incidence time series forecasting in Qinghai Province,

China

Yongbin Wang1,*, Chunjie Xu2,*, Yuchun Li1, Weidong Wu1, Lihui Gui1, Jingchao Ren1, Sanqiao

Yao1

1 Department of Epidemiology and Health Statistics, School of Public Health, Xinxiang Medical

University, Xinxiang, Henan Province, P.R. China; 2 Department of Occupational and

Environmental Health, School of Public Health, Capital Medical University, Beijing, P.R. China

Correspondence: Sanqiao Yao and Yongbin Wang

Department of Epidemiology and Health Statistics, School of Public Health, Xinxiang Medical

University, Xinxiang 453000, Henan Province, P.R. China

Tel +86 037 383 1646

Email [email protected](Sanqiao Yao); [email protected](Yongbin Wang)

*These authors contributed equally to this paper

Table S1 Resulting parameter estimates and their statistical tests of the best-fitting

SARIMA(1,0,1)(0,1,2)12 model based on the TB incidence data between January 2004 and July

2016

Variables Estimates Standard error t P

AR1 0.879 0.068 12.926 <0.001

MA1 -0.561 0.011 -51.000 <0.001

SMA1 -0.806 0.091 -8.857 <0.001

SMA2 0.227 0.097 2.340 0.020

Abbreviations: SARIMA, seasonal autoregressive integrated moving average; AR1,

autoregressive, lag1; MA1, moving average, lag1; SMA1, seasonal moving average, lag1; SMA2,

seasonal moving average, lag2.

2

Table S2 Ljung-Box Q statistics for the residual series yielded by the best-performing three

techniques at various lags based on the TB incidence data between January 2004 and July 2016

LagsSARIMA model NNNAR model SARIMA-NNNAR model

Box-Ljung Q P Box-Ljung Q P Box-Ljung Q P

1 0.031 0.861 0.698 0.404 0.015 0.901

3 1.036 0.793 2.263 0.520 0.172 0.982

6 12.255 0.057 10.415 0.108 2.206 0.900

9 17.203 0.046 16.629 0.055 7.014 0.636

12 17.770 0.123 18.245 0.108 8.844 0.716

15 18.802 0.223 19.844 0.178 9.971 0.822

18 25.893 0.102 28.464 0.055 14.690 0.683

21 27.706 0.149 29.045 0.064 17.466 0.683

24 28.365 0.245 32.702 0.111 18.968 0.754

27 29.440 0.340 34.605 0.182 19.343 0.857

30 30.528 0.439 35.810 0.214 25.268 0.712

33 30.951 0.570 36.462 0.311 27.091 0.756

36 32.875 0.618 37.547 0.398 36.379 0.451

Abbreviations: SARIMA, seasonal autoregressive integrated moving average; NNNAR, neural

nonlinear autoregression.

3

Table S3 ARCH effects for the actual TB incidence rate and residual series yielded by the best-

performing three techniques at various lags based on the TB incidence data between January

2004 and July 2016

LagsActual values SARIMA model NNNAR model SARIMA-NNNAR model

LM-test P LM-test P LM-test P LM-test P

1 54.735 <0.001 2.869 0.090 6.605 0.010 1.829 0.176

3 62.912 <0.001 3.018 0.389 6.685 0.083 2.379 0.498

6 67.562 <0.001 6.870 0.333 6.477 0.372 4.642 0.591

9 73.503 <0.001 10.298 0.327 8.934 0.443 6.152 0.725

12 85.778 <0.001 13.166 0.357 9.540 0.656 6.694 0.877

15 92.005 <0.001 13.507 0.563 10.216 0.806 8.586 0.898

18 95.172 <0.001 21.299 0.265 13.327 0.772 9.690 0.942

21 94.102 <0.001 21.460 0.431 18.240 0.634 15.149 0.815

24 98.897 <0.001 20.867 0.647 16.656 0.863 21.403 0.615

27 101.740 <0.001 19.814 0.839 22.095 0.733 25.333 0.556

30 100.750 <0.001 19.366 0.932 26.198 0.665 26.051 0.673

33 97.734 <0.001 21.655 0.935 26.283 0.790 31.181 0.558

36 95.380 <0.001 27.301 0.851 27.673 0.839 38.237 0.368

Abbreviations: ARCH, autoregressive conditional heteroscedastic; SARIMA, seasonal

autoregressive integrated moving average; NNNAR, neural nonlinear autoregression; LM,

Lagrangian multiplier.

4

Table S4 MSE and R2 values in the training and validation sets corresponding to the different

hidden layer units in the 5-data ahead forecasting

SizeTraining set Validation set

MSE R2 MSE R2

1 0.630 0.505 0.731 0.424

2 0.218 0.763 0.668 0.501

3 0.210 0.778 0.353 0.725

4 0.311 0.669 0.915 0.574

5 0.071 0.923 0.124 0.929

6 0.332 0.723 1.608 0.320

7 0.054 0.942 1.065 0.796

8 0.014 0.985 0.799 0.564

9 0.023 0.976 1.322 0.160

10 0.013 0.986 1.876 0.543

11 0.030 0.969 1.611 0.478

12 0.002 0.998 0.263 0.795

13 0.006 0.993 0.537 0.660

14 0.001 0.998 2.683 0.336

15 0.002 0.997 0.710 0.573

16 0.001 0.999 0.658 0.628

17 0.006 0.993 0.810 0.397

18 0.064 0.936 2.326 0.474

19 0.002 0.998 0.540 0.603

20 0.012 0.987 0.544 0.745

Abbreviations: MSE, mean squared error.

Note: The detailed descriptions regarding the Jordan neural network can be found in the

references. 1, 2 In our work, the validation set including 12 data was randomly selected in the data

except for the testing samples.

5

Table S5 MSE and R2 values in the training and validation sets corresponding to the different

hidden layer units in the 12-data ahead forecasting

SizeTraining set Validation set

MSE R2 MSE R2

1 0.67 0.498 0.834 0.229

2 0.395 0.609 1.361 0.217

3 0.17 0.831 1.100 0.065

4 0.125 0.874 1.083 0.219

5 0.119 0.871 0.341 0.810

6 0.045 0.955 1.870 0.042

7 0.015 0.984 2.161 0.128

8 0.020 0.980 1.553 0.155

9 0.004 0.995 0.835 0.250

10 0.003 0.997 0.957 0.117

11 0.010 0.990 0.952 0.309

12 0.002 0.998 1.402 0.016

13 0.001 0.999 1.684 0.002

14 0.005 0.995 2.756 0.183

15 0.002 0.998 1.300 0.091

16 0.002 0.998 1.790 0.112

17 0.000 1.000 0.770 0.223

18 0.000 1.000 1.394 0.042

19 0.000 1.000 0.813 0.306

20 0.000 1.000 1.159 0.082

Abbreviations: MSE, mean squared error.

6

Table S6 Comparisons of the mimic and predictive performance measures among the best-

performing four models

Models Fitting power Projected power

MAE MAPE RMSE MER MAE MAPE RMSE MER

In-sample dataset during January 2004 to December

201512 step-ahead projections

SARIMA 0.746 9.525 1.008 0.095 0.972 8.685 1.153 0.091

NNNAR 0.463 6.767 0.625 0.058 1.23811.17

61.558 0.116

SARIMA-NNNAR 0.424 5.610 0.564 0.053 0.803 7.649 0.979 0.075

Jordan 0.604 8.553 0.778 0.076 1.49113.15

21.834 0.140

Reduced percentages (%)

C versus A43.164 41.053 44.048 44.211 17.37

2

11.96

8

15.12

0

17.41

5

C versus B5.228 12.632 6.052 5.263 44.73

6

40.62

1

50.30

4

44.79

7

C versus D29.801 34.409 27.506 30.263 46.14

4

41.84

2

46.61

9

46.42

9

In-sample dataset during January 2004 to July 2016 5 step-ahead projections

SARIMA 0.724 9.137 1.014 0.090 0.795 9.450 0.920 0.091

NNNAR 0.606 8.477 0.803 0.074 0.735 8.860 0.914 0.084

SARIMA-NNNAR 0.508 6.596 0.722 0.063 0.656 7.879 0.803 0.075

Jordan 0.581 7.946 0.765 0.071 0.794 8.940 0.806 0.088

Reduced percentages (%)

C versus A 29.881 27.790 28.839 29.97817.52

6

16.61

4

12.68

5

17.45

3

C versus B 16.218 22.170 10.138 14.45910.72

4

11.06

1

12.10

2

10.68

9

C versus D 12.565 16.990 5.621 11.268 17.38 11.86 0.372 14.77

7

0 8 3

Abbreviations: SARIMA, seasonal autoregressive integrated moving average; NNNAR, neural

nonlinear autoregression; MAE, mean absolute error, MAPE, mean absolute percentage error;

RMSE, root mean squared error; MER, mean error rate. A is the SARIMA approach; B is the

NNNAR approach; C is the SARIMA-NNNAR hybrid approach; D stands for the Jordan neural

network.

Figure S1. Architectural layout of a simple neural network nonlinear autoregressive (NNNAR(p,

k)) method. This simple architectural layout composes of a hidden layer with k (k=7 in this

architectural layout) neurons and p (p=3 in this architectural layout) delays and an output layer

with 1 neuron. It is a two-layer feed forward network, with a sigmoid transfer function in the

hidden layer and a linear transfer function in the output layer. This simple architectural layout

uses the earlier inputs at lags p as a single input to train network and forecast. An extension of

the basic NNNAR network (NNNAR(p, P, k)m) further applies the last P sample points from the

same m season besides the earlier inputs at lags p to train network and forecast.

8

Figure S2 Flow chart of the SARIMA-NNNAR hybrid technique.

9

Figure S3 Sample ACF and PACF plots for the original TB incidence rate time-series during

January 2004 through December 2015 in Qinghai province. The original series showed marked

seasonal pattern due to the spikes at lags 1, 12, and 24 in the ACF and PACF plots. Thus a

seasonal difference was taken to stabilize the varied variance and mean over time.

10

Figure S4 Sample ACF and PACF plots for the seasonally differenced TB incidence rate time-

series during January 2004 through December 2015 in Qinghai province. This graph implied that

after the seasonal difference, the serial looks much more stationary compared to the actual. And

the existence of the local maximum points at lags 1 and 12 in the ACF and PACF plots is a hint of

the possible values of p, P, q, and Q being 1 and 2.

11

Figure S5 Resulting diagnostic results for the residual series of TB morbidity rate between

January 2004 and July 2016 from the SARIMA(1,0,1)(0,1,2)12 model. (A) Standardized error

series; (B) Autocorrelation function(ACF) plot for the error series; (C) Partial autocorrelation

function (PACF) plot for the error series; (D) Q-statistic P-values. Based on these graphs, the

derived SARIMA method appears to be appropriate for simulating the data.

12

Figure S6 Resulting diagnostic results for the residual series of TB morbidity rate between

January 2004 and July 2016 from the NNNAR(5,1,4)12 model. (A) Standardized error series; (B)

Autocorrelation function(ACF) plot for the errors; (C) Partial autocorrelation function (PACF) plot

for the errors; (D) Q-statistic P-values. As shown above, it seems the obtained NNNAR model

can be applied to fit the data.

13

Figure S7 Resulting diagnostic tests for the residual series of TB morbidity rate between January

2004 and July 2016 from the SARIMA-NNNAR(2,18) hybrid model. (A) Standardized error series;

(B) Autocorrelation function(ACF) plot for the errors; (C) Partial autocorrelation function (PACF)

plot for the errors; (D) Q-statistic P-values. As seen above, the derived SARIMA-NNNAR

combined technique remains quite adequate in mimicking the dynamic structure of the data.

14

Figure S8 Training errors by different iterations for Jordan neural network. (A) Training error in

the 5-data ahead forecasting; (B) Training error in the 12-data ahead forecasting.

15

1 Wu W, An SY, Guan P, Huang DS, Zhou BS. Time series analysis of human brucellosis in

mainland China by using Elman and Jordan recurrent neural networks. BMC Infect Dis.

2019;19(1):414. doi:10.1186/s12879-019-4028-x

2 Bilski J, Smoląg J. Parallel Approach to Learning of the Recurrent Jordan Neural Network.

International Conference on Artificial Intelligence and Soft Computing2013.

16

Documents

Supplementary material: An advanced data-driven hybrid ... Web viewSupplementary material: An advanced data-driven hybrid model of SARIMA-NNNAR for tuberculosis incidence time series