Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Supplementary Materials
1. Features Considered for Prediction
All features employed in prediction processes are listed in supplemental table 1.
Supplemental Table 1. List of 93 features used in machine learning based prediction of outcome. MoCA in year
4 was predicted.Feature # Abbreviation Date (Year) Feature Name
1,2 MDS_UPDRS I 0&1Movement Disorder Society_ Unified
Parkinson's Disease Rating Scale-type one
3,4MDS_UPDRS I-
Q0&1
Movement Disorder Society _ Unified
Parkinson's Disease Rating Scale- Questionnaire
-type one
5,6 MDS_UPDRS II 0&1Movement Disorder Society _ Unified
Parkinson's Disease Rating Scale-type two
7,8MDS_UPDRS
III0&1
Movement Disorder Society _ Unified
Parkinson's Disease Rating Scale-type three
9,10 MoCA 0&1 Montreal Cognitive Assessment
11,12 AGE, GENDER
13,14DD DIAG ,
SYMDisease Duration -Diagnosis
15,16 REM 0&1REM Sleep Behavior Disorder
Questionnaire
17,18 ESS 0&1 Epworth Sleepiness Scale
19-22 UPSIT-BL1 0University of Pennsylvania Smell
Identification Test-Booklet1
20 UPSIT-BL2 0University of Pennsylvania Smell
Identification Test-Booklet2
21 UPSIT-BL3 0University of Pennsylvania Smell
Identification Test-Booklet3
22 UPSIT-BL4 0University of Pennsylvania Smell
Identification Test-Booklet4
23,24 SDMT 0&1 Symbol Digit Modalities Test
25, 33 LNS-NUM1 0&1 Letter Number Sequencing-NUM 1
26,34 LNS-NUM2 0&1 Letter Number Sequencing-NUM 2
27,35 LNS-NUM3 0&1 Letter Number Sequencing-NUM 3
28,36 LNS-NUM4 0&1 Letter Number Sequencing-NUM 4
29,37 LNS-NUM5 0&1 Letter Number Sequencing-NUM 5
1
30,38 LNS-NUM6 0&1 Letter Number Sequencing-NUM 6
31,39 LNS-NUM7 0&1 Letter Number Sequencing-NUM 7
32,40LNS-Total
NUM0&1 Letter Number Sequencing-Total score
41,45 HVLT‐R-Q 0&1Hopkins Verbal Learning Test – Revised
(Derived-Total Recall T-Score)
42,46 HVLT‐R-R 0&1Hopkins Verbal Learning Test – Revised
(Derived-Delayed Recall T-Score)
43,47 HVLT‐R-S 0&1Hopkins Verbal Learning Test – Revised
(Derived-Retention T-Score)
44,48 HVLT‐R-T 0&1Hopkins Verbal Learning Test – Revised
(Derived-Recog. Discrim. Index T-Score)
49-50 BJLO 0&1 Benton Judgment of Line Orientation
51-52 STAIA 0&1 State‐Trait Anxiety Inventory for Adults
53-54 QUIP 0&1Questionnaire for Impulsive‐Compulsive
Disorders
55-56 GDS 0&1 Geriatric Depression Scale
57-58 SCOPA‐AUT 0&1 Scales for Outcome in Parkinson’s Disease
59 MS&EADL 0Modified Schwab & England Activities of
Daily Living
60,61 GNE 0&1 General Neurological Exam
62,63 NECN 0&1 Neurological_Exam_Cranial_Nerves
64,72 VS-I 0&1 Vital Signs-Temperature (in Celsius)
65,73 VS-J 0&2 Vital Signs-Arm used for blood pressure
66,74 VS-K 0&3 Vital Signs-Supine BP - systolic
67,75 VS-L 0&4 Vital Signs-Supine BP - diastolic - before
68,76 VS-M 0&5 Vital Signs-Supine heart rate - before
69,77 VS-N 0&6 Vital Signs-Standing BP - systolic
70,78 VS-O 0&7 Vital Signs-Standing BP - diastolic
71,79 VS-P 0&8 Vital Signs-Standing heart rate
80,81 SoE Socio‐Economics
82 FH-Q1 Family History-Biological Mother with PD
83 FH-Q2 Family History-Biological Father with PD
84 FH-Q3 Family History-Full Siblings
85 FH-Q4 Family History-Full Siblings with PD
86 FH-Q5Family History-Maternal Grandparents with
PD
87 FH-Q6Family History-Paternal Grandparents with
PD
2
88 FH-Q7 Family History-Maternal Aunts and Uncles
89 FH-Q8Family History-Maternal Aunts and Uncles
with PD
90 FH-Q9 Family History-Paternal Aunts and Uncles
91 FH-Q10Family History-Paternal Aunts and Uncles
with PD
92 FH-Q11Family History-How many children do you
have
93 FH-Q12Family History-How many children with
PD
In the present work, a range of predictor algorithms were selected amongst various families of learner
and regressor algorithms, and in addition, a range of Feature Subset Selector Algorithms (FSSA) were
considered in combination with learner machines, to optimize prediction of outcome.
2. Predictor machines
Below, we elaborate on the 10 predictor machines used in our present work.
2.1. MLP-BP (Multi-Layer Perceptron-Back propagation)
A multilayer perceptron is a feed forward artificial neural network model that maps sets of input data onto
a set of appropriate output so it is a modified MLP that uses three or more layers of neurons (nodes) with
nonlinear activation functions, and is more powerful than the perceptron in that it can distinguish data that
are not linearly separable, or separable by a hyper plane. [1], [2]. Diagram of MLP-BP is shown
Supplemental Figure 1. In this specific work, we used a three–layer neural network and the number of
neurons in each layer was adjusted via Q-learning automatically.
Supplemental Figure 1. Diagram of MLP-BP
2.2. RNN (Recurrent Neural Network)
3
Recurrent neural network is a deep learning algorithm. The RNN as fundamentally different neural
network from feed-forward architectures was investigated for modelling of nonlinear behavior [3], [4].
Diagram of the RNN is shown Supplemental Figure 2. In this work, we used a model with many inputs to
one output. In this specific work, we used a three–layer neural network and the number of neurons in each
layer was adjusted via Q-learning automatically.
Supplemental Figure 2. diagram of the RNN
2.3. RBF (Radial Basis Function)
Radial Basis Function is a method proposed in machine learning for making predictions and forecasting.
Radial basis functions are embedded into a two-layer feed-forward neural network. In between the inputs
and output layers, there is a layer of processing units called hidden units, which implement radial basis
functions [5]. Diagram of RBF is shown Supplemental Figure 3. In our work, the number of neurons in
the hidden layer was adjusted via Q-learning automatically.
Supplemental Figure 3. diagram of the RBF
2.4. LOLIMOT (Local Linear Model Trees)
The aim of the local linear model trees is to enable fast and easy-to-use nonlinear system identification.
LOLIMOT is a fast incremental construction algorithm for local linear neuro fuzzy models also known as
Takagi-Sugeno fuzzy models. This algorithm shortens the overall modeling development time by
reducing the number of required trial and error steps for identification of patterns. In order to achieve this
goal, the implemented algorithm has to converge in a reliable manner without any random influences. The
network structure of a local linear neuro fuzzy model is depicted in Supplemental Figure 4. Each neuron
realizes a local linear model (LLM) and an associated validity function that determines the region of
4
validity of the LLM [6], [7]. In this specific work, number of neuron in the hidden layer was adjusted via
Q-learning automatically.
Supplemental Figure 4. Network structure of a local linear neuro fuzzy model with M neurons for nx LLM inputs;
and nz validity function inputs Zi.
2.5. DTC (Decision Tree Classification)
Decision Tree classification technique is one of the most popular techniques in the emerging field of data
mining. There are various methods for constructing the DTC. Induced Decision tree (ID3) is the basic
algorithm for constructing the DTC [8]. There are many algorithms based on classification that is sample
based, neural networks, Bayesian networks, support vector machine, and decision tree. The DTC
classifies samples by sorting them down the tree from the root to some leaf node, which provides the
classification of samples. Each node in the tree specifies a test of some attribute of the sample and each
branch descending from that node corresponds to one of the possible values for this attribute [9].
Supplemental Figure 5 shows a diagram of decision tree algorithm. In this specific work, the maximum
depth was not set so the algorithm would continue until all leaves were pure. To measure the quality of a
split, a "Gini" function was used (GINI function describes the impurity of each node; each child node was
purer than its parent node so that the GINI function was minimized).
Supplemental Figure 5. an example of the DTC algorithm
2.6. RFA (Random Forest Algorithm)
Random Forests are a combination of tree predictors such that each tree depends on the values of a
random vector sampled independently and with the same distribution for all trees in the forest. The
5
generalization error of a forest of tree classifiers depends on the strength of the individual trees in the
forest and the correlation between them [10], [11]. In Supplemental Figure 6 a diagram of the RFA is
shown. Depth of structure was adjusted via Q-learning automatically. Number of trees and number of
splits were set to 1000 and 5, respectively.
Supplemental Figure 6. a diagram of the RFA algorithm
2.7. BRR (Bayesian Ridge Regression)
Bayesian method views parameter estimate as random, not fixed. The method combines prior information
about parameter (prior distribution of parameter) with the observed data (likelihood function) to obtain
the information or distribution of parameter given data, posterior distribution [12]. Bayesian Ridge
Regression (BRR) corresponds to the particular case α1 = · · · = αm. By regularizing all the features
identically, BRR is not well suited when only few features are relevant [13]. The BRR model assumes
normal likelihood density nn(y | Xβ, σ2 In) for the data. The conditional random variables β, σ2 | λ may be
assigned a conjugate multivariate normal (n) inverse-gamma (ig) prior distribution with probability
density function defined by [14]:
π (β, σ2 | λ) = n (β | 0, σ2λ −1 Ip)ig(σ2 | a, b)
And λ is assigned a gamma ga(λ | aλ, bλ) prior distribution [15]. In this specific work, maximum
iteration of the machine was set to 1000 with a stopping condition when β was converged by a tolerance
of 10-3. The initial values of λ and α were set to 10-6.
2.8. PAR (Passive Aggressive Regression)
Passive-aggressive (PAR) algorithms are a family of online algorithms for supervised learning [16]. They
are similar to the Perceptron in that they do not require a learning rate. This learning method updates the
classification function when a new example is misclassified or its classification score does not exceed
some predefined margin. PAR algorithms have been proved as a very successful and popular online
learning technique for solving many real-world applications [17]. On each round, the PAR algorithms
6
solve a constrained optimization problem which balances between two competing goals: being
conservative, in order to retain information acquired on preceding rounds, and being corrective, in order
to make a more accurate prediction when presented with the same instance again. The PAR algorithms
enjoy a certain popularity in the Natural Language Processing community, where they are often used for
large-scale batch learning [18]. In this specific work, the maximum epoch was set to 1000 with step size
1. The iterations stopped if the current loss value was larger than the sum of previous loss value and 1e-3.
Furthermore, the value of epsilon was adjusted via Q-learning automatically. Epsilon parameter is a
threshold amount between prediction value and target value. If their difference was under epsilon, the
model would not be updated.
2.9. Thiel-Sen Regression
Theil-Sen regression, a nonparametric estimation technique, uses median instead of mean. Median is not
sensitive to outliers in the data. Slop estimators b1 for all pairwise sets of observations in the data are
calculated and the median of all these slopes is taken as Theil-Sen slope estimator. The Y-intercept is
obtained by taking the median of all the differences (yi – b1xi) [19]. One interesting way to characterize
the slope of least squares regression line is that it is the solution of ρ(x, r(β)) = 0, where ρ is the Pearson
correlation coefficient and r(β) are the residuals from a fitted line with slope β. A non-parametric
counterpoint to this is Thiel-Sen regression, which satisfies τ (x, r(β)) = 0 where τ is Kendall’s tau, a rank
based alternative to the correlation coefficient. This was proposed by Theil [20]; Sen [21] extended the
results and added a confidence interval estimate. The approach is well known in selected fields (e.g.
astronomy), and almost 6 completely unknown in others. It has strong resistance to outliers and nearly
full efficiency compared to linear regression when the errors are Gaussian. In our work, the maximum
iteration was set to 1000. Because the number of least square solutions might be extremely large, the
number of samples and subsamples were limited to 20000. Tolerance parameter of spatial median was
adjusted via Q-learning automatically. Initial value of tolerance was set to 10-3.
2.10. LASSOLARs (Least Absolute Shrinkage and Selection Operator- Least Angle Regression)
LASSO is a powerful method that perform two main tasks: regularization and feature selection. The
LASSO method puts a constraint on the sum of the absolute values of the model parameters, the sum has
to be less than a fixed value (upper bound). In order to do so the method apply a shrinking
(regularization) process where it penalizes the coefficients of the regression variables shrinking some of
them to zero. During features selection process the variables that still have a non-zero coefficient after the
shrinking process are selected to be part of the model. The goal of this process is to minimize the
prediction error [22]. Least Angle Regression (“LARS”) relates to the classic model-selection method
7
known as Forward Selection or “forward stepwise regression” given a collection of possible predictors.
Forward Selection is an aggressive fitting technique that can be overly greedy. Lasso-Lars, Lasso model
fit with Least Angle Regression a.k.a. Lars. It is a Linear Model trained with an L1 prior as regularizer
[23]. In this specific work, maximum iteration was set to 1000, and the value of α parameter (range of
change between 0.01 and 1) which multiplied by the penalty term was adjusted via Q-learning method
automatically.
3. Feature Subset Selector Algorithms (FSSAs)
Below we describe the 6 FSSAs used in our present work.
3.1. GA (Genetic Algorithm)
GA is a heuristic solution search technique inspired by natural evolution. It involves a robust and flexible
approach that can be applied to a wide range of learning and optimization problems. GAs are particularly
suited to problems where traditional optimization techniques break down, either due to the irregular
structure of the search space (for example, absence of gradient information) or because the search
becomes computationally intractable. [24] The traditional theory of GAs assumes that, at a very general
level of description, GAs work by discovering, emphasizing, and recombining good building blocks of
solutions in a highly parallel fashion. The idea here is that good solutions tend to be made up of good
building-blocks combinations of bit values that often confer higher fitness to the strings in which they are
present [25]. Regulatory parameters such as Maximum Number of Iterations, Population Size, Crossover
Percentage, Number of Offspring (Parents), Mutation Percentage, Number of Mutants, Mutation Rate and
Selection Pressure were set to 180, 250, 0.80, 160, 0.30, 60, 0.02 and 8, respectively.
3.2. SA (Simulated Annealing)
Simulated annealing is a multivariable optimization technique based on the Monte Carlo method used in
statistical mechanical studies of condensed systems and follows by drawing an analogy between energy
minimization in physical systems and costs minimization in design applications [26], [27]. Regulatory
parameters such as Desired Number of Selected Features, Maximum Number of Iterations, Maximum
Number of Sub-iterations, Initial Temperature, and Temperature Reduction Rate were set to 18, 180, 200,
10 and 0.99.
3.3. DE (Differential Evolution)
8
The DE algorithm is a heuristic approach mainly having three advantages; finding the true global
minimum regardless of the initial parameter values, fast convergence, and using few control parameters.
DE algorithm is a population based algorithm like genetic algorithms using similar operators; crossover,
mutation and selection [28], [29]. Regulative parameters such as Desired Number of Selected Features,
Lower Bound of Variables, Upper Bound of Variables, Maximum Number of Iterations, Population Size,
Lower Bound of Scaling Factor, Upper Bound of Scaling Factor and Crossover Probability were set to 18,
0, 1, 180, 250, 0.20, 0.80, and 0.20, respectively.
3.4. ACO (Ant Colony Optimization algorithm)
ACO is a technique for optimization that was introduced in the early 1990’s. The inspiring source of ant
colony optimization is the foraging behavior of real ant colonies. This behavior is exploited in artificial
ant colonies for the search of approximate solutions to discrete optimization problems, to continuous
optimization problems, and to important problems in telecommunications, such as routing and load
balancing [30], [31]. Regulatory parameters such as Desired Number of Selected Features, Maximum
Number of Iterations, Number of Ants (Population Size), Initial Pheromone, Pheromone Exponential
Weight, Heuristic Exponential Weight and Evaporation Rate were set to 18, 180, 250, 1, 1, 1 and 0.05.
3.5. PSO (Particle Swarm Optimization algorithm)
Particle swarm optimization is a heuristic global optimization method and also an optimization algorithm,
which is based on swarm intelligence. It comes from the research on the bird and fish flock movement
behavior. The algorithm is widely used and rapidly developed for its easy implementation and few
particles required to be tuned [32], [33]. Regulatory parameters such as Desired Number of Selected
Features, Lower Bound of Variables, Upper Bound of Variables, Maximum Number of Iterations,
Population Size (Swarm Size), Inertia Weight, Inertia Weight Damping Ratio, Personal Learning
Coefficient, Global Learning Coefficient were set to 18, 0, 1, 180, 250, 0.73, 1, 1.50 and 1.50,
respectively.
3.6. NSGAII (Non dominated sorting genetic algorithm)
Multi-objective evolutionary algorithms (EAs) that use non-dominated sorting and sharing have been
criticized mainly for their: 1) O(MN3) computational complexity (where is the number of objectives and
is the population size); 2) non-elitism approach; and 3) the need for specifying a sharing parameter. In
this paper, we suggest a non-dominated sorting-based multi-objective EA (MOEA), called non-dominated
sorting genetic algorithm II (NSGA-II), which alleviates all the above three difficulties. Specifically, a
fast non-dominated sorting approach with O (MN2) computational complexity is presented. Also, a
9
selection operator is presented that creates a mating pool by combining the parent and offspring
populations and selecting the best (with respect to fitness and spread) solutions. All optimizing machines
aim to minimize error by selecting the best combination, whereas NSGAII also aims to reduce number of
features [34], [35]. Regulatory parameters such as Maximum Number of Iterations, Population Size,
Crossover Percentage, Number of Parents (Offspring), Mutation Percentage, Number of Mutants,
Mutation Rate were set to 180, 250, 0.7, 176, 100 and 0.10, respectively.
4. Results of Feature Selection using NSGAII
In the present work, NSGAII was able to find the most optimal among all existing combinations amongst
93 features. It enabled selection of different combinations of features, depending on epochs and
adjustments to select additional number of features. This machine was limited by epochs similar to other
FSSAs, therefore, it could only discover 12 best optimal combinations, although there were probably
other optimal combinations. NSGAII is an optimization framework where the most effective features are
selecting while reducing (penalizing) the number of features, thus selecting combinations which did not
have a lot of features. As can be seen, at least 6 combinations existed in the various optimal combinations.
Different selections of features by NSGAII are shown in Supplemental Table 2.
Supplemental Table 2. Selected features by NSGAII as optimal combinations.
Combinatio
n (# of
optimal
features)
6 8 91
1
1
2
1
3
1
4
1
5
1
7
2
1
2
2
2
3
Sele
cted
Fea
ture
s
9 2 3 2 7 7 7 7 7 2 2 2
1
0 7 9 5 9 9 9 9 9 7 3 7
1
6 9
1
0 7
1
0
1
0
1
0
1
0
1
0 9 7 9
2
8
1
0
1
6 9
1
6
1
5
1
5
1
5
1
4
1
0 9
1
0
3
5
1
6
2
8
1
0
2
8
1
6
1
6
1
6
1
5
1
5
1
0
1
4
5
1
2
8
3
5
1
5
3
5
2
8
2
4
2
4
1
6
1
6
1
5
1
5
10
3
5
4
1
1
6
4
0
3
5
2
8
2
8
2
0
2
0
1
6
1
6
5
1
5
1
2
8
5
1
4
0
4
0
4
0
2
4
2
4
2
4
2
0
6
0
4
0
5
6
5
1
5
1
5
1
2
8
2
7
2
7
2
4
5
1
7
0
5
5
5
6
5
6
4
0
2
8
2
8
2
7
7
3
7
2
5
6
6
8
6
8
5
1
3
5
3
5
2
8
7
3
7
2
7
2
7
2
5
6
4
0
4
0
3
5
7
3
7
3
7
3
6
8
4
1
4
1
4
0
9
0
8
8
6
9
5
1
5
1
4
1
9
0
7
2
5
6
5
6
5
1
7
3
6
9
6
9
5
2
9
0
7
0
7
0
5
6
7
2
7
2
6
7
7
3
7
3
6
9
8
8
7
9
7
2
9
0
8
8
7
3
9
0
8
8
9
11
0
We selected 184 patients who had all features (93 features). As such, we generated 10 different
arrangements among patients. Therefore, we chose features according to selected combinations based on
Supplemental Table 2. All combinations were applied to LOLIMOT, and Q-Learning was utilized. The
results are shown in Supplemental Figure 7 (blue line). Supplemental Table 3 shows p-values of
combinations selected by NSGAII versus the best combination results from part A (1.83±0.13).
COM.6 COM.8 COM.9 COM.11 COM.12 COM.13 COM.14 COM.15 COM.17 COM.21 COM.22 COM.23
Selected optimal combination
1.5
1.55
1.6
1.65
1.7
1.75
1.8
1.85
1.9
Abs
olut
e er
ror
Main test Additianal independent test
Supplemental Figure 7. Comparison between first test vs. new test using expanded subject set.
Supplemental Table 3. P-values for combinations selected by NSGAII, relative to best combination from part A of
manuscript.
Combination
(# of optimal
features)
6 8 9 11 12 13 14 15 17 21 22 23
P-value 0.017 0.22 0.007 0.09 0.073 0.005 0.045 0.13 0.48 0.25 0.27 0.42
Supplemental Table 4. Selection of patients beyond original 184 patients for additional independent testing
12
Alternatively, we created additional independent sets, as shown in Supplemental Table 4, of patients
having features according to Supplemental Table 2. For instance, for the combination with 6 features as
selected by NSGAII, 308 patients (206 males, 102 females; average age in year 67.5± 9.90, range [39,
91], average of MoCA outcome: 26.5 ±3.50, range [11, 30]) had available the features 9, 10, 16, 28, 35
and 51 (selected previously as the most vital features). This allowed additional validation of our work. In
the expanded sets (Supplemental Table 4), e.g. 308 total patients for 6 optimal features, we created a
single arrangement for train, training validation and final test (65% for train, 5% for training validation
and 30% for final test), while we ensured that the new final test set (93 new patients) only included newly
included patients, for completely independent testing. For combination 6 with new set of patients, mean
absolute error while we applied 6 features to LOLIMOT accompanied to Q-Learning, was reached about
1.60. So it was approximately similar to mean absolute error (1.68±0.12) which we reached in the prior
main final test (~30% of 184 patients who had all features). For other combinations according to selected
features in Supplemental Tables 2, it was possible to do training, training validation and final testing
process similar to 6th combination. Results of this combinations are also shown in Supplemental Figure 7
(Red line).
13
Combination
(# of optimal
features)
6 8 9 11 12 13 14 15 17 21 22 23
Total patients 308 234 290 232 286 286 282 282 282 228 228 228
Female # 102 80 193 153 191 191 188 189 189 151 151 151
Male # 206 154 97 79 95 95 94 93 93 77 77 77
New patients 124 50 106 48 102 102 98 98 98 44 44 44
Range:
MoCA
Outcome
11-
30
11-
30
11-
30
11-
30
11-
30
11-
30
13-
30
17-
30
13-
30
13-
30
13-
30
13-
30
Average:
MoCA
outcome
26.5±
3.5
26.6
±3.5
26.5
±3.3
26.5
±3.
5
26.5
±3.4
26.5
±3.4
26.6
±3.2
26.7
±2.9
26.6
±3.
2
26.6
±3.3
26.6
±3.3
26.6
±3.3
Range: Age39-
91
39-
91
39-
91
39-
91
39-
91
39-
91
39-
91
39-
91
39-
91
39-
91
39-
91
39-
91
Average: Age
67.5
±9.9
68.2
±9.5
67.6
±9.7
68.2
±9.
5
67.7
±9.6
67.7
±9.7
67.8
±9.7
67.8
±9.7
67.8
±9.
7
68.4
±9.5
68.4
±9.5
68.4
±9.5
Performance for prior final tests and new final tests are shown in Supplemental Figure 7. All mean
absolute errors for new patients were less or equal to mean absolutes errors of prior final test (the
lowering is attributed to a larger training set in new expanded patient set). This means that our utilized
machines all work very well.
According to the Supplemental Table 2, the 5 most used features within all the 12 optimal combinations
were features 9, 10, 16, 28 and 51, namely: (i,ii) MoCA years 0 and 1, (iii) REM (Sleep Behavior
Disorder Questionnaire) year 1, (iv) LNS (Letter Number Sequencing) Number 4 year 0, and (v) STAIA
(State‐Trait Anxiety Inventory for Adults) year 0. As such, these features were the most prominent and
predictive factors. The frequency of usage of each feature within the various optimal combinations is also
shown in Supplemental Figure 8.
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94
Feature number0
2
4
6
8
10
12
Nu
mb
er o
f it
erat
ion
Supplemental Figure 8. Number of times (i.e. frequency) of usage of each feature within the optimal
combinations. Features 9, 10, 16, 28 and 51 were the most commonly used.
References
[1] S. Alsmadi, M. khalil and et al, "Back Propagation Algorithm: The Best Algorithm," IJCSNS International Journal of Computer Science and Network Security, vol. 9, no. 4, pp. 378-383, 2009.
[2] D. Rumelhart, E. Geoffrey and et al, "Leaner Representations By back-Propagating errors," Nature, vol. 323, no. 9, pp. 533-536, 1986.
[3] A. Townley, M. Ilchmann and et al, "Existence and Learning of Oscillations in Recurrent Neural Networks," IEEE TRANSACTIONS ON NEURAL NETWORKS, vol. 11, no. 1, pp. 205-214, 2000.
[4] N. Maknickiene, . V. Rutkauskas and et al, "Investigation of financial market prediction by recurrent neural network," Innovative Infotechnologies for Science, Business and Education, vol. 11, no. 2, pp. 3-8, 2011.
14
[5] Y. Arora, A. Singhal and A. Bansal, "A Study of Applications of RBF Network," International Journal of Computer Applications, vol. 94, no. 2, pp. 17-20, 2014.
[6] O. Nelles, A. Fink and R. Isermann, "Local Linear Model Trees (LOLIMOT) Toolbox for Nonlinear System Identification," science Direct (IFAC System Identification), vol. 33, no. 15, pp. 845-850, 2000.
[7] J. Mart´ınez-Morales and E. Palacios, "Modeling of internal combustion engine emissions by," SciVerse Science Direct, vol. 3, pp. 251-258, 2012.
[8] M. RodneyOD and P. Goodman, "Decision tree design using information theory," Knowledge Acquisition, vol. 2, pp. 1-19, 1990.
[9] S. Chourasia, "Survey paper on improved methods of ID3 decision tree," International Journal of Scientific and Research Publications, vol. 3, no. 12, pp. 1-4, 2013.
[10] L. Breiman, "Random Forests," Machine Learning, vol. 45, p. 5–32, 2001.
[11] A. Jehad, R. Khan and N. Ahmad, "Random Forests and Decision Trees," IJCSI International Journal of Computer Science Issues, vol. 9, no. 5, pp. 272-278, 2012.
[12] A. Efendi, "A simulation study on Bayesian Ridge regression models for several collinearity levels," in AIP Conference Proceedings, 2017.
[13] C. M. Bishop, Pattern Recognition and Machine Learning, 1th ed., P. J. K. B. S. Michael Jordan, Ed., New York: Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA, 2006.
[14] G. Karabatsos, "Fast Marginal Likelihood Estimation of the Ridge Parameter(s) in Ridge Regression and Generalized Ridge Regression for Big Data," Statistics-, pp. 1-44, 2015.
[15] D. Denison,, C. Holmes and et al, Bayesian Methods for Nonlinear Classification and Regression, New York: John Wiley and Sons, 2002.
[16] K. Crammer, O. Dekel and et al, "On-line passive-aggressive algorithms," Journal of Machine Learning Research, vol. 7, pp. 551-585, 2006.
[17] J. Lu, P. Zhao and C. H. Steven , "Online Passive Aggressive Active Learning and its," JMLR: Workshop and Conference Proceedings, vol. 39, pp. 266-282, 2014.
[18] M. Blondel, Y. Kubo and N. Ueda, "Online Passive-Aggressive Algorithms for Non-Negative Matrix Factorization and Completion," Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, PMLR, vol. 33, pp. 96-104, 2014.
[19] S. H. Shah, A. Rashid, and et al, "A Comparative Study of Ordinary Least Squares Regression and Theil-Sen Regression through Simulation in the Presence of Outliers," Lasbela, U. J.Sci.Techl, vol. V, pp. 137-142, 2016.
[20] Theil, "A rank-invariant method of linear and polynomial regression analysis. I, II, III," Nederl. Akad. Wetensch, vol. 53, pp. 386–392, 521–525, 1397–1412, 1950.
[21] P. Sen, "Estimates of the Regression Coefficient Based on Kendall's Tau," Journal of the American
15
Statistical Association, vol. 63, no. 324, pp. 1379-1389, 1968.
[22] V. Fonti, "Feature Selection using LASSO," VU Amsterdam, Amsterdam, 2017.
[23] B. Efron, T. Hastie and et al, "Least angle regression," The Annals of Statistics, vol. 32, pp. 407-499, 2004.
[24] J. McCall, "Genetic algorithms for modelling and optimisation," Journal of Computational and Applied Mathematics, vol. 184, p. 205–222, 2004.
[25] M. Mitchell, "Genetic Algorithms: An Overview," Complexity, vol. 1, no. 1, pp. 31-39, 1995.
[26] W. Dolan, P. Cummings and M. LeVan, "Process Optimization via Simulated," AIChE Journal, vol. 35, pp. 725-736, 1989.
[27] S. Kirkpatrick, C. Gelatt and M. Vecchi, "Optimization by Simulated Annealing," Science, New Series, vol. 220, pp. 671-680, 1983.
[28] D. KARABOGA and S. OKDEM, "A Simple and Global Optimization Algorithm for," Turk J Elec Engin, vol. 12, pp. 53-60, 2004.
[29] A. Musrrat, M. Pant and A. Abraham, "Simplex Differential Evolution," Acta Polytechnica Hungarica, vol. 6, pp. 95-115, 2009.
[30] C. Blum, "Ant colony optimization: Introduction and recent trends," Physics of Life Reviews, vol. 2, pp. 353-373, 2005.
[31] P. Sivakumar and K. Elakia, "A Survey of Ant Colony Optimization," International Journal of Advanced Research in Computer Science and Software Engineering, vol. 6, no. 3, pp. 574-578, 2016.
[32] Q. Bai, "Analysis of Particle Swarm Optimization Algorithm," Computer and Information Science, vol. 3, pp. 180-184, 2010.
[33] S. Singh, "A Review on Particle Swarm Optimization Algorithm," International Journal of Scientific & Engineering Research, vol. 5, no. 4, pp. 551-553, 2014.
[34] D. Kalyanmoy, A. Associate. and et al, "A Fast and Elitist Multiobjective Genetic Algorithm:," IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, vol. 6, pp. 182-197, 2002.
[35] Y. Yusoff, M. Ngadiman and A. Mohd Zain, "Overview of NSGA-II for Optimizing Machining Process Parameters," Procedia Engineering, vol. 15, p. 3978 – 3983, 2011.
16
Figure captions
Supplemental Figure 1. Diagram of MLP-BP
Supplemental Figure 2. diagram of the RNN
Supplemental Figure 3. diagram of the RBF
Supplemental Figure 4. Network structure of a local linear neuro fuzzy model with M neurons for nx LLM inputs;
and nz validity function inputs Zi.
Supplemental Figure 5. an example of the DTC algorithm
Supplemental Figure 6. a diagram of the RFA algorithm
Supplemental Figure 7. Comparison between initial final test vs. new final test included expanded subject set.
Supplemental Figure 8. Number of times (i.e. frequency) of usage of each feature within the optimal combinations.
Features 9, 10, 16, 28 and 51 were the most commonly used.
Table caption
Supplemental Table 1. List of 93 features used in machine learning based prediction of outcome. MoCA in year 4
was predicted.
Supplemental Table 2. Selected features by NSGAII as optimal combinations.
Supplemental Table 3. P-values for combinations selected by NSGAII, relative to best combination from part A of
manuscript.
Supplemental Table 4. Selection of patients beyond original 184 patients for additional independent testing
17