32
Automatic Feature Engineering with RapidMiner Auto Model: Rapidly identifying alcoholics from their EEGs with ease, precision and accuracy. By Dr Gwin NYAKUENGAMA DatAnalytics Email: DatAnalytics @iinet.com.au Webpage: https://dat-analytics.net/ 1

Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

Automatic Feature Engineering with RapidMiner Auto Model:

Rapidly identifying alcoholics from their EEGs with ease, precision and accuracy.

By

Dr Gwin NYAKUENGAMA

DatAnalytics

Email: [email protected]

Webpage: https://dat-analytics.net/ 1

Page 2: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

To use RapidMiner Auto Model’s automatic feature engineering to identify alcoholics from their electroencephalograms (EEGs).

AIM

2

Page 3: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

Electroencephalogram (EEG); Alcoholics; RapidMiner Auto Model; Automatic Feature Engineering; Machine Learning; Classification; Deep Learning; Decision Tree; Random Forest; Gradient Boosted Tree; Support Vector Machine

KEYWORDS

3

Page 4: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

We are grateful to:

• UCI for their EEG dataset and their image used in the main title;

• Previous scholars cited in this study;

• RapidMiner, Microsoft and Stata Corp for their software; and

• Our friends for their support and encouragement.

ACKNOWLEDGEMENTS

4

Page 5: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

INTRODUCTION

Increasingly, both supervised and unsupervised machine learning (ML) are being used to study the adverse medical effects of alcohol on the human brain (see the literature reviews by Rangaswamy and Porjesz, 2014 and Priya et al., 2018).

In ML, features are individual measurable characteristics or dimensions that best represent the data under study. Features are numeric values, strings or variables. Feature engineering is the science of generating and selecting optimal features for model building and validation. Conventional least-square statistical methods are inapplicable on account of autocorrelation / non-independence of these features.

Careful feature selection is the linchpin to optimal ML model performance in terms of accuracy, precision and recall. There are dangers in under- and over-fitted models, such as long machine processing times and poor model performance (see Nyakuengama 2019).

In this study, we successfully built five models in RapidMiner Auto Model namely Deep Learning, Decision Tree, Random Forest, Gradient Boosted Tree and Support Vector Machine. These models made use of RapidMiner’s Feature Engineering engine.

5

Page 6: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

METHOD

The UCI EEG dataset was described previously (Wang et. al., 2014; Zhu et. al., 2014). Variables comprised: • Subject ID, Category (Control / Alcoholic), Time, Trial number, Sensor Position, Matching and Sensor Value (EEG).

Data preparation and preliminary data visualization of the dataset were carried out in R, Stata and MS EXCEL:• EEG signal data from the SMNI_CMI_TRAIN.tar.gz was used to both train and validate models (see below). Data

comprised eight controls and eight alcoholic subjects.• All 64 channels were used.• A minimum amount of data (14 % of the original dataset) was kept to minimize PC processing time:

o Kept a subset of data with time equal or greater than 0.79. This value was selected following inspection of original EEG 3D images (see figure on page 7), which showed big visual difference between the control and alcoholics for time equal or greater than 0.79.

o Kept data for only 44 (of the 64) sensor positions which showed significant visual difference between the control and alcoholics (see next two figures on pages 8 and 9).

Data was imported into RapidMiner Auto Model and ML experiments were undertaken as described previously (Nyakuengama 2018).

6

Page 7: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

Visualization of EEG readings in control and alcoholic subjects (from original study)

7

Page 8: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

Visualization of EEG readings in control and alcoholic subjects

Locations on head (after Wang et. al., 2014) Current study (summarized using Stata, plotted in MS EXCEL)8

Page 9: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

Selected sensor positions

Locations on head (after Wang et. al., 2014)

AF1 CP2 FP2 P5

AF2 CP3 FPZ P6

AF7 CP4 O2 P7

AF8 CP5 P01 P8

AFZ CP6 P02 PZ

C1 CPZ P07 T7

C2 F1 P08 T8

C3 F2 P1 TP7

C4 F3 P2 TP8

CP1 F7 P3 X

F8 P4 Y

Used in current study

9

Page 10: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

DATA IMPORTATION INTO RAPIDMINER AUTO MODEL

10

Page 11: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

11

Page 12: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

VARIABLE SELECTION

12

Page 13: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

13

Page 14: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

14

Page 15: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

15

Page 16: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

MODEL SPECIFICATION IN RAPIDMINER AUTO MODEL

16

Page 17: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

17

Page 18: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

RESULTS

18

Page 19: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

19

Page 20: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

20

Page 21: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

21

Page 22: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

22

Page 23: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

23

Page 24: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

24

Page 25: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

25

Page 26: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

26

Page 27: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

27

Page 28: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

MODEL APPLICATION

28

Page 29: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

DISCUSSION

The best RapidMiner Auto Models all had high (100%) and similar performance parameters (i.e. Accuracy, Precision, ROC, AUC and Recall). These were Random Forest (shortest time), Gradient Boosted Trees, Decision Tree and Deep Learning (longest time). Support Vector Machine was comparatively the worst model. We note that the Naïve Bayes model failed completely. Experiments using this model on large datasets and many features often fail (see references cited by Nyakuengama 2019).

We would choose the Random Forest model among the best models on account of its shortest run-time. However, if this was not an issue, then we could have chosen any of the best models.

Wang et al. (2014) previously found that:• Some sensor positions were most useful in identifying alcoholic EEG signals. All of these sensors were also included in

this study;• Using a technique called K-Nearest Neighbour achieved a great accuracy of 95%;• A data reduction technique called PCA-GE achieved comparable accuracy but using only a third of the data and in only a

third of the run-time of other techniques; and• Using only a third of the data and 19 channels yielded a good accuracy of around 92%.

While the current study is not directly comparable to that of Wang et al. (2014), it is noteworthy that current experiments in RapidMiner Auto Model achieved the superior performance of 100% in each of accuracy, precision and recall with just 14% of the data and 44 channels.

29

Page 30: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

DISCUSSION (continued)

RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features (based on their weights) as sensor values, times, matching, channels and sensor position. It is unsurprising that the usefulness of the features reflected the very experimental design of the study – the observed EEG sensor values of each individual were measured at set times, after controlling for matching, channels and the sensor positions in what is known in statistical terms as a ‘nested’ experimental design.

In this study we successfully used the Random Forest model to identify alcoholics in a new EEG signals dataset (SMNI_CMI_TEST.tar.gz) that had not previously been used to develop the model. In our RapidMiner Auto Model experiment this step is called model application (see the self-titled figure).

30

Page 31: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

CONCLUSION

Our study showed that RapidMiner is a machine learning tool-of-choice when investigating changes in human brain’s EEG signals arising from alcoholism primarily because of:

• Availability of an ensemble machine learning models;• A state-of-the-art, automatic Feature Engineering engine and its ability to handle large dataset with several

features;• Its rapid, accurate and precise results; and• Transparent processes around data cleansing, model building, model validation and model application on new data.

The reader is welcome to contact the author to discuss any aspect of this study ([email protected]).

31

Page 32: Automatic Feature Engineering with RapidMiner Auto Model...2019/02/06  · RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features

BIBLIOGRAPHY

Nyakuengama , J.G. 2018: Use of RapidMiner - Auto Model To Predict Customer Churn: https://dat-analytics.net/2018/07/28/use-of-rapidminer-auto-model-to-predict-customer-churn/

Nyakuengama, J.G. 2019 Part I: Automatic Machine Learning Document Classification – An Introduction: https://provalisresearch.com/blog/automatic-machine-learning-document-classification/

Priya, A. ; Yadav, P., Jain, S.; Bajaj, V. 2018: Efficient method for classification of alcoholic and normal EEG signals using EMD. The Journal of Engineering. Vol. 2018, Issue 3, pp. 166–172.

Rangaswamy, M.; Porjesz B. 2014: Chapter 3 - Understanding alcohol use disorders with neuroelectrophysiology. Handbook of Clinical Neurology, Vol. 125 (3rd series). Alcohol and the Nervous System. E.V. Sullivan and A. Pfefferbaum, Editors.

Zhu, G.; Li Y.; Wen, P.P. 2014: Analysis of alcoholic EEG signals based on horizontal visibility graph entropy. Brain Inform.2014 Dec; 1(1-4): 19–25.

Wang, S. Li, Y.; Wen, P.P.; Lai D. 2014: Data selection in EEG Signals Classification -https://eprints.usq.edu.au/28810/1/Data%20Selection%20in%20EEG%20Signals%20Classification.pdf

32