From Experimental to Applied Predictive Analytics on Big Data - Milan Vukicevic

Milan Vukićević, Assistant Professor @ University of Belgrade, Faculty of Organizational Sciences

From Experimental to Applied Predictive Analytics on Big Data - Challenges and Case Studies

2

Big Data Promise

• Healthcare

• Marketing• Finance• Banking• Telco• Car

industry• Education• Etc…

• Segmentation• Churn prediction• Risk management• Sentiment

analyses• Automatic

recommendations• Fraud detection• Diagnostics,• Etc…

There is a large gap between actual data usage and potential data usage in many application areas that prevents a paradigm shift from delayed interventional to predictive and prescriptive decision making.

From predictive to prescriptive decision making – high financial and human benefits

Challenges

• High complexity of the problems

• Multi-modality of the data

• High cost of wrong decisions

• Interpretability

• From predictive to prescriptive

• Privacy concerns

• Integration of Domain Knowledge and Data Driven Methods

Van Poucke S, Thomeer M, Heath J, Vukicevic M (2016) Are Randomized Controlled Trials the (G)old Standard? From Clinical Intelligence to Prescriptive Analytics, J Med Internet Res 2016;18(7):e185. URL: http://www.jmir.org/2016/7/e185/, doi:10.2196/jmir.5549

Multiparameter Intelligent Monitoring in Intensive Care (MIMIC)

• 58,976 ICU admissions (medical, surgical, coronary care and neonatal)

• 48,000 distinct patients, admitted to Beth Israel Deaconess Medical Center (Boston, MA) from 2001 to

2012.

• Highly detailed and heterogenous data (lab tests, vital signs, sympthoms, medical imaging, notes, waveforms

etc).

• Available to other researchers and there are no privacy concerns

State Inpatient Databases (SID), Agency for Healthcare Research and Quality Healthcare Cost

and Utilization Project (HCUP)

• 330 million inpatient discharges from 46 States from the USA.

• This data tracks all hospital admissions at the individual level.

• diagnoses and procedures coded in ICD-9-CM code. demographics and administrative data of each

admission (e.g., sex, age, month of admission, length of stay, total charges in USD, etc.).

• Open data

“Deep” data

“Wide” data

Data Sources

Bringing Predictive Analytics to Domain Experts

Scalable Predictive Analysis in Critically Ill Patients Using a Visual Open Data

Analysis

Van Poucke, S., Zhang, Z., Schmitz, M., Vukicevic, M., Vander Laenen, M., Celi, L. A., & De Deyne, C. (2016). Scalable Predictive Analysis in Critically Ill Patients Using a Visual Open Data Analysis Platform. PloS one, 11(1).

• Open Visual Platform

• High flexibility (sub-process/macro structure)

• Wrappers for Hadoop Stack

• Python and R scripting

Radoop (RapidMiner - Hadoop) access to hive repository – MIMIC

III database

Van Poucke, S., Zhang, Z., Schmitz, M., Vukicevic, M., Vander Laenen, M., Celi, L. A., & De Deyne, C. (2016). Scalable Predictive Analysis in Critically Ill Patients Using a Visual Open Data Analysis Platform. PloS one, 11(1).

Automatic algorithm selection and parameter optimization

Interpretability and predictive performance

Early Prediction of Mortality Based on Laboratory Tests and Sparse Predictive Models in a Critically Ill Chronic Renal Failure Patient Population

Vukicevic M., Van Poucke S., Radovanovic S., Boer W., Stiglic G., Delibasic B. (2016) Early Prediction of Mortality Based on Laboratory Tests and Sparse Predictive Models in a Critically Ill Chronic Renal Failure Patient Population, Artificial Intelligence in Medicine, Under review

Data:• Lab tests: Mean, standard deviation and number of tests

per day• From all renal failure patients in this study, 27%.0 did not

survive hospitalization

Tasks: • Mortality risk prediction

• Identification of important lab tests over days of ICU admission

Results:

Early Prediction of Mortality Based on Laboratory Tests and Sparse Predictive Models in a Critically Ill Chronic Renal Failure Patient Population

Vukicevic M., Van Poucke S., Radovanovic S., Boer W., Stiglic G., Delibasic B. (2016) Early Prediction of Mortality Based on Laboratory Tests and Sparse Predictive Models in a Critically Ill Chronic Renal Failure Patient Population, Artificial Intelligence in Medicine, Under review

GHFCS: ICD-9 based Feature Space Compression for 30-day Hospital Re-admission

prediction

11Radovanovic, S, Vukicevic, M, Kovacevic, A, Sliglic, G, Obradovic, Z. (2015) “Domain knowledge based hierarchical feature selection for 30-day

hospital readmission prediction“, Proc. AIME 2015, the 15th Conference on Artificial Intelligence in Medicine, Pavia, Italy, June, 2015.

Predicting Hospital Re-admissions - high impact on improvement of healthcare services and reducing costs

• Objective: develop efficient feature compression • Interpretable predictive models • no loss in predictive performance.

Results : • Traditional methods reduce feature space, but result in significant loss in

predictive accuracy • GHFSC gave the most interpretable solution (20 features) without loss

predictive performance compared to similar methods• Multi-scale learning

Harmonic mean between Area Under Curve (AUC) and Feature Space Compression (FSC)

AUC and FSC are equally important

AUC is 5 times more important than FSC

Application: re-admission prediction for pediatric patient data (HCUP) from CA 851 features on the lowest level of hierarchy

GHFCS (Group hierarchical feature compression and selection)• aggregates features on the highest levels possible (without loss in information potential)

• Allows comparison of feature information potential on all ICD-9 hierarchical levels and paths.

• Challenge: high sparsity and dimensionality of data.

The idea: network compression – aggregate data based on ICD-9 hierarchical graph

Data & Knowledge

Jovanovic M, Radovanovic S, Vukicevic M, Van Poucke S, Delibasic B (2016), Building interpretable predictive models for pediatric hospital readmission using tree-lasso logistic regression, Artificial Intelligence In Medicine, DOI:10.1016/j.artmed.2016.07.003.

Integration of Domain Knowledge and Sparse

Predictive Method

Building Interpretable Predictive Models for Pediatric Hospital Readmission Using Tree-Lasso Logistic Regression

Similar Predictive Performanse

Quantification of information loss

Methods: In this paper various methods for data normalization (z-transformation, range transformation, proportion transformation and interquartile range) are presented and visualized discussing the most suited approach for platelet function data series.

Interventions/ Results: Normalization was calculated per assay (test) for all time points and per time point for all tests.

Conclusions: Interquartile range, range transformation and z-transformation demonstrated the correlation as calculated by the Spearman’s correlation test, when normalized per assay (test) for all time points. When normalizing per time point for all tests, no correlation could be abstracted from the charts as was the case when using all data as one dataset for normalization.

Normalization Methods in Time Series of Platelet Function Assays. A SQUIRE Compliant Study

Van Poucke S, Zhang Z, Roest M, Vukicevic M, Beran M, Lauwereins M, Zheng M-H, Henskens Y, Lancé M, Marcus A (2016) Normalization Methods in Time Series of Platelet Function Assays. A SQUIRE Compliant Study, Medicine, 95(28), DOI: 10.1097/MD.0000000000004188

Data Propositionalization For Improving 30-day Hospital Re-admission Prediction

Radovanović, S., Vukićević, M., Kovačević, A., Delibašić, B., & Suknović, M. (2015). Data Propositionalization For Improving 30-day Hospital Re-admission Prediction. Proceedings of the 42nd Symposium of Operational Research – SYM-OP-IS 2015, September 2015, Belgrade, Serbia.

Predicting Hospital Re-admissions - high impact on improvement of healthcare services and reducing costs

Problem: 30-day Hospital re-admission prediction

Data: pediatric patients from CA (HCUP)

The idea: use propositionalization for feature extraction and selection.

• Feature compression• Interpretability

• No loss in predictive performance

Results : • Developed RapidMiner operator for feature creation

based on association rules.• Propositionalization helped feature extraction and

selection. Greatly reduced feature space!• Propositionalization performed better on 7 out of 10

data subset.

Patient ICD_1 ICD_2 ICD_15 ICD_999 Readmit30

P1 1 0 0 1 1P2 1 0 1 0 0... ... ... ... ... ...Pn 0 1 0 0 0

Patient ICD_1 & ICD_999 => Readmit30 ICD_2 => ICD_12 Readmit30

P1 1 0 1P2 1 0 0... ... ... ...Pn 0 1 0

Challenge: high sparsity and dimensionality of data.

Association rules

Experiment: Propositionalization was applied on top ten most common pediatric diagnoses (851 binary

features). Logistic regression was perfromed on newly created feature space. Performance is compared with

performance on original space.

Low Level

Decision Support System for Hospital Readmission Prediction Based on Meta-Heuristic Feature Selection and Stacking

Radovanovic, S., Vukicevic, M., Kovacevic, A., Delibasic, B., Suknovic, M. (2015). Decision Support System for Hospital Readmission Prediction Based on Meta-Heuristic Feature Selection and Stacking. Proceedings of the 5th Rapid-Miner Community Meeting and Conference – RapidMiner

Wisdom 2015. Ljubljana, Slovenia, August 2015. Springer International Publishing.


Data: pediatric patients from CA (HCUP)Patient ICD_1 ICD_2 ICD_15 ICD_999 Readmit30

P1 1 0 0 1 1P2 1 0 1 0 0... ... ... ... ... ...Pn 0 1 0 0 0

Patient ICD_1 ICD_2 ICD_15 ICD_999 Readmit30P1 1 0 0 1 1P2 1 0 1 0 0... ... ... ... ... ...Pn 0 1 0 0 0

Patient ICD_1 ICD_2 ICD_15 ICD_999 Readmit30P1 1 0 0 1 1P2 1 0 1 0 0... ... ... ... ... ...Pn 0 1 0 0 0

Patient Y1 Y2 ... Y16 Readmit30P1 0.789 0.489 ... 0.948 1P2 0.001 0.089 ... 0.512 0... ... ... ... ... ...Pn 0.025 0.001 ... 0.302 0

High Level

PSO

Challenge: high sparsity and dimensionality of data.

The idea: Create stacked model using multiple models:• Apply weak learners• Create new dataset

• Perform metaheuristic based wrapper feature selection• Give recommendation to medical doctor

Experiment: : Logistic regression was applied on confidences of 16 weak learners. Seven wrapper feature selection are performed . Performances between feature selection

techniques are compared alongside Results :

• Developed RapidMiner operators for metaheuristic based feature

selection/weighting and parameter optimization.

• Feature selection improved stacked model.

Logistic regression

ESRS

SA

HC

VNS ILS

source:http://www.acclaimclipart.com/

LACK OF THE DATA

(Rare diseases, Expensive trials, Privacy etc.)

The Lack of the Data(Rare diseases, Expensive trials, Privacy etc.)

Vukicevic, M, Radovanovic, S, Kovacevic, A, Stiglic G, Obradovic, Z. (2015) “Improving hospital readmission prediction using domain knowledge based virtual examples ”,Proc. KMO 2015, the 10th Conference on Knowledge Management in Organization, Maribor, Slovenia, August, 2015

Experiment: Logistic regression was applied on data enriched with Virtual examples constructed by ICD-9 VEG. Performance is compared with several oversampling and ensemble techniques

Result: this strategy improves predictive performance and allows generation of unobserved comorbidities


Data: pediatric patients from CA (HCUP)

ICD-9-VEG tool for generating data and knowledge based Virtual Examples - uses randomization which is controlled by ICD-9 graph. Application: Using ICD-9 VEG to generate virtual examples for rare diseases and comorbidities, thus removing bias of algorithm towards frequently observed ones.

ICD9-VEG: EHR and ICD-9 based learning for Re-admission Risk Prediction

The idea: use prior knowledge from ICD-9 ontology for randomization.

Rare diseases constitute a large portion of re-admission in cumulative.

X-axis - diseases in ascending order by frequency of appearanceY-axis - cumulative share of each disease in total number of readmission)

Most of diseases are rarely observed.

Some diseases have similar re-admission risks with whole ICD groups

Privacy Preserving DSS for reducing Hospital Re-admission rates based on predictive models and knowledge and data sharing

Vukicevic, M., Radovanovic, S., Stiglic, G., Delibasic, B., Van Poucke S., Obradovic, Z. (2016) (in press) "A Data and Knowledge Driven Randomization Technique for Privacy-Preserving Data Enrichment in Hospital Readmission Prediction," 5th Workshop on Data Mining for Medicine and Healthcare, 2015 SIAM Int’l Conf. Data

Mining (SDM), Miami, FL, May 2016.Vukicevic, M, Radovanovic, S, Kovacevic, A, Delibasic B, Suknovic M (2015) “Privacy Preserving DSS forreducing Hospital Re-admission rates based on

predictive models and knowledge and data sharing ”,Proc. of International Conference Decision Support Systems Technology , Belgrade, Serbia, May, 2015

Lack of data is often the major obstacle for evolving highly accurate predictive models.Reasons: rare diseases, long and expensive procedures for data collection and confidentiality of personally sensitive information.

Privacy preserving DSS for EHR of information about EHRs between hospitals, while preserving privacy (through common VE repository).

Prevention of data quality loss by randomization VE generator can use some domain knowledge source (ontologies or rules) in order to randomize the original data in controlled manner.

Experiments:Original - LR was evaluated on the original data from each hospital separately. Shared – LR model is created on data from all hospitals (simulation of situation where data could be shared).Virtual Examples –Original data from each hospital is enriched with the data from VE repository.

Hospitals

AUC

Perfo

rman

ce

• Models built on individual hospitals showed drastically worse performance than on Shared or

Enriched data.

• Models built on Enriched data showed comparable performance as on Shared data.

• On 3 from 8 hospital models on Enriched data showed better performance than on Shared data

White-Box (Glass-Box) Design Algorithm Understanding, Construction and Selection

Incremental Algorithm Design

• K-means (Hartigan et al, 1979), • K-medoids, (Kaufman & Rousseeuw, 1990),• Xmeans (Pelleg and Moore, 2000),• G-means, Hamerly and Elkan (2003),• MPCKmeans (Basu et al, 2004),• K-means++ (Arthur et al, 2007)• etc.

Berkin (2006)“K-means is by far the most popular clustering

algorithm used in scientific and industrial applications".

Top 10 algorithms in data mining (Wu et al. 2008 ) C4.5, CART, K-means

Why?

• Simplicity of use,

• Intuitively understandable

• Low cost of computation

• Implemented in most DM environments

K-means qualities:

Why?

• No standards – lot of re-implemetation of existing components

• Different implementation in different software

• Algorithms are kept in literature and testing environments

• Slow integration of new algorithms in popular software

Black box design:

B. Delibasic, M. Vukicevic, M. Jovanovic, K. Kirchner, J. Ruhland, M. Suknovic (2012) An architecture for component-based design of representative-based clustering algorithms, Data & Knowledge Engineering.Delibašić B, Kirchner K, Ruhland J, Jovanović M, Vukićević M (2009) Reusable components for partitioning clustering algorithms, Artificial Intelligence Review, 32(1-4)

Generic (white-box, glass-box) algorithms

Vukicevic M, Kirchner K, Delibasic B, Jovanovic M, Ruhland J, Suknovic M (2012) Finding best algorithmic components for

clustering microarray data, Knowledge and Information Systems

Vukicevic, M., Delibasic, B., Obradovic, Z., Jovanovic, M., Suknovic, M. (2012) " A Method for Design of Data-tailored Partitioning

Algorithms for Optimizing the Number of Clusters in Microarray Analysis," Proc. 2012 IEEE Symposium on

Computational Intelligence in Bioinformatics and Computational Biology, San Diego, CA, May 2012.

Vukicevic M, Delibasic B, Jovanovic M, Suknovic M, Obradovic Z (2011) Internal Evaluation Measures as Proxies for External Indices in Clustering Gene Expression Data Internal In proc. of the 2011 IEEE International Conference on Bioinformatics and

Biomedicine (BIBM11), Atlanta, Georgia, USA, Nov. 12-15, 574-577

Vukicevic M., Radovanovic S., Delibasic B., Suknovic M., (2015) Extending meta-learning framework for clustering gene expression data with component based algorithm design and internal evaluation measures, International Journal of Data Mining and Bioinformatics, ISSN online: 1748-

5681, In PressVukicevic M., Radovanovic S., Milovanovic M., Minovic M. (2014) Cloud Based Meta-learning System for Predictive Modeling of BiomedicalData,

The Scientific World Journal.

Problem: Predict the state of disease based on gene-expression dataMore than 30 datasets

High dimensionality (over 5000 features)Less than 100 examples on each dataset

Model: Extended meta – feature space (algorithm descriptions, internal validation measures)

Extended algorithm space – over 500 algorithms evaluated

Results:

Conclusions:•Algorithm descriptions and internal validation

measures greatly improve predictive accuracy of meta-models

•RC based algorithms allow detailed examination of algorithm parts that lead to high quality cluster

models•Domain specific meta-data have high influence on

predictive performance (e.g. type of chip)

Classification algs. fail to provide accurate models

Extending Meta Learning Framework With Reusable Components

• Further investigation of “Deep Data”

• Predictive analytics on heterogeneous medical sources (lab tests, comorbidities, nursing notes, imaging)

• Sensor data (smart cities) and wearable device data

• Integration of data driven methods and knowledge sources (ontologies, rulesets etc.)

• Etc.

Future work

Collaborators

Data & Analytics

From Experimental to Applied Predictive Analytics on Big Data - Milan Vukicevic