View
71
Download
0
Embed Size (px)
Citation preview
Milan Vukićević, Assistant Professor @ University of Belgrade, Faculty of Organizational Sciences
From Experimental to Applied Predictive Analytics on Big Data - Challenges and Case Studies
2
Big Data Promise
• Healthcare
• Marketing• Finance• Banking• Telco• Car
industry• Education• Etc…
• Segmentation• Churn prediction• Risk management• Sentiment
analyses• Automatic
recommendations• Fraud detection• Diagnostics,• Etc…
There is a large gap between actual data usage and potential data usage in many application areas that prevents a paradigm shift from delayed interventional to predictive and prescriptive decision making.
From predictive to prescriptive decision making – high financial and human benefits
Challenges
• High complexity of the problems
• Multi-modality of the data
• High cost of wrong decisions
• Interpretability
• From predictive to prescriptive
• Privacy concerns
• Integration of Domain Knowledge and Data Driven Methods
Van Poucke S, Thomeer M, Heath J, Vukicevic M (2016) Are Randomized Controlled Trials the (G)old Standard? From Clinical Intelligence to Prescriptive Analytics, J Med Internet Res 2016;18(7):e185. URL: http://www.jmir.org/2016/7/e185/, doi:10.2196/jmir.5549
Multiparameter Intelligent Monitoring in Intensive Care (MIMIC)
• 58,976 ICU admissions (medical, surgical, coronary care and neonatal)
• 48,000 distinct patients, admitted to Beth Israel Deaconess Medical Center (Boston, MA) from 2001 to
2012.
• Highly detailed and heterogenous data (lab tests, vital signs, sympthoms, medical imaging, notes, waveforms
etc).
• Available to other researchers and there are no privacy concerns
State Inpatient Databases (SID), Agency for Healthcare Research and Quality Healthcare Cost
and Utilization Project (HCUP)
• 330 million inpatient discharges from 46 States from the USA.
• This data tracks all hospital admissions at the individual level.
• diagnoses and procedures coded in ICD-9-CM code. demographics and administrative data of each
admission (e.g., sex, age, month of admission, length of stay, total charges in USD, etc.).
• Open data
“Deep” data
“Wide” data
Data Sources
Bringing Predictive Analytics to Domain Experts
Scalable Predictive Analysis in Critically Ill Patients Using a Visual Open Data
Analysis
Van Poucke, S., Zhang, Z., Schmitz, M., Vukicevic, M., Vander Laenen, M., Celi, L. A., & De Deyne, C. (2016). Scalable Predictive Analysis in Critically Ill Patients Using a Visual Open Data Analysis Platform. PloS one, 11(1).
• Open Visual Platform
• High flexibility (sub-process/macro structure)
• Wrappers for Hadoop Stack
• Python and R scripting
Radoop (RapidMiner - Hadoop) access to hive repository – MIMIC
III database
Van Poucke, S., Zhang, Z., Schmitz, M., Vukicevic, M., Vander Laenen, M., Celi, L. A., & De Deyne, C. (2016). Scalable Predictive Analysis in Critically Ill Patients Using a Visual Open Data Analysis Platform. PloS one, 11(1).
Automatic algorithm selection and parameter optimization
Interpretability and predictive performance
Early Prediction of Mortality Based on Laboratory Tests and Sparse Predictive Models in a Critically Ill Chronic Renal Failure Patient Population
Vukicevic M., Van Poucke S., Radovanovic S., Boer W., Stiglic G., Delibasic B. (2016) Early Prediction of Mortality Based on Laboratory Tests and Sparse Predictive Models in a Critically Ill Chronic Renal Failure Patient Population, Artificial Intelligence in Medicine, Under review
Data:• Lab tests: Mean, standard deviation and number of tests
per day• From all renal failure patients in this study, 27%.0 did not
survive hospitalization
Tasks: • Mortality risk prediction
• Identification of important lab tests over days of ICU admission
Results:
Early Prediction of Mortality Based on Laboratory Tests and Sparse Predictive Models in a Critically Ill Chronic Renal Failure Patient Population
Vukicevic M., Van Poucke S., Radovanovic S., Boer W., Stiglic G., Delibasic B. (2016) Early Prediction of Mortality Based on Laboratory Tests and Sparse Predictive Models in a Critically Ill Chronic Renal Failure Patient Population, Artificial Intelligence in Medicine, Under review
GHFCS: ICD-9 based Feature Space Compression for 30-day Hospital Re-admission
prediction
11Radovanovic, S, Vukicevic, M, Kovacevic, A, Sliglic, G, Obradovic, Z. (2015) “Domain knowledge based hierarchical feature selection for 30-day
hospital readmission prediction“, Proc. AIME 2015, the 15th Conference on Artificial Intelligence in Medicine, Pavia, Italy, June, 2015.
Predicting Hospital Re-admissions - high impact on improvement of healthcare services and reducing costs
• Objective: develop efficient feature compression • Interpretable predictive models • no loss in predictive performance.
Results : • Traditional methods reduce feature space, but result in significant loss in
predictive accuracy • GHFSC gave the most interpretable solution (20 features) without loss
predictive performance compared to similar methods• Multi-scale learning
Harmonic mean between Area Under Curve (AUC) and Feature Space Compression (FSC)
AUC and FSC are equally important
AUC is 5 times more important than FSC
Application: re-admission prediction for pediatric patient data (HCUP) from CA 851 features on the lowest level of hierarchy
GHFCS (Group hierarchical feature compression and selection)• aggregates features on the highest levels possible (without loss in information potential)
• Allows comparison of feature information potential on all ICD-9 hierarchical levels and paths.
• Challenge: high sparsity and dimensionality of data.
The idea: network compression – aggregate data based on ICD-9 hierarchical graph
Data & Knowledge
Jovanovic M, Radovanovic S, Vukicevic M, Van Poucke S, Delibasic B (2016), Building interpretable predictive models for pediatric hospital readmission using tree-lasso logistic regression, Artificial Intelligence In Medicine, DOI:10.1016/j.artmed.2016.07.003.
Integration of Domain Knowledge and Sparse
Predictive Method
Building Interpretable Predictive Models for Pediatric Hospital Readmission Using Tree-Lasso Logistic Regression
Similar Predictive Performanse
Quantification of information loss
Methods: In this paper various methods for data normalization (z-transformation, range transformation, proportion transformation and interquartile range) are presented and visualized discussing the most suited approach for platelet function data series.
Interventions/ Results: Normalization was calculated per assay (test) for all time points and per time point for all tests.
Conclusions: Interquartile range, range transformation and z-transformation demonstrated the correlation as calculated by the Spearman’s correlation test, when normalized per assay (test) for all time points. When normalizing per time point for all tests, no correlation could be abstracted from the charts as was the case when using all data as one dataset for normalization.
Normalization Methods in Time Series of Platelet Function Assays. A SQUIRE Compliant Study
Van Poucke S, Zhang Z, Roest M, Vukicevic M, Beran M, Lauwereins M, Zheng M-H, Henskens Y, Lancé M, Marcus A (2016) Normalization Methods in Time Series of Platelet Function Assays. A SQUIRE Compliant Study, Medicine, 95(28), DOI: 10.1097/MD.0000000000004188
Data Propositionalization For Improving 30-day Hospital Re-admission Prediction
Radovanović, S., Vukićević, M., Kovačević, A., Delibašić, B., & Suknović, M. (2015). Data Propositionalization For Improving 30-day Hospital Re-admission Prediction. Proceedings of the 42nd Symposium of Operational Research – SYM-OP-IS 2015, September 2015, Belgrade, Serbia.
Predicting Hospital Re-admissions - high impact on improvement of healthcare services and reducing costs
Problem: 30-day Hospital re-admission prediction
Data: pediatric patients from CA (HCUP)
The idea: use propositionalization for feature extraction and selection.
• Feature compression• Interpretability
• No loss in predictive performance
Results : • Developed RapidMiner operator for feature creation
based on association rules.• Propositionalization helped feature extraction and
selection. Greatly reduced feature space!• Propositionalization performed better on 7 out of 10
data subset.
Patient ICD_1 ICD_2 ICD_15 ICD_999 Readmit30
P1 1 0 0 1 1P2 1 0 1 0 0... ... ... ... ... ...Pn 0 1 0 0 0
Patient ICD_1 & ICD_999 => Readmit30 ICD_2 => ICD_12 Readmit30
P1 1 0 1P2 1 0 0... ... ... ...Pn 0 1 0
Challenge: high sparsity and dimensionality of data.
Association rules
Experiment: Propositionalization was applied on top ten most common pediatric diagnoses (851 binary
features). Logistic regression was perfromed on newly created feature space. Performance is compared with
performance on original space.
Low Level
Decision Support System for Hospital Readmission Prediction Based on Meta-Heuristic Feature Selection and Stacking
Radovanovic, S., Vukicevic, M., Kovacevic, A., Delibasic, B., Suknovic, M. (2015). Decision Support System for Hospital Readmission Prediction Based on Meta-Heuristic Feature Selection and Stacking. Proceedings of the 5th Rapid-Miner Community Meeting and Conference – RapidMiner
Wisdom 2015. Ljubljana, Slovenia, August 2015. Springer International Publishing.
Problem: 30-day Hospital re-admission prediction
Data: pediatric patients from CA (HCUP)Patient ICD_1 ICD_2 ICD_15 ICD_999 Readmit30
P1 1 0 0 1 1P2 1 0 1 0 0... ... ... ... ... ...Pn 0 1 0 0 0
Patient ICD_1 ICD_2 ICD_15 ICD_999 Readmit30P1 1 0 0 1 1P2 1 0 1 0 0... ... ... ... ... ...Pn 0 1 0 0 0
Patient ICD_1 ICD_2 ICD_15 ICD_999 Readmit30P1 1 0 0 1 1P2 1 0 1 0 0... ... ... ... ... ...Pn 0 1 0 0 0
Patient Y1 Y2 ... Y16 Readmit30P1 0.789 0.489 ... 0.948 1P2 0.001 0.089 ... 0.512 0... ... ... ... ... ...Pn 0.025 0.001 ... 0.302 0
High Level
PSO
Challenge: high sparsity and dimensionality of data.
The idea: Create stacked model using multiple models:• Apply weak learners• Create new dataset
• Perform metaheuristic based wrapper feature selection• Give recommendation to medical doctor
Experiment: : Logistic regression was applied on confidences of 16 weak learners. Seven wrapper feature selection are performed . Performances between feature selection
techniques are compared alongside Results :
• Developed RapidMiner operators for metaheuristic based feature
selection/weighting and parameter optimization.
• Feature selection improved stacked model.
Logistic regression
ESRS
SA
HC
VNS ILS
source:http://www.acclaimclipart.com/
LACK OF THE DATA
(Rare diseases, Expensive trials, Privacy etc.)
The Lack of the Data(Rare diseases, Expensive trials, Privacy etc.)
Vukicevic, M, Radovanovic, S, Kovacevic, A, Stiglic G, Obradovic, Z. (2015) “Improving hospital readmission prediction using domain knowledge based virtual examples ”,Proc. KMO 2015, the 10th Conference on Knowledge Management in Organization, Maribor, Slovenia, August, 2015
Experiment: Logistic regression was applied on data enriched with Virtual examples constructed by ICD-9 VEG. Performance is compared with several oversampling and ensemble techniques
Result: this strategy improves predictive performance and allows generation of unobserved comorbidities
Problem: 30-day Hospital re-admission prediction
Data: pediatric patients from CA (HCUP)
ICD-9-VEG tool for generating data and knowledge based Virtual Examples - uses randomization which is controlled by ICD-9 graph. Application: Using ICD-9 VEG to generate virtual examples for rare diseases and comorbidities, thus removing bias of algorithm towards frequently observed ones.
ICD9-VEG: EHR and ICD-9 based learning for Re-admission Risk Prediction
The idea: use prior knowledge from ICD-9 ontology for randomization.
Rare diseases constitute a large portion of re-admission in cumulative.
X-axis - diseases in ascending order by frequency of appearanceY-axis - cumulative share of each disease in total number of readmission)
Most of diseases are rarely observed.
Some diseases have similar re-admission risks with whole ICD groups
Privacy Preserving DSS for reducing Hospital Re-admission rates based on predictive models and knowledge and data sharing
Vukicevic, M., Radovanovic, S., Stiglic, G., Delibasic, B., Van Poucke S., Obradovic, Z. (2016) (in press) "A Data and Knowledge Driven Randomization Technique for Privacy-Preserving Data Enrichment in Hospital Readmission Prediction," 5th Workshop on Data Mining for Medicine and Healthcare, 2015 SIAM Int’l Conf. Data
Mining (SDM), Miami, FL, May 2016.Vukicevic, M, Radovanovic, S, Kovacevic, A, Delibasic B, Suknovic M (2015) “Privacy Preserving DSS forreducing Hospital Re-admission rates based on
predictive models and knowledge and data sharing ”,Proc. of International Conference Decision Support Systems Technology , Belgrade, Serbia, May, 2015
Lack of data is often the major obstacle for evolving highly accurate predictive models.Reasons: rare diseases, long and expensive procedures for data collection and confidentiality of personally sensitive information.
Privacy preserving DSS for EHR of information about EHRs between hospitals, while preserving privacy (through common VE repository).
Prevention of data quality loss by randomization VE generator can use some domain knowledge source (ontologies or rules) in order to randomize the original data in controlled manner.
Experiments:Original - LR was evaluated on the original data from each hospital separately. Shared – LR model is created on data from all hospitals (simulation of situation where data could be shared).Virtual Examples –Original data from each hospital is enriched with the data from VE repository.
Hospitals
AUC
Perfo
rman
ce
• Models built on individual hospitals showed drastically worse performance than on Shared or
Enriched data.
• Models built on Enriched data showed comparable performance as on Shared data.
• On 3 from 8 hospital models on Enriched data showed better performance than on Shared data
White-Box (Glass-Box) Design Algorithm Understanding, Construction and Selection
Incremental Algorithm Design
• K-means (Hartigan et al, 1979), • K-medoids, (Kaufman & Rousseeuw, 1990),• Xmeans (Pelleg and Moore, 2000),• G-means, Hamerly and Elkan (2003),• MPCKmeans (Basu et al, 2004),• K-means++ (Arthur et al, 2007)• etc.
Berkin (2006)“K-means is by far the most popular clustering
algorithm used in scientific and industrial applications".
Top 10 algorithms in data mining (Wu et al. 2008 ) C4.5, CART, K-means
Why?
• Simplicity of use,
• Intuitively understandable
• Low cost of computation
• Implemented in most DM environments
K-means qualities:
Why?
• No standards – lot of re-implemetation of existing components
• Different implementation in different software
• Algorithms are kept in literature and testing environments
• Slow integration of new algorithms in popular software
Black box design:
B. Delibasic, M. Vukicevic, M. Jovanovic, K. Kirchner, J. Ruhland, M. Suknovic (2012) An architecture for component-based design of representative-based clustering algorithms, Data & Knowledge Engineering.Delibašić B, Kirchner K, Ruhland J, Jovanović M, Vukićević M (2009) Reusable components for partitioning clustering algorithms, Artificial Intelligence Review, 32(1-4)
Generic (white-box, glass-box) algorithms
Vukicevic M, Kirchner K, Delibasic B, Jovanovic M, Ruhland J, Suknovic M (2012) Finding best algorithmic components for
clustering microarray data, Knowledge and Information Systems
Vukicevic, M., Delibasic, B., Obradovic, Z., Jovanovic, M., Suknovic, M. (2012) " A Method for Design of Data-tailored Partitioning
Algorithms for Optimizing the Number of Clusters in Microarray Analysis," Proc. 2012 IEEE Symposium on
Computational Intelligence in Bioinformatics and Computational Biology, San Diego, CA, May 2012.
Vukicevic M, Delibasic B, Jovanovic M, Suknovic M, Obradovic Z (2011) Internal Evaluation Measures as Proxies for External Indices in Clustering Gene Expression Data Internal In proc. of the 2011 IEEE International Conference on Bioinformatics and
Biomedicine (BIBM11), Atlanta, Georgia, USA, Nov. 12-15, 574-577
Vukicevic M., Radovanovic S., Delibasic B., Suknovic M., (2015) Extending meta-learning framework for clustering gene expression data with component based algorithm design and internal evaluation measures, International Journal of Data Mining and Bioinformatics, ISSN online: 1748-
5681, In PressVukicevic M., Radovanovic S., Milovanovic M., Minovic M. (2014) Cloud Based Meta-learning System for Predictive Modeling of BiomedicalData,
The Scientific World Journal.
Problem: Predict the state of disease based on gene-expression dataMore than 30 datasets
High dimensionality (over 5000 features)Less than 100 examples on each dataset
Model: Extended meta – feature space (algorithm descriptions, internal validation measures)
Extended algorithm space – over 500 algorithms evaluated
Results:
Conclusions:•Algorithm descriptions and internal validation
measures greatly improve predictive accuracy of meta-models
•RC based algorithms allow detailed examination of algorithm parts that lead to high quality cluster
models•Domain specific meta-data have high influence on
predictive performance (e.g. type of chip)
Classification algs. fail to provide accurate models
Extending Meta Learning Framework With Reusable Components
• Further investigation of “Deep Data”
• Predictive analytics on heterogeneous medical sources (lab tests, comorbidities, nursing notes, imaging)
• Sensor data (smart cities) and wearable device data
• Integration of data driven methods and knowledge sources (ontologies, rulesets etc.)
• Etc.
Future work
Collaborators