Upload
bigdataexpo
View
153
Download
0
Embed Size (px)
Citation preview
1
Philips HealthCare Informatics
A Perspective on Big Data, Analytics and AI
John Huffman, CTOPhilips Healthcare Informatics
September 2016, Utrecht, NL
2
A Little Bit About My Background35 years or so of AI, reasoning and knowledge integration
• Started at Thinking Machines when it started in the early 80’s– Worked with Danny Hillis, Brewster Kahle on The Connection Machine
• MCC (US Fifth Generation Project)– Worked with Doug Lenat on AI and CYC (comprehensive common sense
knowledge and reasoning project) Liaison to NLP and CHI groups
• Progressively worked on systems of integrated information, knowledge representation, workflow and integrated decision support through start-ups (usually my own) and finally larger companies– Aware, SGI, Stentor, Poiesis Informatics, Philips
5
Advanced Analytics Process*Multi-Stage Process
*CRISP – Cross Industry Standard Process for Data Mining
6
Too much focus on one component…Multi-Stage Process
*CRISP – Cross Industry Standard Process for Data Mining
8
Analytics Lifecycle Overview
Data Ingestion
Model training
Production
Model Evaluation
Data Scientist
Landing Zone
Data Processing
ETLed Processed
ZoneModel
Repository
Data Science
Cleaned Data
Data Cleaning
Big Data Platform
Anonymized data
Repository
9
Feature Eng
Hosted solution
Analytics Lifecycle (more detail)
REST ML APIs
ML AlgosIPs
Data ScienceHosted Cluster(Create Model)
ETLs
ML R lib
ML Py lib
Models
ML Scoring Service
Feature Engg.
Predictive Analytical
AppsOperationalize
Model Evaluate Model
Predictive Model
Evaluator
Model Staging Hosted Cluster
(Evaluation)
Production Cluster
Access
Processing
Data
Access
Processing
Feature Eng
ML FrmkML Framework
Models
ML Scoring Service
ML Frmk
Data
Big Data platform
Data Science Platform (Analytics and ML)
Proposition Owner
Model Evaluator
Service
Predictive model
creation
Domain Services
Domain Services
Original raw data
ETLed data Anonymized data
Scripts and Model Rep.
Create model
Data Preparation
Phase
10
Challenges in Data Collection and ProcessingBefore any analytics can start…
• Data Identification, Collection and Preparation – Domain knowledge important to discriminate relevant data
• ETL – extracting relevant data from raw data • Massaging – pre-processing the data– [Automatic] annotation of data (e.g. masking of bones in chest xray)
• Normalization of the data – Especially complex when data is received from multiple sources
• Aggregation of data – For purpose of statistical analysis
• Note – All the above steps must be done on the same set of technologies that will be present during the deployment of the resultant model
11
Training and Validating the ModelWhich method is appropriate?
• Effective model creation requires an understanding of the nuances and strengths of different methods– Selection of the right method depending on the task
Classification/Regression/Clustering/Dimensionality reduction…• Identification and compute of the metric(s) to evaluate the model– Requires training and test data
• Ensure there is no overfitting• Validate the model – On extended data sets, cohort variation
• Fine tune the parameters of the model
• Note – All the above steps to be done on the same set of technologies that will be present during the deployment
12
Challenges in Deployment and Operations
• Installation (On-Premise, Cloud, Hybrid)• Configuration• Health Monitoring• Auto-Scaling• Multi-Tenancy• Disaster Recovery• Licensing• Performance Monitoring• Metering and Billing• Upgrades• Snapshots• Certificate Management• Resource Utilization and Trending• Privacy and Security
13
These Methods Are Not NewDecades to centuries old technologies
• Neural Networks– (1943) by Warren McCulloch and Walter Pitts, original called threshold logic
• Deep Learning– (1965) Ivakhnenko and Lapa, papers in 1971 already described deep
networks with 8 layers trained by the group method of data handling algorithm
• Random Decision Forest– (1995) Ho
• Big Data (MapReduce)– 2000-2004 various papers, underlying methods well-known in the mid-90’s.
Apache Hadoop (open source) has been available since 2011• Bayesian methods– Bayes lived in the 1700’s. Naïve Bayes methods since the 50’s
14
Some Lessons from AI HistoryWell-known that data is much more important than method…
• Just Google– “More data and simple algorithms beat complex analytics methods”
• This is well-known from expert system and AI experience– “Brittleness”
Application of models on data outside the training domain frequently fails in unusual, unexpected ways
– Marvin Minsky, “Society of Mind” Complex and intelligent behavior comes from the orchestration of
simple agents
• Without a broad, semantically interoperable, clean data repository – complex analytics, decision support algorithms, and workflow optimizations cannot be derived
• Data is the intellectual property in this domain
15
Analytics StackAnalytics is a set of tools – not a solution
General ML Algorithms
R SDK
Data Repositories (S3, HDFS, Hive…)
REST Machine Learning APIs
Py SDK Analytical Apps
Clinical Image
Analytics
Clinical Text Analytics
3rd Party Apps
JDBC/OBDC
Distributed Processing framework
IPsDeep
Learning libraries
NLP building blocks
Model Rep. Scripts Rep.
• Provide easy to use SDKs (R and Python)• Prebaked thin client development environments
• Rstudio and Jupyter
• All ML Capabilities are exposed via RESTFul APIs• Provide higher level abstraction APIs for
Clinical Text and Clinical images• Provide Building blocks for NLP and DL
frameworks• Host Research IP assets
• Persist the models and scripts in repositories (shared across development and deployment clusters)
16
Philips Approach - HSDPAnalytics and Big Data are an integrated component of the platform
ConnectStore Authorize
Share Orchestrate
Manages, updates, monitors and remotely controls smart devices
Securely identifies users, authorizes consent, ensures data privacy and tracks user activity
Standardizes interfaces between HealthSuite enabled applications and devices with third-party systems
Provides functionality to help complete routine tasks and coordinate communications among users
A tailored set of capabilities and tools, optimized for rapid prototyping and development of healthcare and health-related applications
Host
Provides managed infrastructure to monitor the health of systems and performance of applications
Analyze
Acquire, access and manage personal data from devices and applications through a cloud-hosted repository
Offers the foundational infrastructure to build decision support algorithms and machine learning applications