Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1
Getting Started: Modeling the Structure and Operations of Big Data
Session BG2, February 11, 2019
Deepesh Chandra, Associate Partner & Senior Expert
Pierre-Arnaud Klaskala, Associate Partner, Director of Product & Technology
2
Deepesh Chandra, Associate Partner & Senior Expert
Pierre-Arnaud Klaskala, Associate Partner, Director Of Product & Technology
Have no real or apparent conflicts of interest to report.
Conflict of Interest
3
Provide a technical overview of big data analytics
• Describe big data storage, frameworks, and other critical aspects
of usable healthcare data structures
• Explore uses of healthcare structured/unstructured data and
metadata
• Discuss transforming legacy data into trusted and actionable data
structures
• Assess data analytics, data visualization, and business
intelligence and their roles in big data
Learning Objectives
4
Contents
Introduction
Big data components
Building trusted and usable data structures
Analytics and visualization in big data
Key learnings
5
The Challenge1 The Current State2 The Opportunity3
$3.0T Spent on healthcare
in 2015 in US –
>18% of GDP
1.9%Health care spending
in US grows 1.9 basis
points faster than
GDP growth (OECD
historical rate)
0.5%Annual growth in
healthcare labor
productivity in US
over this same period
20%In 2017, 20% of all local
VC investment in SF went
into the AI, Big Data &
Analytics sub-sector
Despite massive investment in IT, the
industry still lags in maturity of AA and
digital capabilities
12thOut of 13 industries in
the McKinsey
Advanced Analytics
maturity index
8thOut of 9 industries in the
McKinsey Digitization
maturity index
11thOut of 13 industries in
terms of readiness to
adopt and employ AI
The opportunity represented by advanced analytics and digital in healthcare, and the urgency to act
SOURCE: 1 OECD Policy Implications of the New Economy 2000 -50 (2001); Global Insight WMM2000 -37;Espicom: World Pharmaceutical Fact Book 2008; International Monetary
Fund. World Economic Outlook Database. October 2009; Espicom: World Pharmaceutical Fact Book 2008; McKinsey< 2 McKinsey Global Institute – AI the Next digital frontier, The age of analytics: competing in a data-driven world3 Fuel by McKinsey
6
SOURCE: McKinsey analytics
Data
ecosystem
Modeling
insights
Workflow
integrationAdoption
Source
of value
Analytics-to-insights Insights-to-impact
Technology and infrastructure Organization and governance
Effective healthcare advanced analytics and digital transformations require work across the entire analytics workflow
SOURCE: McKinsey Analytics; McKinsey Global Institute analysis
7
Big data, advanced analytics, and digital need to be combined to capture business opportunities
ingest, manage,
Integrate, and
analyze large and
complex data
enable more
sophisticated predictive
and prescriptive
analytics, and work
against large,
incomplete, or
unstructured data
Application of modern
(digital) technologies to
core business
processes, Advanced
analytics
Big data
Digital
8
Healthcare data spans the spectrum of data complexity
Unstructured
Semi-Structured
Structured
Sensors and
fitness trackers
Social media
Healthcare claimsAudio
recording
Email, PDF,
PPTX, DOCXEDI
communications
Medical
images
Scheduling
data
Clinical
notes
80% of all data is unstructured1 AND it’s growing at CAGR of 36%2
1 - Source: International Data Corporation, EMC Corporation, Harmony Healthcare IT
2 - Source: International Data Corporation
9
New opportunities create requirements that traditional data stacks cannot meet
Master blueprint for a
data architecture
transformation
… enabling new business
insights
… improving business
transparency
… lowering cost of IT and
operations
… increasing business
agility
SOURCE: Digital McKinsey Big Data and Advanced Analytics Compendium: "From Garage to Factory" – Big Data architecture and technologies
10
Contents
Introduction
Big data components
Building trusted and usable data structures
Analytics and visualization in big data
Key learnings
11
What is a data lake?
Persist all raw source data in a
common place (including history)
Provides data storage and processing
at extremely low cost
Easily connects with data
discovery tools to explore data
Allows to search and integrate data
without knowing exact schema of
data
Stores relational data as well as
media, emails, PDFs and more
(unstructured)
A data lake is NOT
a data warehouse
▪ No facility to
generate reports
▪ No
harmonization or
integration of
data
▪ Data may be
wrong
or inaccurate
Data lake
SOURCE: Digital McKinsey Big Data and Advanced Analytics Compendium: "From Garage to Factory" – Big Data architecture and technologies
12
The data lake is the first step of the analytics journey and the center of the big data stack
Workflow automation
Rapid Prototyping App Factory
Analyze/
test/
optimize
Transfer/
clean up/
expand
Visualize/
test/
improve
Develop/
automate/
operate
Analytics Garage
Data Lake
Collection of a
comprehensive and valid
data set
Development of
successful proto-types
as solutions
Fast development of a
prototype based on
convincing ideas
Analytics Garage with a
variety of tools for
analyzing the data
▪ Data transfer
▪ Workplace
▪ Backups
▪ External data sources
SOURCE: Digital McKinsey Big Data and Advanced Analytics Compendium: "From Garage to Factory" – Big Data architecture and technologies
13
Landing zone, data lake and analytics environment constitute the central elements of the data lake architecture
Data flowArchitecture
Landing zoneData sources
Data lake
Landing zone
Advanced Analytics Environment
Plain data
without tagging
Plain data
with basic tagging
Raw data
fully tagged
Prepared data
Data for Analysis
D
A
B
C
SOURCE: Digital McKinsey Big Data and Advanced Analytics Compendium: "From Garage to Factory" – Big Data architecture and technologies
14
The data lake is structured into different zones that distinguish raw and production data
SOURCE: Digital McKinsey Big Data and Advanced Analytics Compendium: "From Garage to Factory" – Big Data architecture and technologies
Advanced Analytics Environment
Data Lake
Governance
Data catalogue
Taxonomy
Lineage
Access management
Retention management
3Production zone
Raw zone: Tagging describes data
File storage
III
II
I
Graph DB
File storage
Relational DB
API
Landing zone
Preparation
API
15
The production zone is comprised of further sub-zones for specialized production purposes
SOURCE: Digital McKinsey Big Data and Advanced Analytics Compendium: "From Garage to Factory" – Big Data architecture and technologies
Data Lake
Governance
Data catalogue
Taxonomy
Lineage
Access management
Retention management
Production zone
Raw zone: Tagging describes dataIII
II
I
API
Landing zone
Preparation
Analytics workbench DWHAnalytical apps
Use case
analytics zoneCorporate
production
API
I II III Satellite
zones
APIAPI API
Advanced analytics env.
Hosting, Security, Monitoring and Scheduling
Meta data management, Data Governance, Data Lineage
Data
marts
Batch
Ingestion
Streaming
Ingestion
Collaborati
ve Data
science
Platform
Streaming
Analytics
Stream processing
layerReal time Views
Multi-Domain
MDM
ODS Layer
(Warm Data)
Enterpris
e Data
Lake
Big Data
Preparation
Tool
Customer
360 degree
Platform
Business
Intelligence
Dashboards
Analytical
Apps
Transient
Landing
Zone
Data
Access
Layer
Curated
Zone
Extract
& Load
Extract
& Load
Hot path to support streaming use cases
Delivery
Hub to
Source
System
Cleansed,ValidatedCustomerdata
Golden
Records
Real time analytical
decisions
Near Real-
time/Real-
time
Processing
Batch
Processing
Serving
Layer
Data
Ingestion1
Frontend
Layer
Analytics
LayerData preparation LayerData LakeData Sources
Structured Data
• Electronic
Medical
Records
• Billing and
Charge Data
• PO/Supply
Chain
• HR and
operational
data
Unstructured
Data
• Medical
images
• External
Sources
• Web Logs
• Social Media
SOURCE: Digital McKinsey Big Data and Advanced Analytics Compendium: "From Garage to Factory" – Big Data architecture and technologies
Big data reference architecture
SOURCE: Digital McKinsey Big Data and Advanced Analytics Compendium: "From Garage to Factory" – Big Data architecture and technologies
The big data and analytics tool vendor landscape is immensely diverse and highly dynamic
Hosting, Security, Monitoring and Scheduling
Meta data management, Data Governance, Data Lineage
Hot path to support streaming use cases
Data preparation Layer Serving LayerData Ingestion Frontend LayerData LakeData Sources
Structured Data
▪ Electronic
Medical
Records
▪ Billing and
Charge Data
▪ PO/Supply
Chain
▪ HR and
operational
data
Unstructured
Data
▪ Medical
images
▪ External
Sources
▪ Web Logs
▪ Social Media
Cleansed,
Validated
Customer
data
Golden
Records
Analytics Layer
Data marts
Real time
analytical
decisions
Streaming
AnalyticsReal time Views
Stream processing
layer
Extract
& Load
Extract
& Load
Batch
Processing
Big Data
Preparation
Enterprise
Data Lake
ODS Layer
(Warm Data)
Near Real-
time/Real-
time
Processing
Delivery
Hub to
Source
System
18
Contents
Introduction
Big data components
Building trusted and usable data structures
Analytics and visualization in big data
Key learnings
19
Key data governance processes and supporting toolsDimensions
Tools
Key things to have
Metadata
mgmt
• Business glossary
• Metadata management software
• Data lineage
• ETL code generation automated
Data
quality• Data quality tool deployed, covering data profiling,
matching, cleansing, monitoring
Master data
mgmt
• MDM tool
• Integration with other systems and processes
Data governance• Data owners defined
• Data governance body
• Define data governance process
SOURCE: Digital McKinsey - Building best-in-class Data Management Architecture
20
Data quality diagnostic criteria
1 except for pre-agreed cases
2 optional criterion for organizing data in Vertica or DB2
"Satisfactory""Good" “Poor"
▪ Table refers to clear
directories
▪ There is a unique key
▪ Data are stored in a big
table, no directories
available
▪ Key is not available
Normalization2
▪ Number of entries per
month from the start of
data acquisition deviates
by less than 50% from
median1
▪ Number of entries
deviates from the mean
by more than 50% in at
least one of the periods
▪ Number of entries
deviates from the mean
more than 2 times in at
least one of the periods
Timecompleteness
▪ No outliers (>500% of the
median)1
▪ More than 1% of outliers
with a delta of more than
500% of the median
Correctness
▪ Values are presented fully
and sufficiently (filled-in for
90% and above)
▪ Insignificant gaps (<30%)
in at least one attribute
▪ >30% of gaps in at least
one attribute
Quality
SOURCE: Digital McKinsey - Building best-in-class Data Management Architecture
21
Example of end product – data quality diagnostics
22
Data catalog tools usually come with 8 core functionalities
1. Metadata repositories
2. Business glossary
3. Data lineage
4. Impact analysis
5. Rules management
6. Semantic frameworks
7. Metadata ingestion
8. Collaboration
Data catalog capabilities
SOURCE: Digital McKinsey - Data catalogs as metadata management solution
23
Contents
Introduction
Big data components
Building trusted and usable data structures
Analytics and visualization in big data
Key learnings
24
Analytics and visualization are fed from the data lake
Workflow automation
Rapid Prototyping App Factory
Analyze/
test/
optimize
Transfer/
clean up/
expand
Visualize/
test/
improve
Develop/
automate/
operate
Analytics Garage
Data Lake
Collection of a
comprehensive and valid
data set
Development of
successful proto-types
as solutions
Fast development of a
prototype based on
convincing ideas
Analytics Garage with a
variety of tools for
analyzing the data
▪ Data transfer
▪ Workplace
▪ Backups
▪ External data sources
SOURCE: Digital McKinsey Big Data and Advanced Analytics Compendium: "From Garage to Factory" – Big Data architecture and technologies
25
A typical big data stack has a range of coding and visualization tools
Ext.
APIs
Clients
Options for compute engines
Sparkling
Water
Options
MapReduce
Specific
Use
Cases
Options
+ Others
Graphical coding
Exploration
and
Visualization
Plain codingSupporting infra-
structure services
Application server compute
(analyst workbench)
Plain compute
(analyst backend)Database compute (data lake)
Server IVa IVbIVc IVc
SOURCE: Digital McKinsey Big Data and Advanced Analytics Compendium: "From Garage to Factory" – Big Data architecture and technologies
26
Contents
Introduction
Big data components
Building trusted and usable data structures
Analytics and visualization in big data
Key learnings
27
We believe that effective healthcare advanced analytics and digital transformations require work across the entire analytics workflow
SOURCE: McKinsey Analytics; McKinsey Global Institute analysis
Data
ecosystem
Modeling
insights
Workflow
integrationAdoption
Source
of value
Analytics-to-insights Insights-to-impact
Technology and infrastructure Organization and governance
28
Five insights into building a great big data analytic platform
#1 - Ensure everything you do starts delivering impact
within six months
#2 - Use existing data to build in bite-size chunks
#3 - Deploy analytics only to solve tangible business
problems
#4 - Invest twice as much in your talent, culture, and
processes as in tools
#5 - Democratize data across your business to catalyze
innovation from within
29
Please complete the online session evaluation!
Questions
Deepesh ChandraAssociate Partner & Senior Expert
Pierre-Arnaud KlaskalaAssociate Partner, Director Of Product & Technology