Page 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Page 1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Leveraging Big Data for Insurance Insights Without Putting PII/PHI at Risk
February 25, 2016
Page 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Today’s Speakers
Syed Mahmood, Sr. Product Marketing Manager – Hortonworks [email protected]
Cindy Maike, GM-Insurance Hortonworks [email protected]
Venkat Subramanian, CTO and VP of Engineering – Dataguise [email protected]
Page 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Is Sensitive Data (PII/PHI) a challenge for your company’s analytics & big data programs? A. Yes B. No
Page 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
If Yes, do you have capabilities in place to manage sensitive data discovery, protection and audit? A. Yes B. No
Page 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Data Business Insights Insurance Opportunities
Data Privacy Protection Requirements • Regulatory • Customer Expectations
Page 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The Insurance Data Landscape has Changed Dramatically
Customer centric / need based Insurance Offerings
500GB data per annual vehicle in UBI programs
Drones will make the workflow efficient by 2020
Digital becoming consumer / Insured preferred interaction channel
Growing availability & usage of geospatial data
Change in Claim frequency & severity, fraud anomaly analytics
Page 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Industry Opportunity
High-performance analytics, or a combination of structured and unstructured data, is changing
the ways of the insurance industry after decades of conservatism.
Page 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
View of Insurance Industry Data Landscape B
atch
R
eal-t
ime
Dat
a ve
loci
ty
Structured Unstructured Data variety
Semi-structured
Weather-event Drone image feeds
Social media Sensor (GoT)
Geo-location
Deposition recording
Notes and diary
Medical records & bills
Transcriptions
Photos
Investigation
TPA invoices
FNOL intake
Claims triage Vendor invoices
Forms and letters
Claim system
Policy verification
Applications/Submissions
3rd party risk models
Prior loss runs
Page 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
New Opportunities – Security Challenges Use Cases & Opportunities
Data Sources (examples)
New Security Challenges
Know Your Customer Application documents, clickstream and web logs, marketing research, CRM records, and social media
• Coverage for multiple file types and sources
• Critical detection to find and measure sensitivity risk
Claims Optimization & Fraud Detection
Policy records, claims databases, receipts, accident reports, emails, and transcriptions
• Reduce or eliminate PCI scope for Hadoop
• Detect new sensitivity risks in hard-to-reach unstructured data
Evaluate Risk / New Products
Mobile telematics, sensor data, social media, and voice-to-text files
• High scale • Large sets of small files • Detection and protection of
unstructured data Traditional Documents & Attachments
Claims data, insured prior loss data, and claims adjuster notes
• Masking of sensitive data for data sharing
• Sensitive data auditing Third-party Data Sharing Reporting bureaus, third-party claims
administrators (TPAs), telematics service providers (TSPs)
• Tiered access — highly granular roles with differing needs/views for sensitive data
Hortonworks + Dataguise = SECURE BUSINESS EXECUTION
CTO, DATAGUISE
VENKAT SUBRAMANIAN
Dataguise enables Secure Business Execu3on for data-‐driven enterprises
by delivering data-‐centric security solu3ons that Detect, Audit, Protect and Monitor
sensi3ve data assets where they are wherever they move
across repositories.
©2015 Contains confiden3al and proprietary informa3on and may not be disclosed by the recipient to any third
party. 11
©2015 Dataguise, Inc. Confiden3al and Proprietary
Secure Business Execution
The ability of an Enterprise to safely and responsibly leverage the value of all of their data assets
for the purpose of gaining new business insights,
maximizing competitive advantage, and driving revenue growth
12
©2015 Dataguise, Inc. Confiden3al and Proprietary
Business Intelligence Trend for 2016 Shi8 from
IT-‐led, System-‐of-‐record repor>ng
Pervasive, Business-‐led, self-‐service analy>cs
• Easy-‐to-‐use, fast, agile BI & Analy>cs • Deeper Insights into diverse data sources ** Rita Sallam, Gartner
13
©2015 Dataguise, Inc. Confiden3al and Proprietary
Data is your biggest Asset
It is also your biggest Vulnerability
14
©2016 Dataguise, Inc. Confiden3al and Proprietary
DgSecure
15
DETECT Where sensi3ve content is present in struct/unstruct/ semi-‐struct data
AUDIT Who has access to which sensi3ve data & iden3fy misalignments and risk factors
PROTECT Sensi3ve data at the element level–encrypt/decrypt with RBAC, mask
MONITOR Based on metadata, track how and where sensi3ve data is being accessed through a 360° dashboard
Across Hadoop, RDBMS, Files, NoSQL DB
On Premise, in the Cloud, or Hybrid
PHI: Guidance for Data De-Identification Sensitive/Privacy Data
16
• Name • Address • Dates – Birth, Death, .. • Telephone Numbers • Device Identifiers and serial numbers • Email addresses • SSN • Medical record numbers • Account Numbers …..
Secure Environment Perimeter Security, Volume/File encryption
17
• I have strong perimeter security Physical Security, Firewall, IDS/IPS… Isn’t that enough?
• I have turned on volume/file-‐level encryp>on
Control data access Mee>ng regulatory compliance Isn’t this enough?
Need BOTH and *more!
What Should We Do?
18
1. Precisely locate sensitive content across ALL repositories 2. Protect those assets appropriately – masking, encryption 3. Open up ‘controlled’ access to data now that sensitive elements are
protected 4. Enable employees, trusted partners and customers to make data-driven
decisions RISKS BREACH SECURITY COMPLIANCE
VALUE REVENUE DATA DRIVEN DECISIONS BUSINESS INTELLIGENCE
At the cell-level…
©2015 Dataguise, Inc. Confiden3al and Proprietary
How do we do it in DgSecure
19
Complex Sensitive Data Discovery
20
Sensitive Data Type Sample Data
Address 50920 April Blvd. Apt. 181, Lalana ME 83271 1000 Coney Island Ave. Brooklyn NY 11230
Name George Smith Smith, A. George
Credit Card Number 3710 664089 10315 345039502030507 3780-331072-30547
Telephone Number (510) 824-1036 510-824-1036 510.814.1036 5108141036
Sensitive Data Protection Masking & Encryption in Hadoop
21
• MASKING – Obfuscation, one-way operation – Multiple options in DgSecure – fictitious but realistic values, X’ing out part of the
content…. – Consistent masking to retain statistical distribution of data
• ENCRYPTION – Encrypted cell/row – Accessible by authorized users only – Hive, bulk, via App – Granular protection
• REDACTION – X’ing out entire sensitive data cell – Nullifying
Masking Data in Hadoop (Cell Level)
22
Masking Data in Hadoop (Cell Level)
©2015 Contains confiden3al and proprietary informa3on and may not be disclosed by the recipient to any third
party. 23
Masking Data in Hadoop (Cell Level)
©2015 Contains confiden3al and proprietary informa3on and may not be disclosed by the recipient to any third
party. 24
Encrypting Data in Hadoop (Cell Level)
25
Encrypting Data in Hadoop (Cell Level)
26 26
©2016 Dataguise, Inc. Confiden3al and Proprietary
Decryption through hive queries
27
User WITHOUT access privileges on Names & SSN
©2016 Dataguise, Inc. Confiden3al and Proprietary
Decryption through hive queries
28
User WITH access privileges on Names & SSN
Encryption or Masking in Hadoop Analy3c
Transac3onal
Trading System Perf.
Customer reten3on
Payments Risk Mgmt.
IT Security Intelligence
IP Addresses
Name
Personal Health Info
Credit Card Number
Dynamic pricing
Process efficiency
Log analysis
Insurance Premiums
Clinical trial analysis
Smart metering
Risk Modeling
Supply chain op3miza3on
Brand sen3ment
Real-‐3me upsell
Monitoring Sensors
Social Security Number
Date of Birth (DOB)
IP Address
URL
Email Address
Telephone Number
Credit limit
Purchase amount
Customer life3me value
Address
Device ID
Transac3on Date
VIN
Person of Interest Discovery
Session Op3miza3on
Encryption or Masking in Hadoop Analy3c
Transac3onal
Trading System Perf.
Customer reten3on
Payments Risk Mgmt.
IT Security Intelligence
Medical test results
Name
Personal Health Info
Credit Card Number
Dynamic pricing
Process efficiency
Log analysis
Insurance Premiums
Clinical trial analysis
Smart metering
Risk Modeling
Supply chain op3miza3on
Brand sen3ment
Real-‐3me upsell
Monitoring Sensors
Social Security Number
Date of Birth (DOB)
IP Address
URL
Email Address
Telephone Number
Credit limit
Purchase amount
Customer life3me value
Address
Mask
Encrypt Device ID
Transac3on Date
VIN
Person of Interest Discovery
Session Op3miza3on
Encryption or Masking in Hadoop Analy3c
Transac3onal
Trading System Perf.
Customer reten3on
Payments Risk Mgmt.
IT Security Intelligence
Biometric IDs
Name
Personal Health Info
Credit Card Number
Dynamic pricing
Process efficiency
Log analysis
Insurance Premiums
Clinical trial analysis
Smart metering
Risk Modeling
Supply chain op3miza3on
Brand sen3ment
Real-‐3me upsell
Monitoring Sensors
Social Security Number
Date of Birth (DOB)
IP Address
URL
Email Address
Telephone Number
Credit limit
Purchase amount
Customer life3me value
Address
Mask
Encrypt Device ID
Transac3on Date
VIN
Person of Interest Discovery
Session Op3miza3on
Encryption or Masking in Hadoop Analy3c
Transac3onal
Trading System Perf.
Customer reten3on
Payments Risk Mgmt.
IT Security Intelligence
Dynamic pricing
Process efficiency
Log analysis
Insurance Premiums
Clinical trial analysis
Smart metering
Risk Modeling
Supply chain op3miza3on
Brand sen3ment
Real-‐3me upsell
Monitoring Sensors
Person of Interest Discovery
Session Op3miza3on
Medical test results
Name
Personal Health Info
Credit Card Number
Social Security Number
Date of Birth (DOB)
IP Address
URL
Email Address
Telephone Number
Credit limit
Purchase amount
Customer life3me value
Address
Mask
Device ID
Transac3on Date
VIN
Number
Encrypt
Encryption or Masking in Hadoop Analy3c
Transac3onal
Trading System Perf.
Customer reten3on
Payments Risk Mgmt.
IT Security Intelligence
Medical test results
Name
Personal Health Info
Credit Card Number
Dynamic pricing
Process efficiency
Log analysis
Insurance Premiums
Clinical trial analysis
Smart metering
Risk Modeling
Supply chain op3miza3on
Brand sen3ment
Real-‐3me upsell
Monitoring Sensors
Social Security Number
Date of Birth (DOB)
IP Address
URL
Email Address
Telephone Number
Credit limit
Purchase amount
Customer life3me value
Address
Mask
Encrypt Device ID
Transac3on Date
VIN
Person of Interest Discovery
Session Op3miza3on
©2016 Dataguise, Inc. Confiden3al and Proprietary 34
How does this work in DgSecure
©2016 Dataguise, Inc. Confiden3al and Proprietary
HIGH-LEVEL DgSECURE FOR HADOOP FUNCTIONALITY
35
Policy Management Domain Defini3on custom Elements -‐ Composite -‐ Dependent Policy -‐ Per Data Feed? Protec3on Op3ons
Detec3on
In-‐Flight Within HDFS Full vs. Incremental Structured vs. Semi/Unstructured
Quick scan Element Count
Audi3ng
Files/Dirs -‐ Sensi3ve elements -‐ Protected? -‐ Who has access
Users -‐ What can they see
Protec3on Domain based Masking Redac3on Encryp3on -‐ Field or Record
-‐ AES or FPE
Repor3ng Job Level -‐ Sensi3ve elements -‐ Directories & Files -‐ Remedia3on applied
Dashboard -‐ Directory or by policy -‐ Drill-‐down
Audit report -‐ User ac3ons No3fica3ons
Set Policy
©2015 Contains confiden3al and proprietary informa3on and may not be disclosed by the recipient to any third
party. 36
Data Elements
©2015 Contains confiden3al and proprietary informa3on and may not be disclosed by the recipient to any third
party. 37
Define/Execute Detec>on/Protec>on Task
©2015 Contains confiden3al and proprietary informa3on and may not be disclosed by the recipient to any third
party. 38
Discovery Task Result
©2015 Contains confiden3al and proprietary informa3on and may not be disclosed by the recipient to any third
party. 39
MaskingTask Result
©2015 Contains confiden3al and proprietary informa3on and may not be disclosed by the recipient to any third
party. 40
Masking Task Result
41
Dashboard
©2015 Contains confiden3al and proprietary informa3on and may not be disclosed by the recipient to any third
party. 42
Entitlement Reports
©2015 Contains confiden3al and proprietary informa3on and may not be disclosed by the recipient to any third
party. 43
Audit Reports
©2015 Contains confiden3al and proprietary informa3on and may not be disclosed by the recipient to any third
party. 44
©2016 Dataguise, Inc. Confiden3al and Proprietary 45
Sample Secure Business Workflow in an Enterprise
Sample End to End Flow
46
Sample End to End Flow
47
CISO/CPO: Set policy per data feed
type
Sample End to End Flow
48
Data Asset Owner: Provenance metadata
Sample End to End Flow
49
IT/Set Process: Run Discovery to detect
sensi3ve data Metadata to repository
(Atlas)
Sample End to End Flow
50
IT/Set Process: Use Metadata to set access
control in Ranger
Sample End to End Flow
51
Run Masking/Encr to protect sensi3ve data
Metadata incl. lineage to repository (Atlas)
Sample End to End Flow
52
IT/Set Process: Use Metadata to set access
control in Ranger
Sample End to End Flow
53
Data Asset owner adds annota3ons & adds to Data
Asset Index
Sample End to End Flow
54
Data Scien3st browses available data sets and makes access request
Sample End to End Flow
55
Data owner approves request
Sets access control in Ranger
Sample End to End Flow
56
Data Scien3st runs data mining/BI/Analy3cs
Sample End to End Flow
57
Page 58 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger and Knox: Building on the Vision of Comprehensive Security Syed Mahmood
Page 59 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security Challenges of Data Lake
Central repository of critical and sensitive data
Data maintained over long duration
External ecosystem is in flux
Users can access and analyze data in new
and different ways
Page 60 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How do I set policy across the entire cluster?
Who am I/prove it?
What can I do?
What did I do?
How can I encrypt at rest and over the wire?
Differentiator 1: Comprehensive Approach to Security
Data Protection
Protect data at rest and in motion
In order to protect any data system you must implement the following:
Audit
Maintain a record of data access
Authorization
Provision access to data
Authentication
Authenticate users and systems
Administration
Central management and consistent security
Page 61 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDP Security: Comprehensive, Complete, Extensible
Data Protection
Protect data at rest and in motion
Security in HDP is the most comprehensive, complete and extensible for Hadoop
Audit
Maintain a record of data access
Authorization
Provision access to data
Authentication
Authenticate users and systems
Administration
Central management and consistent security
Single administrative console to set policy across the entire cluster: Apache Ranger
Authentication for perimeter and cluster; integrates with existing Active Directory and LDAP solutions: Kerberos | Apache Knox
Consistent authorization controls across all Apache components within HDP: Apache Ranger
Record of data access events across all components that is consistent and accessible: Apache Ranger | Apache Atlas
Encrypts data in motion and data at rest; refer partner encryption solutions for broader needs: HDFS TDE with Ranger KMS
Page 62 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN : Data Operating System
DATA ACCESS SECURITY GOVERNANCE & INTEGRATION OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Data Lifecycle & Governance Falcon Atlas
Administration Authentication Authorization Auditing Data Protection Ranger Knox Atlas HDFS Encryption
Data Workflow Sqoop Flume Kafka NFS WebHDFS
Provisioning, Managing, & Monitoring Ambari Cloudbreak Zookeeper
Scheduling Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase Accumulo Phoenix
Stream
Storm
In-memory
Spark
Others
ISV Engines
Tez Tez Tez Slider Slider
HDFS Hadoop Distributed File System
DATA MANAGEMENT
Hortonworks Data Platform 2.3
Deployment Choice Linux Windows On-Premise Cloud
Differentiator 2: Security Built into the Platform
Page 63 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security Built into the Platform
Security is consistently administered across data
access engines
Build or retire applications
without impacting security
YARN : Data Operating System
DATA ACCESS SECURITY GOVERNANCE & INTEGRATION OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Data Lifecycle & Governance Falcon Atlas
Administration Authentication Authorization Auditing Data Protection Ranger Knox Atlas HDFS Encryption
Data Workflow Sqoop Flume Kafka NFS WebHDFS
Provisioning, Managing, & Monitoring Ambari Cloudbreak Zookeeper
Scheduling Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase Accumulo Phoenix
Stream
Storm
In-memory
Spark
Others
ISV Engines
Tez Tez Tez Slider Slider
HDFS Hadoop Distributed File System
DATA MANAGEMENT
Hortonworks Data Platform 2.3
Deployment Choice Linux Windows On-Premise Cloud
Page 64 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security in Hadoop with HDP
• Wire encryption in
Hadoop • HDFS Encryption
with Ranger KMS
• Centralized audit
reporting with Apache Ranger
• Fine-grain access
control with Apache Ranger
Authorization What can I do?
Audit What did I do?
Data Protection Can data be encrypted at rest and over the wire?
• Kerberos • API security with
Apache Knox
Authentication Who am I/prove it?
HD
P 2.
3
Centralized Security Administration with Ranger
Page 65 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger Comprehensive security for Enterprise Hadoop
Page 66 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Centralized Security with Ranger
Centralized platform
• Centralized platform to define, administer and manage security policies consistently
• Define security policy once and apply it to all the applicable components across the stack
Page 67 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Page 68 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Centralized Security with Ranger
Centralized platform
• Administer security for: – Database
– Table
– Column
– LDAP Groups
– Specific Users
Fine-grained security definition
• Centralized platform to define, administer and manage security policies consistently
• Define security policy once and apply it to all the applicable components across the stack
Page 69 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Page 70 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Centralized Security with Ranger
• Administrators have complete visibility into the security administration process
Deep visibility Centralized platform
• Administer security for: – Database
– Table
– Column
– LDAP Groups
– Specific Users
Fine-grained security definition
• Centralized platform to define, administer and manage security policies consistently
• Define security policy once and apply it to all the applicable components across the stack
Page 71 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Page 72 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Authorization and Auditing with Ranger
HDFS
Ranger Administration Portal
HBase
Hive Server2
Ranger Audit Server
Ranger Plugin
Had
oop
Com
pone
nts
Ent
erpr
ise
Use
rs
Ranger Plugin
Ranger Plugin
Legacy Tools and Data Governance
HDFS
Knox
Storm
Ranger Plugin
Ranger Plugin
RDBMS
Solr Ranger Plugin
Ranger Policy Server
Future Additions
Currently Supported in HDP 2.2
Integration API
Kafka Ranger Plugin
YARN Ranger Plugin
TBD Ranger Plugin
Page 73 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas is Now Included in HDP
Apache Atlas
Knowledge Store
Audit Store
Models Type-System
Policy Rules Taxonomies
Tag Based Policies
Data Lifecycle Management
Real Time Tag Based Access Control
REST API
Services Search Lineage Exchange
Healthcare
HIPAA HL7
Financial
SOX Dodd-Frank
Energy
PPDM
Retail
PCI PII
Other
CWM
Rest API Modern, flexible access to Atlas services, HDP components and external tools
Search—SQL, like DSL (Domain Specific Language) Support for key word, faceted and full text searches
Lineage Capture all SQL runtime activity on HiveServer2 providing lineage for both data and schema
Exchange Leverage existing metadata by importing it from ETL tools, ERP systems and data warehouses Export metadata to downstream systems
Page 74 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas Vision 2015
Metadata Services
Business Taxonomy - classification Operational Data – Model for Hive: DB, Tables, Col,
Centralized location for all metadata inside HDP Single Interface point for Metadata Exchange with platforms outside of HDP. Search & Prescriptive Lineage – Model and Audit
Apache Atlas
Hiv
e
Ran
ger
Falc
on
Kaf
ka
Stor
m
© Hortonworks Inc. 2015. All Rights Reserved
The Insurance Data Landscape has Changed u The insurance industry is joining and analyzing data which has never
been analyzed before
u Many of these sources can be “murky” and sensitive
u Traditional PII/PHI data sources ingested into Hadoop needs to be: • Discovered
• Protected
Ø Protecting PII/PHI data is not an option for Insurers, TPAs and Brokers…. it is a Requirement
Summary
Page 76 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Questions ?
Page 77 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Call to Action
Additional Information : q Data Protection Optimized for Insurance Big Data – A Dataguise and
Hortonworks Capability Overview
q Hortonworks: Comprehensive Security in Hadoop – Solving Security in Hadoop Whitepaper
q Hortonworks: Building Governance into Big Data – Whitepaper