Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN CABOT at Big Data Spain 2012

1

Tél : +33 (0)1 58 56 10 00

Fax : +33 (0)1 58 56 10 01

www.octo.com © OCTO 2012

50, avenue des Champs-Elysées

75008 Paris - FRANCE

Health Insurance Predictive Analysis

with MapReduce and Machine Learning

Julien Cabot

Managing Director

OCTO [email protected]

@julien_cabot

Madrid

16th of November 2012 www.bigdataspain.org

2

Internet as a Data Source…

© OCTO 2012

Internet as the voice of the crowd

3

… in Healthcare

© OCTO 2012

71% about

• Illness

• Symptom

• Medecine

• Advice / opinion

Main sources are old school

forums, not social network

4

Understand the subject of interest of the

patient to design customer-centric products

and marketing actions

Anticipate the psycho-social effect due to

Internet to prevent excessive consultations

(and reimbursements)

Predict the claims while monitoring the

request about symptoms and drugs

Benefits for Insurance Company?

5

How to run the predictive analysis?

6

Understand the semantic field of

Healthcare…used on Internet

Find correlation between the evolution of

claims and … many millions of unidentified

external variables

Find correlated variables… anticipating the

claims

The data problem

We need some help from Machine Learning !

7

Correlation search in external datasets

Trends of medical

keywords used in

forums

Trends of medical

keywords searched in

Google

Google search

volume of symptom

and drugs keywords

Automated tokenization of

message per posted date

and semantic tagging

Trends of socio-

economical factors

Socio-economical

context from Open

Data initiatives

Health claims by

act typology

Correlation

Search Machine

Determination

coeff. (R²) sorted

matrix

8

Understand the semantic field of Healthcare

Timelines of

healthcare

key words

Healthcare

semantic

field

keywords

database 3-Learn

automatically from

Wikipedia Medical

Categories

Message

tokenization

by date

Word stemming, tagging

and common word

filtering with NTLK

1-Build a first list of

keywords

2-Enrich the list

with highly

searched keywords

How to tag Healthcare

words?

9

Compare the evolution of the variable and the claims over the time

Find non linear regression and learn a polymorphic predictive function

f(x) from the dataset with Support Vector Regression (SVR)

How to find correlations between time series?

y

x

f(x) f(x) + ε

f(x) - ε

Problem to solve

min

w

1

2 𝑤𝑇 . 𝑤

𝑦𝑖 - (𝑤𝑇·ϕ(x) + b) ≤ ε

(𝑤𝑇·ϕ(x) + b) - 𝑦𝑖 ≤ ε

Resolution

• Stochastic gradient descendent

• Test the response through the coef.

of determination R²

Open source ML library helps!

10

The current volume of external data grabbed is large but not so huge (~10 Gb)

Data aggregation

Eg. Select … Group By Date

Correlation search

Eg. SVR computing

Data Processing Profiles

Data volume

Data volume

~5Gb . 123 = 8,64 Tb

We need Parallel Computing to divide RAM requirement and time processing !

11

How to build the platform?

12

IT drivers

Data

aggregation

Large Tasks

execution

IO Elasticity

CPU Elasticity

Low CAPEX

Low OPEX

OSS SW

Cost Elasticity

Requirements IT drivers

Aggregate data

from Mb to Gb file

while sequential

reading

SVR, NLP

execution time is

~100ms by task

Large RAM

execution RAM Elasticity

Process many Tb

in memory data

Increase the ROI of

the research

project while

decreasing the

TCO

Commodity HW

13

Available solutions

IO E

lasticity

CP

U E

lasticity

OS

S S

oft

ware

Cost

Ela

sticity

RA

M E

lasticity

Com

modity

Hard

ware

RDBMS

Hadoop

AWS Elastic MapReduce

HPC

In Memory analytics

With

repartitioning

With

repartitioning

With

repartitioning

Through Task Through Task

14

AWS Elastic MapReduce Architecture

Source: AWS

15

Hadoop components

HDFS Distributed file storage

MapReduce Parallel processing framework

Pig Flow processing

Streaming MR scripting

Hive SQL-like querying

BI tools Tableau, Pentaho, …

Mahout Machine Learning

Hama Bulk synchronous

processing

Dataming tools R, SAS

Sqoop RDBMS integration

Zookeeper Coordination service

Flume Data stream integration

Hue Hadoop GUI

HBase NoSQL on HDFS

Solr Full text search

Oozie MR workflow

Custom App Java, C#, PHP, …

Grid of commodity hardware – storage and processing

16

General architecture of the platform

AWS S3

Core

Instance 1

Core

Instance 2

Task

Instance 1

Task

Instance 2

Master

Instance

Task

Instances 3

& 4

Redis

DataViz Application

• Store raw

data

• Store results

files

• Store detailed

results for

drill down

2 x m2.4xlarge

4 x m2.4xlarge

• For SVR and

NLP

processing,

only

17

Data aggregation with Pig Job flow

records = LOAD ‘/input/forums/messages.txt’

AS (str_date:chararray, message:chararray,

url:chararray);

date_grouped = GROUP records BY str_date

results = FOREACH date_grouped GENERATE

group, COUNT(records);

DUMP results;

Num_of_messages_by_date.pig

18

Hadoop streaming runs map/reduce jobs with any

executables or scripts through standard input and

standard output

It looks like that (on a cluster) :

cat input.txt | map.py | sort | reduce.py

Why Hadoop streaming?

Intensive use of NLTK for Natural Language Processing

Intensive use of NumPy and Sklearn for Machine Learning

Hadoop streaming

19

Stemmed word distribution with Hadoop streaming, mapper.py

import sys

import nltk

from nltk.tokenize import regexp_tokenize

from nltk.stem.snowball import FrenchStemmer

# input comes from STDIN (standard input)

for line in sys.stdin:

line = line.strip()

str_date, message, url = line.split(";")

stemmer = FrenchStemmer("french")

tokens = regexp_tokenize(message, pattern='\w+')

for token in tokens:

word = stemmer.stem(token)

if len(word) >= 3:

print '%s;%s' % (word, str_date)

Stem_distribution_by_date/mapper.py

20

Stemmed word distribution with Hadoop streaming, reducer.py

import sys

import json

from itertools import groupby

from operator import itemgetter

from nltk.probability import FreqDist

def read(f):

for line in f:

line = line.strip()

yield line.split(';')

data = read(sys.stdin)

for current_stem, group in groupby(data, itemgetter(0)):

values = [item[1] for item in group]

freq_dist = FreqDist()

print "%s;%s" % (current_stem, json.dumps(freq_dist))

Stem_distribution_by_date/reducer.py

21

Conclusions

22

Conclusions

The correlation search identifies currently 462 variables correlated with a R² >= 80% and a lag >= 1 month

Amazon Elastic MapReduce provides the elasticity required by the morphology of the jobs and the cost elasticity

Monthly cost with zero activity : < 5 €

Monthly cost with intensive activity : < 1 000 €

The equivalent cost of the platform would be around 50 000 €

The S3 transfer overhead is not a problem due the volume of stored data

While Correlation search processing, only 80% max of the virtual CPU are used due to job scheduling with a parallelism factor of 36 instead of 48 regarding SMP

23

Future works

Data mining

Increase the number of data sources

Testing the robustness of the predictive model over the time

Reducing the over fitting of the correlation

Enhance the correlation search for word while testing combinations

IT

Switch only the correlation search to a map reduce engine for SMP architecture and cluster of cores, inspired by the Stanford Phoenix and the Nokia Disco engine

Industrialize the data mining components as a platform for generalization to IARD insurance, banking, e-commerce, telecoms and retails

24

OCTO in a nutshell

Business case and benchmark studies

Business Proof of Concept

Data feeds : Web Trends

Big Data and Analytics architecture design

Big data project delivery

Training, seminar : Big Data, Hadoop

Big data Analytics Offer

Established in 1998

175 employees

19,5 million turnover worldwide (2011)

Verticals-based organization

Banking – Financial Services Insurance Media – Internet – Leisure Industry – Distribution Telecom – Services

IT Consulting firm OCTO offices

25

Thank you!

Technology

Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN CABOT at Big Data Spain 2012