Upload
big-data-spain
View
997
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Session presented at Big Data Spain 2012 Conference 16th Nov 2012 ETSI Telecomunicacion UPM Madrid www.bigdataspain.org More info: http://www.bigdataspain.org/es-2012/conference/health-insurance-predictive-analysis-with-hadoop-and-machine-learning/julien-cabot
Citation preview
1
Tél : +33 (0)1 58 56 10 00
Fax : +33 (0)1 58 56 10 01
www.octo.com © OCTO 2012
50, avenue des Champs-Elysées
75008 Paris - FRANCE
Health Insurance Predictive Analysis
with MapReduce and Machine Learning
Julien Cabot
Managing Director
OCTO [email protected]
@julien_cabot
Madrid
16th of November 2012 www.bigdataspain.org
2
Internet as a Data Source…
© OCTO 2012
Internet as the voice of the crowd
3
… in Healthcare
© OCTO 2012
71% about
• Illness
• Symptom
• Medecine
• Advice / opinion
Main sources are old school
forums, not social network
4
Understand the subject of interest of the
patient to design customer-centric products
and marketing actions
Anticipate the psycho-social effect due to
Internet to prevent excessive consultations
(and reimbursements)
Predict the claims while monitoring the
request about symptoms and drugs
Benefits for Insurance Company?
5
How to run the predictive analysis?
6
Understand the semantic field of
Healthcare…used on Internet
Find correlation between the evolution of
claims and … many millions of unidentified
external variables
Find correlated variables… anticipating the
claims
The data problem
We need some help from Machine Learning !
7
Correlation search in external datasets
Trends of medical
keywords used in
forums
Trends of medical
keywords searched in
Google search
volume of symptom
and drugs keywords
Automated tokenization of
message per posted date
and semantic tagging
Trends of socio-
economical factors
Socio-economical
context from Open
Data initiatives
Health claims by
act typology
Correlation
Search Machine
Determination
coeff. (R²) sorted
matrix
8
Understand the semantic field of Healthcare
Timelines of
healthcare
key words
Healthcare
semantic
field
keywords
database 3-Learn
automatically from
Wikipedia Medical
Categories
Message
tokenization
by date
Word stemming, tagging
and common word
filtering with NTLK
1-Build a first list of
keywords
2-Enrich the list
with highly
searched keywords
How to tag Healthcare
words?
9
Compare the evolution of the variable and the claims over the time
Find non linear regression and learn a polymorphic predictive function
f(x) from the dataset with Support Vector Regression (SVR)
How to find correlations between time series?
y
x
f(x) f(x) + ε
f(x) - ε
Problem to solve
min
w
1
2 𝑤𝑇 . 𝑤
𝑦𝑖 - (𝑤𝑇·ϕ(x) + b) ≤ ε
(𝑤𝑇·ϕ(x) + b) - 𝑦𝑖 ≤ ε
Resolution
• Stochastic gradient descendent
• Test the response through the coef.
of determination R²
Open source ML library helps!
10
The current volume of external data grabbed is large but not so huge (~10 Gb)
Data aggregation
Eg. Select … Group By Date
Correlation search
Eg. SVR computing
Data Processing Profiles
Data volume
Data volume
~5Gb . 123 = 8,64 Tb
We need Parallel Computing to divide RAM requirement and time processing !
11
How to build the platform?
12
IT drivers
Data
aggregation
Large Tasks
execution
IO Elasticity
CPU Elasticity
Low CAPEX
Low OPEX
OSS SW
Cost Elasticity
Requirements IT drivers
Aggregate data
from Mb to Gb file
while sequential
reading
SVR, NLP
execution time is
~100ms by task
Large RAM
execution RAM Elasticity
Process many Tb
in memory data
Increase the ROI of
the research
project while
decreasing the
TCO
Commodity HW
13
Available solutions
IO E
lasticity
CP
U E
lasticity
OS
S S
oft
ware
Cost
Ela
sticity
RA
M E
lasticity
Com
modity
Hard
ware
RDBMS
Hadoop
AWS Elastic MapReduce
HPC
In Memory analytics
With
repartitioning
With
repartitioning
With
repartitioning
Through Task Through Task
14
AWS Elastic MapReduce Architecture
Source: AWS
15
Hadoop components
HDFS Distributed file storage
MapReduce Parallel processing framework
Pig Flow processing
Streaming MR scripting
Hive SQL-like querying
BI tools Tableau, Pentaho, …
Mahout Machine Learning
Hama Bulk synchronous
processing
Dataming tools R, SAS
Sqoop RDBMS integration
Zookeeper Coordination service
Flume Data stream integration
Hue Hadoop GUI
HBase NoSQL on HDFS
Solr Full text search
Oozie MR workflow
Custom App Java, C#, PHP, …
Grid of commodity hardware – storage and processing
16
General architecture of the platform
AWS S3
Core
Instance 1
Core
Instance 2
Task
Instance 1
Task
Instance 2
Master
Instance
Task
Instances 3
& 4
Redis
DataViz Application
• Store raw
data
• Store results
files
• Store detailed
results for
drill down
2 x m2.4xlarge
4 x m2.4xlarge
• For SVR and
NLP
processing,
only
17
Data aggregation with Pig Job flow
records = LOAD ‘/input/forums/messages.txt’
AS (str_date:chararray, message:chararray,
url:chararray);
date_grouped = GROUP records BY str_date
results = FOREACH date_grouped GENERATE
group, COUNT(records);
DUMP results;
Num_of_messages_by_date.pig
18
Hadoop streaming runs map/reduce jobs with any
executables or scripts through standard input and
standard output
It looks like that (on a cluster) :
cat input.txt | map.py | sort | reduce.py
Why Hadoop streaming?
Intensive use of NLTK for Natural Language Processing
Intensive use of NumPy and Sklearn for Machine Learning
Hadoop streaming
19
Stemmed word distribution with Hadoop streaming, mapper.py
import sys
import nltk
from nltk.tokenize import regexp_tokenize
from nltk.stem.snowball import FrenchStemmer
# input comes from STDIN (standard input)
for line in sys.stdin:
line = line.strip()
str_date, message, url = line.split(";")
stemmer = FrenchStemmer("french")
tokens = regexp_tokenize(message, pattern='\w+')
for token in tokens:
word = stemmer.stem(token)
if len(word) >= 3:
print '%s;%s' % (word, str_date)
Stem_distribution_by_date/mapper.py
20
Stemmed word distribution with Hadoop streaming, reducer.py
import sys
import json
from itertools import groupby
from operator import itemgetter
from nltk.probability import FreqDist
def read(f):
for line in f:
line = line.strip()
yield line.split(';')
data = read(sys.stdin)
for current_stem, group in groupby(data, itemgetter(0)):
values = [item[1] for item in group]
freq_dist = FreqDist()
print "%s;%s" % (current_stem, json.dumps(freq_dist))
Stem_distribution_by_date/reducer.py
21
Conclusions
22
Conclusions
The correlation search identifies currently 462 variables correlated with a R² >= 80% and a lag >= 1 month
Amazon Elastic MapReduce provides the elasticity required by the morphology of the jobs and the cost elasticity
Monthly cost with zero activity : < 5 €
Monthly cost with intensive activity : < 1 000 €
The equivalent cost of the platform would be around 50 000 €
The S3 transfer overhead is not a problem due the volume of stored data
While Correlation search processing, only 80% max of the virtual CPU are used due to job scheduling with a parallelism factor of 36 instead of 48 regarding SMP
23
Future works
Data mining
Increase the number of data sources
Testing the robustness of the predictive model over the time
Reducing the over fitting of the correlation
Enhance the correlation search for word while testing combinations
IT
Switch only the correlation search to a map reduce engine for SMP architecture and cluster of cores, inspired by the Stanford Phoenix and the Nokia Disco engine
Industrialize the data mining components as a platform for generalization to IARD insurance, banking, e-commerce, telecoms and retails
24
OCTO in a nutshell
Business case and benchmark studies
Business Proof of Concept
Data feeds : Web Trends
Big Data and Analytics architecture design
Big data project delivery
Training, seminar : Big Data, Hadoop
Big data Analytics Offer
Established in 1998
175 employees
19,5 million turnover worldwide (2011)
Verticals-based organization
Banking – Financial Services Insurance Media – Internet – Leisure Industry – Distribution Telecom – Services
IT Consulting firm OCTO offices
25
Thank you!