Upload
rashad
View
39
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Predictive Modeling CAS Reinsurance Seminar May 7, 2007. Louise Francis, FCAS, MAAA [email protected] Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com. Why Predictive Modeling?. Better use of data than traditional methods - PowerPoint PPT Presentation
Citation preview
Predictive Modeling CAS Reinsurance SeminarMay 7, 2007
Louise Francis, FCAS, MAAA
Francis Analytics and Actuarial Data Mining, Inc.
Actuarial Data Mining Services
Francis Analytics
www.data-mines.com
2Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
Why Predictive Modeling?
• Better use of data than traditional methods
• Advanced methods for dealing with messy data now available
3Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
Data Mining Goes Prime Time
4Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
Becoming A Popular Tool In All Industries
5Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
Real Life Insurance Application – The “Boris Gang”
6Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
Predictive Modeling Family
Predictive Modeling
Classical Linear Models GLMs Data Mining
8Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
Data Quality: A Data Mining Problem
• Actuary reviewing a database
10Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
A Problem: Nonlinear Functions
An Insurance Nonlinear Function:Provider Bill vs. Probability of Independent Medical Exam
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100
200
275
363
450
560
683
821
989
1195
1450
1805
2540
11368Provider 2 Bill
0.30
0.40
0.50
0.60
0.70
0.80
0.90
Valu
e P
ro
b IM
E
11Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis AnalyticsClassical Statistics: Regression
• Estimation of parameters: Fit line that minimizes deviation between actual and fitted values
Workers Comp Severity Trend
$-
$2,000
$4,000
$6,000
$8,000
$10,000
1990 1992 1994 1996 1998 2000 2002 2004
Year
Severi
ty
Severity Fitted Y
2min( ( ) )iY Y
13Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis AnalyticsGeneralized Linear ModelsCommon Links for GLMs
Y
1
)1
ln(Y
Y
CDF normal thedenotes ),( Y
The identity link: h(Y) = Y
The log link: h(Y) = ln(Y)
The inverse link: h(Y) =
The logit link: h(Y) =
The probit link: h(Y) =
14Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
Major Kinds of Data Mining
• Supervised learning– Most common
situation– A dependent variable
• Frequency• Loss ratio• Fraud/no fraud
– Some methods• Regression• CART• Some neural
networks
• Unsupervised learning
– No dependent variable
– Group like records together
• A group of claims with similar characteristics might be more likely to be fraudulent
• Ex: Territory assignment, Text Mining
– Some methods
• Association rules
• K-means clustering
• Kohonen neural networks
15Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
Desirable Features of a Data Mining Method
• Any nonlinear relationship can be approximated
• A method that works when the form of the nonlinearity is unknown
• The effect of interactions can be easily determined and incorporated into the model
• The method generalizes well on out-of sample data
16Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
The Fraud Surrogates used as Dependent Variables
• Independent Medical Exam (IME) requested
• Special Investigation Unit (SIU) referral– (IME successful)– (SIU successful)
• Data: Detailed Auto Injury Claim Database for Massachusetts
• Accident Years (1995-1997)
17Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
Predictor Variables
• Claim file variables– Provider bill, Provider type– Injury
• Derived from claim file variables– Attorneys per zip code– Docs per zip code
• Using external data– Average household income– Households per zip
18Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
Different Kinds of Decision Trees
• Single Trees (CART, CHAID)
• Ensemble Trees, a more recent development (TREENET, RANDOM FOREST)
– A composite or weighted average of many trees (perhaps 100 or more)
19Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
Non Tree Methods
• MARS – Multivariate Adaptive Regression Splines
• Neural Networks
• Naïve Bayes (Baseline)
• Logistic Regression (Baseline)
21Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
Classification and Regression Trees (CART)
• Tree Splits are binary
• If the variable is numeric, split is based on R2 or sum or mean squared error
– For any variable, choose the two way split of data that reduces the mse the most
– Do for all independent variables
– Choose the variable that reduces the squared errors the most
• When dependent is categorical, other goodness of fit measures (gini index, deviance) are used
22Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis AnalyticsCART – Example of 1st split on Provider 2 Bill, With Paid as Dependent
• For the entire database, total squared deviation of paid losses around the predicted value (i.e., the mean) is 4.95x1013. The SSE declines to 4.66x1013 after the data are partitioned using $5,021 as the cutpoint.
• Any other partition of the provider bill produces a larger SSE than 4.66x1013. For instance, if a cutpoint of $10,000 is selected, the SSE is 4.76*1013.
1st Split
All Data
Mean = 11,224
Bill < 5,021
Mean = 10,770
Bill>= 5,021
Mean = 59,250
23Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
|mp2.bill<544.5
mp2.bill<3.5 mp2.bill<4055.5
mp2.bill<1443.5 mp2.bill<16659
mp2.bill<5151.5
0.02254 0.04817
0.07767 0.08832
0.11480 0.13330
0.06980
Continue Splitting to get more homogenous groups at terminal nodes
25Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
Ensemble Trees: Fit More Than One Tree
• Fit a series of trees
• Each tree added improves the fit of the model
• Average or Sum the results of the fits
• There are many methods to fit the trees and prevent overfitting
•Boosting: Iminer Ensemble and Treenet•Bagging: Random Forest
27Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis AnalyticsTreenet Prediction of IME Requested
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
0
20
0
27
5
36
3
45
0
56
0
68
3
82
1
98
9
11
95
14
50
18
05
25
40
11
36
8Provider 2 Bill
0.30
0.40
0.50
0.60
0.70
0.80
0.90
Va
lue
Pro
b IM
E
29Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
Three Layer Neural Network
Input Layer Hidden Layer Output Layer(Input Data) (Process Data) (Predicted Value)
Neural Networks
=
31Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis AnalyticsNeural Networks
• Also minimizes squared deviation between fitted and actual values
• Can be viewed as a non-parametric, non-linear regression
32Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
Hidden Layer of Neural Network(Input Transfer Function)
-1.2 -0.7 -0.2 0.3 0.8
X
0.0
0.2
0.4
0.6
0.8
1.0
Logistic Function for Various Values of w1
w1=-10w1=-5w1=-1w1=1w1=5w1=10
33Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis AnalyticsThe Activation Function (Transfer Function)
• The sigmoid logistic function
YeYf
1
1)(
0 1 1 2 2 ... n nY w w X w X w X
34Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis AnalyticsNeural Network: Provider 2 Bill vs. IME Requested
Privider 2 Bill
Fitte
d N
eu
ral N
et
Pre
dic
tio
n
0 20000 40000 60000 80000 100000
0.0
40
.06
0.0
80
.10
0.1
20
.14
35Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
MARS: Provider 2 Bill vs. IME Requested
0 1000 2000 3000 4000Provider 2 Bill
0.05
0.07
0.09
0.11
0.13
MA
RS
Pre
dic
ted IM
E
36Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
How MARS Fits Nonlinear Function
• MARS fits a piecewise regression– BF1 = max(0, X – 1,401.00)– BF2 = max(0, 1,401.00 - X )– BF3 = max(0, X - 70.00)– Y = 0.336 + .145626E-03 * BF1 - .199072E-03 *
BF2 - .145947E-03 * BF3; BF1 is basis function– BF1, BF2, BF3 are basis functions
• MARS uses statistical optimization to find best basis function(s)
• Basis function similar to dummy variable in regression. Like a combination of a dummy indicator and a linear independent variable
39Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis AnalyticsBaseline Method: Naive Bayes Classifier
• Naive Bayes assumes feature (predictor variables) independence conditional on each category
• Probability that an observation X will have a specific set of values for the independent variables is the product of the conditional probabilities of observing each of the values given target category cj, j=1 to m (m typically 2)
1 2
1 2
( , ... | ) ( | )
where , ... are specific values for the independent variables
n j i i ji
n
P X X X c P X x c
X X X
40Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
Naïve Bayes Formula
1, 21, 2
1 2 )
1, 21 2 )
( , ... )( | ... ) (Bayes Rule)
( , ...
( ) ( | )
( | ... )( , ...
j Nj N
n
j i ji
j Nn
p C c X X XP C X X X
P X X X
p C c P X c
P C X X XP X X X
A constant
44Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
Advantages/Disadvantages
• Computationally efficient
• Under many circumstances has performed well
• Assumption of conditional independence often does not hold
• Can’t be used for numeric variables
45Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
Naïve Bayes Predicted IME vs. Provider 2 Bill
0 97
18
12
65
34
94
33
51
76
01
68
57
69
85
39
39
10
25
11
10
11
99
12
85
13
71
14
65
15
54
16
49
17
45
18
38
19
45
20
50
21
49
22
60
23
80
25
12
26
37
27
60
28
95
30
42
31
96
33
91
35
88
38
05
40
60
43
35
47
05
52
00
59
44
71
26
92
88
13
76
7
Provider 2 Bill
0.060000
0.080000
0.100000
0.120000
0.140000
Me
an
Pro
ba
bil
ity
IM
E
46Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis AnalyticsTrue/False Positives and True/False Negatives
(Type I and Type II Errors) The “Confusion” Matrix
• Choose a “cut point” in the model score.
• Claims > cut point, classify “yes”.Sample Confusion Matrix: Sensitivity and Specificity
Prediction No Yes Row TotalNo 800 200 1,000 Yes 200 400 600 Column Total 1,000 600
True Class
Correct Total Percent CorrectSensitivity 800 1,000 80%Specificity 400 600 67%
47Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
ROC Curves and Area Under the ROC Curve
• Want good performance both on sensitivity and specificity
• Sensitivity and specificity depend on cut points chosen– Choose a series of different cut points, and
compute sensitivity and specificity for each of them
– Graph results• Plot sensitivity vs 1-specifity• Compute an overall measure of “lift”, or
area under the curve
48Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis AnalyticsTREENET ROC Curve – IME Explain AUROC AUROC = 0.701
50Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
Ranking of Methods/Software – IME Requested
Method/Software AUROC Lower Bound Upper BoundRandom Forest 0.7030 0.6954 0.7107Treenet 0.7010 0.6935 0.7085MARS 0.6974 0.6897 0.7051SPLUS Neural 0.6961 0.6885 0.7038S-PLUS Tree 0.6881 0.6802 0.6961Logistic 0.6771 0.6695 0.6848Naïve Bayes 0.6763 0.6685 0.6841SPSS Exhaustive CHAID 0.6730 0.6660 0.6820CART Tree 0.6694 0.6613 0.6775Iminer Neural 0.6681 0.6604 0.6759Iminer Ensemble 0.6491 0.6408 0.6573Iminer Tree 0.6286 0.6199 0.6372
51Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
Some Software Packages That Can be Used
Excel Access Free Software
R Web based software
S-Plus (similar to commercial version of R) SPSS CART/MARS Data Mining suites – (SAS Enterprise Miner/SPSS
Clementine)
52Francis Analytics www.data-mines.com
Actuarial Data Mining Services
Francis Analytics
References
• Derrig, R., Francis, L., “Distinguishing the Forest from the Trees: A Comparison of Tree Based Data Mining Methods”, CAS Winter Forum, March 2006, WWW.casact.org
• Derrig, R., Francis, L., “A Comparison of Methods for Predicting Fraud ”,Risk Theory Seminar, April 2006
• Francis, L., “Taming Text: An Introduction to Text Mining”, CAS Winter Forum, March 2006, WWW.casact.org
• Francis, L.A., Neural Networks Demystified, Casualty Actuarial Society Forum, Winter, pp. 254-319, 2001.
• Francis, L.A., Martian Chronicles: Is MARS better than Neural Networks? Casualty Actuarial Society Forum, Winter, pp. 253-320, 2003b.
• Dahr, V, Seven Methods for Transforming Corporate into Business Intelligence, Prentice Hall, 1997
• The web site WWW.data-mines.com has some tutorials and presentations
Predictive Modeling CAS Reinsurance SeminarMay, 2006
Actuarial Data Mining Services
Francis Analytics
www.data-mines.com