Upload
lee-lester
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
1
Data Quality:GIRO Working Party Paper & Methods to Detect Data Glitches
CAS Annual MeetingLouise Francis
Francis Analytics and Actuarial Data Mining, Inc.www.data-mines.com
2
Objectives Review GIRO Data Quality Working Party
Paper Discuss Data Quality in Predictive Modeling
Focus on a key part of data preparation: Exploratory data analysis
3
GIRO Working Party Paper
Literature Review Horror Stories Survey Experiment Actions Concluding Remarks
4
Insurance Data Management Association
The IDMA is an American organisation which promotes professionalism in the Data Management discipline through education, certification and discussion forums
The IDMA web site:
• Suggests publications on data quality,• Describes a data certification model, and• Contains Data Management Value Propositions
which document the value to various insurance industry stakeholders of investing in data quality
http://www.idma.org
5
Horror Stories – Non-Insurance
Heart-and-Lung Transplant – wrong blood type
Bombing of Chinese Embassy in Belgrade Mars Orbiter – confusion between imperial
and metric units Fidelity Mutual Fund – withdrawal of dividend Porter County, Illinois – Tax Bill and Budget
Shortfall
6
Horror Stories – Rating/Pricing
Examples faced by ISO: Exposure recorded in units of $10,000 instead
of $1,000 Large insurer reporting personal auto data as
miscellaneous and hence missed from ratemaking calculations
One company reporting all its Florida property losses as fire (including hurricane years)
Mismatched coding for policy and claims data
7
Horror Stories - Katrina
US Weather models underestimated costs Katrina by approx. 50% (Westfall, 2005)
2004 RMS study highlighted exposure data that was: Out-of-date Incomplete Mis-coded
Many flood victims had no flood insurance after being told by agents that they were not in flood risk areas.
8
Data Quality Survey of Actuaries
Purpose: Assess the impact of data quality issues on the work of general insurance actuaries
2 questions: percentage of time spent on data quality
issues proportion of projects adversely affected by
such issues
9
Results - Percentage of Time
Employer No. Mean Median Min Max
Insurer/Reinsurer
17 26.4% 25.0% 5.0% 50.0%
Consultancy 13 27.1% 25.0% 7.5% 60.0%
Other 8 23.4% 12.5% 2.0% 75.0%
All 38 26.0% 25.0% 2.0% 75.0%
10
Results - Percentage of Projects
Employer No Mean Median Min Max
Insurer/Reinsurer
15 27.9% 20.0% 5.0% 60.0%
Consultancy 13 43.3% 35.0% 10.0% 100.0%
Other 8 22.6% 20.0% 1.0% 50.0%
All 36 32.3% 30.0% 1.0% 100.0%
11
Survey Conclusions
Data quality issues have a significant impact on the work of general insurance actuaries about a quarter of their time is spent on such issues about a third of projects are adversely affected
The impact varies widely between different actuaries, even those working in similar organisations
Limited evidence to suggest that the impact is more significant for consultants
12
Hypothesis
Uncertainty of actuarial estimates of ultimate incurred losses based on poor data is significantly greater than that of good data
13
Data Quality Experiment
Examine the impact of incomplete and/or erroneous data on actuarial estimate of ultimate losses and the loss reserves
Use real data with simulated limitations and/or errors and observe the potential error in the actuarial estimates
14
Data Used in Experiment
Real data for primary private passenger bodily injury liability business for a single no-fault state
Eighteen (18) accident years of fully developed data; thus, true ultimate losses are known
15
Actuarial Methods Used
Paid chain ladder models Incurred chain ladder models Frequency-severity models
Inverse power curve for tail factors
No judgment used in applying methods
16
Completeness of Data Experiments
Vary size of the sample; that is, 1) All years
2) Use only 7 accident years
3) Use only last 3 diagonals
17
Data Error Experiments
Simulated data quality issues1. Misclassification of losses by accident year
2. Early years not available
3. Late processing of financial information
4. Paid losses replaced by crude estimates
5. Overstatements followed by corrections in following period
6. Definition of reported claims changed
18
Measure Impact of Data Quality
• Compare Estimated to Actual Ultimates• Use Bootstrapping to evaluate effect of
difference samples on results
19
Results – Experiment 1
More data generally reduces the volatility of the estimation errors
20
Results – Experiment 2
Extreme volatility, especially those based on paid data
Actuaries ability to recognise and account for data quality issues is critical
Actuarial adjustments to the data may never fully correct for data quality issues
21
Results – Bootstrapping
Less dispersion in results for error free data Standard deviation of estimated ultimate
losses greater for the modified data (data with errors)
Confirms original hypothesis
22
Conclusions Resulting from Experiment
Greater accuracy and less variability in actuarial estimates when: Quality data used Greater number of accident years used
Data quality issues can erode or even reverse the gains of increased volumes of data: If errors are significant, more data may worsen estimates due
to the propagation of errors for certain projection methods Significant uncertainty in results when:
Data is incomplete Data has errors
23
Predictive Modeling & Exploratory Data Analysis
24
Modeling Process
Modeling
Evaluation
Deployment
Data Preparation
Data Understanding
Business Understanding
2
00Complex Model Artifact
25
Data Preprocessing
Select Data
Clean Data
Construct Data
Rational for inclusion
Data Cleaning Report
Data Attributes Generate records
Integrate Data Merged records
Format Data
Data sets
26
Exploratory Data Analysis
Typically the first step in analyzing data Makes heavy use of graphical techniques Also makes use of simple descriptive
statistics Purpose
Find outliers (and errors) Explore structure of the data
27
Example Data
Private passenger auto Some variables are:
Age Gender Marital status Zip code Earned premium Number of claims Incurred losses Paid losses
28
Some Methods for Numeric Data
Visual Histograms Box and Whisker Plots Stem and Leaf Plots
Statistical Descriptive statistics Data spheres
29
Histograms
Can do them in Microsoft Excel
30
Histograms of Age VariableVarying Window Size
16 33 51 68 85Age
0
20
40
60
80
16 23 30 37 44 51 57 64 71 78 85
Age
0
5
10
15
20
16 24 31 39 47 54 62 70 77 85 Age0
10
20
30
40
31
Formula for Window Width
1
3
3.5
standard deviation
N=sample size
h =window width
h
N
32
Example of Suspicious Value
600 900 1200 1500 1800
License Year
0
5,000
10,000
15,000
20,000
25,000F
req
ue
nc
y
33
Box and Whisker Plot
-20
0
20
40
60
80
No
rmally
Dis
trib
ute
d A
ge
+1.5*midspread
median
Outliers
75th Percentile
25th Percentile
34
Plot of Heavy Tailed DataPaid Losses
-10000
10000
30000
50000
70000
90000
110000
Pa
id L
oss
35
Heavy Tailed Data – Log Scale
101
102
103
104
105
106
107P
aid
Loss
36
Box and Whisker Example
16 1820 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 96
Age
0
200
400
600
800
1,000
Fre
qu
ency
37
Descriptive StatisticsAnalysis ToolPak
Statistic Policyholder Age
Mean 36.9Standard Error 0.1Median 35.0Mode 32.0Standard Deviation 13.2Sample Variance 174.4Kurtosis 0.5Skewness 0.7Range 84Minimum 16Maximum 100Sum 1114357Count 30226Largest(2) 100Smallest(2) 16
38
Descriptive Statistics
License Year has minimum and maximums that are impossible
N Minimum Maximum Mean Std.
Deviation
License Year 30,250 490 2,049 1,990 16.3
Valid N 30,250
39
Data Spheres: The Mahalanobis Distance Statistic
1( ) ' ( )
is a vector of variables
is a vector of means
is a variance-covariance matrix
MD x μ Σ x μ
x
μ
Σ
40
Screening Many Variables at Once
Plot of Longitude and Latitude of zip codes in data Examination of outliers indicated drivers in Ca and PR
even though policies only in one mid-Atlantic state
1( ) ' ( ) MD x μ Σ x μ
41
Records With Unusual Values Flagged
Policy ID Mahalanobis
Depth Percentile of Mahalanobis Age
License Year
Number of Cars
Number of Drivers
Model Year
Incurred Loss
22244 59 100 27 1997 3 6 1994 4,456 6159 60 100 22 2001 2 6 1993 0
22997 65 100 NA NA 2 1 1954 0 5412 61 100 17 2003 3 6 1994 0
30577 72 100 43 1979 3 1 1952 0 28319 8,490 100 30 490 1 1 1987 0 27815 55 100 44 1976 -1 0 1959 0 16158 24 100 82 1938 1 1 1989 61,187 4908 25 100 56 1997 4 4 2003 35,697
28790 24 100 82 2039 1 1 1985 27,769
42
Categorical Data: Data Cubes
43
Categorical Data
Data Cubes Usually frequency tables Search for missing values coded as blanks
Gender
Frequency Percent 5,054 14.3 F 13,032 36.9 M 17,198 48.7 Total 35,284 100
44
Categorical Data
Table highlights inconsistent coding of marital status
Marital Status
Frequency Percent 5,053 14.3 1 2,043 5.8 2 9,657 27.4 4 2 0 D 4 0 M 2,971 8.4 S 15,554 44.1 Total 35,284 100
45
Screening for Missing Data
BUSINESS
TYPE Gender Age License Year
Valid 35,284 35,284 30,242 30,250 N
Missing 0 0 5,042 5,034
25 27.00 1,986.00
50 35.00 1,996.00 Percentiles
75 45.00 2,000.00
46
Blanks as Missing
Frequency Percent Valid Percent Cumulative
Percent
5,054 14.3 14.3 14.3
F 13,032 36.9 36.9 51.3
M 17,198 48.7 48.7 100.0 Valid
Total 35,284 100.0 100.0
47
Conclusions
Anecdotal horror stories illustrate possible dramatic impact of data quality problems
Data quality survey suggests data quality issues have a significant cost on actualial work
Data Quality Working Party urges actuaries to become data quality advocated within their organizations
Techniques from Exploratory Data Analysis can be used to detect data glitches
48
Library for Getting Started
Dasu and Johnson, Exploratory Data Mining and Data Cleaning, Wiley, 2003
Francis, L.A., “Dancing with Dirty Data: Methods for Exploring and Cleaning Data”, CAS Winter Forum, March 2005, www.casact.org
The Data Quality Working Party Paper will be Posted on the GIRO web site in about nine months