48
1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com [email protected]

1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

Embed Size (px)

Citation preview

Page 1: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

1

Data Quality:GIRO Working Party Paper & Methods to Detect Data Glitches

CAS Annual MeetingLouise Francis

Francis Analytics and Actuarial Data Mining, Inc.www.data-mines.com

[email protected]

Page 2: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

2

Objectives Review GIRO Data Quality Working Party

Paper Discuss Data Quality in Predictive Modeling

Focus on a key part of data preparation: Exploratory data analysis

Louise Francis
Page 3: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

3

GIRO Working Party Paper

Literature Review Horror Stories Survey Experiment Actions Concluding Remarks

Page 4: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

4

Insurance Data Management Association

The IDMA is an American organisation which promotes professionalism in the Data Management discipline through education, certification and discussion forums

The IDMA web site:

• Suggests publications on data quality,• Describes a data certification model, and• Contains Data Management Value Propositions

which document the value to various insurance industry stakeholders of investing in data quality

http://www.idma.org

Page 5: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

5

Horror Stories – Non-Insurance

Heart-and-Lung Transplant – wrong blood type

Bombing of Chinese Embassy in Belgrade Mars Orbiter – confusion between imperial

and metric units Fidelity Mutual Fund – withdrawal of dividend Porter County, Illinois – Tax Bill and Budget

Shortfall

Page 6: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

6

Horror Stories – Rating/Pricing

Examples faced by ISO: Exposure recorded in units of $10,000 instead

of $1,000 Large insurer reporting personal auto data as

miscellaneous and hence missed from ratemaking calculations

One company reporting all its Florida property losses as fire (including hurricane years)

Mismatched coding for policy and claims data

Page 7: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

7

Horror Stories - Katrina

US Weather models underestimated costs Katrina by approx. 50% (Westfall, 2005)

2004 RMS study highlighted exposure data that was: Out-of-date Incomplete Mis-coded

Many flood victims had no flood insurance after being told by agents that they were not in flood risk areas.

Page 8: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

8

Data Quality Survey of Actuaries

Purpose: Assess the impact of data quality issues on the work of general insurance actuaries

2 questions: percentage of time spent on data quality

issues proportion of projects adversely affected by

such issues

Page 9: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

9

Results - Percentage of Time

Employer No. Mean Median Min Max

Insurer/Reinsurer

17 26.4% 25.0% 5.0% 50.0%

Consultancy 13 27.1% 25.0% 7.5% 60.0%

Other 8 23.4% 12.5% 2.0% 75.0%

All 38 26.0% 25.0% 2.0% 75.0%

Page 10: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

10

Results - Percentage of Projects

Employer No Mean Median Min Max

Insurer/Reinsurer

15 27.9% 20.0% 5.0% 60.0%

Consultancy 13 43.3% 35.0% 10.0% 100.0%

Other 8 22.6% 20.0% 1.0% 50.0%

All 36 32.3% 30.0% 1.0% 100.0%

Page 11: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

11

Survey Conclusions

Data quality issues have a significant impact on the work of general insurance actuaries about a quarter of their time is spent on such issues about a third of projects are adversely affected

The impact varies widely between different actuaries, even those working in similar organisations

Limited evidence to suggest that the impact is more significant for consultants

Page 12: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

12

Hypothesis

Uncertainty of actuarial estimates of ultimate incurred losses based on poor data is significantly greater than that of good data

Page 13: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

13

Data Quality Experiment

Examine the impact of incomplete and/or erroneous data on actuarial estimate of ultimate losses and the loss reserves

Use real data with simulated limitations and/or errors and observe the potential error in the actuarial estimates

Page 14: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

14

Data Used in Experiment

Real data for primary private passenger bodily injury liability business for a single no-fault state

Eighteen (18) accident years of fully developed data; thus, true ultimate losses are known

Page 15: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

15

Actuarial Methods Used

Paid chain ladder models Incurred chain ladder models Frequency-severity models

Inverse power curve for tail factors

No judgment used in applying methods

Page 16: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

16

Completeness of Data Experiments

Vary size of the sample; that is, 1) All years

2) Use only 7 accident years

3) Use only last 3 diagonals

Page 17: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

17

Data Error Experiments

Simulated data quality issues1. Misclassification of losses by accident year

2. Early years not available

3. Late processing of financial information

4. Paid losses replaced by crude estimates

5. Overstatements followed by corrections in following period

6. Definition of reported claims changed

Page 18: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

18

Measure Impact of Data Quality

• Compare Estimated to Actual Ultimates• Use Bootstrapping to evaluate effect of

difference samples on results

Page 19: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

19

Results – Experiment 1

More data generally reduces the volatility of the estimation errors

Page 20: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

20

Results – Experiment 2

Extreme volatility, especially those based on paid data

Actuaries ability to recognise and account for data quality issues is critical

Actuarial adjustments to the data may never fully correct for data quality issues

Page 21: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

21

Results – Bootstrapping

Less dispersion in results for error free data Standard deviation of estimated ultimate

losses greater for the modified data (data with errors)

Confirms original hypothesis

Page 22: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

22

Conclusions Resulting from Experiment

Greater accuracy and less variability in actuarial estimates when: Quality data used Greater number of accident years used

Data quality issues can erode or even reverse the gains of increased volumes of data: If errors are significant, more data may worsen estimates due

to the propagation of errors for certain projection methods Significant uncertainty in results when:

Data is incomplete Data has errors

Page 23: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

23

Predictive Modeling & Exploratory Data Analysis

Page 24: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

24

Modeling Process

Modeling

Evaluation

Deployment

Data Preparation

Data Understanding

Business Understanding

2

00Complex Model Artifact

Page 25: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

25

Data Preprocessing

Select Data

Clean Data

Construct Data

Rational for inclusion

Data Cleaning Report

Data Attributes Generate records

Integrate Data Merged records

Format Data

Data sets

Page 26: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

26

Exploratory Data Analysis

Typically the first step in analyzing data Makes heavy use of graphical techniques Also makes use of simple descriptive

statistics Purpose

Find outliers (and errors) Explore structure of the data

Page 27: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

27

Example Data

Private passenger auto Some variables are:

Age Gender Marital status Zip code Earned premium Number of claims Incurred losses Paid losses

Page 28: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

28

Some Methods for Numeric Data

Visual Histograms Box and Whisker Plots Stem and Leaf Plots

Statistical Descriptive statistics Data spheres

Page 29: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

29

Histograms

Can do them in Microsoft Excel

Page 30: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

30

Histograms of Age VariableVarying Window Size

16 33 51 68 85Age

0

20

40

60

80

16 23 30 37 44 51 57 64 71 78 85

Age

0

5

10

15

20

16 24 31 39 47 54 62 70 77 85 Age0

10

20

30

40

Page 31: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

31

Formula for Window Width

1

3

3.5

standard deviation

N=sample size

h =window width

h

N

Page 32: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

32

Example of Suspicious Value

600 900 1200 1500 1800

License Year

0

5,000

10,000

15,000

20,000

25,000F

req

ue

nc

y

Page 33: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

33

Box and Whisker Plot

-20

0

20

40

60

80

No

rmally

Dis

trib

ute

d A

ge

+1.5*midspread

median

Outliers

75th Percentile

25th Percentile

Page 34: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

34

Plot of Heavy Tailed DataPaid Losses

-10000

10000

30000

50000

70000

90000

110000

Pa

id L

oss

Page 35: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

35

Heavy Tailed Data – Log Scale

101

102

103

104

105

106

107P

aid

Loss

Page 36: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

36

Box and Whisker Example

16 1820 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 96

Age

0

200

400

600

800

1,000

Fre

qu

ency

Page 37: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

37

Descriptive StatisticsAnalysis ToolPak

Statistic Policyholder Age

Mean 36.9Standard Error 0.1Median 35.0Mode 32.0Standard Deviation 13.2Sample Variance 174.4Kurtosis 0.5Skewness 0.7Range 84Minimum 16Maximum 100Sum 1114357Count 30226Largest(2) 100Smallest(2) 16

Page 38: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

38

Descriptive Statistics

License Year has minimum and maximums that are impossible

N Minimum Maximum Mean Std.

Deviation

License Year 30,250 490 2,049 1,990 16.3

Valid N 30,250

Page 39: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

39

Data Spheres: The Mahalanobis Distance Statistic

1( ) ' ( )

is a vector of variables

is a vector of means

is a variance-covariance matrix

MD x μ Σ x μ

x

μ

Σ

Page 40: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

40

Screening Many Variables at Once

Plot of Longitude and Latitude of zip codes in data Examination of outliers indicated drivers in Ca and PR

even though policies only in one mid-Atlantic state

1( ) ' ( ) MD x μ Σ x μ

Page 41: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

41

Records With Unusual Values Flagged

Policy ID Mahalanobis

Depth Percentile of Mahalanobis Age

License Year

Number of Cars

Number of Drivers

Model Year

Incurred Loss

22244 59 100 27 1997 3 6 1994 4,456 6159 60 100 22 2001 2 6 1993 0

22997 65 100 NA NA 2 1 1954 0 5412 61 100 17 2003 3 6 1994 0

30577 72 100 43 1979 3 1 1952 0 28319 8,490 100 30 490 1 1 1987 0 27815 55 100 44 1976 -1 0 1959 0 16158 24 100 82 1938 1 1 1989 61,187 4908 25 100 56 1997 4 4 2003 35,697

28790 24 100 82 2039 1 1 1985 27,769

Page 42: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

42

Categorical Data: Data Cubes

Page 43: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

43

Categorical Data

Data Cubes Usually frequency tables Search for missing values coded as blanks

Gender

Frequency Percent 5,054 14.3 F 13,032 36.9 M 17,198 48.7 Total 35,284 100

Page 44: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

44

Categorical Data

Table highlights inconsistent coding of marital status

Marital Status

Frequency Percent 5,053 14.3 1 2,043 5.8 2 9,657 27.4 4 2 0 D 4 0 M 2,971 8.4 S 15,554 44.1 Total 35,284 100

Page 45: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

45

Screening for Missing Data

BUSINESS

TYPE Gender Age License Year

Valid 35,284 35,284 30,242 30,250 N

Missing 0 0 5,042 5,034

25 27.00 1,986.00

50 35.00 1,996.00 Percentiles

75 45.00 2,000.00

Page 46: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

46

Blanks as Missing

Frequency Percent Valid Percent Cumulative

Percent

5,054 14.3 14.3 14.3

F 13,032 36.9 36.9 51.3

M 17,198 48.7 48.7 100.0 Valid

Total 35,284 100.0 100.0

Page 47: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

47

Conclusions

Anecdotal horror stories illustrate possible dramatic impact of data quality problems

Data quality survey suggests data quality issues have a significant cost on actualial work

Data Quality Working Party urges actuaries to become data quality advocated within their organizations

Techniques from Exploratory Data Analysis can be used to detect data glitches

Page 48: 1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,

48

Library for Getting Started

Dasu and Johnson, Exploratory Data Mining and Data Cleaning, Wiley, 2003

Francis, L.A., “Dancing with Dirty Data: Methods for Exploring and Cleaning Data”, CAS Winter Forum, March 2005, www.casact.org

The Data Quality Working Party Paper will be Posted on the GIRO web site in about nine months