23
Data Clustering with Actuarial Applications Guojun Gan Emiliano Valdez Department of Mathematics University of Connecticut Storrs, CT, USA 2017 Advances in Predictive Analytics (APA) conference, Waterloo, Canada December 1, 2017 ,

Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

Data Clustering with Actuarial Applications

Guojun Gan Emiliano Valdez

Department of MathematicsUniversity of Connecticut

Storrs, CT, USA

2017 Advances in Predictive Analytics (APA) conference,Waterloo, CanadaDecember 1, 2017

,

Page 2: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

Outline

I Data clusteringI An application

Page 3: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

Data clustering

I The process of dividing a set of objects into homogeneousgroups

I Originated in anthropology and psychology in the 1930s(Driver and Kroeber, 1932; Zubin, 1938; Tryon, 1939), dataclustering is now one of the most popular tools forexploratory data analysis

Page 4: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

Data clustering is a major task of data mining

Unsupervised learning Supervised learning

Data clustering ClassificationAssociation rules Numerical prediction

Page 5: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

Definition of clusters

Bock (1989) also suggested the following criteria for data pointsin a cluster:

1. Share the same or closely related properties;2. Have small mutual distances;3. Have “contacts” or “relations” with at least on other data

point in the cluster;4. Can be clearly distinguishable from the data points that are

not in the cluster.

Page 6: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

Examples of clusters

-4 -2 0 2 4

-10

12

34

56

0 2 4 6 8 10 12

12

34

56

78

Page 7: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

Data types

V1 V2 · · · Vd

x1 x11 x12 · · · x1dx2 x21 x22 · · · x2d...

...... · · ·

...xn xn1 xn2 · · · xnd

Table: A dataset in a tabular form.

Types of variables:I DiscreteI Continuous

Page 8: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

Dissimilarity measures

A distance measure D is a binary function that satisfied thefollowing conditions (Anderberg, 1973):

1. D(x,x) ≥ 0 (Nonnegativity);2. D(x,y) = D(y,x) (Symmetry);3. D(x,y) = 0 if and only if x = y (Reflexivity);4. D(x, z) ≤ D(x,y) + D(y, z) (Triangle inequality),

where x,y, and z are arbitrary data points.

Dmin(x,y) =

d∑j=1

|xj − yj |p 1

p

, (1)

Page 9: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

Clustering algorithms

CrispAlgorithms

FuzzyAlgorithms

AgglomerativeAlgorithms

DivisiveAlgorithms

PartitionalAlgorithms

HierarchicalAlgorithms

ClusteringAlgorithms

Figure: Taxonomy of clustering algorithms.

Page 10: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

Cluster validity

I Internal validity indices evaluate the clustering resultsbased only on quantities and features inherited from theunderlying dataset.

I External validity indices evaluate the clustering resultsbased on a prespecified structure imposed on theunderlying dataset.

I Relative validity indices evaluate the results of a clusteringalgorithm against the results of a different clusteringalgorithm or the results of the same algorithm but withdifferent parameters.

Page 11: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

k -means

Given a set of n data points X = {x1,x2, . . . ,xn}, the k -meansalgorithm tries to divide the dataset into k clusters byminimizing the following objective function:

P(U,Z ) =k∑

l=1

n∑i=1

uil‖xi − zl‖2, (2)

where k is the desired number of cluster specified by the user,U = (uil)n×k is an n × k partition matrix, Z = {z1, z2, . . . , zk} isa set of cluster centers, and ‖ · ‖ is the L2 norm or Euclideandistance. The partition matrix U satisfies the followingconditions:

uil ∈ {0,1}, i = 1,2, . . . ,n, l = 1,2, . . . , k , (3a)

k∑l=1

uil = 1, i = 1,2, . . . ,n. (3b)

Page 12: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

Truncated fuzzy c-means

Let T be an integer such that 1 ≤ T ≤ k and let UT be the setof fuzzy partition matrices U such that each row of U has atmost T nonzero entries.The TFCM algorithm (Gan et al., 2016) aims to find a truncatedfuzzy partition matrix U and a set of cluster centers Z thatminimize the following objective function:

P(U,Z ) =n∑

i=1

k∑l=1

uαil

(‖xi − zl‖2 + ε

), (4)

where α > 1 is the fuzzifier, U ∈ UT , Z = {z1, z2, . . ., zk} is aset of cluster centers, ‖ · ‖ is the L2-norm or Euclidean distance,and ε is a small positive number used to prevent division byzero.

Page 13: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

Hierarchical k -means

Hierarchical k -means (Nister and Stewenius, 2006) uses adivisive approach to apply the traditional k -means with smallk ’s repeatedly until the desired number of clusters is reached.

Algorithm 1: Pseudo-code of hierarchical k -means.Input: A dataset X , kOutput: k clusters

1 Apply the k -means algorithm to divide the dataset into twoclusters;

2 repeat3 Apply the k -means algorithm to divide the largest existing

cluster into two clusters;4 until The number of clusters is equal to k ;5 Return the k clusters;

Page 14: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

An application in variable annuity valuation

The metamodeling approach consists of the following majorsteps:

1. selecting a small number of representative contracts2. using Monte Carlo simulation to calculate the fair market

values (or other quantities of interest) of the representativecontracts

3. building a regression model (i.e., the metamodel) based onthe representative contracts and their fair market values

4. using the regression model to value the whole portfolio ofvariable annuity contracts

Page 15: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

A synthetic portfolio with 190,000 variable annuiytpolicies

FMV

Frequency

0 500 1000 1500 2000 2500

05000

10000

20000

Figure: A histogram of the fair market values. The fair market valuesare in 1000s.

Page 16: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

Clustering results

Table: Performance of the TFCM algorithm and the hierarchicalk -means on the VA data. The runtime is in seconds.

Hkmean (340) Hkmean (680) TFCM (340) TFCM (680)

RWCSS 0.90 0.76 0.82 0.66Runtime 130.02 136.19 2,647.11 5,544.81

The RWCSS measure is defined as follows:

RWCSS =

∑kl=1∑

x∈Cl

∑dj=1(xj − zlj

)2∑x∈X

∑dj=1(xj − x̄j

)2 , (5)

where Cl denotes the l th cluster, zl is the center of the l thcluster, x̄ is the center of the whole dataset X .

Page 17: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

Predictive modeling results I

Table: Accuracy and runtime (in seconds) of the ordinary krigingmodel based on different clustering results.

Hkmean (340) Hkmean (680) TFCM (340) TFCM (680)

PE -0.02 0.02 0.01 0.02R2 0.82 0.92 0.81 0.92Runtime 329.50 787.11 334.62 808.99

Page 18: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

Predictive modeling results II

Figure: Scatter and QQ plots of the ordinary kriging model based onthe clustering result from hierarchical k -means with k = 340.

Page 19: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

Predictive modeling results III

Figure: Scatter and QQ plots of the ordinary kriging model based onthe clustering result from hierarchical k -means with k = 680.

Page 20: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

Predictive modeling results IV

Figure: Scatter and QQ plots of the ordinary kriging model based onthe clustering result from TFCM with k = 340.

Page 21: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

Predictive modeling results V

Figure: Scatter and QQ plots of the ordinary kriging model based onthe clustering result from TFCM with k = 680.

Page 22: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

Acknowledgements

This work is supported by a CAE (Centers of ActuarialExcellence) grant 1 from the Society of Actuaries.

1http://actscidm.math.uconn.edu

Page 23: Data Clustering with Actuarial Applications · Data clustering I The process of dividing a set of objects into homogeneous groups I Originated in anthropology and psychology in the

,

References I

Anderberg, M. (1973). Cluster Analysis for Applications. Academic Press, New York.

Bock, H. (1989). Probabilistic aspects in cluster analysis. In Opitz, O., editor,Conceptual and Numerical Analysis of Data, pages 12–44, Augsburg, FRG.Springer-Verlag.

Driver, H. E. and Kroeber, A. L. (1932). Quantitative expression of culturalrelationships. University of California Publications in American Archaeology andEthnology, 31(4):211–256.

Gan, G., Lan, Q., and Ma, C. (2016). Scalable clustering by truncated fuzzy c-means.Big Data and Information Analytics, 1(2/3):247–259.

Nister, D. and Stewenius, H. (2006). Scalable recognition with a vocabulary tree. In2006 IEEE Computer Society Conference on Computer Vision and PatternRecognition (CVPR’06), volume 2, pages 2161–2168.

Tryon, R. C. (1939). Cluster analysis; correlation profile and orthometric (factor)analysis for the isolation of unities in mind and personality. Edwards brother, Inc.,Ann Arbor, MI.

Zubin, J. (1938). A technique for measuring like-mindedness. Journal of Abnormal andSocial Psychology, 33(4):508–516.