Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica...

Preview:

Citation preview

A robust clustering approach to fraud detection

Luis Angel Garcıa-Escudero

Dpto. de Estadıstica e I.O. and IMUVA - Universidad de Valladolid

joint (and on-going...) work with A. Mayo-Iscar, A. Gordaliza, C. Matran(U. Valladolid) and colleagues from M. Riani, A. Cerioli (U. Parma) and D.Perrotta, F. Torti (JRC-Ispra)

Conference on Benford’s Law for fraud detection. Stresa, July 2019 1

1. CLUSTERING AND ROBUSTNESS

• Clustering is the task of grouping a set of objects in such a way thatobjects in the same cluster are more similar to each other than to thosein other clusters:

• Sample mean:

� m = 1n

∑ni=1 xi minimizes

∑ni=1 ‖xi −m‖2

� m may be seen as the “center” of a data-cloud:

0 1 2 3 4

01

23

4

x1

y2 X

• k clusters ⇒ k “data-clouds” ⇒ k-means

• k-means: Search for

� k centers m1, ...,mk

� a partition {R1, ..., Rk} of {1, 2, ..., n}

minimizingk∑j=1

∑i∈Rj

‖xi −mj‖2.

• Cluster j:

Rj = {i : ‖xi −mj‖ ≤ ‖xi −ml‖ for every l = 1, ..., k}

(...assignment to the closest center...)

• Robustness: Many statistical procedures are strongly affected by evenfew outlying observations:

� The mean is not robust:

x =1.72 + 1.67 + 1.80 + 1.70 + 1.82 + 1.73 + 1.78

7= 1.745

x =1.72 + 1.67 + 1.80 + 1.70 + 182 + 1.73 + 1.78

7= 27.485

� k-means inherits that lack of robustness from the mean

• Lack of robustness of k-means:

−20 −10 0 10 20 30

−20

−10

010

2030

(a) 3−means

x1

x2

2 groups artific. joined

0 50 100 150

−40

−20

020

40

(b) 2−means

x1

x2

Cluster 1

Cluster 2

• Outliers can be seen as “clusters by themselves”

• So, why not increasing the number of clusters...?

� But:

· Due to (physical, economical,...) reasons we could have an initialidea of k without being aware of the existence of outliers

· “Radial/background” noise requires large k’s

• Moreover, the detection of outliers may be the goal itself!!!

• Outliers in trade data can be associated to “frauds”:

� Heterogeneous sources of data (clustering) + Few outliers (frauds??)

0 50 100 150 200

010

020

030

040

050

060

0

Trade data

Quantity (tons)

Valu

e (1

000

euro

s)

2.- TRIMMED k-MEANS

• Trimming is the oldest and most widely used way to achieve robustness.

• Trimmed mean: The proportion α/2 smallest and α/2 largestobservations are discarded before computing the mean:

−2 0 2 4 6

−1.0

−0.5

0.0

0.5

1.0

x

Trimmed Trimmed

• But,... how to trim in clustering?

� Why not trimming outlying “bridge” points?

0 5 10 15

−1.0

−0.5

0.00.5

1.0

x

Non−trimmed ’bridge’ points

� Why a symmetric trimming?

0 10 20 30

−1.0

−0.5

0.0

0.5

1.0

x

Symmetric trimming?

� How to trim in multivariate clustering problems?

• Idea: Data itself tell us which are the most outlying observations!!

� Data-driven, adaptive, impartial,... trimming!

• Trimmed k-means: we search for

� k centers m1, ...,mk and

� a partition {R0, R1, ..., Rk} of {1, 2, ..., n} with #R0 = [nα]

minimizingk∑j=1

∑i∈Rj

‖xi −mj‖2.

[A fraction α of data is not taken into account Trimmed]

• Black circles: trimmed points (k = 3 and α = 0.05):

−20 −10 0 10 20 30

−20

020

40

(a)

x1

x2

−5 0 5

−50

510

(b)

x1

x2

• Old Faithful Geyser data: x1 = “Eruption length”, x2 = “Previouseruption length” and n = 271

Classificationk = 3, α = 0.03

Eruption length

Pre

viou

s er

uptio

n le

ngth

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

� k = 3 and α = 0.03 (0.03 · 271 ' 9 trimmed obs.): 6 rare “short-followed-by-short” eruptions trimmed, 3 bridge points...

3.- ROBUST MODEL-BASED CLUSTERING

• k-means and trimmed k-means prefer spherical clusters:

−6 −4 −2 0 2 4 6

−6−4

−20

24

6

(a) 2−means (spherical groups)

−6 −4 −2 0 2 4 6

−20

−10

010

20

(b) 2−means (elliptical groups)

• Elliptically contoured clusters?

• Multivariate normal distributions with densities φ(·;µ,Σ):

� µ =

(22

)and Σ =

(1 00 1

)[spherical] in (a)

� µ =

(22

)and Σ =

(2 11 1

)[non-spherical] in (b)

−2 0 2 4 6

−20

24

6

(a)

x1

y2

−2 0 2 4 6

−20

24

6

(b)

x1

y2

φ(x;µ,Σ) = (2π)−p/2|Σ|−1/2 exp(− (x− µ)′Σ−1(x− µ)/2

)

• Trimmed likelihoods: Search for

� k centers m1, ...,mk,

� k scatter matrices S1, ..., Sk, and,

� a partition {R0, R1, ..., Rk} of {1, 2, ..., n} with #R0 = [nα]

maximizing

k∑j=1

∑xi∈Rj

log φ(xi;mj, Sj) (obs. in R0 not taken into account)

Garcıa-Escudero et al 2008, Neykov et al 2007, Gallegos and Ritter 2005,...

• Constraints on the Sj scatter matrices needed:

� Unbounded target likelihood functions� Avoid detecting (non-interesting) “spurious” clusters

• Control relative axes’ lengths (eigenvalues constraints):

c = 1 Large c value

• The FSDA Matlab toolbox:

• The R package tclust at CRAN repository:

• The R package tclust :

> library(tclust)

• tkmeans(data, k, alpha)

� k = “number of groups”

� alpha = “trimming proportion”

• tclust(data, k, alpha, restr.fact, ...)

� restr.fact = “Strength of the constraints”

• tclust(X,k=3,alpha=0.03,restr.fact=50)

Large restr.factk = 3, α = 0.03

x1

x2

0 5 10 15 20

05

1015

20

• Old Faithful Geyser data again:

Classificationk = 4, α = 0

Eruption length

Prev

ious e

rupti

on le

ngth

1.5 2.5 3.5 4.5

1.52.0

2.53.0

3.54.0

4.55.0

Classificationk = 3, α = 0.03

Eruption lengthPr

eviou

s eru

ption

leng

th1.5 2.5 3.5 4.5

1.52.0

2.53.0

3.54.0

4.55.0

• Why k = 3 and α = 0.03 was a sensible solution?

• Applying ctlcurves to the Old Faithful Geyser data:

0.0 0.2 0.4 0.6 0.8

−800

−600

−400

−200

0CTL−Curves

α

Objec

tive F

uncti

on Va

lue

Restriction Factor = 50

5

5

5

5555555

555555555

55555555555

4

4

44444444444444

444

44444444444

3

3

3333333333333333

3333333333

33

2

2

222222222

2

22222222

2222222222

1

111111111

11

1

1111111111

1111111

0.00 0.02 0.04 0.06 0.08 0.10

−800

−700

−600

−500

−400

CTL−Curves

α

Objec

tive F

uncti

on Va

lue

Restriction Factor = 50

55

5 5 55

5 5 5 55 5 5 5 5

5 5 5 55

44 4

4 44 4 4

4 44 4

4 4 44

4 4 44

33

3

33

3 3 33 3

3 33 3 3

3 3 3 3 3

22

2

22

2 2 22 2

2 22 2 2

2 22 2

2

11 1

1 11 1 1

1 11 1

1 1 11 1

1 11

4.- ROBUST CLUSTERING AROUND LINEARSUBSPACES

• Robust linear grouping: Higher p dimensions, but assuming that ourdata “live” in k low-dimensional (affine) subspaces...

� We search for

· k linear subspaces h1, ..., hk in Rp

· a partition {R0, R1, ..., Rk} of {1, 2, ..., n} with #R0 = [nα]

minimizingk∑j=1

∑i∈Rj

‖xi − Prhj(xi)‖2.

� Prh(·) denotes the “orthogonal” projection onto the linear subspace h

• Example: Three linear structures in presence of noise:

1 1.5 2 2.5 3 3.5 4 4.5 5−6

−4

−2

0

2

4

6

X

Y

1 1.5 2 2.5 3 3.5 4 4.5 5−6

−4

−2

0

2

4

6

X

Y(a) α = 0 (b) α = 0.1 (◦ = “Trimmed”)

Trimmed “mixtures of regressions” can also be applied...

• k = 1 case ⇒ Robust “Principal Components Analysis (PCA)”:

� PCA provides a q-dimensional (q << p) representation of data by

minBq,Aq,m

n∑i=1

||xi − xi||2 for

xi = Prh(xi) = xi(Bq,Aq,m) = m + Bqai

· Aq =

−a1−· · ·−ai−· · ·−an−

is the scores matrix (n× q)

· Bq =

−b1−· · ·−bj−· · ·−bp−

is a matrix (p × q) whose columns generate a q-dimensional approximating subspace h

• Principal Components Analysis is highly non-robust!!!

• Least Trimmed Squares PCA (Maronna 2005): Minimize

n∑i=1

wi‖xi − xi‖2 =

n∑i=1

wi‖xi − xi(Bq,Aq,m)‖2,

with {wi}ni=1 being “0-1 weights” such that

n∑i=1

wi = [n(1− α)]

� Weights: wi =

{1 If xi is not trimmed

0 If xi is trimmed.

• Cases → xi = (xi1, ..., xip)′ ∈ Rp and Cells → xij ∈ R

� i denotes a country (or a trader; company;...) for i = 1, ..., n

� xij is the “quantity-value ratio” for country i in the j-th month (orthe j-th year; the j-th product;...) for j = 1, ..., p

• Casewise trimming: Trim xi cases with (at least one) outlying xij

n = 100× p = 4 data matrix with 2% outlying cells:

Outlying xij cells Trimmed xi cases (black lines)

• But when the dimension p increases... we do not expect many xicompletely free of outlying xij cells:

n = 100× p = 80 data matrix with 2% outlying cells:

Outlying xij cells Trimmed xi cases (black lines)

• Cellwise trimming:

� Only trimming outlying cells... (⇒ “Particular” frauds identified...??)

• PCA approximation xi = m + Bqai = (xi1, ..., xip)T re-written as

xij = mj + aTi bj.

• Cellwise LTS (Cevallos-Valdiviezo 2016): Minimize

n∑i=1

wij(xij −mj − aTi bj)2

� wij = 0 if cell xij is trimmed and wij = 1 if not with

n∑i=1

wij = [n(1− α)], for j = 1, ..., p.

• Different patterns/structures in data ⇒ G subspace approximations:

xgi(Bgqg,A

gqg,m

g)

= mg + Bgqga

gi or xgij = mg

j + (agi )Tbgj ,

for g = 1, ..., G

• Minimize

minwgij,B

gq ,A

gq,mg

n∑i=1

p∑j=1

G∑g=1

wgij(xij − xgij)

2.

� wgij = 1 if cell xij is assigned to cluster g and non-trimmed and 0otherwise

� Appropriate constraints on the wgij

q1, ..., qG are intrinsic dimensions...

• Example 1: n = 400 in dimension p = 100 with 2 groups and 2%“scattered” outliers:

0.0 0.2 0.4 0.6 0.8 1.0

−20

−10

010

2030

40

0.0 0.2 0.4 0.6 0.8 1.0

−20

−10

010

2030

40

• k = 2, q = 2 and α = 0.05:

“-” are the trimmed cells

• Cluster means and trimmed cells (◦):

• Example 2: n = 400 in dimension p = 100 with 2 groups and fewcurves with 20% consecutive cells corrupted:

0.0 0.2 0.4 0.6 0.8 1.0

−40

−20

020

40

0.0 0.2 0.4 0.6 0.8 1.0

−40

−20

020

40

• Results:

• Real data example: Average daily temperatures in 83 Spanishmeteorologic stations between 2007-2009 (n = 83 and p = 1096).

• Artificial outliers:

� Two periods of 50 consecutive days in Oviedo replaced by 0oC.

� 150 consecutive days in Huelva temperature replaced by 0oC.

• Cluster means:

� “Meseta” (Central plateau-Castile): Mediterranean:

� Cantabrian Coast: Canary Islands:

• Clustered stations:

• Clusters found and trimmed cells:

−10 −5 0 5 10 15

05

1015

First two scores of cluster 1

PONFERRADA

OURENSE

SORIA

BURGOS

VALLADOLID

ÁVILA

PUERTO DE NAVAC

SEGOVIA

VALLADOLID AIR

ZAMORA

LEÓN

SALAMANCA AIR

SALAMANCAMADRID AIRTORREJÓN ARDOZ

COLMENAR V.

MADRIDMADRID CUAT VI

GETAFE

TOLEDOCIUDAD REALGRANADA BA

GRANADA AIR

CUENCA

ALBACETE BA

ALBACETE

TERUEL

FORONDA

DAROCA

−5 0 5 10 15

05

10

First two scores of cluster 2

BARCELONA AIR

BARCELONA FABRA

CACERES

BADAJOZ AIR

HUELVA R. ESTE

JAÉN

CORDOBA

SEVILLA

MORÓN FRONTE

ROTA

JEREZ FRONT

MELILLAMALAGA

ALMERIA AIR SAN JAVIERMURCIA

ALCANTARILLAALICANTE AIR

ALICANTE

VALENCIA AIR

VALENCIA

CASTELLÓN ALMAZ

LOGROÑO

PAMPLONAPAMPLONA AIR

ZARAGOZA

LLEIDA

HUESCA AIR

TORTOSA

PALMA PORT

PALMA AIRMENORCA

IBIZA

−5 0 5

−4

−3

−2

−1

01

2

First two scores of cluster 3

HONDARRIBIA

SAN SEBASTIÁN

BILBAO

SANTANDER AIR

SANTANDER

GIJÓN PORT VITORIAOVIEDO

CORUÑACORUÑA AIR

SANTIAGO DE C.

PONTEVEDRA

VIGO

LUGO

TENERIFE NORTE

−1 0 1 2 3

−1

01

2

First two scores of cluster 4

LANZAROTE

LA PALMA

FUERTEVENTURA

STA CRUZ TENER

GRAN CANARIA

HIERRO

BARCELONA AIRBARCELONA FABRA

HONDARRIBIASAN SEBASTIÁN

BILBAOSANTANDER AIR

SANTANDERGIJÓN PORT

VITORIAOVIEDO

CORUÑACORUÑA AIR

SANTIAGO DE C.PONTEVEDRA

VIGOLUGO

PONFERRADAOURENSE

SORIABURGOS

VALLADOLIDÁVILA

PUERTO DE NAVACSEGOVIA

VALLADOLID AIRZAMORA

LEÓNSALAMANCA AIR

SALAMANCAMADRID AIR

TORREJÓN ARDOZCOLMENAR V.

MADRIDMADRID CUAT VI

GETAFETOLEDO

CACERESCIUDAD REALBADAJOZ AIR

HUELVA R. ESTEJAÉN

CORDOBAGRANADA BA

GRANADA AIRSEVILLA

MORÓN FRONTEROTA

JEREZ FRONTMELILLAMALAGA

ALMERIA AIRSAN JAVIER

MURCIAALCANTARILLAALICANTE AIR

ALICANTECUENCA

ALBACETE BAALBACETE

TERUELVALENCIA AIR

VALENCIACASTELLÓN ALMAZ

FORONDALOGROÑO

PAMPLONAPAMPLONA AIR

DAROCAZARAGOZA

LLEIDAHUESCA AIR

TORTOSAPALMA PORT

PALMA AIRMENORCA

IBIZALANZAROTE

LA PALMAFUERTEVENTURATENERIFE NORTESTA CRUZ TENER

GRAN CANARIAHIERRO

0 300 600 900Dias

Clusters

0

1

2

3

4

• Reconstructed curves “ ” and true real data “ ” in Oviedo:

• Conclusions:

� Different patterns/structures in data ⇒ Cluster Analysis

� Robust clustering aimed at (jointly) detecting main clusters(bulk of data) and outliers ⇒ Potential “frauds”...

� Higher dimensional problems: Assume clusters “living” in low-dimensional subspaces

� “Casewise” and “cellwise” trimming

Some References:

· Cuesta-Albertos, J.A., Gordaliza, A. and Matran, C. (1997), “Trimmed

k-means: An attempt to robustify quantizers,” Ann. Statist., 25, 553-576.

· Garcıa-Escudero, L.A. and Gordaliza, A. (1999), “Robustness properties of

k-means and trimmed k-means,” J. Amer. Statist. Assoc., 94, 956-969.

· Garcıa-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A.

(2008), “A General Trimming Approach To Robust Cluster Analysis,” Ann. Statist.,

36, 1324-1345.

· Garcıa-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A.

(2010), “A review of robust clustering methods,” Advances in Data Analysis and

Classification, 4, 89-109.

· Fritz, H.,Garcıa-Escudero, L.A. and Mayo-Iscar; A (2012), “tclust: An R

package for a trimming approach to Cluster Analysis,” Journal of Statistical Software,

47, Issue 12.

Thanks for your attention!!!

Recommended