Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica...

A robust clustering approach to fraud detection

Luis Angel Garcıa-Escudero

Dpto. de Estadıstica e I.O. and IMUVA - Universidad de Valladolid

joint (and on-going...) work with A. Mayo-Iscar, A. Gordaliza, C. Matran(U. Valladolid) and colleagues from M. Riani, A. Cerioli (U. Parma) and D.Perrotta, F. Torti (JRC-Ispra)

Conference on Benford’s Law for fraud detection. Stresa, July 2019 1

1. CLUSTERING AND ROBUSTNESS

• Clustering is the task of grouping a set of objects in such a way thatobjects in the same cluster are more similar to each other than to thosein other clusters:

• Sample mean:

� m = 1n

∑ni=1 xi minimizes

∑ni=1 ‖xi −m‖2

� m may be seen as the “center” of a data-cloud:

0 1 2 3 4

• k clusters ⇒ k “data-clouds” ⇒ k-means

• k-means: Search for

� k centers m1, ...,mk

� a partition {R1, ..., Rk} of {1, 2, ..., n}

minimizingk∑j=1

∑i∈Rj

‖xi −mj‖2.

• Cluster j:

Rj = {i : ‖xi −mj‖ ≤ ‖xi −ml‖ for every l = 1, ..., k}

(...assignment to the closest center...)

• Robustness: Many statistical procedures are strongly affected by evenfew outlying observations:

� The mean is not robust:

x =1.72 + 1.67 + 1.80 + 1.70 + 1.82 + 1.73 + 1.78

7= 1.745

x =1.72 + 1.67 + 1.80 + 1.70 + 182 + 1.73 + 1.78

7= 27.485

� k-means inherits that lack of robustness from the mean

• Lack of robustness of k-means:

−20 −10 0 10 20 30

(a) 3−means

2 groups artific. joined

0 50 100 150

(b) 2−means

Cluster 1

Cluster 2

• Outliers can be seen as “clusters by themselves”

• So, why not increasing the number of clusters...?

� But:

· Due to (physical, economical,...) reasons we could have an initialidea of k without being aware of the existence of outliers

· “Radial/background” noise requires large k’s

• Moreover, the detection of outliers may be the goal itself!!!

• Outliers in trade data can be associated to “frauds”:

� Heterogeneous sources of data (clustering) + Few outliers (frauds??)

0 50 100 150 200

Trade data

Quantity (tons)

2.- TRIMMED k-MEANS

• Trimming is the oldest and most widely used way to achieve robustness.

• Trimmed mean: The proportion α/2 smallest and α/2 largestobservations are discarded before computing the mean:

−2 0 2 4 6

−1.0

−0.5

Trimmed Trimmed

• But,... how to trim in clustering?

� Why not trimming outlying “bridge” points?

0 5 10 15

−1.0

−0.5

0.00.5

Non−trimmed ’bridge’ points

� Why a symmetric trimming?

0 10 20 30

−1.0

−0.5

Symmetric trimming?

� How to trim in multivariate clustering problems?

• Idea: Data itself tell us which are the most outlying observations!!

� Data-driven, adaptive, impartial,... trimming!

• Trimmed k-means: we search for

� k centers m1, ...,mk and

� a partition {R0, R1, ..., Rk} of {1, 2, ..., n} with #R0 = [nα]

minimizingk∑j=1

∑i∈Rj

‖xi −mj‖2.

[A fraction α of data is not taken into account Trimmed]

• Black circles: trimmed points (k = 3 and α = 0.05):

−20 −10 0 10 20 30

−5 0 5

• Old Faithful Geyser data: x1 = “Eruption length”, x2 = “Previouseruption length” and n = 271

Classificationk = 3, α = 0.03

Eruption length

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

� k = 3 and α = 0.03 (0.03 · 271 ' 9 trimmed obs.): 6 rare “short-followed-by-short” eruptions trimmed, 3 bridge points...

3.- ROBUST MODEL-BASED CLUSTERING

• k-means and trimmed k-means prefer spherical clusters:

−6 −4 −2 0 2 4 6

−6−4

(a) 2−means (spherical groups)

−6 −4 −2 0 2 4 6

(b) 2−means (elliptical groups)

• Elliptically contoured clusters?

• Multivariate normal distributions with densities φ(·;µ,Σ):

� µ =

)and Σ =

(1 00 1

)[spherical] in (a)

� µ =

)and Σ =

(2 11 1

)[non-spherical] in (b)

−2 0 2 4 6

φ(x;µ,Σ) = (2π)−p/2|Σ|−1/2 exp(− (x− µ)′Σ−1(x− µ)/2

• Trimmed likelihoods: Search for

� k centers m1, ...,mk,

� k scatter matrices S1, ..., Sk, and,

� a partition {R0, R1, ..., Rk} of {1, 2, ..., n} with #R0 = [nα]

maximizing

k∑j=1

∑xi∈Rj

log φ(xi;mj, Sj) (obs. in R0 not taken into account)

Garcıa-Escudero et al 2008, Neykov et al 2007, Gallegos and Ritter 2005,...

• Constraints on the Sj scatter matrices needed:

� Unbounded target likelihood functions� Avoid detecting (non-interesting) “spurious” clusters

• Control relative axes’ lengths (eigenvalues constraints):

c = 1 Large c value

• The FSDA Matlab toolbox:

• The R package tclust at CRAN repository:

• The R package tclust :

> library(tclust)

• tkmeans(data, k, alpha)

� k = “number of groups”

� alpha = “trimming proportion”

• tclust(data, k, alpha, restr.fact, ...)

� restr.fact = “Strength of the constraints”

• tclust(X,k=3,alpha=0.03,restr.fact=50)

Large restr.factk = 3, α = 0.03

0 5 10 15 20

• Old Faithful Geyser data again:

Classificationk = 4, α = 0

Eruption length

ious e

1.5 2.5 3.5 4.5

1.52.0

2.53.0

3.54.0

4.55.0

Classificationk = 3, α = 0.03

Eruption lengthPr

th1.5 2.5 3.5 4.5

1.52.0

2.53.0

3.54.0

4.55.0

• Why k = 3 and α = 0.03 was a sensible solution?

• Applying ctlcurves to the Old Faithful Geyser data:

0.0 0.2 0.4 0.6 0.8

−800

−600

−400

−200

0CTL−Curves

tive F

Restriction Factor = 50

5555555

555555555

55555555555

44444444444444

44444444444

3333333333333333

3333333333

222222222

22222222

2222222222

111111111

1111111111

1111111

0.00 0.02 0.04 0.06 0.08 0.10

−800

−700

−600

−500

−400

CTL−Curves

tive F

Restriction Factor = 50

5 5 55

5 5 5 55 5 5 5 5

5 5 5 55

4 44 4 4

4 44 4

4 4 44

3 3 33 3

3 33 3 3

3 3 3 3 3

2 2 22 2

2 22 2 2

2 22 2

1 11 1 1

1 11 1

1 1 11 1

4.- ROBUST CLUSTERING AROUND LINEARSUBSPACES

• Robust linear grouping: Higher p dimensions, but assuming that ourdata “live” in k low-dimensional (affine) subspaces...

� We search for

· k linear subspaces h1, ..., hk in Rp

· a partition {R0, R1, ..., Rk} of {1, 2, ..., n} with #R0 = [nα]

minimizingk∑j=1

∑i∈Rj

‖xi − Prhj(xi)‖2.

� Prh(·) denotes the “orthogonal” projection onto the linear subspace h

• Example: Three linear structures in presence of noise:

1 1.5 2 2.5 3 3.5 4 4.5 5−6

Y(a) α = 0 (b) α = 0.1 (◦ = “Trimmed”)

Trimmed “mixtures of regressions” can also be applied...

• k = 1 case ⇒ Robust “Principal Components Analysis (PCA)”:

� PCA provides a q-dimensional (q << p) representation of data by

minBq,Aq,m

n∑i=1

||xi − xi||2 for

xi = Prh(xi) = xi(Bq,Aq,m) = m + Bqai

· Aq =

−a1−· · ·−ai−· · ·−an−

is the scores matrix (n× q)

· Bq =

−b1−· · ·−bj−· · ·−bp−

is a matrix (p × q) whose columns generate a q-dimensional approximating subspace h

• Principal Components Analysis is highly non-robust!!!

• Least Trimmed Squares PCA (Maronna 2005): Minimize

n∑i=1

wi‖xi − xi‖2 =

n∑i=1

wi‖xi − xi(Bq,Aq,m)‖2,

with {wi}ni=1 being “0-1 weights” such that

n∑i=1

wi = [n(1− α)]

� Weights: wi =

{1 If xi is not trimmed

0 If xi is trimmed.

• Cases → xi = (xi1, ..., xip)′ ∈ Rp and Cells → xij ∈ R

� i denotes a country (or a trader; company;...) for i = 1, ..., n

� xij is the “quantity-value ratio” for country i in the j-th month (orthe j-th year; the j-th product;...) for j = 1, ..., p

• Casewise trimming: Trim xi cases with (at least one) outlying xij

n = 100× p = 4 data matrix with 2% outlying cells:

Outlying xij cells Trimmed xi cases (black lines)

• But when the dimension p increases... we do not expect many xicompletely free of outlying xij cells:

n = 100× p = 80 data matrix with 2% outlying cells:

Outlying xij cells Trimmed xi cases (black lines)

• Cellwise trimming:

� Only trimming outlying cells... (⇒ “Particular” frauds identified...??)

• PCA approximation xi = m + Bqai = (xi1, ..., xip)T re-written as

xij = mj + aTi bj.

• Cellwise LTS (Cevallos-Valdiviezo 2016): Minimize

n∑i=1

wij(xij −mj − aTi bj)2

� wij = 0 if cell xij is trimmed and wij = 1 if not with

n∑i=1

wij = [n(1− α)], for j = 1, ..., p.

• Different patterns/structures in data ⇒ G subspace approximations:

xgi(Bgqg,A

= mg + Bgqga

gi or xgij = mg

j + (agi )Tbgj ,

for g = 1, ..., G

• Minimize

minwgij,B

n∑i=1

p∑j=1

G∑g=1

wgij(xij − xgij)

� wgij = 1 if cell xij is assigned to cluster g and non-trimmed and 0otherwise

� Appropriate constraints on the wgij

q1, ..., qG are intrinsic dimensions...

• Example 1: n = 400 in dimension p = 100 with 2 groups and 2%“scattered” outliers:

0.0 0.2 0.4 0.6 0.8 1.0

• k = 2, q = 2 and α = 0.05:

“-” are the trimmed cells

• Cluster means and trimmed cells (◦):

• Example 2: n = 400 in dimension p = 100 with 2 groups and fewcurves with 20% consecutive cells corrupted:

0.0 0.2 0.4 0.6 0.8 1.0

• Results:

• Real data example: Average daily temperatures in 83 Spanishmeteorologic stations between 2007-2009 (n = 83 and p = 1096).

• Artificial outliers:

� Two periods of 50 consecutive days in Oviedo replaced by 0oC.

� 150 consecutive days in Huelva temperature replaced by 0oC.

• Cluster means:

� “Meseta” (Central plateau-Castile): Mediterranean:

� Cantabrian Coast: Canary Islands:

• Clustered stations:

• Clusters found and trimmed cells:

−10 −5 0 5 10 15

First two scores of cluster 1

PONFERRADA

OURENSE

BURGOS

VALLADOLID

ÁVILA

PUERTO DE NAVAC

SEGOVIA

VALLADOLID AIR

ZAMORA

SALAMANCA AIR

SALAMANCAMADRID AIRTORREJÓN ARDOZ

COLMENAR V.

MADRIDMADRID CUAT VI

GETAFE

TOLEDOCIUDAD REALGRANADA BA

GRANADA AIR

CUENCA

ALBACETE BA

ALBACETE

TERUEL

FORONDA

DAROCA

−5 0 5 10 15

BARCELONA AIR

BARCELONA FABRA

CACERES

BADAJOZ AIR

HUELVA R. ESTE

CORDOBA

SEVILLA

MORÓN FRONTE

JEREZ FRONT

MELILLAMALAGA

ALMERIA AIR SAN JAVIERMURCIA

ALCANTARILLAALICANTE AIR

ALICANTE

VALENCIA AIR

VALENCIA

CASTELLÓN ALMAZ

LOGROÑO

PAMPLONAPAMPLONA AIR

ZARAGOZA

LLEIDA

HUESCA AIR

TORTOSA

PALMA PORT

PALMA AIRMENORCA

−5 0 5

HONDARRIBIA

SAN SEBASTIÁN

BILBAO

SANTANDER AIR

SANTANDER

GIJÓN PORT VITORIAOVIEDO

CORUÑACORUÑA AIR

SANTIAGO DE C.

PONTEVEDRA

TENERIFE NORTE

−1 0 1 2 3

LANZAROTE

LA PALMA

FUERTEVENTURA

STA CRUZ TENER

GRAN CANARIA

HIERRO

BARCELONA AIRBARCELONA FABRA

HONDARRIBIASAN SEBASTIÁN

BILBAOSANTANDER AIR

SANTANDERGIJÓN PORT

VITORIAOVIEDO

CORUÑACORUÑA AIR

SANTIAGO DE C.PONTEVEDRA

VIGOLUGO

PONFERRADAOURENSE

SORIABURGOS

VALLADOLIDÁVILA

PUERTO DE NAVACSEGOVIA

VALLADOLID AIRZAMORA

LEÓNSALAMANCA AIR

SALAMANCAMADRID AIR

TORREJÓN ARDOZCOLMENAR V.

MADRIDMADRID CUAT VI

GETAFETOLEDO

CACERESCIUDAD REALBADAJOZ AIR

HUELVA R. ESTEJAÉN

CORDOBAGRANADA BA

GRANADA AIRSEVILLA

MORÓN FRONTEROTA

JEREZ FRONTMELILLAMALAGA

ALMERIA AIRSAN JAVIER

MURCIAALCANTARILLAALICANTE AIR

ALICANTECUENCA

ALBACETE BAALBACETE

TERUELVALENCIA AIR

VALENCIACASTELLÓN ALMAZ

FORONDALOGROÑO

PAMPLONAPAMPLONA AIR

DAROCAZARAGOZA

LLEIDAHUESCA AIR

TORTOSAPALMA PORT

PALMA AIRMENORCA

IBIZALANZAROTE

LA PALMAFUERTEVENTURATENERIFE NORTESTA CRUZ TENER

GRAN CANARIAHIERRO

0 300 600 900Dias

Clusters

• Reconstructed curves “ ” and true real data “ ” in Oviedo:

• Conclusions:

� Different patterns/structures in data ⇒ Cluster Analysis

� Robust clustering aimed at (jointly) detecting main clusters(bulk of data) and outliers ⇒ Potential “frauds”...

� Higher dimensional problems: Assume clusters “living” in low-dimensional subspaces

� “Casewise” and “cellwise” trimming

Some References:

· Cuesta-Albertos, J.A., Gordaliza, A. and Matran, C. (1997), “Trimmed

k-means: An attempt to robustify quantizers,” Ann. Statist., 25, 553-576.

· Garcıa-Escudero, L.A. and Gordaliza, A. (1999), “Robustness properties of

k-means and trimmed k-means,” J. Amer. Statist. Assoc., 94, 956-969.

· Garcıa-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A.

(2008), “A General Trimming Approach To Robust Cluster Analysis,” Ann. Statist.,

36, 1324-1345.

· Garcıa-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A.

(2010), “A review of robust clustering methods,” Advances in Data Analysis and

Classification, 4, 89-109.

· Fritz, H.,Garcıa-Escudero, L.A. and Mayo-Iscar; A (2012), “tclust: An R

package for a trimming approach to Cluster Analysis,” Journal of Statistical Software,

47, Issue 12.

Thanks for your attention!!!

Luis Angel Garc a-Escudero - European Commission · Luis Angel Garc a-Escudero Dpto. de Estad stica...

Documents

EXPLORATORY TOOLS IN MODEL-BASED …bbercu/Event/Bosantouval...EXPLORATORY TOOLS IN MODEL-BASED CLUSTERING L.A. Garc ‡a-Escudero, A. Gordaliza, C. Matr an and A. Mayo-Iscar Dpto

Inteligencia Artificial Avanzada - Raul Benítez, Gerard Escudero

señor escudero

Escudero, pluas, solorzano

5 Noesiterapia Escudero

Escudero & Brown Teaser

Escudero 2

22 23 Y 25 28 - gob.mx · maria suncion ¡barra lopez y otro 354 ¡ 2017 aguirre escudero pedro vs angel ernesto camacho hernandez y otros 437f 2017 ... garcia hernandez rey francisco

Sampling Dynamic Ocean Fields - UC3Magolaya/download/papers/JOE-12.pdf · Sampling Dynamic Ocean Fields Angel Garc´ıa-Olaya, Fr´ed ´eric Py, Jnaneshwar Das, Kanna Rajan Abstract

Portfolio: Javier Gómez Escudero

Francis Escudero Podcast Transcript

Periodico El escudero

Escudero Et Al 2013_Purinergic Signalling

Escudero Cecilia y Maribel

JESUS ESCUDERO

UNIVERSIDAD AUTONOMA DE CIUDAD JUAREZ - uacj.mx · universidad autonoma de ciudad juarez ... chavez escudero myriam 18 8 a 10 fierro ruiz cesar ... carlos miguel angel 18 8 a 10 fernandez

UNIVERSIDAD COMPLUTENSE DE MADRID · 2019. 8. 28. · Robredo, David G omez, Mar a Pe, Mari Angeles Garc a, Jorge Gonz alez, David Alfaya y Luis ... Juan Angel Rojo Carulli, 18 de

“Education, Not Deportation”: Programs Kevin Escudero

Escudero diaz jose_4.4

PROGRAM - Caribe Federal Credit Union · President’s Report i. ... Lisa V. Rodríguez Sales and Services Executive ... Angel R. Escudero Network Administrator