54
An extended K-means++ with mixed attributes for outlier detection Presented by Miss Sarunya Kanjanawattana

An extended K-means++ with mixed attributes for outlier detection

  • Upload
    hastin

  • View
    33

  • Download
    1

Embed Size (px)

DESCRIPTION

An extended K-means++ with mixed attributes for outlier detection. Presented by Miss Sarunya Kanjanawattana. Examination Committee. Dr. Sumanta Guha (Chairperson) Prof. Dr. Phan Minh Dung (Committee) Dr. Matthew N. Dailey (Committee). :: Agenda ::. Background Literature review - PowerPoint PPT Presentation

Citation preview

Page 1: An extended K-means++ with mixed attributes for outlier detection

An extended K-means++ with mixed attributes for outlier detection

Presented by Miss Sarunya Kanjanawattana

Page 2: An extended K-means++ with mixed attributes for outlier detection

Examination Committee

Dr. Sumanta Guha (Chairperson)Prof. Dr. Phan Minh Dung (Committee)Dr. Matthew N. Dailey (Committee)

Page 3: An extended K-means++ with mixed attributes for outlier detection

:: Agenda ::

• Background• Literature review• Methodologies

Page 4: An extended K-means++ with mixed attributes for outlier detection

Background• Problem statement• Objective of the study• Scope and Limitation • Contribution

Page 5: An extended K-means++ with mixed attributes for outlier detection

« Background »

• Data mining :– huge volume of data and information are collected in

databases. – These tremendous data has far exceeded the human

ability to analyze extract valuable information for the purpose of decision-making support.

“data mining helps to transform the collected data into valuable information”

Page 6: An extended K-means++ with mixed attributes for outlier detection

« Background »

• Outlier detection :– Outlier cluster is a popular methodology

that uses to detect fraud in data sets.– identify data points as “normal” or “outlier”

Outlier data point => fraudulent sample

Page 7: An extended K-means++ with mixed attributes for outlier detection

« Background »

• Fraud detection – Health insurance fraud detection is a

beneficial and challenging task.– The detection helps to observe the fraud

and abuse pattern.

Example : Institutional or health professional led health insurance fraud include the falsification of information on forms.

Page 8: An extended K-means++ with mixed attributes for outlier detection

« Background »

• The National Health Security office– is an autonomous state agency, officially

founded in 2002 , stated by the National Health Security Act

– The vital duties of NHSO • are to manage the health security fund

and allocate the subsidiary budget to 236 clinics and 963 hospitals to promote and develop a good health care system for all Thai people.

Page 9: An extended K-means++ with mixed attributes for outlier detection

« Problem statement »

• Fraud and abuse • led to significant additional expense in the health care

system.

• A case study : NHSO database• Occurred with the large number of data .• Many transactions emerge constantly daily hour. • These become huge and hard to use human inspections for

detecting fraud.

• Outlier clustering approach : • Need fast and more accuracy algorithm to monitor outliers

Page 10: An extended K-means++ with mixed attributes for outlier detection

« Objective of the study »

• To provide a process of extracting the fraud instances and uncover unusual activities in NHSO.

• To develop the K-means++, that is another variation of standard k-means algorithm, with mixed attributes of dataset for detecting outliers.

• To answer what is the optimal “”.

Page 11: An extended K-means++ with mixed attributes for outlier detection

« Scope and Limitation »

• The data source only involved in 4 provinces in Thailand– Nakhonratchasima, Chaiyaphom, Burirum and Surin.

• The transaction comes from a group of High-costs diseases – There is high chance to occur fraudulent behaviors

larger than other groups of diseases.

Page 12: An extended K-means++ with mixed attributes for outlier detection

« Contribution »

• The proposed study provides the methodology to detect fraud and abuse in NSHO, Thailand. It will present some results of outlier cluster.

• This study proposes a novel algorithm based on extended K-means++ to work with mixed attributes and detect outliers.

Page 13: An extended K-means++ with mixed attributes for outlier detection

Literature review• Fraud detection• The process of data mining

Page 14: An extended K-means++ with mixed attributes for outlier detection

« Literature review »

Yi et al. 2006 : • understand and detect suspicious health care

frauds from large databases using clustering technique

• Use two clusters to compare : SAS EM and CLUTO

• As the experimental results indicate that CLUTO is faster than SAS EM while SAS EM provides more useful clusters than CLUTO.

Fraud detection

Page 15: An extended K-means++ with mixed attributes for outlier detection

« Literature review »

Liou, Tang, and Chen 2008 : • Applies data mining techniques to detect

fraudulent or abusive reporting by healthcare providers using their invoices for diabetic outpatient services.

• Logistic regression, neural network, classification trees

• The classification tree model performs the best with an overall correct identification rate of 99%.

Fraud detection

Page 16: An extended K-means++ with mixed attributes for outlier detection

« Literature review »

• Data preprocessing– The data that obtain from the real

databases are often incomplete, noisy and inconsistent.

– The target of data preprocessing is to clean a rough data set for improve accuracy.

– The process of data preprocessing :• data cleaning, data transformation and

integration and data reduction.

The process of data mining

Page 17: An extended K-means++ with mixed attributes for outlier detection

« Literature review »

• Data preprocessingWang and Chiang 2009 : – presents an efficient data preprocessing

procedure for the support of vector clustering (SVC) to reduce the size of a training dataset.

The process of data mining

Page 18: An extended K-means++ with mixed attributes for outlier detection

« Literature review »

• K-means algorithm

The process of data mining

Page 19: An extended K-means++ with mixed attributes for outlier detection

« Literature review »

• K-means algorithm– The benefits of K-means • fast and simplicity. Its algorithm is really

easy to understand and implementation.

– The shortcoming of K-means • number of clusters dependency • degeneracy

The process of data mining

Page 20: An extended K-means++ with mixed attributes for outlier detection

« Literature review »

• K-means++ algorithm

The process of data mining

Page 21: An extended K-means++ with mixed attributes for outlier detection

« Literature review »

• K-means++ algorithm• Arthur and Vassilvitskii 2007– Fast and more efficient• K-means : O(i * n * k)• K-means++ : O(log k)

– not pretty good to work with a dataset which combines categorical and numerical attribute

The process of data mining

Page 22: An extended K-means++ with mixed attributes for outlier detection

« Literature review »

• K-means++ algorithm• Example

The process of data mining

(k=3)

D(x) =

the shortest distance from

a data point x to the

closest center we have

already chosen.

Page 23: An extended K-means++ with mixed attributes for outlier detection

« Literature review »

• K-means++ algorithm• Example

The process of data mining

(k=3)

Page 24: An extended K-means++ with mixed attributes for outlier detection

« Literature review »

• K-means++ algorithm• Example

The process of data mining

D2=82+42

D2=72+32

D2=12+72

D2=22+12

(k=3)

Page 25: An extended K-means++ with mixed attributes for outlier detection

« Literature review »

• K-means++ algorithm• Example

The process of data mining

D2=82+42

D2=72+32

D2=12+72

D2=22+12

(k=3)

Page 26: An extended K-means++ with mixed attributes for outlier detection

« Literature review »

• K-means++ algorithm• Example

The process of data mining

D2=12+12

D2=12+72

D2=22+12

(k=3)

Page 27: An extended K-means++ with mixed attributes for outlier detection

« Literature review »

• K-means++ algorithm• Example

The process of data mining

D2=12+12

D2=12+72

D2=22+12

(k=3)

Page 28: An extended K-means++ with mixed attributes for outlier detection

« Literature review »

• K-means++ algorithm• Example

The process of data mining

(k=3)

Page 29: An extended K-means++ with mixed attributes for outlier detection

« Literature review »

• Y-means algorithm

The process of data mining

Page 30: An extended K-means++ with mixed attributes for outlier detection

« Literature review »

• Y-means algorithm• Guan, Ghorbani, and Belacel 2003– based on the K-means algorithm– It overcomes two shortcomings

of K-means: • number of clusters dependency and

degeneracy

The process of data mining

Page 31: An extended K-means++ with mixed attributes for outlier detection

« Literature review »

• Koufakou, Ortiz, Georgiopoulos, Anagnostopoulos, and Reynolds 2007– Introduced a strategy named

“Attribute Value Frequency (AVF)”. – That is a fast and scalable outlier

detection strategy for categorical data.

The process of data mining

Page 32: An extended K-means++ with mixed attributes for outlier detection

Methodologies• Methodology• Data collection• Data evaluation • Tasks and timeline

Page 33: An extended K-means++ with mixed attributes for outlier detection

« Methodologies »

• It can divide into 3 phases.• Phases 1: Data preprocessing– Convert categorical data to numeric data

• Phases 2: Clustering– Followed by K-means++ algorithm

• Phases 3: Outlier detection – Local and global outlier– Determine what cluster is outlier

Page 34: An extended K-means++ with mixed attributes for outlier detection

« Methodologies »

• Overview of the extended K-means++ algorithm

Page 35: An extended K-means++ with mixed attributes for outlier detection

« Methodologies »

• Phases 1: Data preprocessing

Page 36: An extended K-means++ with mixed attributes for outlier detection

« Methodologies »

• Phases 1: Data preprocessing1) Normalizes the numeric attributes’ value into

the range of 0 and 1

Attribute W Attribute X Attribute Y Attribute Z

A C 100 100

A C 300 900

A D 800 800

B D 900 200

B C 200 800

B E 600 900

A D 700 100

Page 37: An extended K-means++ with mixed attributes for outlier detection

« Methodologies »

• Phases 1: Data preprocessing1) Normalizes the numeric attributes’ value into

the range of 0 and 1

Attribute W Attribute X Attribute Y Attribute Z

A C 0.1 0.1

A C 0.3 0.9

A D 0.8 0.8

B D 0.9 0.2

B C 0.2 0.8

B E 0.6 0.9

A D 0.7 0.1

Page 38: An extended K-means++ with mixed attributes for outlier detection

« Methodologies »

• Phases 1: Data preprocessing2) A categorical attribute A with most number of

items is selected to be the base attribute.

Attribute W Attribute X Attribute Y Attribute Z

A C 0.1 0.1

A C 0.3 0.9

A D 0.8 0.8

B D 0.9 0.2

B C 0.2 0.8

B E 0.6 0.9

A D 0.7 0.1

2 items: A,B 3 items: C,D,E

Page 39: An extended K-means++ with mixed attributes for outlier detection

« Methodologies »

• Phases 1: Data preprocessing3) Counting the frequency of co-occurrence,

represent by Matrix M

Attribute W Attribute X Attribute Y Attribute Z

A C 0.1 0.1

A C 0.3 0.9

A D 0.8 0.8

B D 0.9 0.2

B C 0.2 0.8

B E 0.6 0.9

A D 0.7 0.1

Matrix M =

4 0 2 2 00 3 1 1 10 0 3 0 00 0 0 3 00 0 0 0 1

A B C D E

A B C D E

Page 40: An extended K-means++ with mixed attributes for outlier detection

« Methodologies »

• Phases 1: Data preprocessing4) Calculate similarity between items represent by

equation D

Matrix M =

4 0 2 2 00 3 1 1 10 0 3 0 00 0 0 3 00 0 0 0 1

A B C D E

A B C D E

Similarity Calculated value

DAC 2/4+3-2 = 0.4

DAD 2/4+3-2 = 0.4

DAE 0/4+2-0 = 0

DBC 1/3+3-1 = 0.2

DBD 1/3+3-1 = 0.2

DBE 1/3+1-1 = 0.33

Page 41: An extended K-means++ with mixed attributes for outlier detection

« Methodologies »

• Phases 1: Data preprocessing5) Find group variance of numerical value by

following equation:

Y attribute

Base Items Mean SSw

C 0.1+0.3+0.2/3 = 0.2 0.01+0.01+0 = 0.02

D 0.8+0.9+0.7/3 = 0.8 0+0.01+0.01 = 0.02

E 0.6/1 = 0.6 0

Z attribute

Base Items Mean SSw

C 0.1+0.9+0.8/3 = 0.6 0.25+0.09+0.01 = 0.35

D 0.8+0.2+0.1/3 = 0.37 0.185+0.029+0.73 = 0.94

E 0.9/1 = 0.9 0

å SSw(Y) = 0.04å SSw(Z) = 1.294

<< Select Y

Page 42: An extended K-means++ with mixed attributes for outlier detection

« Methodologies »

• Phases 1: Data preprocessing6) Every base item can be quantified by assigning

mean of the mapping value in the selected numeric attribute.

Y attribute

Base Items Mean

C 0.1+0.3+0.2/3 = 0.2

D 0.8+0.9+0.7/3 = 0.8

E 0.6/1 = 0.6

Attribute W Attribute X Attribute Y Attribute Z

A 0.2 (C) 0.1 0.1

A 0.2 (C) 0.3 0.9

A 0.8 (D) 0.8 0.8

B 0.8 (D) 0.9 0.2

B 0.2 (C) 0.2 0.8

B 0.6 (E) 0.6 0.9

A 0.8 (D) 0.7 0.1

Page 43: An extended K-means++ with mixed attributes for outlier detection

« Methodologies »

• Phases 1: Data preprocessing7) All other categorical items can be quantified by

applying the function:

Attribute W Attribute X Attribute Y Attribute Z

0.4 (A) 0.2 (C) 0.1 0.1

0.4 (A) 0.2 (C) 0.3 0.9

0.4 (A) 0.8 (D) 0.8 0.8

0.398 (B) 0.8 (D) 0.9 0.2

0.398 (B) 0.2 (C) 0.2 0.8

0.398 (B) 0.6 (E) 0.6 0.9

0.4 (A) 0.8 (D) 0.7 0.1

F(A) = 0.4 * 0.2 + 0.4 * 0.8 + 0 * 0.6 = 0.4

F(B) = 0.2 * 0.2 + 0.2 * 0.8 + 0.33 * 0.6 = 0.398

*All data in data set are numeric now.

Page 44: An extended K-means++ with mixed attributes for outlier detection

« Methodologies »

• Phases 2: Clustering

Probability :

D(x) : denote the shortest distance from a data point x to the closest center we have already chosen.

Page 45: An extended K-means++ with mixed attributes for outlier detection

« Methodologies »

• Phases 2: Clustering– Define initial values: • = Cluster width

– for detect local outlier– Followed by previous study = 2.32.

• = Cluster population ratio– for detect global outlier– My assumption : = 0.9

Detection rate and false negative rate should be get the highest values with optimal “”.

Page 46: An extended K-means++ with mixed attributes for outlier detection

« Methodologies »

• Phases 3: Outlier detection

Page 47: An extended K-means++ with mixed attributes for outlier detection

« Methodologies »

• Phases 3: Outlier detection– There are 2 stages• Local outlier detection : • = cluster width

Page 48: An extended K-means++ with mixed attributes for outlier detection

« Methodologies »

• Phases 3: Outlier detection– There are 2 stages• Global outlier detection• = population ratio

Page 49: An extended K-means++ with mixed attributes for outlier detection

« Data collection »

• A real dataset provided by National Health Security office of Thailand was applied to demonstrate the effectiveness of the proposed method.

• Primary data will gather information from database especially statement information that contains all financial transactions, Thailand.

Page 50: An extended K-means++ with mixed attributes for outlier detection

« Data collection »

• Overview of data set

Page 51: An extended K-means++ with mixed attributes for outlier detection

« Data evaluation »

• Outlier Detection Accuracy rate, which is the number of outliers correctly identified by this approach as outliers

• False Positive rate, reflecting the number of normal points erroneously identified as outliers.

Page 52: An extended K-means++ with mixed attributes for outlier detection

« Tasks and timeline »

Page 53: An extended K-means++ with mixed attributes for outlier detection

Thank youDr. Sumanta Guha (Chairperson)Prof. Dr. Phan Minh Dung (Committee)Dr. Matthew N. Dailey (Committee)

Page 54: An extended K-means++ with mixed attributes for outlier detection

Question?