33
A Combined Approach for Clustering based on the GSA-KM and Genetic Algorithms Divakar Raj.M (0901016) Dilip.M (0901015) Kishore Kumar.C (0901036) IV CSE - A Under the guidance of Mr.P.Perumal Associate Professor Department of Computer Science and Engineering (UG) Data Mining / Clustering 1/33

Project 0th Review

Embed Size (px)

DESCRIPTION

0th R

Citation preview

Page 1: Project 0th Review

Data Mining / Clustering

A Combined Approach for Clustering based on the GSA-KM and GeneticAlgorithms

Divakar Raj.M (0901016)

Dilip.M (0901015)

Kishore Kumar.C (0901036)

IV CSE - A

Under the guidance of

Mr.P.Perumal

Associate Professor

Department of Computer Science and Engineering (UG)

1/33

Page 2: Project 0th Review

Data Mining / Clustering

Introduction about Data Mining

• Data mining (knowledge discovery in databases):

– Extraction of interesting (non-trivial, implicit, previously unknown and

potentially useful) information or patterns from data in large databases

• Potential Applications– Market analysis and management– Risk analysis and management– Fraud detection and management– Text mining (news group, email, documents) and Web analysis– Intelligent query answering

2/33

Page 3: Project 0th Review

Data Mining / Clustering 3/33

Data Mining: A KDD Process

– Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Page 4: Project 0th Review

Data Mining / Clustering4/33

Architecture of a Typical Data Mining System

Data Warehouse

Data cleaning & data integration Filtering

Databases

Database or data warehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledge-base

Page 5: Project 0th Review

Data Mining / Clustering 5/33

Data Mining Functionalities

• Concept description: Characterization and discrimination– Generalize, summarize, and contrast data characteristics, e.g., dry

vs. wet regions

• Association (correlation and causality)– Multi-dimensional vs. single-dimensional association

– age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”)

– contains(T, “computer”) contains(x, “software”)

Page 6: Project 0th Review

Data Mining / Clustering 6/33

Data Mining Functionalities

• Classification and Prediction

– Finding models (functions) that describe and distinguish classes or concepts for future prediction

– E.g., classify countries based on climate, or classify cars based on gas mileage

– Presentation: decision-tree, classification rule, neural network

– Prediction: Predict some unknown or missing numerical values

• Cluster analysis– Class label is unknown: Group data to form new classes, e.g., cluster

houses to find distribution patterns

– Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity

Page 7: Project 0th Review

Data Mining / Clustering 7/33

Data Mining Functionalities

• Outlier analysis

– Outlier: a data object that does not comply with the general behavior

of the data

– It can be considered as noise or exception but is quite useful in fraud

detection, rare events analysis

• Trend and evolution analysis

– Trend and deviation: regression analysis

– Sequential pattern mining, periodicity analysis

– Similarity-based analysis

Page 8: Project 0th Review

Data Mining / Clustering

Issues in Data mining

• Individual Privacy• Data Integrity• Relational Database Structure (vs) Multidimensional One• Issue of Cost• Mining methodology and user interaction issues• Performance issues• Issues relating to the diversity of database types

8/33

Page 9: Project 0th Review

Data Mining / Clustering 9/33

Applications

• Database analysis and decision support

– Market analysis and management

• Target Marketing, Customer Relation Management, Market

Basket Analysis, Cross Selling, Market Segmentation

– Risk analysis and management

• Forecasting, Customer Retention, Improved Underwriting,

Quality Control, Competitive Analysis

Page 10: Project 0th Review

Data Mining / Clustering

Applications

• Text mining (news group, email, documents) and Web analysis

• Intelligent query answering

• Sports

• Astronomy

• Internet Web Surf-Aid

10/33

Page 11: Project 0th Review

Data Mining / Clustering

Clustering

• Clustering is a data mining (machine learning) technique used to place data elements into related groups without advance knowledge of the group definitions

• Set of meaningful sub classes called clusters

11/33

Page 12: Project 0th Review

Data Mining / Clustering

Cluster Analysis

• Cluster: a collection of data objects– Similar to one another within the same cluster– Dissimilar to the objects in other clusters

• Cluster analysis– Grouping a set of data objects into clusters

• Clustering is unsupervised classification: no predefined classes

• Typical applications– As a stand-alone tool to get insight into data distribution – As a preprocessing step for other algorithms

12/33

Page 13: Project 0th Review

Data Mining / Clustering 13/33

What Is Good Clustering?

• A good clustering method will produce high quality clusters with

– high intra-class similarity

– low inter-class similarity

• The quality of a clustering result depends on both the similarity

measure used by the method and its implementation.

• The quality of a clustering method is also measured by its ability

to discover some or all of the hidden patterns

Page 14: Project 0th Review

Data Mining / Clustering 14/33

Requirements of Clustering in Data Mining

• Scalability

• Ability to deal with different types of attributes

• Discovery of clusters with arbitrary shape

• Minimal requirements for domain knowledge to determine input parameters

• Able to deal with noise and outliers

• Insensitive to order of input records

• High dimensionality

• Incorporation of user-specified constraints

• Interpretability and Usability

Page 15: Project 0th Review

Data Mining / Clustering 15/33

Major Clustering Approaches

• Partitioning algorithms: Construct various partitions and then

evaluate them by some criterion

• Hierarchy algorithms: Create a hierarchical decomposition of the

set of data (or objects) using some criterion

• Density-based: based on connectivity and density functions

• Grid-based: based on a multiple-level granularity structure

• Model-based: A model is hypothesized for each of the clusters

and the idea is to find the best fit of that model to each other

Page 16: Project 0th Review

Data Mining / Clustering

Issues of Clustering

• Assessment of results

• Choice of appropriate number of clusters

• Data preparation

• Proximity measures

• Handling outliers

16/33

Page 17: Project 0th Review

Data Mining / Clustering 17/33

General Applications of Clustering

• Pattern Recognition

• Image Processing

• Economic Science (especially market research)

• WWW– Document classification– Cluster Weblog data to discover groups of similar access patterns

Page 18: Project 0th Review

Data Mining / Clustering 18/33

Examples of Clustering Applications

• Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs

• Land use: Identification of areas of similar land use in an earth observation database

• Insurance: Identifying groups of motor insurance policy holders with a high average claim cost

• City-planning: Identifying groups of houses according to their house type, value, and geographical location

• Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

Page 19: Project 0th Review

Data Mining / Clustering

Literature Survey[1] An Architecture for Component-Based Design of Representative-

Based Clustering Algorithms

Boris Delibas, Milan Vuki, Milos Jovanovi, Kathrin Kirchner,

Johannes Ruhland, Milija Suknovic (2012)

[2] The Research of Imbalanced Data Set of Sample Sampling Method

Based on K-Means Cluster and Genetic Algorithm

Yang Yong, (2012)

[3] A Combined Approach for Clustering based on K-means and

Gravitational Search Algorithms

Abdolreza Hatamlou, Salwani Abdullah, Hossein Nezamabadi-

pour, (2012)

19/33

Page 20: Project 0th Review

Data Mining / Clustering

An Architecture for Component-Based Design of Representative-Based Clustering Algorithms

• Based on reusable components

• Components derived from K-Means like algorithms and their extensions

• The new algorithm is built by exchanging components from the original algorithm and their improvements

• The Comparison & Evaluation are possible by using Representative Based Clustering Algorithm

20/33

Page 21: Project 0th Review

Data Mining / Clustering

The Research of Imbalanced Data Set of Sample Sampling Method

Based on K-Means Cluster and Genetic Algorithm

• We use K-Means to cluster & In each cluster, we use GA to carry on the valid confirmation and to gain a new sample

• Enhances the classified performance of imbalanced datasets

• Generates unbalanced data set’s minority class

• Attention to Classification’s accuracy of Minority Classes

21/33

Page 22: Project 0th Review

Data Mining / Clustering

A Combined Approach for Clustering based on K-means and

Gravitational Search Algorithms

• A hybrid data clustering algorithm based on GSA and k-means (GSA-KM) is presented

• It uses the advantages of both algorithms• Comparison of the performance of GSA-KM with other well-known

algorithms – K-means– Genetic Algorithm(GA)– Simulated Annealing(SA)– Ant Colony Optimization(ACO)– Honey Bee Mating Optimization(HBMO)– Particle Swarm Optimization(PSO)– Gravitational Search Algorithm(GSA)

• Comparison based on real and standard datasets from the UCI repository

22/33

Page 23: Project 0th Review

Data Mining / Clustering

Existing System

23/33

K-Means

• One of the most efficient and famous clustering algorithms

• Starts with some random or heuristic-based centroids for the desired

clusters

• Assigns every data object to the closest centroid

• Iteratively refines the current centroids to reach the (near) optimal ones by

calculating the mean value of data objects within their respective clusters

• The algorithm will terminate when any one of the specified termination

criteria is met (i.e., a predetermined maximum number of iterations is

reached, a (near) optimal solution is found or the maximum search time is

reached)

Page 24: Project 0th Review

Data Mining / Clustering

Existing System

24/33

Gravitational Search Algorithm

• Inspired by the physical phenomenon of Gravity• Based on the interaction of masses in the universe via Newtonian

gravity law• Attraction depends on the amount of masses and the distance

between them

• F = G (M1*M2) / R2

Page 25: Project 0th Review

Data Mining / Clustering

Drawbacks of Existing System

K – Means

• Performance is highly dependent on the initial state of centroids

• May converge to the local optima rather than global optima

• The number of clusters is needed as input to the algorithm, i.e. the number of clusters is assumed known

25/33

Page 26: Project 0th Review

Data Mining / Clustering

GSA-KM

• Built on three main steps

1. GSA-KM applies k-means algorithm on selected dataset and tries to produce near optimal centroids for desired clusters

2. The proposed approach will produce an initial population of solutions

3. Application of the GSA Algorithm

26/33

Page 27: Project 0th Review

Data Mining / Clustering

Ways for production of an initial population

• One of the candidate solutions will be produced by the output of the k-means algorithm, which has been achieved in the previous step

• Three of them will be created based on the dataset itself and other solutions will be produced randomly

• GSA will be employed for determining an optimal solution for the clustering problem

GSA - KM

27/33

Page 28: Project 0th Review

Data Mining / Clustering

Reasons for Efficiency

• Decreases the number of iterations and function evaluations to find a near global optimum compared to the original GSA alone

• With the advent of a good candidate solution in the initial population, GSA can search for near global optima in a promising search space and, therefore, find a high quality solution in comparison with the original GSA alone

28/33

Page 29: Project 0th Review

Data Mining / Clustering

Proposed System

• Along with the given GSA-KM, we intend to implement Genetic Algorithm to further increase the efficiency and speed of the clustering

• The proposed system will have combined advantages and will be faster and efficient than the traditional clustering algorithms and also GSA-KM

29/33

Page 30: Project 0th Review

Data Mining / Clustering

Implementation Details

• Programming language : C#• Database : MS- Access

• The given repository is clustered using K-Means and GSA, combinedly called GSA-KM and Genetic Algorithm is used to enhance the performance

• The performance is calculated and compared with other clustering algorithms

30/33

Page 31: Project 0th Review

Data Mining / Clustering

References

[1] C.L. Blake, C.J. Merz

UCI repository of machine learning databases

http://www.ics.uci.edu/-learn/MLRepository.html

[2] S. Das, A. Abraham, A. Konar

Meta heuristic pattern clustering —an overview

Studies in Computational Intelligence (2009)

[3] L. Kaufman, P.J. Rousseeuw

Finding Groups in Data: An Introduction to Cluster Analysis

John Wiley & Sons, New York, (1990)

[4] M.B. Adil

Modified global-means algorithm for minimum sum-of- squares clustering problems

Pattern Recognition 41 (10) (2008)

[5] E. Rashedi, H. Nezamabadi-pour, S. Saryazdi

GSA: a gravitational search algorithm

Information Sciences 179 (13) (2009)

31/33

Page 32: Project 0th Review

Data Mining / Clustering

References[6] A. Likas, N. Vlassis, J.J. Verbeek

The global k -means clustering algorithm

Pattern Recognition 36 (2) (2003)

[7] M. Mahdavi

Novel meta-heuristic algorithms for clustering web documents

Applied Mathematics and Computation (2008)

[8] M. Moshtaghi

Clustering ellipses for anomaly detection

Pattern Recognition 44 (2008)

[9] B. Saglam, et al.,

A mixed-integer programming approach to the clustering problem with an application in customer segmentation

European Journal of Operational Research 173 (3) (2006)

[10] A.K. Jain

Data clustering: 50 years beyond K –means

Pattern Recognition Letters 31 (8) (2010)

32/33

Page 33: Project 0th Review

Data Mining / Clustering

Thank You !!!

33/33