Project 0th Review

Data Mining / Clustering

A Combined Approach for Clustering based on the GSA-KM and GeneticAlgorithms

Divakar Raj.M (0901016)

Dilip.M (0901015)

Kishore Kumar.C (0901036)

IV CSE - A

Under the guidance of

Mr.P.Perumal

Associate Professor

Department of Computer Science and Engineering (UG)

1/33


Introduction about Data Mining

• Data mining (knowledge discovery in databases):

– Extraction of interesting (non-trivial, implicit, previously unknown and

potentially useful) information or patterns from data in large databases

• Potential Applications– Market analysis and management– Risk analysis and management– Fraud detection and management– Text mining (news group, email, documents) and Web analysis– Intelligent query answering

2/33

Data Mining / Clustering 3/33

Data Mining: A KDD Process

– Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Data Mining / Clustering4/33

Architecture of a Typical Data Mining System

Data Warehouse

Data cleaning & data integration Filtering

Databases

Database or data warehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledge-base


Data Mining Functionalities

• Concept description: Characterization and discrimination– Generalize, summarize, and contrast data characteristics, e.g., dry

vs. wet regions

• Association (correlation and causality)– Multi-dimensional vs. single-dimensional association

– age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”)

– contains(T, “computer”) contains(x, “software”)



• Classification and Prediction

– Finding models (functions) that describe and distinguish classes or concepts for future prediction

– E.g., classify countries based on climate, or classify cars based on gas mileage

– Presentation: decision-tree, classification rule, neural network

– Prediction: Predict some unknown or missing numerical values

• Cluster analysis– Class label is unknown: Group data to form new classes, e.g., cluster

houses to find distribution patterns

– Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity



• Outlier analysis

– Outlier: a data object that does not comply with the general behavior

of the data

– It can be considered as noise or exception but is quite useful in fraud

detection, rare events analysis

• Trend and evolution analysis

– Trend and deviation: regression analysis

– Sequential pattern mining, periodicity analysis

– Similarity-based analysis


Issues in Data mining

• Individual Privacy• Data Integrity• Relational Database Structure (vs) Multidimensional One• Issue of Cost• Mining methodology and user interaction issues• Performance issues• Issues relating to the diversity of database types

8/33


Applications

• Database analysis and decision support

– Market analysis and management

• Target Marketing, Customer Relation Management, Market

Basket Analysis, Cross Selling, Market Segmentation

– Risk analysis and management

• Forecasting, Customer Retention, Improved Underwriting,

Quality Control, Competitive Analysis


Applications

• Text mining (news group, email, documents) and Web analysis

• Intelligent query answering

• Sports

• Astronomy

• Internet Web Surf-Aid

10/33


Clustering

• Clustering is a data mining (machine learning) technique used to place data elements into related groups without advance knowledge of the group definitions

• Set of meaningful sub classes called clusters

11/33


Cluster Analysis

• Cluster: a collection of data objects– Similar to one another within the same cluster– Dissimilar to the objects in other clusters

• Cluster analysis– Grouping a set of data objects into clusters

• Clustering is unsupervised classification: no predefined classes

• Typical applications– As a stand-alone tool to get insight into data distribution – As a preprocessing step for other algorithms

12/33


What Is Good Clustering?

• A good clustering method will produce high quality clusters with

– high intra-class similarity

– low inter-class similarity

• The quality of a clustering result depends on both the similarity

measure used by the method and its implementation.

• The quality of a clustering method is also measured by its ability

to discover some or all of the hidden patterns


Requirements of Clustering in Data Mining

• Scalability

• Ability to deal with different types of attributes

• Discovery of clusters with arbitrary shape

• Minimal requirements for domain knowledge to determine input parameters

• Able to deal with noise and outliers

• Insensitive to order of input records

• High dimensionality

• Incorporation of user-specified constraints

• Interpretability and Usability


Major Clustering Approaches

• Partitioning algorithms: Construct various partitions and then

evaluate them by some criterion

• Hierarchy algorithms: Create a hierarchical decomposition of the

set of data (or objects) using some criterion

• Density-based: based on connectivity and density functions

• Grid-based: based on a multiple-level granularity structure

• Model-based: A model is hypothesized for each of the clusters

and the idea is to find the best fit of that model to each other


Issues of Clustering

• Assessment of results

• Choice of appropriate number of clusters

• Data preparation

• Proximity measures

• Handling outliers

16/33


General Applications of Clustering

• Pattern Recognition

• Image Processing

• Economic Science (especially market research)

• WWW– Document classification– Cluster Weblog data to discover groups of similar access patterns


Examples of Clustering Applications

• Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs

• Land use: Identification of areas of similar land use in an earth observation database

• Insurance: Identifying groups of motor insurance policy holders with a high average claim cost

• City-planning: Identifying groups of houses according to their house type, value, and geographical location

• Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults


Literature Survey[1] An Architecture for Component-Based Design of Representative-

Based Clustering Algorithms

Boris Delibas, Milan Vuki, Milos Jovanovi, Kathrin Kirchner,

Johannes Ruhland, Milija Suknovic (2012)

[2] The Research of Imbalanced Data Set of Sample Sampling Method

Based on K-Means Cluster and Genetic Algorithm

Yang Yong, (2012)

[3] A Combined Approach for Clustering based on K-means and

Gravitational Search Algorithms

Abdolreza Hatamlou, Salwani Abdullah, Hossein Nezamabadi-

pour, (2012)

19/33


An Architecture for Component-Based Design of Representative-Based Clustering Algorithms

• Based on reusable components

• Components derived from K-Means like algorithms and their extensions

• The new algorithm is built by exchanging components from the original algorithm and their improvements

• The Comparison & Evaluation are possible by using Representative Based Clustering Algorithm

20/33


The Research of Imbalanced Data Set of Sample Sampling Method

Based on K-Means Cluster and Genetic Algorithm

• We use K-Means to cluster & In each cluster, we use GA to carry on the valid confirmation and to gain a new sample

• Enhances the classified performance of imbalanced datasets

• Generates unbalanced data set’s minority class

• Attention to Classification’s accuracy of Minority Classes

21/33


A Combined Approach for Clustering based on K-means and

Gravitational Search Algorithms

• A hybrid data clustering algorithm based on GSA and k-means (GSA-KM) is presented

• It uses the advantages of both algorithms• Comparison of the performance of GSA-KM with other well-known

algorithms – K-means– Genetic Algorithm(GA)– Simulated Annealing(SA)– Ant Colony Optimization(ACO)– Honey Bee Mating Optimization(HBMO)– Particle Swarm Optimization(PSO)– Gravitational Search Algorithm(GSA)

• Comparison based on real and standard datasets from the UCI repository

22/33


Existing System

23/33

K-Means

• One of the most efficient and famous clustering algorithms

• Starts with some random or heuristic-based centroids for the desired

clusters

• Assigns every data object to the closest centroid

• Iteratively refines the current centroids to reach the (near) optimal ones by

calculating the mean value of data objects within their respective clusters

• The algorithm will terminate when any one of the specified termination

criteria is met (i.e., a predetermined maximum number of iterations is

reached, a (near) optimal solution is found or the maximum search time is

reached)


Existing System

24/33

Gravitational Search Algorithm

• Inspired by the physical phenomenon of Gravity• Based on the interaction of masses in the universe via Newtonian

gravity law• Attraction depends on the amount of masses and the distance

between them

• F = G (M1*M2) / R2


Drawbacks of Existing System

K – Means

• Performance is highly dependent on the initial state of centroids

• May converge to the local optima rather than global optima

• The number of clusters is needed as input to the algorithm, i.e. the number of clusters is assumed known

25/33


GSA-KM

• Built on three main steps

1. GSA-KM applies k-means algorithm on selected dataset and tries to produce near optimal centroids for desired clusters

2. The proposed approach will produce an initial population of solutions

3. Application of the GSA Algorithm

26/33


Ways for production of an initial population

• One of the candidate solutions will be produced by the output of the k-means algorithm, which has been achieved in the previous step

• Three of them will be created based on the dataset itself and other solutions will be produced randomly

• GSA will be employed for determining an optimal solution for the clustering problem

GSA - KM

27/33


Reasons for Efficiency

• Decreases the number of iterations and function evaluations to find a near global optimum compared to the original GSA alone

• With the advent of a good candidate solution in the initial population, GSA can search for near global optima in a promising search space and, therefore, find a high quality solution in comparison with the original GSA alone

28/33


Proposed System

• Along with the given GSA-KM, we intend to implement Genetic Algorithm to further increase the efficiency and speed of the clustering

• The proposed system will have combined advantages and will be faster and efficient than the traditional clustering algorithms and also GSA-KM

29/33


Implementation Details

• Programming language : C#• Database : MS- Access

• The given repository is clustered using K-Means and GSA, combinedly called GSA-KM and Genetic Algorithm is used to enhance the performance

• The performance is calculated and compared with other clustering algorithms

30/33


References

[1] C.L. Blake, C.J. Merz

UCI repository of machine learning databases

http://www.ics.uci.edu/-learn/MLRepository.html

[2] S. Das, A. Abraham, A. Konar

Meta heuristic pattern clustering —an overview

Studies in Computational Intelligence (2009)

[3] L. Kaufman, P.J. Rousseeuw

Finding Groups in Data: An Introduction to Cluster Analysis

John Wiley & Sons, New York, (1990)

[4] M.B. Adil

Modified global-means algorithm for minimum sum-of- squares clustering problems

Pattern Recognition 41 (10) (2008)

[5] E. Rashedi, H. Nezamabadi-pour, S. Saryazdi

GSA: a gravitational search algorithm

Information Sciences 179 (13) (2009)

31/33


References[6] A. Likas, N. Vlassis, J.J. Verbeek

The global k -means clustering algorithm

Pattern Recognition 36 (2) (2003)

[7] M. Mahdavi

Novel meta-heuristic algorithms for clustering web documents

Applied Mathematics and Computation (2008)

[8] M. Moshtaghi

Clustering ellipses for anomaly detection

Pattern Recognition 44 (2008)

[9] B. Saglam, et al.,

A mixed-integer programming approach to the clustering problem with an application in customer segmentation

European Journal of Operational Research 173 (3) (2006)

[10] A.K. Jain

Data clustering: 50 years beyond K –means

Pattern Recognition Letters 31 (8) (2010)

32/33


Thank You !!!

33/33

Education

Project 0th Review