View
218
Download
1
Category
Tags:
Preview:
DESCRIPTION
This documentation is about clustering of text docs.
Citation preview
1 INTRODUCTION
1.1 Introduction:
Clustering is main process in engineering and in various fields of scientific
research, which tries to group a set of points into clusters such that points in the
same cluster are more homogeneous to each other when compared to the points in
different clusters. Document clustering is nothing but group the documents by based
on similarity among the documents in an unsupervised manner. Document clustering
used in Quick topic extraction or filtering and information retrieval. We are facing an
ever increasing volume of text documents. The texts flowing over then termed, vast
collections of documents in repositories, digital libraries and digitized personal
information such as articles and emails .These have brought challenges for the
effective and efficient organization of text documents
There is no known single optimization method available for solving all
optimization problems. A lot of optimization methods have been developed for
solving different types of optimization problems in recent years. The modern
optimization methods (sometimes called nontraditional optimization methods) are
very powerful and popular methods for solving complex engineering problems.
These methods are particle swarm optimization algorithm, neural networks, genetic
algorithms, ant colony optimization, artificial immune systems, and fuzzy
optimization.
The Particle Swarm Optimization algorithm (abbreviated as PSO) is a novel
population-based stochastic search algorithm and an alternative solution to the
complex non-linear optimization problem. The PSO algorithm was first introduced by
Dr. Kennedy and Dr. Eberhart in 1995 and its basic idea was originally inspired by
simulation of the social behavior of animals such as bird flocking, fish schooling and
so on. It is based on the natural process of group communication to share individual
knowledge when a group of birds or insects search food or migrate and so forth in a
searching space, although all birds or insects do not know where the best position is.
But from the nature of the social behavior, if any member can find out a desirable
path to go, the rest of the members will follow quickly.
1
The PSO algorithm basically learned from animal’s activity or behavior to solve
optimization problems. In PSO, each member of the population is called a particle
and the population is called a swarm. Starting with a randomly initialized population
and moving in randomly chosen directions, on each particle.
In this thesis, a meta-heuristic called Tabu Search and discusses the features of
the tabu search algorithm. This is one of the most efficient heuristic in finding
‘quality solutions’ in relatively short running time. The principal characteristic of tabu
search is based on using a mechanism which is inspired by the human memory i.e.,
to use the information that is stored in the memory to guide and restrict the future
search in a way to obtain quality solutions and to overcome the local optimality. This
thesis provides insight about the algorithm or procedure of the working of tabu
search algorithm on Document clustering problems and merging the other
optimization technique.
Particle swarm optimization (PSO) method for solving the economic dispatch (ED)
problem in power systems. Many nonlinear characteristics of the generator, such as
ramp rate limits, prohibited operating zone, and nonsmooth cost functions are
considered using the proposed method in practical generator operation. The
feasibility of the proposed method is demonstrated for three different systems, and
it is compared with the GA method in terms of the solution quality and computation
efficiency. The experimental results show that the proposed PSO method was indeed
capable of obtaining higher quality solutions efficiently in ED problems.
Tabu Search (TS), a heuristic method originally proposed by Glover in 1986, to
various combinatorial problems have appeared in the operations research
literature. In several cases, the methods described provide solutions very close to
optimality and are among the most effective, if not the best, to tackle the difficult
problems at hand. These successes have made TS extremely popular among those
interested in finding good solutions to the large combinatorial problems
encountered in many practical settings. Several papers, book chapters, special issues
and books have surveyed the rich TS literature (a list of some of the most important
references is provided in a later section). In spite of this abundant literature, there
2
still seem to be many researchers who, while they are eager to apply TS to new
problem settings, find it difficult to properly grasp the fundamental concepts of the
method, its strengths and its limitations, and to come up with effective
implementations. The purpose of this paper is to address this situation by providing
an introduction in the form of a tutorial focusing on the fundamental concepts of
TS. Throughout the paper, two relatively straightforward, yet challenging and
relevant, problems will be used to illustrate these concepts: the Classical Vehicle
Routing Problem (CVRP) and the Capacitated Plant Location Problem (CPLP). These
will be introduced in the following section. The remainder of the paper is organized
as follows. The basic concepts of TS (search space, neighborhoods, and short-term
tabu lists) are described and illustrated in Section 2. Intermediate, yet critical,
concepts, such as intensification and diversification, are described in Section 3. This
is followed in Section 4 by a brief discussion of advanced topics and recent trends in
TS, and in Section 5 by a short list of key references on TS and its
applications. Section 6 provides practical tips for newcomers struggling with
unforeseen problems as they first try to apply TS to their favorite problem. Section 7
concludes the paper with some general advice on the application of TS to
combinatorial problems.
Tabu search (TS) has its antecedents in methods designed to cross boundaries of feasibility or local optimality treated as barriers in classical procedures, and to systematically impose and release constraints to permit exploration of otherwise forbidden regions (Glover, 1977). The tabu search name and terminology comes from Glover (1986). A distinguishing feature of the approach is its use of adaptive memory and special associated problem-solving strategies.(TS provides the origin of the memory-based and strategy-intensive focus in the metaheuristic literature, as opposed to methods that are memory-less or use only a weak inheritance-based memory. It is also responsible for emphasizing the use of structured designs to exploit historical patterns of search, as opposed to processes that rely almost exclusively on randomization.)The fundamental principles of tabu search were elaborated in a series of papers in
the late 1980s and early 1990s, and have been assembled and in the book Tabu
Search (Glover and Laguna, 1997). The remarkable successes of tabu search for
solving hard optimization problems (especially those arising in real world
applications) has caused an explosion of new TS applications in the last several years.
3
The tabu search philosophy is to derive and exploit a collection of intelligent problem solving strategies, based on implicit and explicit learning procedures. The adaptive memory framework of TS not only involves the exploitation of the history of the problem-solving process, but also entails the creation of structures to make such exploitation possible. Problem-solving history extends to experience gained from solving multiple instances of a problem class by joining TS with an associated learning approach called Target Analysis (see, e.g., chapter 9 of Glover and Laguna, 1997). TS is an iterative procedure designed for the solution of optimization problems. TS starts with a random solution and evaluate the fitness function for the given solution. Then all possible neighbors of the given solution are generated and evaluated. A neighbor is a solution which can be reached from the current solution by a simple, basic transformation. If the best of these neighbors is not in tabu list then pick it to be the new current solution. The tabu list keeps track of previously explored solutions and prohibits TS from revisiting them again. Thus, if the best neighbor solution is worse than the current design, TS will go uphill. In this way, local minima can be overcome. Any reversal of these solutions or moves is then forbad move and is classified as tabu. Some aspiration criteria which allow overriding of tabu status can be introduced if that moves is still found to lead to a better fitness with respect to the fitness of the current optimum. If no more neighbors are present (all are tabu), or when during a predetermined number of iterations no improvements are found, the algorithm stops. Otherwise, the algorithm continues the TS procedures. Engineering and technology have been continuously providing examples of difficult
optimization problems. In this talk we shall present the tabu search technique which
with its various ingredients may be viewed as an engineer designed approach: no
clean proof of convergence is known but the technique has shown a remarkable
efficiency on many problems. The roots of tabu search go back to the 1970's; it was
first presented in its present form by Glover [Glover, 1986]; the basic ideas have also
been sketched by Hansen [Hansen 1986]. Additional efforts of formalization are
reported in [Glover, 1989], [de Werra & Hertz, 1989], [Glover, 1990]. Many
computational experiments have shown that tabu search has now become an
established optimization technique which can compete with almost all known
techniques and which - by its flexibility - can beat many classical procedures. Up to
now, there is no formal explanation of this good behavior. Recently, theoretical
aspects of tabu search have been investigated [Faigle & Kern, 1992], [Glover, 1992],
[Fox, 1993]. A didactic presentation of tabu search and a series of applications have
been collected in a recent book [Glover, Taillard, Laguna & de Werra, 1992]. Its
interest lies in the fact that success with tabu search implies often that a serious
effort of modeling be done from the beginning. The applications in [Glover, Taillard,
4
Laguna & de Werra, 1992] provide many such examples together with a collection of
references. A huge collection of optimization techniques have been suggested by a
crowd of researchers of different fields; an infinity of refinements have made these
techniques work on specific types of applications. All these procedures are based on
some common ideas and are furthermore characterized by a few additional specific
features. Among the optimization procedures the iterative techniques play an
important role: for most optimization problems no procedure is known in general to
get directly an "optimal" solution. The general step of an iterative procedure consists
in constructing from a current solution i a next solution j and in checking whether
one should stop there or perform another step. Neighbourhood search methods are
iterative procedures in which a neighbourhood N(i) is defined for each feasible
solution i, and the next solution j is searched among the solutions in N(i).
Non-linear optimization problems are defined by non-linearity constraints and/or
non-linearity objective. These problems are considered in several domains, including
chemical engineering, energy analysis, environmental planning, biotechnology and
thermal processes, among others. Different techniques and methods are employed
to model and solve these problems. A literature survey shows that the most used
techniques are evolutionary algorithms [1, 2], swarm optimization [6] and non-linear
mathematical programming [15]. Leyffer and Mahajan (2010) present a survey of
non-linearly constrained software and methods, focusing on the contrasting
strategies of local optimization and global optimization [15]. Some of those
approaches such as Genetic algorithms are reported to require a lot of parameters
and to entail considerable effort to implement. In the thermal engineering field,
many complex optimization problems arise in practice. Recently, non-linear
optimization problems have increasingly been subjected to analysis by non-
traditional optimization techniques. Patel and Rao [17, 18] recommend the use of
particle swarm optimization (PSO) based on case studies showing that PSO is simple
in concept, requires few parameters, is easy to implement and performs well
compared to traditional techniques like genetic algorithms [17, 18]. The PSO method
has produced good outcomes for a variety of optimization problems, but many
authors have pointed out a limitation in its ability to diversify the population (see [8,
24]). To deal with this problem, research efforts are underway on several fronts to
5
hybridize the PSO method with other id-2 ICSI 2011: International conference on
swarm intelligence Cergy, France, June 14-15, 2011 meta-heuristics. The most
commly used methods to create PSO hybrids are genetic algorithms and differential
evolution algorithms [24]. For global optimization, a PSO-TS hybrid algorithm which
joins PSO with tabu search (TS) has been proposed in [11]. More recently, Shelokar
et al. hybridize PSO with ant colony algorithm for continuous optimization [21]. In
this work, we focus on a thermal optimization problem known as the T-junction
problem, which consists in designing the main channel in electrical machines
responsible for evacuating generated heat. The objective is to determine the ideal
channel features that optimize the temperature in the system. This problem,
identified through a collaborative industrial project, can be formulated as a
constrained non-linear optimization problem (CNOP). The fitness function used to
evaluate solutions of this problem takes extensive computation time. The use of
meta-heuristics like genetic algorithms in this case has proved to be very time
consuming. We apply the PSO meta-heuristic to solve the problem due to its simple
implementation and the limited number of parameters to adjust, as well as for the
ability to control its fitness function effectively. To avoid premature convergence of
our method, a tabu search procedure is embedded within the PSO.
High-density DNA microarrays are one of the most powerful tools for functional
genomic studies and the development of microarray technology allows for
measuring expression levels of thousands of genes simultaneously Schena et al.
(1995). Recent studies have shown that one of the most important applications of
microarrays is tumor classification (Cho et al., 2003; Li et al., 2004). Gene selection is
an important component for gene expression-based tumor classification systems.
Microarray experiments generate large datasets with expression values for
thousands even tens of thousands of genes but not more than a few tissue samples.
Most of the genes monitored in microarray may be irrelevant to analysis and the use
of all the genes may potentially inhibit the prediction performance of classification
rule by masking the contribution of the relevant genes (Li, 2006; Li and Yang, 2002;
Stephanopoulos et al., 2002; Nguyen and Rocke, 2002; Biceiato et al., 2003; Tan et
al., 2004). An efficient way to solve this problem is gene selection and the ∗ Corresponding author. Tel.: +86 371 67767957; fax: +86 371 67763220. E-mail
6
address: shiweimin@zzu.edu.cn (W.-M. Shi). selection of discriminatory genes is
critical to improving the accuracy and decrease computational complexity and cost.
By selecting relevant genes, conventional classification techniques can be applied to
the microarray data. Gene selection may highlight those relevant genes and it could
enable biologists to gain significant insight into the genetic nature of the disease and
the mechanisms responsible for it (Guyon et al., 2002; Wang et al., 2005). Several
gene selections techniques have been employed in classification problems, such as t-
test filtering approach, as well as some artificial intelligence techniques such as
genetic algorithms (GAs), evolution algorithms (EAs) (Golub et al., 1999; Furey et al.,
2000; Xiong et al., 2001; Peng et al., 2003; Li et al., 2005; Tibshirani et al., 2002; Sima
and Dougherty, 2006), simulated annealing, tabu search and particle swarm
optimization. Particle swarm optimization (PSO) algorithm (Kennedy and Eberhart,
1995; Shi and Eberhart, 1998; Clerc and Kennedy, 2002) is a recently proposed
algorithm by James Kennedy and R.C. Eberhart in 1995, motivated by social behavior
of organisms such as bird flocking and fish schooling. Particle swarm optimization
comprises a very simple concept, and can be implemented in a few lines of computer
code. It requires only few parameters to adjust, and is computationally inexpensive
in
terms of both memory requirements and speed. A modified discrete PSO algorithm
has been proposed in our previous study (Shen et al., 2004a,b, in press) to reduce
dimension and shown satisfied performance. Although PSO has proved to be a
potent search technique for solving optimization, there are still many complex
situations where the PSO tends to converge to local optima and does not perform
particularly well. Tabu search (TS) is a powerful optimization procedure that has
been successfully applied to a number of combinatorial optimization problems
Glover (1986). It has the ability to avoid convergence to local minima by employing a
flexible memory system. But the convergence speed of TS depends on the initial
solution and the parallelism of PSO population would help the TS find the promising
regions of the search space very quickly. In this paper, we develop a hybrid PSO and
TS (HPSOTS) approach for gene selection for tumor classification. The incorporation
of TS as a local improvement procedure enables the algorithm HPSOTS to overleap
local optima and show satisfactory performance. The formulation and corresponding
7
programming flow chart are presented in details in the paper. To evaluate the
performance of HPSOTS, the proposed approach is applied to three publicly available
microarray datasets. Moreover, we compare the performance of HPSOTS on these
datasets to that of stepwise selection, the pure TS and PSO algorithm. It has been
demonstrated that the HPSOTS is a useful tool for gene selection and mining high
dimension data.
1.2 Motivation:
PSO performs excellently in global search while not so well in local search,
meanwhile, TS performs excellently in local search while not so well in global search.
Therefore in this thesis i to combine the two algorithms so the new hybrid algorithm
conducts both global search and local search in every iteration , so the probability of
finding the optimal points significantly increases. However to the best of the authors
knowledge, TSPSO has not been used to cluster text documents. In This study a
document clustering algorithm based on TSPSO is proposed.
1.3 Thesis Overview:
In this thesis involves clustering documents into categories using
Optimization algorithms. Initially we start with data matrix obtained from the text
documents after preprocessing steps. This data matrix is represented with each row
as a document vector and each column as weight of a significant term. This data
matrix is provided as an input to AMOC algorithm for finding the k value and the
produced k value is given to the PSO and TS and TSPSO algorithms to form
clustering documents. The results obtained from above process compare with
obtained VRC values and also their time complexities.
1.4 Clustering
A general definition of clustering stated by Brian Everitt et al. [6]Given a
number of objects or individuals, each of which is described by a set of numerical
measures, devise a classification scheme for grouping the objects into a number of
classes such that objects within classes are similar in some respect and unlike those
from other classes. The number of classes and the characteristics of each class are to
8
be determined. The clustering problem can be formalized as an optimization
problem, i.e. the minimization or maximization of a function subject to a set of
constraints. The goal of clustering can be defined as follows:
Given
I. a dataset X = {x1, x2, …. , xn}
II. the desired number of clusters k
III. a function f that evaluates the quality of clustering
we want to compute a mapping
γ :{1,2,.....,n}⎯⎯→{1,2,.....,k}
that minimizes the function f subject to some constraints. The function f that
evaluates the clustering quality are often defined in terms of similarity between
objects and it is also called distortion function or divergence. The similarity measure
is the key input to a clustering algorithm.
1.5 Document clustering
Clustering of documents is used to group documents into relevant topics. The
major difficulty in document clustering is its high dimension. It requires efficient
algorithms which can solve this high dimensional clustering. A document clustering is
a major topic in information retrieval area .Example includes search engines. The
basic steps used in document clustering process are shown in figure 2.
The goal of a document clustering scheme is to minimize intra-cluster distances
between
documents, while maximizing inter-cluster distances (using an appropriate distance
measure between documents). A distance measure (or, dually, similarity measure)
thus
lies at the heart of document clustering. The large variety of documents makes it
almost
impossible to create a general algorithm which can work best in case of all kinds of
datasets.
9
Figure 2.Flow diagram for representing basic Steps in text clustering
Peprocessing
The text document preprocessing basically consists of a process to strip all
formatting from the article, including capitalization, punctuation, and extraneous
markup (like the dateline,tags). Then the stop words are removed. Stop words term
(i.e., pronouns, prepositions, conjunctions etc) are the words that don't carry
semantic meaning. Stop words can be eliminated using a list of stop words. Stop
words elimination using a list of stop word list will greatly reduce the amount of
noise in text collection, as well as make the computation easier. The benefit of
removing stop words leaves us with condensed version of the documents containing
content words only. The next process is to stem a word. Stemming is the process for
reducing some derived words into their root form. For English documents, a
popularly known algorithm called the Porter stemmer [7] is used. The performance
of text clustering can be improved by using Porter stemmer.
Document Representation
Preprocessing is done to represent the data in the form that can be used for
clustering. There are many ways of representing documents, like the vector space
model, graphical model etc.[11] .
Vector Space Model
Vector Space Model (VSM) can be the simplest level of document
representation in clusters from [18]. Given a document collection, any word present
in the collection is counted as a dimension. If there are totally d separate words,
each document is treated as a d-dimensional vector, whose coordinate values are
10
the frequencies of appearance of the words in that document. Consequently, this
vector is very high dimensional but extremely sparse, because a collection normally
contains so many documents that only a tiny portion of the words actually belongs
to an individual document.
This representation model treats words as independent entities, completely
ignoring the structural information inside documents, such as syntax and meaningful
relationship between words or between sentences. Recently, many efforts have
been made to find a better way of representing text document. As mentioned,
scarcity is a problem of VSM. A document vector has so many unrelated dimensions
that may hide its actual meaning. Researchers have tried to make use of semantic
relatedness of words, or to find some sort of concepts, instead of words, to
represent documents. Its simplicity facilitates fast computation, at the same time
provides sufficient numerical and statistical information. Hence, it is the common
model used in most of the clustering algorithms nowadays.
The weights assigned to each term can be either the term frequency (tf)
orterm frequency-inverse document frequency (tf−idf). In first case, thefrequency of
occurrence for a term in a document is included in the vectordtf=(tf1,tf2,tfm), where tfi
is the frequency of the ithterm in the document.Usually, very common words are
removed and the terms are stemmed. A refinementto this weighting scheme is the
so-called tf −idf weighting scheme. In this approach,a term that appears in many
documents should not be regarded as more importantthan the one that appears in
few documents, and for this reason it needs to be deemphasized.
11
Figure 2.3: Vector space model
The figure 2.3 explains the vector space model. After preprosessing the
document dataset we have the list of words which are common in all the
documents. Then these word list is used as dimensions to represent the documents
into vectors. Documents have many common words in the dataset so dimensions are
high. The figure 2.3 explains that there are three terms(TERM1,TERM2,TERM3)
common in three documents(DOC1,DOC2,DOC3) so the three terms are considered
as dimensions in the plane. Then the documents the drawn in the plane as vectors.
Let N be the total number of documents in the
collection;dfi(documentfrequency) be the number of documents in which the kiterm
appears, and freqi,jbethe raw frequency of the term ki in the document dj.
The inverse documentfrequency(idfi) for ki is defined as:
Idfi=logN/dfi (2.1)
Thetf−idf weight of term i is computed by:
wij=freqij×logN/dfi(2.2)
To account for documents of different length, each vector is normalized so
that it is ofunit length.
The main advantages of Vector Space Model (VSM) are :
♦ The documents are sorted by decreasing similarity with the query q .
♦The terms are weighted by importance.
♦ It allows for partial matching: the documents need not have exactly the sameterms
with the query.
12
One disadvantage of VSM is that the terms are assumed to be independent.
Moreover, weighting is intuitive and not very formal.
Dimension reduction techniques
Dimension reduction can be divided into feature selection and feature
extraction. Feature selection is the process of selecting smaller subsets (features)
from larger set of inputs and Feature extraction transforms the high dimensional
data space to a space of low dimension. The goals of dimension reduction methods
are to allow fewer dimensions for broader comparisons of the concepts contained in
a text collection.
Similarity Measurement
Accurate clustering requires a precise definition of the closeness between a
pair of objects, in terms of either the pair wise similarity or distance. Before
clustering, a similarity/distance measure must be determined. The measure reflects
the degree of closeness or separation of the target objects and should correspond to
the characteristics that are believed to distinguish the clusters embedded in the
data. In many cases, these characteristics are dependent on the data or the problem
context at hand, and there is no measure that is universally best for all kinds of
clustering problems.
Moreover, choosing an appropriate similarity measure is also crucial for
cluster analysis, especially for a particular type of clustering algorithms. For example,
the density-based clustering algorithms, such as DBSCAN, rely heavily on the
similarity computation. Density-based clustering finds clusters as dense areas in the
data set, and the density of a given point is in turn estimated as the closeness of the
corresponding data object to its neighboring objects. Recalling that closeness is
quantified as the distance/similarity value, we can see that a large number of
distance/similarity computations are required for finding dense areas and estimate
cluster assignment of new data objects. Therefore, understanding the effectiveness
of different measures is of great importance in helping to choose the best one.
13
In general, similarity/distance measures map the distance or similarity
between the symbolic descriptions of two objects into a single numeric value, which
depends on two factors—the properties of the two objects and the measure itself.
There are four measures [23] are discussed below.
Euclidean Distance
Euclidean distance is a popular similarity measure used in the data clustering.
The similarity between the two documents di and djis calculated as
(2.3)
It is used in the traditional k-meansalgorithm[2]. The objective of k-means is
to minimize theEuclidean distance between objects of a cluster and thatcluster’s
centroid:
(2.4)
Cosine Similarity
When documents are represented as term vectors, the similarity of two
documents corresponds to the correlation between the vectors. This is quantified as
the cosine of the angle between vectors, that is, the so-called cosine similarity.
Cosine similarity is one of the most popular similarity measure applied to text
documents, such as in numerous information retrieval applications in [11] and
clustering tool kit from [13].An important property of the cosine similarity is its
independence of document length.The similarity of two document vectors di and
dj,Sim (di, dj), is defined as the cosine of the angle between them. For unit vectors,
this equals to their inner product:
(2.5)
Cosine measure is used in a variant of K-means called spherical K-Means in
[4]. While K-Means aims to minimize Euclidean distance, spherical K-Means intends
to maximize the cosine similarity between the documents in a cluster and that
cluster’s centroid.
(2.6)
14
Jaccard Coefficient
The Jaccard coefficient, which is sometimes referred to as the Tanimoto
coefficient, measures similarity as the intersection divided by the union of the
objects. For text document, the Jaccard coefficient compares the sum weight of
shared terms to the sum weight of terms that are present in either of the two
documents but are not the shared terms. Given non unit document vectors u i,uj
, their Jaccard coefficient is:
(2.7)
Pearson Correlation Coefficient
Correlation Clustering, introduced by Bansal, Blum and Chawla [9], provides a
method for clustering a set of objects into the best possible number of clusters,
without specifying that number in proceed. Correlation clustering that does not
require a bound on the number of clusters that the data is partitioned into. Rather,
Correlation Clustering in the paper [10] divides the data into the optimal number of
clusters based on the similarity between the data points. In their paper, [9] Bansal et
al. discuss two objectives of correlation clustering: minimizing disagreements and
maximizing agreements between clusters.
The normalized Pearson correlation is defined as:
(2.8)
Where denotes the average feature value of x overall dimensions.
In [20] Strehl et al. compared four measures: Euclidean, Cosine, Pearson
correlation and Extended Jaccard, and concluded that cosine and extended Jaccard
are the best ones on the web documents.
1.6 Clustering Applications
15
Clustering is the most common form of unsupervised learning and is a major
tool in a number of applications in many fields of business and science. Hereby, we
summarize the basic directions in which clustering is used.
• Finding Similar Documents This feature is often used when the user has spotted
one “good” document in a search result and wants more-like-this. The interesting
property here is that clustering is able to discover documents that are conceptually
alike in contrast to search-based approaches that are only able to discover whether
the documents share many of the same words.
• Organizing Large Document Collections Document retrieval focuses on finding
documents relevant to a particular query, but it fails to solve the problem of making
sense of a large number of uncategorized documents. The challenge here is to
organize these documents in a taxonomy identical to the one humans would create
given enough time and use it as a browsing interface to the original collection of
documents.
• Duplicate Content Detection In many applications there is a need to find
duplicates or near-duplicates in a large number of documents. Clustering is
employed for plagiarism detection, grouping of related news stories and to
reordersearch results rankings (to assure higher diversity among the topmost
documents).Note that in such applications the description of clusters is rarely
needed.
• Recommendation System In this application a user is recommended articles based
on the articles the user has already read. Clustering of the articles makes itpossible in
real time and improves the quality a lot.
• Search Optimization Clustering helps a lot in improving the quality and efficiency
of search engines as the user query can be first compared to the clusters instead of
comparing it directly to the documents and the search results can also be arranged
easily.
1.7 Challenges in Document Clustering
Document clustering is being studied from many decades but still it is far
from a trivial and solved problem. The challenges are:
16
1. Selecting appropriate features of the documents that should be used for
clustering.
2. Selecting an appropriate similarity measure between documents.
3. Selecting an appropriate clustering method utilizing the above similarity measure.
4. Implementing the clustering algorithm in an efficient way that makes it feasible in
terms of required memory and CPU resources.
5. Finding ways of assessing the quality of the performed clustering.
Furthermore, with medium to large document collections (10,000+ documents), the
number of term-document relations is fairly high (millions+), and the computational
complexity of the algorithm applied is thus a central factor in whether it is feasible
for
real-life applications. If a dense matrix is constructed to represent term-document
relations, this matrix could easily become too large to keep in memory - e.g. 100, 000
documents × 100, 000 terms = 1010 entries ~ 40 GB using 32-bit floating point
values. If
the vector model is applied, the dimensionality of the resulting vector space will
likewise
be quite high (10,000+). This means that simple operations, like finding the Euclidean
distance between two documents in the vector space, become time consuming
tasks.
PARTITIONAL CLUSTERING
Partitional clustering algorithms describes that there is maximum similarity
within the clusters and minimum dissimilarity between the clusters.The very
popular partition based clustering algorithm is K means algorithm because of its easy
implementation and simple. But the main drawback is that difficult to predict K value
.To overcome the drawback we are using Automatic Merging of Optimal Clusters
(AMOC).The aim of AMOC is to automatically generate optimal clusters for the given
datasets. The AMOC is an addition to k-means with a two phase iterative procedure
merging validation techniques in order to find optimal clusters with automatically
combining of clusters.
17
Let X = {X1, X2,… , X m} be a set of m objects,every individual object in X i is
represented as[xi1,xi2,…xin] where n is number of attributes. This algorithm takes
kmax as the upper bound of the number of clusters. It iteratively integrate with the
Clusters having lower probability with its nearest cluster and validates the merging
result using Rand Index .
Steps:
1. Initialize kmax to the square root of total number of objects.
2. Assign objects of kmax, randomly to the centroids of cluster
3. By using k-means then find the clusters
4. Calculate the Intra cluster distance.
5. Find a cluster that has minimal probability and combine with its closest cluster.
Recalculate centroids and decrease the number of clusters by one.
6. Whenever the step 5 has been executed for every cluster, then go to step7, or else
go to step5.
7.Whenever if there is no difference in the number of clusters, then stop, or else go
to step2.
Criterion Function
The frequently used partitional clustering similarity strategy is the Variance
Ratio Criterion (VRC) . Its definition is as formulated
Bd n-k
VRC= (1)
Wd k-1
Here B and W denote the between-cluster variations and within-cluster,
respectively. They are defined as:
W= oij -- oj)T(oi
j -- o j ) (2)
B= j( o j ― o )T (oj ― o ) (3)
Where nj denotes the cardinal of the cluster cj, oi j denotes the ith object assigned to
18
the cluster cj, and o denotes the n-dimensional vector of overall sample means (data
centroid), and o j denotes the n-dimensional vector of sample means within jth
cluster (cluster centroid). Between-cluster variations (k-1) is the degree of freedom
and within-cluster variations (n-k) is degree of freedom.
As a consequence, compact and separated clusters are assumed to have minimum
values of W and maximum values of B. Hence, better the data partition, the more
value of VRC. The normalization term (n-k)/(k-1) prevents the ratio to increase
monotonically the number of clusters, thus making VRC as an optimization
(maximization) criterion.
PSO
Particle swarm optimization (PSO) is a computational method that optimizes
a problem by iteratively trying to enhance a candidate solution considering the
measure of quality. PSO optimizes a problem by having a candidate solution and
moving these particles around in the search -space according to simple mathematical
formulae over the position and velocity of particle's.
Vid=w*vid+c1*rand1*(pid-xid)+c2*rand2*(pgd-xid) (4)
Xid=xid+vid (5)
where w is the inertia weight factor ; location of the element value is p id that
realizes the local best value; location of the elements pgd that realizes a overall best
value; c1 and c2 called as acceleration coefficients and constants; the dimension of
the search domain is d; rand 1, rand2 refers the arbitrary values distributed in the
interval [0 ,1].
Each and every particle's shift is effected by its local best position and is also
accompanied toward the best familiar positions in the space, which are upgraded as
better positions are found by specific particles. This will make the swarm closer to
the best position. The PSO Clustering algorithm step by step overview is given
below:
19
Step1:Initialize the population randomly.
Step2. Perform the following for each particle:
(a) Using the velocity and particle position to Update equation (4) and (5) and
to generate the next solution.
(b) Compute the fitness value using fitness function(1).
Step3. Perform step (2) again and again till any of the below conditions is fulfilled.
(a) The number of iterations performed has reaches maximum or minimum.
(b) The average change in fitness values is negligible.
Tabu search
Fred Glover proposed an approach in 1986, which is called as Tabu Search,
used to allow Local Search (LS) methods to overcome local optima. The main concept
of TS is to pursue LS whenever it reaches a local optimum by allow non-improving
moves. The difference between meta heuristic approaches and tabu search is, tabu
search based on the notion of tabu list. That is combination of before visited
solutions including disallow moves. we are using short term memory so it reserves
few of the attributes of solutions instead of whole solution. So it doesn’t grants any
permission to revisited solutions
Steps:
Step 1 Create initial solution x.
Step 2 Initialize the Tabu List.
Step 3 While set of X‟ candidate solutions is not complete.
Step 3.1 compute x‟ candidate solution from present solution x
Step 3.2. Add x‟ to X‟ iif x‟ at least one Aspiration Criterion is satisfied.
Step 4 Select the best x* candidate solution in X‟.
20
Step 5 .If fitness(x) < fitness(x*) then x = x*.
Step 6 Then Tabu List is updated.
Step 7 If termination criteria is reached then finish.
TSPSO:
In this section We introduced the TSPSO algorithm. The algorithm
combines PSO technique with TS. Particle swarm optimization (PSO) is a
computational method used to optimize the results by iteratively trying to
enhance a candidate solution with view to a given measure of quality. It is a
meta heuristic method, it makes some or no hypothesis about the difficulties
being optimized and can search a lot spaces of applicant solutions. It is not
uses the gradient of the trouble being optimized, which means PSO is not
required for the optimization problem but it is to be distinguished as is required
by classic optimization methods such as quasi-newton methods. Gradient
descent PSO is also used on optimization problems which are relatively noisy,
asymmetry, adjusting, etc. .However, PSO suffer from following two aspects : I) It
is easy to be confined into local minima; II) it costs too much time to converge
especially in a complex high dimensional space. When the optimal solution
found all the particles are situated at the same local minimum. After finding
the optimal solution it is t impossible for particles to move and do further
searching because of the velocity update equation . To overcome
aforementioned problem, we proposed hybrid approach which combines the
PSO and Tabu Search (TS) considering that TS belongs to the class of local search
techniques. To overcome this drawback, we merge PSO with a local search
algorithm called TS. we combine TS and PSO algorithm to use the exploration
ability of both algorithms and to avoid flaws of each other.
The flow chart of TSPSO is shown in fig.2
21
The TSPSO steps are listed bellowSteps:Step1 . the population Initialized randomly; Step 2. compute the fitness function (1) for each particle’sStep 3. randomly divide the population into two halves:a) one half of population was updated by PSO. i.e Update the position andvelocity of each particle.b) The another half of population was updated by TS. it searches the local best solution for Each particle.Step 4. Merge the two halves population, and update the “pbest” and “gbest” particles and the tabu list (TL).Step 5 . Iterate Step 2-Step 4 whenever termination condition was reached.Step 6 . Output the result
22
2. Literature Reviews
Tabu-KM: A Hybrid Clustering Algorithm Based on Tabu Search Approach
Abstract
The clustering problem under the criterion of minimum sum of squares is a non-convex and non-linear program, which possesses many locally optimal values, resulting that its solution often falls into these trap and therefore cannot converge to global optima solution. In this paper, an efficient hybrid optimization algorithm is developed for solving this problem, called Tabu-KM. It gathers the optimization property of tabu search and the local search capability of k-means algorithm together. The contribution of proposed algorithm is to produce tabu space for escaping from the trap of local optima and finding better solutions effectively. The Tabu-KM algorithm is tested on several simulated and standard datasets and its performance is compared with k-means, simulated annealing, tabu search, genetic algorithm, and ant colony optimization algorithms. The experimental results on simulated and standard test problems denote the robustness and efficiency of the algorithm and confirm that the proposed method is a suitable choice for solving data clustering problems.
Introduction
Clustering is an important process in engineering and other fields of scientific research. It is the process of grouping patterns into a number of clusters, each of which contains the patterns that are similar to each other according to a specified similarity measure. Clustering is a sequential process, which takes data as a raw material and produces clusters as a result without any predetermined goal [16]. To analyze the clusters, the objects are represented by points in N-dimensional space, where the objects of the vector are values for the attributes of the object and the objective is to classify these points into K clusters such that a certain similarity measure is optimized. ∗ Corresponding author. M. Yaghini Email: yaghini@iust.ac.ir Paper first received April. 07. 2010 ,and in revised form June. 24. 2010. We consider clustering problem stated as follows: given N objects in , allocate each object to one of K clusters such that the sum of squared Euclidean distances between each object and the center of belonging cluster is minimized.
23
The clustering problem can be mathematically described as follows: 2 1 1 ( , ) N K i j i j Min F W C wij x c = = = − ∑∑ (1) Where 1 1 K ij j w= ∑ = , i = 1,…, N. If object xi allocated to cluster Cj , then is equal to 1; otherwise is equal to 0. In equation 1, N denotes the number of objects, K denotes the number of clusters, X={x1,x2,…,xN} denotes the set of N objects, C ={c1,…,cK} denotes the set of K Clustering problem, Hybrid algorithm, Tabu search algorithm, k-Means algorithm. September 2010, Volume 21, Number 2 International Journal of Industrial Engineering & Production Research http://IJIEPR.iust.ac.ir/ International Journal of Industrial Engineering & Production Research (2010) pp. 71-79 ISSN: 2008-4889 72 M. Yaghini & N.Ghazanfari Tabu-KM: A Hybrid Clustering Algorithm Based … clusters, and W denotes the 0-1 matrix. Cluster center cj is calculated as follows: 1 , 1,..., i j j j j x c c x j k n ∈ = = ∑ (2) Where nj denotes the number of objects belonging to cluster cj . It is known that this clustering problem is a non-convex and non-linear which possess many locally optimal values, resulting that its solution often falls into these traps [24]. k-Means algorithm is one of the popular center based algorithms [18] which proved to fail to convergence to a local minimum under certain condition. The criterion it uses minimizes the total mean squared distance from each point in N to that point’s closest center in K. However there are two main problems for k-means method [21] and [19]. First is that the algorithm depends on the initial states and the value of K. Second problem is that it is easily converges to some local optima which is much worse than the desired global optima solution. In this paper, a new efficient algorithm is designed and implemented based on tabu search approach for escaping from local optima. The key idea of proposed algorithm is to produce tabu space and select new center of cluster from the objects not in tabu space. Then the k-means algorithm is run to local search. This paper is organized as follows: the tabu search approach for clustering and also related works are reviewed in section 2. In section 3, we propose the Tabu-KM algorithm and give detailed descriptions. Section 4 presents experimental results with simulated and standard datasets that show our method outperforms some other methods. Finally, conclusions of the current work are reported in section 5.
Conclusion
An effective hybrid clustering algorithm based on tabu search approach called TabuKM is developed by integrating the tabu space and move generator for restricting objects to select as center of cluster. Tabu-KM
24
algorithm is used to escape from the trap of local optima and finding better solutions, in the clustering problem under the criterion of minimum sum of squares. To produce the tabu space, two strategies are investigated: the spherical space around the center of cluster with fixed or dynamic radius. In addition, three different strategies are discussed to select objects as center of new cluster and generate a feasible solution: (1) move to the closest object to the center of initial k-means cluster, (2) move to the closest object to the center of current cluster, (3) move to the closest object to the center of best-so-far cluster. All above-mentioned strategies were investigated. According to the result, the dynamic
A Survey on K-mean Clustering and Particle Swarm Optimization
Abstract
In Data Mining, Clustering is an important research topic and wide range of unsupervised classification application. Clustering is technique which divides a data into meaningful groups. K-mean is one of the popular clustering algorithms. K-mean clustering is widely used to minimize squared distance between features values of two points reside in the same cluster. Particle swarm optimization is an evolutionary computation technique which finds optimum solution in many applications. Using the PSO optimized clustering results in the components, in order to get a more precise clustering efficiency. In this paper, we present the comparison of K-mean clustering and the Particle swarm optimization. Introduction
Clustering is a technique which divides data objects into groups based on the information found in data that describes the objects and relationships among them, their feature values which can be used in many applications, such as knowledge discovery, vector quantization, pattern recognition, data mining, data dredging and etc. [1] There are mainly two techniques for clustering: hierarchical clustering and partitioned clustering. Data are not partitioned into a particular cluster in a single step, but a series of partitions takes place in hierarchical clustering, which may run from a single cluster containing all objects to n clusters each containing a single object. And each cluster can have sub clusters, so it can be viewed as a tree, a node in the tree is a cluster, the
25
root of the tree is the cluster containing all the objects, and each node, except the leaf nodes, is the union of its children. But in partitioned clustering, the algorithms typically determine all clusters at once, it divides the set of data objects into non-overlapping clusters, and each data object is in exactly one cluster. Particle swarm optimization (PSO) has gained much attention, and it has been applied in many fields [2]. PSO is a useful stochastic optimization algorithm based on population. The birds in a flock are represented as particles, and particles are considered as simple agents flying through a problem area. And in the multi-dimensional problem space, the particle’s location can represent the solution for the problem. But the PSO may lack global search ability at the end of a run due to the utilization of a linearly decreasing inertia weight and PSO may fail to find the required optima when the problem to be solved is too complicated and complex. K-means is the most widely used and studied clustering algorithm. Given a set of n data points in real d-dimensional space (Rd), and an integer k, the clustering problem is to determine a set of k points in Rd, the set of points is called cluster centres, the set of n data points are divided into k groups based on the distance between them and cluster centres. K means algorithm is flexible and simple. But it has some limitation, the cluster result mainly depends on the selection of initial cluster centroids and it may converge to the local optima [3]. However, the same initial cluster centre in a data space can always generate the same cluster results, if a good cluster centre can always be obtained, the K-means will work well.
Conclusion
Study of the k-mean clustering and Particle swam optimization we say that the k-mean which is depend on initial condition, which cause the algorithm may converge to suboptimal solution. On the other side Particle swarm optimization is less sensitive for initial condition due to its population based nature. So Particle swarm optimization is more likely to find near optimal solution.
Cluster Analysis by Variance Ratio Criterion and Firefly Algorithm
Abstract
In order to solve the cluster analysis problem more efficiently, we presented a new approach based on firefly algorithm (FA). First, we created the optimization model using the variance ratio criterion (VRC)
26
as fitness function. Second, FA was introduced to find the maximal point of the VRC. The experimental dataset contains 400 data of 4 groups with three different levels of overlapping degrees: non-overlapping, partial overlapping, and severely overlapping. We compared the FA with genetic algorithm (GA) and combinatorial particle swarm optimization (CPSO). Each algorithm was run 20 times. The results show that FA can found the largest VRC values among all three algorithms, while costs the least time. Therefore, FA is effective and rapid for the cluster analysis problem.
Introduction
Cluster analysis is the assignment of a set of observations into subsets without any priori knowledge so that observations in the same cluster are similar to each other than to those in other clusters [1]. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields [2], including machine learning [3], data mining [4], pattern recognition [5], image analysis [6], medical image classification [7], and bioinformatics [8]. Cluster analysis can be achieved by various algorithms that differ significantly. Those methods can be basically classified into four categories: I. Hierarchical Methods. They find successive clusters using previously established clusters. They can be further divided into the agglomerative methods and the divisive methods [9]. Agglomerative algorithms start with one-point clusters and recursively merges two or more most appropriate clusters [10]. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters [11]. II. Partition Methods. They generate a single partition of data with a specified or estimated number of non overlapping clusters, in an attempt to recover natural groups present in the data [12]. III. Density-based Methods. They are devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold. DBSCAN [13] is the typical algorithm of this kind. IV. Subspace Methods. They look for clusters that can only be seen in a particular projection (subspace, manifold) of the data. These methods thus can ignore irrelevant attributes [14]. In this study, we focus our attention on Partition Clustering methods. The K-means clustering [15] and the fuzzy c-means clustering (FCM) [16] are two typical algorithms of this type. They are iterative algorithms and the solution obtained depends on the selection of the initial partition and may converge to a local minimum of criterion function value if the initial partition is not properly chosen [17].
27
Branch and bound algorithm was proposed to find the global optimum clustering. However, it takes too much computation time [18]. In the last decade, evolutionary algorithms were proposed to clustering problem since they are not sensitive to initial values and able to jump out of local minimal point. For example, Lin et al. [19] Cluster Analysis by Variance Ratio Criterion and Firefly Algorithm Yudong Zhang, Dayong Li International Journal of Digital Content Technology and its Applications(JDCTA) Volume7,Number3,February 2013 doi:10.4156/jdcta.vol7.issue3.84 689 pointed out that k-Anonymity has been widely adopted as a model for protecting public released microdata from individual identification. Their work proposed a novel genetic algorithm-based clustering approach for k-anonymization. Their proposed approach adopted various heuristics to select genes for crossover operations. Experimental results showed that their approach can further reduce the information loss caused by traditional clustering-based k-anonymization techniques. Chang et al. [20] proposed a new clustering algorithm based on genetic algorithm (GA) with gene rearrangement (GAGR), which in application may effectively remove the degeneracy for the purpose of a more efficient search. They used a new crossover operator that exploited a measure of similarity between chromosomes in a population. They also employed adaptive probabilities of crossover and mutation to prevent the convergence of the GAGR to a local optimum. Using the real-world data sets, they compared the performance of GAGR clustering algorithm with K-means algorithm and other GA methods. Their experiment results demonstrated that the GAGR clustering algorithm had high performance, effectiveness and flexibility. Agard et al. [21] pointed out defining an efficient bill of materials for a family of complex products was a real challenge for companies, largely because of the diversity they offered to consumers. They solution is to define a set of components (called modules), each of which contained a set of primary functions. An individual product was then built by combining selected modules. The industrial problem leads, in turn, to the complex optimization problem. They solved the problem via a simulated annealing method based on a clustering approach. Jarboui et al. [12] presented a new clustering approach based on the combinatorial particle swarm optimization (CPSO) algorithm. Each particle was represented as a string of length n (where n is the number of data points), and the ith element of the string denoted the group number assigned to object i. An integer vector corresponded to a candidate solution to the clustering problem. A swarm of particles were initiated and fly through the solution space for
28
targeting the optimal solution. To verify the efficiency of the proposed CPSO algorithm, comparisons with a genetic algorithm were performed. Computational results showed that their proposed CPSO algorithm was very competitive and outperforms the genetic algorithm. Niknam et al. [22] considered the k-means algorithm highly depended on the initial state and converged to local optimum solution. Therefore, they presented a new hybrid evolutionary algorithm to solve nonlinear partitional clustering problem. Their proposed hybrid evolutionary algorithm was the combination of FAPSO (fuzzy adaptive particle swarm optimization), ACO (ant colony optimization) and k-means algorithms, called FAPSO-ACO-K, which can find better cluster partition. The performance of their proposed algorithm was evaluated through several benchmark data sets. Their simulation results showed that the performance of the proposed algorithm was better than other algorithms such as PSO, ACO, simulated annealing (SA), combination of PSO and SA (PSO-SA), combination of ACO and SA (ACO-SA), combination of PSO and ACO (PSO-ACO), genetic algorithm (GA), Tabu search (TS), honey bee mating optimization (HBMO) and k-means for partitional clustering problem. However, those aforementioned algorithms yet performed ideally. They sometimes converged too slow, or even converged to local minima points, which lead to a wrong solution. Recently, the firefly algorithm (FA) is a hot nature-inspired technique and has been used for solving nonlinear multimodal optimization problems in dynamic environment [23]. The algorithm is based on the behavior of the fireflies. In social insect colonies, each firefly seems to have its own plans, and yet the group acts as a whole appears to be highly organized. Scholars published immense literatures reporting its performance, effectiveness, and robustness are superior to GA, PSO, and other global algorithms in a wide range of fields [23, 24]. The structure of the rest of this paper was organized as follows. Next section 2 defined the partitional problem, and gave the encoding strategy and clustering criterion. Section 3 introduced the firefly algorithm. Experiments in section 4 contained three types of artificial data with different overlapping degree. Final section 5 was devoted to conclusions and future work.
Conclusion
we first investigate the optimization model including both the encoding strategy and the criterion function of VRC. Afterwards, an FA algorithm was introduced for solving the model. Experiments on three types of
29
artificial data with different overlapping degrees all demonstrate the FA is more robust and costs less time than either GA or CPSO. Future works contains following points: 1) Develop a method that can determine the number of clusters automatically; 2) Use more benchmark data to test the FA; 3) Apply our FA to practical clustering problems, including mathematics [30], face estimation [31], image segmentation [32], image registration [33], image classification [34], UCAV path planning [35], and prediction [36].
Document Clustering: The Next Frontier
Introduction
The proliferation of documents, on both the Web and in private systems, makes knowledge discovery in document collections arduous. Clustering has been long recognized as a useful tool for the task. It groups like-items together, maximizing intra-cluster similarity and inter-cluster distance. Clustering can provide insight into the make-up of a document collection and is often used as the initial step in data analysis. While most document clustering research to date has focused on moderate length single topic documents, real-life collections are often made up of very short or long documents. Short documents do not contain enough text to accurately compute similarities. Long documents often span multiple topics that general document similarity measures do not take into account. In this paper we will first give an overview of general purpose document clustering, and then focus on recent advancements in the next frontier in document clustering: long and short documents. Conclusion This chapter primarily focused on reviewing some recently developed text clustering methods that are specifically suited for long and for short document collections. These types of document collections introduce new sets of challenges. Long document are by their nature multi-topic and as such the underlying document clustering methods must explicitly focus on modeling and/or accounting for these topics. On the other hand, short documents often contain domain-specific vocabulary, are very noisy, and their proper modeling/understanding often requires the incorporation of external information. We strongly believe research in clustering long and short documents is in its early stages and many new
30
methods will be developed in the years to come. Moreover, many real datasets are not only composed of standard, long, or short documents, but rather documents of mixed length. Current scholarship lacks studies on these types of data. Since different methods are often used for clustering standard, long, or short documents, new methods or frameworks should be investigated that address mixed collections. Traditional document clustering is also faced with new challenges. Today’s very large, high-dimensional document collections often lead to multiple valid clustering solutions. Subspace/projective clustering approaches [67], [82] have been used to cope with high dimensionality when performing the clustering task. Ensemble clustering [40] and multiview/alternative clustering approaches [58], [91], which aim to summarize or detect different clustering solutions, have been used to manage the availability of multiple, possibly alternative clusterings for a given dataset. Relatively little work has been done so far in document clustering research to take advantage of lessons learned from these methods. Integrating subspace/ensemble/multi-view clustering with topic models or segmentation may lead to developing the next-generation clustering methods specialized for the document domain. Some topics that we have only briefly touched on in this article are further detailed in other chapters of this book. Other topics related to clustering documents, such as semisupervised clustering, stream document clustering, parallel clustering algorithms, and kernel methods for dimensionality reduction or clustering, were left for further study. Interested readers may consult document clustering surveys by Aggarwal and Zhai [3], Andrews and Fox [9], and Steinbach et al. Discrete PSO with GA Operators for Document Clustering
Abstract
The paper presents Discrete PSO algorithm for document clustering problems. This algorithm is hybrid of PSO with GA operators. The proposed system is based on population-based heuristic search technique, which can be used to solve combinatorial optimization problems, modeled on the concepts of cultural and social rules derived from the analysis of the swarm intelligence (PSO) with GA operators such as crossover and mutation. In standard PSO the non-oscillatory route can quickly cause a particle to stagnate and also it may prematurely converge on suboptimal solutions that are not even guaranteed to local optimal solution. In this paper a modification
31
strategy is proposed for the particle swarm optimization (PSO) algorithm and applied in the document corpus. The strategy adds reproduction by using crossover and mutation operators when the stagnation in movement of the particle is identified. Reproduction has the capability to achieve faster convergence and better solution. Experiments results are examined with document corpus. It demonstrates that the proposed DPSO algorithm statistically outperforms the Simple PSO.
Introduction
Document clustering is an automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters. Unlike document classification [22], no labeled documents are provided in clustering; hence, clustering is also known as unsupervised learning. Document clustering is widely applicable in areas such as search engines, web mining, information retrieval and topological analysis. Document clustering has become an increasingly important task in analyzing huge numbers of documents distributed among various sites. The challenging aspect is to analyze this enormous number of extremely high dimensional distributed documents and to organize them in such a way that results in better search and knowledge extraction without introducing much extra cost and complexity. Clustering, in data mining, is useful to discover distribution patterns in the underlying data. The K-means and its variants [14][15] represent the category of partitioning clustering algorithms that create a flat, non hierarchical clustering that consist of k clusters. The K-means algorithm iteratively refines a randomly chosen set of k initial centroids, minimizing the average distance (i.e., maximizing the similarity) of documents to their closest (most similar) centroid. A common document clustering method [1][19] is the one that first calculates the similarities between all pairs of the documents and then cluster documents together if their similarity values are above mentioned threshold. The common clustering techniques are partitioning and hierarchical [11]. Most of the document clustering algorithms can be classified into these two groups. In this study, a document clustering algorithm based on DPSO is proposed. The remainder of this paper is organized as follows: Section II provides the related works in document clustering using PSO. Section III gives the overview of the PSO. The DPSO with GA operators clustering algorithm is described in Section IV. Section V presents the detailed experimental
32
setup and results for comparing the performance of the proposed algorithm with the standard PSO (SPSO) and K-means approaches. Conclusion The proposed system uses the vector space model for document representation. The total number of documents exist in CISI is 1460, Cranfield is 1400 and ADI is 82. Each particle in the swarm is represented by 2942 dimensions. The advantages of the PSO are very few parameters to deal with and the large number of processing elements, so called dimensions, which enable to fly around the solution space effectively. On the other hand, it converges to a solution very quickly which should be carefully dealt with when using it for combinatorial optimization problems. In this study, the proposed DPSO with GA operators algorithm developed for much more complex, NP-hard document clustering is verified on the document corpus. It is shown that it increases the performance of the clustering and the best results are derived from the proposed technique. Consequently, the proposed technique markedly increased the success of the document clustering problem. The main objective of the paper is to improve the fitness value of the problem. The fitness value achieved from the standard PSO is low since it has the stagnation it causes the premature convergence. However, it can be handled by the DPSO with the crossover and mutation operators of Genetic Algorithm that tries to avoid the stagnation behavior of the particles. The proposed system does not always avoid the stagnation behavior of the particles. But for seldom it avoids the stagnation, which is the source for the improvement in the particles position.
33
3. System Design
3.1 Hardware and software specifications
H/W System Configuration:-
Processor - Pentium i5 Speed - 2.3 Ghz RAM - 4 GB Hard Disk - 500 GB Key Board - Standard Laptop Keyboard Mouse - USB mouse
S/W System Configuration:-
Operating System :Windows 10
Development tool : Net beans 7.0.1
Language : JAVA
Language Version : jdk 1.7
Technologies : AWT, Swings
34
3.2 UML Diagrams Use case diagram
There is only one actor, he can access the following functionality Reading Vector and feature File from user Applying AMOC for generating clusters Applying Tabu Search for testing the TSPSO values Applying PSO for testing the TSPSO values Using TSPO for solve the document cluster analysis difficulties more
efficiently and quickly
User
Read Vector and feature File
Apply AMOC
Apply Tabu Search
Apply PSO
Apply TSPSO
View Results
35
Class Diagram:
Here Mining executer is the main class where it utilizes the methods of OptionSelection when a user invokes the action function the ResultForm is invoked where the inputs of the ResultForm is given to the DocClusteringModel. Finally the output class is executed with the inputs of DocClustering, here StartUp class is generation of DocClustering.
36
Sequence diagram
Here User is a main class whenever he wants to view the datasets he can view them by requesting them using the vectors and features file. Similarly he can optimize the number of clusters that are present from the datasets using AMOC. Whenever he wants to apply the PSO for generating one of the test case for TSPO he can use them, he can apply TSPO for generating efficient clusters finally he can view all the result when required.
: User
Read_dataset Apply AMCO Apply PSO Apply TSPSO View Results
1 : Browse Vector and Feature File()
2 : Vector and Feature file Succefuly Read()
3 : Apply AMCO()
4 : Optimized number of Clusters Arrived()
5 : Apply PSO()
6 : PSO Applied()
7 : Apply TSPSO()
8 : TSPSO Applied()
9 : View Reults()
10 : Results Viewed()
37
Activity diagram
Behavior of the system in terms of activities are describes below. Here as user starts initiates the process he browses through the file for selection of features and vectors files. Then he applies AOC for generating clusters. Then he can Apply PSO for generating test sample1 for TSPO, Then he can apply TS for generating test sampple2 Finally these can be used for TSPO test and TSPO is applied for generating Efficient clusters.
Userapplication
Browse Files
Apply AMCO
Apply PSO
Apply TS
Apply TSPSO
View Results
38
State chart diagramThis state chart describes the way or the sequence users and their
interactions here as user starts initiates the process he browses through the file for selection of features and vectors files. Then he applies AOC for generating clusters. Then he can Apply PSO for generating test sample1 for TSPO, Then he can apply TS for generating test sampple2 Finally these can be used for TSPO test and TSPO is applied for generating Efficient clusters.
Browse Files
Apply AMCO
Apply PSO
Apply TS
Apply TSPSO
View Results
39
Component Diagram:
The figure shows the various interactions of user with different components. User interacts with Read datasets to browse the vectors an features file on success he can read successfully, he can interact with AMOC component to apply it and get optimized clusters, he then interacts with PSO component to apply it an generates samples, Similarly he can interact with TSPO component and on success to generate optimized clusters
40
ALGORITHMS
1. PSO
Let S be the number of particles in the swarm, each having a position
xi ∈ Rn in the search-space and a velocity vi ∈ Rn. Let pi be the best
known position of particle i and let g be the best known position of the
entire swarm. A basic PSO algorithm is then:
For each particle i = 1, ..., S do:
Initialize the particle's position with a uniformly distributed random
vector: xi ~ U(blo, bup), where blo and bup are the lower and upper
boundaries of the search-space.
Initialize the particle's best known position to its initial position:
pi ← xi
If (f(pi) < f(g)) update the swarm's best known position: g ← pi
Initialize the particle's velocity: vi ~ U(-|bup-blo|, |bup-blo|)
Until a termination criterion is met (e.g. number of iterations
performed, or a solution with adequate objective function value is
found), repeat:
For each particle i = 1, ..., S do:
Pick random numbers: rp, rg ~ U(0,1)
For each dimension d = 1, ..., n do:
41
Update the particle's velocity: vi,d ← ω vi,d + φp rp (pi,d-xi,d) + φg rg
(gd-xi,d)
Update the particle's position: xi ← xi + vi
If (f(xi) < f(pi)) do:
Update the particle's best known position: pi ← xi
If (f(pi) < f(g)) update the swarm's best known position: g ← pi
Now g holds the best found solution.
2. Tabu search:
Steps involved:
Step 1 Create initial solution x.
Step 2 Initialize the Tabu List.
Step 3 While set of X‟ candidate solutions is not complete.
Step 3.1 compute x‟ candidate solution from present solution x
Step 3.2. Add x‟ to X‟ iif x‟ at least one Aspiration Criterion is
satisfied.
Step 4 Select the best x* candidate solution in X‟.
Step 5 .If fitness(x) < fitness(x*) then x = x*.
Step 6 Then Tabu List is updated.
Step 7 If termination criteria is reached then finish.
42
Criterion Function
The Variance Ratio Criterion (VRC) is the mostly widely used
partitioned
clustering strategy.
Where nj denotes the cardinal of the cluster cj, oij denotes the ith
object assigned to the cluster cj, and o denotes the n-dimensional
vector of overall sample means (data centroid), and o j denotes the
n-dimensional vector of sample means within jth cluster (cluster
centroid). Between-cluster variations (k-1) is the degree of freedom.
3. TSPSO
Steps involved:
Step1 . the population Initialized randomly; Step 2. Compute the fitness function (1) for each particle’sStep 3. Randomly divide the population into two halves:a) one half of population was updated by PSO. i.e Update the position and velocity of each particle.b) The another half of population was updated by TS. it searches the local best solution for Each particle.Step 4. Merge the two halves population, and update the “pbest” and “gbest” particles and the tabu list (TL).
43
Step 5 . Iterate Step 2-Step 4 whenever termination condition was reached.Step 6 . Output the result
4. Implementation
4.1 Introduction to technologies
The feasibility of the project is analyzed in this phase and business
proposal is put forth with a very general plan for the project and some
cost estimates. During system analysis the feasibility study of the
proposed system is to be carried out. This is to ensure that the proposed
system is not a burden to the company. For feasibility analysis, some
understanding of the major requirements for the system is essential.
Three key considerations involved in the feasibility analysis are
ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY
ECONOMICAL FEASIBILITY
This study is carried out to check the economic impact that the
system will have on the organization. The amount of fund that the
company can pour into the research and development of the system is
limited. The expenditures must be justified. Thus the developed system
as well within the budget and this was achieved because most of the
44
technologies used are freely available. Only the customized products
had to be purchased.
TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is,
the technical requirements of the system. Any system developed must
not have a high demand on the available technical resources. This will
lead to high demands on the available technical resources. This will lead
to high demands being placed on the client. The developed system must
have a modest requirement, as only minimal or null changes are
required for implementing this system.
SOCIAL FEASIBILITY
The aspect of study is to check the level of acceptance of the
system by the user. This includes the process of training the user to use
the system efficiently. The user must not feel threatened by the system,
instead must accept it as a necessity. The level of acceptance by the
users solely depends on the methods that are employed to educate the
user about the system and to make him familiar with it. His level of
confidence must be raised so that he is also able to make some
45
constructive criticism, which is welcomed, as he is the final user of the
system.
4.2 Sample Code
package miner.psoAlgo;
/** * */
//package miner;
import java.io.*;import java.util.*;import javax.swing.JOptionPane;import miner.*;
public class pso{
float tfIdf[][];float particles[][][];float fitness[];float partiVelocity[][][];float pBest[][][];public float gBest[][];float newFitness[];float gBestFitness;
//boolean clusterPoints[][];int clusterSize[];float distance[];float intraclustDistance[];
boolean clusterPoints[][];
small little=new small();
46
public void extractData() throws IOException{
Scanner s=null;try{
s=new Scanner(new BufferedReader(new FileReader("c:\\dc\\tfIdfMatrix.txt")));
String a,b;int col=-1;while(s.hasNext()){
a=s.next();if(a.indexOf("column")!=-1){
col++;for(int j=0;j<tfIdf.length;j++){
a=s.next();tfIdf[j][col]=Float.parseFloat(a);
}/*End of for*/}/*End of If*/
}/*End of while*/}/*End of try*/
catch(IOException e) { JOptionPane.showMessageDialog(null,e.toString(),"pso-extractData()",JOptionPane.ERROR_MESSAGE); //return count; }
finally{
if(s!=null)s.close();/*End of if*/
/*if(out!=null)out.close();*/
}/*End of finally*/}/*End of Extract data*/
public pso()
47
{}
public pso(int Rows,int Columns,int noOfClusters,int noOfParticles)
{System.out.println("parameterised Constructor Executed");tfIdf=new float[Rows][Columns];System.out.println("The size of the matrix
is:"+tfIdf.length+"\t"+tfIdf[0].length);particles=new float[noOfParticles][noOfClusters][Columns];fitness=new float[particles.length];partiVelocity=new float[noOfParticles][noOfClusters]
[Columns];pBest=new float[noOfParticles][noOfClusters][Columns];gBest=new float[noOfClusters][Columns];newFitness=new float[particles.length];
//clusterPoints=new boolean[tfIdf.length][particles[0].length];
clusterSize=new int[particles[0].length];distance=new float[particles[0].length];intraclustDistance=new float[particles[0].length];
Arrays.fill(fitness,0);Arrays.fill(fitness,0);for(int i=0;i<gBest.length;i++)
Arrays.fill(gBest[i],0);for(int i=0;i<pBest.length;i++)
for(int j=0;j<pBest[i].length;j++){
Arrays.fill(partiVelocity[i][j],0);Arrays.fill(pBest[i][j],0);
}
}
public void assignParticles() throws IOException{
48
int Particles[];try{
numberGenerator n; n=new numberGenerator();
Particles=n.extractNumbers((particles.length)*(particles[0].length));System.out.println(Particles.length);int l=0;for(int i=0;i<particles.length;i++){
for(int j=0;j<(particles[0].length);j++){
for(int k=0;k<(particles[0][0].length);k++) { particles[i][j][k]=tfIdf[Particles[l]-1][k]; } l++; System.out.println(l);
}}
} catch(IOException e) { JOptionPane.showMessageDialog(null,e.toString(),"pso-assignParticles()",JOptionPane.ERROR_MESSAGE); //return count; }
finally{
}
}
public float eucliDistance(float a[],float b[])
49
{float distance=0,temp;for(int i=0;i<a.length;i++){
temp=a[i]-b[i];distance+=temp*temp;
}//distance=(float)(distance/tfIdf[0].length);distance=(float)(Math.sqrt(distance));return distance;
}
public small Small(float distance[]){
small a=new small();a.distance=distance[0];a.pos=0;for(int i=1;i<distance.length;i++){
if(a.distance==0 && i==1)
{int j=i;while(true){
if(distance[i]!=0){
a.distance=distance[j];break;
}j++;
}}
if(a.distance>distance[i] && distance[i]!=0){
a.distance=distance[i];a.pos=i;
}
50
}return a;
}
public void calFitness(){
//int clusterSize[];//float distance[];//float intraclustDistance[];
//clusterSize=new int[particles[0].length];//distance=new float[particles[0].length];//intraclustDistance=new float[particles[0].length];
for(int i=0;i<particles.length;i++){
System.gc();newFitness[i]=0;for(int l=0;l<particles[0].length;l++){
clusterSize[l]=0;distance[l]=(float)(0);intraclustDistance[l]=(float)(0);//newFitness[l]=(float)(0);
}
for(int j=0;j<tfIdf.length;j++){
for(int k=0;k<particles[i].length;k++){
distance[k]=eucliDistance(tfIdf[j],particles[i][k]);
51
}little=Small(distance);//clusterPoints[j][little.pos]=true;intraclustDistance[little.pos]+=little.distance;clusterSize[little.pos]++;
}for(int k=0;k<particles[0].length;k++){
intraclustDistance[k]=intraclustDistance[k]/clusterSize[k];if(Float.isNaN(intraclustDistance[k])==true)
intraclustDistance[k]=(float)(3.3406782);System.out.println("The intracluster distance in
cluster:"+k+" is"+intraclustDistance[k]);}
System.out.println();for(int k=0;k<particles[0].length;k++)
newFitness[i]+=intraclustDistance[k];
newFitness[i]=newFitness[i]/particles[0].length;
if(Float.isNaN(newFitness[i])==true)newFitness[i]=fitness[i];
String l=Float.toString(newFitness[i]);/*if(l.length()>5){
int pos;pos=l.indexOf('.');//System.out.println(pos);String s=l.substring(0,pos+4);//System.out.println(s);newFitness[i]=Float.parseFloat(s);
}*/
System.out.println("The Fitness of particle "+i+" is: "+newFitness[i]);
System.out.println();
52
}
}
public void changePartiVelocityLocation(){
float rand1,rand2;int i,j,k;rand1=(float)(Math.random());rand2=(float)(Math.random());System.gc();while(rand1!=rand2)
rand2=(float)(Math.random());for(i=0;i<particles.length;i++){
for(j=0;j<particles[i].length;j++){
for(k=0;k<particles[i][j].length;k++){
partiVelocity[i][j][k]=(float)((0.72)*partiVelocity[i][j][k]+(1.42)*rand1*(pBest[i][j][k]-particles[i][j][k])+(1.42)*rand2*(gBest[j][k]-particles[i][j][k]));
//partiVelocity[i][j][k]=Math.abs(partiVelocity[i][j][k]);
if(Float.isNaN(partiVelocity[i][j][k])==true)
partiVelocity[i][j][k]=0;particles[i][j][k]+=partiVelocity[i][j][k];
}}
}}
public void findpBest(){
int i,j;for(i=0;i<fitness.length;i++){
if(fitness[i]>newFitness[i])
53
{fitness[i]=newFitness[i];for(j=0;j<particles[0].length;j++){
System.arraycopy(particles[i][j],0,pBest[i][j],0,(particles[i][j]).length);
}}
}}
public void findgBest(int i){
int j;small a=new small();int flag=0;
a=Small(fitness);//System.out.println("gBest:"+a.distance);if(i==0){
gBestFitness=a.distance;flag=1;
}else{
if(i==1 && gBestFitness==0)gBestFitness=a.distance;
else if(a.distance!=0)if(gBestFitness>a.distance){
gBestFitness=a.distance;flag=1;
}}
System.out.println("The gBest Fitness is:"+gBestFitness);
if(flag==1)
54
for(j=0;j<particles[0].length;j++){
//System.out.println("The gBest Fitness is assigned");
System.arraycopy(particles[a.pos][j],0,gBest[j],0,(particles[0][0]).length);
}
}
public boolean checkFitness(){
float a;byte count=0;//byte pos;//String l;//l=Float.toString(fitness[0]);//pos=l.indexOf(.);a=newFitness[0];for(int i=1;i<newFitness.length;i++)
if(Math.abs(a-newFitness[i])==0)count++;
if(count==newFitness.length-1){
System.out.println("After checking Fitness:");for(int l=0;l<newFitness.length;l++)
System.out.println(newFitness[l]);return true;
}return false;
}
public void psoalg(int n){
int i,j,k;for(i=0;i<n;i++){
55
System.gc();System.out.println();System.out.println("iteration: "+i);System.out.println();calFitness();if(i==0){
System.arraycopy(newFitness,0,fitness,0,newFitness.length);
//System.out.println("newFitness:"+newFitness[0]);System.out.println("Fitness:"+fitness[0]);System.out.println();for(j=0;j<fitness.length;j++){
for(k=0;k<particles[0].length;k++){
System.arraycopy(particles[j][k],0,pBest[j][k],0,(particles[j][k]).length);
}}System.gc();findgBest(i);changePartiVelocityLocation();
}
else{
System.gc();findpBest();findgBest(i);changePartiVelocityLocation();
}
if(checkFitness()){
System.out.println("Yes");
break;
56
}
}
}
public void show() throws IOException{
PrintWriter out=null;try{
out=new PrintWriter(new FileWriter("c:\\dc\\psoparticles.txt")); for(int i=0;i<particles.length;i++) { out.println("particle:"+(i+1));
for(int j=0;j<(particles[0].length);j++){
out.println("Cluster:"+(j+1)); for(int k=0;k<(particles[0][0].length);k++) {
out.print(particles[i][j][k]+"\t"); } out.println();
} }
} catch(IOException e) { JOptionPane.showMessageDialog(null,e.toString(),"pso-show()",JOptionPane.ERROR_MESSAGE); //return count; }
finally
57
{if(out!=null)
out.close();}
}
public float centToCentDistance(/*PrintWriter out1*/){
float result=0;for(int i=0;i<gBest.length;i++){
for(int j=i+1;j<gBest.length;j++){
float temp;temp=eucliDistance(gBest[i],gBest[j]);System.out.println("The distance from
centroid"+(i+1)+" to centroids"+(j+1)+" is : "+temp);//out1.println("The distance from
centroid"+(i+1)+" to centroids"+(j+1)+" is : "+temp);result+=temp;
}
}int n=gBest.length;n=(n*(n-1))/2;result=result/n;System.out.println("The average distance is:"+result);//out1.println("The average distance is:"+result);
return result;
}
public float intDist(){
float distancei[];
58
float intraclustDistancei[];float fitnessi=0;int clusterSizei[];
small littlei;
littlei=new small();
distancei=new float[gBest.length];intraclustDistancei=new float[gBest.length];clusterSizei=new int[gBest.length];
clusterPoints=new boolean[tfIdf.length][gBest.length];
for(int i=0;i<gBest.length;i++){
clusterSizei[i]=0;distancei[i]=(float)(0);intraclustDistancei[i]=(float)(0);
Arrays.fill(clusterPoints[i],false);//newFitness[i]=(float)(0);
}
for(int j=0;j<tfIdf.length;j++){
for(int k=0;k<gBest.length;k++){
distancei[k]=eucliDistance(tfIdf[j],gBest[k]);
}littlei=Small(distancei);clusterPoints[j][littlei.pos]=true;intraclustDistancei[littlei.pos]+=littlei.distance;clusterSizei[littlei.pos]++;
}for(int k=0;k<gBest.length;k++)
intraclustDistancei[k]=intraclustDistancei[k]/clusterSizei[k];
59
for(int k=0;k<gBest.length;k++){
System.out.println("Cluster"+k+":"+intraclustDistancei[k]);fitnessi+=intraclustDistancei[k];
}
fitnessi=fitnessi/gBest.length;
//String l=Float.toString(newFitness[0]);/*if(l.length()>5){
int pos;pos=l.indexOf('.');//System.out.println(pos);String s=l.substring(0,pos+4);//System.out.println(s);newFitness[i]=Float.parseFloat(s);
}*/
System.out.println("The gBest fitness is:"+fitnessi);
return fitnessi;//finddocclust(clusterPoints);
}
public String finddocclust(){
String clust;clust="";int flag=0;
for(int i=0;i<clusterPoints[0].length;i++){
clust+="The documents under cluster: "+i+" are:"+"\n";
60
flag=0;
for(int j=0;j<clusterPoints.length;j++){
if(clusterPoints[j][i]==true){
flag++;String s=Integer.toString(j);clust+=s;if(flag%5==0)
clust+="\n";else
clust+="\t";
if(flag==5)flag=0;
}}clust+="\
n"+"**************************************"+"\n";
}
System.out.println("The cluster result is:");System.out.println(clust);
return clust;
}
}
61
5. Testing
5.1 Unit Testing
Tests for Input
Test case : What happens when we press “ok” with leaving allfields empty
Expected Output: When the user clicks on ok without any input for fields then it should prompt an error message in an dialog bax saying
“Select appropriate fields properly”
Observed Output: When “ok” is pressed error is prompt in the dialog box The error show same as that of expected
No errors will be displayed when all fields are entered correctly
62
Tests for empty features field
Test case : What happens when we press “ok” with leaving features feild
empty Expected Output:
When the user clicks on ok without any input for fields then it should prompt an error message in an dialog bax saying
“Select features fields properly”
Observed Output: When “ok” is pressed error is prompt in the dialog box The error show same as that of expected
No errors will be displayed when all features fields is entered correctly
63
Tests for empty vectors field
Test case : What happens when we press “ok” with leaving vectors feild
empty Expected Output:
When the user clicks on ok without any input for fields then it should prompt an error message in an dialog bax saying
“Select vectors fields properly”
Observed Output: When “ok” is pressed error is prompt in the dialog box The error show same as that of expected
No errors will be displayed when all vectors fields is entered correctly
64
Tests for empty algorithms field
Test case : What happens when we press “ok” with leaving algorithms feild
unselected Expected Output:
When the user clicks on ok without any input for fields then it should prompt an error message in an dialog bax saying
“Select Algorithms fields properly”
Observed Output: When “ok” is pressed error is prompt in the dialog box The error show same as that of expected
No errors will be displayed when any Algorithms fields is selected correctly
65
5.2 Performance Evaluation
The table 6.1 contains the VRC values of the different algorithms on different
datasets listed below. We compare the VRC values of PSO,TS and TSPSO clustering
algorithms . The graph is plotted for the VRC values of these algorithms.
Table 5.1: VRC values of Three algorithms
Data PSO TS TSPSO
Dataset1 0.489 0.458 0.38
Dataset2 0.502 0.491 0.305
Dataset3 0.561 0.482 0.4
By using the equation 6.1 we calculate the VRC values of each algorithm in
every iteration which are applied on different document datasets. The above table
gives the VRC values of the PSO and TS and TSPSO clustering algorithms of the
corresponding dataset.
1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Datasets
F-Sc
ore v
alue
s
Bisecting Incremental K-Means
Incremental K-Means
K-Means
Sperical K-Means
tr12tr11fbisre0 re1
Figure 5.2 : Performance evaluation of the algorithms
66
The figure 5.2 explains the VRC values of the three algorithms. On X-axis we
take the datasets and on Y-axis we take the VRC values. For each dataset the Three
cluster algorithm VRC values are plotted in the figure 7.1 bisecting Incremental K-
means values. The cyan colour plot represents the incremental K-way clustering
using MVS and the color yellow represents the K-Means F-Score values and the color
brown represents the spherical K-Means values.From the figure, we concluded that
the Bisecting F-Score value of each dataset is high when compared to the other
algorithms.
67
6. Results
This is the home page where the user must enter next in order to carry out his tasks that are needed to be performed.
68
The above figure demonstrates the user selectable fields such as1: Enter Vector file:
Here the user enters the location of Vector Files in order to provide the id as input2: Enter features file:
Here the use enter the location of Features File In order to provide data that are related to the id
3 AMOC: Used for generating the clusters from the given input files.
4 TSPO: Used for finding the cluster that is needed
5 Practical Swamp Optimization, Tabu search Clustering: Used for testing the TSPO generated
69
The above diagram demonstrates the selection of vector files as input
The above figure demonstrates the input of features dile
70
The above figures demonstrates the id that is selected.
The above figure demonstrates the selection of AMOC for cluster generation
71
The above figure demonstrates the suceessfull note of mining
The above figure shows the result generated for the given input files and cluster.
72
The above figure shows the fitness function results so generated using the input values
73
The above figure shows the final result that is generated after successful execution
74
7. CONCLUSION
In this thesis the new approach Hybrid algorithm that uses Tabu search
and basic PSO is proposed to solve the problem of Document Clustering.
PSO has been proved as an effective optimization technique to solve
combinatorial optimization problems. Tabu search, an efficient local
search procedure helps to explore the solutions in different regions of
solutions. This thesis proposes a Hybrid Algorithm is a blended
technique that combines features of basic PSO and TS. The quality of
solutions obtained by Hybrid Algorithm strongly substantiates the
effectiveness of the algorithm for the document clustering in IR system.
We also compared the TSPSO with particle swarm optimization (PSO)
and Tabu search (TS). The results shows that TSPSO having the largest
VRC values among all the algorithms. It concludes that TSPSO is effective
for the document cluster analysis problem. Future work contains use
more standard data sets to test the performance of the TSPSO.
The clustering algorithm
And these algorithms are applied to different datasets. Compared the
results of proposed TSPSO algorithm with the other existing algorithms.
Finally the VRC values of each algorithm compared and concluded that
TSPSO algorithm gives the accurate clusters compared to reaming
algorithms
75
References
1 P Jaganathan, S Jaiganesh,: “An improved K-means algorithm
combined with Particle Swarm Optimization approach for efficient web
document clustering” .International Conference on Green Computing,
Communication and Conservation of Energy (ICGCE)IEEE(2013).
2 M. Yaghini, N.Ghazanfari : “Tabu-KM: A Hybrid Clustering Algorithm
Based on Tabu Search Approach”. International Journal of Industrial
Engineering & Production Research Septtember (2010),, Vollume 21.
3 Pritesh Vora, Bhavesh Oza: “A Survey on K-mean Clustering and
Particle Swarm Optimization”, International Journal of Science and
Modern Engineering (IJISME) ISSN: 2319-6386, Volume-1, Issue-3,
February( 2013).
4 Yudong Zhang, Dayong Li: “Cluster Analysis by Variance Ratio Criterion
and Firefly Algorithm”, International Journal of Digital Content
Technology and its Applications(JDCTA)
Volume7,Number3,February2013 .
76
5 karypis,G:CLUTO a clustering Toolkit. technical report, Dept.of
computer science, Univ .of
Minnesota(2013).http:/glaros.dtc.umn.edu/~gkhome/views/cluto
6 K. Premalatha, Dr. A.M. Natarajan: Discrete PSO with GA Operators for
Document Clustering. International Journal of Recent Trends in
Engineering, Vol 1, No. 1, May 2009
Sites Referred:
http://java.sun.com
http://www.sourcefordgde.com
http://www.networkcomputing.com/
http://www.roseindia.com/
http://www.java2s.com/
http://www. javadb.com/
77
Recommended