Upload
yangyang-li
View
215
Download
1
Embed Size (px)
Citation preview
SHORT PAPER
A point symmetry-based clonal selection clustering algorithmand its application in image compression
Ruochen Liu • Fei He • Jing Liu • Wenping Ma •
Yangyang Li
Received: 13 February 2012 / Accepted: 17 June 2013
� Springer-Verlag London 2013
Abstract To cluster data set with the character of sym-
metry, a point symmetry-based clonal selection clustering
algorithm (PSCSCA) is proposed in this paper. Firstly, an
immune vaccine operator is introduced to the classical
clonal selection algorithm, which can gain a priori
knowledge of pending problems so as to accelerate the
convergent speed. Secondly, a point symmetry-based
similarity measure is used to evaluate the similarity
between two samples. Finally, both kd-trees-based
approximate nearest neighbor searching and k-nearest-
neighbor consistency strategy is used to reduce the com-
putation complexity and improve the clustering accuracy.
In the experiments, first of all, four real-life data sets and
four synthetic data sets are used to test the performance of
PSCSCA. PSCSCA is also compared with multiple existing
algorithms in terms of clustering accuracy and convergent
speed. In addition, PSCSCA is applied to a real-world
application, namely natural image compression, with good
performance being obtained.
Keywords Clustering � Clone selection � Point symmetry
distance � Immune vaccine � Image compression
1 Introduction
Clustering technique is an effective tool for exploring the
underlying structure of a given data set; its main objective
is to partition a given data set into homogeneous groups
(called clusters) in such a way that patterns within a
cluster are more similar to each other than patterns
belonging to different clusters. Clustering technique has
been applied to a wide variety of engineering and scien-
tific disciplines such as medicine, psychology, biology,
sociology, pattern recognition, and image processing
[1–3].
Various clustering algorithms have been developed [4].
Among them, the k-means algorithm is one of the most
popular algorithms. It can be implemented easily and has
high efficiency. The k-means algorithm and its variants
have been successfully employed in many practical appli-
cations. However, most partition-based clustering algo-
rithms, including k-means algorithm, are based on a given
objective function and assume that the number of clusters
is known a priori. Moreover, these approaches apply the
alternating optimization technique to the final clustering
results, whose iterative nature makes them sensitive to
initialization and susceptible to local optima. Recently,
genetic algorithm (GA) inspired by Darwinian evolution
and genetics provides a new solution for clustering analy-
sis. Especially, many partitional clustering algorithms
based on GA have been proposed. We will give a review on
GAs based on the clustering techniques in Sect. 2.
Most of the existing evolutionary clustering algorithms
employ a Euclidian distance metric to construct their fit-
ness functions. They work well on data sets in which the
natural clusters are nearly hyperspherical and linearly
separable, but for complex data sets which are non-spher-
ical and linearly non-separable, the performances of these
algorithms drop sharply. Thus, some researchers use the
characteristic of clusters, ‘‘symmetry’’, to improve the
performance of data partitions [5, 6]. In 2007, Bandyo-
padhyay et al. [7, 8] developed a genetic algorithm with
point symmetry distance-based clustering (GAPS). The
R. Liu (&) � F. He � J. Liu � W. Ma � Y. Li
Key Laboratory of Intelligent Perception and Image
Understanding of Ministry of Education of China,
Xidian University, Xi’an 710071, China
e-mail: [email protected]
123
Pattern Anal Applic
DOI 10.1007/s10044-013-0344-8
authors claimed that the GAPS had good performance on
both convex and non-convex clusters of any shape and size
as long as the clusters do have some symmetry property.
However, due to some limitations such as prematurity and
degradation in GA, GAPS does not have a prominent mean
performance when executed for many runs. Especially, for
some special data sets, i.e., two almost inscribed circles,
GAPS has a high misclassification rate.
A promising and recently introduced approach to
numerical optimization, which is rather unknown outside
the search heuristics field, is the clonal selection algo-
rithm inspired by the clonal selection theory [9], in
immunology. Clonal selection algorithm (CSA) [10–13]
for optimization evolves solution to problems via repeated
applications of a cloning, mutation and selection cycle to
a population of candidate solutions and remaining good
solutions in the population. Compared to the great number
of studies on partitional clustering with GAs, only few
applications using CSA [14–17] can be found in the
literature.
In this paper, a new immune algorithm for clustering
data sets with the character of symmetry is proposed in this
paper. Firstly, an improved CSA is employed to construct a
novel clustering algorithm, in which the clonal mutation
based on the simulated annealing is used to improve the
effectiveness of the proposed clustering algorithm. Sec-
ondly, based on the immune vaccine theory, an immune
vaccine operator is proposed, which helps evolve ordinary
antibodies by using the information of the excellent anti-
body so as to speed up the evolution of the whole antibody
population. By introducing this immune vaccine operator
to CSA, the performance of the classical clonal selection
clustering algorithm is improved and some limitations such
as prematurity and degradation in GA are overcome to
some extent. Thirdly, to cluster data sets with the character
of symmetry, the point symmetry-based distance (PS-dis-
tance) is used to construct the antibody affinity function.
Moreover, both kd-trees-based approximate nearest
neighbor search and k-nearest-neighbor consistency are
used to reduce the computation complexity and ameliorate
the clustering accuracy.
We first apply the proposed algorithm to several real-life
data sets and synthetic data sets. To test the performance of
dealing with the complex problem of the proposed algo-
rithm, we apply it to natural image compression, in which
both large data sets and large codebooks are involved.
The rest of the paper is organized as follows. In Sect. 2,
some related backgrounds are introduced. In Sect. 3, the
details of the proposed algorithm are presented. In Sect. 4,
the proposed algorithm is used to solve the benchmark
clustering problems. In Sect. 5, the proposed algorithm is
applied to several image compression problems. We con-
clude the paper in Sect. 6.
2 Related background
2.1 Problem definition [18]
Let O ¼ fo1; o2; . . .; ong be a set of n objects and Xn�p be
the profile data matrix, with n rows and p columns. Each ith
object is characterized by a real-value p-dimensional pro-
file vector xi; i ¼ 1; 2; . . .; n, where each element xij in xi
corresponds to the jth real-value feature (j ¼ 1; 2; . . .; p) of
the ith object (i ¼ 1; 2; . . .; n).
Given Xn�p, the goal of a partitional clustering algo-
rithm is to determine a partition P ¼ fp1; p2; . . .; pkg (i.e.,
8i;Pi 6¼ U; 8i; j and i 6¼ j;Pi \ Pi ¼ U; [ki¼1Pi ¼ X) such
that objects which belong to the same cluster are as similar
to each other as possible, while objects which belong to
different clusters are as dissimilar as possible.
The number of all possible partitions of data X of n
elements into k non-empty subsets is given as follows:
Pðn; kÞ ¼ 1
k!
Xk
m¼1
ð�1Þmcmk ðk � mÞn ð1Þ
From Eq. (1), it is easy to see that even for a small n and
k, the computation complexity is extremely expensive, not
to mention the large-scale clustering problems. Simple
local search methods, such as hill-climbing algorithms, are
utilized to find the partition, but they are easily stuck in
local optima and therefore cannot guarantee optimality
[19]. A large number of clustering algorithms are based on
iterative optimization such as the popular k-means and its
variants. These algorithms begin with a solution and then
repeatedly improve the solution until no further (local)
improvements can be made. However, k-means suffers
from the possibility of getting trapped into local optima.
Evolutionary algorithms, such as genetic algorithms, are
stochastic search heuristics and widely believed to be
effective in solving NP complete global optimization
problems, obviously providing an alternative to k-means
and simulated annealing in clustering analysis.
2.2 Related work
Genetic algorithms have been applied to partitional clus-
tering in many ways. They can be grouped into three main
categories based on their encoding strategy: (i) direct
encoding of the object–cluster association; (ii) encoding of
cluster separating boundaries and (iii) centroid/medoid/
representative point and variation parameter encoding for
each cluster.
In 1979, Raghavan and Birchand [20] proposed the first
GA-based clustering algorithm, which belongs to the first
approach of using a direct encoding of the object–cluster
association. The idea is to use an encoding that allocates
Pattern Anal Applic
123
directly n objects to k clusters, so that each candidate
solution consists of n solution parameters with integer
values in the interval [1, k]. Falkenauer [21] proposed an
improved version to tackle the redundancy drawback in the
representation scheme.
The second kind of GA approach to partitional cluster-
ing is to encode cluster separating boundaries. Bandyo-
padhyay et al. [22–25] used GAs to determine hyperplanes
as decision boundaries, which divide the attribute feature
space to separate the clusters. For this, they encode the
location and orientation of a set of hyperplanes with a
candidate solution representation of flexible length. Apart
from minimizing the number of misclassified objects, their
approach tries to minimize the number of required hyper-
planes. Another interesting and more flexible approach by
Bandyopadhyay and Maulik [26] is to determine the
boundaries between clusters by connected linear segments
instead of rigid planes.
The third way to use GAs in partitional clustering is to
encode a representative variable (typically a centroid) and
optionally a set of parameters to describe the extent and
shape of the variance for each cluster. Maulik and Ban-
dyopadhyay [27] proposed GA clustering, in which each
string is a sequence of real numbers representing the K-
cluster center, and each object is then allocated to the
cluster that is associated with its nearest representation
point, where ‘nearest’ refers to the Euclidean distance. The
fitness of a candidate solution is then computed as the
adequacy of the identified partition. Many studies [28, 29]
have shown that this approach is more robust in converging
toward the optimal partition than classic partitional algo-
rithms. In 2010 [30], Gong et al. proposed a manifold
evolutionary clustering algorithm (MEC). In MEC, each
individual is a sequence of real integer numbers repre-
senting the sequence numbers of k-cluster representatives.
The length of a chromosome is k words, of which the first
word gene represents the first cluster; the second represents
the second cluster, and so on. MEC is applied to solve
image classification problems.
Recently, there have been two new trends in the study of
partitional clustering based on stochastic search heuristics.
Some researchers try to determine the numbers of clusters
by using natural computation such as GA, particle swarm
optimization (PSO), and differential evolution (DE) and
CSA [31–33].
On the other hand, some researchers proposed special
clustering validity index for some special data sets. Most
existing work focuses on proposing a certain cluster
validity index and finding the optimal partition for some
special data. The measure of clustering validity is usually
based on the Euclidean distance, such as Davies–Bouldin
(DB) index. [34], Dunn’s index [35], Xie–Beni (XB) index
[36], and PBM index, proposed by Bandyopadhyay et al.
[37] and the most commonly used among all the measures
of validity based on Euclidean distance. Bo et al. [38]
designed a density-sensitive similarity measure or a man-
ifold distance to define cluster validity index for data with
manifold characters. Swagatam et al. [31] used a kernel
function-induced index to clustering some special data.
Each approach has its own merits and disadvantages.
Meila and Heckerman provide a comparison of some
clustering methods and initialization strategies in [39].
Bandyopadhyay and Maulik [40] have provided a com-
parison of several validity indexes for not assuming any
underlying distribution of the data sets while using non-
parametric genetic clustering algorithms.
In this paper, we introuduce a new cluster validity
index, namely, point symmetry distance-based index and
give a simple review about it. The first symmetry-based
clustering technique is proposed by Su and Chou [5]. Here,
they have assigned points to a particular cluster if they
present a symmetrical structure with respect to the cluster
center. A new type of non-metric distance, based on point
symmetry, is proposed, which is used in a k-means-based
clustering algorithm, referred to as symmetry-based k-
means (SBKM) algorithm. SBKM is found to provide good
performance on different types of data sets where the
clusters have internal symmetry. However, it can be shown
that SBKM fails for some data sets where the clusters
themselves are symmetrical with respect to some inter-
mediate point, since the point symmetry distance ignores
the Euclidean distance in its computation. Though this has
been mentioned in a subsequent paper by Chou et al. [6]
where they have suggested a modification (MOD-SBKM),
the modified measure has the same limitation as the pre-
vious one. In 2007, Bandyopadhyay et al. [7, 8] developed
a GA with point symmetry distance-based clustering
(GAPS). In GAPS, chromosomes in the initial population
are encoded into real code string. In the evolutionary
procedure, an adaptive mutation and crossover probabili-
ties are used, and a new point symmetry-based distance
(PS-distance), which incorporates both the Euclidean dis-
tance and a measure of symmetry, is proposed. GAPS has
good performance on both convex and non-convex clusters
of any shape and sizes, as long as the clusters do have some
symmetry property.
3 Point symmetry-based clonal selection clustering
algorithm
3.1 Clonal selection principle
The clonal selection theory, a famous theory in immunol-
ogy, was put forward by Burnet [9]. Its main ideas are that
the antigen can selectively react to antibodies, which are of
Pattern Anal Applic
123
native production, and spread on the cell surface in the
form of peptides. When an antigen is detected, those
antibodies that best recognize an antigen will proliferate by
cloning. This process is called clonal selection principle.
The new cloned cells undergo high rate of mutations or
hypermutation to increase their receptor population (called
repertoire). These mutations experienced by the clones are
proportional to their affinity to the antigen. Based on the
theory of clonal selection and hypermutation, CLONALG
is proposed by de Castro and Von Zuben [10]. In CLO-
NALG, only a part of the fittest antibodies are selected to
be cloned proportionally to their antigenic affinities.
According to the antigenic affinity of antibody, CLONALG
creates a proportion of new antibodies to replace the lowest
Ab–Ab fitness value antibodies in the current population
and reserves the best fitness value antibodies for recycling.
Later on, some optimization algorithms were proposed.
But, up to now, few of them are applied to clustering
problems.
3.2 Point symmetry-based clonal selection clustering
algorithm (PSCSCA)
The main loop of the proposed point symmetry-based
clonal selection clustering algorithm is as follows.
Considering clustering the data set xj 2 <d; j ¼1; 2; . . .n into k subset c1; c2; . . .; ck; k� n, some key oper-
ators employed in the proposed algorithm are presented in
the following sections.
3.2.1 Antibody encoding and population initialization
In the proposed algorithm, each antibody is encoded by a
real number vector which represents the coordinates of the
centers. That is, antibody Ai encodes the centers of k
clusters in the d-dimensional space and its length
Li ¼ k � d:
Ai ¼ fa11a12. . .a1d|fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl}c1
. . . ai1. . .aid|fflfflfflffl{zfflfflfflffl}ci
. . . ak1ak2. . .akd|fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl}ck
g ð2Þ
where c1; c2; . . .; ck are clustering centers.
First, we initialize the antibody population randomly,
and five iterations of the k-means algorithm are executed
on the set of centers encoded in each antibody in the initial
antibody population. Then, the refined antibody population
Að0Þ is evolved by CSA. What should be noted is that more
than five iterations of the k-means cannot improve the
clustering; this is an experimental conclusion.
3.2.2 PS distance-based antibody affinity
Symmetry is a characteristic of geometrical shapes, equa-
tions and other objects; we say that such an object is
symmetric with respect to a given operation if this opera-
tion, when applied to the object, does not appear to change
it. The three main symmetrical operations are reflection,
rotation, and translation. A reflection ‘‘flips’’ an object over
a line, inverting it to its mirror image, as if in a mirror.
Thereunto, a rotation rotates an object using a point as its
center. Exactly speaking, in this paper, data with character
Pattern Anal Applic
123
of symmetry means that data have approximated rotation
symmetry; in other words, the data with character of
symmetry include three kinds of instances: (1) different
clusters are symmetrical with respect to some intermediate
point, which is called data with character of external
symmetry; (2) some point in each cluster is symmetrical
with respect to its clustering center, which is called data
with character of within symmetry; (3) it also includes data
with a character of combined external symmetry with
within symmetry.
To cluster the data with character of symmetry, Ban-
dyopadhyay et al. [8] proposed a new definition of the PS-
based distance that can overcome the limitations of the
definition proposed by Su [6]. Based on their PS-based
distance, a GA with PS-based distance (GAPS) is proposed
and the experimental results show that GAPS has a good
performance on both convex and non-convex clusters of
any shape and size as long as the clusters do have some
symmetry property. The new proposed PS-based distance
is defined as:
dpsðx; cÞ ¼d1 þ d2
2� deðx; cÞ ð3Þ
where deðx; cÞ is the Euclidean distance between x and c, x0
is the symmetric point of x with center c, and d1 and d2 are,
respectively, the Euclidean distances between x0 and its
first and second nearest neighbors shown in Fig. 1.
In Fig. 1, de1ðx; c1Þ is the Euclidean distance between x
and c1, x0 is the symmetric point of x with center c1, and d1
and d2 are, respectively, the Euclidean distances between x0
and its first nearest neighbor x1 and second nearest neigh-
bor x2. de2ðx; c2Þ is the Euclidean distance between x and
c2, x00 is the symmetric point of x with center c2, and d3 and
d4 are, respectively, the Euclidean distances of x0 and its
first nearest neighbor x3 and second nearest neighbor x4.
According to Eq. (3), the following equation holds:
dpsðx; c1Þ ¼d1 þ d2
2� de1ðx; c1Þ ð4Þ
dpsðx; c2Þ ¼d3 þ d4
2� de2ðx; c2Þ ð5Þ
If d1þd2
2[ d3þd4
2and de1 x; c1ð Þ\\de2 x; c2ð Þ, then
dps x; c1ð Þ\dps x; c2ð Þ. Taking both d1 and d2 into account,
dps makes the PS distance more robust and noise resistant.
Based on the new definition of the PS-based distance
proposed by Bandyopadhyay et al., we define the antibody
affinity as follows:
aff Aj
� �¼ 1Pk
j¼1
Pxi2cj
dps xi; cj
� � ð6Þ
The antibody affinity is maximized to evolve the proper
cluster centers and obtain the optimal partition.
3.2.3 k-nearest-neighbor consistency and kd-trees-based
approximate nearest neighbor
GAPS proposed by Bandyopadhyay et al. works well for
most data sets with symmetry property; however, we found
that GAPS also misestimates some special data sets which
result in low accuracy. For example, the following data set
with the geometrical shape of two adjacent cirques is
shown in Fig. 2. The symmetrical point of x with respect to
c1 is x0 and the symmetrical point of x with respect to c2 is
x00; d1 and d2 are the Euclidean distances of x0 with its first
nearest neighbor and second nearest neighbor, d3 and d4
are the Euclidean distances of x00with its first nearest
neighbor and second nearest neighbor, and de1 and de2 are
the Euclidean distances of x with c1 and c2, respectively.
Based on Eq. (3), we can compute the PS-based distance
of x with respect to c1 and c2 as follows:
dpsðx; c1Þ ¼d1 þ d2
2� de1ðx; c1Þ ð7Þ
dpsðx; c2Þ ¼d3 þ d4
2� de2ðx; c2Þ ð8Þ
Fig. 1 The PS-based distance proposed by Sanghamitra et al. Fig. 2 An example of Sanghamitra’s point symmetry distance
Pattern Anal Applic
123
Obviously, dpsðx; c1Þ[ dpsðx; c2Þ, so x will be misplaced
to the bigger cirque, which will be improved by using k-
nearest-neighbor consistency in data clustering.
The nearest neighbor consistency is a central concept in
statistical pattern recognition, especially the k-nearest-
neighbor classification method and its strong theoretical
foundation. Chris et al. extended this concept to data
clustering which requires that for any data point in a cluster
[41], its k-nearest neighbors and mutual nearest neighbors
should also be in the same cluster. For the above example,
for point x, the first nearest neighbor y and the second
nearest neighbor z should be also in the same cluster,
according to the k-nearest-neighbor consistency so as to
avoid misclassifications. In this paper, the k-nearest-
neighbor consistency will be combined with the PS-based
distance in the proposed clustering.
According to preceding description, computing k-near-
est neighbors of xi is needed in the process of k-nearest-
neighbor consistency and finding the first nearest neighbor
and the second nearest neighbor of x in the symmetrical
distance Eq.(3).
To reduce the computational complexity, here ANN
search using the kd-tree [42] approach is adopted. Suppose
a set of data points in d-dimensional space is given. These
points are preprocessed into a kd-tree structure, so that
given any query point q, the nearest or generally k-nearest
points to q can be reported efficiently. Euclidean distance is
used to define the distance between two points.
3.2.4 Immune vaccine operator
3.2.4.1 Related definitions We first introduce the defini-
tions of vaccine, vaccination, and immune selection [43] in
artificial immune system. The antibody coding of variable
x is a ¼ a1; a2; . . .; alð Þ, denoted as a ¼ eðxÞ. According to
the biological concept, ai of the antibody a is called genetic
gene or allele, and the value of ai is related to the method
of encoding. Variable x is defined as decoding of antibody
a, denoted as x ¼ e�1ðaÞ.Definition 1 Vaccine can be defined as the estimate of
the gene of the optimal individual. Considering the maxi-
mum of affinity function aff, for antibody a ¼a1; a2; . . .; alð Þ and 8b 2 A, b ¼ b1; b2; . . .; blð Þ, the fol-
lowing formula holds:
aff að Þ ¼ aff e�1 a1; . . .; ai; . . .; alð Þ� �
� aff bð Þ¼ aff e�1 b1; . . .; bi; . . .; blð Þ
� �ð9Þ
Then antibody a ¼ a1; a2; . . .; alð Þ is called the
choiceness individual of antibody populationA. H ¼a a is the choiceness individual of antibody populationjf g,
if the character s (real number or binary) satisfies
8a 2 H; ai ¼ s
Then, s is the ith bit of the choiceness allele bit of A and
denoted as Ai. Set v is mode, if for 8i ¼ 1; 2; . . .l,
vi ¼ Ai the choiceness allele bit of A
� otherwise
�ð10Þ
v is called vaccine of the antibody population A.
Vaccination means the operation of amending some gene
bits of antibody b according to the vaccine extracted so as to
gain higher affinity with greater probability, namely
b0i ¼vi vi 6¼ �bi vi ¼ �
�; i ¼ 1; 2; . . .; l ð11Þ
The immune selection includes the immune test and the
simulated annealing selection. This will be introduced in
the following section.
3.2.4.2 Immune vaccine operator The immune vaccine
operator also includes three operations, namely extract
vaccine, vaccinate, and immune selection. Suppose the
current antibody population is AðtÞ ¼ fA1;A2; . . .;ANg,then the encoding length of each antibody Aiis l, the vac-
cine set is v ¼ fv1; v2; . . .; vlg, antibody affinity of AðtÞ is
AffðtÞ ¼ faff1; aff2; . . .; affNg, and the probability of
extracting vaccine from each antibody is
Pex ¼ PexðA1Þ; . . .;PexðANÞf g, where:
Pex Aið Þ ¼ aff Aið Þ=XN
j¼1
affðAjÞ ð12Þ
The main loops of immune vaccine operator are
explained in algorithms 2 and 3.
Too large a vaccine set can affect the performance, and
inferior vaccine extracted from the initial antibody popu-
lation can lead to a vaccine set with bad quality. To deal
Pattern Anal Applic
123
with these two problems, some low-quality vaccine in the
vaccine set will be deleted after the vaccinate operation.
Vaccinate operation is as follows:
The immune selection includes two operations. The first
one is the immune test, i.e., testing the antibodies. If the
affinity is smaller than that of the parent, which means
serious degeneration must have happened after vaccination,
then instead of the individual, the parent will participate in
the next competition. The second one is the simulated
annealing selection, i.e., selecting an individual Ai in the
present offspring to join the new parents with the proba-
bility Pvc Aið Þ:
Pvc Aið Þ ¼exp aff Aið Þ=Tð Þ
PNi¼1 exp aff Aið Þ=Tð Þ
ð13Þ
where T is the annealing temperature and N is the size of
the population.
Extract vaccine operation and vaccinate operation are
important in the proposed algorithm. Based on reasonable
selection of vaccines, the proposed algorithm can improve
the affinity of antibody population and speed up the con-
vergent rate by using an immune vaccine operator.
3.2.5 Clone
Suppose an antibody population AðtÞ at t generation. After
clone operation, the new antibody population AcðtÞ can be
obtained as follows:
AcðtÞ ¼ TCc ðAðtÞÞ ¼ fAciðtÞji ¼ 1; . . .;Ng
¼ fTCc ðA1ðtÞÞ; . . .; TC
c ðAiðtÞÞ; . . .; TCc ðANðtÞÞg ð14Þ
where TCc ðAiðtÞÞ ¼ Ii � AiðtÞ ; i ¼ 1; 2; . . .;N , Ii is a
vector with length of qi, the antibody set AciðtÞ has qi
copies of antibody ai.
qiðtÞ ¼ gðNc; affðAiðtÞÞÞ ð15Þ
where affðÞ is defined as the antibody affinity with antigen,
Nc is a constant that is related to the total scale of
antibodies to be cloned. Generally, qiðtÞ is defined as
follows:
qiðtÞ ¼ Int Nc:affðAiðtÞÞPNj¼1 affðAjðtÞÞ
!i ¼ 1; 2. . .N ð16Þ
where Int(x) returns the smallest integer that is greater than
or equal to x. Thus, it can be seen that, with regard to a
certain antibody, the scale of this antibody to be cloned is
adjusted to the antibody affinity. The larger the affinity, the
larger will be the scale of antibody to be cloned, and vice
versa.
3.2.6 Hypermutation
In the proposed algorithm, simulated annealing mutation is
used. After the clone operation, each antibody in the pop-
ulation is cloned to form a corresponding seed antibody
population with different size based on its affinity function
value. Then, all seed antibody population will experience a
simulated annealing mutation, in which the extent of
mutation decreases along with the evolutionary process.
Suppose the maximum evolutionary generation is m, for
antibody Ai and Ai 2 ½lb; ub�, two random numbers
r1; r2 2 ½0; 1�, the simulated annealing mutation in genera-
tion i is defined as follows:
A0i ¼Ai þ ub� Aið Þr1 1� i
m
� �b
; r1\0:5
Ai � Ai � lbð Þr2 1� i
m
� �b
; r2� 0:5
8>>><
>>>:ð17Þ
where b is a real parameter which control annealing speed.
The initial antibody population AðtÞ is not involved in
the mutation operation for protecting some important
antibody genes in it.
3.2.7 Clonal selection TCs
After the mutation operation, the best antibody with the
maximal affinity will be kept in the new antibody
′
Pattern Anal Applic
123
population, with a probability of pts. The best antibody is
saved in biðtÞ as follows:
BiðtÞ ¼ fAijðtÞjj ¼ argj
fmax½affðAijðtÞÞ�j
j ¼ 1; . . .; qiðtÞ � 1ggð18Þ
The probability of pts with which the new antibody
BiðtÞdisplaces the initial antibody AiðtÞ is defined as:
where b[ 0 is a constant that relates to the diversity of the
antibody population. Generally, the greater the b, the better
is the diversity, and vice versa.
3.3 Complexity analysis
Suppose the size of the population to be N, the maximum
generation to be M, and the dimension of the data set to be
d, then the time complexity of constructing d-dimension
tree of n data is Oðdn log nÞ and the complexity of getting
two nearest neighbors of n data by kd-tree is Oðd log nÞ.From the above flow of calculating affinity, n data can be
clustered into k classes, and there are n 9 k symmetry
points. For each symmetry point, the algorithm requires its
two nearest neighbors. So, the complexity of calculating
each antibody affinity is Oðdnk log nÞ, and the complexity
of maintaining the consistency of the nearest neighbor is
Oðdn log nÞ. The affinity is updated N times in each gen-
eration, so the complexity is OðNdkn log nÞ and the time
complexity required by immune vaccine in each generation
is OðdkNÞ. The complexity of GAPS is:
Oðdn log nþMNdkn log nÞ ¼ OðMNdkn log nÞ ð20Þ
For PSCSCA, the size of the population is set as N, the
clonal size is Nc, the maximum generation is M, and the
dimension of the data set is d. Then, the complexity of PSCSA
is:
Oðdn log nþMðNcdkn log nþ dkNcÞÞ ¼ OðMNcdkn log nÞð21Þ
Generally, only a constant coefficient is multiplied and the
complexity of PSCSCA is still the same as GAPS if the
clonal size Ncin PSCSCA is set to be the same as Nin GAPS.
4 Experiments on benchmark clustering problems
To validate the performance of PSCSCA, we first apply
it to 12 benchmark clustering problems. The results are
compared with those of k-means, symmetry-based k-
means (SBKM) [5], the modified SBKM (MOD-SBKM)
[6], and genetic algorithm with point symmetry distance
(GAPS) [7] in terms of clustering accuracy, Minskowski
Score (MS) [44], and the convergence speed. In all the
algorithms, the desired number of clusters is set in
advance. Then, a detailed analysis of parameters used in
the proposed algorithm such as the size of the antibody
population N, the clonal scale Nc, the simulated anneal-
ing mutation parameter b, and the diversity control
coefficient b are presented. The influence of immune
vaccine operator on the performance of PSCSCA is also
analyzed.
All parameters of PSCSCA were determined experi-
mentally and parameter analyses can be found in Sect. 4.3.
Most of the parameters used in SBKM, MOD-SBKM, and
GAPS are specified according to the original papers for the
best performance.
PSCSCA is implemented with the following parameters:
the antibody population N ¼ 10, the clonal scale Nc ¼ 20,
the annealing temperature in immune selection T ¼ 1, the
simulated annealing mutation parameter b ¼ 5, and the
diversity control coefficient b ¼ 1 and e ¼ 1e� 5.
The parameters in GAPS are set as follows: the population
size N ¼ 20, the initial crossover probability pc ¼ 0:8, and
the initial mutation probabilitypm ¼ 0:02 and e ¼ 1e� 5.
For k-means, SBKM and MOD-SBKM, e ¼ 1e� 5.
The algorithms are realized with Matlab and C ??. All
experiments were performed using a personal computer
with Intel (R) Core (TM) 1.86G CPU and 2G RAM.
pts Aiðt þ 1Þ ¼ BiðtÞð Þ
¼1 affðBiðtÞÞ[ affðAiðtÞÞexp � affðAiðtÞÞ�affðBiðtÞÞ
b
� affðBiðtÞÞ � affðAiðtÞÞ and AiðtÞ is not the current best antibody
0 affðBiðtÞÞ � affðAiðtÞÞ and AiðtÞ is the current best antibody
8>><
>>:
ð19Þ
Pattern Anal Applic
123
4.1 Data sets
In this subsection, four UCI data sets [45] are used to test
the performance of the proposed algorithm. They are Glass,
Iris, Lung-Cancer, and Wine. There are four synthesized
data sets [7, 8, namely, Data 1, Data 2, Data 3, and Data 4,
as shown in Fig. 3, The description of all test data sets is
given in Table 1.
4.2 Comparison of the clustering performance
For each data set, 20 independent runs of each algorithm
are executed and the mean clustering accuracy (CA) and its
standard variance (Std), Minkowski Score (MS), and its
standard variance (Std) are reported.
Clustering accuracy and MS are used to show the dif-
ference between the clustering results and the true parti-
tion, which are in the range of [0, 1]. The larger the CA, the
better is the clustering result, and the smaller the MS, the
better are the clustering results. Simple mathematic defi-
nitions for them are given as follows:
Considering the data set xj 2 <d; j ¼ 1; 2; . . .n, it will be
clustered into k clusters. Suppose the known true partition
is U ¼ fu1; u2; . . .; ukg, the clustering result
V ¼ fv1; v2; . . .; vk1g, and Confusionði; jÞ denotes the
number of data points that are both in the true cluster ui and
in the cluster vj, then the clustering accuracy (CA) is
defined as:
CA ¼ 1� 1
n
Xk
i¼1
Xk1
j¼1i 6¼j
Confusionði; jÞ ð22Þ
Let a be the number of pairs of points that are in the
same cluster in both U and V, b be the number of pairs that
are in the same cluster in V, and c be the number of pairs
that are in the same cluster in U. MS can be calculated as
follows:
MSðU;VÞ ¼ffiffiffiffiffiffiffiffiffiffiffibþ c
aþ c
rð23Þ
0 2 4 6 8 100
2
4
6
8
10Data 1
-1 0 1 2 3-2
-1.5
-1
-0.5
0
0.5Data2
4 6 8 10 12 14 164
6
8
10
12
14
16Data3
-1 0 1 2 3 4 5-3
-2
-1
0
1
2
3Data4
Fig. 3 Synthesized data sets
Table 1 Description of the data sets
Data set Number of
samples
Number of
clusters
Data
dimension
Glass 214 6 9
Iris 150 3 4
Lung-Cancer 32 3 56
Wine 178 3 13
Data 1 400 2 2
Data 2 400 3 2
Data 3 250 5 2
Data 4 400 2 2
Pattern Anal Applic
123
4.2.1 Comparisons on accuracy and ms scores
The mean results of the two metrics, namely CA and MS,
are shown in Table 2. Bold values in Table 2 mean the best
results of four algorithms.
As shown in Table 2, in 20 independent runs, for Glass,
the mean clustering accuracies of all the five algorithms are
not good. The best result in the five algorithms is that of
GAPS whose mean clustering accuracy is 59.78972 %;
PSCSCA takes the second place. The mean clustering
accuracy of MOD-SBKM is 56.46028 %, which is the worst
result. For the MS scores, SBKM has the least MS score in
five algorithms, and K-means takes the second place.
For the Lung-Cancer data set, PSCSCA gets the best mean
clustering accuracy, improved by 4.53 % compared to GAPS,
and the best MS score is that of k-means in the five algorithms
and the proposed PSCSCA takes the second place.
For Iris and Wine, no matter the mean clustering accu-
racy or the MS score, the result of PSCSCA is the best in
the five algorithms. The mean clustering accuracies of
PSCSCA are 3.26, 0.98, 6.51, and 2.69 % higher than that
of GAPS, respectively.
For the synthesized data sets, both PSCSCA and GAPS
get better results in 20 runs. All mean clustering accuracies
of PSCSCA are more than 97 % for the four synthesized
data sets. Hereinto, for Data 3 and Data 4, the mean
clustering accuracies of PSCSCA are 100 % and also get
optimal partitions in each run since the standard deviations
are zero. For Data 1, the mean clustering accuracies of
PSCSCA are more than 99 %. For Data 2, the mean clus-
tering accuracy of PSCSCA is 97.975 %, while the result
of GAPS is similar and better than those of k-means,
SBKM, and MOD-SBKM. Compared to GAPS, the mean
clustering accuracies of PSCSCA are 3.92, 2.7, 0.03, 2.04,
10.83, and 12.6 % higher for the six synthesized data sets.
For MS, PSCSCA is superior to the other four algo-
rithms when clustering all the synthesized data sets, which
also indicates that the proposed algorithm is robust when
clustering the data sets with character of symmetry.
In conclusion, PSCSCA has good performance on
clustering data sets with character of symmetry, which can
be made out by both mean clustering accuracy and MS. For
Glass and Lung-Cancer, PSCSCA is not good since the two
data sets have a more complex distribution than other data
sets and the superiority of PSCSCA lies in dealing with the
data sets with character of symmetry.
4.2.2 Comparison of the visual clustering results
The visual clustering results for Data 2 and Data 4 are
shown in Figs. 5 and 6, which are the best clustering results
Table 2 Clustering results over benchmark clustering problems
Data set Algorithm Mean CA (Std) (%) MS (Std)
Glass k-means 57.48 (4.5747) 1.0557 (0.0240)
SBKM 57.22 (3.3370) 1.0503 (0.0169)
MOD-
SBKM
56.46 (1.7680) 1.1135 (0.0198)
GAPS 59.79 (2.5572) 1.0775 (0.0231)
PSCSCA 58.41 (1.0771) 1.1048 (0.0279)
Iris k-means 89.33 (0.3351) 0.6134 (0)
SBKM 61.37 (2.3295) 0.9781 (0.0240)
MOD-
SBKM
70.56 (8.5815) 0.6961 (0.0915)
GAPS 90.67 (0.2163) 0.5666 (0)
PSCSCA 93.93 (0.2981) 0.4680 (0.0312)
Lung-
Cancer
k-means 56.25 (3.1901) 0.8273 (0.0050)
SBKM 52.18 (7.8272) 1.0856 (0.0269)
MOD-
SBKM
52.58 (7.9752) 1.0366 (0.0619)
GAPS 59.69 (6.4821) 1.0151 (0.0518)
PSCSCA 64.22 (6.0748) 1.0051 (0.0671)
Wine k-means 92.69 (0.2749) 0.5299 (0)
SBKM 92.08 (1.1146) 0.5745 (0.1328)
MOD-
SBKM
83.61 (10.4825) 0.8284 (0.0999)
GAPS 93.03 (0.2867) 0.5149 (0.0194)
PSCSCA 95.73 (0.5611) 0.4107 (0.029)
Data 1 k-means 82.50 (1.5904) 0.7599 (0)
SBKM 90.80 (9.2615) 0.5254 (0.1533)
MOD-
SBKM
87.18 (17.922) 0.6970 (0.2236)
GAPS 95.49 (1.4792) 0.4366 (0.0485)
PSCSCA 99.40 (0.3828) 0.1181 (0.1015)
Data 2 k-means 94.25 (0) 0.3683 (0)
SBKM 86.75 (10.5317) 0.4828 (0.2844)
MOD-
SBKM
86.66 (9.3128) 0.3973 (0.1303)
GAPS 95.28 (0.1025) 0.3373 (0.0039)
PSCSCA 97.98 (0.3652) 0.2267 (0.0186)
Data 3 k-means 99.60 (0.20) 0.1258 (0)
SBKM 27.30 (1.7333) 1.4730 (0.1910)
MOD-
SBKM
46.77 (14.81) 0.9311 (0.3452)
GAPS 97.96 (0.3952) 0.2832 (0.0218)
PSCSCA 100 (0) 0 (0)
Data 4 k-means 57.50 (0.3) 0.9204 (0.0006)
SBKM 78.95 (6.7279) 0.8873 (0.2602)
MOD-
SBKM
74.43 (5.96) 0.9197 (0.0021)
GAPS 87.40 (11.0941) 0.8174 (0.1889)
PSCSCA 100 (0) 0.4147 (0)
Pattern Anal Applic
123
among the 20 independent runs of the five algorithms.
Because of the high dimension, the results of UCI data sets
are not shown.
As shown in Fig. 4, for Data 2, k-means shows a poor
performance, while the other four algorithms are better.
But all of them appear to be misplaced at the intersectant
part of the ring and slant. Many data in the ring-shaped part
are wrongly partitioned into bias portion by executing k-
means. For SBKM, a part of the upper ring goes into the
linear; MOD-SBKM partitions wrongly a few data in linear
into ring-shaped part; for GAPS and PSCSCA, a part of the
upper ring goes into the linear and the number of data
wrongly partitioned by PSCSCA is less than that of GAPS.
Figure 5 shows the results of the five algorithms for
Data 4. k-means, SBKM, and MOD-SBKM show poor
performances. For GAPS, portions in the big ellipse are
wrongly partitioned. The reason is that some data in the
center of the big ellipse are symmrtric not only to some
data in the small ellipse, but also to some data in the big
ellipse, and the symmetry distance of these data from the
center of the big ellipse dps is less leading to wrong par-
tition. The mean accuracy produced by GAPS is
87.4 %. Experimental results shows that the k-nearest-
neighbor consistency strategy used in PSCSCA can help to
improve the clustering results distinctly.
4.2.3 Comparison on convergent speed of GAPS
and PSCSCA
Since k-means, SBKM, and MOD-SBKM are all based on
k-means method, their convergent speeds are faster than
that of GAPS and PSCSCA, which is a population-based
evolutionary algorithm. In this subsection, we also use the
number of objective function evaluations (FEs) as a mea-
sure of the computational cost. From the data provided in
Table 1, we choose a threshold value of the clustering
accuracy for each data set; this threshold value is somewhat
equal to or larger than the minimum accuracy attained by
each clustering algorithm (PSCSCA and GAPS). Now, we
run each algorithm on each data set and stop as soon as it
-1 0 1 2 3-2
-1.5
-1
-0.5
0
0.5
-1 0 1 2 3-2
-1.5
-1
-0.5
0
0.5
-1 0 1 2 3-2
-1.5
-1
-0.5
0
0.5
-1 0 1 2 3-2
-1.5
-1
-0.5
0
0.5
-1 0 1 2 3-2
-1.5
-1
-0.5
0
0.5
-1 0 1 2 3-2
-1.5
-1
-0.5
0
0.5
(a) (b) (c)
(d) (e) (f)
Fig. 4 The clustering results of Data 2: a Data 2; b clustering result by using k-means; c clustering result by using SBKM; d clustering result by
using MOD-SBKM; e clustering result by using GAPS; f clustering result by using PSCSCA
Pattern Anal Applic
123
achieves the threshold accuracy. The parameters of the two
algorithms are the same as those in Sect. 4.1. PSCSCA and
GAPS are performed with 20 independent runs and halted
when they obtained the smaller accuracy in Table 2. For
example, for Glass, we know the smaller accuracy of
clustering this data by PSCSCA and GAPS is 58.41 %
from Table 2. So both algorithms stop when their cluster-
ing accuracy achieves 58 %.
The mean number of function evaluations and the
standard deviation are given in Table 3. For Data 1, Data
2, Data 3, Data 4, Glass, Iris, Liver-Disorder, Lung-
Cancer, and New-Thyroid, the mean numbers of function
evaluation of PSCSCA are less than those of GAPS by
952 times, 122 times, 139 times, 537 times, 424 times, 5
times, 224 times, 370 times, and 149 times, respectively.
This indicates that the convergent speed of PSCSCA is
faster than that of GAPS. For Data 5, Data 6, and Wine,
PSCSCA requires a larger numbers of the mean function
evaluations than GAPS. For most data sets used in this
paper, the convergent speed of PSCSCA is faster than that
of GAPS.
4.3 Parameter analysis
Most optimizing algorithms based on natural computation
are stochastic searching algorithms; some parameters
employed in these algorithms have strong effect on the
stability and convergent speed. In this section, we will
-2 0 2 4 6-3
-2
-1
0
1
2
3
-1 0 1 2 3 4 5-3
-2
-1
0
1
2
3
-2 0 2 4 6-3
-2
-1
0
1
2
3
-1 0 1 2 3 4 5-3
-2
-1
0
1
2
3
-2 0 2 4 6-3
-2
-1
0
1
2
3
-2 0 2 4 6-3
-2
-1
0
1
2
3
(a) (b) (c)
(d) (e) (f)
Fig. 5 The clustering results of Data 4: a Data 4; b clustering result by using k-means; c clustering result by using SBKM; d clustering result by
using MOD-SBKM; e clustering result by using GAPS; f clustering result by using PSCSCA
Table 3 Mean and standard deviations of FEs required by PSCSCA
and GAPS
Data Set Threshold of
accuracy (%)
GAPS PSCSCA
Glass 58 792 (264) 368 (16)
Iris 90 614 (9) 609 (16)
Lung-Cancer 59 738 (107) 368 (16)
Wine 93 643 (254) 672 (98)
Data 1 95 1,544 (107) 592 (16)
Data 2 95 803 (137) 681 (123)
Data 3 97 905 (78) 368 (16)
Data 4 87 912 (127) 1,033 (98)
Pattern Anal Applic
123
focus on four important parameters: The size of antibody
population N, the clonal scale Nc, the simulated annealing
mutation parameter b, and the diversity control coefficient
b, and analyze their influence on the performance of the
proposed algorithm.
The initial values of parameters are set as follows. The
antibody population size N ¼ 10, the clonal scale Nc ¼ 20,
the annealing temperature in immune selection T = 1, the
simulated annealing mutation parameter b ¼ 5, and the
diversity control coefficient b ¼ 1 and e ¼ 1e� 5. To
determine the proper values of N, Nc, b, and b, a series of
experiments have been conducted on Data 4. In this sec-
tion, we perform each test in 30 independent runs.
4.3.1 Sensitivity in relation to N and Nc
The experimental results of PSCSCA with N increased
from 5 to 25 in steps of 5 and Nc increased from 10 to 40 in
steps of 10 as shown in Fig. 6. The data are the statistical
results of the mean number of objective function evalua-
tions (FEs) and the mean affinity value.
The results in Fig. 6 show that the size of the antibody
population and the clonal scale have a large effect on the
number of function evaluations and the function value.
The mean number of function evaluation is increased
linearly along with the variation of N and Nc, but for the
mean function, the variation of N and Nc has no such
relationship. In addition, the larger clonal scale helps to
extend the searching scope and the larger antibody popu-
lation size improves the population diversity. However,
considering the computational cost, choosing too large N
and Nc is not appropriate. Contrarily, too small N and Nc
will make the function value small, so that PSCSCA
cannot find the optimal partition. We can conclude that the
antibody population size is set to 10–15 and the clonal
scale is set to 20–25. In this paper, we set N ¼ 10 and
Nc ¼ 20.
4.3.2 Sensitivity in relation to b and b
The experimental results of PSCSCA with b increased from
1 to 10 in steps of 1 and b from 0.5 to 2.5 in steps of 0.5 as
shown in Fig. 7. The data are the statistical results of the
mean number of function evaluation and the mean function
value.
As shown in Fig. 7, we can find that simulated annealing
mutation parameter and t have s strong influence on the
number of function evaluations and the function value.
When b� 1:5, the computational cost is too high; when
b� 4 or b� 8, the function is bad and cannot find the
optimal partition. Considering the balance of these two
factors, the simulated annealing mutation parameter is set
to 4–8 and the diversity control coefficient is set to 0.5–1.
In this paper, we set b ¼ 5 and b ¼ 1.
Fig. 7 The influence of b and b on the performance of the proposed algorithm: a the influence on the mean FEs; b the influence on the mean
affinity value
Fig. 6 The influence of N and
Nc on the performance of the
proposed algorithm: a the
influence on the mean FEs;
b the influence on the mean
affinity value
Pattern Anal Applic
123
4.3.3 Influence of immune vaccine operator
on the performance of PSCSCA
In this section, we use the above eight data sets to test the
influence of immune vaccine operator on the performance
of PSCSCA. We perform the PSCSCA with immune vac-
cine operator and the algorithm without immune vaccine
operator, respectively, for 20 runs on all test data sets. The
experimental results are the statistical results of the mean
clustering accuracy and its standard deviations, and the
average number of function evaluations and its standard
deviation.
Figure 8 shows the influence of immune vaccine
operation on the clustering accuracy. In Fig. 8, red
rectangle denotes the mean clustering accuracy of the
algorithm with immune vaccine operation, which is
denoted as IVO for short; blue rectangle curve
denotes the mean clustering accuracy of the algorithm
without immune vaccine operation, denoted as no IVO
for short.
As illustrated by Fig. 8, for most of data sets used in this
paper, the algorithm with immune vaccine operation,
PSCSCA, has better mean clustering accuracy than the
algorithm without immune vaccine operation.
Figure 9 shows the influence of immune vaccine oper-
ation on mean FEs. In Fig. 9, the red rectangle denotes the
mean FEs of the algorithm with immune vaccine operation,
which is IVO for short; the blue-black curve denotes the
mean FEs of the algorithm without immune vaccine
operation, denoted as no IVO for short.
From Fig. 9, it is easy to see that the algorithm with
immune vaccine operation, PSCSCA, requires a less
number of evaluation function value than the algorithm
without immune vaccine operation. We can conclude that
the immune vaccine operation helps algorithm converg
fast.
5 Experiments on real-world image compression
Image compression applies digital compression technology
to digital images, which aims to reduce the redundancy of
image data and the volume of images so as to store and
transmit data effectively. Generally, image compression
technology could be divided into lossless and lossy image
compression. Among the various kinds of lossy compres-
sion methods, vector quantization (VQ) is one of the most
popular and widely used methods [46, 47]. VQ mainly
includes three parts: codebook generation, encoding, and
decoding, in which codebook generation is the key factor
that will affect the performance of the whole image com-
pression process. The main ideas of codebook design can
be summarized as follows. Let the number of training
vectors be M and the number of codeword be N; the
codebook design problem means dividing M training vec-
tors into N clusters which is an NP-hard problem. For large
Glass Iris Lung-cancer Wine Data1 Data2 Data3 Data40
10
20
30
40
50
60
70
80
90
100
Datasets
Mea
n C
lust
erin
g A
ccur
acy
no IVO
IVO
Fig. 8 The influence of
immune vaccine operation on
clustering accuracy
Glass Iris Lung-cancer Wine Data1 Data2 Data3 Data40
200
400
600
800
1000
1200
Datasets
Mea
n F
Es
no IVO
IVO
Fig. 9 The influence of immune vaccine operation on the mean
number of FEs
Pattern Anal Applic
123
M and N, the traditional search algorithms such as Linde–
Buzo–Gray (LBG) [48] and K-means can hardly find the
global optimal classification. Later, many algorithms for
optimal codebook design have been proposed to improve
the traditional search algorithms. One can easily infer from
Sect. 4 that PSCSCA itself can also be used for clustering
purposes and its results are satisfactory. Now, in this sec-
tion we apply PSCSCA to natural image compression.
5.1 Framework for image compression
The proposed clustering is introduced into the frame of
M/RVQ [49] to cluster data of compressed images to
obtain the optimal codebook. Let the size of the mean
codebook be n1, and the size of the residual vector code-
book be n2. First, segment the image into matrix vector and
extract the mean value to quantify. Second, quantify the
residual vector. The mean value adopts the scalar quanti-
zation and the residual vector uses the vector quantization.
So, the quantization error of the mean value does not affect
the quantization error of the residual vector. The main steps
are presented as follows.
Step 1 Segment the image into non-overlapping blocks
with the size of p� p. If the size of the image edge is not
p� p, then reinforce with 0. So, each block is a matrix
vector.
Step 2 Calculate the mean value b of each matrix vector
as follows:
b ¼ 1
p2
Xp
i¼1
Xp
j¼1
xi;j ð24Þ
where xi;j denotes the pixel value of ði; jÞ; i; j ¼ 1; 2; . . .p in
each matrix vector.
Step 3 Cluster the mean value b of each vector by using
PSCSCA and obtain n1 cluster centers as the mean value
codebook.
Step 4 Quantify b with scalar quantization and denote as
b ¼ arg minx2the mean value codebook b� xj j, and the index of b
in the mean value codebooks is used as its coding cb,
namely, cb ¼ arg mini¼1;2;...;n1b� xij j; xi 2 the mean
value code book.
Step 5 Calculate the residual vector zi;j ¼ xi;j � b.
Step 6 Cluster the mean value z of each vector by using
PSCSCA and obtain n2 cluster centers as the residual
vector codebook.
Step 7 Quantify z with vector quantization to obtain
z ¼ arg minx 2 the residual vector code book z� xk k. Treat the
index of z in the residual vector codebook as its coding cz,
namely, cz ¼ arg mini¼1;2;...;n2z� xij j; xi 2 the residual
vector codebook.
Step 8 Divide the received data into the mean value
coding cb and the residual vector coding cz. Decode the
mean value coding cb into the mean value b according to
the mean value codebook and decode the residual vector
encoding cz into the residual vector z according to the
residual vector codebook.
Step 9 Obtain the grayscale matrix vector bþ z of the
image block with the size of p� p, where b is the mean
value and z is the residual vector
Step 10 According to the order of segmentation in step
1, combine the image blocks with the size of p� p into an
image which is of the same size as the original image.
5.2 Results and analysis
Three grayscale images ‘‘Barbara’’, ‘‘Cameraman’’ and
‘‘Peppers’’ are used as the training images (as shown in
Fig. 10) and ‘‘Lena’’, ‘‘Ceramic’’ ‘‘Cat’’, ‘‘Fruit’’, ‘‘Boat’’,
and ‘‘Airplane’’ are used as the test images. The results are
compared with those of LBG, self-organizing mapping
(SOM) [50], GAPS, and modified K-means (Mod-KM)
Fig. 10 Training images: a Barbara (512 9 512); b Cameraman (256 9 256); c Peppers (512 9 512)
Pattern Anal Applic
123
[51] in terms of the peak signal to noise ratio (PSNR) and
running time including the visual effect of recovered
images.
For a m� m grayscale image, the PSNR is defined as
follows:
PSNR ¼ 10 lg2552
Pmi¼1
Pmj¼1 xij � xij
� �2=m2
!ð25Þ
where xij is the original pixel value and xij is the com-
pressed pixel value.
Parameters in PSCSCA are set as follows: the size of
antibody population N is 20; the clonal size Ncis 40; the
annealing temperature in immune selection T ¼ 1, the
simulated annealing mutation parameters b ¼ 5and the
diversity control coefficient b ¼ 1, the size of the mean
value codebook n1 is 16; the size of the residual vector
codebook n2 is 256; the size of each block p is 4.
The comparisons of the visual quality of the image
compression using these five algorithms are presented in
Figs. 11, 12, 13, 14, 15, and 16. We choose the best results
obtained by the four algorithms among ten independent
runs. Table 4 presents the mean PSNR and its standard
variance and Table 5 shows the mean running time of
LBG, SOM, Mod-KM, GAPS, and PSCSCA for the ten
independent runs. In Table 4, bold values mean the best
results of four algorithm.
The test image Lena was compressed and recovered by
using LBG, SOM, Mod-KM, GAPS, and PSCSCA, and the
results are given in Fig. 11. As can be seen, the borders in
PSCSCA are softer and closer to the original image than
the other algorithms.
Figure 12 shows the visual quality of the four algo-
rithms for Ceramic. The visual effect of the PSCSCA is
comparatively better, and the texture obtained by PSCSCA
is more legible than those of the other four algorithms.
The test image Cat was compressed and recovered by
using LBG, SOM, Mod-KM, GAPS and PSCSCA, and the
results are given in Fig. 13. As can be seen, the hair in the
tail of cat is sharper than the other four algorithms.
The test image Fruit was compressed and recovered by
using LBG, SOM, Mod-KM, GAPS, and PSCSCA, and the
Fig. 11 Testing results with Lena: a the original image of Lena (512 9 512); b Lena recovered by using LBG; c Lena recovered by using SOM;
d Lena recovered by using Mod-KM; e Lena recovered by using GAPS; f Lena recovered by using PSCSCA
Pattern Anal Applic
123
Fig. 12 Testing results with Ceramic: a The original image of Ceramic (200 9 200); b Ceramic recovered by using LBG; c Ceramic recovered
by using SOM; d Ceramic recovered by using Mod-KM; e Ceramic recovered by using GAPS; f Ceramic recovered by using PSCSCA
Fig. 13 Testing results with Cat: a The original image of Cat (525 9 700); b Cat recovered by using LBG; c Cat recovered by using SOM; d Cat
recovered by using Mod-KM; e Cat recovered by using GAPS; f Cat recovered by using PSCSCA
Pattern Anal Applic
123
Fig. 14 Testing results with Fruit: a The original image of Fruit (338 9 450); b Fruit recovered by using LBG; c Fruit recovered by using SOM;
d Fruit recovered by using Mod-KM; e Fruit recovered by using GAPS; f Fruit recovered by using PSCSCA
Fig. 15 Testing results with Boat: a The original image of Boat (512 9 512); b Boat recovered by using LBG; c Boat recovered by using SOM;
d Boat recovered by using Mod-KM; e Boat recovered by using GAPS; f Boat recovered by using PSCSCA
Pattern Anal Applic
123
Table 4 Comparison of the mean PSNR (Std)
Image LBG SOM MOD-KM GAPS PSCSCA
Lena 25.8654 (2.8250) 26.2623 (0.3576) 25.0783 (2.3252) 25.6916 (3.4225) 27.0018 (3.8017)
Ceramic 24.0017 (2.9439) 24.3034 (3.1638) 24.0322 (2.4654) 23.7815 (2.3516) 24.7699 (2.4732)
Cat 29.1653 (6.5859) 29.6974 (4.3643) 29.4981 (3.6593) 29.6238 (4.1523) 30.6965 (3.9877)
Fruit 24.8120 (1.8583) 25.0764 (3.4442) 24.9364 (2.8862) 24.9817 (2.9615) 25.1923 (2.9226)
Boat 24.2294 (2.4995) 24.2959 (0.1878) 26.2763 (0.7853) 27.5428 (3.3213) 27.7303 (3.1638)
Airplane 23.8514 (2.6010) 23.9629 (0.9031) 27.3021 (0.7454) 27.2531 (3.6173) 27.3759 (3.2731)
Bold values indicate the best results of four algorithms
Table 5 Comparison of mean running time (in seconds)
Items LBG SOM MOD-KM GAPS PSCSCA
Training time 90.6503 1.73E ? 02 1.11E ? 03 7.62E ? 03 7.57E ? 03
Testing time
Lena 3.9723 3.9775 3.7841 3.1315 3.7554
Ceramic 0.5594 0.5764 0.5421 0.6157 0.5574
Cat 5.4069 5.2655 4.9398 5.3797 5.1376
Fruit 2.1603 2.1684 2.0876 2.2816 2.1479
Boat 4.0109 3.9680 3.7803 3.4933 3.8939
Airplane 3.8732 3.9696 3.7240 3.5447 3.7952
Fig. 16 Testing results with Airplane: a The original image of Airplane (512 9 512); b Airplane recovered by using LBG; c Airplane recovered
by using SOM; d Airplane recovered by using Mod-KM; e Airplane recovered by using GAPS; f Airplane recovered by using PSCSCA
Pattern Anal Applic
123
results are given in Fig. 14. As can be seen, the result of
PSCSCA is softer than those of the other algorithms.
Figure 15 shows the visual quality of the four algo-
rithms for Boat. The visual effect of the PSCSCA is
comparatively better; especially, the edge of the mast is
smoother.
Figure 16 shows the visual quality of the four algo-
rithms for Airplane. The holistic effect obtained by the
proposed algorithm is better than those from the others.
The character ‘‘F16’’ is clearer using the proposed algo-
rithm than using others.
For the six images given in experiments, the compres-
sion results of the proposed method are better than those of
others. Compared with LBG, SOM, Mod-KM, and GAPS,
the proposed method can produce a higher quality of edges.
It can be further confirmed by looking at Table 4. For all
test images, the PSCSCA achieved the highest PSNR.
Figure 17 shows that PSNRs obtained by the two evo-
lutionary methods-based population for Lena, namely,
GAPS and PSCSCA, change with evolutionary generation.
From Fig. 7, we can see that PSNR obtained by GAPS and
PSCSCA are increased with the increase of evolutionary
generation and PSCSCA performs better than GAPS in
terms of PSNR.
Running time including training time and testing time
of the four algorithms are given in Table 5. We can see
that, for LBG, SOM, Mod-KM, GAPS and PSCSCA, their
training times are increased in order. However, the dif-
ferences in the testing time of the five algorithm com-
pressing six natural images are small, because that the
main differences in these algorithms lie in training
codebooks, and their compressing and decompressing
methods are similar. Generally speaking, a better training
method can produce better compressing results, which is
the main idea for all kinds of improved compressing
algorithms. However, better compressing results mean
higher computational cost. Although the training times of
PSCSCA and GAPS are seven times that of Mod-KM,
considering once training is complete, we can apply it to
compress more test images; therefore, the increase in
training time is acceptable. PSCSCA and GAPS require
similar training time, because they use a similar method
of training codebooks.
6 Conclusion
Clustering is an important technique in data mining. In
real-world applications, we often encounter many data sets
with a character of symmetry. The point symmetry-based
similarity measure and immune vaccine mechanism are
introduced to the classical clonal selection algorithm and
realize clustering data sets with the character of symmetry.
Firstly, the proposed algorithm inherits the specialty of
CSA in combining global search with local search. Sec-
ondly, by introducing the immune vaccine operation, the
proposed algorithm can utilize cumulated prior knowledge
in the process of evolution; Finally, a point symmetry-
based similarity measure is used to evaluate the similarity
between two samples, which can detect both convex and
non-convex clusters. Experimental results on different data
sets demonstrate the superiority of PSCSCA as compared
to k-means, SBKM, Mod-SBKM, and GAPS. Another area
of future research is the development of new cluster
validity indices as well as automatic clustering methods
based on the proposed algorithm.
Acknowledgments The authors would like to thank the editor and
the reviewers for helpful comments that greatly improved the paper.
This work is supported by the National Natural Science Foundation of
China under Grant No. 61203303 and No. 61272279, the Provincial
Natural Science Foundation of Shaanxi of China (No. 2010JM8030),
and the Fundamental Research Funds for the Central Universities
(No. K50511020014).
References
1. Evangelou IE, Hadjimitsis DG, Lazakidou AA (2001) Data min-
ing and knowledge discovery in complex image data using arti-
ficial neural networks. In: Proceedings of the workshop complex
reason. http://www.cs.ucy.ac.cy/*lazakid/Publications/../p
2. Rao MR (1971) Cluster analysis and mathematical programming.
J Amer Stat Assoc 66(335):622–626
3. Lillesand T, Keifer R (1994) Remote Sensing and Image Inter-
pretation. Wiley, Hoboken
4. Saha S, Bandyopadhyay S (2010) A new multiobjective cluster-
ing technique based on the concepts of stability and symmetry.
Knowl Inf Syst 23:1–27
5. Su MC, Chou C-H (2001) A modified version of the k-means
algorithm with a distance based on cluster symmetry. IEEE Trans
Pattern Anal Mach Intell 23(6):674–680
0 2 4 6 8 10 12 14 16 18 2020
21
22
23
24
25
26
27
28
29
30
evolutionary generation
PS
NR
(dB
)Lena
GAPS
PSCSCA
Fig. 17 Variation of PSNR obtained by GAPS and PSCSCA with
evolutionary generation
Pattern Anal Applic
123
6. Chou CH, Su MC, Lai E (2002) Symmetry as a new measure for
cluster validity. In: Second WSEAS international conference on
scientific computation and soft computing, pp 209–213
7. Bandyopadhyay S, Saha S (2007) GAPS: a clustering method
using a new point symmetry-based distance measure. Pattern
Recogn 40(12):3430–3451
8. Saha S, Bandyopadhyay S (2008) Application of a new sym-
metry-based cluster validity index for satellite image segmenta-
tion. IEEE Geosci Remote Sens Lett 5(2):166–170
9. Burnet MF (1957) A modification of Jerne’s theory of antibody
production using the concept of clonal selection. Austrian J Sci
20(1):67–76
10. De Castro LN, Von Zuben FJ (2002) Learning and optimization
using the clonal selection principle. IEEE Trans Evol Comput
6(3):239–251
11. Cutello G, Narzisi G, Nicosia G (2005) Clonal selection Algo-
rithms: A comparative case study using effective mutation
potentials. In: Proceedings of 4th international conference on
artificial immune systems. Lecture notes in Computer Science,
vol 3627, pp 13–28
12. Gong M et al (2009) Immune algorithm with orthogonal design
based initialization, cloning, and selection for global optimiza-
tion. Knowl Inf Syst. doi:10.1007/s10115-009-0261-8
13. Liu RC, Jiao LC (2007) An immune memory clonal strategy
algorithm. J Comput Theor Nanosci 4(7–8):1399–1404
14. De Castro LN, Von Zuben FJ (2000) An evolutionary immune
network for data clustering. In: Proceedings of the IEEE
SBRN’00 (Brazilian Symposium on Artificial Neural Networks).
Rio de Janerio, pp 84–89
15. Jie LI, Xinbo GAO, Licheng JIAO (2004) A CSA-based clus-
tering algorithm for large data sets with mixed numeric and
categorical values. In: Proceedings of the world congress on
intelligent control and automation (WCICA2004) Hang Zhou,
pp 2003–2007
16. Ruochen Liu, Zhengchun Shen, Licheng Jiao, Wei Zhang (2010)
Immunodomaince based clonal selection clustering algorithm. In:
Proceedings of the 2010 IEEE congress on evolutionary com-
putation, CEC2010, Barcelona, Spain, 18–23 July, pp 2912–2918
17. Liu RC, Zhang W, Jiao LC, Liu F (2010) A multiobjective
immune clustering ensemble technique applied to unsupervised
SAR image segmentation, In: Proceedings of the ACM interna-
tional conference on image and video retrieval, ACM-CIVR,
pp 158–165
18. Paterlini Sandra, Krink Thiemo (2006) Differential evolution and
particle swarm optimization in partitional clustering. Comput Stat
Data Anal 50:1220–1247
19. Cowgill M, Harvey R, Watson L (1999) A genetic algorithm
approach to cluster analysis. Comput Math Appl 37(7):99–108
20. Raghavan VV, Birchand K (1979) A clustering strategy based on
a formalism of the reproductive process in a natural system. In:
Proceedings of the second international conference on informa-
tion storage and retrieval, pp 10–22
21. Falkenauer E (1998) Genetic Algorithms and Grouping Problems.
Wiley, Chichester
22. Bandyopadhyay S, Murthy CA, Pal SK (1995) Pattern classifi-
cation with genetic algorithms. Pattern Recogn Lett 16:801–808
23. Bandyopadhyay S, Murthy CA, Pal SK (1998) Pattern classifi-
cation using genetic algorithm: determination of H. Pattern Re-
cogn Lett 19:1171–1181
24. Bandyopadhyay S, Pal SK, Murthy CA (1998) Simulated
annealing based pattern classification. J Inf Sci 109:165–184
25. Bandyopadhyay S, Murthy CA, Pal SK (1999) Theoretic perfor-
mance of genetic pattern classifier. J Franklin Inst 336:387–422
26. Bandyopadhyay S, Maulik U (2002) Genetic clustering for
automatic evolution of clusters and application to image classi-
fication. Pattern Recogn 35:1197–1208
27. Bandyopadhyay S, Maulik U (2002) An evolutionary technique
based on K-means algorithm for optimal clustering in RN. Inf Sci
146:221–237
28. Sarafis I, Zalzala AMS, Trinder P (2002) A genetic rule-based
data clustering toolkit. In: Fogel DB, El-Sharkawi MA,Yao X,
Greenwood G, Iba H, Marrow P, Shackleton M (eds), Proceed-
ings of the 2002 congress on evolutionary computation (CEC-
2002). IEE Press, Piscataway, pp 1238–1243
29. Maulik U, Bandyopadhyay S (2000) Genetic algorithm-based
clustering technique. Pattern Recogn 33:1455–1465
30. Chiou YC, Lan LW (2001) Theory and methodology genetic
clustering algorithms. Eur J Oper Res 135:413–427
31. Bandyopadhyay S, Saha S (2007) VGAPS: a clustering method
using a new point symmetry-based distance measure. Pattern
Recogn 40(12):3430–3451
32. Swagatam D, Abraham A, Konar A (2008) Automatic kernel
clustering with a Multi-Elitist Particle Swarm Optimization
Algorithm. Pattern Recogn Lett 29:688–699
33. Swagatam D, Ajith A, Amit K (2008) Automatic clustering using
an improved differential evolution algorithm. IEEE Trans Sys
Man Cyb Part A Sys Humans 38(1):218–237
34. Davies DL, Bouldin DW (1979) A cluster separation measure. In:
IEEE transaction pattern recognition, vol PAMI-1. IEEE Press,
NewYork, pp 224—227
35. Dunn JC (1973) A fuzzy relative of the ISODATA process and its
use in detecting compact well-separated clusters. In: Communi-
cation in Statistics, vol 3. Taylor & Francis Group, London,
pp 32—57
36. Xie XL, Beni G (2001) A validity measure for fuzzy clustering.
In: IEEE transactions pattern analysis and machine intelligence,
vol 13. IEEE Press, NewYork, pp 841–847
37. Pakhira MK, Bandyopadhyay S, Maulik U (2004) Validity index
for crisp and fuzzy clusters. In: Pattern recognition, vol 37.
Pergamon Press, New York, pp 487—501
38. Wang L, Bo LF, Jiao LC (2007) Density-sensitive semi-super-
vised spectral clustering. J Software 18(18):2412–2422
39. Meila M, Heckerman D (1998) An experimental comparison of
several clustering and initialization methods. In: Proceedings of
the 14th conference uncertainty in artificial intelligence. Morgan
Kanfmann, Canada, pp 386—395
40. Bandyopadhyay S, Maulik U (2001) Nonparametric genetic
clustering: comparison of validity indices. In: IEEE transactions
on systems, man, and cybernetucs, part C: application and
reviews, vol 31, no 1. IEEE Press, New York, pp 120–125
41. Ding C, He X (2004) K-Nearest-Neighbor in data clustering:
Incorporating local information into global optimization. In:
Proceedings of the ACM Symposium on Applied Computing.
ACM Press, Nicosia, pp 584–589
42. Mount DM, Arya S (2005) ANN: a library for approximate
nearest neighbor searching. http://www.cs.umd.edu/mount/ANN
43. Jiao LC, Wang L (2000) A Novel Genetic Algorithm Based on
Immunity. IEEE Trans Sys Man Cy Part A Sys Humans
30:552–561
44. Hur BA, Guyon L (2003) Detecting Stable Clusters using Prin-
cipal Component Analysis in Methods in Molecular Biology.
Humana press, New York
45. UC Irvine Machine Learning Repository: http://archive.ics.uci.
edu/ml/datasets.html
46. Li RY et al (2002) Image compression using transformed vector
quantization. Image Vis Comput 20:37–45
47. Baker RL, Gray RM (1982) Image compression using nonadap-
tive spatial vector quantization. In Proceedings of the 16th
Asilomar conference on circuits, systems, and computers,
pp 55–61
48. Linde Y, Buzo A, Gray R (1980) An algorithm for vector
quantizer design. IEEE Trans Commun 28:84–94
Pattern Anal Applic
123
49. Baker RL (1984) Vector quantization of digital images. Ph.D
dissertation, Stanford University, Stanford
50. Horzyk A (2005) Unsupervised Clustering using Self-Optimizing
Neural Networks’’, In: Proceedings of the 2005 5th International
Conference on Intelligent Systems Design and Applications
(ISDA’05). pp 118–123
51. Chou CH, Su MC, Lai E (2004) A new cluster validity measure and
its application to image compression. Pattern Anal Appl 7:205–220
Pattern Anal Applic
123