22
SHORT PAPER A point symmetry-based clonal selection clustering algorithm and its application in image compression Ruochen Liu Fei He Jing Liu Wenping Ma Yangyang Li Received: 13 February 2012 / Accepted: 17 June 2013 Ó Springer-Verlag London 2013 Abstract To cluster data set with the character of sym- metry, a point symmetry-based clonal selection clustering algorithm (PSCSCA) is proposed in this paper. Firstly, an immune vaccine operator is introduced to the classical clonal selection algorithm, which can gain a priori knowledge of pending problems so as to accelerate the convergent speed. Secondly, a point symmetry-based similarity measure is used to evaluate the similarity between two samples. Finally, both kd-trees-based approximate nearest neighbor searching and k-nearest- neighbor consistency strategy is used to reduce the com- putation complexity and improve the clustering accuracy. In the experiments, first of all, four real-life data sets and four synthetic data sets are used to test the performance of PSCSCA. PSCSCA is also compared with multiple existing algorithms in terms of clustering accuracy and convergent speed. In addition, PSCSCA is applied to a real-world application, namely natural image compression, with good performance being obtained. Keywords Clustering Clone selection Point symmetry distance Immune vaccine Image compression 1 Introduction Clustering technique is an effective tool for exploring the underlying structure of a given data set; its main objective is to partition a given data set into homogeneous groups (called clusters) in such a way that patterns within a cluster are more similar to each other than patterns belonging to different clusters. Clustering technique has been applied to a wide variety of engineering and scien- tific disciplines such as medicine, psychology, biology, sociology, pattern recognition, and image processing [13]. Various clustering algorithms have been developed [4]. Among them, the k-means algorithm is one of the most popular algorithms. It can be implemented easily and has high efficiency. The k-means algorithm and its variants have been successfully employed in many practical appli- cations. However, most partition-based clustering algo- rithms, including k-means algorithm, are based on a given objective function and assume that the number of clusters is known a priori. Moreover, these approaches apply the alternating optimization technique to the final clustering results, whose iterative nature makes them sensitive to initialization and susceptible to local optima. Recently, genetic algorithm (GA) inspired by Darwinian evolution and genetics provides a new solution for clustering analy- sis. Especially, many partitional clustering algorithms based on GA have been proposed. We will give a review on GAs based on the clustering techniques in Sect. 2. Most of the existing evolutionary clustering algorithms employ a Euclidian distance metric to construct their fit- ness functions. They work well on data sets in which the natural clusters are nearly hyperspherical and linearly separable, but for complex data sets which are non-spher- ical and linearly non-separable, the performances of these algorithms drop sharply. Thus, some researchers use the characteristic of clusters, ‘‘symmetry’’, to improve the performance of data partitions [5, 6]. In 2007, Bandyo- padhyay et al. [7, 8] developed a genetic algorithm with point symmetry distance-based clustering (GAPS). The R. Liu (&) F. He J. Liu W. Ma Y. Li Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xidian University, Xi’an 710071, China e-mail: [email protected] 123 Pattern Anal Applic DOI 10.1007/s10044-013-0344-8

A point symmetry-based clonal selection clustering algorithm and its application in image compression

Embed Size (px)

Citation preview

SHORT PAPER

A point symmetry-based clonal selection clustering algorithmand its application in image compression

Ruochen Liu • Fei He • Jing Liu • Wenping Ma •

Yangyang Li

Received: 13 February 2012 / Accepted: 17 June 2013

� Springer-Verlag London 2013

Abstract To cluster data set with the character of sym-

metry, a point symmetry-based clonal selection clustering

algorithm (PSCSCA) is proposed in this paper. Firstly, an

immune vaccine operator is introduced to the classical

clonal selection algorithm, which can gain a priori

knowledge of pending problems so as to accelerate the

convergent speed. Secondly, a point symmetry-based

similarity measure is used to evaluate the similarity

between two samples. Finally, both kd-trees-based

approximate nearest neighbor searching and k-nearest-

neighbor consistency strategy is used to reduce the com-

putation complexity and improve the clustering accuracy.

In the experiments, first of all, four real-life data sets and

four synthetic data sets are used to test the performance of

PSCSCA. PSCSCA is also compared with multiple existing

algorithms in terms of clustering accuracy and convergent

speed. In addition, PSCSCA is applied to a real-world

application, namely natural image compression, with good

performance being obtained.

Keywords Clustering � Clone selection � Point symmetry

distance � Immune vaccine � Image compression

1 Introduction

Clustering technique is an effective tool for exploring the

underlying structure of a given data set; its main objective

is to partition a given data set into homogeneous groups

(called clusters) in such a way that patterns within a

cluster are more similar to each other than patterns

belonging to different clusters. Clustering technique has

been applied to a wide variety of engineering and scien-

tific disciplines such as medicine, psychology, biology,

sociology, pattern recognition, and image processing

[1–3].

Various clustering algorithms have been developed [4].

Among them, the k-means algorithm is one of the most

popular algorithms. It can be implemented easily and has

high efficiency. The k-means algorithm and its variants

have been successfully employed in many practical appli-

cations. However, most partition-based clustering algo-

rithms, including k-means algorithm, are based on a given

objective function and assume that the number of clusters

is known a priori. Moreover, these approaches apply the

alternating optimization technique to the final clustering

results, whose iterative nature makes them sensitive to

initialization and susceptible to local optima. Recently,

genetic algorithm (GA) inspired by Darwinian evolution

and genetics provides a new solution for clustering analy-

sis. Especially, many partitional clustering algorithms

based on GA have been proposed. We will give a review on

GAs based on the clustering techniques in Sect. 2.

Most of the existing evolutionary clustering algorithms

employ a Euclidian distance metric to construct their fit-

ness functions. They work well on data sets in which the

natural clusters are nearly hyperspherical and linearly

separable, but for complex data sets which are non-spher-

ical and linearly non-separable, the performances of these

algorithms drop sharply. Thus, some researchers use the

characteristic of clusters, ‘‘symmetry’’, to improve the

performance of data partitions [5, 6]. In 2007, Bandyo-

padhyay et al. [7, 8] developed a genetic algorithm with

point symmetry distance-based clustering (GAPS). The

R. Liu (&) � F. He � J. Liu � W. Ma � Y. Li

Key Laboratory of Intelligent Perception and Image

Understanding of Ministry of Education of China,

Xidian University, Xi’an 710071, China

e-mail: [email protected]

123

Pattern Anal Applic

DOI 10.1007/s10044-013-0344-8

authors claimed that the GAPS had good performance on

both convex and non-convex clusters of any shape and size

as long as the clusters do have some symmetry property.

However, due to some limitations such as prematurity and

degradation in GA, GAPS does not have a prominent mean

performance when executed for many runs. Especially, for

some special data sets, i.e., two almost inscribed circles,

GAPS has a high misclassification rate.

A promising and recently introduced approach to

numerical optimization, which is rather unknown outside

the search heuristics field, is the clonal selection algo-

rithm inspired by the clonal selection theory [9], in

immunology. Clonal selection algorithm (CSA) [10–13]

for optimization evolves solution to problems via repeated

applications of a cloning, mutation and selection cycle to

a population of candidate solutions and remaining good

solutions in the population. Compared to the great number

of studies on partitional clustering with GAs, only few

applications using CSA [14–17] can be found in the

literature.

In this paper, a new immune algorithm for clustering

data sets with the character of symmetry is proposed in this

paper. Firstly, an improved CSA is employed to construct a

novel clustering algorithm, in which the clonal mutation

based on the simulated annealing is used to improve the

effectiveness of the proposed clustering algorithm. Sec-

ondly, based on the immune vaccine theory, an immune

vaccine operator is proposed, which helps evolve ordinary

antibodies by using the information of the excellent anti-

body so as to speed up the evolution of the whole antibody

population. By introducing this immune vaccine operator

to CSA, the performance of the classical clonal selection

clustering algorithm is improved and some limitations such

as prematurity and degradation in GA are overcome to

some extent. Thirdly, to cluster data sets with the character

of symmetry, the point symmetry-based distance (PS-dis-

tance) is used to construct the antibody affinity function.

Moreover, both kd-trees-based approximate nearest

neighbor search and k-nearest-neighbor consistency are

used to reduce the computation complexity and ameliorate

the clustering accuracy.

We first apply the proposed algorithm to several real-life

data sets and synthetic data sets. To test the performance of

dealing with the complex problem of the proposed algo-

rithm, we apply it to natural image compression, in which

both large data sets and large codebooks are involved.

The rest of the paper is organized as follows. In Sect. 2,

some related backgrounds are introduced. In Sect. 3, the

details of the proposed algorithm are presented. In Sect. 4,

the proposed algorithm is used to solve the benchmark

clustering problems. In Sect. 5, the proposed algorithm is

applied to several image compression problems. We con-

clude the paper in Sect. 6.

2 Related background

2.1 Problem definition [18]

Let O ¼ fo1; o2; . . .; ong be a set of n objects and Xn�p be

the profile data matrix, with n rows and p columns. Each ith

object is characterized by a real-value p-dimensional pro-

file vector xi; i ¼ 1; 2; . . .; n, where each element xij in xi

corresponds to the jth real-value feature (j ¼ 1; 2; . . .; p) of

the ith object (i ¼ 1; 2; . . .; n).

Given Xn�p, the goal of a partitional clustering algo-

rithm is to determine a partition P ¼ fp1; p2; . . .; pkg (i.e.,

8i;Pi 6¼ U; 8i; j and i 6¼ j;Pi \ Pi ¼ U; [ki¼1Pi ¼ X) such

that objects which belong to the same cluster are as similar

to each other as possible, while objects which belong to

different clusters are as dissimilar as possible.

The number of all possible partitions of data X of n

elements into k non-empty subsets is given as follows:

Pðn; kÞ ¼ 1

k!

Xk

m¼1

ð�1Þmcmk ðk � mÞn ð1Þ

From Eq. (1), it is easy to see that even for a small n and

k, the computation complexity is extremely expensive, not

to mention the large-scale clustering problems. Simple

local search methods, such as hill-climbing algorithms, are

utilized to find the partition, but they are easily stuck in

local optima and therefore cannot guarantee optimality

[19]. A large number of clustering algorithms are based on

iterative optimization such as the popular k-means and its

variants. These algorithms begin with a solution and then

repeatedly improve the solution until no further (local)

improvements can be made. However, k-means suffers

from the possibility of getting trapped into local optima.

Evolutionary algorithms, such as genetic algorithms, are

stochastic search heuristics and widely believed to be

effective in solving NP complete global optimization

problems, obviously providing an alternative to k-means

and simulated annealing in clustering analysis.

2.2 Related work

Genetic algorithms have been applied to partitional clus-

tering in many ways. They can be grouped into three main

categories based on their encoding strategy: (i) direct

encoding of the object–cluster association; (ii) encoding of

cluster separating boundaries and (iii) centroid/medoid/

representative point and variation parameter encoding for

each cluster.

In 1979, Raghavan and Birchand [20] proposed the first

GA-based clustering algorithm, which belongs to the first

approach of using a direct encoding of the object–cluster

association. The idea is to use an encoding that allocates

Pattern Anal Applic

123

directly n objects to k clusters, so that each candidate

solution consists of n solution parameters with integer

values in the interval [1, k]. Falkenauer [21] proposed an

improved version to tackle the redundancy drawback in the

representation scheme.

The second kind of GA approach to partitional cluster-

ing is to encode cluster separating boundaries. Bandyo-

padhyay et al. [22–25] used GAs to determine hyperplanes

as decision boundaries, which divide the attribute feature

space to separate the clusters. For this, they encode the

location and orientation of a set of hyperplanes with a

candidate solution representation of flexible length. Apart

from minimizing the number of misclassified objects, their

approach tries to minimize the number of required hyper-

planes. Another interesting and more flexible approach by

Bandyopadhyay and Maulik [26] is to determine the

boundaries between clusters by connected linear segments

instead of rigid planes.

The third way to use GAs in partitional clustering is to

encode a representative variable (typically a centroid) and

optionally a set of parameters to describe the extent and

shape of the variance for each cluster. Maulik and Ban-

dyopadhyay [27] proposed GA clustering, in which each

string is a sequence of real numbers representing the K-

cluster center, and each object is then allocated to the

cluster that is associated with its nearest representation

point, where ‘nearest’ refers to the Euclidean distance. The

fitness of a candidate solution is then computed as the

adequacy of the identified partition. Many studies [28, 29]

have shown that this approach is more robust in converging

toward the optimal partition than classic partitional algo-

rithms. In 2010 [30], Gong et al. proposed a manifold

evolutionary clustering algorithm (MEC). In MEC, each

individual is a sequence of real integer numbers repre-

senting the sequence numbers of k-cluster representatives.

The length of a chromosome is k words, of which the first

word gene represents the first cluster; the second represents

the second cluster, and so on. MEC is applied to solve

image classification problems.

Recently, there have been two new trends in the study of

partitional clustering based on stochastic search heuristics.

Some researchers try to determine the numbers of clusters

by using natural computation such as GA, particle swarm

optimization (PSO), and differential evolution (DE) and

CSA [31–33].

On the other hand, some researchers proposed special

clustering validity index for some special data sets. Most

existing work focuses on proposing a certain cluster

validity index and finding the optimal partition for some

special data. The measure of clustering validity is usually

based on the Euclidean distance, such as Davies–Bouldin

(DB) index. [34], Dunn’s index [35], Xie–Beni (XB) index

[36], and PBM index, proposed by Bandyopadhyay et al.

[37] and the most commonly used among all the measures

of validity based on Euclidean distance. Bo et al. [38]

designed a density-sensitive similarity measure or a man-

ifold distance to define cluster validity index for data with

manifold characters. Swagatam et al. [31] used a kernel

function-induced index to clustering some special data.

Each approach has its own merits and disadvantages.

Meila and Heckerman provide a comparison of some

clustering methods and initialization strategies in [39].

Bandyopadhyay and Maulik [40] have provided a com-

parison of several validity indexes for not assuming any

underlying distribution of the data sets while using non-

parametric genetic clustering algorithms.

In this paper, we introuduce a new cluster validity

index, namely, point symmetry distance-based index and

give a simple review about it. The first symmetry-based

clustering technique is proposed by Su and Chou [5]. Here,

they have assigned points to a particular cluster if they

present a symmetrical structure with respect to the cluster

center. A new type of non-metric distance, based on point

symmetry, is proposed, which is used in a k-means-based

clustering algorithm, referred to as symmetry-based k-

means (SBKM) algorithm. SBKM is found to provide good

performance on different types of data sets where the

clusters have internal symmetry. However, it can be shown

that SBKM fails for some data sets where the clusters

themselves are symmetrical with respect to some inter-

mediate point, since the point symmetry distance ignores

the Euclidean distance in its computation. Though this has

been mentioned in a subsequent paper by Chou et al. [6]

where they have suggested a modification (MOD-SBKM),

the modified measure has the same limitation as the pre-

vious one. In 2007, Bandyopadhyay et al. [7, 8] developed

a GA with point symmetry distance-based clustering

(GAPS). In GAPS, chromosomes in the initial population

are encoded into real code string. In the evolutionary

procedure, an adaptive mutation and crossover probabili-

ties are used, and a new point symmetry-based distance

(PS-distance), which incorporates both the Euclidean dis-

tance and a measure of symmetry, is proposed. GAPS has

good performance on both convex and non-convex clusters

of any shape and sizes, as long as the clusters do have some

symmetry property.

3 Point symmetry-based clonal selection clustering

algorithm

3.1 Clonal selection principle

The clonal selection theory, a famous theory in immunol-

ogy, was put forward by Burnet [9]. Its main ideas are that

the antigen can selectively react to antibodies, which are of

Pattern Anal Applic

123

native production, and spread on the cell surface in the

form of peptides. When an antigen is detected, those

antibodies that best recognize an antigen will proliferate by

cloning. This process is called clonal selection principle.

The new cloned cells undergo high rate of mutations or

hypermutation to increase their receptor population (called

repertoire). These mutations experienced by the clones are

proportional to their affinity to the antigen. Based on the

theory of clonal selection and hypermutation, CLONALG

is proposed by de Castro and Von Zuben [10]. In CLO-

NALG, only a part of the fittest antibodies are selected to

be cloned proportionally to their antigenic affinities.

According to the antigenic affinity of antibody, CLONALG

creates a proportion of new antibodies to replace the lowest

Ab–Ab fitness value antibodies in the current population

and reserves the best fitness value antibodies for recycling.

Later on, some optimization algorithms were proposed.

But, up to now, few of them are applied to clustering

problems.

3.2 Point symmetry-based clonal selection clustering

algorithm (PSCSCA)

The main loop of the proposed point symmetry-based

clonal selection clustering algorithm is as follows.

Considering clustering the data set xj 2 <d; j ¼1; 2; . . .n into k subset c1; c2; . . .; ck; k� n, some key oper-

ators employed in the proposed algorithm are presented in

the following sections.

3.2.1 Antibody encoding and population initialization

In the proposed algorithm, each antibody is encoded by a

real number vector which represents the coordinates of the

centers. That is, antibody Ai encodes the centers of k

clusters in the d-dimensional space and its length

Li ¼ k � d:

Ai ¼ fa11a12. . .a1d|fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl}c1

. . . ai1. . .aid|fflfflfflffl{zfflfflfflffl}ci

. . . ak1ak2. . .akd|fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl}ck

g ð2Þ

where c1; c2; . . .; ck are clustering centers.

First, we initialize the antibody population randomly,

and five iterations of the k-means algorithm are executed

on the set of centers encoded in each antibody in the initial

antibody population. Then, the refined antibody population

Að0Þ is evolved by CSA. What should be noted is that more

than five iterations of the k-means cannot improve the

clustering; this is an experimental conclusion.

3.2.2 PS distance-based antibody affinity

Symmetry is a characteristic of geometrical shapes, equa-

tions and other objects; we say that such an object is

symmetric with respect to a given operation if this opera-

tion, when applied to the object, does not appear to change

it. The three main symmetrical operations are reflection,

rotation, and translation. A reflection ‘‘flips’’ an object over

a line, inverting it to its mirror image, as if in a mirror.

Thereunto, a rotation rotates an object using a point as its

center. Exactly speaking, in this paper, data with character

Pattern Anal Applic

123

of symmetry means that data have approximated rotation

symmetry; in other words, the data with character of

symmetry include three kinds of instances: (1) different

clusters are symmetrical with respect to some intermediate

point, which is called data with character of external

symmetry; (2) some point in each cluster is symmetrical

with respect to its clustering center, which is called data

with character of within symmetry; (3) it also includes data

with a character of combined external symmetry with

within symmetry.

To cluster the data with character of symmetry, Ban-

dyopadhyay et al. [8] proposed a new definition of the PS-

based distance that can overcome the limitations of the

definition proposed by Su [6]. Based on their PS-based

distance, a GA with PS-based distance (GAPS) is proposed

and the experimental results show that GAPS has a good

performance on both convex and non-convex clusters of

any shape and size as long as the clusters do have some

symmetry property. The new proposed PS-based distance

is defined as:

dpsðx; cÞ ¼d1 þ d2

2� deðx; cÞ ð3Þ

where deðx; cÞ is the Euclidean distance between x and c, x0

is the symmetric point of x with center c, and d1 and d2 are,

respectively, the Euclidean distances between x0 and its

first and second nearest neighbors shown in Fig. 1.

In Fig. 1, de1ðx; c1Þ is the Euclidean distance between x

and c1, x0 is the symmetric point of x with center c1, and d1

and d2 are, respectively, the Euclidean distances between x0

and its first nearest neighbor x1 and second nearest neigh-

bor x2. de2ðx; c2Þ is the Euclidean distance between x and

c2, x00 is the symmetric point of x with center c2, and d3 and

d4 are, respectively, the Euclidean distances of x0 and its

first nearest neighbor x3 and second nearest neighbor x4.

According to Eq. (3), the following equation holds:

dpsðx; c1Þ ¼d1 þ d2

2� de1ðx; c1Þ ð4Þ

dpsðx; c2Þ ¼d3 þ d4

2� de2ðx; c2Þ ð5Þ

If d1þd2

2[ d3þd4

2and de1 x; c1ð Þ\\de2 x; c2ð Þ, then

dps x; c1ð Þ\dps x; c2ð Þ. Taking both d1 and d2 into account,

dps makes the PS distance more robust and noise resistant.

Based on the new definition of the PS-based distance

proposed by Bandyopadhyay et al., we define the antibody

affinity as follows:

aff Aj

� �¼ 1Pk

j¼1

Pxi2cj

dps xi; cj

� � ð6Þ

The antibody affinity is maximized to evolve the proper

cluster centers and obtain the optimal partition.

3.2.3 k-nearest-neighbor consistency and kd-trees-based

approximate nearest neighbor

GAPS proposed by Bandyopadhyay et al. works well for

most data sets with symmetry property; however, we found

that GAPS also misestimates some special data sets which

result in low accuracy. For example, the following data set

with the geometrical shape of two adjacent cirques is

shown in Fig. 2. The symmetrical point of x with respect to

c1 is x0 and the symmetrical point of x with respect to c2 is

x00; d1 and d2 are the Euclidean distances of x0 with its first

nearest neighbor and second nearest neighbor, d3 and d4

are the Euclidean distances of x00with its first nearest

neighbor and second nearest neighbor, and de1 and de2 are

the Euclidean distances of x with c1 and c2, respectively.

Based on Eq. (3), we can compute the PS-based distance

of x with respect to c1 and c2 as follows:

dpsðx; c1Þ ¼d1 þ d2

2� de1ðx; c1Þ ð7Þ

dpsðx; c2Þ ¼d3 þ d4

2� de2ðx; c2Þ ð8Þ

Fig. 1 The PS-based distance proposed by Sanghamitra et al. Fig. 2 An example of Sanghamitra’s point symmetry distance

Pattern Anal Applic

123

Obviously, dpsðx; c1Þ[ dpsðx; c2Þ, so x will be misplaced

to the bigger cirque, which will be improved by using k-

nearest-neighbor consistency in data clustering.

The nearest neighbor consistency is a central concept in

statistical pattern recognition, especially the k-nearest-

neighbor classification method and its strong theoretical

foundation. Chris et al. extended this concept to data

clustering which requires that for any data point in a cluster

[41], its k-nearest neighbors and mutual nearest neighbors

should also be in the same cluster. For the above example,

for point x, the first nearest neighbor y and the second

nearest neighbor z should be also in the same cluster,

according to the k-nearest-neighbor consistency so as to

avoid misclassifications. In this paper, the k-nearest-

neighbor consistency will be combined with the PS-based

distance in the proposed clustering.

According to preceding description, computing k-near-

est neighbors of xi is needed in the process of k-nearest-

neighbor consistency and finding the first nearest neighbor

and the second nearest neighbor of x in the symmetrical

distance Eq.(3).

To reduce the computational complexity, here ANN

search using the kd-tree [42] approach is adopted. Suppose

a set of data points in d-dimensional space is given. These

points are preprocessed into a kd-tree structure, so that

given any query point q, the nearest or generally k-nearest

points to q can be reported efficiently. Euclidean distance is

used to define the distance between two points.

3.2.4 Immune vaccine operator

3.2.4.1 Related definitions We first introduce the defini-

tions of vaccine, vaccination, and immune selection [43] in

artificial immune system. The antibody coding of variable

x is a ¼ a1; a2; . . .; alð Þ, denoted as a ¼ eðxÞ. According to

the biological concept, ai of the antibody a is called genetic

gene or allele, and the value of ai is related to the method

of encoding. Variable x is defined as decoding of antibody

a, denoted as x ¼ e�1ðaÞ.Definition 1 Vaccine can be defined as the estimate of

the gene of the optimal individual. Considering the maxi-

mum of affinity function aff, for antibody a ¼a1; a2; . . .; alð Þ and 8b 2 A, b ¼ b1; b2; . . .; blð Þ, the fol-

lowing formula holds:

aff að Þ ¼ aff e�1 a1; . . .; ai; . . .; alð Þ� �

� aff bð Þ¼ aff e�1 b1; . . .; bi; . . .; blð Þ

� �ð9Þ

Then antibody a ¼ a1; a2; . . .; alð Þ is called the

choiceness individual of antibody populationA. H ¼a a is the choiceness individual of antibody populationjf g,

if the character s (real number or binary) satisfies

8a 2 H; ai ¼ s

Then, s is the ith bit of the choiceness allele bit of A and

denoted as Ai. Set v is mode, if for 8i ¼ 1; 2; . . .l,

vi ¼ Ai the choiceness allele bit of A

� otherwise

�ð10Þ

v is called vaccine of the antibody population A.

Vaccination means the operation of amending some gene

bits of antibody b according to the vaccine extracted so as to

gain higher affinity with greater probability, namely

b0i ¼vi vi 6¼ �bi vi ¼ �

�; i ¼ 1; 2; . . .; l ð11Þ

The immune selection includes the immune test and the

simulated annealing selection. This will be introduced in

the following section.

3.2.4.2 Immune vaccine operator The immune vaccine

operator also includes three operations, namely extract

vaccine, vaccinate, and immune selection. Suppose the

current antibody population is AðtÞ ¼ fA1;A2; . . .;ANg,then the encoding length of each antibody Aiis l, the vac-

cine set is v ¼ fv1; v2; . . .; vlg, antibody affinity of AðtÞ is

AffðtÞ ¼ faff1; aff2; . . .; affNg, and the probability of

extracting vaccine from each antibody is

Pex ¼ PexðA1Þ; . . .;PexðANÞf g, where:

Pex Aið Þ ¼ aff Aið Þ=XN

j¼1

affðAjÞ ð12Þ

The main loops of immune vaccine operator are

explained in algorithms 2 and 3.

Too large a vaccine set can affect the performance, and

inferior vaccine extracted from the initial antibody popu-

lation can lead to a vaccine set with bad quality. To deal

Pattern Anal Applic

123

with these two problems, some low-quality vaccine in the

vaccine set will be deleted after the vaccinate operation.

Vaccinate operation is as follows:

The immune selection includes two operations. The first

one is the immune test, i.e., testing the antibodies. If the

affinity is smaller than that of the parent, which means

serious degeneration must have happened after vaccination,

then instead of the individual, the parent will participate in

the next competition. The second one is the simulated

annealing selection, i.e., selecting an individual Ai in the

present offspring to join the new parents with the proba-

bility Pvc Aið Þ:

Pvc Aið Þ ¼exp aff Aið Þ=Tð Þ

PNi¼1 exp aff Aið Þ=Tð Þ

ð13Þ

where T is the annealing temperature and N is the size of

the population.

Extract vaccine operation and vaccinate operation are

important in the proposed algorithm. Based on reasonable

selection of vaccines, the proposed algorithm can improve

the affinity of antibody population and speed up the con-

vergent rate by using an immune vaccine operator.

3.2.5 Clone

Suppose an antibody population AðtÞ at t generation. After

clone operation, the new antibody population AcðtÞ can be

obtained as follows:

AcðtÞ ¼ TCc ðAðtÞÞ ¼ fAciðtÞji ¼ 1; . . .;Ng

¼ fTCc ðA1ðtÞÞ; . . .; TC

c ðAiðtÞÞ; . . .; TCc ðANðtÞÞg ð14Þ

where TCc ðAiðtÞÞ ¼ Ii � AiðtÞ ; i ¼ 1; 2; . . .;N , Ii is a

vector with length of qi, the antibody set AciðtÞ has qi

copies of antibody ai.

qiðtÞ ¼ gðNc; affðAiðtÞÞÞ ð15Þ

where affðÞ is defined as the antibody affinity with antigen,

Nc is a constant that is related to the total scale of

antibodies to be cloned. Generally, qiðtÞ is defined as

follows:

qiðtÞ ¼ Int Nc:affðAiðtÞÞPNj¼1 affðAjðtÞÞ

!i ¼ 1; 2. . .N ð16Þ

where Int(x) returns the smallest integer that is greater than

or equal to x. Thus, it can be seen that, with regard to a

certain antibody, the scale of this antibody to be cloned is

adjusted to the antibody affinity. The larger the affinity, the

larger will be the scale of antibody to be cloned, and vice

versa.

3.2.6 Hypermutation

In the proposed algorithm, simulated annealing mutation is

used. After the clone operation, each antibody in the pop-

ulation is cloned to form a corresponding seed antibody

population with different size based on its affinity function

value. Then, all seed antibody population will experience a

simulated annealing mutation, in which the extent of

mutation decreases along with the evolutionary process.

Suppose the maximum evolutionary generation is m, for

antibody Ai and Ai 2 ½lb; ub�, two random numbers

r1; r2 2 ½0; 1�, the simulated annealing mutation in genera-

tion i is defined as follows:

A0i ¼Ai þ ub� Aið Þr1 1� i

m

� �b

; r1\0:5

Ai � Ai � lbð Þr2 1� i

m

� �b

; r2� 0:5

8>>><

>>>:ð17Þ

where b is a real parameter which control annealing speed.

The initial antibody population AðtÞ is not involved in

the mutation operation for protecting some important

antibody genes in it.

3.2.7 Clonal selection TCs

After the mutation operation, the best antibody with the

maximal affinity will be kept in the new antibody

Pattern Anal Applic

123

population, with a probability of pts. The best antibody is

saved in biðtÞ as follows:

BiðtÞ ¼ fAijðtÞjj ¼ argj

fmax½affðAijðtÞÞ�j

j ¼ 1; . . .; qiðtÞ � 1ggð18Þ

The probability of pts with which the new antibody

BiðtÞdisplaces the initial antibody AiðtÞ is defined as:

where b[ 0 is a constant that relates to the diversity of the

antibody population. Generally, the greater the b, the better

is the diversity, and vice versa.

3.3 Complexity analysis

Suppose the size of the population to be N, the maximum

generation to be M, and the dimension of the data set to be

d, then the time complexity of constructing d-dimension

tree of n data is Oðdn log nÞ and the complexity of getting

two nearest neighbors of n data by kd-tree is Oðd log nÞ.From the above flow of calculating affinity, n data can be

clustered into k classes, and there are n 9 k symmetry

points. For each symmetry point, the algorithm requires its

two nearest neighbors. So, the complexity of calculating

each antibody affinity is Oðdnk log nÞ, and the complexity

of maintaining the consistency of the nearest neighbor is

Oðdn log nÞ. The affinity is updated N times in each gen-

eration, so the complexity is OðNdkn log nÞ and the time

complexity required by immune vaccine in each generation

is OðdkNÞ. The complexity of GAPS is:

Oðdn log nþMNdkn log nÞ ¼ OðMNdkn log nÞ ð20Þ

For PSCSCA, the size of the population is set as N, the

clonal size is Nc, the maximum generation is M, and the

dimension of the data set is d. Then, the complexity of PSCSA

is:

Oðdn log nþMðNcdkn log nþ dkNcÞÞ ¼ OðMNcdkn log nÞð21Þ

Generally, only a constant coefficient is multiplied and the

complexity of PSCSCA is still the same as GAPS if the

clonal size Ncin PSCSCA is set to be the same as Nin GAPS.

4 Experiments on benchmark clustering problems

To validate the performance of PSCSCA, we first apply

it to 12 benchmark clustering problems. The results are

compared with those of k-means, symmetry-based k-

means (SBKM) [5], the modified SBKM (MOD-SBKM)

[6], and genetic algorithm with point symmetry distance

(GAPS) [7] in terms of clustering accuracy, Minskowski

Score (MS) [44], and the convergence speed. In all the

algorithms, the desired number of clusters is set in

advance. Then, a detailed analysis of parameters used in

the proposed algorithm such as the size of the antibody

population N, the clonal scale Nc, the simulated anneal-

ing mutation parameter b, and the diversity control

coefficient b are presented. The influence of immune

vaccine operator on the performance of PSCSCA is also

analyzed.

All parameters of PSCSCA were determined experi-

mentally and parameter analyses can be found in Sect. 4.3.

Most of the parameters used in SBKM, MOD-SBKM, and

GAPS are specified according to the original papers for the

best performance.

PSCSCA is implemented with the following parameters:

the antibody population N ¼ 10, the clonal scale Nc ¼ 20,

the annealing temperature in immune selection T ¼ 1, the

simulated annealing mutation parameter b ¼ 5, and the

diversity control coefficient b ¼ 1 and e ¼ 1e� 5.

The parameters in GAPS are set as follows: the population

size N ¼ 20, the initial crossover probability pc ¼ 0:8, and

the initial mutation probabilitypm ¼ 0:02 and e ¼ 1e� 5.

For k-means, SBKM and MOD-SBKM, e ¼ 1e� 5.

The algorithms are realized with Matlab and C ??. All

experiments were performed using a personal computer

with Intel (R) Core (TM) 1.86G CPU and 2G RAM.

pts Aiðt þ 1Þ ¼ BiðtÞð Þ

¼1 affðBiðtÞÞ[ affðAiðtÞÞexp � affðAiðtÞÞ�affðBiðtÞÞ

b

� affðBiðtÞÞ � affðAiðtÞÞ and AiðtÞ is not the current best antibody

0 affðBiðtÞÞ � affðAiðtÞÞ and AiðtÞ is the current best antibody

8>><

>>:

ð19Þ

Pattern Anal Applic

123

4.1 Data sets

In this subsection, four UCI data sets [45] are used to test

the performance of the proposed algorithm. They are Glass,

Iris, Lung-Cancer, and Wine. There are four synthesized

data sets [7, 8, namely, Data 1, Data 2, Data 3, and Data 4,

as shown in Fig. 3, The description of all test data sets is

given in Table 1.

4.2 Comparison of the clustering performance

For each data set, 20 independent runs of each algorithm

are executed and the mean clustering accuracy (CA) and its

standard variance (Std), Minkowski Score (MS), and its

standard variance (Std) are reported.

Clustering accuracy and MS are used to show the dif-

ference between the clustering results and the true parti-

tion, which are in the range of [0, 1]. The larger the CA, the

better is the clustering result, and the smaller the MS, the

better are the clustering results. Simple mathematic defi-

nitions for them are given as follows:

Considering the data set xj 2 <d; j ¼ 1; 2; . . .n, it will be

clustered into k clusters. Suppose the known true partition

is U ¼ fu1; u2; . . .; ukg, the clustering result

V ¼ fv1; v2; . . .; vk1g, and Confusionði; jÞ denotes the

number of data points that are both in the true cluster ui and

in the cluster vj, then the clustering accuracy (CA) is

defined as:

CA ¼ 1� 1

n

Xk

i¼1

Xk1

j¼1i 6¼j

Confusionði; jÞ ð22Þ

Let a be the number of pairs of points that are in the

same cluster in both U and V, b be the number of pairs that

are in the same cluster in V, and c be the number of pairs

that are in the same cluster in U. MS can be calculated as

follows:

MSðU;VÞ ¼ffiffiffiffiffiffiffiffiffiffiffibþ c

aþ c

rð23Þ

0 2 4 6 8 100

2

4

6

8

10Data 1

-1 0 1 2 3-2

-1.5

-1

-0.5

0

0.5Data2

4 6 8 10 12 14 164

6

8

10

12

14

16Data3

-1 0 1 2 3 4 5-3

-2

-1

0

1

2

3Data4

Fig. 3 Synthesized data sets

Table 1 Description of the data sets

Data set Number of

samples

Number of

clusters

Data

dimension

Glass 214 6 9

Iris 150 3 4

Lung-Cancer 32 3 56

Wine 178 3 13

Data 1 400 2 2

Data 2 400 3 2

Data 3 250 5 2

Data 4 400 2 2

Pattern Anal Applic

123

4.2.1 Comparisons on accuracy and ms scores

The mean results of the two metrics, namely CA and MS,

are shown in Table 2. Bold values in Table 2 mean the best

results of four algorithms.

As shown in Table 2, in 20 independent runs, for Glass,

the mean clustering accuracies of all the five algorithms are

not good. The best result in the five algorithms is that of

GAPS whose mean clustering accuracy is 59.78972 %;

PSCSCA takes the second place. The mean clustering

accuracy of MOD-SBKM is 56.46028 %, which is the worst

result. For the MS scores, SBKM has the least MS score in

five algorithms, and K-means takes the second place.

For the Lung-Cancer data set, PSCSCA gets the best mean

clustering accuracy, improved by 4.53 % compared to GAPS,

and the best MS score is that of k-means in the five algorithms

and the proposed PSCSCA takes the second place.

For Iris and Wine, no matter the mean clustering accu-

racy or the MS score, the result of PSCSCA is the best in

the five algorithms. The mean clustering accuracies of

PSCSCA are 3.26, 0.98, 6.51, and 2.69 % higher than that

of GAPS, respectively.

For the synthesized data sets, both PSCSCA and GAPS

get better results in 20 runs. All mean clustering accuracies

of PSCSCA are more than 97 % for the four synthesized

data sets. Hereinto, for Data 3 and Data 4, the mean

clustering accuracies of PSCSCA are 100 % and also get

optimal partitions in each run since the standard deviations

are zero. For Data 1, the mean clustering accuracies of

PSCSCA are more than 99 %. For Data 2, the mean clus-

tering accuracy of PSCSCA is 97.975 %, while the result

of GAPS is similar and better than those of k-means,

SBKM, and MOD-SBKM. Compared to GAPS, the mean

clustering accuracies of PSCSCA are 3.92, 2.7, 0.03, 2.04,

10.83, and 12.6 % higher for the six synthesized data sets.

For MS, PSCSCA is superior to the other four algo-

rithms when clustering all the synthesized data sets, which

also indicates that the proposed algorithm is robust when

clustering the data sets with character of symmetry.

In conclusion, PSCSCA has good performance on

clustering data sets with character of symmetry, which can

be made out by both mean clustering accuracy and MS. For

Glass and Lung-Cancer, PSCSCA is not good since the two

data sets have a more complex distribution than other data

sets and the superiority of PSCSCA lies in dealing with the

data sets with character of symmetry.

4.2.2 Comparison of the visual clustering results

The visual clustering results for Data 2 and Data 4 are

shown in Figs. 5 and 6, which are the best clustering results

Table 2 Clustering results over benchmark clustering problems

Data set Algorithm Mean CA (Std) (%) MS (Std)

Glass k-means 57.48 (4.5747) 1.0557 (0.0240)

SBKM 57.22 (3.3370) 1.0503 (0.0169)

MOD-

SBKM

56.46 (1.7680) 1.1135 (0.0198)

GAPS 59.79 (2.5572) 1.0775 (0.0231)

PSCSCA 58.41 (1.0771) 1.1048 (0.0279)

Iris k-means 89.33 (0.3351) 0.6134 (0)

SBKM 61.37 (2.3295) 0.9781 (0.0240)

MOD-

SBKM

70.56 (8.5815) 0.6961 (0.0915)

GAPS 90.67 (0.2163) 0.5666 (0)

PSCSCA 93.93 (0.2981) 0.4680 (0.0312)

Lung-

Cancer

k-means 56.25 (3.1901) 0.8273 (0.0050)

SBKM 52.18 (7.8272) 1.0856 (0.0269)

MOD-

SBKM

52.58 (7.9752) 1.0366 (0.0619)

GAPS 59.69 (6.4821) 1.0151 (0.0518)

PSCSCA 64.22 (6.0748) 1.0051 (0.0671)

Wine k-means 92.69 (0.2749) 0.5299 (0)

SBKM 92.08 (1.1146) 0.5745 (0.1328)

MOD-

SBKM

83.61 (10.4825) 0.8284 (0.0999)

GAPS 93.03 (0.2867) 0.5149 (0.0194)

PSCSCA 95.73 (0.5611) 0.4107 (0.029)

Data 1 k-means 82.50 (1.5904) 0.7599 (0)

SBKM 90.80 (9.2615) 0.5254 (0.1533)

MOD-

SBKM

87.18 (17.922) 0.6970 (0.2236)

GAPS 95.49 (1.4792) 0.4366 (0.0485)

PSCSCA 99.40 (0.3828) 0.1181 (0.1015)

Data 2 k-means 94.25 (0) 0.3683 (0)

SBKM 86.75 (10.5317) 0.4828 (0.2844)

MOD-

SBKM

86.66 (9.3128) 0.3973 (0.1303)

GAPS 95.28 (0.1025) 0.3373 (0.0039)

PSCSCA 97.98 (0.3652) 0.2267 (0.0186)

Data 3 k-means 99.60 (0.20) 0.1258 (0)

SBKM 27.30 (1.7333) 1.4730 (0.1910)

MOD-

SBKM

46.77 (14.81) 0.9311 (0.3452)

GAPS 97.96 (0.3952) 0.2832 (0.0218)

PSCSCA 100 (0) 0 (0)

Data 4 k-means 57.50 (0.3) 0.9204 (0.0006)

SBKM 78.95 (6.7279) 0.8873 (0.2602)

MOD-

SBKM

74.43 (5.96) 0.9197 (0.0021)

GAPS 87.40 (11.0941) 0.8174 (0.1889)

PSCSCA 100 (0) 0.4147 (0)

Pattern Anal Applic

123

among the 20 independent runs of the five algorithms.

Because of the high dimension, the results of UCI data sets

are not shown.

As shown in Fig. 4, for Data 2, k-means shows a poor

performance, while the other four algorithms are better.

But all of them appear to be misplaced at the intersectant

part of the ring and slant. Many data in the ring-shaped part

are wrongly partitioned into bias portion by executing k-

means. For SBKM, a part of the upper ring goes into the

linear; MOD-SBKM partitions wrongly a few data in linear

into ring-shaped part; for GAPS and PSCSCA, a part of the

upper ring goes into the linear and the number of data

wrongly partitioned by PSCSCA is less than that of GAPS.

Figure 5 shows the results of the five algorithms for

Data 4. k-means, SBKM, and MOD-SBKM show poor

performances. For GAPS, portions in the big ellipse are

wrongly partitioned. The reason is that some data in the

center of the big ellipse are symmrtric not only to some

data in the small ellipse, but also to some data in the big

ellipse, and the symmetry distance of these data from the

center of the big ellipse dps is less leading to wrong par-

tition. The mean accuracy produced by GAPS is

87.4 %. Experimental results shows that the k-nearest-

neighbor consistency strategy used in PSCSCA can help to

improve the clustering results distinctly.

4.2.3 Comparison on convergent speed of GAPS

and PSCSCA

Since k-means, SBKM, and MOD-SBKM are all based on

k-means method, their convergent speeds are faster than

that of GAPS and PSCSCA, which is a population-based

evolutionary algorithm. In this subsection, we also use the

number of objective function evaluations (FEs) as a mea-

sure of the computational cost. From the data provided in

Table 1, we choose a threshold value of the clustering

accuracy for each data set; this threshold value is somewhat

equal to or larger than the minimum accuracy attained by

each clustering algorithm (PSCSCA and GAPS). Now, we

run each algorithm on each data set and stop as soon as it

-1 0 1 2 3-2

-1.5

-1

-0.5

0

0.5

-1 0 1 2 3-2

-1.5

-1

-0.5

0

0.5

-1 0 1 2 3-2

-1.5

-1

-0.5

0

0.5

-1 0 1 2 3-2

-1.5

-1

-0.5

0

0.5

-1 0 1 2 3-2

-1.5

-1

-0.5

0

0.5

-1 0 1 2 3-2

-1.5

-1

-0.5

0

0.5

(a) (b) (c)

(d) (e) (f)

Fig. 4 The clustering results of Data 2: a Data 2; b clustering result by using k-means; c clustering result by using SBKM; d clustering result by

using MOD-SBKM; e clustering result by using GAPS; f clustering result by using PSCSCA

Pattern Anal Applic

123

achieves the threshold accuracy. The parameters of the two

algorithms are the same as those in Sect. 4.1. PSCSCA and

GAPS are performed with 20 independent runs and halted

when they obtained the smaller accuracy in Table 2. For

example, for Glass, we know the smaller accuracy of

clustering this data by PSCSCA and GAPS is 58.41 %

from Table 2. So both algorithms stop when their cluster-

ing accuracy achieves 58 %.

The mean number of function evaluations and the

standard deviation are given in Table 3. For Data 1, Data

2, Data 3, Data 4, Glass, Iris, Liver-Disorder, Lung-

Cancer, and New-Thyroid, the mean numbers of function

evaluation of PSCSCA are less than those of GAPS by

952 times, 122 times, 139 times, 537 times, 424 times, 5

times, 224 times, 370 times, and 149 times, respectively.

This indicates that the convergent speed of PSCSCA is

faster than that of GAPS. For Data 5, Data 6, and Wine,

PSCSCA requires a larger numbers of the mean function

evaluations than GAPS. For most data sets used in this

paper, the convergent speed of PSCSCA is faster than that

of GAPS.

4.3 Parameter analysis

Most optimizing algorithms based on natural computation

are stochastic searching algorithms; some parameters

employed in these algorithms have strong effect on the

stability and convergent speed. In this section, we will

-2 0 2 4 6-3

-2

-1

0

1

2

3

-1 0 1 2 3 4 5-3

-2

-1

0

1

2

3

-2 0 2 4 6-3

-2

-1

0

1

2

3

-1 0 1 2 3 4 5-3

-2

-1

0

1

2

3

-2 0 2 4 6-3

-2

-1

0

1

2

3

-2 0 2 4 6-3

-2

-1

0

1

2

3

(a) (b) (c)

(d) (e) (f)

Fig. 5 The clustering results of Data 4: a Data 4; b clustering result by using k-means; c clustering result by using SBKM; d clustering result by

using MOD-SBKM; e clustering result by using GAPS; f clustering result by using PSCSCA

Table 3 Mean and standard deviations of FEs required by PSCSCA

and GAPS

Data Set Threshold of

accuracy (%)

GAPS PSCSCA

Glass 58 792 (264) 368 (16)

Iris 90 614 (9) 609 (16)

Lung-Cancer 59 738 (107) 368 (16)

Wine 93 643 (254) 672 (98)

Data 1 95 1,544 (107) 592 (16)

Data 2 95 803 (137) 681 (123)

Data 3 97 905 (78) 368 (16)

Data 4 87 912 (127) 1,033 (98)

Pattern Anal Applic

123

focus on four important parameters: The size of antibody

population N, the clonal scale Nc, the simulated annealing

mutation parameter b, and the diversity control coefficient

b, and analyze their influence on the performance of the

proposed algorithm.

The initial values of parameters are set as follows. The

antibody population size N ¼ 10, the clonal scale Nc ¼ 20,

the annealing temperature in immune selection T = 1, the

simulated annealing mutation parameter b ¼ 5, and the

diversity control coefficient b ¼ 1 and e ¼ 1e� 5. To

determine the proper values of N, Nc, b, and b, a series of

experiments have been conducted on Data 4. In this sec-

tion, we perform each test in 30 independent runs.

4.3.1 Sensitivity in relation to N and Nc

The experimental results of PSCSCA with N increased

from 5 to 25 in steps of 5 and Nc increased from 10 to 40 in

steps of 10 as shown in Fig. 6. The data are the statistical

results of the mean number of objective function evalua-

tions (FEs) and the mean affinity value.

The results in Fig. 6 show that the size of the antibody

population and the clonal scale have a large effect on the

number of function evaluations and the function value.

The mean number of function evaluation is increased

linearly along with the variation of N and Nc, but for the

mean function, the variation of N and Nc has no such

relationship. In addition, the larger clonal scale helps to

extend the searching scope and the larger antibody popu-

lation size improves the population diversity. However,

considering the computational cost, choosing too large N

and Nc is not appropriate. Contrarily, too small N and Nc

will make the function value small, so that PSCSCA

cannot find the optimal partition. We can conclude that the

antibody population size is set to 10–15 and the clonal

scale is set to 20–25. In this paper, we set N ¼ 10 and

Nc ¼ 20.

4.3.2 Sensitivity in relation to b and b

The experimental results of PSCSCA with b increased from

1 to 10 in steps of 1 and b from 0.5 to 2.5 in steps of 0.5 as

shown in Fig. 7. The data are the statistical results of the

mean number of function evaluation and the mean function

value.

As shown in Fig. 7, we can find that simulated annealing

mutation parameter and t have s strong influence on the

number of function evaluations and the function value.

When b� 1:5, the computational cost is too high; when

b� 4 or b� 8, the function is bad and cannot find the

optimal partition. Considering the balance of these two

factors, the simulated annealing mutation parameter is set

to 4–8 and the diversity control coefficient is set to 0.5–1.

In this paper, we set b ¼ 5 and b ¼ 1.

Fig. 7 The influence of b and b on the performance of the proposed algorithm: a the influence on the mean FEs; b the influence on the mean

affinity value

Fig. 6 The influence of N and

Nc on the performance of the

proposed algorithm: a the

influence on the mean FEs;

b the influence on the mean

affinity value

Pattern Anal Applic

123

4.3.3 Influence of immune vaccine operator

on the performance of PSCSCA

In this section, we use the above eight data sets to test the

influence of immune vaccine operator on the performance

of PSCSCA. We perform the PSCSCA with immune vac-

cine operator and the algorithm without immune vaccine

operator, respectively, for 20 runs on all test data sets. The

experimental results are the statistical results of the mean

clustering accuracy and its standard deviations, and the

average number of function evaluations and its standard

deviation.

Figure 8 shows the influence of immune vaccine

operation on the clustering accuracy. In Fig. 8, red

rectangle denotes the mean clustering accuracy of the

algorithm with immune vaccine operation, which is

denoted as IVO for short; blue rectangle curve

denotes the mean clustering accuracy of the algorithm

without immune vaccine operation, denoted as no IVO

for short.

As illustrated by Fig. 8, for most of data sets used in this

paper, the algorithm with immune vaccine operation,

PSCSCA, has better mean clustering accuracy than the

algorithm without immune vaccine operation.

Figure 9 shows the influence of immune vaccine oper-

ation on mean FEs. In Fig. 9, the red rectangle denotes the

mean FEs of the algorithm with immune vaccine operation,

which is IVO for short; the blue-black curve denotes the

mean FEs of the algorithm without immune vaccine

operation, denoted as no IVO for short.

From Fig. 9, it is easy to see that the algorithm with

immune vaccine operation, PSCSCA, requires a less

number of evaluation function value than the algorithm

without immune vaccine operation. We can conclude that

the immune vaccine operation helps algorithm converg

fast.

5 Experiments on real-world image compression

Image compression applies digital compression technology

to digital images, which aims to reduce the redundancy of

image data and the volume of images so as to store and

transmit data effectively. Generally, image compression

technology could be divided into lossless and lossy image

compression. Among the various kinds of lossy compres-

sion methods, vector quantization (VQ) is one of the most

popular and widely used methods [46, 47]. VQ mainly

includes three parts: codebook generation, encoding, and

decoding, in which codebook generation is the key factor

that will affect the performance of the whole image com-

pression process. The main ideas of codebook design can

be summarized as follows. Let the number of training

vectors be M and the number of codeword be N; the

codebook design problem means dividing M training vec-

tors into N clusters which is an NP-hard problem. For large

Glass Iris Lung-cancer Wine Data1 Data2 Data3 Data40

10

20

30

40

50

60

70

80

90

100

Datasets

Mea

n C

lust

erin

g A

ccur

acy

no IVO

IVO

Fig. 8 The influence of

immune vaccine operation on

clustering accuracy

Glass Iris Lung-cancer Wine Data1 Data2 Data3 Data40

200

400

600

800

1000

1200

Datasets

Mea

n F

Es

no IVO

IVO

Fig. 9 The influence of immune vaccine operation on the mean

number of FEs

Pattern Anal Applic

123

M and N, the traditional search algorithms such as Linde–

Buzo–Gray (LBG) [48] and K-means can hardly find the

global optimal classification. Later, many algorithms for

optimal codebook design have been proposed to improve

the traditional search algorithms. One can easily infer from

Sect. 4 that PSCSCA itself can also be used for clustering

purposes and its results are satisfactory. Now, in this sec-

tion we apply PSCSCA to natural image compression.

5.1 Framework for image compression

The proposed clustering is introduced into the frame of

M/RVQ [49] to cluster data of compressed images to

obtain the optimal codebook. Let the size of the mean

codebook be n1, and the size of the residual vector code-

book be n2. First, segment the image into matrix vector and

extract the mean value to quantify. Second, quantify the

residual vector. The mean value adopts the scalar quanti-

zation and the residual vector uses the vector quantization.

So, the quantization error of the mean value does not affect

the quantization error of the residual vector. The main steps

are presented as follows.

Step 1 Segment the image into non-overlapping blocks

with the size of p� p. If the size of the image edge is not

p� p, then reinforce with 0. So, each block is a matrix

vector.

Step 2 Calculate the mean value b of each matrix vector

as follows:

b ¼ 1

p2

Xp

i¼1

Xp

j¼1

xi;j ð24Þ

where xi;j denotes the pixel value of ði; jÞ; i; j ¼ 1; 2; . . .p in

each matrix vector.

Step 3 Cluster the mean value b of each vector by using

PSCSCA and obtain n1 cluster centers as the mean value

codebook.

Step 4 Quantify b with scalar quantization and denote as

b ¼ arg minx2the mean value codebook b� xj j, and the index of b

in the mean value codebooks is used as its coding cb,

namely, cb ¼ arg mini¼1;2;...;n1b� xij j; xi 2 the mean

value code book.

Step 5 Calculate the residual vector zi;j ¼ xi;j � b.

Step 6 Cluster the mean value z of each vector by using

PSCSCA and obtain n2 cluster centers as the residual

vector codebook.

Step 7 Quantify z with vector quantization to obtain

z ¼ arg minx 2 the residual vector code book z� xk k. Treat the

index of z in the residual vector codebook as its coding cz,

namely, cz ¼ arg mini¼1;2;...;n2z� xij j; xi 2 the residual

vector codebook.

Step 8 Divide the received data into the mean value

coding cb and the residual vector coding cz. Decode the

mean value coding cb into the mean value b according to

the mean value codebook and decode the residual vector

encoding cz into the residual vector z according to the

residual vector codebook.

Step 9 Obtain the grayscale matrix vector bþ z of the

image block with the size of p� p, where b is the mean

value and z is the residual vector

Step 10 According to the order of segmentation in step

1, combine the image blocks with the size of p� p into an

image which is of the same size as the original image.

5.2 Results and analysis

Three grayscale images ‘‘Barbara’’, ‘‘Cameraman’’ and

‘‘Peppers’’ are used as the training images (as shown in

Fig. 10) and ‘‘Lena’’, ‘‘Ceramic’’ ‘‘Cat’’, ‘‘Fruit’’, ‘‘Boat’’,

and ‘‘Airplane’’ are used as the test images. The results are

compared with those of LBG, self-organizing mapping

(SOM) [50], GAPS, and modified K-means (Mod-KM)

Fig. 10 Training images: a Barbara (512 9 512); b Cameraman (256 9 256); c Peppers (512 9 512)

Pattern Anal Applic

123

[51] in terms of the peak signal to noise ratio (PSNR) and

running time including the visual effect of recovered

images.

For a m� m grayscale image, the PSNR is defined as

follows:

PSNR ¼ 10 lg2552

Pmi¼1

Pmj¼1 xij � xij

� �2=m2

!ð25Þ

where xij is the original pixel value and xij is the com-

pressed pixel value.

Parameters in PSCSCA are set as follows: the size of

antibody population N is 20; the clonal size Ncis 40; the

annealing temperature in immune selection T ¼ 1, the

simulated annealing mutation parameters b ¼ 5and the

diversity control coefficient b ¼ 1, the size of the mean

value codebook n1 is 16; the size of the residual vector

codebook n2 is 256; the size of each block p is 4.

The comparisons of the visual quality of the image

compression using these five algorithms are presented in

Figs. 11, 12, 13, 14, 15, and 16. We choose the best results

obtained by the four algorithms among ten independent

runs. Table 4 presents the mean PSNR and its standard

variance and Table 5 shows the mean running time of

LBG, SOM, Mod-KM, GAPS, and PSCSCA for the ten

independent runs. In Table 4, bold values mean the best

results of four algorithm.

The test image Lena was compressed and recovered by

using LBG, SOM, Mod-KM, GAPS, and PSCSCA, and the

results are given in Fig. 11. As can be seen, the borders in

PSCSCA are softer and closer to the original image than

the other algorithms.

Figure 12 shows the visual quality of the four algo-

rithms for Ceramic. The visual effect of the PSCSCA is

comparatively better, and the texture obtained by PSCSCA

is more legible than those of the other four algorithms.

The test image Cat was compressed and recovered by

using LBG, SOM, Mod-KM, GAPS and PSCSCA, and the

results are given in Fig. 13. As can be seen, the hair in the

tail of cat is sharper than the other four algorithms.

The test image Fruit was compressed and recovered by

using LBG, SOM, Mod-KM, GAPS, and PSCSCA, and the

Fig. 11 Testing results with Lena: a the original image of Lena (512 9 512); b Lena recovered by using LBG; c Lena recovered by using SOM;

d Lena recovered by using Mod-KM; e Lena recovered by using GAPS; f Lena recovered by using PSCSCA

Pattern Anal Applic

123

Fig. 12 Testing results with Ceramic: a The original image of Ceramic (200 9 200); b Ceramic recovered by using LBG; c Ceramic recovered

by using SOM; d Ceramic recovered by using Mod-KM; e Ceramic recovered by using GAPS; f Ceramic recovered by using PSCSCA

Fig. 13 Testing results with Cat: a The original image of Cat (525 9 700); b Cat recovered by using LBG; c Cat recovered by using SOM; d Cat

recovered by using Mod-KM; e Cat recovered by using GAPS; f Cat recovered by using PSCSCA

Pattern Anal Applic

123

Fig. 14 Testing results with Fruit: a The original image of Fruit (338 9 450); b Fruit recovered by using LBG; c Fruit recovered by using SOM;

d Fruit recovered by using Mod-KM; e Fruit recovered by using GAPS; f Fruit recovered by using PSCSCA

Fig. 15 Testing results with Boat: a The original image of Boat (512 9 512); b Boat recovered by using LBG; c Boat recovered by using SOM;

d Boat recovered by using Mod-KM; e Boat recovered by using GAPS; f Boat recovered by using PSCSCA

Pattern Anal Applic

123

Table 4 Comparison of the mean PSNR (Std)

Image LBG SOM MOD-KM GAPS PSCSCA

Lena 25.8654 (2.8250) 26.2623 (0.3576) 25.0783 (2.3252) 25.6916 (3.4225) 27.0018 (3.8017)

Ceramic 24.0017 (2.9439) 24.3034 (3.1638) 24.0322 (2.4654) 23.7815 (2.3516) 24.7699 (2.4732)

Cat 29.1653 (6.5859) 29.6974 (4.3643) 29.4981 (3.6593) 29.6238 (4.1523) 30.6965 (3.9877)

Fruit 24.8120 (1.8583) 25.0764 (3.4442) 24.9364 (2.8862) 24.9817 (2.9615) 25.1923 (2.9226)

Boat 24.2294 (2.4995) 24.2959 (0.1878) 26.2763 (0.7853) 27.5428 (3.3213) 27.7303 (3.1638)

Airplane 23.8514 (2.6010) 23.9629 (0.9031) 27.3021 (0.7454) 27.2531 (3.6173) 27.3759 (3.2731)

Bold values indicate the best results of four algorithms

Table 5 Comparison of mean running time (in seconds)

Items LBG SOM MOD-KM GAPS PSCSCA

Training time 90.6503 1.73E ? 02 1.11E ? 03 7.62E ? 03 7.57E ? 03

Testing time

Lena 3.9723 3.9775 3.7841 3.1315 3.7554

Ceramic 0.5594 0.5764 0.5421 0.6157 0.5574

Cat 5.4069 5.2655 4.9398 5.3797 5.1376

Fruit 2.1603 2.1684 2.0876 2.2816 2.1479

Boat 4.0109 3.9680 3.7803 3.4933 3.8939

Airplane 3.8732 3.9696 3.7240 3.5447 3.7952

Fig. 16 Testing results with Airplane: a The original image of Airplane (512 9 512); b Airplane recovered by using LBG; c Airplane recovered

by using SOM; d Airplane recovered by using Mod-KM; e Airplane recovered by using GAPS; f Airplane recovered by using PSCSCA

Pattern Anal Applic

123

results are given in Fig. 14. As can be seen, the result of

PSCSCA is softer than those of the other algorithms.

Figure 15 shows the visual quality of the four algo-

rithms for Boat. The visual effect of the PSCSCA is

comparatively better; especially, the edge of the mast is

smoother.

Figure 16 shows the visual quality of the four algo-

rithms for Airplane. The holistic effect obtained by the

proposed algorithm is better than those from the others.

The character ‘‘F16’’ is clearer using the proposed algo-

rithm than using others.

For the six images given in experiments, the compres-

sion results of the proposed method are better than those of

others. Compared with LBG, SOM, Mod-KM, and GAPS,

the proposed method can produce a higher quality of edges.

It can be further confirmed by looking at Table 4. For all

test images, the PSCSCA achieved the highest PSNR.

Figure 17 shows that PSNRs obtained by the two evo-

lutionary methods-based population for Lena, namely,

GAPS and PSCSCA, change with evolutionary generation.

From Fig. 7, we can see that PSNR obtained by GAPS and

PSCSCA are increased with the increase of evolutionary

generation and PSCSCA performs better than GAPS in

terms of PSNR.

Running time including training time and testing time

of the four algorithms are given in Table 5. We can see

that, for LBG, SOM, Mod-KM, GAPS and PSCSCA, their

training times are increased in order. However, the dif-

ferences in the testing time of the five algorithm com-

pressing six natural images are small, because that the

main differences in these algorithms lie in training

codebooks, and their compressing and decompressing

methods are similar. Generally speaking, a better training

method can produce better compressing results, which is

the main idea for all kinds of improved compressing

algorithms. However, better compressing results mean

higher computational cost. Although the training times of

PSCSCA and GAPS are seven times that of Mod-KM,

considering once training is complete, we can apply it to

compress more test images; therefore, the increase in

training time is acceptable. PSCSCA and GAPS require

similar training time, because they use a similar method

of training codebooks.

6 Conclusion

Clustering is an important technique in data mining. In

real-world applications, we often encounter many data sets

with a character of symmetry. The point symmetry-based

similarity measure and immune vaccine mechanism are

introduced to the classical clonal selection algorithm and

realize clustering data sets with the character of symmetry.

Firstly, the proposed algorithm inherits the specialty of

CSA in combining global search with local search. Sec-

ondly, by introducing the immune vaccine operation, the

proposed algorithm can utilize cumulated prior knowledge

in the process of evolution; Finally, a point symmetry-

based similarity measure is used to evaluate the similarity

between two samples, which can detect both convex and

non-convex clusters. Experimental results on different data

sets demonstrate the superiority of PSCSCA as compared

to k-means, SBKM, Mod-SBKM, and GAPS. Another area

of future research is the development of new cluster

validity indices as well as automatic clustering methods

based on the proposed algorithm.

Acknowledgments The authors would like to thank the editor and

the reviewers for helpful comments that greatly improved the paper.

This work is supported by the National Natural Science Foundation of

China under Grant No. 61203303 and No. 61272279, the Provincial

Natural Science Foundation of Shaanxi of China (No. 2010JM8030),

and the Fundamental Research Funds for the Central Universities

(No. K50511020014).

References

1. Evangelou IE, Hadjimitsis DG, Lazakidou AA (2001) Data min-

ing and knowledge discovery in complex image data using arti-

ficial neural networks. In: Proceedings of the workshop complex

reason. http://www.cs.ucy.ac.cy/*lazakid/Publications/../p

2. Rao MR (1971) Cluster analysis and mathematical programming.

J Amer Stat Assoc 66(335):622–626

3. Lillesand T, Keifer R (1994) Remote Sensing and Image Inter-

pretation. Wiley, Hoboken

4. Saha S, Bandyopadhyay S (2010) A new multiobjective cluster-

ing technique based on the concepts of stability and symmetry.

Knowl Inf Syst 23:1–27

5. Su MC, Chou C-H (2001) A modified version of the k-means

algorithm with a distance based on cluster symmetry. IEEE Trans

Pattern Anal Mach Intell 23(6):674–680

0 2 4 6 8 10 12 14 16 18 2020

21

22

23

24

25

26

27

28

29

30

evolutionary generation

PS

NR

(dB

)Lena

GAPS

PSCSCA

Fig. 17 Variation of PSNR obtained by GAPS and PSCSCA with

evolutionary generation

Pattern Anal Applic

123

6. Chou CH, Su MC, Lai E (2002) Symmetry as a new measure for

cluster validity. In: Second WSEAS international conference on

scientific computation and soft computing, pp 209–213

7. Bandyopadhyay S, Saha S (2007) GAPS: a clustering method

using a new point symmetry-based distance measure. Pattern

Recogn 40(12):3430–3451

8. Saha S, Bandyopadhyay S (2008) Application of a new sym-

metry-based cluster validity index for satellite image segmenta-

tion. IEEE Geosci Remote Sens Lett 5(2):166–170

9. Burnet MF (1957) A modification of Jerne’s theory of antibody

production using the concept of clonal selection. Austrian J Sci

20(1):67–76

10. De Castro LN, Von Zuben FJ (2002) Learning and optimization

using the clonal selection principle. IEEE Trans Evol Comput

6(3):239–251

11. Cutello G, Narzisi G, Nicosia G (2005) Clonal selection Algo-

rithms: A comparative case study using effective mutation

potentials. In: Proceedings of 4th international conference on

artificial immune systems. Lecture notes in Computer Science,

vol 3627, pp 13–28

12. Gong M et al (2009) Immune algorithm with orthogonal design

based initialization, cloning, and selection for global optimiza-

tion. Knowl Inf Syst. doi:10.1007/s10115-009-0261-8

13. Liu RC, Jiao LC (2007) An immune memory clonal strategy

algorithm. J Comput Theor Nanosci 4(7–8):1399–1404

14. De Castro LN, Von Zuben FJ (2000) An evolutionary immune

network for data clustering. In: Proceedings of the IEEE

SBRN’00 (Brazilian Symposium on Artificial Neural Networks).

Rio de Janerio, pp 84–89

15. Jie LI, Xinbo GAO, Licheng JIAO (2004) A CSA-based clus-

tering algorithm for large data sets with mixed numeric and

categorical values. In: Proceedings of the world congress on

intelligent control and automation (WCICA2004) Hang Zhou,

pp 2003–2007

16. Ruochen Liu, Zhengchun Shen, Licheng Jiao, Wei Zhang (2010)

Immunodomaince based clonal selection clustering algorithm. In:

Proceedings of the 2010 IEEE congress on evolutionary com-

putation, CEC2010, Barcelona, Spain, 18–23 July, pp 2912–2918

17. Liu RC, Zhang W, Jiao LC, Liu F (2010) A multiobjective

immune clustering ensemble technique applied to unsupervised

SAR image segmentation, In: Proceedings of the ACM interna-

tional conference on image and video retrieval, ACM-CIVR,

pp 158–165

18. Paterlini Sandra, Krink Thiemo (2006) Differential evolution and

particle swarm optimization in partitional clustering. Comput Stat

Data Anal 50:1220–1247

19. Cowgill M, Harvey R, Watson L (1999) A genetic algorithm

approach to cluster analysis. Comput Math Appl 37(7):99–108

20. Raghavan VV, Birchand K (1979) A clustering strategy based on

a formalism of the reproductive process in a natural system. In:

Proceedings of the second international conference on informa-

tion storage and retrieval, pp 10–22

21. Falkenauer E (1998) Genetic Algorithms and Grouping Problems.

Wiley, Chichester

22. Bandyopadhyay S, Murthy CA, Pal SK (1995) Pattern classifi-

cation with genetic algorithms. Pattern Recogn Lett 16:801–808

23. Bandyopadhyay S, Murthy CA, Pal SK (1998) Pattern classifi-

cation using genetic algorithm: determination of H. Pattern Re-

cogn Lett 19:1171–1181

24. Bandyopadhyay S, Pal SK, Murthy CA (1998) Simulated

annealing based pattern classification. J Inf Sci 109:165–184

25. Bandyopadhyay S, Murthy CA, Pal SK (1999) Theoretic perfor-

mance of genetic pattern classifier. J Franklin Inst 336:387–422

26. Bandyopadhyay S, Maulik U (2002) Genetic clustering for

automatic evolution of clusters and application to image classi-

fication. Pattern Recogn 35:1197–1208

27. Bandyopadhyay S, Maulik U (2002) An evolutionary technique

based on K-means algorithm for optimal clustering in RN. Inf Sci

146:221–237

28. Sarafis I, Zalzala AMS, Trinder P (2002) A genetic rule-based

data clustering toolkit. In: Fogel DB, El-Sharkawi MA,Yao X,

Greenwood G, Iba H, Marrow P, Shackleton M (eds), Proceed-

ings of the 2002 congress on evolutionary computation (CEC-

2002). IEE Press, Piscataway, pp 1238–1243

29. Maulik U, Bandyopadhyay S (2000) Genetic algorithm-based

clustering technique. Pattern Recogn 33:1455–1465

30. Chiou YC, Lan LW (2001) Theory and methodology genetic

clustering algorithms. Eur J Oper Res 135:413–427

31. Bandyopadhyay S, Saha S (2007) VGAPS: a clustering method

using a new point symmetry-based distance measure. Pattern

Recogn 40(12):3430–3451

32. Swagatam D, Abraham A, Konar A (2008) Automatic kernel

clustering with a Multi-Elitist Particle Swarm Optimization

Algorithm. Pattern Recogn Lett 29:688–699

33. Swagatam D, Ajith A, Amit K (2008) Automatic clustering using

an improved differential evolution algorithm. IEEE Trans Sys

Man Cyb Part A Sys Humans 38(1):218–237

34. Davies DL, Bouldin DW (1979) A cluster separation measure. In:

IEEE transaction pattern recognition, vol PAMI-1. IEEE Press,

NewYork, pp 224—227

35. Dunn JC (1973) A fuzzy relative of the ISODATA process and its

use in detecting compact well-separated clusters. In: Communi-

cation in Statistics, vol 3. Taylor & Francis Group, London,

pp 32—57

36. Xie XL, Beni G (2001) A validity measure for fuzzy clustering.

In: IEEE transactions pattern analysis and machine intelligence,

vol 13. IEEE Press, NewYork, pp 841–847

37. Pakhira MK, Bandyopadhyay S, Maulik U (2004) Validity index

for crisp and fuzzy clusters. In: Pattern recognition, vol 37.

Pergamon Press, New York, pp 487—501

38. Wang L, Bo LF, Jiao LC (2007) Density-sensitive semi-super-

vised spectral clustering. J Software 18(18):2412–2422

39. Meila M, Heckerman D (1998) An experimental comparison of

several clustering and initialization methods. In: Proceedings of

the 14th conference uncertainty in artificial intelligence. Morgan

Kanfmann, Canada, pp 386—395

40. Bandyopadhyay S, Maulik U (2001) Nonparametric genetic

clustering: comparison of validity indices. In: IEEE transactions

on systems, man, and cybernetucs, part C: application and

reviews, vol 31, no 1. IEEE Press, New York, pp 120–125

41. Ding C, He X (2004) K-Nearest-Neighbor in data clustering:

Incorporating local information into global optimization. In:

Proceedings of the ACM Symposium on Applied Computing.

ACM Press, Nicosia, pp 584–589

42. Mount DM, Arya S (2005) ANN: a library for approximate

nearest neighbor searching. http://www.cs.umd.edu/mount/ANN

43. Jiao LC, Wang L (2000) A Novel Genetic Algorithm Based on

Immunity. IEEE Trans Sys Man Cy Part A Sys Humans

30:552–561

44. Hur BA, Guyon L (2003) Detecting Stable Clusters using Prin-

cipal Component Analysis in Methods in Molecular Biology.

Humana press, New York

45. UC Irvine Machine Learning Repository: http://archive.ics.uci.

edu/ml/datasets.html

46. Li RY et al (2002) Image compression using transformed vector

quantization. Image Vis Comput 20:37–45

47. Baker RL, Gray RM (1982) Image compression using nonadap-

tive spatial vector quantization. In Proceedings of the 16th

Asilomar conference on circuits, systems, and computers,

pp 55–61

48. Linde Y, Buzo A, Gray R (1980) An algorithm for vector

quantizer design. IEEE Trans Commun 28:84–94

Pattern Anal Applic

123

49. Baker RL (1984) Vector quantization of digital images. Ph.D

dissertation, Stanford University, Stanford

50. Horzyk A (2005) Unsupervised Clustering using Self-Optimizing

Neural Networks’’, In: Proceedings of the 2005 5th International

Conference on Intelligent Systems Design and Applications

(ISDA’05). pp 118–123

51. Chou CH, Su MC, Lai E (2004) A new cluster validity measure and

its application to image compression. Pattern Anal Appl 7:205–220

Pattern Anal Applic

123