19
Soft Comput (2016) 20:4733–4751 DOI 10.1007/s00500-015-1701-x FOCUS Multi-objective semi-supervised clustering for automatic pixel classification from remote sensing imagery Abhay Kumar Alok 1 · Sriparna Saha 1 · Asif Ekbal 1 Published online: 16 May 2015 © Springer-Verlag Berlin Heidelberg 2015 Abstract Classifying the pixels of satellite images into homogeneous regions is a very challenging task as differ- ent regions have different types of land covers. Some land covers contain more regions, while some contain relatively smaller regions (e.g., bridges, roads). In satellite image seg- mentation, no prior information is available about the number of clusters. Here, in this paper, we have solved this prob- lem using the concepts of semi-supervised clustering which utilizes the property of unsupervised and supervised classifi- cation. Three cluster validity indices are utilized, which are simultaneously optimized using AMOSA, a modern multi- objective optimization technique based on the concepts of simulated annealing. The first two cluster validity indices, symmetry distance based Sym-index, and Euclidean distance based I-index, are based on unsupervised properties. The last one is a supervised information based cluster validity index, Minkowski index. For supervised information, ini- tially fuzzy C-mean clustering technique is used. Thereafter, based on the highest membership values of the data points to their respective clusters, randomly 10 % data points with their class labels are chosen. The effectiveness of this pro- posed semi-supervised clustering technique is demonstrated Communicated by Y.-S. Ong. B Abhay Kumar Alok [email protected]; [email protected] Sriparna Saha [email protected] Asif Ekbal [email protected] 1 Computer Science Engineering, Indian Institute of Technology, Patna, India on three satellite image data sets of different cities of India. Results are also compared with existing clustering tech- niques. Keywords Remote sensing satellite image segmentation · Cluster validity index · Sym-index · I-index · MS-index · Semi-supervised clustering · AMOSA · Multiobjective optimization · Fuzzy C-means · Silhouette-index Abbreviations SOO Single objective optimization MOO Multiobjective optimization AMOSA Archived multiobjective simulated anneal- ing based technique SA Simulated annealing FCM Fuzzy C-means MOGA Multiobjective genetic algorithm 1 Introduction The major challenge in remote sensing application is to clas- sify the pixels of images into homogeneous regions, because these regions consist of different types of land covers. Some land covers have large number of pixels, while some others occupy very less number of pixels. So this type of prob- lem is modeled as clustering/segmentation problem (Maulik and Bandyopadhyay 2003). The other problem associated with satellite image is that no prior information is avail- able related to the actual number of clusters. To solve these problems, several unsupervised and supervised based cluster- ing techniques have been developed. In Sathya and Malathi (2011) a tool is developed using the two algorithms, namely 123

Multi-objective semi-supervised clustering for automatic ...sriparna/papers/soft-abhay2.pdfK-means algorithm for segmentation and classification of remote sensing images from a wide

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multi-objective semi-supervised clustering for automatic ...sriparna/papers/soft-abhay2.pdfK-means algorithm for segmentation and classification of remote sensing images from a wide

Soft Comput (2016) 20:4733–4751DOI 10.1007/s00500-015-1701-x

FOCUS

Multi-objective semi-supervised clustering for automatic pixelclassification from remote sensing imagery

Abhay Kumar Alok1 · Sriparna Saha1 · Asif Ekbal1

Published online: 16 May 2015© Springer-Verlag Berlin Heidelberg 2015

Abstract Classifying the pixels of satellite images intohomogeneous regions is a very challenging task as differ-ent regions have different types of land covers. Some landcovers contain more regions, while some contain relativelysmaller regions (e.g., bridges, roads). In satellite image seg-mentation, no prior information is available about the numberof clusters. Here, in this paper, we have solved this prob-lem using the concepts of semi-supervised clustering whichutilizes the property of unsupervised and supervised classifi-cation. Three cluster validity indices are utilized, which aresimultaneously optimized using AMOSA, a modern multi-objective optimization technique based on the concepts ofsimulated annealing. The first two cluster validity indices,symmetry distance based Sym-index, and Euclidean distancebased I-index, are based on unsupervised properties. Thelast one is a supervised information based cluster validityindex, Minkowski index. For supervised information, ini-tially fuzzy C-mean clustering technique is used. Thereafter,based on the highest membership values of the data pointsto their respective clusters, randomly 10% data points withtheir class labels are chosen. The effectiveness of this pro-posed semi-supervised clustering technique is demonstrated

Communicated by Y.-S. Ong.

B Abhay Kumar [email protected]; [email protected]

Sriparna [email protected]

Asif [email protected]

1 Computer Science Engineering, Indian Instituteof Technology, Patna, India

on three satellite image data sets of different cities of India.Results are also compared with existing clustering tech-niques.

Keywords Remote sensing satellite image segmentation ·Cluster validity index · Sym-index · I-index · MS-index ·Semi-supervised clustering · AMOSA · Multiobjectiveoptimization · Fuzzy C-means · Silhouette-index

Abbreviations

SOO Single objective optimizationMOO Multiobjective optimizationAMOSA Archived multiobjective simulated anneal-

ing based techniqueSA Simulated annealingFCM Fuzzy C-meansMOGA Multiobjective genetic algorithm

1 Introduction

The major challenge in remote sensing application is to clas-sify the pixels of images into homogeneous regions, becausethese regions consist of different types of land covers. Someland covers have large number of pixels, while some othersoccupy very less number of pixels. So this type of prob-lem is modeled as clustering/segmentation problem (Maulikand Bandyopadhyay 2003). The other problem associatedwith satellite image is that no prior information is avail-able related to the actual number of clusters. To solve theseproblems, several unsupervised and supervised based cluster-ing techniques have been developed. In Sathya and Malathi(2011) a tool is developed using the two algorithms, namely

123

Page 2: Multi-objective semi-supervised clustering for automatic ...sriparna/papers/soft-abhay2.pdfK-means algorithm for segmentation and classification of remote sensing images from a wide

4734 A. K. Alok et al.

Fig. 1 Example of semi-supervised clustering

back propagation algorithm of artificial neural networks andK-means algorithm for segmentation and classification ofremote sensing images from a wide database of images. InFan et al. (2009), a single point iterative weighted fuzzyC-means clustering algorithm is proposed for remote sens-ing image segmentation. In recent years, genetic algorithmsare widely used for solving the pixel classification prob-lem from remote sensing images (Bandyopadhyay et al.2007; Bandyopadhyay and Pal 2001). In Bandyopadhyayet al. (2007), a genetic algorithm based fuzzy clusteringtechnique is developed for segmenting the remote sensingimages. In Bandyopadhyay and Saha (2008), a point sym-metry based distance is used to develop a cluster validityindex, named Sym-index. Thereafter this index is used toautomatically predict the number of clusters from somesatellite images. Some differential evolution based cluster-ing techniques are proposed in Das and Konar (2009) forsegmenting remote sensing images. In Maulik and Saha(2010, 2009) differential evolution is modified to developsome clustering techniques to solve the problem of imagesegmentation. All the above-mentioned approaches utilizeonly a single quality measure to capture the goodness ofthe obtained segmentation. But satellite images contain seg-ments having different shapes and sizes. In order to capturethe clusters having different shapes, more number of clusterquality measures capturing different properties of partition-ings need to be simultaneously optimized. In view of these,satellite image segmentation has been posed as a multiobjec-tive optimization problem. Thereafter some multiobjectivebased approaches are developed to solve such problems. In

Bandyopadhyay et al. (2007), a multiobjective fuzzy clus-tering technique is developed for classification of remotesensing images. In Saha et al. (2012a), a support vectormachine based post-processing technique is coupled withmultiobjective fuzzy clustering for improving the accura-cies of segmentation of remote sensing satellite images. InSaha et al. (2012a), solutions of different fuzzy clusteringtechniques are combined using a support vector machinebased post-processing technique. In Saha and Bandyopad-hyay (2010), a multiobjective simulated annealing basedtechnique which uses the concept of multiple centers torepresent a particular center is developed for remote sens-ing image segmentation. All the above-mentioned clusteringtechniques utilize the available unlabeled data for classifica-tion into different categories. These algorithms do not relyon any supervised information available from the data.

In recent years a new classification paradigm, namedsemi-supervised classification, has been developed (Sahaet al. 2012b). This combines the usefulness of both super-vised and unsupervised classification techniques. This isa half way between supervised and unsupervised classi-fication (Chapelle et al. 2006; Basu et al. 2004; Bilenkoet al. 2004; Altun et al. 2005; Chapelle and Zien 2004).It utilizes some amount of unsupervised information and afew labeled information. First some unsupervised classifi-cation techniques are applied and then obtained partitionsare fine-tuned with the help of available supervised informa-tion. An example of semi-supervised clustering frameworkis shown in Fig. 1. In has been shown in the literaturethat semi-supervised classification techniques perform better

123

Page 3: Multi-objective semi-supervised clustering for automatic ...sriparna/papers/soft-abhay2.pdfK-means algorithm for segmentation and classification of remote sensing images from a wide

Multi-objective semi-supervised clustering for automatic pixel. . . 4735

than the existing supervised and unsupervised classifica-tion techniques (Chapelle et al. 2006; Basu et al. 2004;Bilenko et al. 2004; Altun et al. 2005; Chapelle and Zien2004).

Inspired by this, in the current study, we have made anattempt to use the sophisticated semi-supervised classifi-cation techniques for remote sensing image segmentation.Here, we have posed the segmentation problem of remotesensing satellite images as a semi-supervised classificationproblem. Existing techniques of satellite image segmenta-tion utilize only the available unlabeled data for partitioning.A new way of generating the labeled data without utilizingany human annotator is proposed and it is thereafter appliedon the available satellite images to obtain some labeled data.A recently developed semi-supervised clustering technique(Saha et al. 2012b) utilizing the labeled data is used toautomatically partition the available satellite images. In anysemi-supervised classification technique, the target is to sat-isfy two objectives. The partitioning of points into differentclusters should be perfect and there should not be any viola-tion of supervised informationwhichmeans that actual pointsbelonging to a particular class or cluster must be preserved insame group after application of any clustering technique. Inorder to measure these two properties, two different types ofcluster validity indices are used: internal and external. Inter-nal validity index utilizes intrinsic properties of data itemswhile external validity index utilizes the supervised informa-tion given in the form of class labels of data items. In orderto simultaneously optimize these cluster validity indices inthe current work we have posed the semi-supervised clas-sification as a multiobjective optimization (MOO) problem.MOO has a different perspective compared to single objec-tive optimization (SOO). In SOO we need to optimize asingle objective function but in case of MOO we need tooptimize more than one objective function. SOO providesa single solution as the final one but MOO provides a setof solutions on the final Pareto optimal front. All the solu-tions produced by some MOO based techniques are equallyimportant and they are non-dominating to each other. Thisset is also termed as Pareto-optimal solutions. Due to thepresence of multiple solutions, evolutionary algorithms arewidely used for solving multiobjective optimization prob-lems.

The current work reports about the application of a semi-supervised clustering technique for remote sensing satelliteimage segmentation. A multiobejctive simulated annealingbased technique, AMOSA (Bandyopadhyay et al. 2008) isutilized as the underline optimization strategy. Three objec-tive functions are simultaneously optimized for determiningthe appropriate partitioning. The first two objective functionsare some internal indices for cluster validity based on unsu-pervised properties of data set: symmetry distance based

Sym-index (Bandyopadhyay and Saha 2008) and Euclid-ean distance based I-index (Maulik and Bandyopadhyay2002). The last one is an external index of cluster valid-ity, Minkowski Score or MS-index (Ben-Hur and Guyon2003) based on supervised information or prior class labelinformation of data points. The performance of the multiob-jective simulated annealing (MOSA) based semi-supervisedclustering technique (semi-ImClustMOO) has been demon-strated on SPOT (Systeme Probatoire d’Observation de laTerre) satellite image of Kolkata, IRS image of Kolkataand IRS image of Mumbai. The effectiveness of the pro-posed semi-supervised clustering technique is comparedwith that of the popular fuzzy C-means (FCM) algorithm(Bezdek 1981), and multiobjective genetic algorithm basedMOGA clustering (Bandyopadhyay et al. 2007). Fuzzy C-means is a popular clustering technique often used forsatellite image segmentation. This is a simple approachfor solving the satellite image segmentation problem. Mul-tiobjective genetic algorithm based clustering technique,MOGA, is a recently developed fuzzy clustering tech-nique for satellite image segmentation. It has already beenillustrated in Bandyopadhyay et al. (2007) that MOGAperforms better than FCM and single objective genetic algo-rithm based fuzzy clustering technique (Bezdek 1981). Theresults reported in the current paper prove that the pro-posed semi-supervised approach performs better than FCMand MOGA based techniques. This proves the utilizationof semi-supervised classification for segmenting the satel-lite images.

Themain contributions of the current paper are as follows:

– A new approach using fuzzy C-means clustering tech-nique is developed for generating some labeled data froma given unlabeled data without using any human annota-tor.

– To the best of our knowledge, this is the first attemptwhere a semi-supervised classification technique is devel-oped for remote sensing satellite image segmentation.

– The proposed technique utilizes a simulated annealingbased multiobjective optimization technique, AMOSA,as the underline optimization strategy.

– Results are illustrated on three remote sensing satelliteimages of the parts of the cities of Kolkata and Mumbai.Obtained results are compared with two recent imagesegmentation techniques, FCM and MOGA.

This paper is arranged as follows: Sect. 2 discusses aboutgenerating labeled data usingFuzzyC-means clustering tech-nique. Section 3 discusses the proposed multiobjective basedsemi-supervised clustering technique. Section 4 reports theexperimental results obtained. Finally, Sect. 5 concludes thepaper.

123

Page 4: Multi-objective semi-supervised clustering for automatic ...sriparna/papers/soft-abhay2.pdfK-means algorithm for segmentation and classification of remote sensing images from a wide

4736 A. K. Alok et al.

2 Procedure of generating labeled data using fuzzyC-mean clustering technique

FuzzyC-means (FCM) (Bezdek 1981), is a very popular clus-tering technique used widely in pattern recognition, whichincorporates the property of fuzzy logic. A single data pointmay belong to two ormore clusters. FCM is based on a singleobjective function (given below) which should be minimized

Ji =N∑

j=1

C∑

c=1

umc, j D2 (

zc, x j), 1 ≤ m ≤ ∞ (1)

Here, N represents the number of data points, C is the totalnumber of clusters, u denotes fuzzy membership and m rep-resents the fuzzy exponent. Let, x j denote the j th data point,zc is the center of the cth cluster, and distance of x j data pointfrom the cluster center zc is represented by D(zc, x j ).

Initially, FCM algorithm starts its functioning by ran-domly picking K cluster centers. Then after every iteration,it evaluates the fuzzy membership for each data point withrespect to every cluster according to the following equation:

ui, j =(

1D(zc,xi )

) 1m−1

∑Cj=1

(1

D(z j ,xi )

) 1m−1

, 1 ≤ c ≤ C, 1 ≤ i ≤ N , (2)

where D(zc, xi ) and D(z j , xi ) represent the distancesbetween xi and zc, and xi and z j , respectively. Now, afterevaluation of fuzzy membership of each data point, clustercenters are updated with the help of following equation:

zc =∑N

i=1 umc,i xi∑N

i=1 umc,i

, 1 ≤ c ≤ C (3)

The two stepsmentioned above, evaluation of fuzzymem-bership and re-computation of cluster centers, are executedseveral times until there will be no change in cluster cen-ters. Final membership values are obtained considering eachcluster individually. We sort all the points for individualcluster c based on their membership values. We select top10C % points and assign cluster label c to them. Here C is thetotal number of clusters. This labeled information is used asthe supervised information for the proposed semi-supervisedclustering technique.

Illustration with example Let us assume that total N =10, two-dimensional data points are available. These datapoints are (0.1539, 0.8988), (0.9911, 0.3667), (0.7194,0.0805), (0.0745, 0.6993), (0.56023, 0.2570), (0.3725,0.7646), (0.8112, 0.7686), (0.3052, 0.0864), (0.9793,0.9127) and (0.6717, 0.0862). We have then applied FCMalgorithm on these data points with K = 2 number of clus-ters. The lowest objective function value attained by FCM

with K = 2 number of clusters is 0.7588. The obtainedmem-bership values for data points with respect to cluster 1 andcluster 2 are as follows: (0.0765, 0.9235), (0.8191, 0.1809),(0.9655, 0.0345), (0.1398, 0.8602), (0.9582, 0.0418), (0.0010,0.9990), (0.3780, 0.6220), (0.7633, 0.2367), (0.4029, 0.5971)and (0.9695, 0.0305), respectively. First membership valueis for cluster 1 and second one is for second cluster 2. Basedon these membership values, we can assign data points toa particular cluster. From the above-mentioned membershipvalues, it can be easily concluded that the points (1, 4, 6, 7, 9)belong to cluster 2, and points (2, 3, 5, 8, 10) belong to cluster1. Further, these points are labeled accordingly for supervisedinformation. Finally, we select 10% labeled points from eachcluster to be used as the supervised information of the pro-posed semi-supervised classification technique. The pointsof cluster 1 with sorted membership values are (10, 3, 5, 2,8). As the number of points = 10, so 10% of data points= 1. Thus we select point 10 with class label 1 as the labeledinformation. Similarly, points of cluster 2 with sorted mem-bership values are (6, 1, 4, 7, 9). Thus from here we selectpoint 6 with class label 2 as the labeled information.

3 Proposed multiobjective based semi-supervisedclustering technique

In the current paper, to automatically segment remote sensingsatellite images, a multiobjective semi-supervised cluster-ing technique is proposed. The proposed technique uses thesearch capability of AMOSA (Bandyopadhyay et al. 2008), amodern multiobjective optimization technique based on theconcepts of simulated annealing to automatically determinethe appropriate partitioning from remote sensing satelliteimages. The proposed technique is a semi-supervised cluster-ing technique; it utilizes some amount of labeled information.In the current paperwehave assumed that for 10%data pointsactual class labels are known. In order to generate the labeleddata, we have used the fuzzy C-means (Bezdek 1981) basedapproach developed in Sect. 2. The flowchart of the proposedalgorithm is given in Fig. 6. The basic steps of the proposedtechnique are enumerated below:

1. First execute the fuzzy C-means clustering technique onthe given unlabeled data to determine the partitioning.

2. Execute the procedure mentioned in Sect. 2 to determinethe labeled information of few data points.

3. Initialize the solutions of the archive. Follow the proce-dure mentioned in Sect. 3.2 to initialize the solutions ofthe archive.

4. Assign the points to different clusters encoded in a par-ticular string using the procedure mentioned in Sect. 3.3.

5. Determine the objective function values of a particularsolution using the procedure mentioned in Sect. 3.4.

123

Page 5: Multi-objective semi-supervised clustering for automatic ...sriparna/papers/soft-abhay2.pdfK-means algorithm for segmentation and classification of remote sensing images from a wide

Multi-objective semi-supervised clustering for automatic pixel. . . 4737

6. Perturb the solutions of the archive using the mutationoperations mentioned in Sect. 3.5.

7. Apply the steps ofAMOSAbasedoptimization techniqueto simultaneously optimize the three cluster validityindices, two internal cluster validity indices and oneexternal cluster validity index.

8. Finally, a set of non-dominated solutions will be obtainedon the final Pareto optimal front. Each of these solutionswill represent a partitioning of the given data set. Select asingle solution from the final Pareto optimal front usingthe method mentioned in Sect. 3.5 and report this.

3.1 The SA based MOO algorithm: AMOSA

Here, AMOSA (Bandyopadhyay et al. 2008), archived mul-tiobjective simulated annealing based technique, which isa generalized version of probabilistic metaheuristic basedsimulated annealing (SA) algorithm using the concepts ofmultiobjective optimization (MOO), is used as the under-lying optimization strategy. MOO is applied to solve thereal-world problems where there are several objectives,which are conflicting with each other. In general, MOOalgorithm provides a set of non-dominated solutions; noneof the solutions are dominated by other solutions. The setof these non-dominated solutions is known as Pareto opti-mal front. During recent years, it has been observed that, tosolve the multiobjective optimization problems, researchershave developed a large number of multiobjective evolution-ary algorithms (MOEA).

Simulated annealing (SA) is a generic version of searchtechnique, which is used to solve the difficult optimizationproblems. SA follows the principles of statistical mechan-ics (Kirkpatrick et al. 1983). It has already been shown thatSA reaches the global optimum if it annealed sufficientlyslowly (Geman and Geman 1984). The single objective ver-sion of SA is quite popular in solving many single objectiveoptimization problems. There are some limitations of SA insolvingmultiobjective optimization problems.Oneof them isthat SAused to produce a single solution after each execution.

Recently Saha et al. developed a multiobjective version ofSA, popularly known as AMOSA, which rectified the prob-lems associated with single objective version of SA. Here,AMOSA is used as an underlying optimization techniquefor partitioning a data set. Here, concept of archive has beenproposed in AMOSA algorithm, where non-dominated solu-tions seen so far are stored. Two limits are considered onarchive size. These limits are hard or strict limit denoted byHL, and soft limit denoted by SL. In this algorithm, total(γ × SL , γ > 1) solutions are first initialized. Each solutionrepresents a state in the search space. For a given number ofiterations, multiobjective functions are computed and eachsolution is refined with the help of hill-climbing and domi-nation relation.

Thereafter, obtained non-dominated solutions are storedin the archive. This process continues until the size of archiveincreases to SL. When archive size exceeds soft-limit, SL,then single linkage clustering technique is used to reduce thesize of archive to HL. Thereafter, randomly one single pointfrom the archive is selected. This selected single point is con-sidered as current-pt, or the initial solution, at temperatureT = Tmax. Now, to generate a new-pt, current-pt is perturbedand objective functions of new-pt are computed. Thereafter,dominance status of new-pt is compared with respect tocurrent-pt and solutions in the archive. Different cases ofdomination status have been shown in Figs. 2, 3 and 4. Nowa new term, amount of domination, �dom(x, y) betweentwo solutions x and y is defined as follows: �dom(x, y) =∏M

i=1, fi (x) �= fi (y)fi (x)− fi (y)

Ri, where fi (x) and fi (y) are the

i th objective values of the two solutions and Ri is the cor-responding range of the objective function. Based on thesedominance relations, three cases may arise for acceptance of(1) current-pt, (2) new-pt, or (3) solutions from the archive. Iffurther overflow occurs in the archive, again clustering tech-nique is applied to reduce the archive size toHL. This processcontinues iter times for each temperature that is annealedwith a cooling rate of α(<1), till the minimum tempera-ture Tmin is attained. Once the T min temperature is attained,thereafter the process stops, and finally archive with the finalnon-dominated solutions has been reported. AMOSA algo-

Fig. 2 Different cases whennew-pt is dominated bycurrent-pt

123

Page 6: Multi-objective semi-supervised clustering for automatic ...sriparna/papers/soft-abhay2.pdfK-means algorithm for segmentation and classification of remote sensing images from a wide

4738 A. K. Alok et al.

Fig. 3 Different cases when new-pt and current-pt are nondominating

Fig. 4 Different cases when new-pt dominates the current-pt

rithm has been shown in Fig. 5. In Bandyopadhyay et al.(2008), it has been elaborately shown that performance ofAMOSA is better than NSGA-II (Deb 2001) and some welldefined MOO algorithms (Fig. 6).

3.2 Encoding of a state and initialization of archivemembers

In semi-ImClustMOO, a set of real numbers represents thestate of AMOSA. Here multiple centers are used to representa partitioning as done in Saha and Bandyopadhyay (2013).

These real numbers in fact indicate the locations of thecenters of the clusters. Hence the modern MOO techniqueAMOSA can automatically determine the proper set of clus-ter coordinates and the respective partitionings of the dataitems. Suppose a state comprises of encoded centers of Kpartitions, and let us assume that each cluster is decomposedinto C number of small sub-clusters. Then the length of thatstate will be K ×C × F , where F is the number of featurespresent in the data set. The representation of one cluster usingmultiple centers is shown in Fig. 7. Here there are a total of 6whole clusters (i.e., K = 6) and each whole cluster is furtherpartitioned into two different sub-clusters (C = 2). Let thenumber of features present in a data set be F = 2. So j th

subcluster of i th cluster is represented by cij = (cxij , cyij ).

Then the entire state will look like (cx11 , cy11 , cx

12 , cy

12 , cx

21 ,

cy21 , cx22 , cy

22 , cx

31 , cy

31 , cx

32 , cy

32 , cx

41 , cy

41 , cx

42 , cy

42 , cx

51 ,

cy51 , cx52 , cy

52 , cx

61 , cy

61 , cx

62 , cy

62 ). Here, the value of number

of clusters, Ki , encoded in a particular string i of the archive,is determined as follows: Ki = (rand()(Kmax − 1)) + 2.Kmax represents the higher limit of the cluster number andrand() is a function which returns integer. So the number ofinitial clusters can vary in the range of 2 to Kmax. Minimumdistance based criteria is used for assigning points to differ-ent clusters. After forming the initial partitions, C numberof sub-cluster centers are chosen for each cluster. These sub-cluster centers are then encoded in the string to represent aparticular partitioning. So total number of centers encodedin that string is C × K .

Example Suppose a cluster is divided into 10 small subclus-ters; then the cluster is encoded as follows in the form of astate: (4.5, 3.5, 5.5, 7.7, 3.2, 6.3, 7.7, 4.2, 9.2, 7.5, 9.9, 8.3,2.1, 3.5, 2.5, 6.6, 7.7, 9.4, 8.8, 6.9). Here, (cx11 , cy11) =(4.5, 3.5), (cx12 , cy

12) = (5.5, 7.7), (cx13 , cy

13) = (3.2, 6.3),

(cx14 , cy14) = (7.7, 4.2), (cx15 , cy

15) = (9.2, 7.5), (cx16 ,

cy16) = (9.9, 8.3), (cx17 , cy17) = (2.1, 3.5), (cx18 , cy

18) =

(2.5, 6.6), (cx19 , cy19) = (7.7, 9.4) and (cx110, cy110) =(8.8, 6.9).

123

Page 7: Multi-objective semi-supervised clustering for automatic ...sriparna/papers/soft-abhay2.pdfK-means algorithm for segmentation and classification of remote sensing images from a wide

Multi-objective semi-supervised clustering for automatic pixel. . . 4739

Fig. 5 AMOSA Algorithm

3.3 Assignment of points

Here, we have considered each sub-cluster as a separatecluster for the allocation process. Now, assume that eachstate comprises K number of whole clusters and each clus-ter is further partitioned into C number of sub-clusters. Forthe assignment point of view, minimum Euclidean distancebased criterion has been considered. A particular point y j isallocated to the (v, t)th sub-cluster where

(v, t) = argmin{de(z

np, y j )

}, for p = 1 . . . K ,

n = 1, . . . ,C.

znp is the nth sub-cluster center of pth cluster. de(znp, y j )

denotes the Euclidean distance between the point y j andthe cluster center znp. Thereafter, the partition matrix canbe formulated as follows: X [(v − 1) × C + t][ j] = 1 andX [c][ j] = 0, ∀c = 1 . . . K × C otherwise.

123

Page 8: Multi-objective semi-supervised clustering for automatic ...sriparna/papers/soft-abhay2.pdfK-means algorithm for segmentation and classification of remote sensing images from a wide

4740 A. K. Alok et al.

Fig. 6 Flow chart of semi-ImClustMOO algorithm

Fig. 7 Cluster representationusing multiple centers

123

Page 9: Multi-objective semi-supervised clustering for automatic ...sriparna/papers/soft-abhay2.pdfK-means algorithm for segmentation and classification of remote sensing images from a wide

Multi-objective semi-supervised clustering for automatic pixel. . . 4741

Example Let a particular state hold 2 whole clusters (i.e.,K = 2); each whole cluster is further partitioned into fourdifferent sub-clusters (C = 4), and the number of fea-tures present in a data set, F = 2 . Here, (cx11 , cy

11) =

(2.5, 3.7), (cx12 , cy12 ) =(2.7, 3.5), (cx13 , cy

13) = (3.2, 4.5),

(cx14 , cy14) = (1.2, 3.2) and (cx21 , cy

21 ) = (4.3, 1.9),

(cx22 , cy22 ) = (5.5, 4.2), (cx23 , cy

23 ) = (6.1, 2.5), (cx24 , cy

24 )

= (7.5, 3.6). Let us consider two data points X1 = (1.9, 2.7)and X2 = (5.7, 6.8). For the purpose of assignment, Euclid-ean distance of each point with respect to all subclustercenters are computed. Euclidean distance of X1 with respectto (cx11 , cy

11) = (1.1662), X1 with respect to (cx12 , cy

12) =

(1.1314), X1with respect to (cx13 , cy13) = (2.2204), X1with

respect to (cx14 , cy14) = (0.8602), X1 with respect to (cx21 ,

cy21 )=(2.5298), X1 with respect to (cx22 , cy22 ) = (3.9000),

X1 with respect to (cx23 , cy23 ) = (4.2048) and X1 with

respect to (cx24 , cy24 ) = (5.6719). Now we select the min-

imum distance measure of point X1 with respect to allsub-cluster centers. Here, minimum (1.1662, 1.1314, 2.2204,0.8602, 2.5298, 3.9000, 4.2048, 5.6719) equals to 0.8602. Sopoint X1 is assigned to subcluster (cx14 , cy

14). Similar steps

are followed for all the data points for the purpose of assign-ment.

3.4 Objective functions

Three objective functions are considered for the purpose ofoptimization. First two objective functions are some internalindices of cluster validity which rely on some natural char-acteristics of the data sets. Last one measures the similaritywith the available supervised information. This is also calledan external index of cluster validity. The used three objectivefunctions are Sym-index (Bandyopadhyay and Saha 2008),I-index (Maulik and Bandyopadhyay 2002) and Minkowskiindex (Ben-Hur and Guyon 2003).

3.4.1 Sym-index: validity measure based on the propertyof symmetry

Here, a new point symmetry based distance dps(y, z) hasbeen considered for evaluating this cluster validity index. Thecomponents which help to evaluate dps(y, z) are illustratedas follows: suppose there is a point y in a given cluster of npoints. Now symmetrical (reflected) point of y with respectto cluster center z is determined, which is formulated as 2×z − y. Let us assume that the reflected point is denoted byy∗. Now knear nearest neighbors of y∗ are calculated. Letus assume that nearest neighbors are at Euclidean distancesof di s, i = 1, 2, . . . knear. Now dps can be calculated asfollows:

dps(y, z) = dsym(y, z) × de(y, z), (4)

=∑knear

i=1 diknear

× de(y, z), (5)

where de(y, z) is the Euclidean distance between the pointsy and z. And dsym(y, z) is a symmetry measure of y with

respect to z and can be defined as∑knear

i=1 diknear . Here, we cannot

choose knear equals to 1. If we choose knear equals to 1 andanyhow if y∗ exists in the data set then dps(y, z) = 0 anddue to this, impact of the Euclidean distance may be ignored.On the contrary, if we choose large values of knear, it maydiminish the symmetry property of a point with respect toa particular cluster center. So, to rectify this problem, herewe consider knear equals to 2. The value of knear largelydepends on data distribution. Thus it is very much interest-ing and challenging to determine this value. The descriptionsabout properties of dps(y, z) are given elaborately in Bandy-opadhyay and Saha (2008).

The objective function Sym-index is an internal clustervalidity index, used tomeasure the overall average symmetrywith respect to cluster centers. Consider a partition of thedata set Y = {y j : j = 1, 2, . . . n} into K clusters where the

center of cluster zi is computed using zi =∑ni

j=1 yij

ni, where

ni (i = 1, 2, . . . , K ) is the number of points in cluster i andyij denotes the j th point of the i th cluster. The new clustervalidity function Sym is defined as

Sym(K ) =(1

K× 1

EK × DK

)(6)

Here,

EK =K∑

i=1

Ei , (7)

such that

Ei =ni∑

j=1

d∗ps(y

ij , zi ) (8)

and

DK = Kmaxi, j=1

‖zi − z j‖ (9)

Here, DK is the maximum separation between two clustercenters among all pairs of centers. Separation is mea-sured in terms of Euclidean distance. With some constraint,d∗ps(y

ij , zi ) is computed according to Eq. (5). Here, the first

knear nearest neighbors of y∗j = 2 × zi − yij are searched

among all the data points belonging to a particular cluster i .In other way, the knear nearest neighbors of y∗

j , the reflected

point of yij with respect to zi , and yij should belong to the i th

123

Page 10: Multi-objective semi-supervised clustering for automatic ...sriparna/papers/soft-abhay2.pdfK-means algorithm for segmentation and classification of remote sensing images from a wide

4742 A. K. Alok et al.

cluster. To obtain the actual number of clusters, and to avoidoverlapping, Sym-index should be maximized.

As formulated in Eq. (6), Sym-index consists of three fac-tors, 1/K , 1/EK and DK . First factor of Sym-index decreasesas K increases, so to obtain the optimal number of clusters,Sym-index should be maximized. So due to maximizationcriteria, the value of K is preferably going to decrease.The second factor explains about the total symmetry presentwithin the clusters. So for clusters, which consist of goodsymmetrical structures, EK value should be minimized. Nowin general sense, we observe that as K increases, the clustersare going to be more symmetric. From Eq. (5), it can be eas-ily observed that as de(y, z) decreases, EK also decreases,which acts like catalyst to increase the value of the Sym-index. Since, here Sym-index should be maximized, so thevalue of K is automatically going to increase. Finally thethird factor, DK , which is used to measure the maximumseparation between a pair of clusters, increases as the valueof K increases. Constraint imposed on DK is that its maxi-mum value is restricted by the maximum separation betweenpairs of data items within a data set. Due to complementarynature of these three factors, it is guaranteed that Sym-indexwill determine the true partitioning results.

Example Let for K = 2 clusters, there are total 4 data pointsin each cluster. Data points for cluster 1 are (2, 2), (3, 5),(4, 7), (5, 8) and for second cluster data points are (3, 1),(7, 10), (9, 11), (5, 2). Cluster center for cluster one is (3.5,5.5). Reflected point X∗ of point (2, 2) w.r.t cluster cen-ter (3.5, 5.5) is (5, 9). Now two nearest neighbors forreflected point X∗ are (5, 8) and (4, 7), respectively. Now,Euclidean distance of X∗ w.r.t two nearest neighbors areas follows: d1 = √

(5 − 5)2 + (9 − 8)2 = 1, and d2 =√(5 − 4)2 + (9 − 7)2 = 2.2361. Now Euclidean distance

de of point (2, 2) with respect to cluster center(3.5, 5.5)is 3.8079. So, point symmetry distance of point (2, 2) w.r.tcluster center (3.5, 5.5) according to Eqs. 4 and 5 is 6.1612.Similarly for data points (3, 5), (4, 7), (5, 8), point symmetrybased distances are 1.1441, 2.5583 and 4.7172, respectively.So, for cluster one, total symmetrical deviation is 14.5808(6.1612 + 1.1441 + 2.5583 + 4.7172). In similar way,total symmetrical deviation for second cluster is 46.4349(13.4339+16.290+8.8632+7.8468). So, total symmetricaldeviation EK is 61.0517(46.4349 + 14.5808). Now, DK ofEq. 9 is (2.6926). Based on these parameters, value of Sym-index according to Eq. 6 is 0.0221(0.5× 0.0164× 2.6926).

3.4.2 I-index: cluster validity measure based on popularEuclidean distance

I-index (Maulik andBandyopadhyay 2002), which is anotherinternal cluster validity index based on Euclidean distance isused as second objective function. It is formulated as follows:

I (K ) =(1

K× E1

EK× DK

)p

, (10)

where K denotes the number of clusters. Here EK =∑Kk=1

∑nkj=1 de(ck, x

kj ) and DK = maxKi, j=1 de(ci , c j )

where c j denotes the center of the j th cluster and xkj denotesthe j th point of the kth cluster. nk tells about number of datapoints present in the particular kth cluster. The value of K ,which ensures the maximum value of I -index, is consideredas the optimal number of clusters.

Three factors which validate I -index, are mainly 1K , E1EK ,

and DK . The first factor is responsible to reduce the value ofI as K increases. The second factor is the ratio of E1, and EK,where E1 is constant for a given data set. With increasing thevalue of K , the value of EK will be decreased. So, the sec-ond factor ensures the formation of more compact clusters.Finally, the third factor, DK , which is used to measure themaximum separation between a pair of clusters over all pos-sible pairs of clusters, increases as the value of K increases.Constraint imposed on DK is that, its maximum upper valueis restricted by the maximum separation between pairs ofdata items within a data set. Due to complementary nature ofthese three factors, it is guaranteed that I-index will be ableto determine the optimal partitioning. In this work, we haveassumed p = 2.

Example Let for K = 2 clusters, there are total N = 4data points in each cluster. Data points for cluster one are(2, 2), (3, 5), (4, 7), (5, 8) and for second cluster data pointsare (3, 1), (7, 10), (9, 11), (5, 2). To calculate E1, assumethere is only one cluster and all data points are within thiscluster. So total N = 8 data points within this single clusterare (2, 2), (3, 5), (4, 7), (5, 8), (3, 1), (7, 10), (9, 11), (5, 2).Now, the cluster center of the whole data set is (4.75, 5). Nowcompute Euclidean distance of each data point w.r.t the clus-ter center. The corresponding distance measures are d1 =4.0697, d2 = 1.7500, d3 = 2.1360, d4 = 3.0104, d5 =4.3661, d6 = 5.4829, d7 = 4.2500, and d8 = 3.0104.So, E1 is the summation of all the distance measures andequals to 28.075(d1+ d2+ d3+ d5+ d6+ d7+ d8). Nowconsider clusters separately, and data points belonging to aparticular cluster. So for first cluster, cluster center is (3.5,5.5). Distance measures of (2, 2), (3, 5), (4, 7), (5, 8) w.r.tto (3.5, 5.5) are d1 = 3.8079, d2 = 0.7071, d3 = 1.5811,and d4 = 2.9155, respectively. Summation of these distancemeasures is 9.0116(3.8079 + 0.7071 + 1.5811 + 2.9155).Similarly, for second cluster, summation of distance mea-sures of all data points w.r.t cluster center is 15.9340. So, EKis the summation of distance computations of data pointsbelonging to individual clusters. Thus the value of EK is24.9456(9.0116 + 15.9340). The value of DK , distancebetween the cluster centers, is (2.6926). Based on these para-meters, value of I-index is 2.2958.

123

Page 11: Multi-objective semi-supervised clustering for automatic ...sriparna/papers/soft-abhay2.pdfK-means algorithm for segmentation and classification of remote sensing images from a wide

Multi-objective semi-supervised clustering for automatic pixel. . . 4743

3.4.3 Minkowski index: MS-index

Minkowski index (Ben-Hur and Guyon 2003) is an externalcluster validity index, which validates the quality of obtainedpartitioning result based on given true solution. Let true solu-tion be T , and obtained clustering solution be U . To defineMinkowski index, some parameters are evaluated, denotedby n11, n01, and n10. Where, n11 is the total number of pairsof data points situated in the same cluster of T and U . n01signifies the total number of pairs of data points which aresituated only in same cluster of U . n10 signifies the totalnumber of pairs of data points which are situated in the samecluster of T but in different clusters of U . So, Minkowskiindex can be defined as follows:

DM (T,U ) =√n01 + n10n11 + n10

(11)

The minimum value of Minkowski index ensures the truepartitioning result. For each string Minkowski index value iscalculated over 10% data points for which class label infor-mation are known. This is then used as the third objectivefunction.

3.5 Other steps

To evaluate the three objective functions, whole partitioningis generated after joining the sub-cluster centers. To optimizethese three objective functions simultaneously, AMOSA isutilized as the background optimization strategy. Three typesof mutation operations have been used:

1. In order to perturb each individual cluster center, we have

used Laplacian distribution, p(ε) ∝ e− |ε−μ|δ , where the

scaling factor δ sets themagnitude of perturbation to gen-erate a new value for that particular position. Here μ

represents the value at the position which is to be per-turbed. We have kept scaling factor δ equals to 1.0. Thenewly generated value is used to replace the old value.This perturbation operation is applied to all dimensionsindependently.

2. In order to reduce the number of clusters encoded in astring by 1, a cluster center is removed from the string.

3. In order to increase the number of clusters encoded ina string by 1, we have to add an arbitrarily chosen datapoint in the string. Here the cluster center to be added isa randomly chosen data point from the entire data set.

If any string is selected for mutation, then any of theabove mentioned mutation types is applied with uniformprobability.

Example Let a state look like (2.5, 3.5, 4.2, 5.3, 6.1, 7.2);demonstrating three cluster centers in 2-d plane: (2.5, 3.5),(4.2, 5.3), and (6.1, 7.2).

– If mutation type 1 is selected and second position in stateis picked for perturbation, then, each dimension of (4.2,5.3) is changed by some value using Laplacian distribu-tion.

– If mutation type 2 is selected, then a center will beremoved from the state. If position 2 is selected for muta-tion, then after mutation the final state will look like (2.5,3.5, 6.1, 7.2).

– If mutation 3 is selected, a new center will be added. Letthe point (8.9, 9.2) be selected from the data set randomlyto be added in the state. After addition, the final state willlook like (2.5, 3.5, 4.2, 5.3, 6.1, 7.2, 8.9, 9.2).

If any string is selected for mutation, then any one of theabove-mentioned three mutation operations is executed on itwith equal probability.

As the proposed semi-supervised classification techniqueutilizes three cluster validity indices as the objective func-tions, simultaneous optimization of these three indices pro-duces a set of solutions on the final Pareto front. In order tocompare the proposed approach with other approaches, weneed to select a single solution from the final Pareto front.Here we have selected that solution which corresponds tothe minimum value of the Minkowski Score computed over10% labeled data.

4 Experiment results

This section explains about the image datasets and experi-mental results obtained after application of semi-ImClustMOO algorithm on remote sensing satellite images.The proposed semi-ImClustMOO algorithm is executed withthe following parameter combinations: Tmin = 0.01, Tmax =10, SL = 20, HL = 10, α = 0.8, and iter = 10.Semi-ImClustMOO algorithm produces a large collection ofnon-dominated solutions on thefinal Pareto optimal front. So,selection of a particular solution from thePareto optimal frontis done using theminimumMinkowski index value. Also, theeffectiveness of semi-ImClustMOO algorithm is comparedwith FCM (Bezdek 1981) andMOGA (Bandyopadhyay et al.2007) clustering techniques. In order to compare the perfor-mance of these three partitioning techniques, we have plottedthe segmented images obtained by three techniques.We havealso calculated the Silhouette index (Rousseeuw 1987) andI-index (Maulik and Bandyopadhyay 2002) values of theobtained partitionings by the three techniques. For good par-titioning both the values should bemaximized. Tables 1 and 2

123

Page 12: Multi-objective semi-supervised clustering for automatic ...sriparna/papers/soft-abhay2.pdfK-means algorithm for segmentation and classification of remote sensing images from a wide

4744 A. K. Alok et al.

Table 1 Silhouette index valuesobtained for optimal solutionsafter execution of differentalgorithms

Algorithm SPOT Kolkata IRS Kolkata IRS Mumbai

K s(C) K s(C) K s(C)

Semi-ImClustMOO 7 0.6129 6 0.6372 6 0.5976

FCM 7 0.4487 4 0.4832 6 0.4642

MOGA 7 0.5608 4 0.5892 7 0.5592

Bold values indicate the index value obtained by proposed algorithm is better than other algorithm for useddata set

Table 2 I-index values obtainedfor optimal solutions afterexecution of different algorithms

Algorithm SPOT Kolkata IRS Kolkata IRS Mumbai

K I K I K I

Semi-ImClustMOO 7 101.0130 6 110.4084 6 244.6950

FCM 7 82.2081 4 31.1697 6 178.0322

MOGA 7 97.6453 4 96.2788 7 183.7727

Bold values indicate the index value obtained by proposed algorithm is better than other algorithm for useddata set

show the Silhouette index and I-index values of the obtainedpartitionings by the three partitioning techniques, respec-tively. Results show that the proposed semi-ImClustMOOalgorithm attains the maximum values of Silhoutte index andI-index for all the image data sets as compared to two popularimage segmentation techniques, FCM and MOGA.

4.1 Silhouette index (Rousseeuw 1987)

It is the most widely used performance measurement indexof partitioning results obtained after application of any clus-tering technique. Basically, it evaluates the two parameterswhich reveal the intra or inter cluster similarities. Here, la( j)is the average distance of a particular point with respect toremaining points within a particular cluster. Similarly,mb( j)is the minimum of the average distance of a point fromremaining points of other clusters. Based on these parame-ters, the Silhouette index S(C) is calculated as follows:

s(C) = mb( j) − la( j)

max{la( j),mb( j)} (12)

Here, average Silhouette index S(C) (Rousseeuw 1987)is considered over all the data points of clusters. It providesthe prediction level of how the data has been clustered. Inother way S(C) quantifies two objectives, separability andcompactness of clusters, in a natural way. The S(C) indextakes value in the range of −1 to +1. So good partitioningresults correspond to high positive value of S(C) index.

4.1.1 SPOT image of Kolkata (Richards and Richards 1999)

The size of SPOT image ofKolkata is 512×512. The image ispresented at three bands: red, green and near-infrared bands.

Band 1: red band havingwavelength in the range of 0.61–0.68µm.

Fig. 8 aDistribution of pixels of SPOT image of Kolkata in the feature space. b SPOT image of Kolkata projected in the NIR band after applicationof histogram equalization

123

Page 13: Multi-objective semi-supervised clustering for automatic ...sriparna/papers/soft-abhay2.pdfK-means algorithm for segmentation and classification of remote sensing images from a wide

Multi-objective semi-supervised clustering for automatic pixel. . . 4745

Fig. 9 Scatter plot of some randomly selected data points of SPOT image of Kolkata having 7 classes

Fig. 10 Graph showing thevariations of MS scores withnumber of clusters for SPOTimage of Kolkata

Band 2: green band having wavelength in the range of0.50–0.59µm.Band 3: near-infrared band having wavelength in therange of 0.79–0.89µm.

Thus, here each pixel is associated with three features, thethree intensity values corresponding to different bands.

Figure 8a shows the pixel distribution of image in fea-ture space. Figure 8b illustrates the important landcovers ofKolkata in the infrared band of the input image. This satel-lite image consists of seven classes (Bandyopadhyay andPal 2001): vegetation, habitation, open space, turbid water,concrete, pond water and roads (including bridges). FromFig. 8b, a prominent black stretch line can be seen knownas Hooghly river. Below the river, there are also two differ-ent black patches to the left-hand side of the image. Thesereflect the water bodies: one to the left is called GardenReach lake, and another to the right is called Khidirporedockyard. Both are classified as pond water. Again to theright side of water bodies, a thin line can be easily seenwhich is a canal called Talis nala classified as turbid water.In the right-hand side, above the Talis nala, there is a trian-gular patch known as the race-course. There is a thin lineon the top right corner of the image and going towards mid-dle of the picture is the Beleghata canal classified as turbidwater, with a road by its side. In the middle top portions of

the image, there are several roads including habitation andconcrete. There is also a portion of the bridge showing inthis image. Figure 9 shows scatter plot of data points hav-ing seven classes which are highly overlapping in nature.In the proposed semi-ImClustMOO algorithm, the numberof clusters is varied from 2 to 10. Multiobjective basedsemi-ImClustMOO algorithm produces a large number ofnon-dominated solutions. Selection of a single solution isdone based on the minimumMS index (Ben-Hur and Guyon2003) value. The variation of MS index values for differ-ent number of clusters is shown in Fig. 10. The minimumMS index value of 0.3343 corresponds to a solution havingseven clusters encoded in it. Again to show the effective-ness of this semi-ImClustMOO algorithm, Silhouette index(Rousseeuw 1987) has been calculated. Obtainedmean valueof Silhouette index corresponding to the optimal partitioningidentified by the proposed algorithm for K = 7 is 0.6179.The partitioning result provided by semi-ImClustMOO algo-rithm identifies almost all the regions properly as shownin Fig. 11. Talis nala is properly detected by the proposedmethod and is illustrated in Fig. 11. The small bridge overthe river Hooghly is also properly detected by the proposedmethod. The segmented image obtained after execution ofFCM (Bezdek 1981) clustering algorithm is illustrated inFig. 12 for K = 7. The mean value of Silhouette index forthis obtained partitioning is 0.4487. Now from Fig. 12, it

123

Page 14: Multi-objective semi-supervised clustering for automatic ...sriparna/papers/soft-abhay2.pdfK-means algorithm for segmentation and classification of remote sensing images from a wide

4746 A. K. Alok et al.

Fig. 11 Clustered SPOT imageusing semi-ImClustMOOalgorithm

Fig. 12 Clustered SPOT imageusing FCM algorithm

Fig. 13 Clustered SPOT imageusing MOGA algorithm

can be easily seen that FCM algorithm fails to detect thebridge. Also there are confusions among the turbid water,pure water and concrete class. Again, multi-objective basedgenetic clustering algorithm, MOGA (Bandyopadhyay et al.2007), is applied on SPOT image data set. The obtainedpartitioning result for K = 7 has been shown in Fig. 13.Although it works well to identify all regions but the par-titioning also has some confusions between concrete andhabitation classes. It works better than FCM algorithm. Themean value of Silhouette index for obtained partitioningresult by MOGA (Bandyopadhyay et al. 2007) algorithmis 0.5608. The values of Silhouette index clearly show thatthe newly proposed semi-ImClustMOO algorithm performsthe best as compared to other clustering algorithms likeFCM and MOGA etc. The partitioning results also prove

that there are little ambiguities in the partitioning resultidentified bymulti-objective based semi-ImClustMOO algo-rithm.

4.2 IRS image of Kolkata

The size of IRS image of Kolkata is 512 × 512. The imagedata used here were acquired from the IRS satellite (IRS-1A)using theLISS-II sensor that has a resolution of 36.25×36.25m. The image is presented at four bands: blue, green, red andnear-infrared bands.

Band 1: blue band having wavelength in the range of0.45–0.52 µm.

123

Page 15: Multi-objective semi-supervised clustering for automatic ...sriparna/papers/soft-abhay2.pdfK-means algorithm for segmentation and classification of remote sensing images from a wide

Multi-objective semi-supervised clustering for automatic pixel. . . 4747

Fig. 14 a Distribution of pixels of IRS image of Kolkata in the first three feature space. b IRS image of Kolkata in the near-infrared band withhistogram equalization

Fig. 15 Plotting four-dimensional data in two dimensions for ran-domly selected data points of IRS image of Kolkata having 6 classes

Band 2: green band having wavelength in the range of0.52–0.59 µm.Band 3: red band havingwavelength in the range of 0.62–0.68 µm.Band 4: near-infrared having wavelength in the range of0.77–0.86 µm.

Thus, here each pixel is associated with four features, thefour intensity values corresponding to different bands.

Figure 14a shows the pixel distribution of image in thefirst three feature space. Figure 14b illustrates the importantlandcovers of IRS image of kolkata in the infrared band.From (Bandyopadhyay and Pal 2001), it is found that thisimage contains four clusters, and these clusters belong tofour different classes: turbid water, pond water, concreteand open space. From Fig. 14b it can be easily observedthat river hooghly cutting across the middle of the imagehas been classified as turbid water. Several fisheries locatedtowards the lower right portion are classified as pond water.Towards the lower right side of the image, there is a town-ship called Salt Lake, which comes under concrete and openspace categories. The canal which bounds the top portion oftownship is classified as pond water. In the right-hand side

of the image there are two parallel lines, which correspondto airstrips of Dumdum airport. This comes under concretecategory.Apart from the above-mentioned information, thereare several water bodies, roads, etc., in this image. Figure 15shows the plot of four-dimensional data projected over twodimensions having six classes which are highly overlappingin nature. Semi-ImClustMOO technique evolves six clustersfrom this data set. The partitioning is shown in Fig. 17. In theproposed semi-ImClustMOO algorithm, the number of clus-ters is varied in the range of 2–10. The minimum value ofMS index ensures the selection of a single optimum solutionfrom the non-dominated front. The variation of MS indexvalues for different number of clusters is shown in Fig. 16.The minimum MS index value obtained for the optimumpartitioning identified by semi-ImClustMOO algorithm forK = 6 is 0.3211. Obtained mean value of Silhouette indexfor this partitioning is 0.6372. The partitioning result pro-vided by semi-ImClustMOO algorithm separates almost allthe regions well as shown in Fig. 17. The canal, the upperportion from Salt Lake, has been correctly classified as pondwater. Now, two parallel lines corresponding to Dumdumairport and the airstrip corresponding to Dumdum airporthave been classified correctly as belonging to concrete class.The partitioning result obtained after application of FCM(Bezdek 1981) algorithm for K = 4 is shown in Fig. 18. Themean value of Silhouette index for this obtained partitioningresult is 0.4832. Now from Fig. 18, it can be easily seen thatFCM algorithm is unable to differentiate the river Hooghlyand the city regions belonging to different classes. Althoughit classifies well some portions like the canal bounded bySalt Lake and airstrips, still there are some significant con-fusions. The partitioning result obtained after application ofMOGA (Bandyopadhyay et al. 2007) algorithm for K = 4is shown in Fig. 19. The mean value of Silhouette index forthis obtained partitioning result is 0.5892. It works better thanFCM algorithm, and it works well to identify all regions butstill some confusions exist in the obtained partitioning.

123

Page 16: Multi-objective semi-supervised clustering for automatic ...sriparna/papers/soft-abhay2.pdfK-means algorithm for segmentation and classification of remote sensing images from a wide

4748 A. K. Alok et al.

Fig. 16 Graph showing thevariations of MS scores withnumber of clusters for IRSimage of Kolkata

Fig. 17 Clustered IRS imageof Kolkata usingsemi-ImClustMOO algorithm

Fig. 18 Clustered IRS image ofKolkata using FCM algorithm

Fig. 19 Clustered IRS image ofKolkata using MOGA algorithm

123

Page 17: Multi-objective semi-supervised clustering for automatic ...sriparna/papers/soft-abhay2.pdfK-means algorithm for segmentation and classification of remote sensing images from a wide

Multi-objective semi-supervised clustering for automatic pixel. . . 4749

Fig. 20 a Distribution of pixels of IRS image of Mumbai in the first three feature space. b IRS image of Mumbai in the near-infrared band withhistogram equalization

Fig. 21 Plotting four-dimensional data in two dimensions for ran-domly selected data points of IRS image of Mumbai having 6 classes

4.3 IRS image of Mumbai

The size of IRS image of Mumbai is 512 × 512. TheIRS image of Mumbai was obtained using the Linear Self-Scanning (LISS)-II sensor. The image is presented at fourbands: blue, green, red and near-infrared bands. Thus, hereeach pixel is associated with four features, the four intensityvalues corresponding to different bands.

Figure 20a shows the pixel distribution of image in thefirst three feature space. Figure 20b illustrates the impor-tant landcovers of IRS image of Mumbai in the near-infraredband. From (Bandyopadhyay and Pal 2001), it is seen thatthis image contains seven clusters: concrete, open space(OS1 and OS2), vegetation, habitation and turbid water(TW1 and TW2). From Fig. 20, it can be seen that threesides of the city area are surrounded by Arabian sea whichcan be classified as turbid water category. There are sev-eral islands towards the bottom right of the image; thefamous one is known as Elephanta island. These islandscome under the open space vegetation categories. From theimage, dockyard having three finger-like structures can beseen easily which is situated on the south eastern part of

Mumbai. This portion comes under the concrete and habi-tation categories. Figure 21 shows the two-dimensional plotof four-dimensional data points having six classes which arehighly overlapping in nature. Semi-ImClustMOO clusteringtechnique evolves six clusters from this data set as shownin Fig. 23. In the proposed semi-ImClustMOO algorithm,the number of clusters is varied from 2 to 10. The mini-mum value of MS index ensures the selection of a singleoptimum solution from the non-dominated front. The varia-tion of MS index for obtained number of clusters is shownin Fig. 22. The MS index value obtained for the optimumpartitioning identified by semi-ImClustMOO technique forK = 6 is 0.4211. Obtained mean value of Silhouette indexfor K = 6 clusters is 0.5976. The partitioning result pro-vided by semi-ImClustMOO algorithm separates almost allthe regions well as shown in Fig. 23. The semi-ImClustMOOalgorithm is able to classify the water body of the Arabiansea into three different classes. It also classifies the islandand dockyard correctly. The partitioning result obtained uti-lizing FCM (Bezdek 1981) algorithm has been shown forK = 6 in Fig. 24. The mean value of Silhouette indexfor this obtained partitioning result is 0.4642. Now fromFig. 24, we can say that it is very difficult to distinguishthe Arabian sea. So there is a significant amount of con-fusion in the obtained partitioning by FCM algorithm. Thepartitioning result obtained using MOGA (Bandyopadhyayet al. 2007) algorithm is shown for K = 7 in Fig. 25. Itclassifies the water body of the Arabian sea into two differ-ent classes. It works better than FCM algorithm. The meanvalue of Silhouette index for this obtained partitioning resultis 0.5592. Table 1 shows the Silhouette index values cor-responding to the partitionings identified by the proposedsemi-ImClustMOO, FCM and MOGA algorithms. Silhou-ette index value obtained by semi-ImClustMOO algorithmis higher than the obtained values by FCM andMOGA algo-rithms. Table 2 shows the I-index values corresponding to thepartitionings identified by the proposed semi-ImCLustMOO,

123

Page 18: Multi-objective semi-supervised clustering for automatic ...sriparna/papers/soft-abhay2.pdfK-means algorithm for segmentation and classification of remote sensing images from a wide

4750 A. K. Alok et al.

Fig. 22 Graph showing thevariations of MS scores withnumber of clusters for IRSimage of Mumbai

Fig. 23 Clustered IRS imageof Mumbai usingsemi-ImClustMOO algorithm

Fig. 24 Clustered IRS image ofMumbai using FCM algorithm

Fig. 25 Clustered IRS imageof Mumbai using MOGAalgorithm

123

Page 19: Multi-objective semi-supervised clustering for automatic ...sriparna/papers/soft-abhay2.pdfK-means algorithm for segmentation and classification of remote sensing images from a wide

Multi-objective semi-supervised clustering for automatic pixel. . . 4751

FCM and MOGA algorithms. I-index value obtained bysemi-ImClustMOO algorithm is higher than those obtainedby FCM and MOGA algorithms.

5 Conclusions

In this paper, classification of satellite images is modeled asa semi-supervised clustering of pixels in different intensityspaces. Multi-objective optimization based semi-supervisedclustering technique, semi-ImClustMOO, has been devel-oped in this paper for partitioning the satellite images. Toobtain the true partitioning results, three objective functionsare used.Amodernmultiobjective simulated annealing basedoptimization strategy, AMOSA, is utilized to optimize thethree objectives simultaneously. The first two objectives aresome cluster validity indices based on some internal proper-ties and the last one is a cluster validity measure based onsome external knowledge. In order to generate the labeleddata, fuzzy C-means (FCM) clustering technique is utilized.Based on the highest membership values of points withrespect to different clusters after application of FCM cluster-ing technique, 10%data points are selectedwhich are used asthe supervised information in the proposed semi-supervisedclustering technique. Here multi-center approach is utilizedto represent a particular cluster.

The effectiveness of the semi-ImClustMOO clusteringtechnique has been compared with FCM, MOGA cluster-ing techniques for partitioning three remote sensing satelliteimages. Future work includes introduction of some modernsemi-supervised clustering techniques based on some otheroptimization techniques like genetic algorithm/differentialevolution. We would also like to develop some more objec-tive functions based on internal or external criteria.

References

Altun Y, McAllester D, Belkin M (2005) Maximum margin semi-supervised learning for structured variables. In:Advances in neuralinformation processing systems. MIT Press, Cambridge, MA, pp33–40

Bandyopadhyay S, Pal SK (2001) Pixel classification using variablestring genetic algorithms with chromosome differentiation. IEEETrans Geosci Remote Sens 39(2):303–308

Bandyopadhyay S, Saha S (2008) A point symmetry-based clusteringtechnique for automatic evolution of clusters. IEEE Trans KnowlData Eng 20(11):1441–1457

Bandyopadhyay S, Maulik U, Mukhopadhyay A (2007) Multiobjec-tive genetic clustering for pixel classification in remote sensingimagery. IEEE Trans Geosci Remote Sens 45(5):1506–1511

Bandyopadhyay S, Saha S, Maulik U, Deb K (2008) A simulatedannealing-based multiobjective optimization algorithm: AMOSA.IEEE Trans Evolut Comput 12(3):269–283

Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision forpairwise constrained clustering. Proceedings of the SIAM Inter-national Conference on Data Mining (SDM-2004), Buena Vista,FL, pp–344

Ben-Hur A, Guyon I (2003) Detecting stable clusters using principalcomponent analysis. In: Functional genomics, Springer, Berlin, pp159–182

Bezdek JC (1981) Pattern recognition with fuzzy objective functionalgorithms. Kluwer, Norwell

BilenkoM, Basu S, Mooney RJ (2004) Integrating constraints and met-ric learning in semi-supervised clustering. In: Proceedings of thetwenty-first international conference on Machine learning, ACM,pp 81–88

Chapelle O, Zien A (2004) Semi-supervised classification by low den-sity separation. In: AI STATS. MIT Press, Cambridge, MA

Chapelle O, Schölkopf B, ZienA et al (2006) Semi-supervised learning,vol 2. MIT press, Cambridge

Das S, Konar A (2009) Automatic image pixel clustering with animproved differential evolution. Appl Soft Comput 9(1):226–236

Deb K (2001) Multi-objective optimization using evolutionary algo-rithms, vol 16. Wiley, New York

Fan J, Han M, Wang J (2009) Single point iterative weighted fuzzyc-means clustering algorithm for remote sensing image segmenta-tion. Pattern Recogn 42(11):2527–2540

Geman S, Geman D (1984) Stochastic relaxation, gibbs distributions,and the bayesian restoration of images. IEEE Trans Pattern AnalMach Intell 6:721–741

Kirkpatrick S, Gelatt CD, Vecchi MP et al (1983) Optimization bysimmulated annealing. Science 220(4598):671–680

Maulik U, Bandyopadhyay S (2002) Performance evaluation of someclustering algorithms andvalidity indices. IEEETransPatternAnalMach Intell 24(12):1650–1654

Maulik U, Bandyopadhyay S (2003) Fuzzy partitioning using a real-coded variable-length genetic algorithm for pixel classification.IEEE Trans Geosci Remote Sens 41(5):1075–1081

Maulik U, Saha I (2009) Modified differential evolution based fuzzyclustering for pixel classification in remote sensing imagery. Pat-tern Recogn 42(9):2135–2149

Maulik U, Saha I (2010) Automatic fuzzy clustering using modifieddifferential evolution for image classification. IEEE Trans GeosciRemote Sens 48(9):3503–3510

Richards JA, Richards J (1999) Remote sensing digital image analysis,vol 3. Springer, Berlin

Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretationand validation of cluster analysis. J Comput Appl Math 20:53–65

Saha S, Bandyopadhyay S (2010) Application of a multiseed-basedclustering technique for automatic satellite image segmentation.IEEE Geosci Remote Sens Lett 7(2):306–308

Saha S, Bandyopadhyay S (2013) A generalized automatic cluster-ing algorithm in a multiobjective framework. Appl Soft Comput13(1):89–108

Saha I, Maulik U, Bandyopadhyay S, Plewczynski D (2012a) Svmefc:Svm ensemble fuzzy clustering for satellite image segmentation.IEEE Geosci Remote Sens Lett 9(1):52–55

Saha S, Ekbal A, Alok AK (2012b) Semi-supervised clustering usingmultiobjective optimization. In: 2012 12th International Confer-ence on Hybrid Intelligent Systems (HIS), IEEE, pp 360–365

Sathya P, Malathi L (2011) Classification and segmentation in satelliteimagery using back propagation algorithm of ann and k-meansalgorithm. Int J Mach Learn Comput 1(4):422–426

123